PHP - html_simple_dom, crawlers encodes innerhtml?

PHP - html_simple_dom, crawlers encodes innerhtml? - php

Im using PHP html_simple_dom.
The targeted site is using UTF-8. My php as well as the stream context are set to use UTF 8.
An element (which i inspect by browser) has an innerHTML of "AAA ' BBB", at least as far as when its rendering using my firefox and chrome browsers.
However, my PHP script always fetches this string as "AAA ' BBB".
I can fix this using htmlspecialchars_decode($string, 1), but i really want to know why the PHP script, or rather the website is ("wrongly) encoding the string in the first place when visiting it using my PHP, which is explicitly set to UTF
header('Content-Type: text/html; charset=utf-8');
define("CONTEXT", stream_context_create(
array(
"http" =>
array(
"header" => 'Content-Type: text/html; charset=utf-8'
// also tried 'header' => 'Accept-Charset: UTF-8'
)
)
)
);
targetsite reads UTF-8 - http://mtggoldfish.com.cutercounter.com/
$html = file_get_html($url, false, CONTEXT);
// do things, blurts out every "'" as encoded &#039

Browser inspectors do a bit of transformation to have something human-readable.
Create a simple HTML with only AAA ' BBB in the body, you will see AAA ' BBB in the inspectors.
If you really want to see the content of the page, look at the source code (which is what file_get_html gets)

Related

How to find out the character-encoding standard that has been used in a PHP file?

I'm using PHP 7.2.11 on my laptop that runs on Windows 10 Home Single Language 64-bit operating system.
I've installed Apache/2.4.35 (Win32) and PHP 7.2.10 using the latest version of XAMPP.
I typed in a below code into a file titled demo.php :
<?php
$string1 = "Hel\xE1lo"; //Tried hexadecimal equivalent code-point from ISO-8859-1
echo $string1;
?>
After running above program into my web browser it gave me below output :
Hel�lo
Then, I made a small change to the above program and re-wrote the code as below :
<?php
$string1 = "Hel\xC3\xA1lo"; //Tried hexadecimal equivalent code-point from UTF-8, C form
echo $string1;
?>
After running the same program after making some change into my web browser it gave me below output (Indeed the expected result) :
Helálo
So, a doubt came to my mind after watching this stuff.
I want to know whether there is any built-in function or some mechanism in PHP which will tell me which character-encoding standard has been used in the current file?
P.S. : I know that in PHP the string will be encoded in whatever fashion it is encoded in the script file. I want to know whether there exist some built-in function, some mechanism or any other way around which will tell me the character-encoding standard used in the file under consideration.

This function must be in the same file whose encoding is to be determined.
//return 'UTF-8', 'iso-8859-1',.. or false
function getPageCoding(){
$codes = array(
'UTF-8' => "\xc3\xa4",
'iso-8859-1' => "\xe4",
'cp850' => "\x84",
);
return array_search('ä',$codes);
}
echo getPageCoding();
Demo: https://3v4l.org/UVvBM

file_get_contents() breaking ISO-8859-1 encoding

I am trying to read a page using file_get_contents() but I cannot get the character encoding to work.
this is my code:
$username = "masked";
$password = "maskedPass";
$remote_url = 'https://utfws.utfpr.edu.br/aluno01/sistema/mplistahorario.inicio?p_curscodnr=212';
// Create a stream
$opts = array(
'http'=>array(
'method'=>"GET",
'header' => array(
"Authorization: Basic " . base64_encode("$username:$password"),
'Accept-Charset: iso-8859-1'
)
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents($remote_url, false, $context);
echo $file;
I tried to change the character encoding to utf-8 but I always get a page with question marks instead of áéíóúãõç.
When I open the page directly in my browser it works just fine. Why is this happening?

It sounds to me like this might just be a problem of lost encoding details.
What you're describing is:
request document from webserver, specifying encoding 8859-1
server responds with document in requested encoding, including header specifying the encoding is 8859-1. This will look correct in a browser.
output document ( but not header data! ) from php ( where this goes isn't specified
open the data in some sort of viewer.
See where the encoding specification was lost, there in step 3?
The data can correctly be decoded with 8859-1, but only will be decoded with 8859-1 if the viewer is configured to use that encoding by default. Some apps may have a default of 8859-1, but UTF-8 is a lot more common these days.
If you load the data into a different storage engine, say mysql, the problem may compound. mysql associates a charset with text data. If your database defaults to utf-8, and you don't tell it the data is actually in 8859-1, but you don't tell it the data is in 8859-1, now you're feeding it data that is assumed to be in utf-8, and the data will be treated as such in the database going forward. Now even if you ask the database for 8859-1 in the future, the data will be re-encoded from utf-8 to 8859-1, but it's not valid utf-8 - it's yet another incorrect set of bytes.
To address this problem, specify the encoding when you view the data, or when you save it to a database.

Corrupted UTF-8 encoding when reading Google feed / alerts

Whenever I try to read a Google alert via PHP using something like:
$feed = file_get_contents("http://www.google.com/alerts/feeds/01445174399729103044/950192755411504138");
Regardless of whether I save the $feed to a file or echo the result to the output, all utf-8 unicode characters ( i.e. those with diacritics) are represented by white space. I have tried - without success - various combinations of:
utf8_encode
utf8_decode
iconv
mb_convert_encoding
I think the wrong characters have come from the stream, but I'm lost because if I try this URI in a browser then everything is fine. Can anyone shed some light on the issue?

Sorry, you are absolutely correct - there is something untoward happening! Though it is not what you would first suspect... For reference, given that:
echo mb_detect_encoding($feed); // prints: ASCII
The unicode data is lost before it is even sent by the remote server - it appears that Google is looking at the user-agent string in the request header - which is non-existent using file_get_contents by default without a stream-context.
Because it cannot identify the client making the request it defaults to and forces ASCII encoding. This is presumably a necessary fallback in the event of some kind of cataclysmic cock-up. [citation needed...]
It's not simply enough to name your application however, you need to include a known vendor. I 'm unsure of the full extent of this but I believe most folks include "Mozilla [version]" to work around the issue, for example:
$url = 'http://www.google.com/...';
$feed = file_get_contents($url, false, stream_context_create([
'http' => [
'method' => 'GET',
'header' => 'Accept-Charset: UTF-8' ."\r\n"
.'User-Agent: (Mozilla/5.0 compatible) MyFeedReader/1.0'
]
]));
file_put_contents('test.txt', $feed); // should now work as expected

Send UTF-16 encoded data with PHP curl

I'm building a php client to a web service that requires posted data to be encoded as UTF-16. How do i configure curl to encode my data in UTF-16 and also to decode the answer in UTF-16?
Some sample code:
$s = curl_init($url);
curl_setopt($s,CURLOPT_POST,1);
curl_setopt($s,CURLOPT_RETURNTRANSFER, 1 );
curl_setopt($s,CURLOPT_POSTFIELDS,$data);
curl_setopt($s,CURLOPT_TIMEOUT, 5);
curl_setopt($s,CURLOPT_HTTPHEADER,array('Content-Type: text/plain'));
$result = curl_exec($s);
curl_close($s);
Adding an Accept-Encoding header does not seem to do the trick. Is it possible to encode my $data string in UTF-16 first and then pass a byte array to curl instead of a string?
Thank you for your answers!

First, you need to find out what's your data encoding. Then, it's your choice. Both iconv() and mb_convert_encoding() work pretty well.
Additionaly, you should inform about the encoding in the HTTP header:
curl_setopt($s,CURLOPT_HTTPHEADER,array('Content-Type: text/plain; charset=UTF-16'));

Does charset matter when I echo content from a BLOB field?

I have various images in a mysql table (a classic BLOB field).
In my viewImage.php I simply do:
header( 'Content-Type: image/jpeg' );
//> Fetch Image from database then echo $row['blobField'];
Should I specify a charset ?
Should I write:
header( 'Content-Type: image/jpeg; charset=UTF-8' );
? For what I might I think, I believe the response is no we don't need to set a charset

No, since character sets and encodings only matter for text.

No you don't need to set a charset.
The content-type is to tell the browser how to handle a request. if you're sending text/html and image/jpeg then the browser will not be able to handle it.
It has to be one or the other, it can't be both.
Using text/html is kind of a catch-all.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP - html_simple_dom, crawlers encodes innerhtml? - php

Browser inspectors do a bit of transformation to have something human-readable. Create a simple HTML with only AAA ' BBB in the body, you will see AAA ' BBB in the inspectors. If you really want to see the content of the page, look at the source code (which is what file_get_html gets)

Related

How to find out the character-encoding standard that has been used in a PHP file?

file_get_contents() breaking ISO-8859-1 encoding

Corrupted UTF-8 encoding when reading Google feed / alerts

Send UTF-16 encoded data with PHP curl

Does charset matter when I echo content from a BLOB field?

Categories

Resources