This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 3 years ago.
I've created my crawler using file_get_contents function but when I crawl some sites I'm getting this character: � when I should get this: é. Some ideas of what is happening?
This is for a windows vps server running php.
I've already tried:
file_get_contents() Breaks Up UTF-8 Characters
How fix UTF-8 Characters in PHP file_get_contents()
How to get file content with a proper utf-8 encoding using file_get_contents?
But all these things didn't work.
PD: My file where I'm running this code is on UTF8.
$url = "https://play.google.com/books/reader?id=4rqYDwAAQBAJ&hl=en_US";
$options = array('http'=>array('method'=>"GET", 'header'=>"Accept-language: en-US,en;q=0.8\r\n" ."Accept-Charset: UTF-8, *;q=0"));
$context = stream_context_create($options)
$profile = file_get_contents($url,false,$context);
echo $profile
I'm expecting to get accented characters and not this diamond character �.
Google is ignoring your Accept-Charset header because you're not specifying a User-Agent, no idea why. It took me one hour to figure it out. Adjust your options as follows:
$options = [
"http" => [
"method" => "GET",
"header" => "Accept-language: en-US,en;q=0.8\\r\n" .
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0\r\n" .
"Accept-Charset: UTF-8, *;q=0"
]
];
Adding the "User-Agent" header seems to do the trick. Google is probably returning a different encoding if not.
Related
Im using PHP html_simple_dom.
The targeted site is using UTF-8. My php as well as the stream context are set to use UTF 8.
An element (which i inspect by browser) has an innerHTML of "AAA ' BBB", at least as far as when its rendering using my firefox and chrome browsers.
However, my PHP script always fetches this string as "AAA ' BBB".
I can fix this using htmlspecialchars_decode($string, 1), but i really want to know why the PHP script, or rather the website is ("wrongly) encoding the string in the first place when visiting it using my PHP, which is explicitly set to UTF
header('Content-Type: text/html; charset=utf-8');
define("CONTEXT", stream_context_create(
array(
"http" =>
array(
"header" => 'Content-Type: text/html; charset=utf-8'
// also tried 'header' => 'Accept-Charset: UTF-8'
)
)
)
);
targetsite reads UTF-8 - http://mtggoldfish.com.cutercounter.com/
$html = file_get_html($url, false, CONTEXT);
// do things, blurts out every "'" as encoded '
Browser inspectors do a bit of transformation to have something human-readable.
Create a simple HTML with only AAA ' BBB in the body, you will see AAA ' BBB in the inspectors.
If you really want to see the content of the page, look at the source code (which is what file_get_html gets)
I am trying to read a page using file_get_contents() but I cannot get the character encoding to work.
this is my code:
$username = "masked";
$password = "maskedPass";
$remote_url = 'https://utfws.utfpr.edu.br/aluno01/sistema/mplistahorario.inicio?p_curscodnr=212';
// Create a stream
$opts = array(
'http'=>array(
'method'=>"GET",
'header' => array(
"Authorization: Basic " . base64_encode("$username:$password"),
'Accept-Charset: iso-8859-1'
)
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents($remote_url, false, $context);
echo $file;
I tried to change the character encoding to utf-8 but I always get a page with question marks instead of áéíóúãõç.
When I open the page directly in my browser it works just fine. Why is this happening?
It sounds to me like this might just be a problem of lost encoding details.
What you're describing is:
request document from webserver, specifying encoding 8859-1
server responds with document in requested encoding, including header specifying the encoding is 8859-1. This will look correct in a browser.
output document ( but not header data! ) from php ( where this goes isn't specified
open the data in some sort of viewer.
See where the encoding specification was lost, there in step 3?
The data can correctly be decoded with 8859-1, but only will be decoded with 8859-1 if the viewer is configured to use that encoding by default. Some apps may have a default of 8859-1, but UTF-8 is a lot more common these days.
If you load the data into a different storage engine, say mysql, the problem may compound. mysql associates a charset with text data. If your database defaults to utf-8, and you don't tell it the data is actually in 8859-1, but you don't tell it the data is in 8859-1, now you're feeding it data that is assumed to be in utf-8, and the data will be treated as such in the database going forward. Now even if you ask the database for 8859-1 in the future, the data will be re-encoded from utf-8 to 8859-1, but it's not valid utf-8 - it's yet another incorrect set of bytes.
To address this problem, specify the encoding when you view the data, or when you save it to a database.
Whenever I try to read a Google alert via PHP using something like:
$feed = file_get_contents("http://www.google.com/alerts/feeds/01445174399729103044/950192755411504138");
Regardless of whether I save the $feed to a file or echo the result to the output, all utf-8 unicode characters ( i.e. those with diacritics) are represented by white space. I have tried - without success - various combinations of:
utf8_encode
utf8_decode
iconv
mb_convert_encoding
I think the wrong characters have come from the stream, but I'm lost because if I try this URI in a browser then everything is fine. Can anyone shed some light on the issue?
Sorry, you are absolutely correct - there is something untoward happening! Though it is not what you would first suspect... For reference, given that:
echo mb_detect_encoding($feed); // prints: ASCII
The unicode data is lost before it is even sent by the remote server - it appears that Google is looking at the user-agent string in the request header - which is non-existent using file_get_contents by default without a stream-context.
Because it cannot identify the client making the request it defaults to and forces ASCII encoding. This is presumably a necessary fallback in the event of some kind of cataclysmic cock-up. [citation needed...]
It's not simply enough to name your application however, you need to include a known vendor. I 'm unsure of the full extent of this but I believe most folks include "Mozilla [version]" to work around the issue, for example:
$url = 'http://www.google.com/...';
$feed = file_get_contents($url, false, stream_context_create([
'http' => [
'method' => 'GET',
'header' => 'Accept-Charset: UTF-8' ."\r\n"
.'User-Agent: (Mozilla/5.0 compatible) MyFeedReader/1.0'
]
]));
file_put_contents('test.txt', $feed); // should now work as expected
How can I determine whether a string was compressed with gzcompress (aparts from comparing sizes of string before/after calling gzuncompress, or would that be the proper way of doing it) ?
PRE: I guess, if you send a request, you can immediately look into $http_response_header to see if the one of the items in the array is a variation of Content-Encoding: gzip. But this is not ideal!
there is a far better method.
Here is HOW TO...
Check if its GZIP. Like a BOSS!
according to GZIP RFC:
The header of gzip content looks like this
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
the ID1 and ID2 identify the content as GZIP. And CM states that the ZLIB_ENCODING (the compression method) is ZLIB_ENCODING_DEFLATE - which is customarily used by GZIP with all web-servers.
oh! and they have fixed values:
The value of ID1 is "\x1f"
The value of ID2 is "\x8b"
The value of CM is "\x08" (or just 8...)
almost there:
`$is_gzip = 0 === mb_strpos($mystery_string , "\x1f" . "\x8b" . "\x08");`
Working example
<?php
/** #link https://gist.github.com/eladkarako/d8f3addf4e3be92bae96#file-checking_gzip_like_a_boss-php */
date_default_timezone_set("Asia/Jerusalem");
while (ob_get_level() > 0) ob_end_flush();
mb_language("uni");
#mb_internal_encoding('UTF-8');
setlocale(LC_ALL, 'en_US.UTF-8');
header('Time-Zone: Asia/Jerusalem');
header('Charset: UTF-8');
header('Content-Encoding: UTF-8');
header('Content-Type: text/plain; charset=UTF-8');
header('Access-Control-Allow-Origin: *');
function get($url, $cookie = '') {
$html = #file_get_contents($url, false, stream_context_create([
'http' => [
'method' => "GET",
'header' => implode("\r\n", [''
, 'Pragma: no-cache'
, 'Cache-Control: no-cache'
, 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2310.0 Safari/537.36'
, 'DNT: 1'
, 'Accept-Language: en-US,en;q=0.8'
, 'Accept: text/plain'
, 'X-Forwarded-For: ' . implode(', ', array_unique(array_filter(array_map(function ($item) { return filter_input(INPUT_SERVER, $item, FILTER_SANITIZE_SPECIAL_CHARS); }, ['HTTP_X_FORWARDED_FOR', 'REMOTE_ADDR', 'HTTP_CLIENT_IP', 'SERVER_ADDR', 'REMOTE_ADDR']), function ($item) { return null !== $item; })))
, 'Referer: http://eladkarako.com'
, 'Connection: close'
, 'Cookie: ' . $cookie
, 'Accept-Encoding: gzip'
])
]]));
$is_gzip = 0 === mb_strpos($html, "\x1f" . "\x8b" . "\x08", 0, "US-ASCII");
return $is_gzip ? zlib_decode($html, ZLIB_ENCODING_DEFLATE) : $html;
}
$html = get('http://www.pogdesign.co.uk/cat/');
echo $html;
What do we see here that is worth mentioning?
start with initializing the PHP engine to use UTF-8 (since we don't really know if the web-server will return a GZIP content.
Providing the header Accept-Encoding: gzip, tells the web-sever, it may output a GZIP content.
Discovering GZIP content (you should use the multi-byte functions with ASCII encoding).
Finally returning the plain output, is easy using the ZLIB methods.
A string and a compressed string are both simply sequences of bytes. You cannot really distinguish one sequence of bytes from another sequence of bytes. You should know whether a blob of bytes represents a compressed format or not from accompanying metadata.
If you really need to guess programmatically, you have several things you can try:
Try to uncompress the string and see if the uncompress operation succeeds. If it fails, the bytes probably did not represent a compressed string.
Try to check for obvious "weird" bytes like anything before 0x20. Those bytes aren't typically used in regular text. There's no real guarantee that they occur in a compressed string though.
Use mb_check_encoding to see whether a string is valid in the encoding you suspect it to be in. If it isn't, it's probably compressed (or you checked for the wrong encoding). With the caveat that virtually any byte sequence is valid in virtually every single-byte encoding, so this'll only work for multi-byte encodings.
This work fine for me:
if (#gzuncompress($_xml)!==false) {
// gzipped sring
You can simply try gzuncompress() on the data as noted by #DiDiegodaFonseca. If it fails, it was not made by gzcompress(), or it was not faithfully transmitted.
If you really want to, you can check the first two bytes for a zlib header (not a gzip header, as incorrectly suggested in the accepted answer). gzcompress() produces a zlib stream, not a gzip stream. gzencode() is what produces a gzip stream. gzdeflate() produces a raw deflate stream.
RFC 1950 describes the zlib header. It is two bytes, where the two bytes taken as a big-endian 16-bit unsigned integer must be a multiple of 31. In addition to checking that, you can check that the low four bits of the first byte is 8 (1000), and that the high bit is zero.
I'm building a php client to a web service that requires posted data to be encoded as UTF-16. How do i configure curl to encode my data in UTF-16 and also to decode the answer in UTF-16?
Some sample code:
$s = curl_init($url);
curl_setopt($s,CURLOPT_POST,1);
curl_setopt($s,CURLOPT_RETURNTRANSFER, 1 );
curl_setopt($s,CURLOPT_POSTFIELDS,$data);
curl_setopt($s,CURLOPT_TIMEOUT, 5);
curl_setopt($s,CURLOPT_HTTPHEADER,array('Content-Type: text/plain'));
$result = curl_exec($s);
curl_close($s);
Adding an Accept-Encoding header does not seem to do the trick. Is it possible to encode my $data string in UTF-16 first and then pass a byte array to curl instead of a string?
Thank you for your answers!
First, you need to find out what's your data encoding. Then, it's your choice. Both iconv() and mb_convert_encoding() work pretty well.
Additionaly, you should inform about the encoding in the HTTP header:
curl_setopt($s,CURLOPT_HTTPHEADER,array('Content-Type: text/plain; charset=UTF-16'));