Corrupted UTF-8 encoding when reading Google feed / alerts

Corrupted UTF-8 encoding when reading Google feed / alerts - php

Whenever I try to read a Google alert via PHP using something like:
$feed = file_get_contents("http://www.google.com/alerts/feeds/01445174399729103044/950192755411504138");
Regardless of whether I save the $feed to a file or echo the result to the output, all utf-8 unicode characters ( i.e. those with diacritics) are represented by white space. I have tried - without success - various combinations of:
utf8_encode
utf8_decode
iconv
mb_convert_encoding
I think the wrong characters have come from the stream, but I'm lost because if I try this URI in a browser then everything is fine. Can anyone shed some light on the issue?

Sorry, you are absolutely correct - there is something untoward happening! Though it is not what you would first suspect... For reference, given that:
echo mb_detect_encoding($feed); // prints: ASCII
The unicode data is lost before it is even sent by the remote server - it appears that Google is looking at the user-agent string in the request header - which is non-existent using file_get_contents by default without a stream-context.
Because it cannot identify the client making the request it defaults to and forces ASCII encoding. This is presumably a necessary fallback in the event of some kind of cataclysmic cock-up. [citation needed...]
It's not simply enough to name your application however, you need to include a known vendor. I 'm unsure of the full extent of this but I believe most folks include "Mozilla [version]" to work around the issue, for example:
$url = 'http://www.google.com/...';
$feed = file_get_contents($url, false, stream_context_create([
'http' => [
'method' => 'GET',
'header' => 'Accept-Charset: UTF-8' ."\r\n"
.'User-Agent: (Mozilla/5.0 compatible) MyFeedReader/1.0'
]
]));
file_put_contents('test.txt', $feed); // should now work as expected

Related

PHP - html_simple_dom, crawlers encodes innerhtml?

Im using PHP html_simple_dom.
The targeted site is using UTF-8. My php as well as the stream context are set to use UTF 8.
An element (which i inspect by browser) has an innerHTML of "AAA ' BBB", at least as far as when its rendering using my firefox and chrome browsers.
However, my PHP script always fetches this string as "AAA ' BBB".
I can fix this using htmlspecialchars_decode($string, 1), but i really want to know why the PHP script, or rather the website is ("wrongly) encoding the string in the first place when visiting it using my PHP, which is explicitly set to UTF
header('Content-Type: text/html; charset=utf-8');
define("CONTEXT", stream_context_create(
array(
"http" =>
array(
"header" => 'Content-Type: text/html; charset=utf-8'
// also tried 'header' => 'Accept-Charset: UTF-8'
)
)
)
);
targetsite reads UTF-8 - http://mtggoldfish.com.cutercounter.com/
$html = file_get_html($url, false, CONTEXT);
// do things, blurts out every "'" as encoded &#039

Browser inspectors do a bit of transformation to have something human-readable.
Create a simple HTML with only AAA ' BBB in the body, you will see AAA ' BBB in the inspectors.
If you really want to see the content of the page, look at the source code (which is what file_get_html gets)

file_get_contents() breaking ISO-8859-1 encoding

I am trying to read a page using file_get_contents() but I cannot get the character encoding to work.
this is my code:
$username = "masked";
$password = "maskedPass";
$remote_url = 'https://utfws.utfpr.edu.br/aluno01/sistema/mplistahorario.inicio?p_curscodnr=212';
// Create a stream
$opts = array(
'http'=>array(
'method'=>"GET",
'header' => array(
"Authorization: Basic " . base64_encode("$username:$password"),
'Accept-Charset: iso-8859-1'
)
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents($remote_url, false, $context);
echo $file;
I tried to change the character encoding to utf-8 but I always get a page with question marks instead of áéíóúãõç.
When I open the page directly in my browser it works just fine. Why is this happening?

It sounds to me like this might just be a problem of lost encoding details.
What you're describing is:
request document from webserver, specifying encoding 8859-1
server responds with document in requested encoding, including header specifying the encoding is 8859-1. This will look correct in a browser.
output document ( but not header data! ) from php ( where this goes isn't specified
open the data in some sort of viewer.
See where the encoding specification was lost, there in step 3?
The data can correctly be decoded with 8859-1, but only will be decoded with 8859-1 if the viewer is configured to use that encoding by default. Some apps may have a default of 8859-1, but UTF-8 is a lot more common these days.
If you load the data into a different storage engine, say mysql, the problem may compound. mysql associates a charset with text data. If your database defaults to utf-8, and you don't tell it the data is actually in 8859-1, but you don't tell it the data is in 8859-1, now you're feeding it data that is assumed to be in utf-8, and the data will be treated as such in the database going forward. Now even if you ask the database for 8859-1 in the future, the data will be re-encoded from utf-8 to 8859-1, but it's not valid utf-8 - it's yet another incorrect set of bytes.
To address this problem, specify the encoding when you view the data, or when you save it to a database.

Sending compressed text over Amazon SQS from PHP to NodeJS

I seem to be stuck at sending the compressed messages from PHP to NodeJS over Amazon SQS.
Over on the PHP side I have:
$SQS->sendMessage(Array(
'QueueUrl' => $queueUrl,
'MessageBody' => 'article',
'MessageAttributes' => Array(
'json' => Array(
'BinaryValue' => bzcompress(json_encode(Array('type'=>'article','data'=>$vijest))),
'DataType' => 'Binary'
)
)
));
NOTE 1: I also tried putting compressed data directly in the message, but the library gave me an error with some invalid byte data
On the Node side, I have:
body = decodeBzip(message.MessageAttributes.json.BinaryValue);
Where message is from sqs.receiveMessage() call and that part works since it worked for raw (uncompressed messages)
What I am getting is TypeError: improper format
I also tried using:
PHP - NODE
gzcompress() - zlib.inflateraw()
gzdeflate() - zlib.inflate()
gzencode() - zlib.gunzip()
And each of those pairs gave me their version of the same error (essentially, input data is wrong)
Given all that I started to suspect that an error is somewhere in message transmission
What am I doing wrong?
EDIT 1: It seems that the error is somewhere in transmission, since bin2hex() in php and .toString('hex') in Node return totally different values. It seems that Amazon SQS API in PHP transfers BinaryAttribute using base64 but Node fails to decode it. I managed to partially decode it by turning off automatic conversion in amazon aws config file and then manually decoding base64 in node but it still was not able to decode it.
EDIT 2: I managed to accomplish the same thing by using base64_encode() on the php side, and sending the base64 as a messageBody (not using MessageAttributes). On the node side I used new Buffer(messageBody,'base64') and then decodeBzip on that. It all works but I would still like to know why MessageAttribute is not working as it should. Current base64 adds overhead and I like to use the services as they are intended, not by work arounds.

This is what all the SQS libraries do under the hood. You can get the php source code of the SQS library and see for yourself. Binary data will always be base64 encoded (when using MessageAttributes or not, does not matter) as a way to satisfy the API requirement of having form-url-encoded messages.
I do not know how long the data in your $vijest is, but I am willing to bet that after zipping and then base64 encoding it will be bigger than before.
So my answer to you would be two parts (plus a third if you are really stubborn):
When looking at the underlying raw API it is absolutely clear that not using MessageAttributes does NOT add additional overhead from base64. Instead, using MessageAttributes adds some slight additional overhead because of the structure of the data enforced by the SQS php library. So not using MessageAttributes is clearly NOT a workaround and you should do it if you want to zip the data yourself and you got it to work that way.
Because of the nature of a http POST request it is a very bad idea to compress your data inside your application. Base64 overhead will likely nullify the compression advantage and you are probably better off sending plain text.
If you absolutely do not believe me or the API spec or the HTTP spec and want to proceed, then I would advise to send a simple short string 'teststring' in the BinaryValue parameter and compare what you sent with what you got. That will make it very easy to understand the transformations the SQS library is doing on the BinaryValue parameter.

gzcompress() would be decoded by zlib.Inflate(). gzdeflate() would be decoded by zlib.InflateRaw(). gzencode() would be decoded by zlib.Gunzip(). So out of the three you listed, two are wrong, but one should work.

php: gzdecode() insufficient memory

Sometimes when downloading sources of a webpage and trying to decode it, i get an error: gzdecode() insufficient memory. (memory limit i 500m, usage far below that)
I include headers with my curl output, those are separated from the content correctly, before decoding. Content encoding header of the pages is clearly gzip. I read on php.net that including a length argument could cause such a crash, but I do not use length argument with gzdecode.
So while seemingly everything should be fine, I still get the error. Last time I found it with this page: https://ahmia.fi/address/.
Is there probably something with the https I do not know about? My curl settiing is \CURLOPT_SSL_VERIFYPEER => false.
Any help appreciated!

CURLOPT_ENCODING The contents of the "Accept-Encoding: " header. This enables decoding of the response. Supported encodings are "identity", "deflate", and "gzip". If an empty string, "", is set, a header containing all supported encoding types is sent.
try this:
CURLOPT_ENCODING => ""

How to format incoming email text for HTML display

I've set up a script that processes incoming emails and creates blog entries on Blogger. I'm using PEAR's Mail_Mime libs (for now) to read the incoming message. The messages often have characters in them that cannot be read by browsers--this happens most often when people use Outlook or cut/paste from MS Word.
So the output at the other end is something like this:
Here is a test post with “quotes” and ‘apostrophes�for what it�s worth, it also has dashes�and other strange formatting cut and paste from MS Word.
You can also see the output in the wild.
It's not hard to fix any specific instance, but each client (hotmail, gmail, outlook, etc) seems to handle things just a bit differently. Mail_Mime only seems to munge the output and, if I turn off Mail_Mime's parsing and try to translate the encoded characters myself using mb_convert_encoding or some manual simulation of this, it's even worse.
Please not that this is not going to be solved by selecting the right encoding type and using decode/encode/convert functions. The incoming formats vary from Windows-1252 to UTF8 to just about anything else mail clients can think of.
Has anyone scripted this before that could save me some time by offering up a sample or advice on the best approach? I've tried all the simple answers and done plenty of experimenting, so please don't bother responding unless you've dealt with a similar issue successfully or have a deep understanding of encoding issues.

The only way to do this is to do it by the spec's which is I'm afraid to pull in the 'Content-Type' mime header, pick up the charset (it'll look like Content-Type: text/plain; charset="us-ascii") then convert to UTF-8, and of course ensure your output on the web is sent as UTF-8 with the right headers.

To solve this problem, and get my message into valid UTF-8 that is readable from a browser, I found this PHP lib, ConvertCharset by Mikolaj Jedrzejak, which worked on almost everything. It still had issues with a specific symbol (=A0) when converting from Windows-1252 or iso-8859-1. So I converted this character manually before setting the code loose.
Here's what it looks like overall:
// decode using Mail_Mime
require 'Mail.php';
require 'Mail/mime.php';
require 'Mail/mimeDecode.php';
$params['include_bodies'] = true;
$params['decode_bodies'] = true; // this decodes it!
$params['decode_headers'] = true;
$decoder = new Mail_mimeDecode($input);
$mime = $decoder->decode($params);
// too much work to put in this example
$charset = ...; //do some magic with $mime->parts to get the character set
$text = ...; //do some magic with $mime->parts to get the text
// fix the =A0 control character; it's already been decoded
// by Mail_Mime, so we need the actual byte code now
// this has to be done before trying to convert to UTF-8
$char = chr(hexdec(substr('A0',1)));
$text = str_replace($char, '', $text);
// convert to UTF-8 using ConvertCharset
require 'ConvertCharset.class.php';
if( strtolower($charset) != 'utf-8' ) {
$converter = new ConvertCharset($charset, 'utf-8', false);
}
$text = $converter->Convert($text);
Then everything is spiffy. It even does the infamous Iñtërnâtiônàlizætiøn conversion, as well as accepting french, spanish, and pastes directly from MS Word :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.