cURL gets response with utf-8 BOM - php

In my script I send data with cURL, and enabled CURLOPT_RETURNTRANSFER. The response is json encoded data. When I'm trying to json_decode, it returns null. Then I found that response contains utf-8 BOM symbols at the beginning of string ().
There is some experiments:
$data = $data = curl_exec($ch);
echo $data;
the result is
{"field_1":"text_1","field_2":"text_2","field_3":"text_3"}
$data = $data = curl_exec($ch);
echo mb_detect_encoding($data);
result - UTF-8
$data = $data = curl_exec($ch);
echo mb_convert_encoding($data, 'UTF-8', mb_detect_encoding($data));
// identical to echo mb_convert_encoding($data, 'UTF-8', 'UTF-8');
result - {"field_1":"text_1","field_2":"text_2","field_3":"text_3"}
The one thing that helps is removing first 3 symbols:
if (substr($data, 0, 3) == pack('CCC', 239, 187, 191)) {
$data = substr($data, 3);
}
But what if there will be another BOM? So the question is:
How to detect right encoding of cURL response? OR how to detect what BOM has arrrived? Or maybe how to convert the response with BOM?

I'm afraid you already found the answer by yourself - it's bad news in that there is no better answer that I know of.
The BOM should not be there, and it's the sender's responsibility to not send it along.
But I can reassure you, the BOM is either there or there is not, and if it is, it's those three bytes you know.
You can have a slightly faster and handle another N BOMs with a small alteration:
$__BOM = pack('CCC', 239, 187, 191);
// Careful about the three ='s -- they're all needed.
while(0 === strpos($data, $__BOM))
$data = substr($data, 3);
A third-party BOM detector wouldn't do any different. This way you're covered even if at a later time cURL began stripping unneeded BOMs.
Possible causes
Some JSON optimizers and filters may decide the output requires a BOM. Also, perhaps more simply, whoever wrote the script generating the JSON inadvertently included a BOM before the opening PHP tag. Apache, not caring what the BOM is, sees there is data before the opening tag, so sends it along and hides it from the PHP stream itself. This can occasionally also cause the "Cannot add headers: output already started" error.
Content detection
You can verify the JSON is valid UTF-8, BOM or not BOM, but need mb_string support and you must use strict mode to get some edge cases:
if (false === mb_detect_encoding($data, 'UTF-8', true)) {
// JSON contains invalid sequences (erroneously NOT JSON encoded)
}
I would advise against trying to correct a possible encoding error; you risk breaking your own code, and also having to maintain someone else's work.

This page details a similar issue: BOM in a PHP page auto generated by Wordpress
Basically, it can occur when the JSON generator is written in PHP and an editor has somehow snuck in the BOM before the opening <?php tag. Since your client language is PHP I'm assuming this is relevant.
You could strip it out using the substr comparison -- a BOM only ever occurs at the start of a document. But if you have control over the JSON source, you should remove the BOM from the source document instead.

There will never be more than 3 characters before the "{". Those 3 characters are one character in UTF-8. So if you just do $data = substr($data, 3); you will be fine.
Take a look here for more information: json_decode returns NULL after webservice call

Related

Removing invisible characters from UTF-8 XML data

I am consuming an XML feed which contains a great deal of whitespace.
When I echo out the raw feed, it looks as though the columns of the tabled data are properly formatted with just the white space.
I have tried many regex patterns to remove it, to only allow visible characters, trim, chop, utf-8 encode/decode, nothing is touching it. It's like it is laughing in my face when I echo out a value and see this:
string(17) "72"
Opened the data in Notepad++ with show all characters on, and it simply shows it as spaces. I am at a loss of where to go with this.
I did recieve the following error:
simplexml_load_string(): Entity: line 265: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x43 0x20 0x74
I just found this regex (untested)
$xml_data = preg_replace("/>\s+</", "><", $xml_data);
If you are using the xml parser, I think you can use the 'XML_OPTION_SKIP_WHITE' option referenced here:
http://php.net/manual/en/function.xml-parser-set-option.php
Try running the data through utf8_encode() - it might seem like a hack, but it seems like the originating data isn't properly setup.
My theory is that you're grabbing it with the wrong encoding, and the proper solution would be to load it differently.
Solution
My very hacky workaround that works:
$raw = file_get_contents('http://stupidwebservice.com/xmldata.asmx/Feed');
$raw = urlencode(utf8_encode($raw));
$raw = str_replace('++','',$raw);
$raw = urldecode($raw);
urlencoding after the utf-8 encoding turned the space into +'s. I simply removed all instances of double ++'s and took it back. Works great.

write unicode character to a file in php - not duplicate

I have a unicode string received over HTTP Post or fetched from a DB (does not matter)
In PHP I checked the encoding of the string using "mb_detect_encoding" and got UTF-8 as the result.
SO therefore the string is in Unicode.
But how do I write the string from php to a output file with the proper encoding
$fd = fopen('myfile.php', "wb");
fwrite($fd, $msg."\n");
What I see is "टेसà¥à¤Ÿ" instead of the actual string which is
टेस्ट्
Pasting the 'junk' into Notepad++ and then from menu option doing 'encoding UTF-8' will show the proper text.
EDIT
*SOLUTION*
Sorry for posting the question and figuring out the answer myself.
I found the solution at the following site
http://www.codingforums.com/showthread.php?t=129270
function writeUTF8File($filename,$content) {
$f=fopen($filename,"w");
# Now UTF-8 - Add byte order mark
fwrite($f, pack("CCC",0xef,0xbb,0xbf));
fwrite($f,$content);
fclose($f);
}
PHP does not change the encoding of the string or does anything with it when you write to a file. It simply dumps the bytes of the string (PHP strings are really byte arrays) into the file, period. If you actually receive the string as UTF-8 and do not do anything with it except write it to a file, the content of the file will be UTF-8 encoded. Your problem is most likely that whatever application you're using to view the file does not properly read it as UTF-8 encoded.
That BOM solution is not necessarily the best. A BOM is not necessary for UTF-8 and many applications have problems with it. It only helps applications that are otherwise unable (too stupid) to detect that a file is UTF-8 encoded. The better solution may be to simply explicitly tell the application in question that it needs to treat the file as UTF-8 encoded when opening the file. Or use a better application.
You have to specify the strict parameter of mb_detect_encoding, or you'll get many false positives.
Also, while the output may be UTF-8, you will have to specify the right headers (content-encoding) and/or the charset meta tag (if it's HTML).
Sorry for posting the question and figuring out the answer myself.
I found the solution at the following site here
function writeUTF8File($filename,$content) {
$f=fopen($filename,"w");
# Now UTF-8 - Add byte order mark
fwrite($f, pack("CCC",0xef,0xbb,0xbf));
fwrite($f,$content);
fclose($f);
}

PHP fread() Function Returning Extra Characters at the Front on UTF-8 Text Files

While I'm using fread() on a normal text file (for example: ANSI file saved normally with Notepad), the returned content string is correct, as everyone knows.
But when I read the UTF-8 text file, the returning content string contains invisible characters (at the front). Why I said invisible is that the extra characters can't be seen normally on output (e.g.. echo for just read). But when the content string is used for processing (for example: Build a link with href value), problem is arisen then.
$filename = "blabla.txt";
$handle = fopen($filename, "r");
$contents = fread($handle, filesize($filename));
fclose($handle);
echo ''.$contents.'';
I put only http://www.google.com in the UTF-8 encoding text file. While running the PHP file, you will see a output link http://www.google.com
.. but you will never reach to Google.
Because address source href is being like this:
%EF%BB%BFhttp://www.google.com
It means, fread added %EF%BB%BF weird characters at the front.
This is extra annoying stuff. Why it is happening?
Added:
Some pointing that is BOM. So, BOM or whatever, it is changing my original values. So now, it is problem with other steps, function calls, etc. Now I have to substr($string,3) for all outputs. This is totally non-sense changing the original values.
This is called the UTF-8 BOM. Please refer to http://en.wikipedia.org/wiki/Byte_order_mark
It is something that is optionally added to the beginnning of Utf-8 files, meaning it is in the file, and not something fread adds. Most text editors won't display the BOM, but some will -- mostly those that don't understand it. Not all editors will add it to Utf-8 files, but yet again, some will...
For Utf-8 the usage of BOM is not recommended, as it has no meaning and by many instances are not understood.
It is UTF-8 BOM. IF you look at the docs for fread(here) someone has discussed a solution for it.
The solution given over there is the following
// Reads past the UTF-8 bom if it is there.
function fopen_utf8 ($filename, $mode) {
$file = #fopen($filename, $mode);
$bom = fread($file, 3);
if ($bom != b"\xEF\xBB\xBF")
rewind($file, 0);
else
echo "bom found!\n";
return $file;
}

Php cannot find way to split utf-8 strings

i just started dabbling in php and i'm afraid i need some help to figure out how to manipulate utf-8 strings.
I'm working in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding confirms this) which i then proceed to reading it using
$file = fopen("file.txt", "r");
while(!feof($file)){
$line = fgets($file);
//...
}
fclose($file);
using mb_detect_encoding($line) reports UTF-8
If i do echo $line I can see the line properly (no mangled characters) in the browser
so I guess everything is fine with browser and apache. Though i did search my apache configuration for AddDefaultCharset and tried adding http meta-tags for character encoding (just in case)
When i try to split the string using $arr = mb_split(';',$line) the fields of the resulting array contain mangled utf-8 characters (mb_detect_encoding($arr[0]) reports utf-8 as well).
So echo $arr[0] will result in something like this: ΑΘΗÎÎ.
I have tried setting mb_detect_order('utf-8'), mb_internal_encoding('utf-8'), but nothing changed. I also tried to manually detect utf-8 using this w3 perl regex because i read somewhere that mb_detect_encoding can sometimes fail (myth?), but results were the same as well.
So my question is how can i properly split the string? Is going down the mb_ path the wrong way? What am I missing?
Thank you for your help!
UPDATE: I'm adding sample strings and base64 equivalents (thanks to #chris' for his suggestion)
1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ΑΘΗÎΑ"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="
Ok, so after doing this there seems to be a 77u/ difference between 3. and 5. which according to this is a utf-8 BOM mark. So how can i avoid it?
UPDATE 2: I woke up refreshed today and with your tips in mind i tried it again. It seems that $line=fgets($file) reads correctly the first line (no mangled chars), and fails for each subsequent line. So then i base64_encoded the first and second line, and the 77u/ bom appeared on the base64'd string of the first line only. I then opened up the offending file in vim, and entered :set nobomb :w to save the file without the bom. Firing up php again showed that the first line was also mangled now. Based on #hakre's remove_utf8_bom i added it's complementary function
function add_utf8_bom($str){
$bom= "\xEF\xBB\xBF";
return substr($str,0,3)===$bom?$str:$bom.$str;
}
and voila each line is read correctly now.
I do not much like this solution, as it seems very very hackish (i can't believe that an entire framework/language does not provide for a way to deal with nobombed strings). So do you know of an alternate approach? Otherwise I'll proceed with the above.
Thanks to #chris, #hakre and #jacob for their time!
UPDATE 3 (solution): It turns out after all that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8') and meta-tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" />. It also had to be properly enclosed inside an <html><body> section or the browser would not understand the encoding correctly. Thanks to #jake for his suggestion.
Morale of the story: I should learn more about html before trying coding for the browser in the first place. Thanks for your help and patience everyone.
UTF-8 has the very nice feature that it is ASCII-compatible. With this I mean that:
ASCII characters stay the same when encoded to UTF-8
no other characters will be encoded to ASCII characters
This means that when you try to split a UTF-8 string by the semicolon character ;, which is an ASCII character, you can just use standard single byte string functions.
In your example, you can just use explode(';',$utf8encodedText) and everything should work as expected.
PS: Since the UTF-8 encoding is prefix-free, you can actually use explode() with any UTF-8 encoded separator.
PPS: It seems like you try to parse a CSV file. Have a look at the fgetcsv() function. It should work perfectly on UTF-8 encoded strings as long as you use ASCII characters for separators, quotes, etc.
When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.
I like to use a PHP file similar to the following:
<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8>
<title>Test page for project XY</title>
</head>
<body>
<h1>Test Page</h1>
<pre><?php
echo print_r($_GET,1);
?></pre>
</body>
</html>
If you don't include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.
Edit, I just read your post closer. You're suggesting this should output false, because you're suggesting a BOM was introduced by mb_split().
header('content-type: text/plain;charset=utf-8');
$s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5";
$str = base64_decode($s);
$peices = mb_split(';', $str);
var_dump(substr($str, 0, 10) === $peices[0]);
var_dump($peices);
Does it? It works as expected for me( bool true, and the strings in the array are correct)
The mb_splitDocs function should be fine, but you should define the charset it's using as well with mb_regex_encodingDocs:
mb_regex_encoding('UTF-8');
About mb_detect_encodingDocs: it can fail, but that's just by the fact that you can never detect an encoding. You either know it or you can try but that's all. Encoding detection is mostly a gambling game, however you can use the strict parameter with that function and specify the encoding(s) you're looking for.
How to remove the BOM mask:
You can filter the string input and remove a UTF-8 bom with a small helper function:
/**
* remove UTF-8 BOM if string has it at the beginning
*
* #param string $str
* #return string
*/
function remove_utf8_bom($str)
{
if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF")
{
$str = substr($str, 3);
}
return $str;
}
Usage:
$line = remove_utf8_bom($line);
There are probably better ways to do it, but this should work.

How to remove %EF%BB%BF in a PHP string

I am trying to use the Microsoft Bing API.
$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
The data returned has a ' ' character in the first character of the returned string. It is not a space, because I trimed it before returning the data.
The ' ' character turned out to be %EF%BB%BF.
I wonder why this happened, maybe a bug from Microsoft?
How can I remove this %EF%BB%BF in PHP?
You should not simply discard the BOM unless you're 100% sure that the stream will: (a) always be UTF-8, and (b) always have a UTF-8 BOM.
The reasons:
In UTF-8, a BOM is optional - so if the service quits sending it at some future point you'll be throwing away the first three characters of your response instead.
The whole purpose of the BOM is to identify unambiguously the type of UTF stream being interpreted UTF-8? -16? or -32?, and also to indicate the 'endian-ness' (byte order) of the encoded information. If you just throw it away you're assuming that you're always getting UTF-8; this may not be a very good assumption.
Not all BOMs are 3-bytes long, only the UTF-8 one is three bytes. UTF-16 is two bytes, and UTF-32 is four bytes. So if the service switches to a wider UTF encoding in the future, your code will break.
I think a more appropriate way to handle this would be something like:
/* Detect the encoding, then convert from detected encoding to ASCII */
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "ASCII", $enc);
$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
if (substr($data, 0, 3) == "\xef\xbb\xbf") {
$data = substr($data, 3);
}
It's a byte order mark (BOM), indicating the response is encoded as UTF-8. You can safely remove it, but you should parse the remainder as UTF-8.
I had the same problem today, and fixed by ensuring the string was set to UTF-8:
http://php.net/manual/en/function.utf8-encode.php
$content = utf8_encode ( $content );
To remove it from the beginning of the string (only):
$data = preg_replace('/^%EF%BB%BF/', '', $data);
$data = str_replace('%EF%BB%BF', '', $data);
You probably shouldn't be using stripslashes -- unless the API returns blackslashed data (and 99.99% chance it doesn't), take that call out.
You could use substr to only get the rest without the UTF-8 BOM:
// if it’s binary UTF-8
$data = substr($data, 3);
// if it’s percent-encoded UTF-8
$data = substr($data, 9);

Categories