How to remove %EF%BB%BF in a PHP string - php

I am trying to use the Microsoft Bing API.
$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
The data returned has a ' ' character in the first character of the returned string. It is not a space, because I trimed it before returning the data.
The ' ' character turned out to be %EF%BB%BF.
I wonder why this happened, maybe a bug from Microsoft?
How can I remove this %EF%BB%BF in PHP?

You should not simply discard the BOM unless you're 100% sure that the stream will: (a) always be UTF-8, and (b) always have a UTF-8 BOM.
The reasons:
In UTF-8, a BOM is optional - so if the service quits sending it at some future point you'll be throwing away the first three characters of your response instead.
The whole purpose of the BOM is to identify unambiguously the type of UTF stream being interpreted UTF-8? -16? or -32?, and also to indicate the 'endian-ness' (byte order) of the encoded information. If you just throw it away you're assuming that you're always getting UTF-8; this may not be a very good assumption.
Not all BOMs are 3-bytes long, only the UTF-8 one is three bytes. UTF-16 is two bytes, and UTF-32 is four bytes. So if the service switches to a wider UTF encoding in the future, your code will break.
I think a more appropriate way to handle this would be something like:
/* Detect the encoding, then convert from detected encoding to ASCII */
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "ASCII", $enc);

$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
if (substr($data, 0, 3) == "\xef\xbb\xbf") {
$data = substr($data, 3);
}

It's a byte order mark (BOM), indicating the response is encoded as UTF-8. You can safely remove it, but you should parse the remainder as UTF-8.

I had the same problem today, and fixed by ensuring the string was set to UTF-8:
http://php.net/manual/en/function.utf8-encode.php
$content = utf8_encode ( $content );

To remove it from the beginning of the string (only):
$data = preg_replace('/^%EF%BB%BF/', '', $data);

$data = str_replace('%EF%BB%BF', '', $data);
You probably shouldn't be using stripslashes -- unless the API returns blackslashed data (and 99.99% chance it doesn't), take that call out.

You could use substr to only get the rest without the UTF-8 BOM:
// if it’s binary UTF-8
$data = substr($data, 3);
// if it’s percent-encoded UTF-8
$data = substr($data, 9);

Related

mb_detect_encoding returns both ASCII and UTF8 [duplicate]

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

mb_detect_encoding detects ASCII as UTF-8?

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

cURL gets response with utf-8 BOM

In my script I send data with cURL, and enabled CURLOPT_RETURNTRANSFER. The response is json encoded data. When I'm trying to json_decode, it returns null. Then I found that response contains utf-8 BOM symbols at the beginning of string ().
There is some experiments:
$data = $data = curl_exec($ch);
echo $data;
the result is
{"field_1":"text_1","field_2":"text_2","field_3":"text_3"}
$data = $data = curl_exec($ch);
echo mb_detect_encoding($data);
result - UTF-8
$data = $data = curl_exec($ch);
echo mb_convert_encoding($data, 'UTF-8', mb_detect_encoding($data));
// identical to echo mb_convert_encoding($data, 'UTF-8', 'UTF-8');
result - {"field_1":"text_1","field_2":"text_2","field_3":"text_3"}
The one thing that helps is removing first 3 symbols:
if (substr($data, 0, 3) == pack('CCC', 239, 187, 191)) {
$data = substr($data, 3);
}
But what if there will be another BOM? So the question is:
How to detect right encoding of cURL response? OR how to detect what BOM has arrrived? Or maybe how to convert the response with BOM?
I'm afraid you already found the answer by yourself - it's bad news in that there is no better answer that I know of.
The BOM should not be there, and it's the sender's responsibility to not send it along.
But I can reassure you, the BOM is either there or there is not, and if it is, it's those three bytes you know.
You can have a slightly faster and handle another N BOMs with a small alteration:
$__BOM = pack('CCC', 239, 187, 191);
// Careful about the three ='s -- they're all needed.
while(0 === strpos($data, $__BOM))
$data = substr($data, 3);
A third-party BOM detector wouldn't do any different. This way you're covered even if at a later time cURL began stripping unneeded BOMs.
Possible causes
Some JSON optimizers and filters may decide the output requires a BOM. Also, perhaps more simply, whoever wrote the script generating the JSON inadvertently included a BOM before the opening PHP tag. Apache, not caring what the BOM is, sees there is data before the opening tag, so sends it along and hides it from the PHP stream itself. This can occasionally also cause the "Cannot add headers: output already started" error.
Content detection
You can verify the JSON is valid UTF-8, BOM or not BOM, but need mb_string support and you must use strict mode to get some edge cases:
if (false === mb_detect_encoding($data, 'UTF-8', true)) {
// JSON contains invalid sequences (erroneously NOT JSON encoded)
}
I would advise against trying to correct a possible encoding error; you risk breaking your own code, and also having to maintain someone else's work.
This page details a similar issue: BOM in a PHP page auto generated by Wordpress
Basically, it can occur when the JSON generator is written in PHP and an editor has somehow snuck in the BOM before the opening <?php tag. Since your client language is PHP I'm assuming this is relevant.
You could strip it out using the substr comparison -- a BOM only ever occurs at the start of a document. But if you have control over the JSON source, you should remove the BOM from the source document instead.
There will never be more than 3 characters before the "{". Those 3 characters are one character in UTF-8. So if you just do $data = substr($data, 3); you will be fine.
Take a look here for more information: json_decode returns NULL after webservice call

How to Process Japanese Characters in $_GET

I do
http://localhost/api/test2.php?id=jr-東北本線-荒川橋梁__35.79_139.72
Then I do
$data=$_GET['id']; // Zend says that $data is jr-????-????__35.79_139.72
$encoding = mb_detect_encoding ($data); // $encoding is ASCII
$data= mb_convert_encoding($data,'utf-8'); //$data is still jr-????-????__35.79_139.72
$encoding2 = mb_detect_encoding ($data); // $encoding is still ASCII
The thing is I want $data to be jr-東北本線-荒川橋梁__35.79_139.72
So what should I do?
If the encoding of the URL data (the query part) is actually UTF-8 encoded, you don't need to do nothing at all. PHP supports UTF-8 then out-of-the-box thanks to it's binary safe strings.
So you better do not run any conversions just for having some fun trying (and failing which sucks big time).

UTF-8, XML, and htmlentities with PHP / Mysql

I have found a lot of varying / inconsistent information across the web on this topic, so I'm hoping someone can help me out with these issues:
I need a function to cleanse a string so that it is safe to insert into a utf-8 mysql db or to write to a utf-8 XML file. Characters that can't be converted to utf-8 should be removed.
For writing to an XML file, I'm also running into the problem of converting html entities into numeric entities. The htmlspecialchars() works almost all the time, but I have read that it is not sufficient for properly cleansing all strings, for example one that contains an invalid html entity.
Thanks for your help, Brian
You didn't say where the strings were coming from, but if you're getting them from an HTML form submission, see this article:
Setting the character encoding in form submit for Internet Explorer
Long and short, you'll need to explicitly tell the browser what charset you want the form submission in. If you specify UTF-8, you should never get invalid UTF-8 from a browser. If you want to protect yourself against ANY type of malicious attack, you'll need to use iconv:
http://www.php.net/iconv
$utf_8_string = iconv($from_charset, $to_charset, $original_string);
If you specify "utf-8" as both $from_charset and $to_charset, iconv() should return an error if $original_string contains invalid UTF-8.
If you're getting your strings from a different source and you know the character encoding, you can still use iconv(). Typical encodings in the US are CP-1252 (Windows) and ISO-8859-1 (everything else.)
Something like this?
function cleanse($in) {
$bad = Array('”', '“', '’', '‘');
$good = Array('"', '"', '\'', '\'');
$out = str_replace($bad, $good, $in);
return $out;
}
You can convert a string from any encoding to UTF-8 with iconv or mbstring:
// With the //IGNORE flag, this will ignore invalid characters
iconv('input-encoding', 'UTF-8//IGNORE', $the_string);
or
mb_convert_encoding($the_string, 'UTF-8', 'input-encoding');

Categories