How to Process Japanese Characters in $_GET - php

I do
http://localhost/api/test2.php?id=jr-東北本線-荒川橋梁__35.79_139.72
Then I do
$data=$_GET['id']; // Zend says that $data is jr-????-????__35.79_139.72
$encoding = mb_detect_encoding ($data); // $encoding is ASCII
$data= mb_convert_encoding($data,'utf-8'); //$data is still jr-????-????__35.79_139.72
$encoding2 = mb_detect_encoding ($data); // $encoding is still ASCII
The thing is I want $data to be jr-東北本線-荒川橋梁__35.79_139.72
So what should I do?

If the encoding of the URL data (the query part) is actually UTF-8 encoded, you don't need to do nothing at all. PHP supports UTF-8 then out-of-the-box thanks to it's binary safe strings.
So you better do not run any conversions just for having some fun trying (and failing which sucks big time).

Related

mb_detect_encoding returns both ASCII and UTF8 [duplicate]

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

Testing non UTF-8 string

I have read some other threads on this subject but I cannot understand what I am doing wrong.
I have a function
public function reEncode($item)
{
if (! mb_detect_encoding($item, 'utf-8', true)) {
$item = utf8_encode($item);
}
return $item;
}
I am writing a test for this. I want to test a string that is not UTF-8 to see if this statement is hit. I am having trouble creating the test string.
$contents = file_get_contents('CyrillicKOI8REncoded.txt');
var_dump(mb_detect_encoding($contents));
$sanitized = $this->reEncode($contents);
var_dump(mb_detect_encoding($sanitized));
Initially I used file_get_contents on a file I encoded in sublime with various encodings; Cyrillic (KOI8-R), HEX and DOS (CP 437) as it has been stated that file_get_contents() ignores the file encoding. This seems to be true as the characters returned are a jumbled mess.
That said, every time I use mb_detect_encoding() on these variables, I always get ASCII or UTF-8. The statement is never triggered because ASCII is a subset of UTF-8.
So I have tried mb_convert_encoding() and iconv() to convert a basic string to UTF-16, UTF-32, base64, hex etc etc but every time mb_detect_encoding() returns ASCII or UTF-8
In my tests I want to assert the encoding type before and after this function is called.
$sanitized = $this->reEncode($contents);
$this->assertEquals('UTF-32', mb_detect_encoding($contents));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitized));
I cannot understand what basic mistake I am doing to constantly get ASCII or UTF-8 returned from mb_detect_encoding().
Ok, so it turns out you must use strict to check or the mb_detect_encoding() function is next to useless.
$item = mb_convert_encoding('Котёнок', 'KOI8-R');
$sanitized = $this->reEncode($item);
$this->assertEquals('KOI8-R', mb_detect_encoding($item, 'KOI8-R', true));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitised, 'UTF-8', true));

Encoding string with non-ascii characters

I have a string such as this - Panamá. I need to convert this string to Panam\xE1 so it's readable in a JavaScript file I'm generating using PHP.
Is there a function to encode this in PHP? Any ideas would be appreciated.
My rule is,
If you try to encode or escape data using preg_replace or
using massive mapping arrays or str_replace, STOP you are probably doing it wrong.
All it takes is one missed or eroneous mapping (and you WILL miss some mappings) then you end up with code that doesn't work in all cases and code which corrupts your data in some cases. Whole libraries have been written already dedicated to doing the translations for you (e.g. iconv) and for escaping data, you should use the proper PHP function.
If you plan on outputting the data to a browser (the fact you want to encode for javascript suggests this) then I suggest using UTF8 encoding. If your data is in latin-1, use the utf8_encode function.
Whether your PHP string contains ASCII characters or not, to send any data from PHP to JS you should ALWAYS use the json_encode function.
PHP code
$your_encoding = 'latin1';
$panama = "Panamá";
//Get your data in utf8 if it isnt already
$panama = iconv($your_encoding, "utf-8", $panama);
$panama_encoded = json_encode($panama);
echo "var js_panama = " . $panama_encoded . ";";
JS Output
var js_panama = "Panam\u00e1";
Even though JSON supports unicode, it may not be compatible with your non UTF-8 javascript file. This is not a problem because the json_encode PHP function will escape unicode characters by default.
Assuming that your input is in the latin-1 encoding then ord and dechex will do what you want:
$result = preg_replace_callback(
'/[\x80-\xff]/',
function($match) {
return '\x'.dechex(ord($match[0]));
},
$input);
If your input is in any other encoding then you would need to know what encoding that is and adapt the solution accordingly. Note that in this case it would not be possible to use specifically the \x## notation in the JS output in all cases.
This should work for you:
$str = "Panamá";
$str = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$utf = iconv('UTF-8', 'UCS-4', current($m));
return sprintf("\x%s", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $str);
echo $str;
Output (Source Code):
Panam\xE1

mb_detect_encoding detects ASCII as UTF-8?

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

How to remove %EF%BB%BF in a PHP string

I am trying to use the Microsoft Bing API.
$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
The data returned has a ' ' character in the first character of the returned string. It is not a space, because I trimed it before returning the data.
The ' ' character turned out to be %EF%BB%BF.
I wonder why this happened, maybe a bug from Microsoft?
How can I remove this %EF%BB%BF in PHP?
You should not simply discard the BOM unless you're 100% sure that the stream will: (a) always be UTF-8, and (b) always have a UTF-8 BOM.
The reasons:
In UTF-8, a BOM is optional - so if the service quits sending it at some future point you'll be throwing away the first three characters of your response instead.
The whole purpose of the BOM is to identify unambiguously the type of UTF stream being interpreted UTF-8? -16? or -32?, and also to indicate the 'endian-ness' (byte order) of the encoded information. If you just throw it away you're assuming that you're always getting UTF-8; this may not be a very good assumption.
Not all BOMs are 3-bytes long, only the UTF-8 one is three bytes. UTF-16 is two bytes, and UTF-32 is four bytes. So if the service switches to a wider UTF encoding in the future, your code will break.
I think a more appropriate way to handle this would be something like:
/* Detect the encoding, then convert from detected encoding to ASCII */
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "ASCII", $enc);
$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
if (substr($data, 0, 3) == "\xef\xbb\xbf") {
$data = substr($data, 3);
}
It's a byte order mark (BOM), indicating the response is encoded as UTF-8. You can safely remove it, but you should parse the remainder as UTF-8.
I had the same problem today, and fixed by ensuring the string was set to UTF-8:
http://php.net/manual/en/function.utf8-encode.php
$content = utf8_encode ( $content );
To remove it from the beginning of the string (only):
$data = preg_replace('/^%EF%BB%BF/', '', $data);
$data = str_replace('%EF%BB%BF', '', $data);
You probably shouldn't be using stripslashes -- unless the API returns blackslashed data (and 99.99% chance it doesn't), take that call out.
You could use substr to only get the rest without the UTF-8 BOM:
// if it’s binary UTF-8
$data = substr($data, 3);
// if it’s percent-encoded UTF-8
$data = substr($data, 9);

Categories