Testing non UTF-8 string - php

I have read some other threads on this subject but I cannot understand what I am doing wrong.
I have a function
public function reEncode($item)
{
if (! mb_detect_encoding($item, 'utf-8', true)) {
$item = utf8_encode($item);
}
return $item;
}
I am writing a test for this. I want to test a string that is not UTF-8 to see if this statement is hit. I am having trouble creating the test string.
$contents = file_get_contents('CyrillicKOI8REncoded.txt');
var_dump(mb_detect_encoding($contents));
$sanitized = $this->reEncode($contents);
var_dump(mb_detect_encoding($sanitized));
Initially I used file_get_contents on a file I encoded in sublime with various encodings; Cyrillic (KOI8-R), HEX and DOS (CP 437) as it has been stated that file_get_contents() ignores the file encoding. This seems to be true as the characters returned are a jumbled mess.
That said, every time I use mb_detect_encoding() on these variables, I always get ASCII or UTF-8. The statement is never triggered because ASCII is a subset of UTF-8.
So I have tried mb_convert_encoding() and iconv() to convert a basic string to UTF-16, UTF-32, base64, hex etc etc but every time mb_detect_encoding() returns ASCII or UTF-8
In my tests I want to assert the encoding type before and after this function is called.
$sanitized = $this->reEncode($contents);
$this->assertEquals('UTF-32', mb_detect_encoding($contents));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitized));
I cannot understand what basic mistake I am doing to constantly get ASCII or UTF-8 returned from mb_detect_encoding().

Ok, so it turns out you must use strict to check or the mb_detect_encoding() function is next to useless.
$item = mb_convert_encoding('Котёнок', 'KOI8-R');
$sanitized = $this->reEncode($item);
$this->assertEquals('KOI8-R', mb_detect_encoding($item, 'KOI8-R', true));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitised, 'UTF-8', true));

Related

mb_detect_encoding returns both ASCII and UTF8 [duplicate]

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

mb_detect_encoding detects ASCII as UTF-8?

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

How to check if a string can safely be converted in another character set without loss?

Is it possible, prior to converting a string from a charset to another, to know whether this conversion will be lossless?
If I try to convert an UTF-8 string to latin1, for example, the chars that can't be converted are replaced by ?. Checking for ? in the result string to find out if the conversion was lossless is obviously not a choice.
The only solution I can see right now is to convert back to the original charset, and compare to the original string:
function canBeSafelyConverted($string, $fromEncoding, $toEncoding)
{
$encoded = mb_convert_encoding($string, $toEncoding, $fromEncoding);
$decoded = mb_convert_encoding($encoded, $fromEncoding, $toEncoding);
return $decoded == $string;
}
This is just a quick&dirty one though, that may come with unexpected behaviours at times, and I guess there might be a cleaner way to do this with mbstring, iconv, or any other library.
An alternative way is to set up your own error handler with set_error_handler(). If you use iconv() on the string it will throw a notice if it can not be fully converted that you can catch there and react to in your code.
Or you could just count the number of question marks before and after encoding. Or call iconv() with //IGNORE and count the number of characters.
None of the suggestions much more elegant than yours, but gets rid of the double processing.

How to Process Japanese Characters in $_GET

I do
http://localhost/api/test2.php?id=jr-東北本線-荒川橋梁__35.79_139.72
Then I do
$data=$_GET['id']; // Zend says that $data is jr-????-????__35.79_139.72
$encoding = mb_detect_encoding ($data); // $encoding is ASCII
$data= mb_convert_encoding($data,'utf-8'); //$data is still jr-????-????__35.79_139.72
$encoding2 = mb_detect_encoding ($data); // $encoding is still ASCII
The thing is I want $data to be jr-東北本線-荒川橋梁__35.79_139.72
So what should I do?
If the encoding of the URL data (the query part) is actually UTF-8 encoded, you don't need to do nothing at all. PHP supports UTF-8 then out-of-the-box thanks to it's binary safe strings.
So you better do not run any conversions just for having some fun trying (and failing which sucks big time).

How to remove %EF%BB%BF in a PHP string

I am trying to use the Microsoft Bing API.
$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
The data returned has a ' ' character in the first character of the returned string. It is not a space, because I trimed it before returning the data.
The ' ' character turned out to be %EF%BB%BF.
I wonder why this happened, maybe a bug from Microsoft?
How can I remove this %EF%BB%BF in PHP?
You should not simply discard the BOM unless you're 100% sure that the stream will: (a) always be UTF-8, and (b) always have a UTF-8 BOM.
The reasons:
In UTF-8, a BOM is optional - so if the service quits sending it at some future point you'll be throwing away the first three characters of your response instead.
The whole purpose of the BOM is to identify unambiguously the type of UTF stream being interpreted UTF-8? -16? or -32?, and also to indicate the 'endian-ness' (byte order) of the encoded information. If you just throw it away you're assuming that you're always getting UTF-8; this may not be a very good assumption.
Not all BOMs are 3-bytes long, only the UTF-8 one is three bytes. UTF-16 is two bytes, and UTF-32 is four bytes. So if the service switches to a wider UTF encoding in the future, your code will break.
I think a more appropriate way to handle this would be something like:
/* Detect the encoding, then convert from detected encoding to ASCII */
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "ASCII", $enc);
$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
if (substr($data, 0, 3) == "\xef\xbb\xbf") {
$data = substr($data, 3);
}
It's a byte order mark (BOM), indicating the response is encoded as UTF-8. You can safely remove it, but you should parse the remainder as UTF-8.
I had the same problem today, and fixed by ensuring the string was set to UTF-8:
http://php.net/manual/en/function.utf8-encode.php
$content = utf8_encode ( $content );
To remove it from the beginning of the string (only):
$data = preg_replace('/^%EF%BB%BF/', '', $data);
$data = str_replace('%EF%BB%BF', '', $data);
You probably shouldn't be using stripslashes -- unless the API returns blackslashed data (and 99.99% chance it doesn't), take that call out.
You could use substr to only get the rest without the UTF-8 BOM:
// if it’s binary UTF-8
$data = substr($data, 3);
// if it’s percent-encoded UTF-8
$data = substr($data, 9);

Categories