I'm trying to iron out the warnings and notices from a script. The script includes the following:
$clean_string = iconv('UTF-8', 'UTF-8//IGNORE', $supplier.' => '.$product_name);
As I understand it, the purpose of this line, as intended by the original author of the script, is to remove non-UTF-8 characters from the string, but obviously any non-UTF-8 characters in the input will cause iconv to throw an illegal character warning.
To solve this, my idea was to do something like the following:
$clean_string = iconv(mb_detect_encoding($supplier.' => '.$product_name), 'UTF-8//IGNORE', $supplier.' => '.$product_name);
Oddly however, mb_detect_encoding() is returning UTF-8 as the detected encoding!
The letter e with an accent (é) is an example of a character that causes this behaviour.
I realise I'm mixing multibyte libraries between detection and conversion, but I couldn't find an encoding detection function in the iconv library.
I've considered using the mb_convert_encoding() function to clean the string up into UTF-8, but the PHP documentation isn't clear what happens to characters that cannot be represented.
I am using PHP 5.2.17, and with the glibc iconv implementation, library version 2.5.
Can anyone offer any suggestions on how to clean the string into UTF-8, or insight into why this behaviour occurs?
Your example:
$string = $supplier . ' => ' . $product_name;
$stringUtf8 = iconv('UTF-8', 'UTF-8//IGNORE', $string);
and using PHP 5.2 might work for you. In later PHP versions, if the input is not precisely UTF-8, incov will drop the string (you will get an empty string). That so far as a note to you, you might not be aware of it.
Then you try with mb_detect_encodingDocs to find out about the original encoding:
$string = $supplier . ' => ' . $product_name;
$encoding = mb_detect_encoding($string);
$stringUtf8 = iconv($encoding, 'UTF-8//IGNORE', $string);
As I already linked in a comment, mb_detect_encoding is doing some magic and can not work. It tries to help you, however, it can not detect the encoding very good. This is by matters of the subject. You can try to set the strict mode to true:
$order = mb_detect_order();
$encoding = mb_detect_encoding($string, $order, true);
if (FALSE === $encoding) {
throw new UnexpectedValueException(
sprintf(
'Unable to detect input encoding with mb_detect_encoding, order was: %s'
, print_r($order, true)
)
);
}
Next to that you might also need to translate the names of the encodingDocs (and/or validate against supported encoding) between the two libraries (iconv and multi byte strings).
Hope this helps so that you at least do better understand why some things might not work and how you can better find the error-cases and filter the input then with the standard PHP extensions.
Related
I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));
So today I was updating some code I made that took some data from a webpage and emailed it to people for convenience. However, I noticed that whoever was typing the text used a program which used some other encoding which had a weird ’ character which was 0xD5 (213) in the Mac Roman set. But when they uploaded it to their website, it came out as Õ. So I used php and did this:
$parsed = str_ireplace("Õ", "'", $parsed);
So I did this and tested it, but it didn't seem to work. Can anyone help me? Thanks!
If this is just a single anomaly you're correcting you can specify it with a hex escape sequence like:
$parsed = str_replace("\xD5", "'", $parsed);
The reason just "Õ" isn't working is the encoding of your PHP file doesn't represent Õ as 0xD5. Strings are just byte sequences and what you're giving str_ireplace don't match. (Well, that and str_ireplace is gonna do funky things with it, str_replace is preferred here.)
More appropriate to handle the problem in general would be to use iconv to convert the input string from whatever its source encoding is into the output encoding you need.
Examples:
$parsed = iconv('MACINTOSH', 'UTF-8', $parsed);
or
$parsed = iconv('MACINTOSH', 'ASCII//TRANSLIT', $parsed);
The //TRANSLIT here means that when a character can't be represented in the target charset, it'll be approximated through one or several similarly looking characters. There's a lot ASCII (and others) can't represent, so transliteration can come in handy if you're not outputting UTF-8 (which would be ideal.)
I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));
The iconv function sometimes gives me an error:
Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]
Is there a way to detect that there are illegal characters in a UTF-8 string before sending data to inconv()?
First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.
You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:
$isUTF8 = preg_match('//u', $string);
Another possibility is mb_check_encoding [PHP Manual]:
$validUTF8 = mb_check_encoding($string, 'UTF-8');
Another function you can use is mb_detect_encoding [PHP Manual]:
$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));
It's important to set the strict parameter to true.
Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)
echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;
You can use # and check the length of the return string:
strlen($string) === strlen(#iconv('UTF-8', 'UTF-8//IGNORE', $string));
Check the examples on the iconv manual page as well.
For the one use json_encode, try json_last_error
<?php
// An invalid UTF8 sequence
$text = "\xB1\x31";
$json = json_encode($text);
$error = json_last_error();
var_dump($json, $error === JSON_ERROR_UTF8);
output (e.g. for PHP versions 5.3.3 - 5.3.13, 5.3.15 - 5.3.29, 5.4.0 - 5.4.45)
string(4) "null"
bool(true)
You could try using mb_detect_encoding to detect if you've got a different character set (than UTF-8) then mb_convert_encoding to convert to UTF-8 if required. It's more likely that people are giving you valid content in a different character set than giving you invalid UTF-8.
The specification on which characters that are invalid in UTF-8 is pretty clear. You probably want to strip those out before trying to parse it. They shouldn't be there, so if you could avoid it even before generating the XML that would be even better.
See here for a reference:
http://www.w3.org/TR/xml/#charsets
That isn't a complete list. Many parsers also disallow some low-numbered control characters, but I can't find a comprehensive list right now.
However, iconv might have built-in support for this:
http://www.zeitoun.net/articles/clear-invalid-utf8/start
Put an # in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in the source encoding id to ignore invalid characters:
#iconv('UTF-8//IGNORE', $destinationEncoding, $yourString);
I have found a lot of varying / inconsistent information across the web on this topic, so I'm hoping someone can help me out with these issues:
I need a function to cleanse a string so that it is safe to insert into a utf-8 mysql db or to write to a utf-8 XML file. Characters that can't be converted to utf-8 should be removed.
For writing to an XML file, I'm also running into the problem of converting html entities into numeric entities. The htmlspecialchars() works almost all the time, but I have read that it is not sufficient for properly cleansing all strings, for example one that contains an invalid html entity.
Thanks for your help, Brian
You didn't say where the strings were coming from, but if you're getting them from an HTML form submission, see this article:
Setting the character encoding in form submit for Internet Explorer
Long and short, you'll need to explicitly tell the browser what charset you want the form submission in. If you specify UTF-8, you should never get invalid UTF-8 from a browser. If you want to protect yourself against ANY type of malicious attack, you'll need to use iconv:
http://www.php.net/iconv
$utf_8_string = iconv($from_charset, $to_charset, $original_string);
If you specify "utf-8" as both $from_charset and $to_charset, iconv() should return an error if $original_string contains invalid UTF-8.
If you're getting your strings from a different source and you know the character encoding, you can still use iconv(). Typical encodings in the US are CP-1252 (Windows) and ISO-8859-1 (everything else.)
Something like this?
function cleanse($in) {
$bad = Array('”', '“', '’', '‘');
$good = Array('"', '"', '\'', '\'');
$out = str_replace($bad, $good, $in);
return $out;
}
You can convert a string from any encoding to UTF-8 with iconv or mbstring:
// With the //IGNORE flag, this will ignore invalid characters
iconv('input-encoding', 'UTF-8//IGNORE', $the_string);
or
mb_convert_encoding($the_string, 'UTF-8', 'input-encoding');