How can I detect a malformed UTF-8 string in PHP? - php

The iconv function sometimes gives me an error:
Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]
Is there a way to detect that there are illegal characters in a UTF-8 string before sending data to inconv()?

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.
You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:
$isUTF8 = preg_match('//u', $string);
Another possibility is mb_check_encoding [PHP Manual]:
$validUTF8 = mb_check_encoding($string, 'UTF-8');
Another function you can use is mb_detect_encoding [PHP Manual]:
$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));
It's important to set the strict parameter to true.
Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)
echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;
You can use # and check the length of the return string:
strlen($string) === strlen(#iconv('UTF-8', 'UTF-8//IGNORE', $string));
Check the examples on the iconv manual page as well.

For the one use json_encode, try json_last_error
<?php
// An invalid UTF8 sequence
$text = "\xB1\x31";
$json = json_encode($text);
$error = json_last_error();
var_dump($json, $error === JSON_ERROR_UTF8);
output (e.g. for PHP versions 5.3.3 - 5.3.13, 5.3.15 - 5.3.29, 5.4.0 - 5.4.45)
string(4) "null"
bool(true)

You could try using mb_detect_encoding to detect if you've got a different character set (than UTF-8) then mb_convert_encoding to convert to UTF-8 if required. It's more likely that people are giving you valid content in a different character set than giving you invalid UTF-8.

The specification on which characters that are invalid in UTF-8 is pretty clear. You probably want to strip those out before trying to parse it. They shouldn't be there, so if you could avoid it even before generating the XML that would be even better.
See here for a reference:
http://www.w3.org/TR/xml/#charsets
That isn't a complete list. Many parsers also disallow some low-numbered control characters, but I can't find a comprehensive list right now.
However, iconv might have built-in support for this:
http://www.zeitoun.net/articles/clear-invalid-utf8/start

Put an # in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in the source encoding id to ignore invalid characters:
#iconv('UTF-8//IGNORE', $destinationEncoding, $yourString);

Related

Using iconv() to check for invalid UTF-8 characters: Detected an illegal character in input string

I am using iconv() to check if a string contains non-valid UTF-8 characters.
$valid = $string == iconv('UTF-8', 'UTF-8//IGNORE', $string);
However, this still throws the error: "iconv(): Detected an illegal character in input string"
To the best of my knowledge this should not be possible using the //IGNORE flag?
I'm using PHP 5.5.9-1ubuntu4.6 on Ubuntu 14.04.1 LTS
Another answer provides a better answer for why iconv() is throwing an error:
The output character set (the second parameter) should be different
from the input character set (first param). If they are the same, then
if there are illegal UTF-8 characters in the string, iconv will reject
them as being illegal according to the input character set.
Taken from a comment in the PHP manual, you can detect if a string is encoded in UTF-8 with this function:
$valid = mb_detect_encoding($str, 'UTF-8', true); // returns boolean.
More info on mb_detect_encoding();

mb_check_encoding for Windows encoding

I get an input string and try to check if it's a valid windows-1255 string:
mb_check_encoding($string, 'windows-1255');
I get an error message: "Invalid encoding "windows-1255""
The encoding name 'windows-1255' is probably correct, as I use it in the "iconv" function and it works fine. I also tried "WINDOWS-1255" and "Windows-1255" and got the same results.
How can I check if the string is valid windows-1255 encoding?
In my experience, trying to sniff the encoding is always broken one way or the other.
The following isn't tested, but should work if the encoding is registered with your system. Make sure you test it thoroughly (negatives as well as positives) before using.
You could use iconv() convert it from Windows-1255 to UTF-8 and back. If it's still the same string, it's valid Windows-1255.
$string = "צקר"; // the source file needs to be Windows-1255 as well
$string_1255 = iconv ("Windows-1255", "UTF-8//IGNORE", $string);
$string_final = iconv ("UTF-8", "Windows-1255//IGNORE", $string_1255);
if ($string == $string_final)
echo "Yay!!! :)";
else
echo "No :(";

Checking for UTF-8 replacement character

I'm trying to determine whether or not my string contains the UTF-8 replacement character.
Currently I've had two attempts which failed.
First attempt:
stristr($string, "\xEF\xBF\xBD")
Second attempt
preg_match("#\xEF\xBF\xBD#i", $string)
None of these works.
Question is, how can I check my string for the replacement character?
If you mean to use this just to see if there are non-visible characters in a string, you could use something like this:
if (strlen($string) != strlen(iconv("UTF-8", "UTF-8//IGNORE", $string)))
echo "This string has invisible characters";
The method in your question should also work, but it requires the character encoding for the string to actually be in UTF-8. You can use iconv to convert a string from whatever its encoding is to UTF-8 before checking if the character is there.
Also: possibly you would want to use the multibyte notation for this character, which is \uFFFD instead. However, PHP does not support this by default, meaning you'll have to use some trick like this:
mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
More info on that here.
<?php
if (mb_detect_encoding($str, "UTF-8") !== FALSE) {
// $str is UTF-8 encoded
} else {
// $str is not UTF-8 encoded
}
Please refer this.

Why is iconv generating an illegal character error?

I'm trying to iron out the warnings and notices from a script. The script includes the following:
$clean_string = iconv('UTF-8', 'UTF-8//IGNORE', $supplier.' => '.$product_name);
As I understand it, the purpose of this line, as intended by the original author of the script, is to remove non-UTF-8 characters from the string, but obviously any non-UTF-8 characters in the input will cause iconv to throw an illegal character warning.
To solve this, my idea was to do something like the following:
$clean_string = iconv(mb_detect_encoding($supplier.' => '.$product_name), 'UTF-8//IGNORE', $supplier.' => '.$product_name);
Oddly however, mb_detect_encoding() is returning UTF-8 as the detected encoding!
The letter e with an accent (é) is an example of a character that causes this behaviour.
I realise I'm mixing multibyte libraries between detection and conversion, but I couldn't find an encoding detection function in the iconv library.
I've considered using the mb_convert_encoding() function to clean the string up into UTF-8, but the PHP documentation isn't clear what happens to characters that cannot be represented.
I am using PHP 5.2.17, and with the glibc iconv implementation, library version 2.5.
Can anyone offer any suggestions on how to clean the string into UTF-8, or insight into why this behaviour occurs?
Your example:
$string = $supplier . ' => ' . $product_name;
$stringUtf8 = iconv('UTF-8', 'UTF-8//IGNORE', $string);
and using PHP 5.2 might work for you. In later PHP versions, if the input is not precisely UTF-8, incov will drop the string (you will get an empty string). That so far as a note to you, you might not be aware of it.
Then you try with mb_detect_encoding­Docs to find out about the original encoding:
$string = $supplier . ' => ' . $product_name;
$encoding = mb_detect_encoding($string);
$stringUtf8 = iconv($encoding, 'UTF-8//IGNORE', $string);
As I already linked in a comment, mb_detect_encoding is doing some magic and can not work. It tries to help you, however, it can not detect the encoding very good. This is by matters of the subject. You can try to set the strict mode to true:
$order = mb_detect_order();
$encoding = mb_detect_encoding($string, $order, true);
if (FALSE === $encoding) {
throw new UnexpectedValueException(
sprintf(
'Unable to detect input encoding with mb_detect_encoding, order was: %s'
, print_r($order, true)
)
);
}
Next to that you might also need to translate the names of the encoding­Docs (and/or validate against supported encoding) between the two libraries (iconv and multi byte strings).
Hope this helps so that you at least do better understand why some things might not work and how you can better find the error-cases and filter the input then with the standard PHP extensions.

Unicode unknown "�" character detection in PHP

Is there any way in PHP of detecting the following character �?
I'm currently fixing a number of UTF-8 encoding issues with a few different algorithms and need to be able to detect if � is present in a string. How do I do so with strpos?
Simply pasting the character into my codebase does not seem to work.
if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)
Converting a UTF-8 string into UTF-8 using iconv() using the //IGNORE parameter produces a result where invalid UTF-8 characters are dropped.
Therefore, you can detect a broken character by comparing the length of the string before and after the iconv operation. If they differ, they contained a broken character.
Test case (make sure you save the file as UTF-8):
<?php
header("Content-type: text/html; charset=utf-8");
$teststring = "Düsseldorf";
// Deliberately create broken string
// by encoding the original string as ISO-8859-1
$teststring_broken = utf8_decode($teststring);
echo "Broken string: ".$teststring_broken ;
echo "<br>";
$teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );
echo $teststring_converted;
echo "<br>";
if (strlen($teststring_converted) != strlen($teststring_broken ))
echo "The string contained an invalid character";
in theory, you could drop //IGNORE and simply test for a failed (empty) iconv operation, but there might be other reasons for a iconv to fail than just invalid characters... I don't know. I would use the comparison method.
Here is what I do to detect and correct the encoding of strings not encoded in UTF-8 when that is what I am expecting:
$encoding = mb_detect_encoding($str, 'utf-8, iso-8859-1, ascii', true);
if (strcasecmp($encoding, 'UTF-8') !== 0) {
$str = iconv($encoding, 'utf-8', $str);
}
As far as I know, that question mark symbol is not a single character. There are many different character codes in the standard font sets that are not mapped to a symbol, and that is the default symbol that is used. To do detection in PHP, you would first need to know what font it is that you're using. Then you need to look at the font implementation and see what ranges of codes map to the "?" symbol, and then see if the given character is in one of those ranges.
I use the CUSTOM method (using str_replace) to sanitize undefined characters:
$input='a³';
$text=str_replace("\n\n", "sample000" ,$text);
$text=str_replace("\n", "sample111" ,$text);
$text=filter_var($text,FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);
$text=str_replace("sample000", "<br/><br/>" ,$text);
$text=str_replace("sample111", "<br/>" ,$text);
echo $text; //outputs ------------> a3

Categories