After converting my site to use utf-8, I'm now faced with the prospect of validating all incoming utf data, to ensure its valid and coherent.
There seems to be various regexp's and PHP API to detect whether a string is utf, but the ones Ive seen seem incomplete (regexps which validate utf, but still allow invalid 3rd bytes etc).
I'm also concerned about detecting (and preventing) overlong encoding, meaning ASCII characters that can be encoded as multibyte utf sequences.
Any suggestions or links welcome!
mb_check_encoding() is designed for this purpose:
mb_check_encoding($string, 'UTF-8');
You can do a lot of things with iconv that can tell you if the sequence is valid UTF-8.
Telling it to convert from UTF-8 to the same:
$str = "\xfe\x20"; // Invalid UTF-8
$conv = #iconv('UTF-8', 'UTF-8', $str);
if ($str != $conv) {
print("Input was not a valid UTF-8 sequence.\n");
}
Asking for the length of the string in bytes:
$str = "\xfe\x20"; // Invalid UTF-8
if (#iconv_strlen($str, 'UTF-8') === false) {
print("Input was not a valid UTF-8 sequence.\n");
}
Related
If I use the function mb_convert_encoding() to convert an ASCII encoded string in PHP to a UTF-8 string, then concatenate it with an ASCII encoded string, what encoding is it? Are there any negative consequences for doing this?
It would depend firstly on whether you mean strict ASCII, which only includes 128 characters. Every single one of these characters has the exact same encoding in the ASCII encoding scheme as it does in the UTF-8 encoding scheme. For these characters, the mb_convert_encoding function will have no effect. You can easily verify this yourself with this script:
/* Convert ASCII to UTF-8 */
for ($i=0; $i<128; $i++) {
$str1 = chr($i);
$str2 = mb_convert_encoding($str1, "UTF-8", "ASCII");
echo $str1 . " - " . $str2 . " - ";
if ($str1 !== $str2) {
echo " - DIFFERENT!";
} else {
echo " - same";
}
echo "\n";
}
For all of these true ASCII characters, there's no point in transcoding them.
HOWEVER, if by "ASCII" you mean extended ASCII (see here) and are talking about characters with accents and stuff, then you are getting into trouble because there is no definitive character set described by this term. You'll notice that in the list of supported character encodings for php's Multibyte String extension there is only one occurrence of the acronym ASCII and that is for ASCII itself.
To answer your questions more precisely:
If I use the function mb_convert_encoding() to convert an ASCII encoded string in PHP to a UTF-8 string, then concatenate it with an ASCII encoded string, what encoding is it?
The resulting string is both ASCII and UTF-8 because both encoding schemes use identical byte encodings for those 128 characters.
Are there any negative consequences for doing this?
There should be no negative consequences under any circumstance if the characters are in fact true ASCII characters.
If, on the other hand, the strings include some accented character like Å or õ and some sloppy coder is calling this "extended ASCII" then you might have problems. Those characters have different encodings in the latin-1 and UTF-8 encoding schemes, for instance.
Consider taking a peek at this php function and it may shake loose some understanding. Ask yourself what it means to convert a character which is NOT ASCII from ASCII to UTF-8. It is not a meaningful conversion but it does result in a change in this particular script:
$chars = array("Å", "õ");
foreach ($chars as $char) {
echo $char . " : ";
$str1 = mb_convert_encoding($str1, "UTF-8", "ASCII");
$str2 = mb_convert_encoding($str1, "UTF-8", "ISO-8859-1");
echo $str1 . " - " . $str2 . " - ";
if ($char !== $str1) {
echo " - ASCII DIFFERENT";
}
if ($char !== $str2) {
echo " - LATIN 1 DIFFERENT";
}
echo "\n";
}
You might start to get confused at this point. It might help for you to know that my PHP code in that last function has its own character encoding which on my workstation happens to be utf-8. These transformations I've performed are therefore pretty stupid. I'm lying to PHP, saying that these UTF-8 strings are ASCII or Latin-1 and asking PHP to transform them to UTF-8. It performs a transformation as best it can but we all know that transformation isn't meaningful.
I hope you can appreciate what I'm getting at here. Every time you see a character on a computer, it has some encoding. Whether or not there are any negative consequences will depend on how you treat the data that comes to you, what transformations you perform on it, and what you intend to do with it later.
It's helpful to think of a chain of custody. Where did your data come from? What encoding did they use? Is that what I'm using on my system? Where am I sending this data? Does it need to be converted? You should also be careful to specify character sets for all these things:
data you receive from clients
form submissions to your website
display of html on your website
operations on text strings in your applications
character encoding of your connection to a database, character encoding of the tables in your db and encodings of the columns in those tables
character encoding of stored data
email character encoding
character encoding of data submitted to an API
And so on.
General rule of thumb: use utf-8 for everything you possibly can.
ASCII is a subset of UTF-8, so an ASCII string is a valid UTF-8 string. Concatenating two UTF-8 strings is unambiguous.
I'm trying to determine whether or not my string contains the UTF-8 replacement character.
Currently I've had two attempts which failed.
First attempt:
stristr($string, "\xEF\xBF\xBD")
Second attempt
preg_match("#\xEF\xBF\xBD#i", $string)
None of these works.
Question is, how can I check my string for the replacement character?
If you mean to use this just to see if there are non-visible characters in a string, you could use something like this:
if (strlen($string) != strlen(iconv("UTF-8", "UTF-8//IGNORE", $string)))
echo "This string has invisible characters";
The method in your question should also work, but it requires the character encoding for the string to actually be in UTF-8. You can use iconv to convert a string from whatever its encoding is to UTF-8 before checking if the character is there.
Also: possibly you would want to use the multibyte notation for this character, which is \uFFFD instead. However, PHP does not support this by default, meaning you'll have to use some trick like this:
mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
More info on that here.
<?php
if (mb_detect_encoding($str, "UTF-8") !== FALSE) {
// $str is UTF-8 encoded
} else {
// $str is not UTF-8 encoded
}
Please refer this.
How to convert ASCII encoding to UTF8 in PHP
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8.
If you know for sure that your current encoding is pure ASCII, then you don't have to do anything because ASCII is already a valid UTF-8.
But if you still want to convert, just to be sure that its UTF-8, then you can use iconv
$string = iconv('ASCII', 'UTF-8//IGNORE', $string);
The IGNORE will discard any invalid characters just in case some were not valid ASCII.
Use mb_convert_encoding to convert an ASCII to UTF-8. More info here
$string = "chárêctërs";
print(mb_detect_encoding ($string));
$string = mb_convert_encoding($string, "UTF-8");
print(mb_detect_encoding ($string));
"ASCII is a subset of UTF-8, so..." - so UTF-8 is a set? :)
In other words: any string build with code points from x00 to x7F has indistinguishable representations (byte sequences) in ASCII and UTF-8. Converting such string is pointless.
Use utf8_encode()
Man page can be found here http://php.net/manual/en/function.utf8-encode.php
Also read this article from Joel on Software. It provides an excellent explanation if what Unicode is and how it works. http://www.joelonsoftware.com/articles/Unicode.html
I have found a lot of varying / inconsistent information across the web on this topic, so I'm hoping someone can help me out with these issues:
I need a function to cleanse a string so that it is safe to insert into a utf-8 mysql db or to write to a utf-8 XML file. Characters that can't be converted to utf-8 should be removed.
For writing to an XML file, I'm also running into the problem of converting html entities into numeric entities. The htmlspecialchars() works almost all the time, but I have read that it is not sufficient for properly cleansing all strings, for example one that contains an invalid html entity.
Thanks for your help, Brian
You didn't say where the strings were coming from, but if you're getting them from an HTML form submission, see this article:
Setting the character encoding in form submit for Internet Explorer
Long and short, you'll need to explicitly tell the browser what charset you want the form submission in. If you specify UTF-8, you should never get invalid UTF-8 from a browser. If you want to protect yourself against ANY type of malicious attack, you'll need to use iconv:
http://www.php.net/iconv
$utf_8_string = iconv($from_charset, $to_charset, $original_string);
If you specify "utf-8" as both $from_charset and $to_charset, iconv() should return an error if $original_string contains invalid UTF-8.
If you're getting your strings from a different source and you know the character encoding, you can still use iconv(). Typical encodings in the US are CP-1252 (Windows) and ISO-8859-1 (everything else.)
Something like this?
function cleanse($in) {
$bad = Array('”', '“', '’', '‘');
$good = Array('"', '"', '\'', '\'');
$out = str_replace($bad, $good, $in);
return $out;
}
You can convert a string from any encoding to UTF-8 with iconv or mbstring:
// With the //IGNORE flag, this will ignore invalid characters
iconv('input-encoding', 'UTF-8//IGNORE', $the_string);
or
mb_convert_encoding($the_string, 'UTF-8', 'input-encoding');
Is there any way in PHP of detecting the following character �?
I'm currently fixing a number of UTF-8 encoding issues with a few different algorithms and need to be able to detect if � is present in a string. How do I do so with strpos?
Simply pasting the character into my codebase does not seem to work.
if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)
Converting a UTF-8 string into UTF-8 using iconv() using the //IGNORE parameter produces a result where invalid UTF-8 characters are dropped.
Therefore, you can detect a broken character by comparing the length of the string before and after the iconv operation. If they differ, they contained a broken character.
Test case (make sure you save the file as UTF-8):
<?php
header("Content-type: text/html; charset=utf-8");
$teststring = "Düsseldorf";
// Deliberately create broken string
// by encoding the original string as ISO-8859-1
$teststring_broken = utf8_decode($teststring);
echo "Broken string: ".$teststring_broken ;
echo "<br>";
$teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );
echo $teststring_converted;
echo "<br>";
if (strlen($teststring_converted) != strlen($teststring_broken ))
echo "The string contained an invalid character";
in theory, you could drop //IGNORE and simply test for a failed (empty) iconv operation, but there might be other reasons for a iconv to fail than just invalid characters... I don't know. I would use the comparison method.
Here is what I do to detect and correct the encoding of strings not encoded in UTF-8 when that is what I am expecting:
$encoding = mb_detect_encoding($str, 'utf-8, iso-8859-1, ascii', true);
if (strcasecmp($encoding, 'UTF-8') !== 0) {
$str = iconv($encoding, 'utf-8', $str);
}
As far as I know, that question mark symbol is not a single character. There are many different character codes in the standard font sets that are not mapped to a symbol, and that is the default symbol that is used. To do detection in PHP, you would first need to know what font it is that you're using. Then you need to look at the font implementation and see what ranges of codes map to the "?" symbol, and then see if the given character is in one of those ranges.
I use the CUSTOM method (using str_replace) to sanitize undefined characters:
$input='a³';
$text=str_replace("\n\n", "sample000" ,$text);
$text=str_replace("\n", "sample111" ,$text);
$text=filter_var($text,FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);
$text=str_replace("sample000", "<br/><br/>" ,$text);
$text=str_replace("sample111", "<br/>" ,$text);
echo $text; //outputs ------------> a3