How do I check if the charset of a string is UTF8?
Don't reinvent the wheel. There is a builtin function for that task: mb_check_encoding().
mb_check_encoding($string, 'UTF-8');
Just a side note:
You cannot determine if a given string is encoded in UTF-8. You only can determine if a given string is definitively not encoded in UTF-8. Please see a related question here:
You cannot detect if a given string
(or byte sequence) is a UTF-8 encoded
text as for example each and every
series of UTF-8 octets is also a valid
(if nonsensical) series of Latin-1 (or
some other encoding) octets. However
not every series of valid Latin-1
octets are valid UTF-8 series.
function is_utf8($string) {
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
I have checked. This function is effective.
Better yet, use both of the above solutions.
function isUtf8($string) {
if (function_exists("mb_check_encoding") && is_callable("mb_check_encoding")) {
return mb_check_encoding($string, 'UTF8');
}
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
mb_detect_encoding($string); will return the actual character set of $string. mb_check_encoding($string, 'UTF-8'); will return TRUE if character set of $string is UTF-8 else FALSE
if its send to u from server
echo $_SERVER['HTTP_ACCEPT_CHARSET'];
None of the above answers are correct. Yes, they may be working. If you take the answer with the preg_replace function, are you trying to kill your server if you process a lot of stirng ? Use this pure PHP function with no regex, work 100% of the time and it's way faster.
if(function_exists('grk_Is_UTF8') === FALSE){
function grk_Is_UTF8($String=''){
# On va calculer la longeur de la chaîne
$Len = strlen($String);
# On va boucler sur chaque caractère
for($i = 0; $i < $Len; $i++){
# On va aller chercher la valeur ASCII du caractère
$Ord = ord($String[$i]);
if($Ord > 128){
if($Ord > 247){
return FALSE;
} elseif($Ord > 239){
$Bytes = 4;
} elseif($Ord > 223){
$Bytes = 3;
} elseif($Ord > 191){
$Bytes = 2;
} else {
return FALSE;
}
#
if(($i + $Bytes) > $Len){
return FALSE;
}
# On va boucler sur chaque bytes / caractères
while($Bytes > 1){
# +1
$i++;
# On va aller chercher la valeur ASCII du caractère / byte
$Ord = ord($String[$i]);
if($Ord < 128 OR $Ord > 191){
return FALSE;
}
# Parfait
$Bytes--;
}
}
}
# Vrai
return TRUE;
}
}
Related
I have a string ex:
$a = 'abc🔹abc';
The 'small blue diamond' is: bin2hex('🔹') => f09f94b9
Small blue diamond representation
So, I would like to convert the $a string into a string which represents the small blue diamond with the HTML-escape: 🔹
What would be the function what I should call to convert all unicode character into the HTML-escape representation?
More details on this case
In WordPress when I want to insert the $a variable into a table, $wpdb does it checks. Link to WPDB source code
When WordPress prepares the $data which should be inserted or updated, it runs the fields on the $wpdb->strip_invalid_text method and then it check if anything invalid found in the $data. It the text in the $a variable invalid with the following regexp:
$regex = '/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC2-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| \xE0[\xA0-\xBF][\x80-\xBF] # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xE1-\xEC][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| [\xEE-\xEF][\x80-\xBF]{2}';
if ( 'utf8mb4' === $charset ) {
$regex .= '
| \xF0[\x90-\xBF][\x80-\xBF]{2} # four-byte sequences 11110xxx 10xxxxxx * 3
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
';
}
$regex .= '){1,40} # ...one or more times
)
| . # anything else
/x';
$value['value'] = preg_replace( $regex, '$1', $value['value'] );
if ( false !== $length && mb_strlen( $value['value'], 'UTF-8' ) > $length ) {
$value['value'] = mb_substr( $value['value'], 0, $length, 'UTF-8' );
}
When the 'small blue diamond' represented with f09f94b9, this regexp marks the data invalid. When it is represented with 🔹. So what I need is to convert that unicode characters into a representation what is accepted by WordPress.
Here is what I came up with to convert all of the characters you can modify it further to only convert characters in the range you need.
$s = 'abc🔹def';
$a = preg_split('//u', $s, null, PREG_SPLIT_NO_EMPTY);
foreach($a as $c){
echo '&#' . unpack('V', iconv('UTF-8', 'UCS-4LE', $c))[1] . ';';
}
I am parsing a text file and am occassionally running into data such as:
CASTA¥EDA, JASON
Using a Mongo DB backend when I try saving information, I am getting errors like:
[MongoDB\Driver\Exception\UnexpectedValueException]
Got invalid UTF-8 value serializing 'Jason Casta�eda'
After Googling a few places, I located two functions that the author says would work:
function is_utf8( $str )
{
return preg_match( "/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x",
$str
);
}
public function force_utf8($str, $inputEnc='WINDOWS-1252')
{
if ( $this->is_utf8( $str ) ) // Nothing to do.
return $str;
if ( strtoupper( $inputEnc ) === 'ISO-8859-1' )
return utf8_encode( $str );
if ( function_exists( 'mb_convert_encoding' ) )
return mb_convert_encoding( $str, 'UTF-8', $inputEnc );
if ( function_exists( 'iconv' ) )
return iconv( $inputEnc, 'UTF-8', $str );
// You could also just return the original string.
trigger_error(
'Cannot convert string to UTF-8 in file '
. __FILE__ . ', line ' . __LINE__ . '!',
E_USER_ERROR
);
}
Using the two functions above I am trying to determine if a line of text has UTF-8 by calling is_utf8($text) and if it is not then I call the force_utf8($text) function. However I am getting the same error. Any pointers?
This question is pretty old, but for those who face same issue and get on this page like me:
mb_convert_encoding($value, 'UTF-8', 'UTF-8');
This code should replace all non UTF-8 characters by ? symbol and it will be safe for MongoDB insert/update operations.
I am trying to write a code to hyphenate a string into latin verses. There are a few constraints to it which I have taken care of, however I do not get the desired output. My code is given below :
<?php
$string = "impulerittantaenanimis caelestibusirae";
$precedingC = precedingConsonant($string);
$xrule = xRule($precedingC);
$consonantc = consonantCT($xrule);
$consonantp = consonantPT($consonantc);
$cbv = CbetweenVowels($consonantp);
$tv = twoVowels($cbv);
echo $tv;
function twoVowels($string)
{
return preg_replace('/([aeiou])([aeiou])/', '$1-$2', $string);
}
function CbetweenVowels($string)
{
return preg_replace('/([aeiou])([^aeiou])([aeiou])/', '$1-$2$3', $string);
}
function consonantPT($string)
{
return preg_replace('/([^aeiou]p)(t[aeiou])/', '$1-$2', $string);
}
function consonantCT($string)
{
return preg_replace('/([^aeiou]c)(t[aeiou])/', '$1-$2', $string);
}
function precedingConsonant($string)
{
$arr1 = str_split($string);
$length = count($arr1);
for($j=0;$j<$length;$j++)
{
if(isVowel($arr1[$j]) && !isVowel($arr1[$j+1]) && !isVowel($arr1[$j+2]) && isVowel($arr1[$j+3]))
{
$pc++;
}
}
function strAppend2($string)
{
$arr1 = str_split($string);
$length = count($arr1);
for($i=0;$i<$length;$i++)
{
$check = $arr1[$i+1].$arr1[$i+2];
$check2 = $arr1[$i+1].$arr1[$i+2].$arr1[$i+3];
if($check=='br' || $check=='cr' || $check=='dr' || $check=='fr' || $check=='gr' || $check=='pr' || $check=='tr' || $check=='bl' || $check=='cl' || $check=='fl' || $check=='gl' || $check=='pl' || $check=='ch' || $check=='ph' || $check=='th' || $check=='qu' || $check2=='phl' || $check2=='phr')
{
if(isVowel($arr1[$i]) && !isVowel($arr1[$i+1]) && !isVowel($arr1[$i+2]) && isVowel($arr1[$i+3]))
{
$updatedString = substr_replace($string, "-", $i+1, 0);
return $updatedString;
}
}
else
{
if(isVowel($arr1[$i]) && !isVowel($arr1[$i+1]) && !isVowel($arr1[$i+2]) && isVowel($arr1[$i+3]))
{
$updatedString = substr_replace($string, "-", $i+2, 0);
return $updatedString;
}
}
}
}
$st1 = $string;
for($k=0;$k<$pc;$k++)
{
$st1 = strAppend2($st1);
}
return $st1;
}
function xRule($string)
{
return preg_replace('/([aeiou]x)([aeiou])/', '$1-$2', $string);
}
function isVowel($ch)
{
if($ch=='a' || $ch=='e' || $ch=='i' || $ch=='o' || $ch=='u')
{
return true;
}
else
{
return false;
}
}
function isConsonant($ch)
{
if($ch=='a' || $ch=='e' || $ch=='i' || $ch=='o' || $ch=='u')
{
return false;
}
else
{
return true;
}
}
?>
I believe if I combine all these functions it will result in the desired output. However I will specify my constraints below :
Rule 1 : When two or more consonants are between vowels, the first consonant is joined to the preceding vowel; for example - rec-tor, trac-tor, ac-tor, delec-tus, dic-tator, defec-tus, vic-tima, Oc-tober, fac-tum, pac-tus,
Rule 2 : 'x' is joined to the preceding vowel; as, rex-i.
However we give a special exception to the following consonants - br, cr, dr, fr, gr, pr, tr; bl, cl, fl, gl, pl, phl, phr, ch, ph, th, qu. These consonants are taken care by adding them to the later vowel for example - con- sola-trix
n- sola-trix.
Rule 3 : When 'ct' follows a consonant, that consonant and 'c' are both joined to the first vowel for example - sanc-tus and junc-tum
Similarly for 'pt' we apply the same rule for example - scalp-tum, serp-tum, Redemp-tor.
Rule 4 : A single consonant between two vowels is joined to the following vowel for example - ma-ter, pa-ter AND Z is joined to the following vowel.
Rule 5 : When two vowels come together they are divided, if they be not a diphthong; as au-re-us. Diaphthongs are - "ae","oe","au"
If you look carefully at each rule, you can see that all involve a vowel at the beginning or a preceding vowel. Once you realize that, you can try to build a single pattern putting [aeiou] in factor at the beginning:
$pattern = '~
(?<=[aeiou]) # each rule involves a vowel at the beginning (also called a
# "preceding vowel")
(?:
# Rule 2: capture particular cases
( (?:[bcdfgpt]r | [bcfgp] l | ph [lr] | [cpt] h | qu ) [aeiou] x )
|
[bcdfghlmnp-tx]
(?:
# Rule 3: When "ct" follows a consonant, that consonant and "c" are both
# joined to the first vowel
[cp] \K (?=t)
|
# Rule 1: When two or more consonants are between vowels, the first
# consonant is joined to the preceding vowel
\K (?= [bcdfghlmnp-tx]+ [aeiou] )
)
|
# Rule 4: a single consonant between two vowels is joined to the following
# vowel
(?:
\K (?= [bcdfghlmnp-t] [aeiou] )
|
# Rule 2: "x" is joined to the preceding vowel
x \K (?= [a-z] | (*SKIP)(*F) )
)
|
# Rule 5: When two vowels come together they are divided, if they not be a
# diphthong ("ae", "oe", "au")
\K (?= [aeiou] (?<! a[eu] | oe ) )
)
~xi';
This pattern is designed to only match the position where to put the hyphen (except for particular cases of Rule 2), that's why it uses a lot of \K to start the match result at this position and lookaheads to test what follows without matching characters.
$string = <<<EOD
Aeneadum genetrix, hominum diuomque uoluptas,
alma Uenus, caeli subter labentia signa
quae mare nauigerum, quae terras frugiferentis
concelebras, per te quoniam genus omne animantum
EOD;
$result = preg_replace($pattern, '-$1', $string);
Ae-ne-a-dum ge-ne-trix, ho-mi-num di-u-om-qu-e u-o-lup-tas,
al-ma U-e-nus, cae-li sub-ter la-ben-ti-a sig-na
qu-ae ma-re nau-i-ge-rum, qu-ae ter-ras fru-gi-fe-ren-tis
con-ce-leb-ras, per te qu-o-ni-am ge-nus om-ne a-ni-man-tum
Note that I didn't include several letters like k, y and z that don't exist in the latin alphabet, feel free to include them if you need to handle translated greek words or other.
While the real problem is the colocation of the field on the database, i can't change it. I need to drop invalid characters instead.
Using #iconv('utf-8', 'utf-8//IGNORE'); won't work, because the characters are valid UTF8 characters, but invalid when inserted in a field with that colocation.
$broken_example = '↺ﺆী▜Ꮛ︷ሚ◶ヲɸʩ𝑸ᚙ𐤄🃟ʳ⸘ᥦฆⵞ䷿ꘚꕛ𝆖𝇑𝆺𝅥𝅮↺ﺆী▜Ꮛ︷ሚ◶ヲɸʩ𝑸ᚙ𐤄🃟ʳ⸘ᥦฆⵞ䷿ꘚꕛ𝆖𝇑𝆺𝅥𝅮';
$utf8 = html_entity_decode($broken_example, ENT_QUOTES, 'UTF-8');
I've tried to use some workaround like preg_replace('/&#([0-9]{6,});/', '');, but with no success.
The error mysql is reporting is Incorrect string value: '\xF0\x90\xA4\x84\xCA\xB3...'
A regex for validating all utf-8 chars is:
function removeInvalidChars ($text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
return preg_replace($regex, '$1', $text);
}
Removing the match for 4-byte chars will allow only the characters that can be stored in utf8_general.
function removeInvalidChars ($text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2}) | ./x';
return preg_replace($regex, '$1', $text);
}
btw it's the character set that matters not the collation. Also you would be much better off just switching to utf8mb4 with utf8mb4_unicode_ci rather than putting a hack like this in.
Today I decided to test a small function that checks if a string is UTF-8.
I used recommendations of the Multilingual form encoding and created a small helper:
function is_utf8($string) {
if (strlen($string) == 0)
{
return true;
}
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
As a test, I used a string with 196 characters. And just checked my helper. But browser doesn't display page with result, instead - 404 Page not found.
$string = "1234567890123456789012345678..."; // 196 characters here
echo strlen($string); // result - 196
var_dump(is_utf8($string)); // Error - Page not found!
But if I use 195 characters, everything works fine.
I've tried any of the characters, even spaces. This function only works with a string of no more than 195 characters.
Why?
This works as well, with a simple regular expression and serialize
function check_utf8($str) {
return (bool)preg_match('//u', serialize($str));
}
Did a simple test.
I performed the function of 1000000 times. Looked who faster.
I would also like to thank #mario for the help of an atomic grouping.
$string = "ывлдоkfdsuLIU(*knj4k58u7MJHKkiyhsf9hfhlknhlkjldfivjo8iulkjlgs".
"2345678901234567890123456789012345678901234567890123456789012".
"ыдваолт ДЛЯОЧДльы0щ39478509г0*()*?Щчялртодылматцю4к 2ылвсголо".
"4567890123456789012345678901234567890123456789012345678901234".
"4567890123456789012345678901234567890123456789012345678901234".
"asdfsd ds.kjasldasjlKUJLjLKZjulizL kzjxLkUJOLIULKM.LKl;.mcvss";
$s = microtime(true);
for ($i=0; $i<1000000; $i++)
{
// algorithm
}
$e = microtime(true);
echo $e-$s;
And here result:
preg_match('//u', $string )
Result: 11.634791135788 sec
(preg_match('%^(?>
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string)
Result: Fatal error: Maximum execution time of 30 seconds exceeded
preg_match('/^./su', $string)
Result: 12.27244400978 sec
mb_detect_encoding($string, array('UTF-8'), true)
Result: 15.370143890381 sec
And I also tried method proposed here by #helloworld
preg_match('//u', serialize($string))
Result: 23.193331956863 sec
Thank you all for your advice!
You helped me to understand
If the String is too long -> PCRE crash
look http://www.java-samples.com/showtutorial.php?tutorialid=1526 for solving