How to check the charset of string? - php

How do I check if the charset of a string is UTF8?

Don't reinvent the wheel. There is a builtin function for that task: mb_check_encoding().
mb_check_encoding($string, 'UTF-8');

Just a side note:
You cannot determine if a given string is encoded in UTF-8. You only can determine if a given string is definitively not encoded in UTF-8. Please see a related question here:
You cannot detect if a given string
(or byte sequence) is a UTF-8 encoded
text as for example each and every
series of UTF-8 octets is also a valid
(if nonsensical) series of Latin-1 (or
some other encoding) octets. However
not every series of valid Latin-1
octets are valid UTF-8 series.

function is_utf8($string) {
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
I have checked. This function is effective.

Better yet, use both of the above solutions.
function isUtf8($string) {
if (function_exists("mb_check_encoding") && is_callable("mb_check_encoding")) {
return mb_check_encoding($string, 'UTF8');
}
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}

mb_detect_encoding($string); will return the actual character set of $string. mb_check_encoding($string, 'UTF-8'); will return TRUE if character set of $string is UTF-8 else FALSE

if its send to u from server
echo $_SERVER['HTTP_ACCEPT_CHARSET'];

None of the above answers are correct. Yes, they may be working. If you take the answer with the preg_replace function, are you trying to kill your server if you process a lot of stirng ? Use this pure PHP function with no regex, work 100% of the time and it's way faster.
if(function_exists('grk_Is_UTF8') === FALSE){
function grk_Is_UTF8($String=''){
# On va calculer la longeur de la chaîne
$Len = strlen($String);
# On va boucler sur chaque caractère
for($i = 0; $i < $Len; $i++){
# On va aller chercher la valeur ASCII du caractère
$Ord = ord($String[$i]);
if($Ord > 128){
if($Ord > 247){
return FALSE;
} elseif($Ord > 239){
$Bytes = 4;
} elseif($Ord > 223){
$Bytes = 3;
} elseif($Ord > 191){
$Bytes = 2;
} else {
return FALSE;
}
#
if(($i + $Bytes) > $Len){
return FALSE;
}
# On va boucler sur chaque bytes / caractères
while($Bytes > 1){
# +1
$i++;
# On va aller chercher la valeur ASCII du caractère / byte
$Ord = ord($String[$i]);
if($Ord < 128 OR $Ord > 191){
return FALSE;
}
# Parfait
$Bytes--;
}
}
}
# Vrai
return TRUE;
}
}

Related

PHP convert special characters to HTML entity

I have a string ex:
$a = 'abc🔹abc';
The 'small blue diamond' is: bin2hex('🔹') => f09f94b9
Small blue diamond representation
So, I would like to convert the $a string into a string which represents the small blue diamond with the HTML-escape: 🔹
What would be the function what I should call to convert all unicode character into the HTML-escape representation?
More details on this case
In WordPress when I want to insert the $a variable into a table, $wpdb does it checks. Link to WPDB source code
When WordPress prepares the $data which should be inserted or updated, it runs the fields on the $wpdb->strip_invalid_text method and then it check if anything invalid found in the $data. It the text in the $a variable invalid with the following regexp:
$regex = '/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC2-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| \xE0[\xA0-\xBF][\x80-\xBF] # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xE1-\xEC][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| [\xEE-\xEF][\x80-\xBF]{2}';
if ( 'utf8mb4' === $charset ) {
$regex .= '
| \xF0[\x90-\xBF][\x80-\xBF]{2} # four-byte sequences 11110xxx 10xxxxxx * 3
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
';
}
$regex .= '){1,40} # ...one or more times
)
| . # anything else
/x';
$value['value'] = preg_replace( $regex, '$1', $value['value'] );
if ( false !== $length && mb_strlen( $value['value'], 'UTF-8' ) > $length ) {
$value['value'] = mb_substr( $value['value'], 0, $length, 'UTF-8' );
}
When the 'small blue diamond' represented with f09f94b9, this regexp marks the data invalid. When it is represented with 🔹. So what I need is to convert that unicode characters into a representation what is accepted by WordPress.
Here is what I came up with to convert all of the characters you can modify it further to only convert characters in the range you need.
$s = 'abc🔹def';
$a = preg_split('//u', $s, null, PREG_SPLIT_NO_EMPTY);
foreach($a as $c){
echo '&#' . unpack('V', iconv('UTF-8', 'UCS-4LE', $c))[1] . ';';
}

PHP UTF-8 handling

I am parsing a text file and am occassionally running into data such as:
CASTA¥EDA, JASON
Using a Mongo DB backend when I try saving information, I am getting errors like:
[MongoDB\Driver\Exception\UnexpectedValueException]
Got invalid UTF-8 value serializing 'Jason Casta�eda'
After Googling a few places, I located two functions that the author says would work:
function is_utf8( $str )
{
return preg_match( "/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x",
$str
);
}
public function force_utf8($str, $inputEnc='WINDOWS-1252')
{
if ( $this->is_utf8( $str ) ) // Nothing to do.
return $str;
if ( strtoupper( $inputEnc ) === 'ISO-8859-1' )
return utf8_encode( $str );
if ( function_exists( 'mb_convert_encoding' ) )
return mb_convert_encoding( $str, 'UTF-8', $inputEnc );
if ( function_exists( 'iconv' ) )
return iconv( $inputEnc, 'UTF-8', $str );
// You could also just return the original string.
trigger_error(
'Cannot convert string to UTF-8 in file '
. __FILE__ . ', line ' . __LINE__ . '!',
E_USER_ERROR
);
}
Using the two functions above I am trying to determine if a line of text has UTF-8 by calling is_utf8($text) and if it is not then I call the force_utf8($text) function. However I am getting the same error. Any pointers?
This question is pretty old, but for those who face same issue and get on this page like me:
mb_convert_encoding($value, 'UTF-8', 'UTF-8');
This code should replace all non UTF-8 characters by ? symbol and it will be safe for MongoDB insert/update operations.

Combine Multiple Regex into One

I am trying to write a code to hyphenate a string into latin verses. There are a few constraints to it which I have taken care of, however I do not get the desired output. My code is given below :
<?php
$string = "impulerittantaenanimis caelestibusirae";
$precedingC = precedingConsonant($string);
$xrule = xRule($precedingC);
$consonantc = consonantCT($xrule);
$consonantp = consonantPT($consonantc);
$cbv = CbetweenVowels($consonantp);
$tv = twoVowels($cbv);
echo $tv;
function twoVowels($string)
{
return preg_replace('/([aeiou])([aeiou])/', '$1-$2', $string);
}
function CbetweenVowels($string)
{
return preg_replace('/([aeiou])([^aeiou])([aeiou])/', '$1-$2$3', $string);
}
function consonantPT($string)
{
return preg_replace('/([^aeiou]p)(t[aeiou])/', '$1-$2', $string);
}
function consonantCT($string)
{
return preg_replace('/([^aeiou]c)(t[aeiou])/', '$1-$2', $string);
}
function precedingConsonant($string)
{
$arr1 = str_split($string);
$length = count($arr1);
for($j=0;$j<$length;$j++)
{
if(isVowel($arr1[$j]) && !isVowel($arr1[$j+1]) && !isVowel($arr1[$j+2]) && isVowel($arr1[$j+3]))
{
$pc++;
}
}
function strAppend2($string)
{
$arr1 = str_split($string);
$length = count($arr1);
for($i=0;$i<$length;$i++)
{
$check = $arr1[$i+1].$arr1[$i+2];
$check2 = $arr1[$i+1].$arr1[$i+2].$arr1[$i+3];
if($check=='br' || $check=='cr' || $check=='dr' || $check=='fr' || $check=='gr' || $check=='pr' || $check=='tr' || $check=='bl' || $check=='cl' || $check=='fl' || $check=='gl' || $check=='pl' || $check=='ch' || $check=='ph' || $check=='th' || $check=='qu' || $check2=='phl' || $check2=='phr')
{
if(isVowel($arr1[$i]) && !isVowel($arr1[$i+1]) && !isVowel($arr1[$i+2]) && isVowel($arr1[$i+3]))
{
$updatedString = substr_replace($string, "-", $i+1, 0);
return $updatedString;
}
}
else
{
if(isVowel($arr1[$i]) && !isVowel($arr1[$i+1]) && !isVowel($arr1[$i+2]) && isVowel($arr1[$i+3]))
{
$updatedString = substr_replace($string, "-", $i+2, 0);
return $updatedString;
}
}
}
}
$st1 = $string;
for($k=0;$k<$pc;$k++)
{
$st1 = strAppend2($st1);
}
return $st1;
}
function xRule($string)
{
return preg_replace('/([aeiou]x)([aeiou])/', '$1-$2', $string);
}
function isVowel($ch)
{
if($ch=='a' || $ch=='e' || $ch=='i' || $ch=='o' || $ch=='u')
{
return true;
}
else
{
return false;
}
}
function isConsonant($ch)
{
if($ch=='a' || $ch=='e' || $ch=='i' || $ch=='o' || $ch=='u')
{
return false;
}
else
{
return true;
}
}
?>
I believe if I combine all these functions it will result in the desired output. However I will specify my constraints below :
Rule 1 : When two or more consonants are between vowels, the first consonant is joined to the preceding vowel; for example - rec-tor, trac-tor, ac-tor, delec-tus, dic-tator, defec-tus, vic-tima, Oc-tober, fac-tum, pac-tus,
Rule 2 : 'x' is joined to the preceding vowel; as, rex-i.
However we give a special exception to the following consonants - br, cr, dr, fr, gr, pr, tr; bl, cl, fl, gl, pl, phl, phr, ch, ph, th, qu. These consonants are taken care by adding them to the later vowel for example - con- sola-trix
n- sola-trix.
Rule 3 : When 'ct' follows a consonant, that consonant and 'c' are both joined to the first vowel for example - sanc-tus and junc-tum
Similarly for 'pt' we apply the same rule for example - scalp-tum, serp-tum, Redemp-tor.
Rule 4 : A single consonant between two vowels is joined to the following vowel for example - ma-ter, pa-ter AND Z is joined to the following vowel.
Rule 5 : When two vowels come together they are divided, if they be not a diphthong; as au-re-us. Diaphthongs are - "ae","oe","au"
If you look carefully at each rule, you can see that all involve a vowel at the beginning or a preceding vowel. Once you realize that, you can try to build a single pattern putting [aeiou] in factor at the beginning:
$pattern = '~
(?<=[aeiou]) # each rule involves a vowel at the beginning (also called a
# "preceding vowel")
(?:
# Rule 2: capture particular cases
( (?:[bcdfgpt]r | [bcfgp] l | ph [lr] | [cpt] h | qu ) [aeiou] x )
|
[bcdfghlmnp-tx]
(?:
# Rule 3: When "ct" follows a consonant, that consonant and "c" are both
# joined to the first vowel
[cp] \K (?=t)
|
# Rule 1: When two or more consonants are between vowels, the first
# consonant is joined to the preceding vowel
\K (?= [bcdfghlmnp-tx]+ [aeiou] )
)
|
# Rule 4: a single consonant between two vowels is joined to the following
# vowel
(?:
\K (?= [bcdfghlmnp-t] [aeiou] )
|
# Rule 2: "x" is joined to the preceding vowel
x \K (?= [a-z] | (*SKIP)(*F) )
)
|
# Rule 5: When two vowels come together they are divided, if they not be a
# diphthong ("ae", "oe", "au")
\K (?= [aeiou] (?<! a[eu] | oe ) )
)
~xi';
This pattern is designed to only match the position where to put the hyphen (except for particular cases of Rule 2), that's why it uses a lot of \K to start the match result at this position and lookaheads to test what follows without matching characters.
$string = <<<EOD
Aeneadum genetrix, hominum diuomque uoluptas,
alma Uenus, caeli subter labentia signa
quae mare nauigerum, quae terras frugiferentis
concelebras, per te quoniam genus omne animantum
EOD;
$result = preg_replace($pattern, '-$1', $string);
Ae-ne-a-dum ge-ne-trix, ho-mi-num di-u-om-qu-e u-o-lup-tas,
al-ma U-e-nus, cae-li sub-ter la-ben-ti-a sig-na
qu-ae ma-re nau-i-ge-rum, qu-ae ter-ras fru-gi-fe-ren-tis
con-ce-leb-ras, per te qu-o-ni-am ge-nus om-ne a-ni-man-tum
Note that I didn't include several letters like k, y and z that don't exist in the latin alphabet, feel free to include them if you need to handle translated greek words or other.

Properly validate UTF-8 characters for insertion in a table with utf8_general_ci colocation

While the real problem is the colocation of the field on the database, i can't change it. I need to drop invalid characters instead.
Using #iconv('utf-8', 'utf-8//IGNORE'); won't work, because the characters are valid UTF8 characters, but invalid when inserted in a field with that colocation.
$broken_example = '↺ﺆী▜Ꮛ︷ሚ◶ヲɸʩ𝑸ᚙ𐤄🃟ʳ⸘ᥦฆⵞ䷿ꘚꕛ𝆖𝇑𝆺𝅥𝅮↺ﺆী▜Ꮛ︷ሚ◶ヲɸʩ𝑸ᚙ𐤄🃟ʳ⸘ᥦฆⵞ䷿ꘚꕛ𝆖𝇑𝆺𝅥𝅮';
$utf8 = html_entity_decode($broken_example, ENT_QUOTES, 'UTF-8');
I've tried to use some workaround like preg_replace('/&#([0-9]{6,});/', '');, but with no success.
The error mysql is reporting is Incorrect string value: '\xF0\x90\xA4\x84\xCA\xB3...'
A regex for validating all utf-8 chars is:
function removeInvalidChars ($text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
return preg_replace($regex, '$1', $text);
}
Removing the match for 4-byte chars will allow only the characters that can be stored in utf8_general.
function removeInvalidChars ($text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2}) | ./x';
return preg_replace($regex, '$1', $text);
}
btw it's the character set that matters not the collation. Also you would be much better off just switching to utf8mb4 with utf8mb4_unicode_ci rather than putting a hack like this in.

Regular expression testing for UTF-8

Today I decided to test a small function that checks if a string is UTF-8.
I used recommendations of the Multilingual form encoding and created a small helper:
function is_utf8($string) {
if (strlen($string) == 0)
{
return true;
}
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
As a test, I used a string with 196 characters. And just checked my helper. But browser doesn't display page with result, instead - 404 Page not found.
$string = "1234567890123456789012345678..."; // 196 characters here
echo strlen($string); // result - 196
var_dump(is_utf8($string)); // Error - Page not found!
But if I use 195 characters, everything works fine.
I've tried any of the characters, even spaces. This function only works with a string of no more than 195 characters.
Why?
This works as well, with a simple regular expression and serialize
function check_utf8($str) {
return (bool)preg_match('//u', serialize($str));
}
Did a simple test.
I performed the function of 1000000 times. Looked who faster.
I would also like to thank #mario for the help of an atomic grouping.
$string = "ывлдоkfdsuLIU(*knj4k58u7MJHKkiyhsf9hfhlknhlkjldfivjo8iulkjlgs".
"2345678901234567890123456789012345678901234567890123456789012".
"ыдваолт ДЛЯОЧДльы0щ39478509г0*()*?Щчялртодылматцю4к 2ылвсголо".
"4567890123456789012345678901234567890123456789012345678901234".
"4567890123456789012345678901234567890123456789012345678901234".
"asdfsd ds.kjasldasjlKUJLjLKZjulizL kzjxLkUJOLIULKM.LKl;.mcvss";
$s = microtime(true);
for ($i=0; $i<1000000; $i++)
{
// algorithm
}
$e = microtime(true);
echo $e-$s;
And here result:
preg_match('//u', $string )
Result: 11.634791135788 sec
(preg_match('%^(?>
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string)
Result: Fatal error: Maximum execution time of 30 seconds exceeded
preg_match('/^./su', $string)
Result: 12.27244400978 sec
mb_detect_encoding($string, array('UTF-8'), true)
Result: 15.370143890381 sec
And I also tried method proposed here by #helloworld
preg_match('//u', serialize($string))
Result: 23.193331956863 sec
Thank you all for your advice!
You helped me to understand
If the String is too long -> PCRE crash
look http://www.java-samples.com/showtutorial.php?tutorialid=1526 for solving

Categories