Regular expression testing for UTF-8

Regular expression testing for UTF-8 - php

Today I decided to test a small function that checks if a string is UTF-8.
I used recommendations of the Multilingual form encoding and created a small helper:
function is_utf8($string) {
if (strlen($string) == 0)
{
return true;
}
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
As a test, I used a string with 196 characters. And just checked my helper. But browser doesn't display page with result, instead - 404 Page not found.
$string = "1234567890123456789012345678..."; // 196 characters here
echo strlen($string); // result - 196
var_dump(is_utf8($string)); // Error - Page not found!
But if I use 195 characters, everything works fine.
I've tried any of the characters, even spaces. This function only works with a string of no more than 195 characters.
Why?

This works as well, with a simple regular expression and serialize
function check_utf8($str) {
return (bool)preg_match('//u', serialize($str));
}

Did a simple test.
I performed the function of 1000000 times. Looked who faster.
I would also like to thank #mario for the help of an atomic grouping.
$string = "ывлдоkfdsuLIU(*knj4k58u7MJHKkiyhsf9hfhlknhlkjldfivjo8iulkjlgs".
"2345678901234567890123456789012345678901234567890123456789012".
"ыдваолт ДЛЯОЧДльы0щ39478509г0*()*?Щчялртодылматцю4к 2ылвсголо".
"4567890123456789012345678901234567890123456789012345678901234".
"4567890123456789012345678901234567890123456789012345678901234".
"asdfsd ds.kjasldasjlKUJLjLKZjulizL kzjxLkUJOLIULKM.LKl;.mcvss";
$s = microtime(true);
for ($i=0; $i<1000000; $i++)
{
// algorithm
}
$e = microtime(true);
echo $e-$s;
And here result:
preg_match('//u', $string )
Result: 11.634791135788 sec
(preg_match('%^(?>
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string)
Result: Fatal error: Maximum execution time of 30 seconds exceeded
preg_match('/^./su', $string)
Result: 12.27244400978 sec
mb_detect_encoding($string, array('UTF-8'), true)
Result: 15.370143890381 sec
And I also tried method proposed here by #helloworld
preg_match('//u', serialize($string))
Result: 23.193331956863 sec
Thank you all for your advice!
You helped me to understand

If the String is too long -> PCRE crash
look http://www.java-samples.com/showtutorial.php?tutorialid=1526 for solving

Related

PHP UTF-8 handling

I am parsing a text file and am occassionally running into data such as:
CASTA¥EDA, JASON
Using a Mongo DB backend when I try saving information, I am getting errors like:
[MongoDB\Driver\Exception\UnexpectedValueException]
Got invalid UTF-8 value serializing 'Jason Casta�eda'
After Googling a few places, I located two functions that the author says would work:
function is_utf8( $str )
{
return preg_match( "/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x",
$str
);
}
public function force_utf8($str, $inputEnc='WINDOWS-1252')
{
if ( $this->is_utf8( $str ) ) // Nothing to do.
return $str;
if ( strtoupper( $inputEnc ) === 'ISO-8859-1' )
return utf8_encode( $str );
if ( function_exists( 'mb_convert_encoding' ) )
return mb_convert_encoding( $str, 'UTF-8', $inputEnc );
if ( function_exists( 'iconv' ) )
return iconv( $inputEnc, 'UTF-8', $str );
// You could also just return the original string.
trigger_error(
'Cannot convert string to UTF-8 in file '
. __FILE__ . ', line ' . __LINE__ . '!',
E_USER_ERROR
);
}
Using the two functions above I am trying to determine if a line of text has UTF-8 by calling is_utf8($text) and if it is not then I call the force_utf8($text) function. However I am getting the same error. Any pointers?

This question is pretty old, but for those who face same issue and get on this page like me:
mb_convert_encoding($value, 'UTF-8', 'UTF-8');
This code should replace all non UTF-8 characters by ? symbol and it will be safe for MongoDB insert/update operations.

preg_split based on a Sentence

I have the follwoing script to split up sentences. There are a few phrases that I would like to treat as the end of a sentence in addition to punctuation. This works fine if it is a single character, but not when it there is a space.
This is the code I have that works:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:\#*] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| U\.S\.A\.
| U\.S\.
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
| a\.m\. # or "a.m.",
| p\.m\. # or "p.m.",
| a€￠\.
| :\.
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
This is an example phrase I have tried to add:
"Total Gross Income"
I have tried formating it in these ways, but none of them work:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:\#*] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear
| "Total Gross Income"
| Total[ X]Gross[ X]Income
| Total" "Gross" "Income
)
This for example if I have the following code:
$block_o_text = "You could receive the wrong amount. If you receive more benefits than you should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross Income Total ResourcesMedical ProgramsHousehold.";
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
echo $i . " - " . $sentance . "<BR>";
}
The results I get are:
77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income Total ResourcesMedical ProgramsHousehold
What I want to get is :
77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income
83 - Total ResourcesMedical ProgramsHousehold
What am I doing wrong?

Your problem is with the white space declaration that follows your lookbehind - it requires at least one white space in order to split, but if you remove it, then you end up capturing the preceeding letter and breaking the whole thing.
Thus As far as I can tell, you can't do this entirely with lookarounds. You'll still need to have some of the expression work with lookarounds (space preceded by punctuation, etc.), but for specific phrases, you can't.
You can also use the PREG_SPLIT_DELIM_CAPTURE flag to capture out what you're splitting. Something like this should get you started:
$re = '/((?<=[\.\?\!])\s+|Total\sGross\sIncome)/ix';
$block_o_text = "You could receive the wrong amount. If you receive more benefits than you should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross IncomeTotal ResourcesMedical ProgramsHousehold.";
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
if (!ctype_space($sentences[$i])) {
echo $i . " - " . $sentences[$i] . "<br>";
}
}
Output:
0 - You could receive the wrong amount.
2 - If you receive more benefits than you should, you must pay them back.
4 - When will we review your case?
6 - An eligibility review form will be sent before your benefits stop.
8 - Total Gross Income
9 - Total ResourcesMedical ProgramsHousehold.

Breaking a string on dates

I am using the following script I have altered to split a large string into sentances. However I am having issues getting it to also break on dates.
Original working code:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| U\.S\.A\.
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
| a\.m\. # or "a.m.",
| p\.m\. # or "p.m.",
| â€¢\.
| :\.
| •\.
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
I have added [0-9]/[0-9]/[0-9], but it doesn't seem to be having the desired effect. What am I missing? Here Is my updated code below:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| [0-9]/[0-9]/[0-9] # or on a date
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| U\.S\.A\.
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
| a\.m\. # or "a.m.",
| p\.m\. # or "p.m.",
| â€¢\.
| :\.
| •\.
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';

Dates do not have only single digits especially in the year. You need to account for that. You also need to escape the / since that is your regex delimiter.
[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{2,4}

Regular expression to find a string included between two bars and containing certain words

I always forget regex right after learning it. I want to extract the isbn number from a string.
String: English | ISBN: 1285463234 | 2014 | 499 pages | PDF | 28 MB
Target to extract: 1285463234

You can use findall with this regex:
/(?<=ISBN: )\d+/
Regex Demo
Explanation:
(?<= Opens a positive lookahead group, asserts that this matches after:
ISBN: Matches the string "ISBN: "
) Closes the lookahead group.
\d+ Matches one or more digits.

You could try the below regex to extract the ISBN number,
ISBN:\s*\K\d+
DEMO
Your PHP code would be,
<?php
$mystring = 'English | ISBN: 1285463234 | 2014 | 499 pages | PDF | 28 MB';
$regex = '~ISBN:\s*\K\d+~';
if (preg_match($regex, $mystring, $m)) {
$yourmatch = $m[0];
echo $yourmatch;
}
?> //=> 1285463234
Explanation:
ISBN: Matches the string ISBN:
\s* Matches zero or more spaces.
\K Discards previously matched characters.(ie, ISBN:)
\d+ Matches one or more digits.

If you have problems with regex, there are also other libraries which can do it with your straight forward example, for example the sscanf function in the PHP string library:
$subject = 'English | ISBN: 1285463234 | 2014 | 499 pages | PDF | 28 MB';
$result = sscanf($subject, 'English | ISBN: %d | ', $isbn);
If there is a match ($result is 1), the $isbn variable will contain the ISBN number as integer:
int(1285463234)
ISBN numbers never start with 0, so this should not pose any problem. If you need is as string, use %s instead of %d:
$result = sscanf($subject, 'English | ISBN: %s | ', $isbn);
The result is then a string:
string(10) "1285463234"
Scanf patterns are way easier to deal with less complex string parsing than PCRE regular expressions (which have more power but are also more complex). Also the assignment to a specific variable is easier to do.

An exact representation (and also a patchwork of the other answers ^^) would be this:
(?<=\| ISBN: )(\S+)(?= \|)
Debuggex Demo

How to check the charset of string?

How do I check if the charset of a string is UTF8?

Don't reinvent the wheel. There is a builtin function for that task: mb_check_encoding().
mb_check_encoding($string, 'UTF-8');

Just a side note:
You cannot determine if a given string is encoded in UTF-8. You only can determine if a given string is definitively not encoded in UTF-8. Please see a related question here:
You cannot detect if a given string
(or byte sequence) is a UTF-8 encoded
text as for example each and every
series of UTF-8 octets is also a valid
(if nonsensical) series of Latin-1 (or
some other encoding) octets. However
not every series of valid Latin-1
octets are valid UTF-8 series.

function is_utf8($string) {
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
I have checked. This function is effective.

Better yet, use both of the above solutions.
function isUtf8($string) {
if (function_exists("mb_check_encoding") && is_callable("mb_check_encoding")) {
return mb_check_encoding($string, 'UTF8');
}
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}

mb_detect_encoding($string); will return the actual character set of $string. mb_check_encoding($string, 'UTF-8'); will return TRUE if character set of $string is UTF-8 else FALSE

if its send to u from server
echo $_SERVER['HTTP_ACCEPT_CHARSET'];

None of the above answers are correct. Yes, they may be working. If you take the answer with the preg_replace function, are you trying to kill your server if you process a lot of stirng ? Use this pure PHP function with no regex, work 100% of the time and it's way faster.
if(function_exists('grk_Is_UTF8') === FALSE){
function grk_Is_UTF8($String=''){
# On va calculer la longeur de la chaîne
$Len = strlen($String);
# On va boucler sur chaque caractère
for($i = 0; $i < $Len; $i++){
# On va aller chercher la valeur ASCII du caractère
$Ord = ord($String[$i]);
if($Ord > 128){
if($Ord > 247){
return FALSE;
} elseif($Ord > 239){
$Bytes = 4;
} elseif($Ord > 223){
$Bytes = 3;
} elseif($Ord > 191){
$Bytes = 2;
} else {
return FALSE;
}
#
if(($i + $Bytes) > $Len){
return FALSE;
}
# On va boucler sur chaque bytes / caractères
while($Bytes > 1){
# +1
$i++;
# On va aller chercher la valeur ASCII du caractère / byte
$Ord = ord($String[$i]);
if($Ord < 128 OR $Ord > 191){
return FALSE;
}
# Parfait
$Bytes--;
}
}
}
# Vrai
return TRUE;
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular expression testing for UTF-8 - php

This works as well, with a simple regular expression and serialize function check_utf8($str) { return (bool)preg_match('//u', serialize($str)); }

If the String is too long -> PCRE crash look http://www.java-samples.com/showtutorial.php?tutorialid=1526 for solving

Related

PHP UTF-8 handling

preg_split based on a Sentence

Breaking a string on dates

Regular expression to find a string included between two bars and containing certain words

How to check the charset of string?

Categories

Resources