PHP UTF-8 handling

PHP UTF-8 handling - php

I am parsing a text file and am occassionally running into data such as:
CASTA¥EDA, JASON
Using a Mongo DB backend when I try saving information, I am getting errors like:
[MongoDB\Driver\Exception\UnexpectedValueException]
Got invalid UTF-8 value serializing 'Jason Casta�eda'
After Googling a few places, I located two functions that the author says would work:
function is_utf8( $str )
{
return preg_match( "/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x",
$str
);
}
public function force_utf8($str, $inputEnc='WINDOWS-1252')
{
if ( $this->is_utf8( $str ) ) // Nothing to do.
return $str;
if ( strtoupper( $inputEnc ) === 'ISO-8859-1' )
return utf8_encode( $str );
if ( function_exists( 'mb_convert_encoding' ) )
return mb_convert_encoding( $str, 'UTF-8', $inputEnc );
if ( function_exists( 'iconv' ) )
return iconv( $inputEnc, 'UTF-8', $str );
// You could also just return the original string.
trigger_error(
'Cannot convert string to UTF-8 in file '
. __FILE__ . ', line ' . __LINE__ . '!',
E_USER_ERROR
);
}
Using the two functions above I am trying to determine if a line of text has UTF-8 by calling is_utf8($text) and if it is not then I call the force_utf8($text) function. However I am getting the same error. Any pointers?

This question is pretty old, but for those who face same issue and get on this page like me:
mb_convert_encoding($value, 'UTF-8', 'UTF-8');
This code should replace all non UTF-8 characters by ? symbol and it will be safe for MongoDB insert/update operations.

Related

PHP convert special characters to HTML entity

I have a string ex:
$a = 'abc🔹abc';
The 'small blue diamond' is: bin2hex('🔹') => f09f94b9
Small blue diamond representation
So, I would like to convert the $a string into a string which represents the small blue diamond with the HTML-escape: 🔹
What would be the function what I should call to convert all unicode character into the HTML-escape representation?
More details on this case
In WordPress when I want to insert the $a variable into a table, $wpdb does it checks. Link to WPDB source code
When WordPress prepares the $data which should be inserted or updated, it runs the fields on the $wpdb->strip_invalid_text method and then it check if anything invalid found in the $data. It the text in the $a variable invalid with the following regexp:
$regex = '/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC2-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| \xE0[\xA0-\xBF][\x80-\xBF] # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xE1-\xEC][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| [\xEE-\xEF][\x80-\xBF]{2}';
if ( 'utf8mb4' === $charset ) {
$regex .= '
| \xF0[\x90-\xBF][\x80-\xBF]{2} # four-byte sequences 11110xxx 10xxxxxx * 3
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
';
}
$regex .= '){1,40} # ...one or more times
)
| . # anything else
/x';
$value['value'] = preg_replace( $regex, '$1', $value['value'] );
if ( false !== $length && mb_strlen( $value['value'], 'UTF-8' ) > $length ) {
$value['value'] = mb_substr( $value['value'], 0, $length, 'UTF-8' );
}
When the 'small blue diamond' represented with f09f94b9, this regexp marks the data invalid. When it is represented with 🔹. So what I need is to convert that unicode characters into a representation what is accepted by WordPress.

Here is what I came up with to convert all of the characters you can modify it further to only convert characters in the range you need.
$s = 'abc🔹def';
$a = preg_split('//u', $s, null, PREG_SPLIT_NO_EMPTY);
foreach($a as $c){
echo '&#' . unpack('V', iconv('UTF-8', 'UCS-4LE', $c))[1] . ';';
}

Replace unicode with actual data in php

I want to replace unicode \u00a0 with actual data in php.
Example : "<b>Sort By\u00a0-\u00a0</b>A to Z\u00a0|\u00a0Recently Added\u00a0|\u00a0Most Downloaded"

\u00a0 is the escape sequence of a NO-BREAK SPACE
To decode any escape sequence in PHP you can use this function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
echo unicodeString($str);
//<b>Sort By - </b>A to Z | Recently Added | Most Downloaded
DEMO:
https://ideone.com/9QtzMO
If you just need to replace a single escape sequence, use:
$str = str_replace("\u00a0", " ", $str);
echo $str;
<b>Sort By - </b>A to Z | Recently Added | Most Downloaded

RegExp Language Routing

Good Day,
I am in need of a Route-RegExp based on client-language for a website.
It should be like this:
Relative URL / Route:
/(No-Language) -> /?lng=(someDefaultLanguage)
/(No-Language)/ -> /?lng=(someDefaultLanguage)
/lngCode/page -> /page/?lng=lngCode
/lngCode/page/ -> /page/?lng=lngCode
/lngCode/pageL1/pageL2 -> /pageL1/pageL2/?lng=lngCode
/language/page?param=Value -> /page/?lng=lngCode&param=Value
(Notice the trailing slashes on some lines)
Tree structure is, ...well infinite :)
There are cases with single and multiple URL-Params.
I'm absolute no regex wizard, I managed this result in uhm, ...hours:
/^\/([a-z]{2})(?:(.*[^\?])|^$)((?:[\/\?]).*|^$)/
Please don't ask me what I was trying to route there. I am sooooo new to regex.
Thank you in advance
--
Edit for clarification (I hope):
Basically it is this concept: (It is internal routing, no redirection if I didnt mention.)
The language-parameter (as directory-style) must be grabbed from the 1st url and attached as a real parameter named, "lng". The directory-parameter should disappear.
If there are already other parameters, they need to be attached as well (?/&-case).
If there is no language given (=default-language), there is no directory-style-parameter in the url. Would be nice if still a ?lng=en parameter can be attached.
Examples:
localhost/blogpage/coolentry (default language)
localhost/de/blogpage/coolentry
localhost/es/blogpage/coolentry
localhost/blogpage/ -> localhost/blogpage/?lng=en
localhostde/de/blogpage/ -> localhost/blogpage/?lng=de
localhost/blogpage/coolentry/ -> localhost/blogpage/coolentry/?lng=en
localhost/de/blogpage/coolentry/ -> localhost/blogpage/coolentry/?lng=de
localhost/de/blogpage/coolentry/?entryPage=1 -> localhost/blogpage/coolentry/?lng=de&entryPage=1
It gets routed always with a real language parameter.
I have as well edited the first post, there was a confusing typo in it.

Sorry for the delay #hcm.
Well, I gotta tell ya. I spent 10 minutes writing the original regex.
Tested in Perl, everything worked great.
Then I go to dump it into an online php tester and I get this "Undefined offset" error/warning.
Capture groups 2,3,4 are optional, so I do a (?: ( capture ) )? but php is such a mess
you can't even test for undefined group.
Update
#hcm - Ok, figured it out. Those online testers don't translate CRLF's to LF's.
Therefore, when using multiline mode $ is boundry before a newline or end of string.
^ is after a newline or beginning of a string, which is no problem.
So, $ won't match before a CR only before a LF. A workaround is probably using the
\R any linebreak construct but that is not a boundry, its an actual character.
What I did was to cure this is to use (?: $ | (?= \r) ) outside of assertions, and
(?: $ | \r ) inside assertions. This cures all problems.
After reading your message, I've changed the regex so that every part is optional, but
still positional.
The 4 optional parts are as follows.
1. Before the lang code.
2. The lang code.
3. After the lang code.
4. The parameters.
No part will run over the other.
All leading /'s are taken out of each part (not part of the match),
while internal slashes are left in place.
All that's left is to construct the new url as you wish.
Let me know how this turns out or if you need a little tweak.
PHP code:
// Go to this website:
// http://writecodeonline.com/php/
// Cut & paste this into the code box, hit run.
$text = '
invalid
/
/de
/de/coolentry
localhost/
localhost/blogpage
localhost/blogpage/
localhost/blogpage/de/
/localhost/blogpage/coolentry/famous invalid
/root/blog/page/cool/entry/?entryPage=1&var1=A&var2=B
localhost/blogpage/coolentry/
localhost/blogpage/de/coolentry/
localhost/blogpage/de/coolentry/
localhost/blogpage/de/coolentry/?entryPage=1
localhost/blogpage/coolentry/?entryPage=2
';
$str = preg_replace_callback('~^(?![^\S\r\n]*(?:\r|$))(?|(?!/[a-z]{2}[^\S\r\n]*(?:/|(?:\r|$)))/?((?:(?!/[a-z]{2}[^\S\r\n]*(?:/|(?:\r|$))|\?|/[^\S\r\n]*(?:\r|$))\S)*)|())(?|/([a-z]{2})(?=/|[^\S\r\n]*(?:\r|$))|())(?|/((?:(?!/[^\S\r\n]*(?:\r|$))[^?\s])+)|())(?|/\?((?:(?!/[^\S\r\n]*(?:\r|$))\S)*)|())/?[^\S\r\n]*(?:$|(?=\r))~m',
function( $matches )
{
///////////////// URL //////////////////
$url = '';
// Before lang code -- Group 1
if ( $matches[1] != '' ) {
$url .= '/' . $matches[1];
}
// After lang code -- Group 3
if ( $matches[3] != '' ) {
$url .= '/' . $matches[3];
}
///////////////// PARAMS //////////////////
$params = '/?lng=';
// Lang code -- Group 2
if ( $matches[2] != '' ) {
$params .= $matches[2];
}
else {
$params .= 'en'; // No lang given, set default
}
// Other params
if ( $matches[4] != '') {
$params .= '&' . $matches[4];
}
///////////////// Check there is a Url //////////////////
if ( $url == '' ) { // No url given, set a default
$url = '/language'; // 'language', 'localhost', etc...
}
///////////////// Put the pieces back together //////////////////
$NewURL = $url . $params;
return $NewURL;
},
$text);
print $str;
output:
invalid
/language/?lng=en
/language/?lng=de
/coolentry/?lng=de
/localhost/?lng=en
/localhost/blogpage/?lng=en
/localhost/blogpage/?lng=en
/localhost/blogpage/?lng=de
/localhost/blogpage/coolentry/famous invalid
/root/blog/page/cool/entry/?lng=en&entryPage=1&var1=A&var2=B
/localhost/blogpage/coolentry/?lng=en
/localhost/blogpage/coolentry/?lng=de
/localhost/blogpage/coolentry/?lng=de
/localhost/blogpage/coolentry/?lng=de&entryPage=1
/localhost/blogpage/coolentry/?lng=en&entryPage=2
Regex
# '~^(?![^\S\r\n]*(?:\r|$))(?|(?!/[a-z]{2}[^\S\r\n]*(?:/|(?:\r|$)))/?((?:(?!/[a-z]{2}[^\S\r\n]*(?:/|(?:\r|$))|\?|/[^\S\r\n]*(?:\r|$))\S)*)|())(?|/([a-z]{2})(?=/|[^\S\r\n]*(?:\r|$))|())(?|/((?:(?!/[^\S\r\n]*(?:\r|$))[^?\s])+)|())(?|/\?((?:(?!/[^\S\r\n]*(?:\r|$))\S)*)|())/?[^\S\r\n]*(?:$|(?=\r))~m'
^ # BOL
(?! # Not a blank line, remove to generate a total default url
[^\S\r\n]*
(?: \r | $ )
)
(?| # BEFORE lang code
(?!
/ [a-z]{2} [^\S\r\n]* # not lang code
(?:
/
| (?: \r | $ )
)
)
/? # strip leading '/'
( # (1 start)
(?:
(?!
/ [a-z]{2} [^\S\r\n]* # not lang code
(?:
/
| (?: \r | $ )
)
|
\? # not parms
|
/ [^\S\r\n]* # not final slash
(?: \r | $ )
)
\S
)*
) # (1 end)
|
( ) # (1)
)
(?| # LANG CODE
/ # strip leading '/'
( [a-z]{2} ) # (2)
(?=
/
| [^\S\r\n]*
(?: \r | $ )
)
|
( ) # (2)
)
(?| # AFTER lang code
/ # strip leading '/'
( # (3 start)
(?:
(?! # not final slash
/ [^\S\r\n]*
(?: \r | $ )
)
[^?\s] # not parms
)+
) # (3 end)
|
( ) # (3)
)
(?| # PARAMETERS
/ \? # strip leading '/?'
( # (4 start)
(?:
(?! # not final slash
/ [^\S\r\n]*
(?: \r | $ )
)
\S
)*
) # (4 end)
|
( ) # (4)
)
/?
[^\S\r\n]* # EOL
(?:
$
| (?= \r )
)

Properly validate UTF-8 characters for insertion in a table with utf8_general_ci colocation

While the real problem is the colocation of the field on the database, i can't change it. I need to drop invalid characters instead.
Using #iconv('utf-8', 'utf-8//IGNORE'); won't work, because the characters are valid UTF8 characters, but invalid when inserted in a field with that colocation.
$broken_example = '↺ﺆী▜Ꮛ︷ሚ◶ｦɸʩ𝑸ᚙ𐤄🃟ʳ⸘ᥦฆⵞ䷿ꘚꕛ𝆖𝇑𝆺𝅥𝅮↺ﺆী▜Ꮛ︷ሚ◶ｦɸʩ𝑸ᚙ𐤄🃟ʳ⸘ᥦฆⵞ䷿ꘚꕛ𝆖𝇑𝆺𝅥𝅮';
$utf8 = html_entity_decode($broken_example, ENT_QUOTES, 'UTF-8');
I've tried to use some workaround like preg_replace('/&#([0-9]{6,});/', '');, but with no success.
The error mysql is reporting is Incorrect string value: '\xF0\x90\xA4\x84\xCA\xB3...'

A regex for validating all utf-8 chars is:
function removeInvalidChars ($text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
return preg_replace($regex, '$1', $text);
}
Removing the match for 4-byte chars will allow only the characters that can be stored in utf8_general.
function removeInvalidChars ($text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2}) | ./x';
return preg_replace($regex, '$1', $text);
}
btw it's the character set that matters not the collation. Also you would be much better off just switching to utf8mb4 with utf8mb4_unicode_ci rather than putting a hack like this in.

Regular expression testing for UTF-8

Today I decided to test a small function that checks if a string is UTF-8.
I used recommendations of the Multilingual form encoding and created a small helper:
function is_utf8($string) {
if (strlen($string) == 0)
{
return true;
}
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
As a test, I used a string with 196 characters. And just checked my helper. But browser doesn't display page with result, instead - 404 Page not found.
$string = "1234567890123456789012345678..."; // 196 characters here
echo strlen($string); // result - 196
var_dump(is_utf8($string)); // Error - Page not found!
But if I use 195 characters, everything works fine.
I've tried any of the characters, even spaces. This function only works with a string of no more than 195 characters.
Why?

This works as well, with a simple regular expression and serialize
function check_utf8($str) {
return (bool)preg_match('//u', serialize($str));
}

Did a simple test.
I performed the function of 1000000 times. Looked who faster.
I would also like to thank #mario for the help of an atomic grouping.
$string = "ывлдоkfdsuLIU(*knj4k58u7MJHKkiyhsf9hfhlknhlkjldfivjo8iulkjlgs".
"2345678901234567890123456789012345678901234567890123456789012".
"ыдваолт ДЛЯОЧДльы0щ39478509г0*()*?Щчялртодылматцю4к 2ылвсголо".
"4567890123456789012345678901234567890123456789012345678901234".
"4567890123456789012345678901234567890123456789012345678901234".
"asdfsd ds.kjasldasjlKUJLjLKZjulizL kzjxLkUJOLIULKM.LKl;.mcvss";
$s = microtime(true);
for ($i=0; $i<1000000; $i++)
{
// algorithm
}
$e = microtime(true);
echo $e-$s;
And here result:
preg_match('//u', $string )
Result: 11.634791135788 sec
(preg_match('%^(?>
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string)
Result: Fatal error: Maximum execution time of 30 seconds exceeded
preg_match('/^./su', $string)
Result: 12.27244400978 sec
mb_detect_encoding($string, array('UTF-8'), true)
Result: 15.370143890381 sec
And I also tried method proposed here by #helloworld
preg_match('//u', serialize($string))
Result: 23.193331956863 sec
Thank you all for your advice!
You helped me to understand

If the String is too long -> PCRE crash
look http://www.java-samples.com/showtutorial.php?tutorialid=1526 for solving

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP UTF-8 handling - php

This question is pretty old, but for those who face same issue and get on this page like me: mb_convert_encoding($value, 'UTF-8', 'UTF-8'); This code should replace all non UTF-8 characters by ? symbol and it will be safe for MongoDB insert/update operations.

Related

PHP convert special characters to HTML entity

Replace unicode with actual data in php

RegExp Language Routing

Properly validate UTF-8 characters for insertion in a table with utf8_general_ci colocation

Regular expression testing for UTF-8

Categories

Resources