I have a string ex:
$a = 'abc🔹abc';
The 'small blue diamond' is: bin2hex('🔹') => f09f94b9
Small blue diamond representation
So, I would like to convert the $a string into a string which represents the small blue diamond with the HTML-escape: 🔹
What would be the function what I should call to convert all unicode character into the HTML-escape representation?
More details on this case
In WordPress when I want to insert the $a variable into a table, $wpdb does it checks. Link to WPDB source code
When WordPress prepares the $data which should be inserted or updated, it runs the fields on the $wpdb->strip_invalid_text method and then it check if anything invalid found in the $data. It the text in the $a variable invalid with the following regexp:
$regex = '/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC2-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| \xE0[\xA0-\xBF][\x80-\xBF] # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xE1-\xEC][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| [\xEE-\xEF][\x80-\xBF]{2}';
if ( 'utf8mb4' === $charset ) {
$regex .= '
| \xF0[\x90-\xBF][\x80-\xBF]{2} # four-byte sequences 11110xxx 10xxxxxx * 3
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
';
}
$regex .= '){1,40} # ...one or more times
)
| . # anything else
/x';
$value['value'] = preg_replace( $regex, '$1', $value['value'] );
if ( false !== $length && mb_strlen( $value['value'], 'UTF-8' ) > $length ) {
$value['value'] = mb_substr( $value['value'], 0, $length, 'UTF-8' );
}
When the 'small blue diamond' represented with f09f94b9, this regexp marks the data invalid. When it is represented with 🔹. So what I need is to convert that unicode characters into a representation what is accepted by WordPress.
Here is what I came up with to convert all of the characters you can modify it further to only convert characters in the range you need.
$s = 'abc🔹def';
$a = preg_split('//u', $s, null, PREG_SPLIT_NO_EMPTY);
foreach($a as $c){
echo '&#' . unpack('V', iconv('UTF-8', 'UCS-4LE', $c))[1] . ';';
}
Related
I have data with 2 parts with this separator |
$data = 'hello | Hello there
price | Lets talk about our support.
how are you ?| Im fine ';
And my static word is $word= 'price'
My code
$msg = array_filter(array_map('trim', explode("\n", $data)));
foreach ($msg as $singleLine) {
$partition = preg_split("/[|]+/", trim($singleLine), '2');
$part1 = strtolower($partition[0]);
}
How can I match the data? I need the result to be like this: Let's talk about our support
You may use a single regex approach:
'~^\h*price\h*\|\h*\K.*\S~m'
See the regex demo
Details
^ - start of a line (due to m modifier)
\h* - 0+ horizontal whitespace
price - your static word
\h*\|\h* - | enclosed with 0+ horizontal whitespaces
\K - match reset operator that discards the text matched so far
.*\S - 0+ chars other than line break chars, as many as possible, up to the last non-whitespace char on the line (including it).
PHP code:
if (preg_match('~^\h*' . preg_quote($word, '~') . '\h*\|\h*\K.*\S~m', $data, $match)) {
echo $match[0];
}
Wiktor's answer seems good, but you might want to turn your data into a key -> value array.
If that is the case, you may do this:
$avp = [];
if (preg_match_all('/^ \h* (?<key>[^|]+?) \h* \| \h* (?<value>[^$]+?) \h* $/mx', $data, $matches, PREG_SET_ORDER)) {
foreach ($matches as [, $key, $value]) {
$avp[$key] = $value;
}
}
$word = 'price';
echo $avp[$word]; // Lets talk about our support.
Demo: https://3v4l.org/uMBAg
I have the following expression in pregsplit:
$content = preg_split('/([\p{P}\p{S}])|\s/', $file, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
Now if the content of the input file was int somenumber;
It would split into:
int
somenumber
;
If it was int some_number; what I'd get is:
int
some
_
number
;
However, what I'd like is:
int
some_number
;
Is there a way to edit this expression to group together alphanumeric characters + the "_" ?
The _ is matched by \p{P} (punctuation property class). Restrict it with the (?!_) negative lookahead:
$content = preg_split('/((?!_)[\p{P}\p{S}])|\s/', 'int some_number;', -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
See the PHP demo and a regex demo.
With this (?!_)[\p{P}\p{S}], all punctuation and symbol characters with the exception of _ can be matched.
I am parsing a text file and am occassionally running into data such as:
CASTA¥EDA, JASON
Using a Mongo DB backend when I try saving information, I am getting errors like:
[MongoDB\Driver\Exception\UnexpectedValueException]
Got invalid UTF-8 value serializing 'Jason Casta�eda'
After Googling a few places, I located two functions that the author says would work:
function is_utf8( $str )
{
return preg_match( "/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x",
$str
);
}
public function force_utf8($str, $inputEnc='WINDOWS-1252')
{
if ( $this->is_utf8( $str ) ) // Nothing to do.
return $str;
if ( strtoupper( $inputEnc ) === 'ISO-8859-1' )
return utf8_encode( $str );
if ( function_exists( 'mb_convert_encoding' ) )
return mb_convert_encoding( $str, 'UTF-8', $inputEnc );
if ( function_exists( 'iconv' ) )
return iconv( $inputEnc, 'UTF-8', $str );
// You could also just return the original string.
trigger_error(
'Cannot convert string to UTF-8 in file '
. __FILE__ . ', line ' . __LINE__ . '!',
E_USER_ERROR
);
}
Using the two functions above I am trying to determine if a line of text has UTF-8 by calling is_utf8($text) and if it is not then I call the force_utf8($text) function. However I am getting the same error. Any pointers?
This question is pretty old, but for those who face same issue and get on this page like me:
mb_convert_encoding($value, 'UTF-8', 'UTF-8');
This code should replace all non UTF-8 characters by ? symbol and it will be safe for MongoDB insert/update operations.
I want to replace unicode \u00a0 with actual data in php.
Example : "<b>Sort By\u00a0-\u00a0</b>A to Z\u00a0|\u00a0Recently Added\u00a0|\u00a0Most Downloaded"
\u00a0 is the escape sequence of a NO-BREAK SPACE
To decode any escape sequence in PHP you can use this function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
echo unicodeString($str);
//<b>Sort By - </b>A to Z | Recently Added | Most Downloaded
DEMO:
https://ideone.com/9QtzMO
If you just need to replace a single escape sequence, use:
$str = str_replace("\u00a0", " ", $str);
echo $str;
<b>Sort By - </b>A to Z | Recently Added | Most Downloaded
While the real problem is the colocation of the field on the database, i can't change it. I need to drop invalid characters instead.
Using #iconv('utf-8', 'utf-8//IGNORE'); won't work, because the characters are valid UTF8 characters, but invalid when inserted in a field with that colocation.
$broken_example = '↺ﺆী▜Ꮛ︷ሚ◶ヲɸʩ𝑸ᚙ𐤄🃟ʳ⸘ᥦฆⵞ䷿ꘚꕛ𝆖𝇑𝆺𝅥𝅮↺ﺆী▜Ꮛ︷ሚ◶ヲɸʩ𝑸ᚙ𐤄🃟ʳ⸘ᥦฆⵞ䷿ꘚꕛ𝆖𝇑𝆺𝅥𝅮';
$utf8 = html_entity_decode($broken_example, ENT_QUOTES, 'UTF-8');
I've tried to use some workaround like preg_replace('/&#([0-9]{6,});/', '');, but with no success.
The error mysql is reporting is Incorrect string value: '\xF0\x90\xA4\x84\xCA\xB3...'
A regex for validating all utf-8 chars is:
function removeInvalidChars ($text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
return preg_replace($regex, '$1', $text);
}
Removing the match for 4-byte chars will allow only the characters that can be stored in utf8_general.
function removeInvalidChars ($text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2}) | ./x';
return preg_replace($regex, '$1', $text);
}
btw it's the character set that matters not the collation. Also you would be much better off just switching to utf8mb4 with utf8mb4_unicode_ci rather than putting a hack like this in.