Converting Unicode apostrophe in php - php

So I’m getting some text from a user in PHP and one of the characters is supposed to be an apostrophe but instead of coming in as the character apostrophe ’ it comes in as %u2019.
I tried all of the following to no avail:
$b = urldecode($a);
$c = utf8_decode($a);
$d = html_entity_decode($a);
$e = rawurldecode($a);
This %u2019 seemingly can’t be turned back to this character.
urldecode: %u2019
utf8_decode: %u2019
html_entity_decode: %u2019
rawurldecode: %u2019

It must be JavaScript escaped string. You can see the same string if you run escape("’") in your browser console.
Use my php function that is equivalent to JavaScript unescape():
$str = preg_replace_callback(
'/%u(\d+)/',
function($matches) {
return mb_convert_encoding('&#'.hexdec($matches[1]).';', 'UTF-8', 'HTML-ENTITIES');
},
$str
);
If your php is older than php 5.3, you should define the callback function as a normal function because older php can't support closure.

Related

How to convert MS dot character to Unicode [duplicate]

This question looks embarrassingly simple, but I haven't been able to find an answer.
What is the PHP equivalent to the following C# line of code?
string str = "\u1000";
This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).
That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?
PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.
It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.
$unicodeChar = "\u{1000}";
Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:
$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');
Another option would be to use mb_convert_encoding()
echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:
echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');
I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:
\x[0-9A-Fa-f]{1,2}
The sequence of characters matching the regular expression is a
character in hexadecimal notation.
ASCII example:
<?php
echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>
Hello World!
So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:
<?php
header('content-type:text/html;charset=utf-16be');
echo("\x30\xA2");
?>
ア
If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).
UTF-16 little endian example:
<?php
header('content-type:text/html;charset=utf-16le');
echo("\xA2\x30");
?>
ア
UTF-8 example:
<?php
header('content-type:text/html;charset=utf-8');
echo("\xE3\x82\xA2");
?>
ア
There is also the pack function, but you can expect it to be slow.
PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
Or with an anonymous function expression instead of create_function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
}, $str);
}
Its usage:
$str = unicodeString("\u1000");
html_entity_decode('エ', 0, 'UTF-8');
This works too. However the json_decode() solution is a lot faster (around 50 times).
Try Portable UTF-8:
$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );
All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.
As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.
As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.
However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.
This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.
First, a proof example:
// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = " ";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)
Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.
The next question is, how do you get from U+200A to \xE2\x80\x8A?
Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.
function str_encode_utf8binary($str) {
/** #author Krinkle 2018 */
$output = '';
foreach (str_split($str) as $octet) {
$ordInt = ord($octet);
// Convert from int (base 10) to hex (base 16), for PHP \x syntax
$ordHex = base_convert($ordInt, 10, 16);
$output .= '\x' . $ordHex;
}
return $output;
}
function str_convert_html_to_utf8binary($str) {
return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
return str_encode_utf8binary(json_decode($str));
}
// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e
// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary(' ') . "\n";
// \xe2\x80\x8a
// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a
function unicode_to_textstring($str){
$rawstr = pack('H*', $str);
$newstr = iconv('UTF-16BE', 'UTF-8', $rawstr);
return $newstr;
}
$msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';
echo unicode_to_textstring($str);

PHP str_replace removing unintentionally removing Chinese characters

i have a PHP scripts that removes special characters, but unfortunately, some Chinese characters are also removed.
<?php
function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split('#/\\:*?\"<>|[]\'_+(),{}’! &'), "", $inputString);
return $inputString;
}
$test = '赵景然 赵景然';
print(removeSpecialCharactersFromString($test));
?>
oddly, the output is 赵然 赵然. The character 景 is removed
in addition, 陈 一 is also removed. What might be the possible cause?
The string your using to act as a list of the things you want to replace doesn't work well with the mixed encoding. What I've done is to convert this string to UTF16 and then split it.
function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split(
mb_convert_encoding('#/\\:*?\"<>|[]\'_+(),{}’! &', 'UTF16')), "", $inputString);
return $inputString;
}
$test = '#赵景然 赵景然';
print(removeSpecialCharactersFromString($test));
Which gives...
赵景然赵景然
BTW -str_replace is MB safe - sort of recognised the poster... http://php.net/manual/en/ref.mbstring.php#109937

substr_replace() when used with special charactors (äöå) replaces with a? [duplicate]

I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings..
$accents_search = array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ');
$accents_replace = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e',
'e','e','E','E','E','E','i','i','i','i','I','I','I','I','oe','o','o','o','o','o','o',
'O','O','O','O','O','u','u','u','U','U','U','c','C','N','n');
$str = str_replace($accents_search, $accents_replace, $str);
Results I get:
Ørjan Nilsen -> �orjan Nilsen
Expected Result:
Ørjan Nilsen -> Orjan Nilsen
Edit: I've got my internal character handler set to UTF-8 (according to mb_internal_encoding()), also the value of $str is UTF-8, so from what I can tell, all the strings involved are UTF-8. Does str_replace() detect char sets and use them properly?
According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.
Looks like the string was not replaced because your input encoding and the file encoding mismatch.
It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.
NFD converts something like the "ü" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).
header('Content-Type: text/plain; charset=utf-8');
$test = implode('', array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'));
$test = Normalizer::normalize($test, Normalizer::FORM_D);
// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';
echo preg_replace($pattern, '', $test);
Output:
aaaaªaaAAAAAeeeeEEEEiiiiIIIIœooooºøØOOOOuuuUUUcCNn
The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)
(I'm adding this two months late because I think it's a nice technique that's not known widely enough.)
Try this function definition:
if (!function_exists('mb_str_replace')) {
function mb_str_replace($search, $replace, $subject) {
if (is_array($subject)) {
foreach ($subject as $key => $val) {
$subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
}
return $subject;
}
$pattern = '/(?:'.implode('|', array_map(create_function('$match', 'return preg_quote($match[0], "/");'), (array)$search)).')/u';
if (is_array($search)) {
if (is_array($replace)) {
$len = min(count($search), count($replace));
$table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
$f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
$subject = preg_replace_callback($pattern, $f, $subject);
return $subject;
}
}
$subject = preg_replace($pattern, (string)$replace, $subject);
return $subject;
}
}

Creating a UTF-8 string from hexadecimal code

In C++, it's possible create a UTF-8 string using this kind of notation: "\uD840\uDC50".
However this doesn't work in PHP. Is there a similar notation?
If not, is there any built-in way to create a UTF-8 string knowing its Unicode code point?
I've ended up implementing it like this:
$utf8 = html_entity_decode("一", ENT_COMPAT, 'UTF-8');
function hexToString($str){return chr(hexdec(substr($str, 2)));}
$result = preg_replace_callback("/(\\\\x..)/isU", function($m) { return hexToString($m[0] ); }, $str);

Unicode character in PHP string

This question looks embarrassingly simple, but I haven't been able to find an answer.
What is the PHP equivalent to the following C# line of code?
string str = "\u1000";
This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).
That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?
PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.
It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.
$unicodeChar = "\u{1000}";
Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:
$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');
Another option would be to use mb_convert_encoding()
echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:
echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');
I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:
\x[0-9A-Fa-f]{1,2}
The sequence of characters matching the regular expression is a
character in hexadecimal notation.
ASCII example:
<?php
echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>
Hello World!
So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:
<?php
header('content-type:text/html;charset=utf-16be');
echo("\x30\xA2");
?>
ア
If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).
UTF-16 little endian example:
<?php
header('content-type:text/html;charset=utf-16le');
echo("\xA2\x30");
?>
ア
UTF-8 example:
<?php
header('content-type:text/html;charset=utf-8');
echo("\xE3\x82\xA2");
?>
ア
There is also the pack function, but you can expect it to be slow.
PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
Or with an anonymous function expression instead of create_function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
}, $str);
}
Its usage:
$str = unicodeString("\u1000");
html_entity_decode('エ', 0, 'UTF-8');
This works too. However the json_decode() solution is a lot faster (around 50 times).
Try Portable UTF-8:
$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );
All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.
As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.
As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.
However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.
This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.
First, a proof example:
// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = " ";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)
Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.
The next question is, how do you get from U+200A to \xE2\x80\x8A?
Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.
function str_encode_utf8binary($str) {
/** #author Krinkle 2018 */
$output = '';
foreach (str_split($str) as $octet) {
$ordInt = ord($octet);
// Convert from int (base 10) to hex (base 16), for PHP \x syntax
$ordHex = base_convert($ordInt, 10, 16);
$output .= '\x' . $ordHex;
}
return $output;
}
function str_convert_html_to_utf8binary($str) {
return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
return str_encode_utf8binary(json_decode($str));
}
// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e
// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary(' ') . "\n";
// \xe2\x80\x8a
// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a
function unicode_to_textstring($str){
$rawstr = pack('H*', $str);
$newstr = iconv('UTF-16BE', 'UTF-8', $rawstr);
return $newstr;
}
$msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';
echo unicode_to_textstring($str);

Categories