PHP: Encode UTF8-Characters to html entities - php

I want to encode normal characters to html-entities like
a => a
A => A
b => b
B => B
but
echo htmlentities("a");
doesn't work. It outputs the normal charaters (a A b B) in the html source code instead of the html-entities.
How can I convert them?

You can build a function for this fairly easily using mb_ord or IntlChar::ord, either of which will give you the numeric value for a Unicode Code Point.
You can then convert that to a hexadecimal string using base_convert, and add the '&#x' and ';' around it to give an HTML entity:
function make_entity(string $char) {
$codePoint = mb_ord($char, 'UTF-8'); // or IntlChar::ord($char);
$hex = base_convert($codePoint, 10, 16);
return '&#x' . $hex . ';';
}
echo make_entity('a');
echo make_entity('€');
echo make_entity('🐘');
You then need to run that for each code point in your UTF-8 string. It is not enough to loop over the string using something like substr, because PHP's string functions work with individual bytes, and each UTF-8 code point may be multiple bytes.
One approach would be to use a regular expression replacement with a pattern of /./u:
The . matches each single "character"
The /u modifier turns on Unicode mode, so that each "character" matched by the . is a whole code point
You can then run the above make_entity function for each match (i.e. each code point) with preg_replace_callback.
Since preg_replace_callback will pass your callback an array of matches, not just a string, you can make an arrow function which takes the array and passes element 0 to the real function:
$callback = fn($matches) => make_entity($matches[0]);
So putting it together, you have this:
echo preg_replace_callback('/./u', fn($m) => make_entity($m[0]), 'a€🐘');
Arrow functions were introduced in PHP 7.4, so if you're stuck on an older version, you can write the same thing as a regular anonymous function:
echo preg_replace_callback('/./u', function($m) { return make_entity($m[0]) }, 'a€🐘');
Or of course, just a regular named function (or a method on a class or object; see the "callable" page in the manual for the different syntax options):
function make_entity_from_array_item(array $matches) {
return make_entity($matches[0]);
}
echo preg_replace_callback('/./u', 'make_entity_from_array_item', 'a€🐘');

Related

How to convert MS dot character to Unicode [duplicate]

This question looks embarrassingly simple, but I haven't been able to find an answer.
What is the PHP equivalent to the following C# line of code?
string str = "\u1000";
This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).
That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?
PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.
It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.
$unicodeChar = "\u{1000}";
Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:
$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');
Another option would be to use mb_convert_encoding()
echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:
echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');
I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:
\x[0-9A-Fa-f]{1,2}
The sequence of characters matching the regular expression is a
character in hexadecimal notation.
ASCII example:
<?php
echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>
Hello World!
So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:
<?php
header('content-type:text/html;charset=utf-16be');
echo("\x30\xA2");
?>
ア
If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).
UTF-16 little endian example:
<?php
header('content-type:text/html;charset=utf-16le');
echo("\xA2\x30");
?>
ア
UTF-8 example:
<?php
header('content-type:text/html;charset=utf-8');
echo("\xE3\x82\xA2");
?>
ア
There is also the pack function, but you can expect it to be slow.
PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
Or with an anonymous function expression instead of create_function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
}, $str);
}
Its usage:
$str = unicodeString("\u1000");
html_entity_decode('エ', 0, 'UTF-8');
This works too. However the json_decode() solution is a lot faster (around 50 times).
Try Portable UTF-8:
$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );
All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.
As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.
As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.
However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.
This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.
First, a proof example:
// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = " ";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)
Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.
The next question is, how do you get from U+200A to \xE2\x80\x8A?
Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.
function str_encode_utf8binary($str) {
/** #author Krinkle 2018 */
$output = '';
foreach (str_split($str) as $octet) {
$ordInt = ord($octet);
// Convert from int (base 10) to hex (base 16), for PHP \x syntax
$ordHex = base_convert($ordInt, 10, 16);
$output .= '\x' . $ordHex;
}
return $output;
}
function str_convert_html_to_utf8binary($str) {
return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
return str_encode_utf8binary(json_decode($str));
}
// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e
// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary(' ') . "\n";
// \xe2\x80\x8a
// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a
function unicode_to_textstring($str){
$rawstr = pack('H*', $str);
$newstr = iconv('UTF-16BE', 'UTF-8', $rawstr);
return $newstr;
}
$msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';
echo unicode_to_textstring($str);

What is this encoding and how can I encode a string to it in PHP?

I have an input like this:
$input = 'GFL/R&D/50/67289';
I am trying to get to this:
GFL$2fR$26D$2f50$2f67289
So far, the closest I have come is this:
echo filter_var($input, FILTER_SANITIZE_ENCODED, FILTER_FLAG_ENCODE_LOW)
which produces:
GFL%2FR%26D%2F50%2F67289
How can I get from the given input to the desired output and what sort of encoding is the result in?
By the way, please note the case sensitivity going on there. $2f is required rather than $2F.
This will do the trick: url-encode, then lower-case the encoded sequences and swap % for $ with a preg callback (PHP's PCRE doesn't support case-shifting modifiers):
$input = 'GFL/R&D/50/67289';
echo preg_replace_callback('/(%)([0-9A-F]{2})/', function ($m) {
return '$' . strtolower($m[2]);
}, urlencode($input));
output:
GFL$2fR$26D$2f50$2f67289

PHP string comparison for strings which look the same in some views, but not in others

I have two strings that look the same when I echo them, but when I var_dump() them they are different string types:
Echo:
http://blah
http://blah
var dump:
string(14) "http://blah"
string(11) "http://blah"
strToHex:
%68%74%74%70%3a%2f%2f%62%6c%61%68%00%00%00
%68%74%74%70%3a%2f%2f%62%6c%61%68
When I compare them, they return false. How can I manipulate the string type, so that I can perform a comparison that returns true?
What is the difference between string 11 and string 14? I am sure there is a simple resolution, but I have not found anything yet. No matter how I implode, explode, UTF-8 encode, etc., they will not compare the strings or change type.
Letter "a" can be written in another encoding.
For example: blаh. Here a is a Cyrillic 'а'.
All of these letters are Cyrillic, but it looks like Latin: у, е, х, а, р, о, с
Trim the strings before comparing. There are escaped characters, like \t and \n, which are not visible.
$clean_str = trim($str);
When using var_dump(), then string(14) means that the value is a string that holds 14 bytes. So string(11) and string(14) are not different "types" of strings; they are just strings of different length.
I would use something like this to see what actually is inside those strings:
function strToHex($value, $prefix = '') {
$result = '';
$length = strlen($value);
for ( $n = 0; $n < $length; $n++ ) {
$result .= $prefix . sprintf('%02x', ord($value[$n]));
}
return $result;
}
echo strToHex("test\r\n", '%');
Output:
%74%65%73%74%0d%0a
This decodes as:
%74 - t
%65 - e
%73 - s
%74 - t
%0d - \r (carriage return)
%0a - \n (line feed)
Or, as pointed out in comments by Karolis, you can use the built-in function bin2hex():
echo bin2hex("test\r\n");
Output:
746573740d0a
Try to trim these strings:
if (trim($string1) == trim($string2)) {
// Do things
}
Probably Unicode strings within the upper range are counted as double bytes.
Use mb_strlen() to check lengths.
Also some characters may not be visible, but present (there are many of Unicode spaces, etc.)
Generally, when you work with Unicode functions, you should use the mb_* string functions.
You may overload string encoding functions in php.ini to always use mb_* functions instead the standard ones (I am not sure if Xdebug honors those settings).
In PHP 6 this problem will be solved, as it should be globally Unicode-aware.

Unicode character in PHP string

This question looks embarrassingly simple, but I haven't been able to find an answer.
What is the PHP equivalent to the following C# line of code?
string str = "\u1000";
This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).
That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?
PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.
It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.
$unicodeChar = "\u{1000}";
Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:
$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');
Another option would be to use mb_convert_encoding()
echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:
echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');
I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:
\x[0-9A-Fa-f]{1,2}
The sequence of characters matching the regular expression is a
character in hexadecimal notation.
ASCII example:
<?php
echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>
Hello World!
So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:
<?php
header('content-type:text/html;charset=utf-16be');
echo("\x30\xA2");
?>
ア
If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).
UTF-16 little endian example:
<?php
header('content-type:text/html;charset=utf-16le');
echo("\xA2\x30");
?>
ア
UTF-8 example:
<?php
header('content-type:text/html;charset=utf-8');
echo("\xE3\x82\xA2");
?>
ア
There is also the pack function, but you can expect it to be slow.
PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
Or with an anonymous function expression instead of create_function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
}, $str);
}
Its usage:
$str = unicodeString("\u1000");
html_entity_decode('エ', 0, 'UTF-8');
This works too. However the json_decode() solution is a lot faster (around 50 times).
Try Portable UTF-8:
$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );
All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.
As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.
As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.
However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.
This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.
First, a proof example:
// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = " ";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)
Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.
The next question is, how do you get from U+200A to \xE2\x80\x8A?
Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.
function str_encode_utf8binary($str) {
/** #author Krinkle 2018 */
$output = '';
foreach (str_split($str) as $octet) {
$ordInt = ord($octet);
// Convert from int (base 10) to hex (base 16), for PHP \x syntax
$ordHex = base_convert($ordInt, 10, 16);
$output .= '\x' . $ordHex;
}
return $output;
}
function str_convert_html_to_utf8binary($str) {
return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
return str_encode_utf8binary(json_decode($str));
}
// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e
// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary(' ') . "\n";
// \xe2\x80\x8a
// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a
function unicode_to_textstring($str){
$rawstr = pack('H*', $str);
$newstr = iconv('UTF-16BE', 'UTF-8', $rawstr);
return $newstr;
}
$msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';
echo unicode_to_textstring($str);

PHP regex issue: cannot find $C

I'm trying to parse dollar amounts from a text of in mixed French (Canadian) and English. The text is in UTF-8. They use $C to denote currency. For some reason when I use preg_match neither the '$' nor the 'C' can be found. Everything else works fine. Any ideas?
e.g. use
preg_match_all('/\$C/u', $match)
on "Thanks for a payment of 46,00 $C" returns empty.
I think the regex can't find those characters because they aren't there. If you initialize the string like this:
$source = "Thanks for a payment of 46,00 $C";
...(i.e., as a double-quoted string literal), $C gets interpreted as a variable name. Since you never initialized that variable, it gets replaced with nothing in the actual string. You should either use single-quotes to initialize the string, or escape the dollar sign with a backslash like you did in the regex.
By the way, this couldn't be an encoding problem, because (in the example, at least), all the characters are from the ASCII character set. Whether it was encoded as UTF-8, ISO-8859-1 or ASCII, the binary representation of the string would be identical.
preg_match_all('/\$C/u', 'Thanks for a payment of 46,00 $C', $matches);
print_r($matches);
works fine for me:
Array
(
[0] => Array
(
[0] => $C
)
)
Maybe this helps:
// assuming $text is the input string
$matches = array();
preg_match_all('/([0-9,\\.]+)\\s*\\$C/u', $text, $matches);
if ($matches) {
$price = floatval(str_replace(',', '.', $matches[1][0]));
printf("%.2f\n", $price);
} else {
printf("No price found\n");
}
Just make sure the input string ($text) has been properly decoded into an Unicode string. (For example, if it's in UTF-8, use the utf8_decode function.)

Categories