Emoji (Unicode) to UTF-8 ampersand hash (?) encoding

Emoji (Unicode) to UTF-8 ampersand hash (?) encoding - php

To maintain compatibility with a pre-existing PHP solution, I require
input: 😁 // emoji character,
output: ð
I believe this is 'ampersand hash' encoding (I'm not sure that's what it's called.. I'll be damned if I can find any resources which explain how I arrive at this format... or why what this encoding is suitable for...)
I can get the bytes by URL-encoding the Unicode...
<?php print urlencode("😁"); /* Output: %F0%9F%98%81 */ ?>
...and I can use a Regex to convert this to the format I need... but I don't like this solution. It's very hacky and very prone to accidentally encoding non-encoded strings...
<?php
$enc = urlencode("😁");
print $enc; // %F0%9F%98%81
$find = '/(%)([0-9a-fA-F][0-9a-fA-F])/i';
$replacement = '&#x$2;';
print preg_replace($find,$replacement,$enc);
?>
Result: ð&#x81
Is there a better approach?
What is this encoding known as, and how do I arrive at it (via PHP)?
Many thanks!
Edit: Turns out this approach is unsuitable after all. urlencode converts all the spaces into + characters. There must be a correct approach to arrive at this format?

ð is "html entities"; it represents the 4 hex bytes F09F9891, which is the UTF-8 encoding for that Emoji. I suspect it is HTML, not PHP that you are trying to appease?
http://unicode.scarfboy.com/?s=%F0%9F%98%81 -- go part way down the page to "string stuff" to see how to encode it for HTML, utf8, python, javascript, etc.
One way in PHP is:
echo bin2hex('😁'); // f09f9881
Then break it into groups of 2 hex digits.

Related

List of known troublesome characters that causes PHP to fail to detect the proper character encoding before converting to UTF-8 resulting in lost data

PHP isn't always correct, what I write has to always be correct. In this case an email with a subject contains an en dash character. This thread is about detecting oddball characters that when alone (let's say, among otherwise purely ASCII text) is incorrectly detected by PHP. I've already determined one static example though my goal here is to create a definitive thread containing as close to a version of drop-in code as we can possibly create.
Here is my starting string from the subject header of an email:
<?php
//This is AFTER exploding the : of the header and using trim on $p[1]:
$s = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//orkut – convite enviado por Lais Piccirillo
?>
Typically the next step is to do the following:
$s = imap_mime_header_decode($s);//orkut � convite enviado por Lais Piccirillo
Typically past that point I'd do the following:
$s = mb_convert_encoding($subject, 'UTF-8', mb_detect_encoding($s));//en dash missing!
Now I received a static answer for an earlier static question. Eventually I was able to put this working set of code together:
<?php
$s1 = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//Attempt to determine the character set:
$en = mb_detect_encoding($s1);//ASCII; wrong!!!
$p = explode('?', $s1, 3)[1];//ISO-8859-1; wrong!!!
//Necessary to decode the q-encoded header text any way FIRST:
$s2 = imap_mime_header_decode($s1);
//Now scan for character exceptions in the original text to compensate for PHP:
if (strpos($s1, '=96') !== false) {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8', 'CP1252');}
else {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8');}
//String is finally ready for client output:
echo '<pre>'.print_r($s2,1).'</pre>';//orkut – convite enviado por Lais Piccirillo
?>
Now either I've still programmed this incorrectly and there is something in PHP I'm missing (tried many combinations of html_entity_decode, iconv, mb_convert_encoding and utf8_encode) or, at least for the moment with PHP 8, we'll be forced to detect specific characters and manually override the encoding as I've done on line 12. In the later case a bug report either needs to be created or likely updated if one specific to this issue already exists.
So technically the question is:
How do we properly detect all character encodings to prevent any characters from being lost during the conversion of strings to UTF-8?
If no such proper answer exists valid answers include characters that when among otherwise purely ASCII text results in PHP failing to properly detect the correct character encoding thus resulting in an incorrect UTF-8 encoded string. Presuming this issue becomes fixed in the future and can be validated against all odd-ball characters listed in all of the other relevant answers then a proper answer can be accepted.

You are blaming PHP for something that PHP could not possibly solve:
$s1 is an ASCII string; just as the string "smiling face emoji" is ASCII, even though it describes the string "🙂".
$s2 is decoded according to the information you were sent. In fact, it's decoded into a raw sequence of bytes, and a label which was provided in the input.
Your actual problem is that the information you were sent was wrong - the system that sent it to you has made the common mistake of mislabelling Windows-1252 as ISO-8859-1.
Those two encodings agree on the meanings of 224 out of the 256 possible 8-bit values. They disagree on the values from 0x80 to 0x9F: those are control characters in ISO 8859 and (mostly) assigned to printable characters in Windows-1252.
Note that there is no way for any system to automatically tell you which interpretation was intended - either way, there is simply a byte in memory containing (for instance) 0x96. However, the extra control characters from ISO 8859 are very rarely used, so if the string claims to be ISO-8859-1 but contains bytes in that range, it's almost certainly in some other encoding. Since Windows-1252 is very widely used (and often mislabelled in this way), a common solution is simply to assume that any data labelled ISO-8859-1 is actually Windows-1252.
That makes the solution really very simple:
// $input is the ASCII string you've received
$input = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
// Decode the string into its labelled encoding, and string of bytes
$mime_decoded = imap_mime_header_decode($input);
$input_encoding = $mime_decode[0]->charset;
$raw_bytes = $mime_decode[0]->text;
// If it claims to be ISO-8859-1, assume it's lying
if ( $input_encoding === 'ISO-8859-1' ) {
$input_encoding = 'Windows-1252';
}
// Now convert from a known encoding to UTF-8 for the use of your application
$utf8_string = mb_convert_encoding($raw_bytes, 'UTF-8', $input_encoding);

How can I reproducibly represent a non-UTF8 string in PHP (Browser)

I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.

I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see â‚¬ or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"

A simple comparison in utf8, wrong result?

this code prints "no" , but it should print "ok" and utf8 encodes of two are different
$a="کیهان";
$b="كيهان";
echo utf8_encode($a)."==".utf8_encode($b)."<br>";
if(utf8_encode($a)==utf8_encode($b))
echo "ok";
else
echo "no";
and the result :
Ú©ÛÙØ§Ù==ÙÙÙØ§Ù
no
what's that © ?
edit : $a is copied and $b is typed

your unicode strings are different to begin with... shown here with spaces to hilight the point:
$a="ک ی ه ن";
$b="ك ي ه ن";
EDIT: for curiosity's sake...
Seems that they display identically in the tab at the top of the file, which must have font features which combine characters together, but displays differently in the body of code, where it is actually displayed back to front.

EDIT:
Billy's completely right (+1) about why the strings are not equal. This answer may explain why you see garbage text after the conversion.
I'm guessing that your original encoding is not ISO-8859-1.
See the first comment in the docs.
Please note that utf8_encode only converts a string encoded in
ISO-8859-1 to UTF-8. A more appropriate name for it would be
"iso88591_to_utf8". If your text is not encoded in ISO-8859-1, you do
not need this function. If your text is already in UTF-8, you do not
need this function. In fact, applying this function to text that is
not encoded in ISO-8859-1 will most likely simply garble that text.
You may want iconv instead.

How can I strip out odd copy-pasted characters like: â€™

I have a php web app/tool that people end up copy-pasting data into. The data eventually turns into XML, for which certain characters produce really odd character once they are saved. I am not sure if "â€™" looked like that before it was copy-pasted. It might have just been interpreted that way. It might have just been a long "-". In any case, all these characters are really odd. Is there a way to strip them out easily?

That is because PHP uses 8-bit encoding but your data is mostly likely written in UTF-8. You will find Joel's article on Encoding very enlightening.
And for the short answer try just encoding it in UTF-8
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");
echo 'Detected Encoding '.$enc."<br />";
echo 'Fixed Result: '.iconv($enc, "UTF-8", $text)."<br />";
?>

It would probably be easier in your case to whitelist rather than blacklist; i.e., make a list of acceptable characters and strip the rest. You can do this easily using preg_replace:
$str = preg_replace($str, "/[A-Za-z0-9'-._\(\)/");
|
V
add more chars here

When you see a character pair starting with an accented "A" or "a", it generally means you're seeing a character whose actual encoding is iso-8859-1 displayed by software that thinks it's displaying utf-8.
If you're going to allow people to modify text in an XML document using tools that aren't XML-aware, the likelihood is that you will end up with characters encoded in iso-8859-1. That should be no problem provided the XML declaration at the start of the file is present and says that the encoding is iso-8859-1. But if there's no XML declaration, or if the encoding in the declaration is utf-8, you're going to end up with corrupt data.
You've asked about how to repair the data, but when you experience data corruption the focus should always be on prevention rather than repair.

Convert HTML numbered entities in php to unicode for use on iPhone

I'm creating a web service to transfer json to an iPhone app. I'm using json-framework to receive the json, and that works great because it automatically decodes things like "\u2018". The problem I'm running into is there doesn't seem to be a comprehensive way to get all the characters in one fell swoop.
For example html_entity_decode() gets most things, but it leaves behind stuff like ‘ (‘). In order to catch these entities and convert them to something json-framework can use (e.g., \u2018), I'm using this code to convert the &# to \u, convert the numbers to hex, and then strip the ending semicolon.
function func($matches) {
return "\u" . dechex($matches[1]);
}
$json = preg_replace_callback("/&#(\d{4});/", "func", $json);
This is working for me at the moment, but it just doesn't feel right. It seems like I'm surely missing some characters that are going to come back to haunt me later.
Does anyone see flaws in this approach? Can anyone think of characters this approach will miss?
Any help would be most appreciated!

From where are you getting this HTML-encoded input? If you're scraping a web page you should be using an HTML parser, which will decode both entity and character references for you. If you are getting them in form input data, you've got a problem with encodings (make sure to serve the page containing the form as UTF-8 to avoid this).
If you must convert an HTML-encoded stretch of literal text to JSON, you should do it by HTML-decoding first then JSON-encoding, rather than attempting to go straight to JSON format (which will fail for a bunch of other characters that need escaping). Use the built-in decoder and encoder functions rather than trying to create JSON-encoded characters like \u.... yourself (as there are traps there).
$html= 'abc " def Ӓ ghi ሴ jkl \n mno';
$raw= html_entity_decode($html, ENT_COMPAT, 'utf-8');
$json= json_encode($raw);
"abc \" def \u04d2 ghi \u1234 jkl \\n mno"

‘ is a decimal numbered entity, while I believe \u2018 is a hexadecimal representation. HTML also supports hexadecimal numbered entities (e.g., ‘), but once you've found # as the entity prefix you're looking at either decimal or hex. There are also named entities (e.g., &) but it doesn't sound like you need to cover those cases in your code.

$html_escape = ""Love sex magic rise" & 尹真希 ‘";
$utf8 = mb_convert_encoding($html_escape, 'UTF-8', 'HTML-ENTITIES');
echo json_encode(array(
"title" => $utf8
));
// {"title":"\"Love sex magic rise\" & \u5c39\u771f\u5e0c \u2018"}
This work well for me

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.