I receive country names like from a library: "\u00c3\u0096sterreich".
How do I convert this to Österreich?
Using PHP 7.3
This one is a lot trickier than it seem, but the below code appears to work.
First we pipe it through the standard regex for Unicode escape sequences, then pack that as a binary string, convert the encoding and finally decode. I cannot promise this is the best way to do this, but it appears to be working correct as far as I can tell.
$str = '\u00c3\u0096sterreich';
$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
return utf8_decode(mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE'));
}, $str);
Demo here
The Unicode for the UTF-8 character "Ö" is U+00D6.
This character consists of the 2 hex bytes: c3 and 96.
The representation \u00c3 \u0096 for these 2 bytes is a bit strange. Provided that the multibyte character is represented byte for byte, the following code can also be used.
$str = '\u00c3\u0096sterreich';
$str = preg_replace_callback(
'~\\\\u00([0-9a-f]{2})~i',
function($m){
return hex2bin($m[1]);
},
$str
);
//Test
$expect = "Österreich";
var_dump($str === $expect); //bool(true)
In case anyone else ends up here with a similar issue, I thought I'd try and shed some light on what's going on. Because as mentioned, this is a lot more complicated that it might look.
A string like \u00c3 refers to a Unicode code-point, in hexadecimal. Ö in the Unicode table is character 214, or \u00d6.
The 214 here is not directly related to how Ö is actually stored in any particular encoding (UTF-8, UTF-16, etc), it's just an abstract number in the overall Unicode table that refers to that character. UTF-8, for instance, will store it in two bytes 11000010 10010110 (194 150 in decimal). There's a really good explanation of how this works in this answer, if you're interested in the finer details.
What appears to have happened in your string is that these two bytes have then been encoded back into hexadecimal, and returned as two separate Unicode code points. u00c3 is Ã, and \u0096 is a control character. This is why any standard methods of decoding this (json_decode, etc) won't have worked - ultimately what you have is not a valid representation of the string Österreich.
The other answers should both work perfectly well, but this code snippet might better illustrate the issue with the format your library is using. It specifically matches two consecutive low Unicode code-points, recombines their decimal representations into an unsigned two-byte integer, and then returns the result.
$str = '\u00c3\u0096sterreich';
echo preg_replace_callback('/\\\\u00([0-9a-fA-F]{2})\\\\u00([0-9a-fA-F]{2})/', function ($match) {
$i = (hexdec($match[1]) << 8) + hexdec($match[2]);
return pack('N', $i);
}, $str);
Österreich
See https://3v4l.org/QtUuGD
Related
So today I was updating some code I made that took some data from a webpage and emailed it to people for convenience. However, I noticed that whoever was typing the text used a program which used some other encoding which had a weird ’ character which was 0xD5 (213) in the Mac Roman set. But when they uploaded it to their website, it came out as Õ. So I used php and did this:
$parsed = str_ireplace("Õ", "'", $parsed);
So I did this and tested it, but it didn't seem to work. Can anyone help me? Thanks!
If this is just a single anomaly you're correcting you can specify it with a hex escape sequence like:
$parsed = str_replace("\xD5", "'", $parsed);
The reason just "Õ" isn't working is the encoding of your PHP file doesn't represent Õ as 0xD5. Strings are just byte sequences and what you're giving str_ireplace don't match. (Well, that and str_ireplace is gonna do funky things with it, str_replace is preferred here.)
More appropriate to handle the problem in general would be to use iconv to convert the input string from whatever its source encoding is into the output encoding you need.
Examples:
$parsed = iconv('MACINTOSH', 'UTF-8', $parsed);
or
$parsed = iconv('MACINTOSH', 'ASCII//TRANSLIT', $parsed);
The //TRANSLIT here means that when a character can't be represented in the target charset, it'll be approximated through one or several similarly looking characters. There's a lot ASCII (and others) can't represent, so transliteration can come in handy if you're not outputting UTF-8 (which would be ideal.)
From the answers to this question I tried to make my program more safe by converting strings to hex and comparing those values instead of directly and dangerously using strings directly from the user. I modified the code on that question to add a conversion:
function mssql_escape($data) {
if(is_numeric($data))
return $data;
$data = iconv("ISO-8859-1", "UTF-16", $data);
$unpacked = unpack('H*hex', $data);
return '0x' . $unpacked['hex'];
}
I do this because in my database I am using nvarchar instead of varchar. Now when I run through it on the php side, it comes up with
0xfeff00680065006c006c006f00200077006f0072006c00640021
Then I run the following query:
declare #test nvarchar(100);
set #test = 'hello world!';
select CONVERT(VARBINARY(MAX), #test);
It results in:
0x680065006C006C006F00200077006F0072006C0064002100
Now you'll notice those numbers are ALMOST the same. Other than the trailing zeros, the only difference is feff00. Why is that there? I realize all I would have to do is shift, but I'd really like to know WHY it's there instead of just making an assumption. Can anybody explain to me why php decides to throw feff00 (yellow!) in the front of my hex?
Well, Andrew, I seem to answer a lot of your questions. This link explains:
So the people were forced to come up with the bizarre convention of
storing a FE FF at the beginning of every Unicode string; this is
called a Unicode Byte Order Mark and if you are swapping your high and
low bytes it will look like a FF FE and the person reading your string
will know that they have to swap every other byte. Phew. Not every
Unicode string in the wild has a byte order mark at the beginning.
And Wikipedia explains:
If the 16-bit units are represented in big-endian byte order, this BOM
character will appear in the sequence of bytes as 0xFE followed by
0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text
display that expects the text to be ISO-8859-1.
if the 16-bit units
use little-endian order, the sequence of bytes will have 0xFF followed
by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a
text display that expects the text to be ISO-8859-1.
So the code you displayed with FEFF, which means it's in Big Endian notation. Use UTF-16LE for little endian, and SQL will understand that. Shifting the first SIX hex digits will only coincidentally work as long as you're only using two bytes.
how can I append a 16 bit unicode character to a string in php
$test = "testing" . (U + 199F);
From what I see, \x only takes 8 bit characters aka ascii
From the manual:
PHP only supports a 256-character set, and hence does not offer native Unicode support.
You could enter a manually-encoded UTF-8 sequence, I suppose.
You can also type out UCS4 as byte sequence and use iconv("UTF-32LE", "UTF-8", $str); to convert it into UTF-8 for further processing. You just can't input the codepoint as a 32-bit code unit in one go.
Unicode characters don't directly exist in PHP(*), but you can deal with strings containing bytes represent characters in UTF-8 encoding. Here's one way of converting a numeric character code point to UTF-8:
function unichr($i) {
return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
$test= 'testing'.unichr(0x199F);
(*: and ‘16-bit’ Unicode characters don't exist at all; Unicode has code points way beyond U+FFFF. There are 16-bit ‘code units’ in UTF-16, but that's an ugly encoding you're unlikely to meet in PHP.)
Because unicode is just multibyte and PHP only supports single byte you can create multibyte characters with multiple single bytes :)
$test = "testing\x19\x9F";
Try:
$test = "testing" . "\u199F";
I'm using the mb_detect_encoding() function to check if a string contains non latin1 (ISO-8859-1) characters.
Since Japanese isn't part of latin1 I'm using it as the text within the test string, yet when the string is passed in to the function it seems to return ok for ISO-8859-1. Example code:
$str = "これは日本語のテキストです。読めますか";
$res = mb_detect_encoding($str,"ISO-8859-1",true);
print $res;
I've tried using 'ASCII' instead of 'ISO-8859-1', which correctly returns false. Is anyone able to explain the discrepancy?
I wanted to be funny and say hexdump could explain it:
0000000 81e3 e393 8c82 81e3 e6af a597 9ce6 e8ac
0000010 9eaa 81e3 e3ae 8683 82e3 e3ad b982 83e3
0000020 e388 a781 81e3 e399 8280 aae8 e3ad 8182
0000030 81e3 e3be 9981 81e3 0a8b
But alas, that's quite the opposite.
In ISO-8859-1 practically only the code points \x80-\x9F are invalid. But these are exactly the byte values your UTF-8 representation of the Japanese characters occupy.
Anyway, mb_detect_encoding uses heuristics. And it fails in this example. My conjecture is that it mistakes ISO-8859-1 for -15 or worse: CP1251 the incompatible Windows charset, which allows said code points.
I would say you use a workaround and test it yourself. The only check to assure that a byte in a string is certainly not a Latin-1 character is:
preg_match('/[\x7F-\x9F]/', $str);
I'm linking to the German Wikipedia, because their article shows the differences best: http://de.wikipedia.org/wiki/ISO_8859-1
I'm having this problem with UTF8 string comparison which I really have no idea about and it starts to give me headache. Please help me out.
Basically I have this string from a xml document encoded in UTF8: 'Mina Tidigare anställningar'
And when I compare that string with the exactly the same string which I typed myself: 'Mina Tidigare anställningar' (also in UTF8). And the result is FALSE!!!
I have no idea why. It is so strange. Can someone help me out?
This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř can be written as one character ř or as two characters: r and the combining ˇ.
Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.
In one of the comments, you show these hex representations of the strings:
4d696e61205469646967617265 20 616e7374 c3a4 6c6c6e696e676172 // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
^^-----------------^^^^1 ^^^^^^2
Note the parts I marked, apparently there are two parts to this problem.
For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.
As to the second, that is the case I outlined above: c3a4 is ä (U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - one character, two bytes), whereas 61 is a (U+0061 "LATIN SMALL LETTER A" - one character, one byte) and cc88 would be the combining umlaut " (U+0308 "COMBINING DIAERESIS" - two characters, three bytes). Here, the normalization library should be useful.
Let's try blindly: maybe both UTF-8 strings have not the same underlying representation (you can get characters with accents as a sequence or as a unique character). You should give use some hex dump of both UTF8 strings and someone may be able to help.
mb_detect_encoding($s, "UTF-8") == "UTF-8" ? : $s = utf8_encode($s);