PHP UTF-8 encoding problem of U+009A

PHP UTF-8 encoding problem of U+009A - php

I have problems displaying the Unicode character of U+009A.
It should look like "š", but instead looks like a rectangular block with the numbers 009A inside.
Converting it to the entity "" displays the character correctly, but I don't want to store entities in the database.
The encoding of the webpage is in UTF-8.
The character is URL-encoded as "%C2%9A".
Reproduce:
# php -E 'echo urldecode("%C2%9A");' > /tmp/test ; less /tmp/test
This gives me <U+009A> in less or <9A> in vim.

The Unicode character "š" is U+0161, not U+009A
I suspect that it's 0x9A in another character set.
The box with 009A is usually shown when you don't have a font installed with that character.

If you’re using UTF-8 as your input encoding, then you can simply use the plain š. Or you could use the hexadecimal representation "\xC2\x9A" (in double quotes) that’s independent from the input encoding. Or utf8_encode("\x9A") since the first 256 characters of Unicode and ISO 8859-1 are identical.

If I do a hexdump of the output of echo urldecode("%C2%9A"); I get c2 9a, which is the correct UTF-8 encoding for character 0x9a.
You get that same encoding from the output of utf8_encode("\x9A")
When I try to view Unicode char 0x9a, I get a square box too - suspect it's not the char you think it should be (Aha: as Azquelt has posted, unicode character "š" is U+0161, not U+009A)

Codeigniter have utf-8 character input data save issue in some hosting servers like Etisalat. system/core/Utf8.php have function to detect illegal char in input data(post/get). In some cases utf-8 char is consider as illegal and save function will fail. For avoid data saving issue do the following in clean_string() function of Utf8.php at line 85.
$str = !mb_detect_encoding($str, 'UTF-8', TRUE) ? utf8_encode($str) : $str;
$str = #iconv('UTF-8', 'UTF-8//IGNORE', $str);

Related

how to get unicode character from a unicode string in php

I want to get a single unicode chatacter from a unicode string.
for example:-
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo $str[0];
output is:- �
but i want to get char 'प' at 0 index of the string.
plz help me how to get char 'प' instead of � .

As #deceze writes, you need to use mb_substr in order to get a character, instead of just a byte. In addition, you need to set the internal encoding with mb_internal_encoding. Assuming that the encoding of your .php file is UTF-8, the following should work:
mb_internal_encoding('utf-8');
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo mb_substr($str, 0, 1);

PHP's default $str[x] notation operates on bytes, so you're just getting the first part of a multibyte character. To extract entire encoding aware byte sequences for whole characters, you need to use mb_substr.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Bullet "•" in XML

Similar to this question
I am consuming an XML product that has some illegal chars in it. I seriously doubt I can get them to fix the problem, but I will try. In the meantime I'd like a work-around.
The problem is that it contains a bullet. It renders as "â€¢" in my source. I've tried a few encoding conversions but have not found a combination that works. (I'm not accustomed to even thinking about my encoding type, so I'm out of my element here.) So, I tried the below and it seems that str_replace does not recognize the "•". (it renders as tall block in my text editor)
You can see the commented lines where I tried a few different things.
I tried str replace on "â€¢" first, then tweaked around and this is my latest:
// deal with bullets in XML.
$bullet="•"; //this was copied and pasted from transliterated text.
//$data=iconv( "UTF-8", "windows-1252//TRANSLIT", $data ); //transliterate the text:
//$data=str_replace($bullet,'•',$data); // replace the bullet char
$data=str_replace($bullet,' - ',$data); // replace the bullet char
//$data=iconv( "windows-1252", "UTF-8", $data ); // return the text to utf-8 encoding.
Any ideas how to strip or replace this char? If there's a function to pre-clean the XML, that'd be great, and I wouldn't have to worry about it.

XML by definition has no illegal chars. If some string contains a character that is not part of XML, then that string is not XML by definition.
The character you're concerned about is part of Unicode. As XML is based on Unicode, this is good news. So let's name what you aim for:
Unicode Character 'BULLET' (U+2022)
So you now say it renders as â€¢. Because U+2022 is encoded as 0xE2 0x80 0xA2 in UTF-8, it is a more or less safe assumption to say that you take an UTF-8 encoded string (that is the default encoding used in XML btw) but command the software that renders it to treat it as some single-byte encoding hence turning the single code-point into three different characters:
Unicode Character 'LATIN SMALL LETTER A WITH CIRCUMFLEX' (U+00E2)
Unicode Character 'EURO SIGN' (U+20AC)
Unicode Character 'CENT SIGN' (U+00A2)
Instead you need to command the rendering application to use the UTF-8 encoding. That should immediately solve your issue. So find the place where you introduce the wrong encoding, you will likely not need to re-encode it, just to properly hint the encoding.
If you wonder which single-byte character-encodings have these three Unicode Characters at the corresponding bytes (0xE2 0x80 0xA2), here is a list. I have highlighted the most popular one of these:
ISO-8859-15 (Latin 9)
OEM 858 (Multilingual Latin I + Euro)
Windows 1252 (Latin I)
Windows 1254 (Turkish)
Windows 1256 (Arabic)
Windows 1258 (Vietnam)

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?

If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.

PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff

Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.

Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

DomDocument and special characters written in two bytes

I have a web application, written in PHP, based on UTF-8 (both PHP and MySQL are on UTF-8). Everything is beautiful - no problem with special characters.
However, I had to build an export to XML with encoding ISO-8859-2 (Polish), so I picked DomDocument because it has built in encoding conversion.
But when I had sent the XML to my partner for validation, he said that one of tags have too many characters. It was strange because it had the specific maximum number of characters. Then I have opened the file in HexEditor and saw that every special character has two bytes.
I have tried to convert the result with iconv and mb_convert_encoding.
Iconv says:
iconv() [<a href='function.iconv'>function.iconv</a>]: Detected an illegal character in input string in file application/controllers/report/export.php at 169
mb_convert_encoding is simply deleting all special characters and result is encoded in ASCII.
Is there a way to convert the output of DomDocument to one-byte characters?
Thanks in advance!

One problem when switching between encodings is that, even with transliteration, not all characters are representable in other encodings in a single byte.
For example, consider the EURO SIGN, a character that takes 3 bytes when encoded in UTF-8. If you look at the charset support page, you can see that ISO-8859-2 is not listed.
Since there is not a single character to represent the euro sign, then transliteration does its best to still represent it in the output
echo iconv( 'UTF-8', 'ISO-8859-2//TRANSLIT', '€' ); // EUR
In this example, we still end up with 3 bytes to represent the euro sign after transliterating.
EDIT
P.S. The NOTICE level error you're getting is because you executed iconv() without the transliteration flag. And as I highlighted above, the EURO SIGN doesn't exist in ISO-8859-2, so you clearly have at least one character in your data that also doesn't exist in ISO-8859-2, so you'll have to use transliteration. Just know that it doesn't guarantee that you'll get down to 1 byte/char.

Replace unicode character

I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.

U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);

A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.