PHP Utf8 Decoding Issue

PHP Utf8 Decoding Issue - php

I have the following address line: Praha 5, Staré Město,
I need to use utf8_decode() function on this string before I can write it to a PDF file (using domPDF lib).
However, the php utf8 decode function for the above address line appears incorrect (or rather, incomplete).
The following code:
<?php echo utf8_decode('Praha 5, Staré Město,'); ?>
Produces this:
Praha 5, Staré M?sto,
Any idea why ě is not getting decoded?

utf8_decode converts the string from a UTF-8 encoding to ISO-8859-1, a.k.a. "Latin-1".
The Latin-1 encoding cannot represent the letter "ě". It's that simple.
"Decode" is a total misnomer, it does the same as iconv('UTF-8', 'ISO-8859-1', $string).
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

I wound up using a home-grown UTF-8 / UTF-16 decoding function (convert to &#number; representations), I haven't found any pattern to why UTF-8 isn't detected, I suspect it's because the "encoded-as" sequence isn't always exactly in the same position in the string returned. You might do some additional checking on that.
Three-character UTF-8 indicator: $startutf8 = chr(0xEF).chr(187).chr(191); (if you see this ANYWHERE, not just first three characters, the string is UTF-8 encoded)
Decode according to UTF-8 rules; this replaced an earlier version which chugged through byte by byte:using
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}

Problem is in your PHP file encoding , save your file in UTF-8 encoding , then even no need to use utf8_decode , if you get these data 'Praha 5, Staré Město,' from database , better change it charset to UTF-8

you don't need that (#Rajeev :this string is automatically detected as utf-8 encoded :
echo mb_detect_encoding('Praha 5, Staré Město,');
will always return UTF-8.).
You'd rather see :
https://code.google.com/p/dompdf/wiki/CPDFUnicode

Related

PHP, convert string into UTF-8 and then hexadecimal

In PHP, I want to convert a string which contains non-ASCII characters into a sequence of hexadecimal numbers which represents the UTF-8 encoding of these characters. For instance, given this:
$text = 'ąćę';
I need to produce this:
C4=84=C4=87=C4=99
How do I do that?

As your question is written, and assuming that your text is properly UTF-8 encoded to start with, this should work:
$text = 'ąćę';
$result = implode('=', str_split(strtoupper(bin2hex($text)), 2));
If your text is not UTF-8, but some other encoding, then you can use
$utf8 = mb_convert_encoding($text, 'UTF-8', $yourEncoding);
to get it into UTF-8, where $yourEncoding is some other character encoding like 'ISO-8859-1'.
This works because in PHP, strings are just arrays of bytes. So as long as your text is encoded properly to start with, you don't have to do anything special to treat it as bytes. In fact, this code will work for any character encoding you want without modification.
Now, if you want to do quoted-printable, then that's another story. You could try using the function quoted_printable_encode (requires PHP 5.3 or higher).

how to get unicode character from a unicode string in php

I want to get a single unicode chatacter from a unicode string.
for example:-
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo $str[0];
output is:- �
but i want to get char 'प' at 0 index of the string.
plz help me how to get char 'प' instead of � .

As #deceze writes, you need to use mb_substr in order to get a character, instead of just a byte. In addition, you need to set the internal encoding with mb_internal_encoding. Assuming that the encoding of your .php file is UTF-8, the following should work:
mb_internal_encoding('utf-8');
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo mb_substr($str, 0, 1);

PHP's default $str[x] notation operates on bytes, so you're just getting the first part of a multibyte character. To extract entire encoding aware byte sequences for whole characters, you need to use mb_substr.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

working with UTF-8 encoded text

I have a problem. I need to find some utf-8 characters from my text file and output them, but it doens't output the letters, instead it outputs "?", questionmarks...
ini_set( 'default_charset', 'UTF-8' );
$homepage = file_get_contents('t1.txt');
echo $homepage;
echo "\t";
echo "\t!!!!!!!!!!!!";
echo $homepage[14];
so, here it is very strange, if I'm using exsisting index it outputs nothing, but if I put
echo $homepage[35];
it outputs "?",
but my $homepage string is only 30 charecters long, what's wrong?
It is very strange, it takes the string from file correctly, and outputs it correctly, but when I call for the character by index, it doesn't work.. here is what's in my text file:
advhasgdvgv
олыолоываи
ouhh
and it outputs it correctly, when I just call $homepage, but when $homepage[14] it doesn't work.Here is output:
advhasgdvgv олыолоываи ouhh !!!!!!!!!!!!

Try mb_convert_encoding, and see if that fixes the problem.
http://www.php.net/manual/en/function.mb-convert-encoding.php
string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding ] )
$homepage = mb_convert_encoding(
file_get_contents('t1.txt'),
"UTF-8"
);
You should also check on the encodings of both the PHP file and the text file you have there.

I used this approach for dealing with UTF-8:
<?php
$string = 'ئاکام';//my name
mb_internal_encoding("UTF-8");
$mystring = mb_substr($string,0,1);ئ
//without mb_internal_encoding the return was Ø
echo $mystring;
?>
I also saved all files (Encoding as UTF-8)

Unicode characters have more than 1 byte per letter, so you access them you would have to do:
echo $homepage[30] . $homepage[31];
> и
But that is assuming the character is only 2 bytes, but there could be more; so a more general solution would be:
function charAt($str, $pos, $encoding = "UTF-8")
{
return mb_substr($str, $pos, 1, $encoding);
}

PHP does not really support UTF-8 in strings, which means that accessing text[n] will get the n'th byte instead of n'th char. UTF-8 chars might have 1-4 bytes in them, which means that you simply cannot access them by index using PHP, as you don't know what index a char starts from. Also, you obviously cannot retrieve a char using text[n], because it might need multiple bytes.
Depending on what you want, you can either convert the string to ISO 8859 using utf8_decode(), or use some UTF-8-aware mechanism to iterate through the string from the beginning and extract the bytes you want/need.
Be aware that Linux and Windows versions of PHP might produce different output on certain conversions, such as mb_strtoupper(), and that not all regex functions support UTF-8.

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?

If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.

PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff

Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.

Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

PHP UTF-8 encoding problem of U+009A

I have problems displaying the Unicode character of U+009A.
It should look like "š", but instead looks like a rectangular block with the numbers 009A inside.
Converting it to the entity "" displays the character correctly, but I don't want to store entities in the database.
The encoding of the webpage is in UTF-8.
The character is URL-encoded as "%C2%9A".
Reproduce:
# php -E 'echo urldecode("%C2%9A");' > /tmp/test ; less /tmp/test
This gives me <U+009A> in less or <9A> in vim.

The Unicode character "š" is U+0161, not U+009A
I suspect that it's 0x9A in another character set.
The box with 009A is usually shown when you don't have a font installed with that character.

If you’re using UTF-8 as your input encoding, then you can simply use the plain š. Or you could use the hexadecimal representation "\xC2\x9A" (in double quotes) that’s independent from the input encoding. Or utf8_encode("\x9A") since the first 256 characters of Unicode and ISO 8859-1 are identical.

If I do a hexdump of the output of echo urldecode("%C2%9A"); I get c2 9a, which is the correct UTF-8 encoding for character 0x9a.
You get that same encoding from the output of utf8_encode("\x9A")
When I try to view Unicode char 0x9a, I get a square box too - suspect it's not the char you think it should be (Aha: as Azquelt has posted, unicode character "š" is U+0161, not U+009A)

Codeigniter have utf-8 character input data save issue in some hosting servers like Etisalat. system/core/Utf8.php have function to detect illegal char in input data(post/get). In some cases utf-8 char is consider as illegal and save function will fail. For avoid data saving issue do the following in clean_string() function of Utf8.php at line 85.
$str = !mb_detect_encoding($str, 'UTF-8', TRUE) ? utf8_encode($str) : $str;
$str = #iconv('UTF-8', 'UTF-8//IGNORE', $str);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.