Wrong output when using array indexing on UTF-8 string - php

I have encountered a problem when using a UTF-8 string. I want to read a single character from the string, for example:
$string = "üÜöÖäÄ";
echo $string[0];
I am expecting to see ü, but I get � -- why?

Use mb_substr($string, 0, 1, 'utf-8') to get the character instead.
What happens in your code is that the expression $string[0] gets the first byte of the UTF-8 encoded representation of your string because PHP strings are effectively arrays of bytes (PHP does not internally recognize encodings).
Since the first character in your string is composed in more than one byte (UTF-8 encoding rules), you are effectively only getting part of the character. Furthermore, these rules make the byte you are retrieving invalid to stand as a character on its own, which is why you see the question mark.
mb_substr knows the encoding rules, so it will not naively give you back just one byte; it will get as many as needed to encode the first character.
You can see that $string[0] gives you back just one byte with:
$string = "üÜöÖäÄ";
echo strlen($string[0]);
While mb_substr gives you back two bytes:
$string = "üÜöÖäÄ";
echo strlen(mb_substr($string, 0, 1, 'utf-8'));
And these two bytes are in fact just one character (you need to use mb_strlen for this):
$string = "üÜöÖäÄ";
echo mb_strlen(mb_substr($string, 0, 1, 'utf-8'), 'utf-8');
Finally, as Marwelln points out below, the situation becomes more tolerable if you use mb_internal_encoding to get rid of the 'utf-8' redundancy:
$string = "üÜöÖäÄ";
mb_internal_encoding('utf-8');
echo mb_strlen(mb_substr($string, 0, 1));
You can see most of the above in action.

Related

Trouble decoding some special characters ’ “ ”

I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'Thi’s i"s a’n e”xa“mple';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?
If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'Thi’s i"s a’n e”xa“mple';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;

c2a0 and 20 string comparision

I have two utf-8 strings:
one saved as variable in php file (saved in UTF-8)
another gets externally from another with regular expression.
When I compare those two same space-separated strings, the result is false, meaning that they are not the same.
The string I saved as a variable is rendered as 20 with bin2hex (ascii encoded space symbol)
The string I got externally, processed with mb_strtolower($string, 'utf-8') is rendered as c2a0 with bin2hex (utf-8 space)
My questions is:
Why when I save in utf-8 string not fully encoded as utf-8 (meaning space in ascii)?
How to get rid of that problem?
As said in the comments c2a0 is a no-break space and 20 is normal space
Since you can see the problem in bin2hex you could:
$str = hex2bin(str_replace('c2a0', '20', bin2hex($str)));
or to put that another way:
$str = preg_replace('~\xc2\xa0~', ' ', $str); // typo corrected

how to get unicode character from a unicode string in php

I want to get a single unicode chatacter from a unicode string.
for example:-
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo $str[0];
output is:- �
but i want to get char 'प' at 0 index of the string.
plz help me how to get char 'प' instead of � .
As #deceze writes, you need to use mb_substr in order to get a character, instead of just a byte. In addition, you need to set the internal encoding with mb_internal_encoding. Assuming that the encoding of your .php file is UTF-8, the following should work:
mb_internal_encoding('utf-8');
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo mb_substr($str, 0, 1);
PHP's default $str[x] notation operates on bytes, so you're just getting the first part of a multibyte character. To extract entire encoding aware byte sequences for whole characters, you need to use mb_substr.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

PHP Utf8 Decoding Issue

I have the following address line: Praha 5, Staré Město,
I need to use utf8_decode() function on this string before I can write it to a PDF file (using domPDF lib).
However, the php utf8 decode function for the above address line appears incorrect (or rather, incomplete).
The following code:
<?php echo utf8_decode('Praha 5, Staré Město,'); ?>
Produces this:
Praha 5, Staré M?sto,
Any idea why ě is not getting decoded?
utf8_decode converts the string from a UTF-8 encoding to ISO-8859-1, a.k.a. "Latin-1".
The Latin-1 encoding cannot represent the letter "ě". It's that simple.
"Decode" is a total misnomer, it does the same as iconv('UTF-8', 'ISO-8859-1', $string).
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
I wound up using a home-grown UTF-8 / UTF-16 decoding function (convert to &#number; representations), I haven't found any pattern to why UTF-8 isn't detected, I suspect it's because the "encoded-as" sequence isn't always exactly in the same position in the string returned. You might do some additional checking on that.
Three-character UTF-8 indicator: $startutf8 = chr(0xEF).chr(187).chr(191); (if you see this ANYWHERE, not just first three characters, the string is UTF-8 encoded)
Decode according to UTF-8 rules; this replaced an earlier version which chugged through byte by byte:using
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}
Problem is in your PHP file encoding , save your file in UTF-8 encoding , then even no need to use utf8_decode , if you get these data 'Praha 5, Staré Město,' from database , better change it charset to UTF-8
you don't need that (#Rajeev :this string is automatically detected as utf-8 encoded :
echo mb_detect_encoding('Praha 5, Staré Město,');
will always return UTF-8.).
You'd rather see :
https://code.google.com/p/dompdf/wiki/CPDFUnicode

working with UTF-8 encoded text

I have a problem. I need to find some utf-8 characters from my text file and output them, but it doens't output the letters, instead it outputs "?", questionmarks...
ini_set( 'default_charset', 'UTF-8' );
$homepage = file_get_contents('t1.txt');
echo $homepage;
echo "\t";
echo "\t!!!!!!!!!!!!";
echo $homepage[14];
so, here it is very strange, if I'm using exsisting index it outputs nothing, but if I put
echo $homepage[35];
it outputs "?",
but my $homepage string is only 30 charecters long, what's wrong?
It is very strange, it takes the string from file correctly, and outputs it correctly, but when I call for the character by index, it doesn't work.. here is what's in my text file:
advhasgdvgv
олыолоываи
ouhh
and it outputs it correctly, when I just call $homepage, but when $homepage[14] it doesn't work.Here is output:
advhasgdvgv олыолоываи ouhh !!!!!!!!!!!!
Try mb_convert_encoding, and see if that fixes the problem.
http://www.php.net/manual/en/function.mb-convert-encoding.php
string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding ] )
$homepage = mb_convert_encoding(
file_get_contents('t1.txt'),
"UTF-8"
);
You should also check on the encodings of both the PHP file and the text file you have there.
I used this approach for dealing with UTF-8:
<?php
$string = 'ئاکام';//my name
mb_internal_encoding("UTF-8");
$mystring = mb_substr($string,0,1);ئ
//without mb_internal_encoding the return was Ø
echo $mystring;
?>
I also saved all files (Encoding as UTF-8)
Unicode characters have more than 1 byte per letter, so you access them you would have to do:
echo $homepage[30] . $homepage[31];
> и
But that is assuming the character is only 2 bytes, but there could be more; so a more general solution would be:
function charAt($str, $pos, $encoding = "UTF-8")
{
return mb_substr($str, $pos, 1, $encoding);
}
PHP does not really support UTF-8 in strings, which means that accessing text[n] will get the n'th byte instead of n'th char. UTF-8 chars might have 1-4 bytes in them, which means that you simply cannot access them by index using PHP, as you don't know what index a char starts from. Also, you obviously cannot retrieve a char using text[n], because it might need multiple bytes.
Depending on what you want, you can either convert the string to ISO 8859 using utf8_decode(), or use some UTF-8-aware mechanism to iterate through the string from the beginning and extract the bytes you want/need.
Be aware that Linux and Windows versions of PHP might produce different output on certain conversions, such as mb_strtoupper(), and that not all regex functions support UTF-8.

Categories