I have two utf-8 strings:
one saved as variable in php file (saved in UTF-8)
another gets externally from another with regular expression.
When I compare those two same space-separated strings, the result is false, meaning that they are not the same.
The string I saved as a variable is rendered as 20 with bin2hex (ascii encoded space symbol)
The string I got externally, processed with mb_strtolower($string, 'utf-8') is rendered as c2a0 with bin2hex (utf-8 space)
My questions is:
Why when I save in utf-8 string not fully encoded as utf-8 (meaning space in ascii)?
How to get rid of that problem?
As said in the comments c2a0 is a no-break space and 20 is normal space
Since you can see the problem in bin2hex you could:
$str = hex2bin(str_replace('c2a0', '20', bin2hex($str)));
or to put that another way:
$str = preg_replace('~\xc2\xa0~', ' ', $str); // typo corrected
Related
I have the following address line: Praha 5, Staré Město,
I need to use utf8_decode() function on this string before I can write it to a PDF file (using domPDF lib).
However, the php utf8 decode function for the above address line appears incorrect (or rather, incomplete).
The following code:
<?php echo utf8_decode('Praha 5, Staré Město,'); ?>
Produces this:
Praha 5, Staré M?sto,
Any idea why ě is not getting decoded?
utf8_decode converts the string from a UTF-8 encoding to ISO-8859-1, a.k.a. "Latin-1".
The Latin-1 encoding cannot represent the letter "ě". It's that simple.
"Decode" is a total misnomer, it does the same as iconv('UTF-8', 'ISO-8859-1', $string).
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
I wound up using a home-grown UTF-8 / UTF-16 decoding function (convert to &#number; representations), I haven't found any pattern to why UTF-8 isn't detected, I suspect it's because the "encoded-as" sequence isn't always exactly in the same position in the string returned. You might do some additional checking on that.
Three-character UTF-8 indicator: $startutf8 = chr(0xEF).chr(187).chr(191); (if you see this ANYWHERE, not just first three characters, the string is UTF-8 encoded)
Decode according to UTF-8 rules; this replaced an earlier version which chugged through byte by byte:using
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}
Problem is in your PHP file encoding , save your file in UTF-8 encoding , then even no need to use utf8_decode , if you get these data 'Praha 5, Staré Město,' from database , better change it charset to UTF-8
you don't need that (#Rajeev :this string is automatically detected as utf-8 encoded :
echo mb_detect_encoding('Praha 5, Staré Město,');
will always return UTF-8.).
You'd rather see :
https://code.google.com/p/dompdf/wiki/CPDFUnicode
I have some PHP code that I use for text filtering. During filtering, some ASCII characters such as ampersand (&) and tilde (~) are temporarily converted to low ASCII characters (such as decimal code-points 4 and 5). Just before the final filtered output is generated, the conversion is reverted.
$temp = str_replace(array('&', '~'), array("\x04", "\x05"), $input);
... some filtering code to work with $temp ...
$out = str_replace(array("\x04", "\x05"), array('&', '~'), $temp);
This works well with input text of character encodings that use 8-bit code units such as UTF-8 and ISO 8859-1. But I am not sure about input encoded in larger code units, such as UTF-16 or UTF-32. Will the first conversion step mangle the well-formedness of the input text? Will there be some conflict during the reversion step because of some pre-existing characters of the input? The PHP setup does not overload multi-byte string functions.
Can anyone comment? Thanks.
str_replace works fine, as long as all strings passed to it are in the same encoding. It just does a binary compare/replace of data, so the actual encoding doesn't really matter.
That's why there's no mb_str_replace in this list.
I have encountered a problem when using a UTF-8 string. I want to read a single character from the string, for example:
$string = "üÜöÖäÄ";
echo $string[0];
I am expecting to see ü, but I get � -- why?
Use mb_substr($string, 0, 1, 'utf-8') to get the character instead.
What happens in your code is that the expression $string[0] gets the first byte of the UTF-8 encoded representation of your string because PHP strings are effectively arrays of bytes (PHP does not internally recognize encodings).
Since the first character in your string is composed in more than one byte (UTF-8 encoding rules), you are effectively only getting part of the character. Furthermore, these rules make the byte you are retrieving invalid to stand as a character on its own, which is why you see the question mark.
mb_substr knows the encoding rules, so it will not naively give you back just one byte; it will get as many as needed to encode the first character.
You can see that $string[0] gives you back just one byte with:
$string = "üÜöÖäÄ";
echo strlen($string[0]);
While mb_substr gives you back two bytes:
$string = "üÜöÖäÄ";
echo strlen(mb_substr($string, 0, 1, 'utf-8'));
And these two bytes are in fact just one character (you need to use mb_strlen for this):
$string = "üÜöÖäÄ";
echo mb_strlen(mb_substr($string, 0, 1, 'utf-8'), 'utf-8');
Finally, as Marwelln points out below, the situation becomes more tolerable if you use mb_internal_encoding to get rid of the 'utf-8' redundancy:
$string = "üÜöÖäÄ";
mb_internal_encoding('utf-8');
echo mb_strlen(mb_substr($string, 0, 1));
You can see most of the above in action.
I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.
U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);
A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.
I have problems displaying the Unicode character of U+009A.
It should look like "š", but instead looks like a rectangular block with the numbers 009A inside.
Converting it to the entity "" displays the character correctly, but I don't want to store entities in the database.
The encoding of the webpage is in UTF-8.
The character is URL-encoded as "%C2%9A".
Reproduce:
# php -E 'echo urldecode("%C2%9A");' > /tmp/test ; less /tmp/test
This gives me <U+009A> in less or <9A> in vim.
The Unicode character "š" is U+0161, not U+009A
I suspect that it's 0x9A in another character set.
The box with 009A is usually shown when you don't have a font installed with that character.
If you’re using UTF-8 as your input encoding, then you can simply use the plain š. Or you could use the hexadecimal representation "\xC2\x9A" (in double quotes) that’s independent from the input encoding. Or utf8_encode("\x9A") since the first 256 characters of Unicode and ISO 8859-1 are identical.
If I do a hexdump of the output of echo urldecode("%C2%9A"); I get c2 9a, which is the correct UTF-8 encoding for character 0x9a.
You get that same encoding from the output of utf8_encode("\x9A")
When I try to view Unicode char 0x9a, I get a square box too - suspect it's not the char you think it should be (Aha: as Azquelt has posted, unicode character "š" is U+0161, not U+009A)
Codeigniter have utf-8 character input data save issue in some hosting servers like Etisalat. system/core/Utf8.php have function to detect illegal char in input data(post/get). In some cases utf-8 char is consider as illegal and save function will fail. For avoid data saving issue do the following in clean_string() function of Utf8.php at line 85.
$str = !mb_detect_encoding($str, 'UTF-8', TRUE) ? utf8_encode($str) : $str;
$str = #iconv('UTF-8', 'UTF-8//IGNORE', $str);