PHP UTF-8 mb_convert_encode and Internet-Explorer - php

Since some days I read about Character-Encoding, I want to make all my Pages with UTF-8 for Compability. But I get stuck when I try to convert User-Input to UTF-8, this works on all Browsers, expect Internet-Explorer (like always).
I don't know whats wrong with my code, it seems fine to me.
I set the header with char encoding
I saved the file in UTF-8 (No BOM)
This happens only, if you try to access to the page via $_GET on the internet-Explorer myscript.php?c=äüöß
When I write down specialchars on my site, they would displayed correct.
This is my Code:
// User Input
$_GET['c'] = "äüöß"; // Access URL ?c=äüöß
//--------
header("Content-Type: text/html; charset=utf-8");
mb_internal_encoding('UTF-8');
$_GET = userToUtf8($_GET);
function userToUtf8($string) {
if(is_array($string)) {
$tmp = array();
foreach($string as $key => $value) {
$tmp[$key] = userToUtf8($value);
}
return $tmp;
}
return userDataUtf8($string);
}
function userDataUtf8($string) {
print("1: " . mb_detect_encoding($string) . "<br>"); // Shows: 1: UTF-8
$string = mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string)); // Convert non UTF-8 String to UTF-8
print("2: " . mb_detect_encoding($string) . "<br>"); // Shows: 2: ASCII
$string = preg_replace('/[\xF0-\xF7].../s', '', $string);
print("3: " . mb_detect_encoding($string) . "<br>"); // Shows: 3: ASCII
return $string;
}
echo $_GET['c']; // Shows nothing
echo mb_detect_encoding($_GET['c']); // ASCII
echo "äöü+#"; // Shows "äöü+#"
The most confusing Part is, that it shows me, that's converted from UTF-8 to ASCII... Can someone tell me why it doesn't show me the specialchars correctly, whats wrong here? Or is this a Bug on the Internet-Explorer?
Edit:
If I disable converting it says, it's all UTF-8 but the Characters won't show to me either... They are displayed like "????"....
Note: This happens ONLY in the Internet-Explorer!

Although I prefer using urlencoded strings in address bar but for your case you can try to encode $_GET['c'] to utf8. Eg.
$_GET['c'] = utf8_encode($_GET['c']);

An approach to display the characters using IE 11.0.18 which worked:
Retrieve the Unicode of your character : example for 'ü' = 'U+00FC'
According to this post, convert it to utf8 entity
Decode it using utf8_decode before dumping
The line of code illustrating the example with the 'ü' character is :
var_dump(utf8_decode(html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", 'U+00FC'), ENT_NOQUOTES, 'UTF-8')));
To summarize: For displaying purposes, go from Unicode to UTF8 then decode it before displaying it.
Other resources:
a post to retrieve characters' unicode

Related

Create UTF-8 code from dynamic Unicode in PHP

I am making a dynamic Unicode icon in PHP. I want the UTF-8 code of the Unicode icon.
So far I have done:
$value = "1F600";
$emoIcon = "\u{$value}";
$emoIcon = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $emoIcon);
echo $emoIcon; //output 😀
$hex=bin2hex($emoIcon);
echo $hex; // output 26237831463630303b
$hexVal=chunk_split($hex,2,"\\x");
var_dump($hexVal); // output 26\x23\x78\x31\x46\x36\x30\x30\x3b\x
$result= "\\x" . substr($hexVal,0,-2);
var_dump($result); // output \x26\x23\x78\x31\x46\x36\x30\x30\x3b
But when I put the value directly, it prints the correct data:
$emoIcon = "\u{1F600}";
$emoIcon = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $emoIcon);
echo $emoIcon; //output 😀
$hex=bin2hex($emoIcon);
echo $hex; // output f09f9880
$hexVal=chunk_split($hex,2,"\\x");
var_dump($hexVal); // output f0\x9f\x98\x80\x
$result= "\\x" . substr($hexVal,0,-2);
var_dump($result); // output \xf0\x9f\x98\x80
\u{1F600} is a Unicode escape sequence used in double-quoted strings, it must have a literal value - trying to use "\u{$value}", as you've seen, doesn't work (for a couple reasons, but that doesn't matter so much.)
If you want to start with "1F600" and end up with 😀 use hexdec to turn it into an integer and feed that to IntlChar::chr to encode that code point as UTF-8. E.g.:
$value = "1F600";
echo IntlChar::chr(hexdec($value));
Outputs:
😀

check if the string begin with euro/pound symbol

I'm trying to check if a string is start with '€' or '£' in PHP.
Below are the codes
$text = "€123";
if($text[0] == "€"){
echo "true";
}
else{
echo "false";
}
//output false
If only check a single char, it works fine
$symbol = "€";
if($symbol == "€"){
echo "true";
}
else{
echo "false";
}
// output true
I have also tried to print the string on browser.
$text = "€123";
echo $text; //display euro symbol correctly
echo $text[0] //get a question mark
I have tried to use substr(), but the same problem occurred.
Characters, such as '€' or '£' are multi-byte characters. There is an excellent article that you can read here. According to the PHP docs, PHP strings are byte arrays. As a result, accessing or modifying a string using array brackets is not multi-byte safe, and should only be done with strings that are in a single-byte encoding such as ISO-8859-1.
Also make sure your file is encoded with UTF-8: you can use a text editor such as NotePad++ to convert it.
If I reduce the PHP to this, it works, the key being to use mb_substr:
<?php
header ('Content-type: text/html; charset=utf-8');
$text = "€123";
echo mb_substr($text,0,1,'UTF-8');
?>
Finally, it would be a good idea to add the UTF-8 meta-tag in your head tag:
<meta charset="utf-8">
I suggest this as the easiest solution to you. Convert the symbols to their unicode identifiers using htmlentities().
htmlentities($text, ENT_QUOTES, "UTF-8");
Which will either give you £ or €. Now that allows you to run a switch() {case:} statement to check. (Or your if statements)
$symbols = explode(";", $text);
switch($symbols[0]) {
case "&pound":
echo "It's Pounds";
break;
case "&euro":
echo "It's Euros";
break;
}
Working Example
This happens because you’re using a multi-byte character encoding (probably UTF-8) in which both € and £ are recorded using multiple bytes. That means that "€" is a string of three bytes, not just one.
When you use $text[0] you're getting only the first byte of the first character, and so it doesn't match the three bytes of "€". You need to get the first three bytes instead, to check whether one string starts with another.
Here’s the function I use to do that:
function string_starts_with($string, $prefix) {
return substr($string, 0, strlen($prefix)) == $prefix;
}
The question mark appears because the first byte of "€" isn’t enough to encode a whole character: the error is indicated by ‘�’ when available, otherwise ‘?’.

utf (chinese char) covert to Hexadecimal format in php

I am passing my message to SMS api,
This is the documentation
Normally Unicode Messages are Arabic and Chinese Message, which are
defined by GSM Standards. Unicode messages are nothing but normal text
type messages but it has to be submitted in HEX form. To submit
Unicode messages following Url to be used.
I tried bin2hex() there is not working for the output.
$str = '人';
//$str = 'a';
$output = bin2hex($str);
echo $output;
//output
//人 = e4baba ; I would expect '4EBA'
I found a similar solution but it is in VB.net anyone can convert it?
http://www.supportchain.com/index.php?/Knowledgebase/Article/View/28/7/unable-to-send-sms-with-chinese-character-using-api
the sample i had tried, and it is work:-
example of conversion : a converted to hexadecimal is 0061, 人 converted to hexadecimal is 4EBA
The issue you are facing has to do with encoding. Since these are considered special characters, you need to add some encoding details when converting to hex.
Each of these outputs exactly what you were looking for when I run them:
echo bin2hex(iconv('UTF-8', 'ISO-10646-UCS-2', '人')) . PHP_EOL;
//Outputs 4eba
echo bin2hex(iconv('UTF-8', 'UNICODE-1-1', '人')) . PHP_EOL;
//Outputs 4eba
echo bin2hex(iconv('UTF-8', 'UTF-16BE', '人')) . PHP_EOL;
//Outputs 4eba
Pick whichever one you fancy.
If you want to convert back:
echo iconv('UTF-16BE', 'UTF-8', hex2bin('4eba')) . PHP_EOL;
//outputs 人

How do I write UTF-8 data to a UTF-16LE file using PHP?

Given a string of UTF-8 data in PHP, how can I convert and save it to a UTF-16LE file (this particular file happens to be destined for Indesign - to be placed as a tagged text document).
Data:
$copy = "<UNICODE-MAC>\n";
$copy .= "<Version:8><FeatureSet:InDesign-Roman><ColorTable:=<Black:COLOR:CMYK:Process:0,0,0,1>>\n";
$copy .= "A bunch of unicode special characters like ñ, é, etc.";
I am using the following code, but to no avail:
file_put_contents("output.txt", pack("S",0xfeff) . $copy);
You can use iconv:
$copy_utf16 = iconv("UTF-8", "UTF-16LE", $copy);
file_put_contents("output.txt", $copy_utf16);
Note that UTF-16LE does not include a Byte-Order-Marker, because the byte order is well defined. To produce a BOM use "UTF-16" instead.
Using the following code, I have found a solution:
this function changes the byte order (from http://shiplu.mokadd.im/95/convert-little-endian-to-big-endian-in-php-or-vice-versa/):
function chbo($num) {
$data = dechex($num);
if (strlen($data) <= 2) {
return $num;
}
$u = unpack("H*", strrev(pack("H*", $data)));
$f = hexdec($u[1]);
return $f;
}
used with a utf-8 to utf-16LE conversion, it creates a file that will work with indesign:
file_put_contents("output.txt", pack("S",0xfeff). chbo(iconv("UTF-8","UTF-16LE",$copy));
Alternatively, you could use mb_convert_encoding() as follows:
$copy_UTF16LE = mb_convert_encoding($copy,'UTF-16LE','UTF-8');

get utf8 urlencoded characters in another page using php

I have used rawurlencode on a utf8 word.
For example
$tit = 'தேனின் "வாசம்"';
$t = (rawurlencode($tit));
when I click the utf8 word ($t), I will be transferred to another page using .htaccess and I get the utf8 word using $_GET['word'];
The word displays as தேனினà¯_"வாசமà¯" not the actual word. How can I get the actual utf8 word. I have used the header charset=utf-8.
Was my comment first, but should have been an answer:
magic_quotes is off? Would be weird if it was still on in 2011. But you should check and do stripslashes.
Did you use rawurldecode($_GET['word']); ? And do you use UTF-8 encoding for your PHP file?
<?php
$s1 = <<<EOD
தேனினà¯_"வாசமà¯"
EOD;
$s2 = <<<EOD
தேனின் "வாசம்"
EOD;
$s1 = mb_convert_encoding($s1, "WINDOWS-1252", "UTF-8");
echo bin2hex($s1), "\n";
echo bin2hex($s2), "\n";
echo $s1, "\n", $s2, "\n";
Output:
e0aea4e0af87e0aea9e0aebfe0aea9e0af5f22e0aeb5e0aebee0ae9ae0aeaee0af22
e0aea4e0af87e0aea9e0aebfe0aea9e0af8d2022e0aeb5e0aebee0ae9ae0aeaee0af8d22
தேனின��_"வாசம��"
தேனின் "வாசம்"
You're probably just not showing the data as UTF-8 and you're showing it as ISO-8859-1 or similar.

Categories