Detect URL query string encoding - php

On a request URL, I can get the query string ?dir=Documents%20partag%C3%A9s or ?dir=Documents%20partag%E9s. I think the first one is UTF-8 and the second is ASCII.
The real string is : Documents partagés
So, I have a PHP script (in UTF-8) and what I want to do, is to detect if the query string is ASCII or UTF-8, and if ASCII, convert it to UTF-8.
I tried with mb_ functions, but the query string is always detected as ASCII and urldecode version of query string as UTF-8.
How can I achieve this? Note that Wikipedia has a similar function -it encodes itself %E9 to %C3%A9.

E9 is 233 in decimal. It is not a valid ASCII byte (0-127 only), but it is é in ISO-8859-1 (Latin1). When using mb_convert_encoding, you can specify multiple encodings (e.g.: UTF-8 and ISO-8859-1).
This should fix it:
mb_convert_encoding($str, 'UTF-8', 'UTF-8,ISO-8859-1');
With the following script:
$str1 = 'Documents%20partag%E9s';
$str2 = 'Documents%20partag%C3%A9s';
var_dump(mb_convert_encoding(urldecode($str1), 'UTF-8', 'UTF-8,ISO-8859-1'));
var_dump(mb_convert_encoding(urldecode($str2), 'UTF-8', 'UTF-8,ISO-8859-1'));
I get:
string(19) "Documents partagés"
string(19) "Documents partagés"

Related

PHP, convert string into UTF-8 and then hexadecimal

In PHP, I want to convert a string which contains non-ASCII characters into a sequence of hexadecimal numbers which represents the UTF-8 encoding of these characters. For instance, given this:
$text = 'ąćę';
I need to produce this:
C4=84=C4=87=C4=99
How do I do that?
As your question is written, and assuming that your text is properly UTF-8 encoded to start with, this should work:
$text = 'ąćę';
$result = implode('=', str_split(strtoupper(bin2hex($text)), 2));
If your text is not UTF-8, but some other encoding, then you can use
$utf8 = mb_convert_encoding($text, 'UTF-8', $yourEncoding);
to get it into UTF-8, where $yourEncoding is some other character encoding like 'ISO-8859-1'.
This works because in PHP, strings are just arrays of bytes. So as long as your text is encoded properly to start with, you don't have to do anything special to treat it as bytes. In fact, this code will work for any character encoding you want without modification.
Now, if you want to do quoted-printable, then that's another story. You could try using the function quoted_printable_encode (requires PHP 5.3 or higher).

How to convert UTF-8 interpreted GB2312 encoding to real UTF-8 encoding?

This is a strange scenario, not conventional converting one encoding to another one.
Question
I use Readability API to retrieve main content from given url, it works fine if the target url is encoded with UTF-8, but when target url is encoded in GB2312(one of Chinese encoding), I get rubbish information instead(the Chinese characters are wrongly encoded but English letters and digits work fine).
Deep Research
I inspected the HTTP header Readability API returns, it indicates that the encoding of API response is UTF-8.
Here's a snippet of wrongly encoded Chinese characters:
ÄÉ´ï¶û¾ø¾³Ï´󷴻÷¾Ü¾øÀäÃÅÄæת½ú¼¶ÖÐÍøËÄÇ¿
Length: 42
Which originally are:
纳达尔绝境下大反击拒绝冷门逆转晋级中网四强
Length: 21
However, if you convert the correct Chinese into unicode, it should be:
纳达尔绝境下大反击拒绝冷门逆转晋级中网四强
Tried But Not Working
iconv("GB2312", "UTF-8", $str);
iconv("GBK", "UTF-8", $str);
iconv("GB18300", "UTF-8", $str);
mb_convert_enconding($str, "UTF-8", "GB2312");
mb_convert_enconding($str, "UTF-8", "GB18300");
mb_convert_enconding($str, "UTF-8", "GBK");
Solution Requested
Since Readability API doesn't provide a parameter for charset of target url( I call this API like https://www.readability.com/api/content/v1/parser?url=http://sports.sina.com.cn/t/2013-10-04/14596813815.shtml&token=my_token_here), I have to do the convertion when handling the API response.
I will appreciate it very much if you have any idea about this issue.
Environment Info: PHP 5.3.6
It seems that the individual bytes that make up the characters have been encoded as HTML numeric entities as if they were characters from ISO-8859-1 or some other 8-bit encoding. To undo the numeric entity encoding you can use mb_decode_numericentity:
$str = "ÄÉ´ï¶û¾ø¾³Ï´󷴻÷¾Ü¾øÀäÃÅÄæת½ú¼¶ÖÐÍøËÄÇ¿";
$str = mb_decode_numericentity($str, array(0, 255, 0, 255), "ISO-8859-1");
echo iconv("gb2312", "utf8", $str);
This produces the expected output of 纳达尔绝境下大反击拒绝冷门逆转晋级中网四强.

mb_detect_encoding showing the same encoding

I have a weird problem , the following code :
$str = "נסיון" // <--- Hebrew chars
echo mb_detect_encoding ($str)."<br><br><br>";
$str = iconv (mb_detect_encoding($str),'UCS-2BE',$str);
echo mb_detect_encoding ($str)."<br><br><br>";
This will output :
UTF-8
UTF-8
This code is written in a file that's encoded (using Notepad++) in UTF-8 Without BOM, trying other encodings and didn't work.
I also tried converting the string using :
$str = mb_convert_encoding($str,'UCS-2BE');
But that didn't work either. Any insights?
From the documentation for mb_detect_order, the function that establishes the order in which mb_detect_encoding tests different encodings:
mbstring currently implements the following encoding detection filters. If there is an invalid byte sequence for the following encodings, encoding detection will fail.
UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP
For ISO-8859-*, mbstring always detects as ISO-8859-*.
For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always.
So, you can't detect the encoding of the second string with the mb functions.

PHP Utf8 Decoding Issue

I have the following address line: Praha 5, Staré Město,
I need to use utf8_decode() function on this string before I can write it to a PDF file (using domPDF lib).
However, the php utf8 decode function for the above address line appears incorrect (or rather, incomplete).
The following code:
<?php echo utf8_decode('Praha 5, Staré Město,'); ?>
Produces this:
Praha 5, Staré M?sto,
Any idea why ě is not getting decoded?
utf8_decode converts the string from a UTF-8 encoding to ISO-8859-1, a.k.a. "Latin-1".
The Latin-1 encoding cannot represent the letter "ě". It's that simple.
"Decode" is a total misnomer, it does the same as iconv('UTF-8', 'ISO-8859-1', $string).
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
I wound up using a home-grown UTF-8 / UTF-16 decoding function (convert to &#number; representations), I haven't found any pattern to why UTF-8 isn't detected, I suspect it's because the "encoded-as" sequence isn't always exactly in the same position in the string returned. You might do some additional checking on that.
Three-character UTF-8 indicator: $startutf8 = chr(0xEF).chr(187).chr(191); (if you see this ANYWHERE, not just first three characters, the string is UTF-8 encoded)
Decode according to UTF-8 rules; this replaced an earlier version which chugged through byte by byte:using
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}
Problem is in your PHP file encoding , save your file in UTF-8 encoding , then even no need to use utf8_decode , if you get these data 'Praha 5, Staré Město,' from database , better change it charset to UTF-8
you don't need that (#Rajeev :this string is automatically detected as utf-8 encoded :
echo mb_detect_encoding('Praha 5, Staré Město,');
will always return UTF-8.).
You'd rather see :
https://code.google.com/p/dompdf/wiki/CPDFUnicode

Convert ASCII TO UTF-8 Encoding

How to convert ASCII encoding to UTF8 in PHP
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8.
If you know for sure that your current encoding is pure ASCII, then you don't have to do anything because ASCII is already a valid UTF-8.
But if you still want to convert, just to be sure that its UTF-8, then you can use iconv
$string = iconv('ASCII', 'UTF-8//IGNORE', $string);
The IGNORE will discard any invalid characters just in case some were not valid ASCII.
Use mb_convert_encoding to convert an ASCII to UTF-8. More info here
$string = "chárêctërs";
print(mb_detect_encoding ($string));
$string = mb_convert_encoding($string, "UTF-8");
print(mb_detect_encoding ($string));
"ASCII is a subset of UTF-8, so..." - so UTF-8 is a set? :)
In other words: any string build with code points from x00 to x7F has indistinguishable representations (byte sequences) in ASCII and UTF-8. Converting such string is pointless.
Use utf8_encode()
Man page can be found here http://php.net/manual/en/function.utf8-encode.php
Also read this article from Joel on Software. It provides an excellent explanation if what Unicode is and how it works. http://www.joelonsoftware.com/articles/Unicode.html

Categories