How can I encode from unicode to GB18030 in PHP? - php

I know how to do this in Python:
In [1]: u'中华人民共和国'.encode('GB18030').encode('base64')
Out[1]: '1tC7qsjLw/G5srrNufo=\n'
But I need to do this in PHP, and I'm not sure how to do it.

You can use iconv to convert the string from UTF-8 (or whatever your initial encoding is) to GB18030, then base64_encode the result. E.g.:
echo base64_encode(iconv('UTF-8', 'GB18030', '中华人民共和国'));
outputs:
1tC7qsjLw/G5srrNufo=
Note that PHP doesn't have native Unicode strings - they're just a bunch of bytes, so you'll need to specify the encoding the string is in. If it's a string literal in your PHP it'll be whatever encoding you've used for the file.

echo mb_convert_encoding($str, "gb18030", "UTF-8");
replace UTF-8 with whatever encoding the text is in already. Or the third parameter is optional, it might work even without that.
mb_convert_encoding

Related

PHP and Unicode or UTF-8?

My PHP application outputs JSON where special characters are encoded, f.ex. the string "Brøndum" is represented as "Br\u00f8ndum".
Can you tell me which encoding this is, as well as how I get back from "Br\u00f8ndum" to "Brøndum".
I have tried utf8_encode/decode but they don't work as expected.
Thanks!
That's standard JSON unicode escaping.
You get back to the actual character by using a JSON parser. json_decode in the case of PHP.
You can tell PHP not to escape Unicode characters in the first place with the JSON_UNESCAPED_UNICODE flag.
json_encode("Brøndum", JSON_UNESCAPED_UNICODE)
mb_detect_encoding is your function. You just pass it the string and it detects the codification. You can also send it an array with the possibilities (as a regular string like "hello" could potentially be encoded in different codifications.
echo mb_detect_encoding("Br\u00f8ndum");

PHP, convert string into UTF-8 and then hexadecimal

In PHP, I want to convert a string which contains non-ASCII characters into a sequence of hexadecimal numbers which represents the UTF-8 encoding of these characters. For instance, given this:
$text = 'ąćę';
I need to produce this:
C4=84=C4=87=C4=99
How do I do that?
As your question is written, and assuming that your text is properly UTF-8 encoded to start with, this should work:
$text = 'ąćę';
$result = implode('=', str_split(strtoupper(bin2hex($text)), 2));
If your text is not UTF-8, but some other encoding, then you can use
$utf8 = mb_convert_encoding($text, 'UTF-8', $yourEncoding);
to get it into UTF-8, where $yourEncoding is some other character encoding like 'ISO-8859-1'.
This works because in PHP, strings are just arrays of bytes. So as long as your text is encoded properly to start with, you don't have to do anything special to treat it as bytes. In fact, this code will work for any character encoding you want without modification.
Now, if you want to do quoted-printable, then that's another story. You could try using the function quoted_printable_encode (requires PHP 5.3 or higher).

How to encode foreign characters in url in PHP

How do you correctly encode an URL with foreign characters in PHP?
I assumed urlencode() would do the trick but it does not.
The correct encoding for the following URL
http://eu.battle.net/wow/en/character/anachronos/Paddestøel/advanced
Is this:
http://eu.battle.net/wow/en/character/anachronos/Paddest%C3%B8el/advanced
But urlencode encodes it like this:
http://eu.battle.net/wow/en/character/anachronos/Paddest%F8el/advanced
What function do I use to encode it like on the second example?
Your PHP scripts seem to use some single-byte encoding. You can either:
Save the source code as UTF-8
Convert data to UTF-8 with iconv() or mb_convert_encoding()
In general, making the full switch to UTF-8 fixes all encoding issues at once but initial migration might require some extra work.
There is no "correct" encoding. URL-percent-encoding simply represents raw bytes. It's up to you what those bytes are or how you're going to interpret them later. If your string is UTF-8 encoded, the percent-encoded raw byte representation is %C3%B8. If your string is not UTF-8 encoded, it's something else. If you want %C3%B8, make sure your string is UTF-8 encoded.
Use UTF-8 encoding
function url_encode($string){
return urlencode(utf8_encode($string));
}
Then use this function to encode your url (got it in a comment here: http://php.net/manual/en/function.urlencode.php)

PHP: Use (or not) 'utf8_encode' in combination with setting BOM to \xEF\xBB\xBF

When using the following code:
$myString = 'some contents';
$fh=fopen('newfile.txt',"w");
fwrite($fh, "\xEF\xBB\xBF" . $myString);
Is there any point of using PHP functions to first encode the text ($myString in the example) e.g. like running utf8_encode($myString); or similar iconv() commands?
Assuming that the BOM \xEF\xBB\xBF is first inputted into the file and that UTF8 represents practically all characters in the world I don't see any potential failure scenarion of creating a file this way. In other words I don't see any case where any major text editor wouldn't be able to interpret the newly created file corectly, displaying all characters as intended. This even if $myString would be a PHP $_POST variable from a HTML form. Am I right?
If your source file is UTF-8 encoded, then the string $myString is also UTF-8 encoded, you don't need to convert it. Otherwise, you need to use iconv() to convert the encoding first before write it to the file.
And note utf8_encode() is used to encode an ISO-8859-1 string to UTF-8.
Note that utf8_encode will only convert ISO-8859-1 encoded strings.
In general, given that PHP only supports a 256 char character set, you will need to utf-8 encode any string containing non-ASCII characters before writing it to UTF-8.
The BOM is optional (most text file readers now will scan the file for its encoding).
From Wikipedia
The Unicode Standard permits the BOM in UTF-8,[2] but does not require
or recommend for or against its use

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?
If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.
PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff
Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.
Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

Categories