Understanding character encoding in PHP

Understanding character encoding in PHP - php

I am struggling at understanding character encoding in PHP.
Consider the following script (you can run it here):
$string = "\xe2\x82\xac";
var_dump(mb_internal_encoding());
var_dump($string);
var_dump(unpack('C*', $string));
$utf8string = mb_convert_encoding($string, "UTF-8");
var_dump($utf8string);
var_dump(unpack('C*', $utf8string));
mb_internal_encoding("UTF-8");
var_dump($string);
var_dump($utf8string);
I have a string, actually the € character, represented with its unicode code points. Up to PHP 5.5 the used internal encoding is ISO-8859-1, hence I think that my string will be encoded using this encoding. With unpack I can see the bite representation of my string, and it corresponds to the hexadecimal codes I use to define the string.
Then I convert the encoding of the string to UTF-8, using mb_convert_encoding. At this point the string displays differently on the screen and its byte representation changes (and this is expected).
If I change the PHP internal encoding also to UTF-8, I'd expect utf8string to be displayed correctly on the screen, but this doesn't happen.
What I am missing?

The script you show doesn't use any non-ascii characters, so its internal encoding does not make any difference. mb_internal_encoding does convert your data on output. This question will tell you more about how it works; it will also tell you it's better not to use it.
The three-byte string $string in your code is the UTF-8 representation of the Euro symbol, not its "unicode code point" (which is 2 bytes wide, like all common Unicode characters: 0x20ac).
Does this clear up the behavior you see?

You started with a string that is the utf-8 representation of the Euro symbol. If you run echo($string) all versions of PHP produce the three bytes you put in $string. How they are interpreted by the browser depends on the character set specified in the Content-Type header. If it is text/html; charset=utf-8 then you get the Euro sign in the rendered page.
Then you do the wrong move. You call mb_convert_encoding() with only two arguments. This lets PHP use the current value of its internal encoding used by the mb_string extension for the the third argument ($from_encoding). Why?
For PHP 5.6 and newer, the default value returned by mb_internal_encoding() is utf-8 and the call to mb_convert_encoding() is a no-op.
But for previous versions of PHP, the default value returned by mb_internal_encoding() is iso-8859-1 and it doesn't match the encoding of your string. Accordingly, mb_convert_encoding() interprets the bytes of $string as three individual characters and encodes them using the rules of utf-8. The outcome is obviously wrong.
Btw, if you initialize $string with '€' you get the same output on all PHP versions (even on PHP 4, iirc).

Related

PHP strlen and mb_strlen not working as expected

PHP functions strlen() and mb_strlen() both are returning the wrong number of characters when I run them on a string.
Here is a piece of the code I'm using...
$foo = mb_strlen($itemDetails['ITEMDESC'], 'UTF-8');
echo $foo;
It is telling me this sting - "4Â½" Straight Iris Scissors" is 45 characters long. It's 27.
It also tells me that this string - "Infant Heel Warmer, No Adhesive Attachment Pad, 100/cs" is 54, which is correct.
I assume its some issue with character encoding, everything should be UTF-8 I think. I've tried feeding mb_strlen() several different character encoding types and they all are returning this oddball count with the string that has those non-standard characters.
I've no idea why this is happening.

Double-check whether your text really is UTF-8 or not. That "Â" character makes it look like a classic character encoding problem to me. You should check the entire path from the origin of the text through the point in your code that you quoted above, because there are a lot of places where the encodings can get munged.
Did the text originate from an HTML form? Ensure your <form> element includes the accept-charset="UTF-8" attribute.
Did the text get stored in a database along the way? Make sure the database stores and returns the data in UTF-8. This means checking the server's global defaults, the defaults for the database or schema, and the table itself.

It is very likely that your input is encoded in UTF-16.
You may convert to UTF-8
$foo = mb_strlen(mb_convert_encoding($itemDetails['ITEMDESC'], "UTF-8", "UTF-16"));
or if you use mb_strlen() be sure to use proper encoding as a second parameter.
$foo = mb_strlen($itemDetails['ITEMDESC'], "UTF-16");
Without correct encoding mb_strlen will always return wrong results. It's easy to get into troubles when you're dealing with UTF-8/16/32 encoded strings. mb_detect_encoding() will not solve this problem.

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?

If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.

PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff

Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.

Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

DomDocument and special characters written in two bytes

I have a web application, written in PHP, based on UTF-8 (both PHP and MySQL are on UTF-8). Everything is beautiful - no problem with special characters.
However, I had to build an export to XML with encoding ISO-8859-2 (Polish), so I picked DomDocument because it has built in encoding conversion.
But when I had sent the XML to my partner for validation, he said that one of tags have too many characters. It was strange because it had the specific maximum number of characters. Then I have opened the file in HexEditor and saw that every special character has two bytes.
I have tried to convert the result with iconv and mb_convert_encoding.
Iconv says:
iconv() [<a href='function.iconv'>function.iconv</a>]: Detected an illegal character in input string in file application/controllers/report/export.php at 169
mb_convert_encoding is simply deleting all special characters and result is encoded in ASCII.
Is there a way to convert the output of DomDocument to one-byte characters?
Thanks in advance!

One problem when switching between encodings is that, even with transliteration, not all characters are representable in other encodings in a single byte.
For example, consider the EURO SIGN, a character that takes 3 bytes when encoded in UTF-8. If you look at the charset support page, you can see that ISO-8859-2 is not listed.
Since there is not a single character to represent the euro sign, then transliteration does its best to still represent it in the output
echo iconv( 'UTF-8', 'ISO-8859-2//TRANSLIT', '€' ); // EUR
In this example, we still end up with 3 bytes to represent the euro sign after transliterating.
EDIT
P.S. The NOTICE level error you're getting is because you executed iconv() without the transliteration flag. And as I highlighted above, the EURO SIGN doesn't exist in ISO-8859-2, so you clearly have at least one character in your data that also doesn't exist in ISO-8859-2, so you'll have to use transliteration. Just know that it doesn't guarantee that you'll get down to 1 byte/char.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

(PHP) rawurlencode/decode seems to encode '£' sign as 'Â£' (%C2%A3 instead of %A3)

So, I've run into a problem with PHP's rawurlencode function. All text fields in our web app are of course converted before being processed by the web-server, and we've used rawurlencode for this. This works fine with almost every character I've found, expect for the "£" sign. Now, there is no reason for our users to ever enter a pound sign, but they might, so I want to take care of this.
The problem is that rawurlencode doesn't encode a pound sign entered on the webpage as %A3, but instead as %C2%A3. Even worse, if the user failed to enter another bit of critical information (which causes the webpage to refresh - the checks are done on the backend side - and try and refill the form boxes with the information the user had used), then when the %C2 is run through rawurldecode/encode, it becomes Ã? - aka, %C3?. And of course the "£" is also turned into another Â£!
So, what is causing this? I assume it's a character encoding issue, but I'm not that knowledgable about these things. I heard somewhere that I can encode £s as &pound manually, but why should I need to do that when the database can handle "£"s, and there is a percentage-encoding for a pound sign? Is this a bug in rawurlencode, or a bug caused by differing character sets?
Thanks for any help.

The standard requires forms to be submitted in the character encoding you specify in <form accept-charset="..."> or UTF-8 if it's not specified or the text the user has entered cannot be represented in the charset you specify.
Clearly, you're receiving the pound sign encoded in UTF-8. If you want to convert it to ISO-8859-15, write:
iconv("UTF-8", "ISO-8859-15//TRANSLIT", $original)

This is probably encoding A3 character in your native character set to C2A3 in UTF-8 encoding, which seems to be the valid UTF-8 encoding for an ANSI A3. Just consume your encoded url using UTF-8 encoding, or specify an ANSI encoding to urlencode.
Artefacto's answer represents a case when you need to convert character encodings, for example, you are displaying a page and the page encoding is set to Latin-1. (Raw)Urlencode will produce escaped strings with multibyte character representations. (Raw)Urldecode will by default produce utf-8 encoded strings, and will represent £ as two bytes. If you display this string making a claim that it is a ISO-8859 encoded string, it will appear as two characters.
A primer on PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8
Some "hot tips": http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/
Likely, between getting the string from rawurldecode, and using the string, the locale is assumed to be ISO8859, so two bytes get interpreted as two characters when they represent one.
Use mb_convert_encoding to force PHP to realize that the bytes in the string represent a UTF-8 encoded string.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.