iconv() Vs. utf8_encode() - php

when you have a charset different of UTF-8 and you need to put it on JSON format to migrate it to a DB, there are two methods that can be used in PHP, calling utf8_encode() and iconv(). I would like to know which one have better performance, and when is convenient to use one or another.

when you have a charset different of UTF-8
Nope - utf8_encode() is suitable only for converting a ISO-8859-1 string to UTF-8. Iconv provides a vast number of source and target encodings.
Re performance, I have no idea how utf8_encode() works internally and what libraries it uses, but my prediction is there won't be much of a difference - at least not on "normal" amounts of data in the bytes or kilobytes. If in doubt, do a benchmark.
I tend to use iconv() because it's clearer that there is a conversion from character set A to character set B.
Also, iconv() provides more detailed control on what to do when it encounters invalid data. Adding //IGNORE to the target character set will cause it to silently drop invalid characters. This may be helpful in certain situations.

I recommend you to write your own function.
It will be 2-3 lines long and it will be better than struggling with locale, iconv etc. issues.
For example:
Fix Turkish Charset Issue Html / PHP (iconv?)

Related

When is the correct time to use utf8_encode and utf8_decode?

Character encoding has always been a problem for me. I don't really get when the correct time to use it is.
All the databases I use now I set up with utf8_general_ci, as that seems to a good 'general' start. I have since learned in the past five minutes that it is case insensitive. So that's helpful.
But my question is when to use utf8_encode and utf8_decode ? As far as I can see now, If I $_POST a form from a table on my website, I need to utf8_encode() the value before I insert it into the database.
Then when I pull it out, I need to utf8_decode it. Is that the case? Or am I missing something?
utf8_encode and _decode are pretty bad misnomers. The only thing these functions do is convert between UTF-8 and ISO-8859-1 encodings. They do exactly the same thing as iconv('ISO-8859-1', 'UTF-8', $str) and iconv('UTF-8', 'ISO-8859-1', $str) respectively. There's no other magic going on which would necessitate their use.
If you receive a UTF-8 encoded string from the browser and you want to insert it as UTF-8 into the database using a database connection with the utf8 charset set, there is absolutely no use for either function anywhere in this chain. You are not interested in converting encodings at all here, and that should be the goal.
The only time you could use either function is if you need to convert from UTF-8 to ISO-8859-1 or vice versa at any point, because external data is encoded in this encoding or an external system expects data in this encoding. But even then, I'd prefer the explicit use of iconv or mb_convert_encoding, since it makes it more obvious and explicit what is going on. And in this day and age, UTF-8 should be the default go-to encoding you use throughout, so there should be very little need for such conversion.
See:
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Handling Unicode Front To Back In A Web App
UTF-8 all the way through
Basically utf8_encode is used for Encodes an ISO-8859-1 string to UTF-8.
When you are working on translation like One language to Another language than you have to use this function to prevent to show some garbage Characters.
Like When you display spanish character than some time script doesn't recognize spanish character and it will display some garbage character instead of spanish character.
At that time you can use.
For more refer about this please follow this link :
http://php.net/manual/en/function.utf8-encode.php

PHP utf-8 best practices and risks for distributed web applications

I have read several things about this topic but still I have doubts I want to share with the community.
I want to add a complete utf-8 support to the application I developed, DaDaBIK; the application can be used with different DBMSs (such as MySQL, PostgreSQL, SQLite). The charset used in the databases can be ANY. I cant' set or assume the charset.
My approach would be convert, using iconv functions, everything i read from the db in utf-8 and then convert it back in the original charset when I have to write to the DB. This would allow me to assume I'm working with utf-8.
The problem, as you probably know, is that PHP doesn't support utf-8 natively and, even assuming to use mbstring, there are (according to http://www.phpwact.org/php/i18n/utf-8) several PHP functions which can create problems with utf-8 and DON't have an mbstring correspondance, for example the PREG extension, strcspn, trim, ucfirst, ucwords....
Since I'm using some external libraries such as adodb and htmLawed I can't control all the source code...in those libraries there are several cases of usage of those functions....do you have any advice about? And above all, how very popular applications like wordpress and so on are handling this (IMHO big) problem? I doubt they don't have any "trim" in the code....they just take the risk (data corruption for example) or there is something I can't see?
Thanks a lot.
First of all: PHP supports UTF-8 just fine natively. Only a few of the core functions dealing with strings should not be used on multi-byte strings.
It entirely depends on the functions you are talking about and what you're using them for. PHP strings are encoding-less byte arrays. Most standard functions therefore just work on raw bytes. trim just looks for certain bytes at the start and end of the string and trims them off, which works perfectly fine with UTF-8 encoded strings, because UTF-8 is entirely ASCII compatible. The same goes for str_replace and similar functions that look for characters (bytes) inside strings and replace or remove them.
The only real issue is functions that work with an offset, like substr. The default functions work with byte offsets, whereas you really want a more intelligent character offset, which does not necessarily correspond to bytes. For those functions an mb_ equivalent typically exists.
preg_ supports UTF-8 just fine using the /u modifier.
If you have a library which uses, for instance, substr on a potential multi-byte string, use a different library because it's a bad library.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for some more in-depth discussion and demystification about PHP and character sets.
Further, it does not matter what the strings are encoded as in the database. You can set the connection encoding for the database, which will cause it to convert everything for you and always return you data in the desired client encoding. No need for iconverting everything in PHP.

UTF8 to latin1_swedish_ci

There are a lot of topics about latin1_swedisch_ci to utf8 conversion. But what about the other way around? I'm dealing for quite a long time with this problem and I haven't found a solution so far. Since I don't know what else is accessing this database, I don't want to change the character encoding of the table.
I have in the table a column which is formatted in latin1_swedisch_ci. Now I have to write queries in php. This database contains German and French names, meaning that I have characters like ö,ä,ô and so on. How can I do that?
As an example if I want to query the name 'Bürki', then I have to write something like $name='Bürki'. Is there a proper way to convert it to latin1_swedisch_ci without using string replacement for those special characters?
iconv() will convert strings from one encoding to the other.
The encodings that are of interest to you are utf-8 and iso-8859-1 - the latter is equivalent with latin1.
The "swedish", "german" etc. localizations affect issues like sorting only, the character encoding is always the same.
PS.
then I have to write something like $name='Bürki'.
If you encode your source file as UTF-8, you can write Bürki directly. (You would then have to convert that string into iso-8859-1)
I agree with Pekka, however, I would try to use the utf8_decode() function instead because it is possible that iconv is not installed...
Iconv, however, is more powerful - it can do transliteration for an example. But for this purpose I believe utf8_decode() is enough.

about mb string and normal string in PHP

How do I know the string is mb string? so we use mb_strlen instead of strlen ?
You need to always know what encoding a string is in, and whether it is a multibyte one. After all, you need to pass the string's encoding as the second parameter to mb_strlen() to get reliable results, right?
The encoding of incoming data will always be defined in some way - the page's encoding when processing form data; the database connection's and tables' encoding when processing database data; and so on. It is your job to build the flow in a way that you always know what is in what encoding where.
The only exception is when you're dealing with arbitrary third party data that don't declare their content's encoding properly. It is then (and only then) when it's okay to employ sniffing functions like mb-detect-encoding() and colleagues. Remember that those functions are very error-prone and can give you only an educated guess what encoding a string is in, not hard reliable info.
No. A string is a string. There is no way to tell if it contains multiple byte characters.
You can guess with something like mb_detect_encoding() but your mileage may vary depending on the charset and encoding. For example, UTF-8 has a very distinct pattern and you will get very good result. But other encodings like GB2312 are really hard to detect.
If you are designing a new protocol or system, it's best to keep the encoding information.
Compare the strlen and the mb_strlen results, and if they do not match, the string contains multibyte characters.
Isn't mb_check_encoding or mb_detect_encoding supposed to be used for that?

Strange behaviour of mb_detect_order() in PHP

I would like to detect encoding of some text (using PHP).
For that purpose i use mb_detect_encoding() function.
The problem is that the function returns different results if i change the order of possible encodings with mb_detect_order() function.
Consider the following example
$html = <<< STR
ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください
STR;
mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
$originalEncoding = mb_detect_encoding($str);
die($originalEncoding); // $originalEncoding = 'UTF-8'
However if you change the order of encodings in mb_detect_order() the results will be different:
mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
die($originalEncoding); // $originalEncoding = 'EUC-JP'
So my questions are:
Why is that happening ?
Is there a way in PHP to correctly and unambiguously detect encoding of text ?
That's what I would expect to happen.
The detection algorithm probably just keeps trying, in order, the encodings you specified in mb_detect_order and then returns the first one under which the bytestream would be valid.
Something more intelligent requires statistical methods (I think machine learning is commonly used).
EDIT: See e.g. this article for more intelligent methods.
Due to its importance, automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods, we aimed at a simple algorithm which can be uniformly applied to every charset, and the algorithm is based on well-established, standard machine learning techniques. We also studied the relationship between language and charset detection, and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Support Vector Machine (SVM).
Not really. The different encodings often have large areas of overlap, and if your string that you are testing exists entirly inside that overlap, then both encoding are acceptable.
For example, utf-8 and ISO-8859-1 are the same for the letters a-z. The string "hello" would have an identical sequence of bytes in both encodings.
This is exactly why there is an mb_detect_order() function in the first place, as it allows you to say what you would prefer to happen when these clashes happen. Would you like "hello" to be utf-8 or ISO-8859-1?
Keep in mind mb_detect_encoding() does not know what encoding the data is in. You may see a string, but the function itself only sees a stream of bytes. Going by that, it needs to guess what the encoding is - e.g. ASCII would be if bytes are only in the 0-127 range, UTF-8 would be if there are ASCII bytes and 128+ bytes that exist only in pairs or more, and so forth.
As you can imagine, given that context, it's quite difficult to detect an encoding reliably.
Like rihk said, this is what the mb_detect_order() function is for - you're basically supplying your best guess what the data is likely to be. Do you work with UTF-8 files frequently? Then chances are your stuff isn't likely to be UTF-16 even if mb_detect_encoding() could guess it as that.
You might also want to check out Artefacto's link for a more in-depth view.
Example case: Internet Explorer uses some interesting encoding guessing if nothing is specified (#link, Section: 'To automatically detect a website's language') that's caused strange behaviours on websites that took encoding for granted in the past. You can probably find some amusing stuff on that if you google around. It makes for a nice show-case how even statistical methods can backfire horribly, and why encoding-guessing in general is problematic.
mb_detect_encoding looks at the first charset entry in your mb_detect_order() and then loops through your input $html matching character by character whether that character falls within the valid set of characters for the charset. If every character matches, then it returns true; if any character fails, it moves on to the next charset in the mb_detect_order() and tries again.
The wikipedia list of charsets is a good place to see the characters that make up each charset.
Because these charset values overlap (char x8fA1EF exists in both 'UTF-8' and in 'EUC-JP') this will be considered a match even though it's a totally different character in each character set. So unless any of the character values exist in one charset, but not in another, then mb_detect_encoding can't identify which of the charsets is invalid; and will return the first charset from your array list which could be valid.
As far as I'm aware, there is no surefire way of identifying a charset. PHP's "best guess" method can be helped if you have a reasonable idea of what charsets you are likely to encounter, and order your list accordingly based on the gaps (invalid characters) in each charset.
The best solution is to "know" the charset. If you are scraping your html from another page, look for the charset identifier in the header of that page.
If you really want to be clever, you can try and identify the language in which the html is written, perhaps using trigrams or n-grams or similar as described in this article on PHP/ir.

Categories