about mb string and normal string in PHP - php

How do I know the string is mb string? so we use mb_strlen instead of strlen ?

You need to always know what encoding a string is in, and whether it is a multibyte one. After all, you need to pass the string's encoding as the second parameter to mb_strlen() to get reliable results, right?
The encoding of incoming data will always be defined in some way - the page's encoding when processing form data; the database connection's and tables' encoding when processing database data; and so on. It is your job to build the flow in a way that you always know what is in what encoding where.
The only exception is when you're dealing with arbitrary third party data that don't declare their content's encoding properly. It is then (and only then) when it's okay to employ sniffing functions like mb-detect-encoding() and colleagues. Remember that those functions are very error-prone and can give you only an educated guess what encoding a string is in, not hard reliable info.

No. A string is a string. There is no way to tell if it contains multiple byte characters.
You can guess with something like mb_detect_encoding() but your mileage may vary depending on the charset and encoding. For example, UTF-8 has a very distinct pattern and you will get very good result. But other encodings like GB2312 are really hard to detect.
If you are designing a new protocol or system, it's best to keep the encoding information.

Compare the strlen and the mb_strlen results, and if they do not match, the string contains multibyte characters.

Isn't mb_check_encoding or mb_detect_encoding supposed to be used for that?

Related

Is it a good practice to use mb_convert_encoding function

This question is different from UTF-8 all the way through as it asks for how safe and is it a good practice to use the mb_convert_encoding function.
Lets say that a user can upload the files using the PHP API. Each filename and path gets stored in a PostgreSQL database table which has UTF-8 as default encoding.
Sometimes user uploads files which names aren't UTF-8 encoded and they get imported into the database. The problem is that the characters that are not UTF-8 encoded are scrambled and do not display as they should in the table columns.
I was thinking of adding the following to the PHP code before import:
if ( ! mb_check_encoding($output, 'UTF-8') {
$output = mb_convert_encoding($content, 'UTF-8');
}
Does this look like a good practice and will it be displayed and converted by the user's client correctly if I return UTF-8 as the output? Is there a potential loss to the bytes by using mb_convert_encoding?
Thanks
If you're going to convert an encoding, you need to know what you're converting from. You can check whether the encoding is or isn't valid UTF-8, but if it tells you it's not valid UTF-8 then you still have no clue what it is. Omitting the $from_encoding parameter from mb_convert_encoding just makes it assume some preset encoding for that parameter, but that doesn't mean that $content actually is in that encoding.
In other words: if you don't know what encoding a string is in, you cannot meaningfully convert it to anything else either, and just trying to convert it from ¯\_(ツ)_/¯ is a crapshoot with the result being equally likely to be something useful and utter garbage.
If you encounter unknown encodings, you only have a few choices:
Reject the input value.
Test whether it's one of a handful of other expected encodings and then explicitly convert from your best guess; but that is pretty much a crapshoot as well.
Just use bin2hex or something similar on the value, essentially giving up on trying to interpret it correctly, but still leaving some semblance to the original value.

iconv() Vs. utf8_encode()

when you have a charset different of UTF-8 and you need to put it on JSON format to migrate it to a DB, there are two methods that can be used in PHP, calling utf8_encode() and iconv(). I would like to know which one have better performance, and when is convenient to use one or another.
when you have a charset different of UTF-8
Nope - utf8_encode() is suitable only for converting a ISO-8859-1 string to UTF-8. Iconv provides a vast number of source and target encodings.
Re performance, I have no idea how utf8_encode() works internally and what libraries it uses, but my prediction is there won't be much of a difference - at least not on "normal" amounts of data in the bytes or kilobytes. If in doubt, do a benchmark.
I tend to use iconv() because it's clearer that there is a conversion from character set A to character set B.
Also, iconv() provides more detailed control on what to do when it encounters invalid data. Adding //IGNORE to the target character set will cause it to silently drop invalid characters. This may be helpful in certain situations.
I recommend you to write your own function.
It will be 2-3 lines long and it will be better than struggling with locale, iconv etc. issues.
For example:
Fix Turkish Charset Issue Html / PHP (iconv?)

Methods for identifying encoding type using php

I have a PHP string type variable which may come encoded in Hexadecimal pattern or in Base64.
For example:
737461636b6f766572666c6f772e636f6d
c3RhY2tvdmVyZmxvdy5jb20=
Both lines mean stackoverflow.com, the problem is I do not know which one is going to be HEX or Base64 because of that I do not know which decoding method to apply.
Is it possible to determine the encoding method without knowing the encoded text? If yes, how to do it in php?
There is no way to know for sure whether the string is in Base64/HEX just by looking at it. You will have to include an additional bit with the string indicating which one it is, and then read that in your code and decode as required.
If, by chance the string contains a letter after 'F', you can be sure that it is Base64, but it may be Base64 even though it does not, so there is no way to be sure without some kind of header before the string telling you what the encoding is.
If you can guarantee only those two encodings the Base64 will end with an = and the Hex will only include [a-fA-F0-9].
This should not be too difficult. The valid set of characters for hex is [0-9a-f], while the valid set for Base64 is more like [a-zA-Z0-9\+/] possibly with one or two trailing = characters for padding. You should be able to use a regex to discriminate between one and the other.
Of course, there may be some instances where a string appears to be valid in both encodings, so there is no sure-fire way to test based just upon the string itself. Generally speaking, however, it would be fairly rare for a non-trivial input string encoded in Base64 to result in an output string that includes only valid hexadecimal characters and no padding characters. Fairly rare, but not impossible.

Strange behaviour of mb_detect_order() in PHP

I would like to detect encoding of some text (using PHP).
For that purpose i use mb_detect_encoding() function.
The problem is that the function returns different results if i change the order of possible encodings with mb_detect_order() function.
Consider the following example
$html = <<< STR
ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください
STR;
mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
$originalEncoding = mb_detect_encoding($str);
die($originalEncoding); // $originalEncoding = 'UTF-8'
However if you change the order of encodings in mb_detect_order() the results will be different:
mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
die($originalEncoding); // $originalEncoding = 'EUC-JP'
So my questions are:
Why is that happening ?
Is there a way in PHP to correctly and unambiguously detect encoding of text ?
That's what I would expect to happen.
The detection algorithm probably just keeps trying, in order, the encodings you specified in mb_detect_order and then returns the first one under which the bytestream would be valid.
Something more intelligent requires statistical methods (I think machine learning is commonly used).
EDIT: See e.g. this article for more intelligent methods.
Due to its importance, automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods, we aimed at a simple algorithm which can be uniformly applied to every charset, and the algorithm is based on well-established, standard machine learning techniques. We also studied the relationship between language and charset detection, and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Support Vector Machine (SVM).
Not really. The different encodings often have large areas of overlap, and if your string that you are testing exists entirly inside that overlap, then both encoding are acceptable.
For example, utf-8 and ISO-8859-1 are the same for the letters a-z. The string "hello" would have an identical sequence of bytes in both encodings.
This is exactly why there is an mb_detect_order() function in the first place, as it allows you to say what you would prefer to happen when these clashes happen. Would you like "hello" to be utf-8 or ISO-8859-1?
Keep in mind mb_detect_encoding() does not know what encoding the data is in. You may see a string, but the function itself only sees a stream of bytes. Going by that, it needs to guess what the encoding is - e.g. ASCII would be if bytes are only in the 0-127 range, UTF-8 would be if there are ASCII bytes and 128+ bytes that exist only in pairs or more, and so forth.
As you can imagine, given that context, it's quite difficult to detect an encoding reliably.
Like rihk said, this is what the mb_detect_order() function is for - you're basically supplying your best guess what the data is likely to be. Do you work with UTF-8 files frequently? Then chances are your stuff isn't likely to be UTF-16 even if mb_detect_encoding() could guess it as that.
You might also want to check out Artefacto's link for a more in-depth view.
Example case: Internet Explorer uses some interesting encoding guessing if nothing is specified (#link, Section: 'To automatically detect a website's language') that's caused strange behaviours on websites that took encoding for granted in the past. You can probably find some amusing stuff on that if you google around. It makes for a nice show-case how even statistical methods can backfire horribly, and why encoding-guessing in general is problematic.
mb_detect_encoding looks at the first charset entry in your mb_detect_order() and then loops through your input $html matching character by character whether that character falls within the valid set of characters for the charset. If every character matches, then it returns true; if any character fails, it moves on to the next charset in the mb_detect_order() and tries again.
The wikipedia list of charsets is a good place to see the characters that make up each charset.
Because these charset values overlap (char x8fA1EF exists in both 'UTF-8' and in 'EUC-JP') this will be considered a match even though it's a totally different character in each character set. So unless any of the character values exist in one charset, but not in another, then mb_detect_encoding can't identify which of the charsets is invalid; and will return the first charset from your array list which could be valid.
As far as I'm aware, there is no surefire way of identifying a charset. PHP's "best guess" method can be helped if you have a reasonable idea of what charsets you are likely to encounter, and order your list accordingly based on the gaps (invalid characters) in each charset.
The best solution is to "know" the charset. If you are scraping your html from another page, look for the charset identifier in the header of that page.
If you really want to be clever, you can try and identify the language in which the html is written, perhaps using trigrams or n-grams or similar as described in this article on PHP/ir.

Is PHP's json_encode guaranteed to produce ASCII string?

Well, the subject says everything. I'm using json_encode to convert some UTF8 data to JSON and I need to transfer it to some layer that is currently ASCII-only. So I wonder whether I need to make it UTF-8 aware, or can I leave it as it is.
Looking at JSON rfc, UTF8 is also valid charset in JSON output, although not recommended, i.e. some implemenatations can leave UTF8 data inside. The question is whether PHP's implementation dumps everthing as ASCII or opts to leave something as UTF-8.
Unlike JSON support in other languages, json_encode() does not have the ability to generate anything other than ASCII.
According to the JSON article in Wikipedia, Unicode characters in strings are always
double-quoted Unicode with backslash escaping
The examples in the PHP Manual on json_encode() seem to confirm this.
So any UTF-8 character outside ASCII/ANSI should be escaped like this: \u0027 (note, as #Ignacio points out in the comments, that this is the recommended way to deal with those characters, not a required one)
However, I suppose json_decode() will convert the characters back to their byte values? You may get in trouble there.
If you need to be sure, take a look at iconv() that could convert your UTF-8 String into ASCII (dropping any unsupported characters) beforehand.
Well, json_encode returns a string. According to the PHP documentation for string:
A string is series of characters. Before PHP 6, a character is the same as a byte. That is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode. See utf8_encode() and utf8_decode() for some basic Unicode functionality.
So for the time being you do not need to worry about making it UTF-8 aware. Of course you still might want to think about this anyway, to future-proof your code.

Categories