PHP: parsing ascii string safely when running in multibyte mode

PHP: parsing ascii string safely when running in multibyte mode - php

In my PHP config file I have
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
To ensure UTF8 support. I have read that one should also use the multibyte string manipulation functions throughout if you have set these settings. I am currently altering a library which parses an excel file, and I need to split the one attribute value in the form N12 to determine the spreadsheet size. I know for a fact that the value cannot have values outside of ascii range. Do I need to use the multibyte string manipulation functions to parse the 12 out of N12 or can I use the normal ones. I am asking as I would like to keep the solution general and maybe submit the solution back to the library. If I need to use the correct function depending on whether current mode is utf8 or not, what is the best way to check for this?

UTF-8 is a pure superset of ASCII. If your functions can handle UTF-8, they by definition can also handle ASCII. The core PHP string functions mostly expect single-byte encodings, but that doesn't mean they won't work with other encodings; for example: Multibyte trim in PHP?.
So it depends on what exactly you're trying to do. Possibly core PHP string functions will already work fine regardless of encoding. If they do not, and your operation would break when using multi-byte strings, then you can use the appropriate MB function instead which by definition will also handle ASCII just fine when treating the input as UTF-8.

Related

Detect encoding in PHP without multibyte extension?

Is there a way to detect the encoding of a string in PHP without having the mbstring extension loaded? I know it is possible to do so with mb_detect_encoding(), but is there an equivalent, non-multibyte function?
If not, what would it take to implement a detect_encoding() function that would at least detect UTF-8?

Strings in PHP are just byte sequences, they carry no encoding information with them. mb_detect_encoding doesn't actually detect the string's encoding, it tries to make an educated guess by running the byte sequence against a series of identification functions, one per encoding (by default those given by mb_detect_order), and returns the first one in which the sequence matches. These functions are very basic and don't even exist for many popular encodings.
There is no way, with or without the mbstring extension, to ascertain the encoding of a string - only to maybe rule some out, which you could only do if the string happens to contain byte sequences that would be invalid in those particular encodings.
You will never know whether "\xC2\xA4" is supposed to be the UTF-8 ¤ or ISO-8859-1 Â¤ just by looking at it - because they're the exact same bytes.
For more information see: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

There's always iconv, which is generally enabled in PHP by default
<pre>
<?php
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");
var_dump(iconv_get_encoding('all'));
?>
</pre>

What does PHP's mb_internal_encoding actually do?

According to the PHP website it does this:
encoding is the character encoding name used for the HTTP input
character encoding conversion, HTTP output character encoding
conversion, and the default character encoding for string functions
defined by the mbstring module. You should notice that the internal
encoding is totally different from the one for multibyte regex.
Can someone please explain this in simpler terms?
HTTP input character encoding conversion
HTTP output character encoding conversion
default character encoding for string functions
What is meant by “internal encoding is totally different from the one for multibyte regex”?
My guess is that
means GET and POST are treated as that encoding.
means it outputs to that encoding.
means it uses that encoding for all multibyte string functions.
I have no idea about. Why would regex be different to normal string functions?
If point 2 is correct would you need to do:
ini_set('default_charset', 'UTF-8');
If I understand 3 correctly does that mean if you do:
mb_internal_encoding('UTF-8')
You don't need to do:
mb_strtolower($str, 'UTF-8');
Just:
mb_strtolower($str);
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?

The mbstring extension added the glorious idea (</sarcasm>) to automatically convert all incoming data and all output data from some encoding to another. See mbstring HTTP Input and Output. It's configured with the mbstring.http_input ini setting and by using the mb_output_handler. mb_internal_encoding influences this conversion. IMO you should leave those settings off and never touch them; I have yet to find any problem that can elegantly be solved by this and it sounds like a terrible idea overall to have implicit encoding conversions going on. Especially if it's all controlled via one global flag (mb_internal_encoding) which is used in a variety of different contexts.
So that's 1. and 2.
For 3., yes indeed, mb_internal_encoding basically sets the default value for all mb_ functions which accept an $encoding parameter. Essentially it just sets a global variable (internally) which other functions read from, that's all.
The last part refers to the fact that there's a separate mb_regex_encoding function to set the internal encoding for mb_ereg_ functions.
I did read on another SO post that mb_strtolower($str) should no be trusted and that you need to set the encoding for each multibyte string function. Is this true?
I'd agree to this insofar as all global state cannot be trusted. This is pretty trustworthy:
mb_internal_encoding('UTF-8');
mb_strtolower($string);
However, this is not really:
mb_strtolower($string);
See the difference? If you rely on global state being set correctly elsewhere, you can never be sure it actually is correct. You just need to make a call to some third party library which sets mb_internal_encoding to something else without you knowing, and your mb_strtolower call will suddenly behave very differently.

When should I use mb_strpos(); over strpos();?

Huh, looking at all those string functions, sometimes I get confused. One is using all the time mb_ functions, the other - plain ones, so the question is simple...
When should I use mb_strpos(); and when should I go with the plain one (strpos();)?
And, yes, I'm aware about that mb_ functions stand for multi-byte, but does it really mean, that if I'm working with only utf-8 encoded strings, I should stick with mb_ functions?
Thanks in advance!

You should use the mb_ functions whenever you expect to work with text that's not pure ASCII. I.e. you can work with the regular string functions, even if you're using UTF-8, as long as all the strings you're using them on only contain ASCII characters.
strpos('foobar', 'foo') // fine in any (ASCII-compatible) encoding, including UTF-8
strpos('ふーばー', 'ふー') // won't work as expected, use mb_strpos instead

Yes, if working with UTF-8 (which is a multi-byte encoding : one character can use more than one byte), you should use the mb_* functions.
The non-mb functions will work on bytes, and not characters -- which is fine when 1 character == 1 byte ; but that's not the case with (for example) UTF-8.

I'd say yes, here's the description from the php documentation:
mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience....
If you're not sure that the mb extension is loaded, you should check before because mb-string is a non-default extension.

Is there any downside to save all my source code files in UTF-8?

If that's relevant (it very well could be), they are PHP source code files.

There are a few pitfalls to take care of:
PHP is not aware of the BOM character certain editors or IDEs like to put at the very beginning of UTF-8 files. This character indicates the file is UTF-8, but it is not necessary, and it is invisible. This can cause "headers already sent out" warnings from functions that deal with HTTP headers because PHP will output the BOM to the browser if it sees one, and that will prevent you from sending any header. Make sure your text editor has a UTF-8 (No BOM) encoding; if you're not sure, simply do the test. If <?php header('Content-Type: text/html') ?> at the beginning of an otherwise empty file doesn't trigger a warning, you're fine.
Default string functions are not multibyte encodings-aware. This means that strlen really returns the number of bytes in the string, not the actual number of characters. This isn't too much of a problem until you start splicing strings of non-ASCII characters with functions like substr: when you do, indices you pass to it refer to byte indices rather than character indices, and this can cause your script to break non-ASCII characters in two. For instance, echo substr("é", 0, 1) will return an invalid UTF-8 character because in UTF-8, é actually takes two bytes and substr will return only the first one. (The solution is to use the mb_ string functions, which are aware of multibyte encodings.)
You must ensure that your data sources (like external text files or databases) return UTF-8 strings too, because PHP makes no automagic conversion. To that end, you may use implementation-specific means (for instance, MySQL has a special query that lets you specify in which encoding you expect the result: SET CHARACTER SET UTF8 or something along these lines), or if you couldn't find a better way, mb_convert_encoding or iconv will convert one string into another encoding.

It's actually usually recommended that you keep all sources in UTF8. It won't matter size of regular code with latin characters at all, but will prevent glitches with any special characters.

If you are using any special chars in e.g string values, the size is a little bit bigger, but that shouldn't matter.
Nevertheless my suggestion is, to always leave the default format. I spent so many hours because there was an error with the format saving and all characters changed.
From a technical point of few, there isn't a difference!

Very relevant, the PHP parser may start to output spurious characters, like a funky unside-down questionmark. Just stick to the norm, much preferred.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.