Is mb_* necessary to replace single-byte characters from a multibyte string?

Is mb_* necessary to replace single-byte characters from a multibyte string? - php

Let's say I have an UTF-8 text like this:
âàêíóôõ <br> âàêíóôõ <br> âàêíóôõ
I want to replace <br> with <br />. Do I need to use mb_str_replace or I can use str_replace ?
Consindering < b r / > are all single byte char?

Since str_replace is binary-safe and UTF-8 is a bijective encoding, you can use str_replace, even if search string or replacement contains multi-byte characters, as long as all three parameters are encoded as UTF-8.
That's why there isn't an mb_str_replace function in the first place.
If your encoding is not bijective - i.e. there are multiple representations of the same string, for example < in UTF-7, which can be expressed both as '+ADw-' and '<', you should convert all strings to the same (bijective) encoding, apply str_replace, and then convert the strings to the target encoding.

Reference for manipulating UTF-8 strings safely in PHP (archive). There is no hard-and-fast rule. Some native PHP string functions functions can operate safely on utf-8, some can with care, and some cannot.
There is no mb_str_replace(). Notice the section "UTF-8 Safe Functionality": explode() and str_replace() are safe as long as all three arguments to it are valid UTF-8 strings.

Related

PHP intval of multibyte strings

How does intval() change when using UTF-8 multibyte strings as opposed to regular one byte per character strings? Is it the same?

PHP doesn't distinguish string encodings internally. A string is simply an array of bytes. If you pass an UTF-8 string to intval, the function only sees the bytes of the encoded UTF-8 string. Given the nature of the UTF-8 encoding, intval will treat any non-ASCII character as a non-digit. So it doesn't make a difference whether you pass an ASCII, Latin1, or UTF-8 string.

Replacing low ASCII characters in UTF-16-encoded string using PHP's str_replace function

I have some PHP code that I use for text filtering. During filtering, some ASCII characters such as ampersand (&) and tilde (~) are temporarily converted to low ASCII characters (such as decimal code-points 4 and 5). Just before the final filtered output is generated, the conversion is reverted.
$temp = str_replace(array('&', '~'), array("\x04", "\x05"), $input);
... some filtering code to work with $temp ...
$out = str_replace(array("\x04", "\x05"), array('&', '~'), $temp);
This works well with input text of character encodings that use 8-bit code units such as UTF-8 and ISO 8859-1. But I am not sure about input encoded in larger code units, such as UTF-16 or UTF-32. Will the first conversion step mangle the well-formedness of the input text? Will there be some conflict during the reversion step because of some pre-existing characters of the input? The PHP setup does not overload multi-byte string functions.
Can anyone comment? Thanks.

str_replace works fine, as long as all strings passed to it are in the same encoding. It just does a binary compare/replace of data, so the actual encoding doesn't really matter.
That's why there's no mb_str_replace in this list.

php preg_replace: unicode modifier for ascii strings

I need to handle strings in my php script using regular expressions. But there is a problem - different strings have different encodings. If string contains just ascii symbols, mb_detect_encoding function returns 'ASCII'. But if string contains russian symbols, for example, mb_detect_encoding returns 'UTF-8'. It's not good idea to check encoding of each string manually, I suppose.
So the question is - is it correct to use preg_replace (with unicode modifier) for ascii strings? Is it right to write such code preg_replace ("/[^_a-z]/u","",$string); for both ascii and utf-8 strings?

This would be no problem if the two choices were "UTF-8" or "ASCII", but that's not the case.
If PHP doesn't use UTF-8, it uses ISO-8859-1, which is NOT ASCII (it's a superset of ASCII in that the first 127 characters . It's a superset of ASCII. Some characters, for example the Swedish ones å, ä and ö, can be represented in both ISO-8859-1 and Unicode, with different code points! I don't think this matter much for preg_* functions so it may not be applicable to your question, but please keep this in mind when working with different encodings.
You should really, really try to know which character set your strings are in, without the magic of mb_detect_encoding (mb_detect_encoding is not a guarantee, just a good guess). For example, strings fetched through HTTP does have a character set specified in the HTTP header.

Yes sure, you can always use Unicode modifier and it will not affect neither results nor performance.

The 7-bit ASCII character set is encoded identically in UTF-8. If you have an ASCII string you should be able to use the PREG "u" modifier on it.
However, if you have a "supplemented" 8-bit ASCII character set such as ISO-8859-1, Windows-1252 or HP-Roman8 the characters with the leftmost bit set on (values x80 - xff) are not encoded the same in UTF-8 and it would not be appropriate to use the PREG "u" modifier.

Would this regex be multibyte safe?

I'm using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point:
preg_match('!^[\w.-]*$!',$filename)
This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?

PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extensionDocs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.
You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß is encoded as 0xcf 0x9f), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22, the "double-quote" symbol.
The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_eregDocs, mb_eregiDocs, mb_ereg_replaceDocs and mb_eregi_replaceDocs.
PCRE based regular expression functions like preg_matchDocs support UTF-8 by using the u-modifier (PCRE8)Docs.
But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.

Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?

PHP's str_replace() was intended only for ANSI strings and as such can mangle UTF-8 strings. However, given that it's binary-safe would it work properly if it was only given valid UTF-8 strings as arguments?
Edit: I'm not looking for a replacement function, I would just like to know if this hypothesis is correct.

Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.
In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.
This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C might be the second byte of a character sequence representing something else).

It's correct because UTF-8 multibyte characters are exclusively non-ASCII (128+ byte value) characters beginning with a byte that defines how many bytes follow, so you can't accidentally end up matching a part of one UTF-8 multibyte character with another.
To visualise (abstractly):
a for an ASCII character
2x for a 2-byte character
3xx for a 3-byte character
4xxx for a 4-byte character
If you're matching, say, a2x3xx (a bytes in ASCII range), since a < x, and 2x cannot be a subset of 3xx or 4xxx, et cetera, you can be safe that your UTF-8 will match correctly, given the prerequisite that all strings are definitely valid UTF-8.
Edit: See bobince's answer for a less abstract explanation.

Well, I do have a counter example: I have a UTF8 encoded settings ".ini' file specifying appliation settings like email sender name. it says something like:
email_from = Märta
and I read it from there to variable $sender. Now that I replace the message body (UTF8 again)
regards
{sender}
$message = str_replace("{sender}",$sender_name,$message);
The email is absolutely correct in every respect but the sender is totally broken. There are other cases (like explode() ) when something goes wrong with a UTF string. It is healthy before the conversion but not after it. Sorry to say there seems to be no way of correcting this behaviour.
Edit: Actually, explode() is involved in parsing the .ini file so the problem may well lie in that very function so the str_replace() may well be innocent.

No you cannot.
From practice I am telling you if you have some multibyte symbols like ◊ etc, and others are non-multibyte it wont work correctly, because there are symbols that take 2-4 to place them,
str_replace takes fixed bytes, and replaces... In result we have something that isn't any symbols trash etc.

Yes, I think this is correct, at least I couldn't find any counter-example.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.