I've been searching for UTF8-safe alternatives for string manipulation functions. I've found many different opinions and suggestions. I would like to ask if following functions can cause problems in UTF-8 and if does, what should I use instead. I know the list of mb_ prefixed functions in PHP manual, but there are not all functions I am using.
Functions are: implode, explode, str_replace, preg_match, preg_replace
Thank you
explode just looks for an identical byte sequence and separates the string at that point. Since UTF-8 is safely backwards compatible with ASCII, there's no concern and it will work fine. implode just assembles strings together, which works fine as well due to the properties of UTF-8. str_replace works for the same reasons. The preg_ functions work fine as long as you are using the /u modifier.
If you need to safely manipulate with UTF8 characters, you can do it like this:
mb_internal_encoding('UTF-8');
preg_replace( '`...`u', '...', $string ) // with the u (unicode) modifier
Related
I know that if I use multibyte(UTF-8) characters for the pattern, I have to use mb_ functions or have to use u option for pattern of preg_ functions.
But when I use multibyte(UTF-8) characters only for the subject of preg_ functions and use only ascii characters for the pattern, do preg_ functions (without u option) work correctly?
I know that in this case I have to use mb_ function or add u option to the pattern:
$str = preg_replace("/$utf8_multibyte_pattern/", '', $str);
I want to know if this code(u option is not used) is safe or not:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!#`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
Maybe I found the answer by myself.
But someone who knows about character code well, please comment to this answer or post another answer.
According to wikipedia, UTF-8 character codes don't contain ascii code.
http://en.wikipedia.org/wiki/UTF-8#Advantages
The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.
I think this means preg function with ascii pattern without u option is safe for multibyte(UTF8) subject.
And this code (without u option)
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
and this code (with u option)
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
are the same.
Both correctly works.
Am I correct?
It is safe as far as I know as long as you use the unicode property (/u) like so:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!#`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
To see more information on unicode characters, see here
I have the the problem described in title.
If I use
preg_match_all('/\pL+/u', $_POST['word'], $new_word);
and I type hello à and ì the new_word returned is *hello and *
Why?
Someone advised me to specify all characters I want to convert in this way
preg_match_all('/\pL+/u', $_POST['word'], 'aäeëioöuáéíóú');
, but I want my application works with all existing accents (for a multilanguage website).
Can you help me?
Thanks.
EDIT: I specify that I utilise this regex to purify punctuation. It well purify all punctuation but unicode characters are wrong returned, in fact are not even returned.
EDIT 2: I am sorry, but I very badly explained.
The problem is not in preg_match_all but in
str_word_count($my_key, 2, 'aäáàeëéèiíìoöóòuúù');
I had to manually specify accented characters but I think there are many others. Right?
\pL should match all utf8 characters and spaces. Be sure, that $_POST['word'] is a string encoded with utf8. If not, try utf8_encode() before matching or check the encoding of your HTML form. In my tests, your example works like a charm.
You may use this together with count() to get the number of words. Then you need not care about the possible characters. \pL will do this for you. This should do the trick:
$string = "áll thât words wíth ìntérnâtiønal çhårs";
preg_match_all('/\pL+/u', $string, $words);
echo count($words[0]); // returns: 6
Try using mb_ereg_match() (instead of preg_match()) from Multibyte String PHP library. It is specially made for working with multibyte strings.
Huh, looking at all those string functions, sometimes I get confused. One is using all the time mb_ functions, the other - plain ones, so the question is simple...
When should I use mb_strpos(); and when should I go with the plain one (strpos();)?
And, yes, I'm aware about that mb_ functions stand for multi-byte, but does it really mean, that if I'm working with only utf-8 encoded strings, I should stick with mb_ functions?
Thanks in advance!
You should use the mb_ functions whenever you expect to work with text that's not pure ASCII. I.e. you can work with the regular string functions, even if you're using UTF-8, as long as all the strings you're using them on only contain ASCII characters.
strpos('foobar', 'foo') // fine in any (ASCII-compatible) encoding, including UTF-8
strpos('ふーばー', 'ふー') // won't work as expected, use mb_strpos instead
Yes, if working with UTF-8 (which is a multi-byte encoding : one character can use more than one byte), you should use the mb_* functions.
The non-mb functions will work on bytes, and not characters -- which is fine when 1 character == 1 byte ; but that's not the case with (for example) UTF-8.
I'd say yes, here's the description from the php documentation:
mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience....
If you're not sure that the mb extension is loaded, you should check before because mb-string is a non-default extension.
we have this code:
$value = preg_replace("/[^\w]/", '', $value);
where $value is in utf-8. After this transformation first byte of multibyte characters is stripped. How to make \w cover UTF-8 chars completely?
Sorry, i am not very well in PHP
You could try with the /u modifier:
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
If that won't do, try
mb_ereg_replace - Replace regular expression with multibyte support
instead.
There is this nasty u modifier to pcre patterns in PHP. It states that the regex is encoded in UTF8, but I found that it treats the input as UTF8, too.
Append u to regex, to turn on the multibyte unicode mode of PCRE:
$value = preg_replace("/[^\w]/u", '', $value);
Corollary
In unicode mode, PCRE expects everything is multibyte and if it is not then there will be problems meeting deadlines. Therefore, to convert anything to UTF-8 (and drop any unconvertible junk), we first use:
$value = iconv( 'ISO-8859-1', 'UTF-8//IGNORE//TRANSLIT', $i );
to clean and prep the input.
Because everything can be encoded into ISO-8859-1 (even if some obscure characters appear incorrectly), and since most web browsers run natively in 8859 (unless told to use UTF-8), we've found this function as a general, safe, effective method to 'take anything, drop any junk, and convert into UTF-8'.
mb_ereg_* is deprecated as of 5.3.0 -- so using those functions is not the right way to go.
try this function instead...http://php.net/manual/en/function.mb-ereg-replace.php
Use [^\w]+ instead of [^\w]
You can also use \W in place of [^\w]
why php's str_replace and many other string functions mess up the strings with special chars such ('é' 'à' ..) ? and how to fix this problem ?
str_replace is not multi-byte (unicode) aware. use the according mb_* functions instead
in your place mb_ereg_replace sounds like the right option. you could as well just use the PCRE regex functions and specifying the X flag
PHP wasn't developed from the ground up to natively support UTF8. It may be useful to instead of specify the character literal, specify the entity reference / hex code of that in your replacement, eg \x3094 and replace that, I think it's more consistently supported.
Though it would help seeing your direct issue at hand, with more code.