php multi byte strings regex - php

We have a regex to strip out non alpha numeric characters except for '#', '&' and '-'. Here is what it looks like:
preg_replace('/[^a-zA-Z0-9#&-*]/', '', strtolower($title));
Now we need to support traditional Chinese strings and the above function won't work. How can I implement similar functionality for traditional Chinese.
Thanks,

Use u modifier:
preg_replace(`/[^a-zA-Z0-9#&-*诶]/u`, '', $string);
By the way, don't use strtolower(), because it will break your string. Use mb_strtolower():
mb_strtolower($string, 'UTF-8');

Have you tried mb_ereg_replace() instead of preg_replace()? That might do the trick.
http://www.php.net/manual/en/function.mb-ereg-replace.php

Related

PHP - replace all non-alphanumeric chars for all languages supported

Hi i'm actually trying replacing all the NON-alphanumeric chars from a string like this:
mb_ereg_replace('/[^a-z0-9\s]+/i','-',$string);
first problem is it doesn't replaces chars like "." from the string.
Second i would like to add multybite support for all users languages to this method.
How can i do that?
Any help appriciated, thanks a lot.
Try the following:
preg_replace('/[^\p{L}0-9\s]+/u', '-', $string);
When the u flag is used on a regular expression, \p{L} (and \p{Letter}) matches any character in any of the Unicode letter categories.
It should replace . with -, you're probably mixing up your data in the first place.
As for the multi-byte support, add the u modifier and look into PCRE properties, namely \p{Letter}:
$replaced = preg_replace('~[^0-9\p{Letter}]+~iu', '-', $string);
The shortest way is:
$result = preg_replace('~\P{Xan}++~u', '-', $string);
\p{Xan} contains numbers and letters in all languages, thus \P{Xan} contains all that is not a letter or a number.
This expression does replace dots. For multibyte use u modifier (UTF-8).

Regex to remove non alphanumeric characters from UTF8 strings

How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?
I will be working with input from many different languages and I am wondering if there is something that can help me with this
Thanks
There are the unicode character class thingys that you can use:
http://www.regular-expressions.info/unicode.html
http://php.net/manual/en/regexp.reference.unicode.php
To match any non-letter symbols you can just use \PL+, the negation of \p{L}. To not remove spaces, use a charclass like [^\pL\s]+. Or really just remove punctuation with \pP+
Well, and obviously don't forget the regex /u modifier.
I used this:
$clean = preg_replace( "/[^\p{L}|\p{N}]+/u", " ", $raw );
$clean = preg_replace( "/[\p{Z}]{2,}/u", " ", $clean );
Similar post
Remove non-utf8 characters from string
I'm not sure if this covers all characters though.
According to this post on th dreamincode forum
http://www.dreamincode.net/forums/topic/78179-regular-expression-to-remove-non-ascii-characters/
this should work
/[^\x{21}-\x{7E}\s\t\n\r]/
Maybe this will be usefull?
$newstring = preg_replace('/[^0-9a-zA-Z\s]/', $oldstring);

preg_replace or mb_ereg_replace in this case?

I have this RegEx for matching whitespace in Unicode:
/^[\pZ\pC]+|[\pZ\pC]+$/u
I'm not even sure of what it does, but it seems to work. Now, in this case, which function applies better and why?
$str = preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $str);
or
$str = mb_ereg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $str);
The first one works. The second one doesn't.
Tried it out again, mb_ereg_replace doesn't actually support those Unicode char escapes. And it doesn't use regex delimiters. (See Oniguruma)
preg_replace uses the PCRE regex engine, which supports both.
Anyway, there is no such thing as a "better" application. It's either functioning, or not.

PHP -Multibyte regular expression to remove all but chinese characters...please help

I am trying to take a UTF-8 string that looks something like:
&q| 艝隭)R墢Lq28}徫廵g'Y鑽妽踒F
and strip out everything except the Chinese characters they are hex 4E00-9FA5 and I would like to keep only those characters in the string. I have tried taking this line that leaves only valid US characters:
preg_replace('/[^\x20-\x7E]/', '', $str);
to this:
preg_replace('/[^\x4E00-\x9FA5]/u', '', $str);
but it outputs nothing....am I missing something? I am not very good with regular expressions
You were very close!
preg_replace('/[^\x{4E00}-\x{9FA5}]/u', '', $str);

PHP: URL friendly strings [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to handle diacritics (accents) when rewriting 'pretty URLs'
I want to replace special characters, such as Å Ä Ö Ü é, with "normal" characters (those between a-z and 0-9). And spaces should certainly be replaced with dashes, but that's not really a problem.
In other words, I want to turn this:
en räksmörgås
into this:
en-raksmorgas
What's the best way to do this?
Thank you in advance.
You can use iconv for the string replacement...
$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
Basically, it'll transliterate the characters it can, and drop those it can't (that are not in the ASCII character set)...
Then, just replace the spaces with str_replace:
$string = str_replace(' ', '-', $string);
Or, if you want to get fancy, you can replace all consecutive white-space characters with a single dash using a simple regex:
$string = preg_replace('/\\s+/', '-', $string);
Edit As #Robert Ros points out, you need to set the locale prior to using iconv (Depending on the defaults of your system). Just execute this line prior to the iconv line:
setlocale(LC_CTYPE, 'en_US.UTF8');
Check out http://php.net/manual/en/function.strtr.php
<?php
$addr = strtr($addr, "äåö", "aao");
?>
A clever hack often used for this is calling htmlentitites, then running
preg_replace('/&(\w)(acute|uml|circ|tilde|ring|grave);/', '\1', $str);
to get rid of the diacritics. A more complete (but often unnecessarily complicated) solution is using a Unicode decomposition algorithm to split diacritics, then dropping everything that is not an ASCII letter or digit.

Categories