I am trying to take a UTF-8 string that looks something like:
&q| 艝隭)R墢Lq28}徫廵g'Y鑽妽踒F
and strip out everything except the Chinese characters they are hex 4E00-9FA5 and I would like to keep only those characters in the string. I have tried taking this line that leaves only valid US characters:
preg_replace('/[^\x20-\x7E]/', '', $str);
to this:
preg_replace('/[^\x4E00-\x9FA5]/u', '', $str);
but it outputs nothing....am I missing something? I am not very good with regular expressions
You were very close!
preg_replace('/[^\x{4E00}-\x{9FA5}]/u', '', $str);
Related
I have this string:
b🤵♀️🤵♀️b
After removing the smilies and special chars:
$str = preg_replace('/[^ -\x{2122}]\s+|\s*[^ -\x{2122}]/u','',$str);
$str = trim($str);
...
strlen($str);
gives me 8 instead of 2, why and how to fix this?
The regular expression is not sufficient to remove all special characters. A special debugger shows which characters are still present after the preg_replace.
"b\u{200d}\u{200d}b"
or as 8 bytes
"b\xe2\x80\x8d\xe2\x80\x8db"
The characters \u{200d} are in the original string between the emojis. Removing these characters for the specific example here is not difficult.
$str = preg_replace('/[^ -\x{2122}]\s+|\s*[^ -\x{2122}]|\x{200d}/u','',$str);
However, this is not a solution if other special characters can also occur.
Hi i'm actually trying replacing all the NON-alphanumeric chars from a string like this:
mb_ereg_replace('/[^a-z0-9\s]+/i','-',$string);
first problem is it doesn't replaces chars like "." from the string.
Second i would like to add multybite support for all users languages to this method.
How can i do that?
Any help appriciated, thanks a lot.
Try the following:
preg_replace('/[^\p{L}0-9\s]+/u', '-', $string);
When the u flag is used on a regular expression, \p{L} (and \p{Letter}) matches any character in any of the Unicode letter categories.
It should replace . with -, you're probably mixing up your data in the first place.
As for the multi-byte support, add the u modifier and look into PCRE properties, namely \p{Letter}:
$replaced = preg_replace('~[^0-9\p{Letter}]+~iu', '-', $string);
The shortest way is:
$result = preg_replace('~\P{Xan}++~u', '-', $string);
\p{Xan} contains numbers and letters in all languages, thus \P{Xan} contains all that is not a letter or a number.
This expression does replace dots. For multibyte use u modifier (UTF-8).
I'm using the following regex to remove all invisible characters from an UTF-8 string:
$string = preg_replace('/\p{C}+/u', '', $string);
This works fine, but how do I alter it so that it removes all invisible characters EXCEPT newlines? I tried some stuff using [^\n] etc. but it doesn't work.
Thanks for helping out!
Edit: newline character is '\n'
Use a "double negation":
$string = preg_replace('/[^\P{C}\n]+/u', '', $string);
Explanation:
\P{C} is the same as [^\p{C}].
Therefore [^\P{C}] is the same as \p{C}.
Since we now have a negated character class, we can substract other characters like \n from it.
My using a negative assertion you can a character class except what the assertion matches, so:
$res = preg_replace('/(?!\n)\p{C}/', '', $input);
(PHP's dialect of regular expressions doesn't support character class subtraction which would, otherwise, be another approach: [\p{C}-[\n]].)
Before you do it, replace newlines (I suppose you are using something like \n) with a random string like ++++++++ (any string that will not be removed by your regular expression and does not naturally occur in your string in the first place), then run your preg_replace, then replace ++++++++ with \n again.
$string=str_replace('\n','++++++++',$string); //Replace \n
$string=preg_replace('/\p{C}+/u', '', $string); //Use your regexp
$string=str_replace('++++++++','\n',$string); //Insert \n again
That should do. If you are using <br/> instead of \n simply use nl2br to preserve line breaks and replace <br/> instead of \n
We have a regex to strip out non alpha numeric characters except for '#', '&' and '-'. Here is what it looks like:
preg_replace('/[^a-zA-Z0-9#&-*]/', '', strtolower($title));
Now we need to support traditional Chinese strings and the above function won't work. How can I implement similar functionality for traditional Chinese.
Thanks,
Use u modifier:
preg_replace(`/[^a-zA-Z0-9#&-*诶]/u`, '', $string);
By the way, don't use strtolower(), because it will break your string. Use mb_strtolower():
mb_strtolower($string, 'UTF-8');
Have you tried mb_ereg_replace() instead of preg_replace()? That might do the trick.
http://www.php.net/manual/en/function.mb-ereg-replace.php
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to handle diacritics (accents) when rewriting 'pretty URLs'
I want to replace special characters, such as Å Ä Ö Ü é, with "normal" characters (those between a-z and 0-9). And spaces should certainly be replaced with dashes, but that's not really a problem.
In other words, I want to turn this:
en räksmörgås
into this:
en-raksmorgas
What's the best way to do this?
Thank you in advance.
You can use iconv for the string replacement...
$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
Basically, it'll transliterate the characters it can, and drop those it can't (that are not in the ASCII character set)...
Then, just replace the spaces with str_replace:
$string = str_replace(' ', '-', $string);
Or, if you want to get fancy, you can replace all consecutive white-space characters with a single dash using a simple regex:
$string = preg_replace('/\\s+/', '-', $string);
Edit As #Robert Ros points out, you need to set the locale prior to using iconv (Depending on the defaults of your system). Just execute this line prior to the iconv line:
setlocale(LC_CTYPE, 'en_US.UTF8');
Check out http://php.net/manual/en/function.strtr.php
<?php
$addr = strtr($addr, "äåö", "aao");
?>
A clever hack often used for this is calling htmlentitites, then running
preg_replace('/&(\w)(acute|uml|circ|tilde|ring|grave);/', '\1', $str);
to get rid of the diacritics. A more complete (but often unnecessarily complicated) solution is using a Unicode decomposition algorithm to split diacritics, then dropping everything that is not an ASCII letter or digit.