PHP - replace all non-alphanumeric chars for all languages supported

PHP - replace all non-alphanumeric chars for all languages supported - php

Hi i'm actually trying replacing all the NON-alphanumeric chars from a string like this:
mb_ereg_replace('/[^a-z0-9\s]+/i','-',$string);
first problem is it doesn't replaces chars like "." from the string.
Second i would like to add multybite support for all users languages to this method.
How can i do that?
Any help appriciated, thanks a lot.

Try the following:
preg_replace('/[^\p{L}0-9\s]+/u', '-', $string);
When the u flag is used on a regular expression, \p{L} (and \p{Letter}) matches any character in any of the Unicode letter categories.

It should replace . with -, you're probably mixing up your data in the first place.
As for the multi-byte support, add the u modifier and look into PCRE properties, namely \p{Letter}:
$replaced = preg_replace('~[^0-9\p{Letter}]+~iu', '-', $string);

The shortest way is:
$result = preg_replace('~\P{Xan}++~u', '-', $string);
\p{Xan} contains numbers and letters in all languages, thus \P{Xan} contains all that is not a letter or a number.

This expression does replace dots. For multibyte use u modifier (UTF-8).

Related

How to remove special characters and keep letters of any language in PHP?

I know this should remove any characters from string and keep only numbers and ENGLISH letters.
$txtafter = preg_replace("/[^a-zA-Z 0-9]+/","",$txtbefore);
but I wish to remove any special characters and keep any letter of any language like Arabic or Japanese.

Probably this will work for you:
$repl = preg_replace('/[^\w\s]+/u','' ,$txtbefore);
This will remove all non-word and non-space characters from your text. /u flag is there for unicode support.

You can use the \p{L} pattern to match any letter and \p{N} to much any numeric character. Also you should use u modifier like this: /\p{L}+/u
Your final regex may look like: /[^\p{L}\p{N}]/u
Also be sure to check this question:
Regular expression \p{L} and \p{N}

PHP: remove small words from string ignoring german characters in the words

I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string
and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.

I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:
$kw = utf8_encode(strtolower($kw));
Now you can use the regex-unicode functionalities. \p{L} is for letters and \p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
[^\p{L}\p{N}]
You want all word-characters:
[\p{L}\p{N}]
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
(?<=[^\p{L}\p{N}]|^)
Replace max 2 "word-characters" followed by a boundary or the end:
[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)
So your regex could look like this:
'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'
And decode to your local, if you like:
echo utf8_decode($kw);
Good luck! Robert

Your \b word boundaries trip over the ö, because it's not an alphanumeric character. Per default PCRE works on ASCII letters.
The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer:
$kw = preg_replace("/\b[^-]{1,2}\b/u", "-", $kw);
I would use preg_replace_callback or /e btw, and instead search for [A-Z] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.

preg_replace - regular expression php for two different characters

Is this a correct syntax for preg_replace (regular expression) to remove ?ajax=true or &ajax=true from a string?
echo preg_replace('/(\?|&)ajax=true/', '', $string);
So for example /hello/hi?ajax=true will give me /hello/hi and /hello/hi?ajax=true will give me /hello/hi
Do I need to escape &?

Why don't you try it?
You don't need to escape "&". It is not a special character in regex.
Your expression should be working, that is an alternation that you are using. But if you have only single characters in your alternation, it is more readable, if you use a character class.
echo preg_replace('/[?&]ajax=true/', '', $string);
[?&] is a character class, it will match one character out of the characters listed between the square brackets.

I think it is ok your expression. You can add (?i) to ignore Upper Case letters. The result should be something like:
echo preg_replace('/(\?|&)(?i)ajax=true/', '', $string);

remove whatever i want from string

I got a few keywords, symbols, letters etc I want to remove from my php string. I'm trying to add it but it doesn't work too well.
$string = preg_replace("/(?![=$'%-mp4mp3])\p{P}/u","", $check['title']);
pretty much I want to to remove word mp3, mp4, ./, apples from the string.
Please help guide me, thanks in advance!

First: [] in regular expression introduces a character class. A hyphen is used to represent a character range between two symbols. So the reason your regular expression would make too many erasures (as I suppose) is because [=$'%-mp4mp3] means =, $, ', everything from % to m (72 characters actually!), p, 3, 4.
Second: your regular expression doesn't grab "bad" characters/keywords. Actually, you erase punctuation after bad characters/keywords, as negative lookahead is meta sequence (it is not included in match).
Change your regex to:
"/[=$'%-]|mp3|mp4/u"

You don't need regex for that.
$string = "Your original string here";
$keywords = array('mp3', 'mp4');
echo str_replace($keywords, '', $string);

Regex to remove non alphanumeric characters from UTF8 strings

How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?
I will be working with input from many different languages and I am wondering if there is something that can help me with this
Thanks

There are the unicode character class thingys that you can use:
http://www.regular-expressions.info/unicode.html
http://php.net/manual/en/regexp.reference.unicode.php
To match any non-letter symbols you can just use \PL+, the negation of \p{L}. To not remove spaces, use a charclass like [^\pL\s]+. Or really just remove punctuation with \pP+
Well, and obviously don't forget the regex /u modifier.

I used this:
$clean = preg_replace( "/[^\p{L}|\p{N}]+/u", " ", $raw );
$clean = preg_replace( "/[\p{Z}]{2,}/u", " ", $clean );

Similar post
Remove non-utf8 characters from string
I'm not sure if this covers all characters though.
According to this post on th dreamincode forum
http://www.dreamincode.net/forums/topic/78179-regular-expression-to-remove-non-ascii-characters/
this should work
/[^\x{21}-\x{7E}\s\t\n\r]/

Maybe this will be usefull?
$newstring = preg_replace('/[^0-9a-zA-Z\s]/', $oldstring);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP - replace all non-alphanumeric chars for all languages supported - php

Try the following: preg_replace('/[^\p{L}0-9\s]+/u', '-', $string); When the u flag is used on a regular expression, \p{L} (and \p{Letter}) matches any character in any of the Unicode letter categories.

It should replace . with -, you're probably mixing up your data in the first place. As for the multi-byte support, add the u modifier and look into PCRE properties, namely \p{Letter}: $replaced = preg_replace('~[^0-9\p{Letter}]+~iu', '-', $string);

The shortest way is: $result = preg_replace('~\P{Xan}++~u', '-', $string); \p{Xan} contains numbers and letters in all languages, thus \P{Xan} contains all that is not a letter or a number.

This expression does replace dots. For multibyte use u modifier (UTF-8).

Related

How to remove special characters and keep letters of any language in PHP?

PHP: remove small words from string ignoring german characters in the words

preg_replace - regular expression php for two different characters

remove whatever i want from string

Regex to remove non alphanumeric characters from UTF8 strings

Categories

Resources