Regex to remove non alphanumeric characters from UTF8 strings - php

How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?
I will be working with input from many different languages and I am wondering if there is something that can help me with this
Thanks

There are the unicode character class thingys that you can use:
http://www.regular-expressions.info/unicode.html
http://php.net/manual/en/regexp.reference.unicode.php
To match any non-letter symbols you can just use \PL+, the negation of \p{L}. To not remove spaces, use a charclass like [^\pL\s]+. Or really just remove punctuation with \pP+
Well, and obviously don't forget the regex /u modifier.

I used this:
$clean = preg_replace( "/[^\p{L}|\p{N}]+/u", " ", $raw );
$clean = preg_replace( "/[\p{Z}]{2,}/u", " ", $clean );

Similar post
Remove non-utf8 characters from string
I'm not sure if this covers all characters though.
According to this post on th dreamincode forum
http://www.dreamincode.net/forums/topic/78179-regular-expression-to-remove-non-ascii-characters/
this should work
/[^\x{21}-\x{7E}\s\t\n\r]/

Maybe this will be usefull?
$newstring = preg_replace('/[^0-9a-zA-Z\s]/', $oldstring);

Related

How to correctly replace multiple white spaces with a single white space in PHP?

I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is:
$new_str = preg_replace("/\s+/", " ", $str);
But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace.
So how do we replace all these characters as well using regular expressions?
When passing u modifier, \s becomes Unicode-aware. So, a simple solution is to use
$new_str = preg_replace("/\s+/u", " ", $str);
^^
See the PHP online demo.
The first thing to do is to read this explanation of how unicode can be treated in regex. Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. So this would be:
$pattern = "/<our-pattern-here>/u";
The next thing is to note that in PHP unicode characters have the pattern \x{00A0} where 00A0 is hex representation for non-breaking space. So if we want to replace consecutive non-breaking spaces with a single space we would have:
$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);
And if we were to include other types of spaces mentioned in the wiki like:
\x{000D} carriage return
\x{000C} form feed
\x{0085} next line
Our pattern becomes:
$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";
But this is really not great since the regex engine will take forever to find out all combinations of these characters. This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences.
A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. And then replacing multiple spaces with a single normal space. We remove the [ ]+ and instead separate the characters with the or operator | :
$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space
A pattern that matches all Unicode whitespaces is [\pZ\pC]. Here is a unit test to prove it.
If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. So to answer your question that would be:
$new_str = preg_replace("/[\pZ\pC]+/u", " ", $str);

Regex for removing special characters on a multilingual string

The most common regex suggested for removing special characters seems to be this -
preg_replace( '/[^a-zA-Z0-9]/', '', $string );
The problem is that it also removes non-English characters.
Is there a regex that removes special characters on all languages? Or the only solution is to explicitly match each special character and remove them?
You can use instead:
preg_replace('/\P{Xan}+/u', '', $string );
\p{Xan} is all that is a number or a letter in any alphabet of the unicode table.
\P{Xan} is all that is not a number or a letter. It is a shortcut for [^\p{Xan}]
You can use:
$string = preg_replace( '/[^\p{L}\p{N}]+/u', '', $string );

PHP - replace all non-alphanumeric chars for all languages supported

Hi i'm actually trying replacing all the NON-alphanumeric chars from a string like this:
mb_ereg_replace('/[^a-z0-9\s]+/i','-',$string);
first problem is it doesn't replaces chars like "." from the string.
Second i would like to add multybite support for all users languages to this method.
How can i do that?
Any help appriciated, thanks a lot.
Try the following:
preg_replace('/[^\p{L}0-9\s]+/u', '-', $string);
When the u flag is used on a regular expression, \p{L} (and \p{Letter}) matches any character in any of the Unicode letter categories.
It should replace . with -, you're probably mixing up your data in the first place.
As for the multi-byte support, add the u modifier and look into PCRE properties, namely \p{Letter}:
$replaced = preg_replace('~[^0-9\p{Letter}]+~iu', '-', $string);
The shortest way is:
$result = preg_replace('~\P{Xan}++~u', '-', $string);
\p{Xan} contains numbers and letters in all languages, thus \P{Xan} contains all that is not a letter or a number.
This expression does replace dots. For multibyte use u modifier (UTF-8).

PHP regex, replace all trash symbols

I can't get my head around a solid RegEx for doing this, still very new at all this RegEx magic. I had some limited success, but I feel like there is a simpler, more efficient way.
I would like to purify a string of all non-alphanumeric characters, and turn all those invalid subsets into one single underscore, but trim them at the edges. For example, the string <<+ćThis?//String_..! should be converted to This_String
Any thoughts on doing this all in one RegEx? I did it with regular str_replace, and then regexed the multi-underscores out of the way, and then trimmed the last underscores from the edges, but it seems like overkill and like something RegEx could do in one go. Kind of going for max speed/efficiency here, even if it is milliseconds I'm dealing with.
= trim(preg_replace('<\W+>', "_", $string), "_");
The uppercase \W escape here matches "non-word" characters, meaning everything but letters and numbers. To remove the leftover outer underscores I would still use trim.
Yes, you could do this:
preg_replace("/[^a-zA-Z0-9]+/", "_", $myString);
Then you would trim leading and trailing underscores, maybe by doing this:
preg_replace("/^_+|_+$/", "", $myReplacedString);
It's not one regex, but it's cleaner than str_replace and a bunch of regex.
$output = preg_replace('/([^0-9a-z])/i', ' ', '<<+ćThis?//String_..!');
$output = preg_replace('!\s+!', '_', trim($output));
echo $output;
This_String

What is the best way to remove punctuation marks, symbols, diacritics, special characters?

I use these lines of code to remove all punctuation marks, symbols, etc as you can see them in the array,
$pattern_page = array("+",",",".","-","'","\"","&","!","?",":",";","#","~","=","/","$","£","^","(",")","_","<",">");
$pg_url = str_replace($pattern_page, ' ', strtolower($pg_url));
but I want to make it simpler as it looks silly to list all the stuff I want to remove in the array as there might be some other special characters I want to remove.
I thought of using the regular expression below,
$pg_url = preg_replace("/\W+/", " ", $pg_url);
but it doesn't remove under-score - _
What is the best way to remove all these stuff? Can regular expression do that?
Depending on how greedy you'd like to be, you could do something like:
$pg_url = preg_replace("/[^a-zA-Z 0-9]+/", " ", $pg_url);
This will replace anything that isn't a letter, number or space.
Use classes:
preg_replace('/[^[:alpha:]]/', '', $input);
Would remove anything that's not considered a "character" by the currently set locale. If it's punctuation, you seek to eliminate, the class would be [:punct:].
\W means "any non-word character" and is the opposite of \w which includes underscores (_).

Categories