Regex for removing special characters on a multilingual string - php

The most common regex suggested for removing special characters seems to be this -
preg_replace( '/[^a-zA-Z0-9]/', '', $string );
The problem is that it also removes non-English characters.
Is there a regex that removes special characters on all languages? Or the only solution is to explicitly match each special character and remove them?

You can use instead:
preg_replace('/\P{Xan}+/u', '', $string );
\p{Xan} is all that is a number or a letter in any alphabet of the unicode table.
\P{Xan} is all that is not a number or a letter. It is a shortcut for [^\p{Xan}]

You can use:
$string = preg_replace( '/[^\p{L}\p{N}]+/u', '', $string );

Related

creating regex with letters and accents

i need to create a regular expresion that match word whitespace word, it can't start with whitespace neither has more than 1 whitespaces between word and word i have to allow on each word letters and accents, i'm using this pattern:
^([^\+\*\.\|\(\)\[\]\{\}\?\/\^\s\d\t\n\r<>ºª!#"·#~½%¬&=\'¿¡~´,;:_®¥§¹×£µ€¶«²¢³\$\-\\]+\s{0,1}?)*$/
Examples:
-Graça+whitespace+anotherWord -> match
-whitespace+Graça+whitespace+anotherWord -> don't match
-Graça+whitespace+whitespace+anotherword -> don't match
In general, it is a validation to allow firstname+whitespace+lastname with accents chars and a-z chars
and i have to exclude all specials chars like /*-+)(!/($=
You can try this pattern: ^[\x{0041}-\x{02B3}]+\s[\x{0041}-\x{02B3}]+.
Explanation: since you are using characters not matched by \w, you have to define your own range of word characters. \x{0041} is just a character with unicode index equal to 0041.
Demo
For just spaces, use str_replace:
$string = str_replace(' ', '', $string);
For all whitespace, use preg_replace:
$string = preg_replace('/\s+/', '', $string);

Remove all special chars, but not non-Latin characters

I'm using this PHP function for SEO urls. It's working fine with Latin words, but my urls are on Cyrillic. This regex - /[^a-z0-9_\s-]/ is not working with Cyrillic chars, please help me to make it works with non-Latin chars.
function seoUrl($string) {
// Lower case everything
$string = strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^a-z0-9_\s-]/', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using \p{Cyrillic}. Besides you have to set u (unicode) flag to predict engine behavior. You may also need i flag for enabling case-insensitivity like A-Z:
~[^\p{Cyrillic}a-z0-9_\s-]~ui
You don't need to double escape \s.
PHP code:
preg_replace('~[^\p{Cyrillic}a-z0-9_\s-]+~ui', '', $string);
To learn more about Unicode Regular Expressions see this article.
\p{L} or \p{Letter} matches any kind of letter from any language.
To match only Cyrillic characters, use \p{Cyrillic}
Since Cyrillic characters are not standard ASCII characters, you have to use u flag/modifier, so regex will recognize Unicode characters as needed.
Be sure to use mb_strtolower instead of strtolower, as you work with unicode characters.
Because you convert all characters to lowercase, you don't have to use i regex flag/modifier.
The following PHP code should work for you:
function seoUrl($string) {
// Lower case everything
$string = mb_strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^\p{Cyrillic}a-z0-9\s_-]+/u', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
Furthermore, please note that \p{InCyrillic_Supplementary} matches all Cyrillic Supplementary characters and \p{InCyrillic} matches all non-Supplementary Cyrillic characters.

Regex to strip specific characters

I have been using the following regex to replace all punctuation in a string:
preg_replace('/[^\w\s]/', '', $tweet);
with \w being shorthand for [a-zA-Z0-9_] and \s is used to ommit spaces. I learned this wisdom here: Strip punctuation in an address field in PHP. But now, I need the regex to strip all characters except
a-z and A-Z
{ and }
So it should strip out all dots, commas, numbers etc. What is the correct regex for this?
preg_replace('/[^a-zA-Z{} ]/', '', $tweet);
Possibly faster variant as proposed by FakeRainBrigand in a comment, thanks:
preg_replace('/[^a-zA-Z{} ]+/', '', $tweet);
preg_replace('/[^a-z{}]/i', '', $tweet);

Regex to remove non alphanumeric characters from UTF8 strings

How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?
I will be working with input from many different languages and I am wondering if there is something that can help me with this
Thanks
There are the unicode character class thingys that you can use:
http://www.regular-expressions.info/unicode.html
http://php.net/manual/en/regexp.reference.unicode.php
To match any non-letter symbols you can just use \PL+, the negation of \p{L}. To not remove spaces, use a charclass like [^\pL\s]+. Or really just remove punctuation with \pP+
Well, and obviously don't forget the regex /u modifier.
I used this:
$clean = preg_replace( "/[^\p{L}|\p{N}]+/u", " ", $raw );
$clean = preg_replace( "/[\p{Z}]{2,}/u", " ", $clean );
Similar post
Remove non-utf8 characters from string
I'm not sure if this covers all characters though.
According to this post on th dreamincode forum
http://www.dreamincode.net/forums/topic/78179-regular-expression-to-remove-non-ascii-characters/
this should work
/[^\x{21}-\x{7E}\s\t\n\r]/
Maybe this will be usefull?
$newstring = preg_replace('/[^0-9a-zA-Z\s]/', $oldstring);

What is a regular expression to filter any special characters in PHP?

What is a regular expression to filter any special characters?
I want to remove any characters except 0-9 a-z A-Z and standard universal alphabet (arabic).
For example remove these characters: `~!##$%^&*()_+=-\][{}|';lL:"/.,<>? and any others.
$result = preg_replace('~[^A-Za-z0-9]~', '', $text);
how about:
preg_replace('/[^\p{Alphabetic}\p{Arabic}\pN]*/u', '', $str);

Categories