Remove all special chars, but not non-Latin characters - php

I'm using this PHP function for SEO urls. It's working fine with Latin words, but my urls are on Cyrillic. This regex - /[^a-z0-9_\s-]/ is not working with Cyrillic chars, please help me to make it works with non-Latin chars.
function seoUrl($string) {
// Lower case everything
$string = strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^a-z0-9_\s-]/', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}

You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using \p{Cyrillic}. Besides you have to set u (unicode) flag to predict engine behavior. You may also need i flag for enabling case-insensitivity like A-Z:
~[^\p{Cyrillic}a-z0-9_\s-]~ui
You don't need to double escape \s.
PHP code:
preg_replace('~[^\p{Cyrillic}a-z0-9_\s-]+~ui', '', $string);

To learn more about Unicode Regular Expressions see this article.
\p{L} or \p{Letter} matches any kind of letter from any language.
To match only Cyrillic characters, use \p{Cyrillic}
Since Cyrillic characters are not standard ASCII characters, you have to use u flag/modifier, so regex will recognize Unicode characters as needed.
Be sure to use mb_strtolower instead of strtolower, as you work with unicode characters.
Because you convert all characters to lowercase, you don't have to use i regex flag/modifier.
The following PHP code should work for you:
function seoUrl($string) {
// Lower case everything
$string = mb_strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^\p{Cyrillic}a-z0-9\s_-]+/u', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
Furthermore, please note that \p{InCyrillic_Supplementary} matches all Cyrillic Supplementary characters and \p{InCyrillic} matches all non-Supplementary Cyrillic characters.

Related

Allow English characters, Chinese, Japanese

How I can replace only the symbols via PHP but not the characters what is numbers 0,9 or English or Chinese or Japanese characters only symbols. Is there any way to do this via PHP?
I use preg_replace to allow English characters and numbers but if Japanese/Chinese/Russians characters are found is auto-deleted.
I try this command too but it is still not working:
$Data = preg_replace('/[^\p{L}\p{N}]/u', '-', $Data);
May be this code will help you.
<?php
$string = "年m月d日ASDFdfdfd4545$##$#$#";
$newString = preg_replace('/[^\\p{L} 0-9]/mu', "_", $string);
echo $newString;
Output:
年m月d日ASDFdfdfd4545_______
\p{L} matches any kind of letter from any language
/u is the Unicode modifier, you need this if you want to handle
Unicode characters
Live demo: http://sandbox.onlinephpfunctions.com/code/a81db5a33e910799f995046104d38898c1203756

Regex for removing special characters on a multilingual string

The most common regex suggested for removing special characters seems to be this -
preg_replace( '/[^a-zA-Z0-9]/', '', $string );
The problem is that it also removes non-English characters.
Is there a regex that removes special characters on all languages? Or the only solution is to explicitly match each special character and remove them?
You can use instead:
preg_replace('/\P{Xan}+/u', '', $string );
\p{Xan} is all that is a number or a letter in any alphabet of the unicode table.
\P{Xan} is all that is not a number or a letter. It is a shortcut for [^\p{Xan}]
You can use:
$string = preg_replace( '/[^\p{L}\p{N}]+/u', '', $string );

PHP preg_match with ï

How can I use preg_match to accept all normal characters and only the i accented character (ï)?
preg_match('#^[ a-zA-Z0-9\[\]()-.!?~*]+$#', $string);
Thanks!
If you are matching unicode you need to set the /u (unicode) flag then include the unicode character in your range.
preg_match('#^[ \x{00EF} a-z A-Z 0-9 \[\]()-.!?~*]+$#u', $string);
There is a full list of unicode characters here
Simply add it to the end of the existing character class...
<?php
// without the accented i
// returns 0, no match :(
$string = "abcï";
echo preg_match('#^[ a-zA-Z0-9\[\]()-.!?~*]+$#', $string);
// adding the accented i to the end of the character class
// returns 1, a match!
$string = "abcï";
echo preg_match('#^[ a-zA-Z0-9\[\]()-.!?~*ï]+$#', $string);
?>
Use unicode properties:
preg_match('#^[\pL\pN \[\]().!?~*-]+$#', $string);
\pL stands for any letter in any language
\pN stands for any number in any language

Match some strange characters in a `\w` match

preg_match("/\w+/", $s, $matches);
I have the PHP code above. I use it to match words in a string. It works great, except in one case.
Example:
'This is a word' should match {'This','is','a','word'}
'Bös Tüb' should match {'Bös','Tüb'}
The first example works, but the second does not. Instead it returns {'B','s','T','b'}, it does not see the ö and ü as a word character.
Question
How to match the ö and ü and any other characters that are normally used in names (they can be strange, this is about German and Turkish names)? Should I add them all manually (/[a-zA-Z and all others as unicode]/)?
EDIT
As I ofcourse forgot to mention, there are a lot of \n, \r and ' ' characters in between the words. This is why I am using Regex.
You can use the u modifier to deal with Unicode characters. And then decode the matches with utf8_decode().
$s = 'Bös Tüb';
preg_match("/\w+/u", $s, $matches); // use the 'u' modifier
var_dump(utf8_decode($matches[0])); // outputs: Bös
If you need to separate by space you can use php explode func like:
$some_string = 'test some words';
$words_arr = explode(' ', $some_string);
var_dump($words_arr);
No matter what are the chars into the string, the script will work.
EDIT:
You can try:
preg_match("/\w+/u", $s, $matches);
for unicode.

regex to match any UTF character excluding punctuation

I'm preparing a function in PHP to automatically convert a string to be used as a filename in a URL (*.html). Although ASCII should be use to be on the safe side, for SEO needs I need to allow the filename to be in any language but I don't want it to include punctuation other than a dash (-) and underscore (_), chars like *%$##"' shouldn't be allowed.
Spaces should be converted to dashes.
I think that using Regex will be the easiest way, but I'm not sure it how to handle UTF8 strings.
My ASCII functions looks like this:
function convertToPath($string)
{
$string = strtolower(trim($string));
$string = preg_replace('/[^a-z0-9-]/', '-', $string);
$string = preg_replace('/-+/', "-", $string);
return $string;
}
Thanks,
Roy.
I think that for SEO needs you should stick to ASCII characters in the URL.
In theory, many more characters are allowed in URLs. In practice most systems only parse ASCII reliable.
Also, many automagically-parse-the-link scripts choke on non-ASCII characters. So allowing URLs with non-ASCII characters in your URLs drastically reduces the change of your link showing up (correctly) in user generated content.
(if you want an example of such a script, take a look at the stackoverflow script, it chokes on parenthesis for example)
You could also take a look at:
How to handle diacritics (accents) when rewriting ‘pretty URLs’
The accepted solution there is to transiterate the non-ASCII characters:
<?php
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);
?>
Hope this helps
If UTF-8 mode is selected you can select all non-Letters (according to the Unicode general category - please refer to the PHP documentation Regular Expression Details) by using
/\P{L}+/
so I'd try the following (untested):
function convertToPath($string)
{
$string = mb_strtolower(trim($string), 'UTF-8');
$string = preg_replace('/\P{L}+/', '-', $string);
$string = preg_replace('/-+/', "-", $string);
return $string;
}
Be aware that you'll get prolems with strtolower() on UTF-8 strings as it'll mess with you multi-byte characters - use mb_strtolower() instead.

Categories