How to match with regex unicode text ignoring diacritics on characters (Á É Í)

How to match with regex unicode text ignoring diacritics on characters (Á É Í) - php

What I am trying to achieve is - I want to use a preg-replace to highlight searched string in suggestions but ignoring diacritics on characters, spaces or apostrophe. So when I will for example search for ha my search suggestions will look like this:
O'Hara
Ó an Cháintighe
H'aSOMETHING
I have done a loads of research but did not come up with any code yet. I just have an idea that I could somehow convert the characters with diacritics (e.g.: Á, É...) to character and modifier (A+´, E+´) but I am not sure how to do it.

I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP
My function highlights text ignoring diacritics, spaces, apostrophes and dashes:
function highlight($pattern, $string)
{
$array = str_split($pattern);
//add or remove characters to be ignored
$pattern=implode('[\s\'\-]*', $array);
//list of letters with diacritics
$replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");
$pattern=str_replace(array_keys($replacements), $replacements, $pattern);
//instead of <u> you can use <b>, <i> or even <div> or <span> with css class
return preg_replace("/(" . $pattern . ")/ui", "<u>\\1</u>", $string);
}

Related

Detecting non english characters in a string?

I'm trying to remove cached profiles which have non English letters in their description. I'm fine with dashes, symbols, special characters, underscores all that I just don't want foreign characters in my string.
The issue is, my code below detects strings with á as ASCII even though it isn't an English character, is matching against ASCII the right way?
if (!mb_detect_encoding($this->removeEmojis(str_replace(" ", "", $cacheItem->description), 'ASCII', true)))
{
$cacheItem->delete(); // laravel
}
Value of $cacheItem->description
Welcome to my profile<br> Londrina-Paraná
The letter á is a non English character.
The description can also contain dots, symbols, special characters, but I want to detect foreign characters like Latin.
Descriptions can also contain emojis so I try to remove them with this function
private function removeEmojis($text){
// theres lots more inside the preg_replace I truncated it for readability
return preg_replace('/[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0077}\x{E006C}\x{E0073}\x{E007F})|[\x{1F3F4}]/u', ' ', $text);
}

You can detect any character that is not printable ASCII , by using this regexp
[^\x20-\x7E]]*
See ASCII table
Replace the matches with empty string then you get a purified one and then you can apply your emoji replacement.

You can use preg_match to check if all the characters in the string are in the range <space> to ~ which is the ASCII character range:
$description = 'Welcome to my profile<br> Londrina-Paraná';
var_dump(preg_match('/^[ -~]*$/', $description));
$description = 'Welcome to my profile<br> Londriná-Parana';
var_dump(preg_match('/^[ -~]*$/', $description));
$description = 'Welcome to my profile<br> Londrina-Parana';
var_dump(preg_match('/^[ -~]*$/', $description));
Output:
int(0)
int(0)
int(1)
Demo on 3v4l.org

sanitize string using whitelist regex php

I want to sanitize a $string using the next white list:
It includes a-z, A-Z,0-9 and some usual characters included on posts []=+-¿?¡!<>$%^&*'"()/##*,.:;_|.
As well spanish accents like á,é,í,ó,ú and ÁÉÍÓÚ
WHITE LIST
abcdefghijklmnñopqrstuvwxyzñáéíóúABCDEFGHIJKLMNÑOPQRSTUVWXYZÁÉÍÓÚ0123456789[]=+-¿?¡!<>$%^&*'"()/##*,.:;_|
I want to sanitize this string
$string="//abcdefghijklmnñopqrstuvwxyzñáéíóúABCDEFGHIJKLMNÑOPQRSTUVWXYZÁÉÍÓÚ0123456789[]=+-¿?¡!<>$%^&*'()/##*,.:;_| |||||||||| ] ¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶¸¹º»¼½ mmmmm onload onclick='' [ ? / < ~ # ` ! # $ % ^ & * ( ) + = } | : ; ' , > { space !#$%&'()*+,-./:;<=>?#[\]^_`{|}~ <html>sdsd</html> ** *`` `` ´´ {} {}[] ````... ;;,,´'¡'!!!!¿?ña ñaña ÑA á é´´ è ´ 8i ó ú à à` à è`ì`ò ù & > < ksks < wksdsd '' \" \' <script>alert('hi')</script>";
I tried this regex but it doesnt work
//$regex = '/[^\w\[\]\=\+\-\¿\?\¡\!\<\>\$\%\^\&\*\'\"\(\)\/\#\#\*\,\.\/\:\;\_\|]/i';
//preg_replace($regex, '', $string);
Does anyone has a clue how to sanitize thisstring according to the whitelist values?

If you known your white list characters use the white list in the regex instead of including the black list. The blacklist could be really big. Specially if the encoding something like UTF-8 or UTF-16
There is a lot of ways to do this. One could be to create a regex with capture groups of the desired range of posibilities (also include the spaces and new lines) and compose a new string with the groups.
Also take carefully that some of the characters could be reserved regex characters and need to be scaped. Like "[ ? +"
You could test a regex like:
$string ="Your test string";
$pattern= "([a-zA-Z0-9\[\]=\+\-\¿\?¡!<>$%\^&\*'\"\sñÑáéíóúÁÉÍÓÚ]+)";
preg_match_all($pattern, $string, $matches);
$newString = join('', $matches);
This is only and simple example of how to apply the whilte list with the regex.

regex to also match accented characters

I have the following PHP code:
$search = "foo bar que";
$search_string = str_replace(" ", "|", $search);
$text = "This is my foo text with qué and other accented characters.";
$text = preg_replace("/$search_string/i", "<b>$0</b>", $text);
echo $text;
Obviously, "que" does not match "qué". How can I change that? Is there a way to make preg_replace ignore all accents?
The characters that have to match (Spanish):
á,Á,é,É,í,Í,ó,Ó,ú,Ú,ñ,Ñ
I don't want to replace all accented characters before applying the regex, because the characters in the text should stay the same:
"This is my foo text with qué and other accented characters."
and not
"This is my foo text with que and other accented characters."

The solution I finally used:
$search_for_preg = str_ireplace(["e","a","o","i","u","n"],
["[eé]","[aá]","[oó]","[ií]","[uú]","[nñ]"],
$search_string);
$text = preg_replace("/$search_for_preg/iu", "<b>$0</b>", $text)."\n";

$search = str_replace(
['a','e','i','o','u','ñ'],
['[aá]','[eé]','[ií]','[oó]','[uú]','[nñ]'],
$search)
This and the same for upper case will complain your request. A side note: ñ replacemet sounds invalid to me, as 'niño' is totaly diferent from 'nino'

If you want to use the captured text in the replacement string, you have to use character classes in your $search variable (anyway, you set it manually):
$search = "foo bar qu[eé]"
And so on.

You could try defining an array like this:
$vowel_replacements = array(
"e" => "eé",
// Other letters mapped to their other versions
);
Then, before your preg_match call, do something like this:
foreach ($vowel_replacements as $vowel => $replacements) {
str_replace($search_string, "$vowel", "[$replacements]");
}
If I'm remembering my PHP right, that should replace your vowels with a character class of their accented forms -- which will keep it in place. It also lets you change the search string far more easily; you don't have to remember to replaced the vowels with their character classes. All you have to remember is to use the non-accented form in your search string.
(If there's some special syntax I'm forgetting that does this without a foreach, please comment and let me know.)

Trying to generate url slugs with PHP regex, Japanese characters not going through

So I'm trying to generate slugs to store in my DB. My locales include English, some European languages and Japanese.
I allow \d, \w, European characters are transliterated, Japanese characters are untouched. Period, plus and dash (-) are kept. Leading/trailing whitespace is removed, while the whitespace in between is replaced by a dash.
Here is some code: (please feel free to improve it, given my conditions above as my regex-fu is currently white belt tier)
function ToSlug($string, $separator='-') {
$url = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$url = preg_replace('/[^\d\w一-龠ぁ-ゔァ-ヴー々〆〤.+ -]/', '', $url);
$url = strtolower($url);
$url = preg_replace('/[ ' . $separator . ']+/', $separator, $url);
return $url;
}
I'm testing this function, however my JP characters are not getting through, they are simply replaced by ''. Whilst I do suspect it's the //IGNORE that's taking them out, I need that their or else my German, France transliterations will not work. Any ideas on how I can fix this?
EDIT: I'm not sure if Japanese Kanji covers all of Simplified Chinese but I'm gonna need that and Korean as well. If anyone who knows the regex off the bat please let me know it will save me some time searching. Thanks.

Note: I am not familiar with the Japanese writing system.
Looking at the function the iconv call appears to remove all the Japanese characters. Instead of using iconv to transliterate, it may be easier to just create a function that does it:
function _toSlugTransliterate($string) {
// Lowercase equivalents found at:
// https://github.com/kohana/core/blob/3.3/master/utf8/transliterate_to_ascii.php
$lower = [
'à'=>'a','ô'=>'o','ď'=>'d','ḟ'=>'f','ë'=>'e','š'=>'s','ơ'=>'o',
'ß'=>'ss','ă'=>'a','ř'=>'r','ț'=>'t','ň'=>'n','ā'=>'a','ķ'=>'k',
'ŝ'=>'s','ỳ'=>'y','ņ'=>'n','ĺ'=>'l','ħ'=>'h','ṗ'=>'p','ó'=>'o',
'ú'=>'u','ě'=>'e','é'=>'e','ç'=>'c','ẁ'=>'w','ċ'=>'c','õ'=>'o',
'ṡ'=>'s','ø'=>'o','ģ'=>'g','ŧ'=>'t','ș'=>'s','ė'=>'e','ĉ'=>'c',
'ś'=>'s','î'=>'i','ű'=>'u','ć'=>'c','ę'=>'e','ŵ'=>'w','ṫ'=>'t',
'ū'=>'u','č'=>'c','ö'=>'o','è'=>'e','ŷ'=>'y','ą'=>'a','ł'=>'l',
'ų'=>'u','ů'=>'u','ş'=>'s','ğ'=>'g','ļ'=>'l','ƒ'=>'f','ž'=>'z',
'ẃ'=>'w','ḃ'=>'b','å'=>'a','ì'=>'i','ï'=>'i','ḋ'=>'d','ť'=>'t',
'ŗ'=>'r','ä'=>'a','í'=>'i','ŕ'=>'r','ê'=>'e','ü'=>'u','ò'=>'o',
'ē'=>'e','ñ'=>'n','ń'=>'n','ĥ'=>'h','ĝ'=>'g','đ'=>'d','ĵ'=>'j',
'ÿ'=>'y','ũ'=>'u','ŭ'=>'u','ư'=>'u','ţ'=>'t','ý'=>'y','ő'=>'o',
'â'=>'a','ľ'=>'l','ẅ'=>'w','ż'=>'z','ī'=>'i','ã'=>'a','ġ'=>'g',
'ṁ'=>'m','ō'=>'o','ĩ'=>'i','ù'=>'u','į'=>'i','ź'=>'z','á'=>'a',
'û'=>'u','þ'=>'th','ð'=>'dh','æ'=>'ae','µ'=>'u','ĕ'=>'e','ı'=>'i',
];
return str_replace(array_keys($lower), array_values($lower), $string);
}
So, with some modifications, it could look something like this:
function toSlug($string, $separator = '-') {
// Work around this...
#$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$string = _toSlugTransliterate($string);
// Remove unwanted chars + trim excess whitespace
// I got the character ranges from the following URL:
// https://stackoverflow.com/questions/6787716/regular-expression-for-japanese-characters#10508813
$regex = '/[^一-龠ぁ-ゔァ-ヴーａ-ｚＡ-Ｚ０-９a-zA-Z0-9々〆〤.+ -]|^\s+|\s+$/u';
$string = preg_replace($regex, '', $string);
// Using the mb_* version seems safer for some reason
$string = mb_strtolower($string);
// Same as before
$string = preg_replace("/[ {$separator}]+/", $separator, $string);
return $string;
}
$x = ' æøå!this.ís-a test-ゔヴ ーァ ';
echo toSlug($x);
In regex you can use unicode "scripts" to match letters of various languages. There is no "Japanese" one, but there are Hiragana, Katakana and Han. As I have no idea how Japanese is written, and how one could use these, I am not even going to try.
Using these scripts, however, would be done something like this:
'/[\p{Hiragana}\p{Katakana}\p{Han}]+/'

Remove garbage characters in arabic

I needed to remove all non Arabic characters from a string and eventually with the help of people from stack-overflow was able to come up with the following regex to get rid of all characters which are not Arabic.
preg_replace('/[^\x{0600}-\x{06FF}]/u','',$string);
The problem is the above removes white spaces too. And now I discovered I would need character from A-Z,a-z,0-9, !##$%^&*() also. So how do I need to modify the regex?
Thanking you

Add the ones you want to keep to your character class:
preg_replace('/[^\x{0600}-\x{06FF}A-Za-z !##$%^&*()]/u','', $string);

assume you have this string:
$str = "Arabic Text نص عربي test 123 و,.m,............ ~~~ ٍ،]ٍْ}~ِ]ٍ}";
this will keep arabic chars with spaces only.
echo preg_replace('/[^أ-ي ]/ui', '', $str);
this will keep Arabic and English chars with Numbers Only
echo preg_replace('/[^أ-يA-Za-z0-9 ]/ui', '', $str);
this will answer your question latterly.
echo preg_replace('/[^أ-يA-Za-z !##$%^&*()]/ui', '', $str);

In a more detailed manner from Above example, Considering below is your string:
$string = '<div>This..</div> <a>is<a/> <strong>hello</strong> <i>world</i> ! هذا هو مرحبا العالم! !##$%^&&**(*)<>?:";p[]"/.,\|`~1##$%^&^&*(()908978867564564534423412313`1`` "Arabic Text نص عربي test 123 و,.m,............ ~~~ ٍ،]ٍْ}~ِ]ٍ}"; ';
Code:
echo preg_replace('/[^\x{0600}-\x{06FF}A-Za-z0-9 !##$%^&*().]/u','', strip_tags($string));
Allows: English letters, Arabic letters, 0 to 9 and characters !##$%^&*().
Removes: All html tags, and special characters other than above

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to match with regex unicode text ignoring diacritics on characters (Á É Í) - php

Related

Detecting non english characters in a string?

sanitize string using whitelist regex php

regex to also match accented characters

Trying to generate url slugs with PHP regex, Japanese characters not going through

Remove garbage characters in arabic

Categories

Resources