Detecting non english characters in a string? - php

I'm trying to remove cached profiles which have non English letters in their description. I'm fine with dashes, symbols, special characters, underscores all that I just don't want foreign characters in my string.
The issue is, my code below detects strings with á as ASCII even though it isn't an English character, is matching against ASCII the right way?
if (!mb_detect_encoding($this->removeEmojis(str_replace(" ", "", $cacheItem->description), 'ASCII', true)))
{
$cacheItem->delete(); // laravel
}
Value of $cacheItem->description
Welcome to my profile<br> Londrina-Paraná
The letter á is a non English character.
The description can also contain dots, symbols, special characters, but I want to detect foreign characters like Latin.
Descriptions can also contain emojis so I try to remove them with this function
private function removeEmojis($text){
// theres lots more inside the preg_replace I truncated it for readability
return preg_replace('/[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0077}\x{E006C}\x{E0073}\x{E007F})|[\x{1F3F4}]/u', ' ', $text);
}

You can detect any character that is not printable ASCII , by using this regexp
[^\x20-\x7E]]*
See ASCII table
Replace the matches with empty string then you get a purified one and then you can apply your emoji replacement.

You can use preg_match to check if all the characters in the string are in the range <space> to ~ which is the ASCII character range:
$description = 'Welcome to my profile<br> Londrina-Paraná';
var_dump(preg_match('/^[ -~]*$/', $description));
$description = 'Welcome to my profile<br> Londriná-Parana';
var_dump(preg_match('/^[ -~]*$/', $description));
$description = 'Welcome to my profile<br> Londrina-Parana';
var_dump(preg_match('/^[ -~]*$/', $description));
Output:
int(0)
int(0)
int(1)
Demo on 3v4l.org

Related

Preg replace utf8 charset issue with à

I'm trying to add a special string '|||' after newlines, blankspaces and other characters. I'm doing this because I want to split my text into an array. So I was thinking to do it like this:
$result = preg_replace("/<br>/", "<br>|||", preg_replace("/\s/", " |||", preg_replace("/\r/", "\r|||", preg_replace("/\n/", "\n|||", preg_replace("/’/", "’|||", preg_replace("/'/", "'|||", $text))))));
$result = preg_split("/[|||]+/", $result);
It works with every word but words which contain à char. It is replaced by �.
I'm sure the problem is here because my string $text shows the char à.
Since your pattern deals with a Unicode string, pass the /u modifier.
Also, you do not need so many chained regex replacements, group the first patterns and use a backreference in the replacement.
Use
preg_replace("/(<br>|[\s’'])/u", "$1|||", $text)
Note that \s matches spaces, carriage returns and newlines.
Details:
(<br>|[\s’']) - Group 1 capturing either a
<br> - character sequence
| - or
[\s’'] - a whitespace, ’ or '.
See the PHP demo:
$text = "Voilà. C'est vrai.";
echo preg_replace("/(<br>|[\s’'])/u", "$1|||", $text);

regex to also match accented characters

I have the following PHP code:
$search = "foo bar que";
$search_string = str_replace(" ", "|", $search);
$text = "This is my foo text with qué and other accented characters.";
$text = preg_replace("/$search_string/i", "<b>$0</b>", $text);
echo $text;
Obviously, "que" does not match "qué". How can I change that? Is there a way to make preg_replace ignore all accents?
The characters that have to match (Spanish):
á,Á,é,É,í,Í,ó,Ó,ú,Ú,ñ,Ñ
I don't want to replace all accented characters before applying the regex, because the characters in the text should stay the same:
"This is my foo text with qué and other accented characters."
and not
"This is my foo text with que and other accented characters."
The solution I finally used:
$search_for_preg = str_ireplace(["e","a","o","i","u","n"],
["[eé]","[aá]","[oó]","[ií]","[uú]","[nñ]"],
$search_string);
$text = preg_replace("/$search_for_preg/iu", "<b>$0</b>", $text)."\n";
$search = str_replace(
['a','e','i','o','u','ñ'],
['[aá]','[eé]','[ií]','[oó]','[uú]','[nñ]'],
$search)
This and the same for upper case will complain your request. A side note: ñ replacemet sounds invalid to me, as 'niño' is totaly diferent from 'nino'
If you want to use the captured text in the replacement string, you have to use character classes in your $search variable (anyway, you set it manually):
$search = "foo bar qu[eé]"
And so on.
You could try defining an array like this:
$vowel_replacements = array(
"e" => "eé",
// Other letters mapped to their other versions
);
Then, before your preg_match call, do something like this:
foreach ($vowel_replacements as $vowel => $replacements) {
str_replace($search_string, "$vowel", "[$replacements]");
}
If I'm remembering my PHP right, that should replace your vowels with a character class of their accented forms -- which will keep it in place. It also lets you change the search string far more easily; you don't have to remember to replaced the vowels with their character classes. All you have to remember is to use the non-accented form in your search string.
(If there's some special syntax I'm forgetting that does this without a foreach, please comment and let me know.)

How to match with regex unicode text ignoring diacritics on characters (Á É Í)

What I am trying to achieve is - I want to use a preg-replace to highlight searched string in suggestions but ignoring diacritics on characters, spaces or apostrophe. So when I will for example search for ha my search suggestions will look like this:
O'Hara
Ó an Cháintighe
H'aSOMETHING
I have done a loads of research but did not come up with any code yet. I just have an idea that I could somehow convert the characters with diacritics (e.g.: Á, É...) to character and modifier (A+´, E+´) but I am not sure how to do it.
I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP
My function highlights text ignoring diacritics, spaces, apostrophes and dashes:
function highlight($pattern, $string)
{
$array = str_split($pattern);
//add or remove characters to be ignored
$pattern=implode('[\s\'\-]*', $array);
//list of letters with diacritics
$replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");
$pattern=str_replace(array_keys($replacements), $replacements, $pattern);
//instead of <u> you can use <b>, <i> or even <div> or <span> with css class
return preg_replace("/(" . $pattern . ")/ui", "<u>\\1</u>", $string);
}

Regex to match a string that may contain Chinese characters

I'm trying to write a regular expression which could match a string that possibly includes Chinese characters. Examples:
hahdj5454_fd.fgg"
example.com/list.php?keyword=关键字
example.com/list.php?keyword=php
I am using this expression:
$matchStr = '/^[a-z 0-9~%.:_\-\/[^x7f-xff]+$/i';
$str = "http://example.com/list.php?keyword=关键字";
if ( ! preg_match($matchStr, $str)){
exit('WRONG');
}else{
echo "RIGHT";
}
It matches plain English strings like that dasdsdsfds or http://example.com/list.php, but it doesn't match strings containing Chinese characters. How can I resolve this?
Assuming you want to extend the set of letters that this regex matches from ASCII to all Unicode letters, then you can use
$matchStr = '#^[\pL 0-9~%.:_/-]+$#u';
I've removed the [^x7f-xff part which didn't make any sense (in your regex, it would have matched an opening bracket, a caret, and some ASCII characters that were already covered by the a-z and 0-9 parts of that character class).
This works:
$str = "http://mysite/list.php?keyword=关键字";
if (preg_match('/[\p{Han}]/simu', $str)) {
echo "Contains Chinese Characters";
}else{
exit('WRONG'); // Doesn't contains Chinese Characters
}

how to transform japanese english character to normal english character?

I have an japanese english character.
This character is not normal english string.
Characters: Game
How to transform this character to normal english character in php?
Subtract 65248 from the ordinal value of each character. In other words:
$str = "Game some other text by ヴィックサ";
$str = preg_replace_callback(
"/[\x{ff01}-\x{ff5e}]/u",
function($c) {
// convert UTF-8 sequence to ordinal value
$code = ((ord($c[0][0])&0xf)<<12)|((ord($c[0][1])&0x3f)<<6)|(ord($c[0][2])&0x3f);
return chr($code-0xffe0);
},
$str);
This will replace all of the "Fullwidth" characters with their normal width equivalents.
It would be easier to use mb_convert_kana:
$string = 'Characters: Game';
$newString = mb_convert_kana($string,'a');
I'm sure there is a much easier answer but couldnt you make a dictonary object with the special charter as the key and the char you want as the value
then just do a simple find and replace?

Categories