Remove garbage characters in arabic - php

I needed to remove all non Arabic characters from a string and eventually with the help of people from stack-overflow was able to come up with the following regex to get rid of all characters which are not Arabic.
preg_replace('/[^\x{0600}-\x{06FF}]/u','',$string);
The problem is the above removes white spaces too. And now I discovered I would need character from A-Z,a-z,0-9, !##$%^&*() also. So how do I need to modify the regex?
Thanking you

Add the ones you want to keep to your character class:
preg_replace('/[^\x{0600}-\x{06FF}A-Za-z !##$%^&*()]/u','', $string);

assume you have this string:
$str = "Arabic Text نص عربي test 123 و,.m,............ ~~~ ٍ،]ٍْ}~ِ]ٍ}";
this will keep arabic chars with spaces only.
echo preg_replace('/[^أ-ي ]/ui', '', $str);
this will keep Arabic and English chars with Numbers Only
echo preg_replace('/[^أ-يA-Za-z0-9 ]/ui', '', $str);
this will answer your question latterly.
echo preg_replace('/[^أ-يA-Za-z !##$%^&*()]/ui', '', $str);

In a more detailed manner from Above example, Considering below is your string:
$string = '<div>This..</div> <a>is<a/> <strong>hello</strong> <i>world</i> ! هذا هو مرحبا العالم! !##$%^&&**(*)<>?:";p[]"/.,\|`~1##$%^&^&*(()908978867564564534423412313`1`` "Arabic Text نص عربي test 123 و,.m,............ ~~~ ٍ،]ٍْ}~ِ]ٍ}"; ';
Code:
echo preg_replace('/[^\x{0600}-\x{06FF}A-Za-z0-9 !##$%^&*().]/u','', strip_tags($string));
Allows: English letters, Arabic letters, 0 to 9 and characters !##$%^&*().
Removes: All html tags, and special characters other than above

Related

Detecting non english characters in a string?

I'm trying to remove cached profiles which have non English letters in their description. I'm fine with dashes, symbols, special characters, underscores all that I just don't want foreign characters in my string.
The issue is, my code below detects strings with á as ASCII even though it isn't an English character, is matching against ASCII the right way?
if (!mb_detect_encoding($this->removeEmojis(str_replace(" ", "", $cacheItem->description), 'ASCII', true)))
{
$cacheItem->delete(); // laravel
}
Value of $cacheItem->description
Welcome to my profile<br> Londrina-Paraná
The letter á is a non English character.
The description can also contain dots, symbols, special characters, but I want to detect foreign characters like Latin.
Descriptions can also contain emojis so I try to remove them with this function
private function removeEmojis($text){
// theres lots more inside the preg_replace I truncated it for readability
return preg_replace('/[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0077}\x{E006C}\x{E0073}\x{E007F})|[\x{1F3F4}]/u', ' ', $text);
}
You can detect any character that is not printable ASCII , by using this regexp
[^\x20-\x7E]]*
See ASCII table
Replace the matches with empty string then you get a purified one and then you can apply your emoji replacement.
You can use preg_match to check if all the characters in the string are in the range <space> to ~ which is the ASCII character range:
$description = 'Welcome to my profile<br> Londrina-Paraná';
var_dump(preg_match('/^[ -~]*$/', $description));
$description = 'Welcome to my profile<br> Londriná-Parana';
var_dump(preg_match('/^[ -~]*$/', $description));
$description = 'Welcome to my profile<br> Londrina-Parana';
var_dump(preg_match('/^[ -~]*$/', $description));
Output:
int(0)
int(0)
int(1)
Demo on 3v4l.org

How to match with regex unicode text ignoring diacritics on characters (Á É Í)

What I am trying to achieve is - I want to use a preg-replace to highlight searched string in suggestions but ignoring diacritics on characters, spaces or apostrophe. So when I will for example search for ha my search suggestions will look like this:
O'Hara
Ó an Cháintighe
H'aSOMETHING
I have done a loads of research but did not come up with any code yet. I just have an idea that I could somehow convert the characters with diacritics (e.g.: Á, É...) to character and modifier (A+´, E+´) but I am not sure how to do it.
I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP
My function highlights text ignoring diacritics, spaces, apostrophes and dashes:
function highlight($pattern, $string)
{
$array = str_split($pattern);
//add or remove characters to be ignored
$pattern=implode('[\s\'\-]*', $array);
//list of letters with diacritics
$replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");
$pattern=str_replace(array_keys($replacements), $replacements, $pattern);
//instead of <u> you can use <b>, <i> or even <div> or <span> with css class
return preg_replace("/(" . $pattern . ")/ui", "<u>\\1</u>", $string);
}

Strip non alphanumeric characters from Arabic UTF8 + English string

I want to remove all non Arabic, non English and non Numbers charecters from a string, except for dashes (-).
I managed to do it for non English alphanumeric characters like this:
$slug = ereg_replace('[^A-Za-z0-9-]', '', $string);
But for non arabic alphanumeric characters i tried to do it like this:
$slug = ereg_replace('\p{InArabic}', '', $string);
but it didnt strip the non alphanumeric characters! I also tried this answer but it didnt work either, it always returns '0' !!
$slug = preg_replace('/[^\x{0600}-\x{06FF}A-Za-z0-9-]/u','', $string);
Hopefully someone can help me.
Try the below:
$slug = preg_replace('/[^\p{Arabic}\da-z-]/ui', '', $string);

Matching alphanumeric characters separated by spaces

Okay, I'm stuck. PHP, Regex. I have a string:
Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.
And I want to use preg_replace() to enclose a substring containing latin letters, numbers and spaces with <b> tags. A substring is not merely a word but a set of words as long as the next word contains Latin characters:
Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.
My best shot was:
$text = 'Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.';
$regex = "/\d*\p{Latin}+(\d|\s|\p{Latin})*/iu";
preg_replace($regex, '<b>$0</b>', $text);
But it grabs not only "here98" but also the following "85":
Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.
I understand why it is so but fail to figure out the correct Regex.
You need not just match Latin+digits words, but look one word ahead and one word behind.
AFAIK, variable-length look-behinds are not possible, so you should use non-capturing group (?:...)and positive look-ahead (?=...):
$regex = "/(?:[\p{Latin}\d]+ )([\p{Latin}\d ]+)(?= [\p{Latin}\d]+)/iu";
preg_replace($regex, '<b>$1</b>', $text);
PS: Aaaah! Russian mafia! ;-)

PHP preg_replace oddity with £ pound sign and ã

I am applying the following function
<?php
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâäåìíîïùúûüýÿ]/", "", $string);
return $new_string;
}
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿ";
echo replaceChar($string);
?>
which works fine but if I add ã to the preg_replace like
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâãäåìíîïùúûüýÿ]/", "", $string);
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿã";
It conflicts with the pound sign £ and replaces the pound sign with the unidentified question mark in black square.
This is not critical but does anyone know why this is?
Thank you,
Barry
UPDATE: Thank you all. Changed functions adding the u modifier: pt2.php.net/manual/en/… – as suggested by Artefacto and works a treat
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõøöàáâãäåìíîïùúûüýÿ]/u", "", $string);
return $new_string;
}
If your string is in UTF-8, you must add the u modifier to the regex. Like this:
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâäåìíîïùúûüýÿ]/u", "", $string);
return $new_string;
}
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿ";
echo replaceChar($string);
Chances are that your string is UTF-8, but preg_replace() is working on bytes
that code is valid ...
maybe you should try Central-European character encoding
<?php
header ('Content-type: text/html; charset=ISO-8859-2');
?>
You might want to take a look at mb_ereg_replace(). As Mark mentioned preg_replace only works on byte level and does not work well with multibyte character encodings.
Cheers,
Fabian

Categories