I needed to remove all non Arabic characters from a string and eventually with the help of people from stack-overflow was able to come up with the following regex to get rid of all characters which are not Arabic.
preg_replace('/[^\x{0600}-\x{06FF}]/u','',$string);
The problem is the above removes white spaces too. And now I discovered I would need character from A-Z,a-z,0-9, !##$%^&*() also. So how do I need to modify the regex?
Thanking you
Add the ones you want to keep to your character class:
preg_replace('/[^\x{0600}-\x{06FF}A-Za-z !##$%^&*()]/u','', $string);
assume you have this string:
$str = "Arabic Text نص عربي test 123 و,.m,............ ~~~ ٍ،]ٍْ}~ِ]ٍ}";
this will keep arabic chars with spaces only.
echo preg_replace('/[^أ-ي ]/ui', '', $str);
this will keep Arabic and English chars with Numbers Only
echo preg_replace('/[^أ-يA-Za-z0-9 ]/ui', '', $str);
this will answer your question latterly.
echo preg_replace('/[^أ-يA-Za-z !##$%^&*()]/ui', '', $str);
In a more detailed manner from Above example, Considering below is your string:
$string = '<div>This..</div> <a>is<a/> <strong>hello</strong> <i>world</i> ! هذا هو مرحبا العالم! !##$%^&&**(*)<>?:";p[]"/.,\|`~1##$%^&^&*(()908978867564564534423412313`1`` "Arabic Text نص عربي test 123 و,.m,............ ~~~ ٍ،]ٍْ}~ِ]ٍ}"; ';
Code:
echo preg_replace('/[^\x{0600}-\x{06FF}A-Za-z0-9 !##$%^&*().]/u','', strip_tags($string));
Allows: English letters, Arabic letters, 0 to 9 and characters !##$%^&*().
Removes: All html tags, and special characters other than above
Related
I'm trying to remove cached profiles which have non English letters in their description. I'm fine with dashes, symbols, special characters, underscores all that I just don't want foreign characters in my string.
The issue is, my code below detects strings with á as ASCII even though it isn't an English character, is matching against ASCII the right way?
if (!mb_detect_encoding($this->removeEmojis(str_replace(" ", "", $cacheItem->description), 'ASCII', true)))
{
$cacheItem->delete(); // laravel
}
Value of $cacheItem->description
Welcome to my profile<br> Londrina-Paraná
The letter á is a non English character.
The description can also contain dots, symbols, special characters, but I want to detect foreign characters like Latin.
Descriptions can also contain emojis so I try to remove them with this function
private function removeEmojis($text){
// theres lots more inside the preg_replace I truncated it for readability
return preg_replace('/[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0077}\x{E006C}\x{E0073}\x{E007F})|[\x{1F3F4}]/u', ' ', $text);
}
You can detect any character that is not printable ASCII , by using this regexp
[^\x20-\x7E]]*
See ASCII table
Replace the matches with empty string then you get a purified one and then you can apply your emoji replacement.
You can use preg_match to check if all the characters in the string are in the range <space> to ~ which is the ASCII character range:
$description = 'Welcome to my profile<br> Londrina-Paraná';
var_dump(preg_match('/^[ -~]*$/', $description));
$description = 'Welcome to my profile<br> Londriná-Parana';
var_dump(preg_match('/^[ -~]*$/', $description));
$description = 'Welcome to my profile<br> Londrina-Parana';
var_dump(preg_match('/^[ -~]*$/', $description));
Output:
int(0)
int(0)
int(1)
Demo on 3v4l.org
What I am trying to achieve is - I want to use a preg-replace to highlight searched string in suggestions but ignoring diacritics on characters, spaces or apostrophe. So when I will for example search for ha my search suggestions will look like this:
O'Hara
Ó an Cháintighe
H'aSOMETHING
I have done a loads of research but did not come up with any code yet. I just have an idea that I could somehow convert the characters with diacritics (e.g.: Á, É...) to character and modifier (A+´, E+´) but I am not sure how to do it.
I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP
My function highlights text ignoring diacritics, spaces, apostrophes and dashes:
function highlight($pattern, $string)
{
$array = str_split($pattern);
//add or remove characters to be ignored
$pattern=implode('[\s\'\-]*', $array);
//list of letters with diacritics
$replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");
$pattern=str_replace(array_keys($replacements), $replacements, $pattern);
//instead of <u> you can use <b>, <i> or even <div> or <span> with css class
return preg_replace("/(" . $pattern . ")/ui", "<u>\\1</u>", $string);
}
I want to remove all non Arabic, non English and non Numbers charecters from a string, except for dashes (-).
I managed to do it for non English alphanumeric characters like this:
$slug = ereg_replace('[^A-Za-z0-9-]', '', $string);
But for non arabic alphanumeric characters i tried to do it like this:
$slug = ereg_replace('\p{InArabic}', '', $string);
but it didnt strip the non alphanumeric characters! I also tried this answer but it didnt work either, it always returns '0' !!
$slug = preg_replace('/[^\x{0600}-\x{06FF}A-Za-z0-9-]/u','', $string);
Hopefully someone can help me.
Try the below:
$slug = preg_replace('/[^\p{Arabic}\da-z-]/ui', '', $string);
Okay, I'm stuck. PHP, Regex. I have a string:
Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.
And I want to use preg_replace() to enclose a substring containing latin letters, numbers and spaces with <b> tags. A substring is not merely a word but a set of words as long as the next word contains Latin characters:
Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.
My best shot was:
$text = 'Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.';
$regex = "/\d*\p{Latin}+(\d|\s|\p{Latin})*/iu";
preg_replace($regex, '<b>$0</b>', $text);
But it grabs not only "here98" but also the following "85":
Это кириллические 23 78these are56 45latin76 letters here98 85 буквы.
I understand why it is so but fail to figure out the correct Regex.
You need not just match Latin+digits words, but look one word ahead and one word behind.
AFAIK, variable-length look-behinds are not possible, so you should use non-capturing group (?:...)and positive look-ahead (?=...):
$regex = "/(?:[\p{Latin}\d]+ )([\p{Latin}\d ]+)(?= [\p{Latin}\d]+)/iu";
preg_replace($regex, '<b>$1</b>', $text);
PS: Aaaah! Russian mafia! ;-)
I am applying the following function
<?php
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâäåìíîïùúûüýÿ]/", "", $string);
return $new_string;
}
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿ";
echo replaceChar($string);
?>
which works fine but if I add ã to the preg_replace like
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâãäåìíîïùúûüýÿ]/", "", $string);
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿã";
It conflicts with the pound sign £ and replaces the pound sign with the unidentified question mark in black square.
This is not critical but does anyone know why this is?
Thank you,
Barry
UPDATE: Thank you all. Changed functions adding the u modifier: pt2.php.net/manual/en/… – as suggested by Artefacto and works a treat
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõøöàáâãäåìíîïùúûüýÿ]/u", "", $string);
return $new_string;
}
If your string is in UTF-8, you must add the u modifier to the regex. Like this:
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâäåìíîïùúûüýÿ]/u", "", $string);
return $new_string;
}
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿ";
echo replaceChar($string);
Chances are that your string is UTF-8, but preg_replace() is working on bytes
that code is valid ...
maybe you should try Central-European character encoding
<?php
header ('Content-type: text/html; charset=ISO-8859-2');
?>
You might want to take a look at mb_ereg_replace(). As Mark mentioned preg_replace only works on byte level and does not work well with multibyte character encodings.
Cheers,
Fabian