Regex to match string with and without special/accented characters? - php

Is there a regular expression to match a specific string with and without special characters? Special characters-insensitive, so to speak.
Like céra will match cera, and vice versa.
Any ideas?
Edit: I want to match specific strings with and without special/accented characters. Not just any string/character.
Test example:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
if (strpos($compareClientName, $this->search) !== false)
{
$clientName = preg_replace('/(.*?)('.$this->search.')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $clientName);
}
Output: <span class="highlight">céra</span>
As you can see, I want to highlight the specific search string. However, I still want to display the original (accented) characters of the matched string.
I'll have to combine this with Michael Sivolobov's answer somehow, I guess.
I think I'll have to work with a separate preg_match() and preg_replace(), right?

You can use the \p{L} pattern to match any letter.
Source
You have to use the u modifier after the regular expression to enable unicode mode.
Example : /\p{L}+/u
Edit :
Try something like this. It should replace every letter with an accent to a search pattern containing the accented letter (both single character and unicode dual) and the unaccented letter. You can then use the corrected search pattern to highlight your text.
function mbStringToArray($string)
{
$strlen = mb_strlen($string);
while($strlen)
{
$array[] = mb_substr($string, 0, 1, "UTF-8");
$string = mb_substr($string, 1, $strlen, "UTF-8");
$strlen = mb_strlen($string);
}
return $array;
}
// I had to use this ugly function to remove accents as iconv didn't work properly on my test server.
function stripAccents($stripAccents){
return utf8_encode(strtr(utf8_decode($stripAccents),utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'),'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY'));
}
$clientName = 'céra';
$clientNameNoAccent = stripAccents($clientName);
$clientNameArray = mbStringToArray($clientName);
foreach($clientNameArray as $pos => &$char)
{
$charNA =$clientNameNoAccent[$pos];
if($char != $charNA)
{
$char = "(?:$char|$charNA|$charNA\p{M})";
}
}
$clientSearchPattern = implode($clientNameArray); // c(?:é|e|e\p{M})ra
$text = 'the client name is Céra but it could be Cera or céra too.';
$search = preg_replace('/(.*?)(' . $clientSearchPattern . ')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $text);
echo $search; // the client name is <span class="highlight">Céra</span> but it could be <span class="highlight">Cera</span> or <span class="highlight">céra</span> too.

If you want to know is there some accent or another mark on some letter you can check it by matching pattern \p{M}
UPDATE
You need to convert all your accented letters in pattern to group of alternatives:
E.g. céra -> c(?:é|e|e\p{M})ra
Why did I add e\p{M}? Because your letter é can be one character in Unicode and can be combination of two characters (e and grave accent). e\p{M} matches e with grave accents (two separate Unicode characters)
As you convert your pattern to match all characters you can use it in your preg_match

As you marked in one of the comments, you don't need a regular expression for that as the goal is to find specific strings. Why don't you use explode? Like that:
$clientName = 'céra';
$this->search = 'cera';
$compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName));
$this->search = strtolower($this->search);
$pieces = explode($compareClientName, $this->search);
if (count($pieces) > 1)
{
$clientName = implode('<span class="highlight">'.$clientName.'</span>', $pieces);
}
Edit:
If your $search variable may contain special characters too, why don'y you translit it, and use mb_strpos with $offset? like this:
$offset = 0;
$highlighted = '';
$len = mb_strlen($compareClientName, 'UTF-8');
while(($pos = mb_strpos($this->search, $compareClientName, $offset, 'UTF-8')) !== -1) {
$highlighted .= mb_substr($this->search, $offset, $pos-$offset, 'UTF-8').
'<span class="highlight">'.
mb_substr($this->search, $pos, $len, 'UTF-8').'</span>';
$offset = $pos + $len;
}
$highlighted .= mb_substr($this->search, $offset, 'UTF-8');
Update 2:
It is important to use mb_ functions with instead of simple strlen etc. This is because accented characters are stored using two or more bytes; Also always make sure that you use the right encoding, take a look at this for example:
echo strlen('é');
> 2
echo mb_strlen('é');
> 2
echo mb_internal_encoding();
> ISO-8859-1
echo mb_strlen('é', 'UTF-8');
> 1
mb_internal_encoding('UTF-8');
echo mb_strlen('é');
> 1

As you can see here, POSIX equivalence class is for matching characters with the same collating order that can be done by below regex:
[=a=]
This will match á and ä as well as a depending on your locale.

Related

Replace illegal charactes in a text by underscore in PHP

i need to replace the illegal characters by underscore(_),
For Example:
if user given text is "imageЙ ййé.png" need to replace this Й йй characters by _ __ So the overall output must be image_ __é.png. And this replacing must not occur for french characters. I have worked check the below code and help me to get the output.
<?php
$allowed_char_array=array("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z","à","á","â","ã","ä","å","æ","ç","è","é","ê","ë","ì","í","î","ï","ñ","ò","ó","ô","õ","ö","ð","ø","œ","š","Þ","ù","ú","û","ü","ý","ÿ","ž","0","1","2","3","4","5","6","7","8","9"," ","(",")","-","_",".","#","#","$","%","*","¢","ß","¥","£","™","©","®","ª","×","÷","±","+","-","²","³","¼","½","¾","µ","¿","¶","·","¸","º","°","¯","§","…","¤","¦","≠","¬","ˆ","¨","‰");
$word = 'imageЙ ййé.png';
$file_name = url_rewrite(trim($word));
$file_name2 = strtolower($file_name);
$split = str_split($file_name2);
if(is_array($split) && is_array($allowed_char_array)){
$result=array_diff($split,$allowed_char_array);
echo '<pre>';
print_r($split);
echo '<pre>';
print_r($allowed_char_array);
echo '<pre>';
print_r($result);
}
function url_rewrite($chaine) {
// On va formater la chaine de caractère
// On remplace pour ne plus avoir d'accents
$accents = array('é','à','è','À','É','È');
$sans = array('é','à','è','À','É','È');
$chaine = str_replace($accents, $sans, $chaine);
return $chaine;
}
?>
I would build a regex (character class, to be exact) using your whitelisted characters, and then remove any character which matches the negation of that class.
$allowed_char_array = array("a","b","c","d","e") // and others
$chars = implode("", $allowed_char_array);
$regex = "/[^" . $chars . "]/u";
$input = "imageЙ ййé.png";
echo $regex . "\n";
$output = preg_replace($regex, "_", $input);
echo $input . "\n" . $output;
imageЙ ййé.png
image_ __é.png
If the above be not clear, here is what the actual all to preg_replace would look like:
preg_replace("/[^abcdefghijklmnopqrstuv]/u, "_", $input);
That is, any non whitelisted character would be replaced with just underscore. I did not bother to list out the entire character class, because you already have that in your source code.
Note that the /u flag in the regex is critical here, because your input string is a UTF-8 string. UTF-8 characters may consist of more than one byte, and using preg_replace on them without /u may have unexpected results.
You will want to use mb_strtolower() to convert multibyte characters to lowercase safely.
My solution uses strtr() to convert your French accented letters to your preferred form.
Since all characters are lowercased from the onset, you can halve your white list of French characters.
Using pathinfo() helps you to dissect your filename.
Code: (Demo)
$word = 'imageЙ ййé.png';
$parts = pathinfo($word);
$filename = strtr(mb_strtolower($parts['filename']), ['é' =>'é', 'à' => 'à','è' => 'è']);
echo preg_replace('~[^ a-zéàè]~u', '_', $filename) , "." , $parts['extension'];
Output:
image_ __é.png

php regex replace each character with asterisk

I am trying to something like this.
Hiding users except for first 3 characters.
EX)
apple -> app**
google -> goo***
abc12345 ->abc*****
I am currently using php like this:
$string = "abcd1234";
$regex = '/(?<=^(.{3}))(.*)$/';
$replacement = '*';
$changed = preg_replace($regex,$replacement,$string);
echo $changed;
and the result be like:
abc*
But I want to make a replacement to every single character except for first 3 - like:
abc*****
How should I do?
Don't use regex, use substr_replace:
$var = "abcdef";
$charToKeep = 3;
echo strlen($var) > $charToKeep ? substr_replace($var, str_repeat ( '*' , strlen($var) - $charToKeep), $charToKeep) : $var;
Keep in mind that regex are good for matching patterns in string, but there is a lot of functions already designed for string manipulation.
Will output:
abc***
Try this function. You can specify how much chars should be visible and which character should be used as mask:
$string = "abcd1234";
echo hideCharacters($string, 3, "*");
function hideCharacters($string, $visibleCharactersCount, $mask)
{
if(strlen($string) < $visibleCharactersCount)
return $string;
$part = substr($string, 0, $visibleCharactersCount);
return str_pad($part, strlen($string), $mask, STR_PAD_RIGHT);
}
Output:
abc*****
Your regex matches all symbols after the first 3, thus, you replace them with a one hard-coded *.
You can use
'~(^.{3}|(?!^)\G)\K.~'
And replace with *. See the regex demo
This regex matches the first 3 characters (with ^.{3}) or the end of the previous successful match or start of the string (with (?!^)\G), and then omits the characters matched from the match value (with \K) and matches any character but a newline with ..
See IDEONE demo
$re = '~(^.{3}|(?!^)\G)\K.~';
$strs = array("aa","apple", "google", "abc12345", "asdddd");
foreach ($strs as $s) {
$result = preg_replace($re, "*", $s);
echo $result . PHP_EOL;
}
Another possible solution is to concatenate the first three characters with a string of * repeated the correct number of times:
$text = substr($string, 0, 3).str_repeat('*', max(0, strlen($string) - 3));
The usage of max() is needed to avoid str_repeat() issue a warning when it receives a negative argument. This situation happens when the length of $string is less than 3.

str_word_count() for non-latin words?

im trying to count the number of words in variable written in non-latin language (Bulgarian). But it seems that str_word_count() is not counting non-latin words. The encoding of the php file is UTF-8
$str = "текст на кирилица";
echo 'Number of words: '.str_word_count($str);
//this returns 0
You may do it with regex:
$str = "текст на кирилица";
echo 'Number of words: '.count(preg_split('/\s+/', $str));
here I'm defining word delimiter as space characters. If there may be something else that will be treated as word delimiter, you'll need to add it into your regex.
Also, note, that since there's no utf characters in regex (not in string) - /u modifier isn't required. But if you'll want some utf characters to act as delimiter, you'll need to add this regex modifier.
Update:
If you want only cyrillic letters to be treated in words, you may use:
$str = "текст
на 12453
кирилица";
echo 'Number of words: '.count(preg_split('/[^А-Яа-яЁё]+/u', $str));
And here is the solution that come to my mind:
$var = "текст на кирилица с пет думи";
$array = explode(" ", $var);
$i = 0;
foreach($array as $item)
{
if(strlen($item) > 2) $i++ ;
}
echo $i; // will return 5
As it stated in str_word_count description
'word' is defined as a locale dependent string
Specify Bulgarian locale before calling str_word_count
setlocale(LC_ALL, 'bg_BG');
echo str_word_count($content);
Read more about setlocale here.
The best solution I found is to provide a list of characters for word count function:
$text = 'текст на кирилице and on english too';
$count = str_word_count($text, 0, 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя');
echo $count; // => 7

Compare a symbol from multibyte string with one in ASCII

I want to detect a space or a hyphen in a multibyte string.
At first I splitting a string into array of chars
$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
Then I try to compare those symbols with a hyphen or a space
foreach ($chrArray as $char) {
if ($char == '-' || $char == ' ') {
// Do something
}
}
Oh, this one doesn't work. Ok, why? Maybe because those symbols in ASCII?
echo mb_detect_encoding('-'); // ASCII
Okay, I'll try to handle it.
$encoding = mb_detect_encoding($str); // UTF-8
$dash = mb_convert_encoding('-', $encoding);
$space = mb_convert_encoding(' ', $encoding);
Oh, but it doesn't work too. Wait a second...
echo mb_detect_encoding($dash); // ASCII
!!! What's happening??? How could I do what I want?
I've come to using regexes. This one
"/(?<=-| |^)([\w]*)/u"
finds all words in unicode that have either a hyphen, or a space, or nothing (first in a line) at previous position. Instead of iterating chars array I'm using the preg_replace_callback (in PHP >= 5.4.1 the mb_ereg_replace_callback can be used).

Condensed function to strip double letters away from a string (PHP)

I need to take every double letter occurrence away from a word. (I.E. "attached" have to become: "aached".)
I wrote this function:
function strip_doubles($string, $positions) {
for ($i = 0; $i < strlen($string); $i++) {
$stripped_word[] = $string[$i];
}
foreach($positions['word'] as $position) {
unset($stripped_word[$position], $stripped_word[$position + 1]);
}
$returned_string= "";
foreach($stripped_words $key => $value) {
$returned_string.= $stripped_words[$key];
}
return $returned_string;
}
where $string is the word to be stripped and $positions is an array containing the positions of any first double letter.
It perfectly works but how would a real programmer write the same function... in a more condensed way? I have a feeling it could be possible to do the same thing without three loops and so much code.
Non-regex solution, tested:
$string = 'attached';
$stripped = '';
for ($i=0,$l=strlen($string);$i<$l;$i++) {
$matched = '';
// if current char is the same as the next, skip it
while (substr($string, $i, 1)==substr($string, $i+1, 1)) {
$matched = substr($string, $i, 1);
$i++;
}
// if current char is NOT the same as the matched char, append it
if (substr($string, $i, 1) != $matched) {
$stripped .= substr($string, $i, 1);
}
}
echo $stripped;
You should use a regular expression. It matches on certain characteristics and can replace the matched occurences with some other string(s).
Something like
$result = preg_replace('#([a-zA-Z]{1})\1#i', '', $string);
Should work. It tells the regexp to match one character from a-z followed by the match itself, thus effectively two identical characters after each other. The # mark the start and end of the regexp. If you want more characters than just a-z and A-Z, you could use other identifiers like [a-ZA-Z0-9]{1} or for any character .{1} or for only Unicode characters (including combined characters), use \p{L}\p{M}*
The i flag after the last # means 'case insensitive' and will instruct the regexp to also match combinations with different cases, like 'tT'. If you want only combinations in the same case, so 'tt' and 'TT', then remove the 'i' from the flags.
The '' tells the regexp to replace the matched occurences (the two identical characters) with an empty string.
See http://php.net/manual/en/function.preg-replace.php and http://www.regular-expressions.info/

Categories