regex to also match accented characters - php

I have the following PHP code:
$search = "foo bar que";
$search_string = str_replace(" ", "|", $search);
$text = "This is my foo text with qué and other accented characters.";
$text = preg_replace("/$search_string/i", "<b>$0</b>", $text);
echo $text;
Obviously, "que" does not match "qué". How can I change that? Is there a way to make preg_replace ignore all accents?
The characters that have to match (Spanish):
á,Á,é,É,í,Í,ó,Ó,ú,Ú,ñ,Ñ
I don't want to replace all accented characters before applying the regex, because the characters in the text should stay the same:
"This is my foo text with qué and other accented characters."
and not
"This is my foo text with que and other accented characters."

The solution I finally used:
$search_for_preg = str_ireplace(["e","a","o","i","u","n"],
["[eé]","[aá]","[oó]","[ií]","[uú]","[nñ]"],
$search_string);
$text = preg_replace("/$search_for_preg/iu", "<b>$0</b>", $text)."\n";

$search = str_replace(
['a','e','i','o','u','ñ'],
['[aá]','[eé]','[ií]','[oó]','[uú]','[nñ]'],
$search)
This and the same for upper case will complain your request. A side note: ñ replacemet sounds invalid to me, as 'niño' is totaly diferent from 'nino'

If you want to use the captured text in the replacement string, you have to use character classes in your $search variable (anyway, you set it manually):
$search = "foo bar qu[eé]"
And so on.

You could try defining an array like this:
$vowel_replacements = array(
"e" => "eé",
// Other letters mapped to their other versions
);
Then, before your preg_match call, do something like this:
foreach ($vowel_replacements as $vowel => $replacements) {
str_replace($search_string, "$vowel", "[$replacements]");
}
If I'm remembering my PHP right, that should replace your vowels with a character class of their accented forms -- which will keep it in place. It also lets you change the search string far more easily; you don't have to remember to replaced the vowels with their character classes. All you have to remember is to use the non-accented form in your search string.
(If there's some special syntax I'm forgetting that does this without a foreach, please comment and let me know.)

Related

How to match alphanumeric and symbols using PHP?

I'm working with text content in UTF8 encoding stored in variable $title.
Using preg_replace, how do I append an extra space if the $title string is ending with:
upper/lower case character
digit
symbol, eg. ? or !
This should do the trick:
preg_replace('/^(.*[\w?!])$/', "$1 ", $string);
In essence what it does is if the string ends in one of your unwanted characters it appends a single space.
If the string doesn't match the pattern, then preg_replace() returns the original string - so you're still good.
If you need to expand your list of unwanted endings you can just add them into the character block [\w?!]
Using a positive lookbehind before the end of the line.
And replace with a space.
$title = preg_replace('/(?<=[A-Za-z0-9?!])$/',' ', $title);
Try it here
You may want to try this Pattern Matching below to see if that does it for you.
<?php
// THE REGEX BELOW MATCHES THE ENDING LOWER & UPPER-CASED CHARACTERS, DIGITS
// AND SYMBOLS LIKE "?" AND "!" AND EVEN A DOT "."
// HOWEVER YOU CAN IMPROVISE ON YOUR OWN
$rxPattern = "#([\!\?a-zA-Z0-9\.])$#";
$title = "What is your name?";
var_dump($title);
// AND HERE, YOU APPEND A SINGLE SPACE AFTER THE MATCHED STRING
$title = preg_replace($rxPattern, "$1 ", $title);
var_dump($title);
// THE FIRST var_dump($title) PRODUCES:
// 'What is your name?' (length=18)
// AND THE SECOND var_dump($title) PRODUCES
// 'What is your name? ' (length=19) <== NOTICE THE LENGTH FROM ADDED SPACE.
You may test it out HERE.
Cheers...
You need
$title=preg_replace("/.*[\w?!]$/", "\\0 ", $title);

How to match with regex unicode text ignoring diacritics on characters (Á É Í)

What I am trying to achieve is - I want to use a preg-replace to highlight searched string in suggestions but ignoring diacritics on characters, spaces or apostrophe. So when I will for example search for ha my search suggestions will look like this:
O'Hara
Ó an Cháintighe
H'aSOMETHING
I have done a loads of research but did not come up with any code yet. I just have an idea that I could somehow convert the characters with diacritics (e.g.: Á, É...) to character and modifier (A+´, E+´) but I am not sure how to do it.
I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP
My function highlights text ignoring diacritics, spaces, apostrophes and dashes:
function highlight($pattern, $string)
{
$array = str_split($pattern);
//add or remove characters to be ignored
$pattern=implode('[\s\'\-]*', $array);
//list of letters with diacritics
$replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");
$pattern=str_replace(array_keys($replacements), $replacements, $pattern);
//instead of <u> you can use <b>, <i> or even <div> or <span> with css class
return preg_replace("/(" . $pattern . ")/ui", "<u>\\1</u>", $string);
}

PHP Regex: Remove words less than 3 characters

I'm trying to remove all words of less than 3 characters from a string, specifically with RegEx.
The following doesn't work because it is looking for double spaces. I suppose I could convert all spaces to double spaces beforehand and then convert them back after, but that doesn't seem very efficient. Any ideas?
$text='an of and then some an ee halved or or whenever';
$text=preg_replace('# [a-z]{1,2} #',' ',' '.$text.' ');
echo trim($text);
Removing the Short Words
You can use this:
$replaced = preg_replace('~\b[a-z]{1,2}\b\~', '', $yourstring);
In the demo, see the substitutions at the bottom.
Explanation
\b is a word boundary that matches a position where one side is a letter, and the other side is not a letter (for instance a space character, or the beginning of the string)
[a-z]{1,2} matches one or two letters
\b another word boundary
Replace with the empty string.
Option 2: Also Remove Trailing Spaces
If you also want to remove the spaces after the words, we can add \s* at the end of the regex:
$replaced = preg_replace('~\b[a-z]{1,2}\b\s*~', '', $yourstring);
Reference
Word Boundaries
You can use the word boundary tag: \b:
Replace: \b[a-z]{1,2}\b with ''
Use this
preg_replace('/(\b.{1,2}\s)/','',$your_string);
As some solutions worked here, they had a problem with my language's "multichar characters", such as "ch". A simple explode and implode worked for me.
$maxWordLength = 3;
$string = "my super string";
$exploded = explode(" ", $string);
foreach($exploded as $key => $word) {
if(mb_strlen($word) < $maxWordLength) unset($exploded[$key]);
}
$string = implode(" ", $exploded);
echo $string;
// outputs "super string"
To me, it seems that this hack works fine with most PHP versions:
$string2 = preg_replace("/~\b[a-zA-Z0-9]{1,2}\b\~/i", "", trim($string1));
Where [a-zA-Z0-9] are the accepted Char/Number range.

PHP remove everything except letters and a hyphen (-)

I'm making a form that asks for the user's first and last name, and I don't want them entering
$heil4
I would like them to enter
Sheila
I know how to filter out everything except letters, but I'm aware that some names can have
Sheila-McDonald
So how would I remove everything from a string apart from letters and a hyphen?
Simply use
$s = preg_replace("/[^a-z-]/i", "", $s);
or if you want to convert some non-ascii characters to ascii, such as Jean-Rémy to Jean-Remy, then use
$s = preg_replace("/[^a-z-]/i", "", iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $s));
Instead of replacing with nothing, have some fun. that way a name that consists mainly of numbers you can decode ;p
$name = '$h3il4-McD0nald';
$find = array(0,1,3,4,5,6,7,'$');
$replace = array('o','l','e','a','s','g','t','s');
$name = str_replace($find,$replace,$name);
//Sheila-McDonald
echo ucfirst(preg_replace('/[^a-z-]/i', '', $name));
$new = preg_replace('#[^A-Z-]#iu', '', $data);
but instead of removing letters (and thus modifying user's input) better validate it
and show an error if the input is not valid. This way the user will know that what he had entered is exactly the value you have
if(!preg_match('#[A-Z-]#iu', $data)) echo 'invalid';
Use this to strip out all non alpha-numeric characters, not including non latin characters, and prescribed punctuation.
$strtochange= preg_replace("/[^\s\p{Pd}a-zA-ZÀ-ÿ]/",'',$strtochange);
Note: this will turn $heil4 into heil.

get words from string using preg_split in php

I'm trying to get words from string in php using preg_split like this:
$result = preg_split('/[^A-Za-z]+/', $text)
but this doesn't work, some words are split,
what am I doing wrong?
Edit: the fact is it doesn't work with russian text = "фыва ывафы фываф";
$result = preg_split('/[^А-яа-я]+/', $text)
[^A-Za-z] only takes ASCII letters into account. You need to split on Unicode non-letters:
$result = preg_split('/\P{L}+/u', $subject);
[^А-Яа-я]+ won't work either because in the Unicode character set, А (0x0410) is not the first Kyrillian letter, and я (0x044F) is not the last one. It appears these honors go to Ё (0x0401) and ӹ (0x04F9). I don't know Russian at all, so I can't speculate on why this is so.
You can check this easily using your character map program:
$str ="As sdf fdasf";
$result = preg_split('/[\b ]/', $str);
edit:
$result = preg_split('/\b\s+/', $str); //this is not for Unicode

Categories