php regex match possible accented characters - php

I found alot of questions about this, but none of those helped me with my especific problem. The situation: I want to search a string with something like "blablebli" and be able to find a match with all possible accented variations of that ("blablebli", "blábleblí", "blâblèbli", etc...) in an text.
I already made a workaround to to the opposite (find a word without possible accents that i wrote). But i can't figure it out a way to implement what i want.
Here is my working code. (the relevant part, this was part of a foreach so we are only seeing a single word search):
$word="something";
$word = preg_quote(trim($word)); //Just in case
$word2 = $this->removeAccents($word); // Removed all accents
if(!empty($word)) {
$sentence = "/(".$word.")|(".$word2.")/ui"; // Now I'm checking with and without accents.
if (preg_match($sentence, $content)){
echo "found";
}
}
And my removeAccents() function (i'm not sure if i covered all possible accents with that preg_replace(). So far it's working. I would appreciate if someone check if i'm missing anything):
function removeAccents($string)
{
return preg_replace('/[\`\~\']/', '', iconv('UTF-8', 'ASCII//TRANSLIT', $string));
}
What i'm trying to avoid:
I know i could check my $word and replace all a for [aàáãâä] and
same thing with other letters, but i dont know... it seens a litle
overkill.
And sure i could use my own removeAccents() function in my if
statement to check the $content without accents, something like:
if (preg_match($sentence, $content) || preg_match($sentence, removeAccents($content)))
But my problem with that second situation is i want to hightlight the word found after the match. So i can't change my $content.
Is there any way to improve my preg_match() to include possible accented characters? Or should i use my first option above?

I would decompose the string, this makes it easier to remove the offending characters, something along the lines:
<?php
// Convert unicode input to NFKD form.
$str = Normalizer::normalize("blábleblí", Normalizer::FORM_KD);
// Remove all combining characters (https://en.wikipedia.org/wiki/Combining_character).
var_dump(preg_replace('/[\x{0300}-\x{036f}]/u', "", $str));

Thanks for the help everyone, but i will end it up using my first sugestion i made in my question. And thanks again #CasimiretHippolyte for your patience, and making me realize that isn't that overkill as i thought.
Here is the final code I'm using (first the functions):
function removeAccents($string)
{
return preg_replace('/[\x{0300}-\x{036f}]/u', '', Normalizer::normalize($string, Normalizer::FORM_KD));
}
function addAccents($string)
{
$array1 = array('a', 'c', 'e', 'i' , 'n', 'o', 'u', 'y');
$array2 = array('[aàáâãäå]','[cçćĉċč]','[eèéêë]','[iìíîï]','[nñ]','[oòóôõö]','[uùúûü]','[yýÿ]');
return str_replace($array1, $array2, strtolower($string));
}
And:
$word="something";
$word = preg_quote(trim($word)); //Just in case
$word2 = $this->addAccents($this->removeAccents($word)); //check all possible accents
if(!empty($word)) {
$sentence = "/(".$word.")|(".$word2.")/ui"; // Now I'm checking my normal word and the possible variations of it.
if (preg_match($sentence, $content)){
echo "found";
}
}
Btw, im covering all possible accents from my country (and some others). You should check if you need to improve the addAccents() function before use it.

Related

Create a function to find a specific word in the title

I have the following title formation on my website:
It's no use going back to yesterday, because at that time I was... Lewis Carroll
Always is: The phrase… (author).
I want to delete everything after the ellipsis (…), leaving only the sentence as the title. I thought of creating a function in php that would take the parts of the titles, throw them in an array and then I would work each part, identifying the only pattern I have in the title, which is the ellipsis… and then delete everything. But when I do that, in the X space of my array, it returns the following:
was...
In position 8 of the array comes the word and the ellipsis and I don't know how to find a pattern to delete the author of the title, my pattern was the ellipsis. Any idea?
<?php
$a = get_the_title(155571);
$search = '... ';
if(preg_match("/{$search}/i", $a)) {
echo 'true';
}
?>
I tried with the code above and found the ellipsis, but I needed to bring it into an array to delete the part I need. I tried something like this:
<?php
define('WP_USE_THEMES', false);
require('./wp-blog-header.php');
global $wpdb;
$title_array = explode(' ', get_the_title(155571));
$search = '... ';
if (array_key_exists("/{$search}/i",$title_array)) {
echo "true";
}
?>
I started doing it this way, but it doesn't work, any ideas?
Thanks,
If you use regex you need to escape the string as preg_quote() would do, because a dot belongs to the pattern.
But in your simple case, I would not use a regex and just search for the three dots from the end of the string.
Note: When the elipsis come from the browser, there's no way to detect in PHP.
$title = 'The phrase... (author).';
echo getPlainTitle($title);
function getPlainTitle(string $title) {
$rpos = strrpos($title, '...');
return ($rpos === false) ? $title : substr($title, 0, $rpos);
}
will output
The phrase
First of all, since you're working with regular expressions, you need to remember that . has a special meaning there: it means "any character". So /... / just means "any three characters followed by a space", which isn't what you want. To match a literal . you need to escape it as \.
Secondly, rather than searching or splitting, you could achieve what you want by replacing part of the string. For instance, you could find everything after the ellipsis, and replace it with an empty string. To do that you want a pattern of "dot dot dot followed by anything", where "anything" is spelled .*, so \.\.\..*
$title = preg_replace('/\.\.\..*/', '', $title);

php ucfirst ucwords discussion [duplicate]

This question already has answers here:
Make all words lowercase and the first letter of each word uppercase
(3 answers)
Closed 1 year ago.
I just wanted to share my experience when needing to deal with an language independent version of ucfirst.
the problem is when you are mixing English texts with Japanese, chinese or other languages as in my case sometimes Swedish etc. with ÅÄÖ, traditional ucfirst has issues with converting the string to capitalized.
I did however sometime ago stumbled across the following code snippet here on stack overflow:
function myucfirst($str) {
$fc = mb_strtoupper(mb_substr($str, 0, 1));
return $fc.mb_substr($str, 1);
}
It works fine in most cases but recently I also needed the translations autogenerate texts in dynamic pdfs using TCPDF.
This is when I hit my head over why TCPDF had issues with the text. I had no problems anywhere else, the character encoding was utf8 but still it bricked.
When showing Kanji for Japanese signs, I just put ignore using the above function to captitalize the word but all of a sudden when using Swedish, I encountered the same brick when I need to capitalize ÅÄÖ.
That led me to realize that the problem with the function above is that it's only looking at the first character. ÅÄÖ is taking up 2 letter spaces and kanjis for chinese or Japanese letters take up 3 letter spaces and the function above did not consider that resulting to bricking TCPDF.
To give more context, When generating PDF documents with TCPDF the TCP font will end up getting errors since the gerneal mb_string function will translate the first character to "?�"vrigt for the swedish word Övrigt and with for instance Japanese "?��"のととろ, for 隣のトトロ (my neighbour totoro.) this will make the font translation for the � not work correctly. you need to do the conversion of ÅÄÖ for the first two letters substr($str, 0,2) to be able to convert the letter properly.
Also I am not sure if you see the code examples I gave but since neither chinese or japanese use upper case letters in their writing language, I am excluding every sign that requires 3 letter spaces since they are not managing upper / lower cases at all. I don't really want to exclude them but parsing them through mb_string will lead to similar errors in TCPDF so, my examples are a workaround for now or if someone has a better solution.
so... my approach was to solve the above problem by using the following function.
function myucfirst($str) {
if ($str[0] !== "?"){
for($i = 1; $i <= 3; $i++){
$first = substr($str, 0, $i);
$first = mb_convert_case($first, MB_CASE_UPPER, "UTF-8");
if ($first !== '?'){
$rest = substr($str, $i);
break;
}
}
if ($i < 3){
$ret_string = $first . $rest;
} else {
$ret_string = $str;
}
} else {
$ret_string = $str;
}
return $ret_string;
}
Thanks to Steven Pennys' help below, this is the solution that's working both with Swedish and Japanese / chinese special characters, even when needing to use a string with the library TCPDF for dynamically creating PDFs:
function myucfirst($str) {
$ret_string = mb_convert_case($str, MB_CASE_TITLE, 'UTF-8');
return $ret_string;
}
and following to do a similar fix for ucwords
function myucwords($str){
$str = trim($str);
if (strpos($str, ' ') !== false){
$str_arr = explode(' ', $str);
foreach ($str_arr as $word){
$ret_str .= isset($ret_str)? ' ' . myucfirst($word):myucfirst($word);
}
} else {
$ret_str = myucfirst($str);
}
return $ret_str;
}
The myucwords is using the first myucfirst to capitalize each word.
Since I am not that experienced as a developer or a stack overflow contributor, you should be able to see 3 code examples and I would really appreciate if there's better ways to write these functions but for now, for those who have the similar problem, please enjoy!
/Chris
The examples you gave are poor, as with Övrigt the input is exactly the same
as the output. So I modified the example so they can be useful. See below:
<?php
# example 1
$s1 = mb_convert_case('åäö', MB_CASE_TITLE);
# example 2
$s2 = mb_convert_case('övrigt', MB_CASE_TITLE);
# exmaple 3
$s3 = mb_convert_case('隣のトトロ', MB_CASE_TITLE);
# print
var_dump($s1 == 'Åäö', $s2 == 'Övrigt', $s3 == '隣のトトロ');
Note you will need this in your php.ini, if its not already:
extension = mbstring
https://php.net/function.mb-convert-case

Preg Match circumflex with ^ in php

I know I am going to get a lot of asinine comments, but I cannot figure this out no matter what I do. I have a function here
$filter = mysql_query("SELECT * FROM `filter`");
$fil = mysql_fetch_array($filter);
$bad = $fil['filter'];
$bword = explode(",", $bad);
function wordfilter($output,$bword){
$badWords = $bword;
$matchFound = preg_match_all("/(" . implode($badWords,"|") . ")/i",$output,$matches);
if ($matchFound) {
$words = array_unique($matches[0]);
foreach($words as $word) {
$output = preg_replace("/$word/","*****",$output);
}
}
return $output;
}
I know bad word filters are frowned upon, but my client has requested this.
Now i have a list in the database here are a few entries.
^ass$,^asses$,^asshopper,^cock$,^coon,^cracker$,^cum$,^dick$,^fap$,^heeb$,^hell$,^homo$,^humping,^jap$,^mick$,^muff$,^paki$,^phap$,^poon$,^spic$,^tard$,^tit$,^tits$,^twat$,^vag$,ass-hat,ass-pirate,assbag
as you can see I am using a circumflex and dollar signs for certain words.
The problem I am having is with the first three words beginning with ass it is blocking out the word even if i write something like glasses or grasshoppers but everything past the first 3 work fine, I have tried adding 3 entries before these in-case that was the problem, but unfortunately it isn't.
Is there something wrong with how i have this written?
Extending from comment:
Try to use \b to detect words:
$matchFound = preg_match_all('/\b('.implode($badWords,"|").')\b/i',$output,$matches);

php string replace by str_replace issue

i made a function to replace a words in a string by putting new words from an array.
this is my code
function myseo($t){
$a = array('me','lord');
$b = array('Mine','TheLord');
$theseotext = $t;
$theseotext = str_replace($a,$b, $theseotext);
return $theseotext;
}
echo myseo('This is me Mrlord');
the output is
This is Mine MrTheLord
and it is wrong it should be print
This is Mine Mrlord
because word (Mrlord) is not included in the array.
i hope i explained my issue in good way. any help guys
regards
According to the code it is correct, but you want it to isolate by word. You could simply do this:
function myseo($t){
$a = array(' me ',' lord ');
$b = array(' Mine ',' TheLord ');
return str_replace($a,$b, ' '.$t.' ');
}
echo myseo('This is me Mrlord');
keep in mind this is kind of a cheap hack since I surround the replace string with empty spaces to ensure both sides get considered. This wouldn't work for punctuated strings. The alternate would be to break apart the string and replace each word individually.
str_replace doesn't look at full words only - it looks at any matching sequence of characters.
Thus, lord matches the latter part of Mrlord.
use str_ireplace instead, it's case insensitive.

preg_replace custom ranges

The user input is stored in the variable $input.
so i want to use preg replace to swap the letters from the user input that will range from a-z, with my own custom alphabet.
My code i am trying, which doesnt work is below:
preg_replace('/([a-z])/', "y,p,l,t,a,v,k,r,e,z,g,m,s,h,u,b,x,n,c,d,i,j,f,q,o,w", $input)
This code however doesnt work.
If anyone has any suggestions on how i can get this working then that would be great. Thanks
Don't jump for preg, when str is enough:
$regular = range('a', 'z');
$custom = explode(',', "y,p,l,t,a,v,k,r,e,z,g,m,s,h,u,b,x,n,c,d,i,j,f,q,o,w");
$output = str_replace($regular, $custom, $input);
Using str_replace makes a lot more sense in this case:
str_replace(
range("a", "z"), // Creates an array with all lowercase letters
explode(",", "y,p,l,t,a,v,k,r,e,z,g,m,s,h,u,b,x,n,c,d,i,j,f,q,o,w"),
$input
);
You could instead use strtr(), this resolves the problem of replacing already replaced values.
echo strtr($input, 'abcdefghijklmnopqrstuvwxyz', 'ypltavkrezgmshubxncdijfqow');
With $input as yahoo the output is oyruu, as expected.
A potential problem with the solutions given is that multiple replacements could occur for each character. eg. 'a' gets replaced by 'y', and in the same statement 'y' gets replaced by 'o'. So, in the examples given above, 'aaa' becomes 'ooo', not 'yyy' that might be expected. And 'yyy' becomes 'ooo' as well. The resulting string is essentially garbage. You'd never be able to convert it back, if that was a requirement.
You could get around this using two replacements.
On the first replacement you replace the $regular chars with an intermediate set of character sequences that don't exist in $input. eg. 'a' to '[[[a]]]', 'b' to '[[[b]]]', etc.
Then replace the intermediate character sequences with your $custom set of chars. eg. '[[[a]]]' to 'y', '[[[b]]]' to 'p', etc.
Like so...
$regular = range('a', 'z');
$custom = explode(',', 'y,p,l,t,a,v,k,r,e,z,g,m,s,h,u,b,x,n,c,d,i,j,f,q,o,w');
// Create an intermediate set of char (sequences) that don't exist anywhere else in the $input
// eg. '[[[a]]]', '[[[b]]]', ...
$intermediate = $regular;
array_walk($intermediate,create_function('&$value','$value="[[[$value]]]";'));
// Replace the $regular chars with the $intermediate set
$output = str_replace($regular, $intermediate, $input);
// Replace the $intermediate chars with our custom set
$output = str_replace($intermediate, $custom, $output);
EDIT:
Leaving this solution for reference, but #salathe's solution to use strtr() is much better!

Categories