preg replace would ignore non-letter characters when detecting words - php

I have an array of words and a string and want to add a hashtag to the words in the string that they have a match inside the array. I use this loop to find and replace the words:
foreach($testArray as $tag){
$str = preg_replace("~\b".$tag."~i","#\$0",$str);
}
Problem: lets say I have the word "is" and "isolate" in my array. I will get ##isolate at the output. this means that the word "isolate" is found once for "is" and once for "isolate". And the pattern ignores the fact that "#isoldated" is not starting with "is" anymore and it starts with "#".
I bring an example BUT this is only an example and I don't want to just solve this one but every other possiblity:
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
Output will be:
this #is ##isolated #is an example of this and that

You may build a regex with an alternation group enclosed with word boundaries on both ends and replace all the matches in one pass:
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
echo preg_replace('~\b(?:' . implode('|', $testArray) . ')\b~i', '#$0', $str);
// => this #is #isolated #is an example of this and that
See the PHP demo.
The regex will look like
~\b(?:is|isolated|somethingElse)\b~
See its online demo.
If you want to make your approach work, you might add a negative lookbehind after \b: "~\b(?<!#)".$tag."~i","#\$0". The lookbehind will fail all matches that are preceded with #. See this PHP demo.

A way to do that is to split your string by words and to build a associative array with your original array of words (to avoid the use of in_array):
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
$hash = array_flip(array_map('strtolower', $testArray));
$parts = preg_split('~\b~', $str);
for ($i=1; $i<count($parts); $i+=2) {
$low = strtolower($parts[$i]);
if (isset($hash[$low])) $parts[$i-1] .= '#';
}
$result = implode('', $parts);
echo $result;
This way, your string is processed only once, whatever the number of words in your array.

Related

PHP explode string at first alphanumeric character

I have a strings like this.
$str = "-=!#?Bob-Green_Smith";
$str = "-_#!?1241482";
How can I explode them at the first alphanumeric match.
eg:
$str = "-=!#?Bob-Green_Smith";
becomes:
$val[0] = "-=!#?";
$val[1] = "Bob-Green_Smith";
Quick thought some times the string won't contain the initial string of characters,
so I'd need to check if the first character is alphanumeric or not.. otherwise Bob-Green_Smith would get split when he shouldn't.
Thanks
You can use preg_match.
This will match "non word characters" zero or more as first group.
Then the rest as the second.
The output will have three items, the first is the full string, so I use array_shift to remove it.
$str = "-=!#?Bob-Green_Smith";
Preg_match("/(\W*)(.*)/", $str, $val);
Array_shift($val); // remove first item
Var_dump($val);
https://3v4l.org/m2MCg
You can do this like :
$str = "-=!#?1Bob-Green_Smith";
preg_match('~[a-z0-9]~i', $str, $match, PREG_OFFSET_CAPTURE);
echo $bubString = substr($str, $match[0][1]);

Only last element of array being used when replacing text

I am trying to replace some "common" words from a large block of text, however it's only using the last word from the array, please can you see where I'm going wrong?
Thanks
$glue = strtolower ($glue);//make all lower case
//remove common words
$Maffwordlist = array('the','to','for');
foreach($Maffwordlist as $Maffword)
$filtered = preg_replace("/\s". $Maffword ."\s/", " ", $glue);
The extract above only removes 'for' from the text, 'the' and 'to' are still included.
Any help appreciated.
The problem is that the subject of your preg_replace() is always $glue, which itself never changes. Before iterating your list of words, you need to assign the starting contents of $glue into $filtered since that is what you are acting on in order to accumulate all the values into it.
// $filtered is the string you'll be modifying...
$filtered = strtolower ($glue);//make all lower case
$Maffwordlist = array('the','to','for');
foreach($Maffwordlist as $Maffword) {
$filtered = preg_replace("/\s". $Maffword ."\s/", " ", $glue);
}
But we can do better.
A regular expression can be constructed to handle all the replacements without a loop using a (a|b|c) grouping.
// Stick the words together with pipes
$pattern = implode("|", $Maffwordlist);
// And surround with regex delimiters and ()
// so the whole regex looks like /\s(the|to|for)\s/
$pattern = '/\s(' . $pattern . ')\s/';
// And do the operation in one go:
$filtered = preg_replace($pattern, " ", $filtered);
I'll note you may wish to use \b word boundaries instead of \s delimiting these by whitespace. That way, you would get proper replacements in a sentence like "You should not end a sentence with for." where one of your list words appears but not bound by whitespace.
Finally then, you'll end up with multiple consecutive spaces in some places where replacements have taken place. You can collapse those into single spaces with something like the following.
// Replace multiple spaces with a single space
$filtered = preg_replace('/\s+/', ' ', $filtered);

regex for matching three specific character

while attempting a question in SO,i tried to write the regular expression which matches three characters that should be in the string.
i am following the answer Regular Expressions: Is there an AND operator?
<?php
$words = "systematic,gear,synthesis,mysterious";
$words=explode(",",$words);
$your_array = preg_grep("/^(^s|^m|^e)/", $words);
print_r($your_array);
?>
the output should be systematic and mysterious.but i am getting synthesis also.
Why is it so?what i am doing wrong?
** i dont want a new solution :)
SEE HERE
You can do this:
$wordlist = 'systematic,gear,synthesis,mysterious';
$words = explode(',', $wordlist);
foreach($words as $word) {
if (preg_match('~(?=[^s]*s)(?=[^m]*m)(?=[^e]*e)~', $word))
echo '<br/>' . $word;
}
//or
$res = preg_grep('~(?=[^s]*s)(?=[^m]*m)(?=[^e]*e)~', $words);
print_r($res);
To test the presence of a character in the string, I use (?=[^s]*s).
[^s]*s means all that is not a "s" zero or more times, and a "s".
(?=..) is a lookahead assertion and means "followed by". It is only a check, a lookahead give no characters in a match result, but the main interest with this feature is that you can check the same substring several times.
What is wrong with your pattern?
/^(^s|^m|^e)/ will give you only words that begins with "s" or "m" or "e" because ^ is an anchor and means : "start of the string". In other words, your pattern is the same as /^([sme])/.

Php replace exact word

Here is my problem:
Using preg_replace('#\b(word)\b#','****',$text);
Where in text I have word\word and word, the preg_replace above replaces both word\word and word so my resulting string is ***\word and ***.
I want my string to look like : word\word and ***.
Is this possible? What am I doing wrong???
LATER EDIT
I have an array with urls, I foreach that array and preg_replace the text where url is found, but it's not working.
For instance, I have http://www.link.com and http://www.link.com/something
If I have http://www.link.com it also replaces http://www.link.com/something.
You are effectively specifying that you don't want certain characters to count as word boundary. Therefore you need to specify the "boundaries" yourself, something like this:
preg_replace('#(^|[^\w\\])(word)([^\w\\]|$)#','**',$text);
What this does is searches for the word surrounded by line boundaries or non-word characters except the back slash \. Therefore it will match .word, but not .word\ and not `\word. If you need to exclude other characters from matching, just add them inside the brackets.
You could just use str_replace("word\word", "word\word and"), I dont really see why you would need to use a preg_replace in your case given above.
Here is a simple solution that doesn't use a regex. It will ONLY replace single occurances of 'word' where it is a lone word.
<?php
$text = "word\word word cat dog";
$new_text = "";
$words = explode(" ",$text); // split the string into seperate 'words'
$inc = 0; // loop counter
foreach($words as $word){
if($word == "word"){ // if the current word in the array of words matches the criteria, replace it
$words[$inc] = "***";
}
$new_text.= $words[$inc]." ";
$inc ++;
}
echo $new_text; // gives 'word\word *** cat dog'
?>

Identifying a random repeating pattern in a structured text string

I have a string that has the following structure:
ABC_ABC_PQR_XYZ
Where PQR has the structure:
ABC+JKL
and
ABC itself is a string that can contain alphanumeric characters and a few other characters like "_", "-", "+", "." and follows no set structure:
eg.qWe_rtY-asdf or pkl123
so, in effect, the string can look like this:
qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ
My goal is to find out what string constitutes ABC.
I was initially just using
$arrString = explode("_",$string);
to return $arrString[0] before I was made aware that ABC ($arrString[0]) itself can contain underscores, thus rendering it incorrect.
My next attempt was exlpoding it on "_" anyway and then comparing each of the exploded string parts with the first string part until I get a semblance of a pattern:
function getPatternABC($string)
{
$count = 0;
$pattern ="";
$arrString = explode("_", $string);
foreach($arrString as $expString)
{
if(strcmp($expString,$arrString[0])!==0 || $count==0)
{
$pattern = $pattern ."_". $arrString[$count];
$count++;
}
else break;
}
return substr($pattern,1);
}
This works great - but I wanted to know if there was a more elegant way of doing this using regular expressions?
Here is the regex solution:
'^([a-zA-Z0-9_+-]+)_\1_\1\+'
What this does is match (starting from the beginning of the string) the longest possible sequence consisting of the characters inside the square brackets (edit that per your spec). The sequence must appear exactly twice, each time followed by an underscore, and then must appear once more followed by a plus sign (this is actually the first half of PQR with the delimiter before JKL). The rest of the input is ignored.
You will find ABC captured as capture group 1.
So:
$input = 'qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ';
$result = preg_match('/^([a-zA-Z0-9_+-]+)_\1_\1\+/', $input, $matches);
if ($result) {
echo $matches[2];
}
See it in action.
Sure, just make a regular expression that matches your pattern. In this case, something like this:
preg_match('/^([a-zA-Z0-9_+.-]+)_\1_\1\+JKL_XYZ$/', $string, $match);
Your ABC is in $match[1].
If the presence of underscores in these strings has a low frequency, it may be worth checking to see if a simple explode() will do it before bothering with regex.
<?php
$str = 'ABC_ABC_PQR_XYZ';
if(substr_count($str, '_') == 3)
$abc = reset(explode('_', $str));
else
$abc = regexy_function($str);
?>

Categories