PHP checking multiple words in a long one word string - php

using strpos, I can find one word in a string, but how can I find many words?
I know how to do it when the string contains words that are separated with space, but for example, if I have array('hi','how','are','you') and a string $string = 'hihowareyoudoingtoday?'
How can I return the total amount found to match?

You can use preg_match_all that returns the number of matches:
$words = array('hi','how','are','you');
$nb = preg_match_all('~' . implode('|', $words) . '~', $string, $m);
print_r($m[0]);
echo "\n$nb";
preg_match_all is a function that searches all occurences of a pattern in a string. In the present example, the pattern is:
~hi|how|are|you~
the ~ is only a pattern delimiter.
| is a logical OR
Note that the search is performed form left to right character by character in the string and each alternatives are tested until one matches. So the first word that matches in a position is stored as a match result (into $m).
Understanding this mechanism is important. For example, if you have the string baobab and the pattern ~bao|baobab~, the result will be only bao since it is the first tested alternative.
In other words, if you want to obtain the largest words first, you need to sort the array by size before.

Yes, you can use strpos() in this case:
Example:
$haystack = 'hihowareyoudoingtoday?';
$needles = array('hi','how','are','you');
$matches = 0;
foreach($needles as $needle) { // create a loop, foreach string
if(strpos($haystack, $needle) !== false) { // use stripos and compare it in the parent string
$matches++; // if it matches then increment.
}
}
echo $matches; // 4

Related

Replace whole words from blacklist array instead of partial matches

I have an array of words
$banned_names = array('about','access','account');
The actual array is very long a contains bad words so at risk of breaking any rule I just added an example, the issue I'm having is the following:
$title = str_ireplace($filterWords, '****', $dn1['title']);
This works however, one of my filtered words is 'rum' and if I was to post the word 'forum' it will display as 'fo****'
So I need to only replace the word with **** if it matches the exact word from the array, if I was to give an example the phrase "Lets check the forum and see if anyone has rum", would be "Lets check the forum and see if anyone has ****".
Similar to the other answers but this uses \b in regex to match word boundaries (whole words). It also creates the regex-compatible banned list on the fly before passing to preg_replace_callback().
$dn1['title'] = 'access forum';
$banned_names = array('about','access','account','rum');
$banned_list = array_map(function($r) { return '/\b' . preg_quote($r, '/') . '\b/'; }, $banned_names);
$title = preg_replace_callback($banned_list, function($m) {
return $m[0][0].str_repeat('*', strlen($m[0])-1);
}, $dn1['title']);
echo $title; //a***** forum
You can use regex with \W to match a "non-word" character:
var_dump(preg_match('/\Wrum\W/i', 'the forum thing')); // returns 0 i.e. doesn't match
var_dump(preg_match('/\Wrum\W/i', 'the rum thing')); // returns 1 i.e. matches
The preg_replace() method takes an array of filters like str_replace() does, but you'll have to adjust the list to include the pattern delimiters and the \W on both sides. You could store the full patterns statically in your list:
$banlist = ['/\Wabout\W/i','/\Waccess\W/i', ... ];
preg_replace($banlist, '****', $text);
Or adjust the array on the fly to add those bits.
You can use preg_replace() to look for your needles with a beginning/end of string tag after converting each string in your haystack to an array of strings, so you'll be matching on full words. Alternatively you can add spaces and continue to use str_ireplace() but that option would fail if your word is the first or last word in the string being checked.
Adding spaces (will miss first/last word, not reccomended):
You'll have to modify your filtering array first of course. And yes the foreach could be simpler, but I hope this makes clear what I'm doing/why.
foreach($filterWords as $key => $value){
$filterWords[$key] = " ".$value." ";
}
str_ireplace ( $filterWords, "****", $dn1['title'] );
OR
Breaking up long string (recommended):
foreach($filterWords as $key => $value){
$filterWords[$key] = "/^".$value."$/i"; //add regex for beginning/end of string value
}
preg_replace ( $filterWords, "****", explode(" ", $dn1['title']) );

preg_match how to return matches?

According to PHP manual "If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on."
How can I return a value from a string with only knowing the first few characters?
The string is dynamic and will always change whats inside, but the first four character will always be the same.
For example how could I return "Car" from this string "TmpsCar". The string will always have "Tmps" followed by something else.
From what I understand I can return using something like this
preg_match('/(Tmps+)/', $fieldName, $matches);
echo($matches[1]);
Should return "Car".
Your regex is flawed. Use this:
preg_match('/^Tmps(.+)$/', $fieldName, $matches);
echo($matches[1]);
$matches = []; // Initialize the matches array first
if (preg_match('/^Tmps(.+)/', $fieldName, $matches)) {
// if the regex matched the input string, echo the first captured group
echo($matches[1]);
}
Note that this task could easily be accomplished without regex at all (with better performance): See startsWith() and endsWith() functions in PHP.
"The string will always have "Tmps" followed by something else."
You don't need a regular expression, in that case.
$result = substr($fieldName, 4);
If the first four characters are always the same, just take the portion of the string after that.
An alternative way is using the explode function
$fieldName= "TmpsCar";
$matches = explode("Tmps", $fieldName);
if(isset($matches[1])){
echo $matches[1]; // return "Car"
}
Given that the text you are looking in, contains more than just a string, starting with Tmps, you might look for the \w+ pattern, which matches any "word" char.
This would result in such an regular expression:
/Tmps(\w+)/
and altogether in php
$text = "This TmpsCars is a test";
if (preg_match('/Tmps(\w+)/', $text, $m)) {
echo "Found:" . $m[1]; // this would return Cars
}

Search for matching words without false positivis

I found this link and am working off of it, but I need to extend it a little further.
Check if string contains word in array
I am trying to create a script that checks a webpage for known bad words. I have one array with a list of bad words, and it compares it to the string from file_get_contents.
This works at a basic level, but returns false positives. For example, if I am loading a webpage with the word "title" it returns that it found the word "tit".
Is my best bet to strip all html and punctuation, then explode it based on spaces and put each individual word into an array? I am hoping there is a more efficient process then that.
Here is my code so far:
$url = 'http://somewebsite.com/';
$content = strip_tags(file_get_contents($url));
//list of bad words separated by commas
$badwords = 'tit,butt,etc'; //this will eventually come from a db
$badwordList = explode(',', $badwords);
foreach($badwordList as $bad) {
$place = strpos($content, $bad);
if (!empty($place)) {
$foundWords[] = $bad;
}
}
print_r($foundWords);
Thanks in advance!
You can just use a regex with preg_match_all():
$badwords = 'tit,butt,etc';
$regex = sprintf('/\b(%s)\b/', implode('|', explode(',', $badwords)));
if (preg_match_all($regex, $content, $matches)) {
print_r($matches[1]);
}
The second statement creates the regex which we are using to match and capture the required words off the webpage. First, it splits the $badwords string on commas, and join them with |. This resulting string is then used as the pattern like so: /\b(tits|butt|etc)\b/. \b (which is a word boundary) will ensure that only whole words are matched.
This regex pattern would match any of those words, and the words which are found in the webpage, will be stored in array $matches[1].

PHP Replace a word with a random word from a list?

Using PHP, I'm looking to take a piece of text and search through it and replace certain words with another word from the list.
e.g.
Search through the text to find any word in this list:
pretty,beautiful,gorgeous,lovely,attractive,appealing
and then replace this word with another from the same list
(but not selecting the same word).
Hope this makes sense!
thanks in advance.
you could use preg_replace_callback:
$random_string = '…';
$needle = array('pretty', 'beautiful', 'gorgeous', 'lovely', 'attractive', 'appealing');
$new_string = preg_replace_callback(
array_map(
function($v) { return '/'.preg_quote($v).'/'; }, // assuming $needle does not contain '/'
$needle),
function($matches) use($needle) {
do {
$new = $needle[rand(0, count($needle)-1)];
while($new != $matches[0]) {
return $new;
},
$random_string);
to make sure your $needle array does not contain characters which have a special meaning in a regular expression, we call preg_quote on each item of the array before searching.
instead of doing a do{}while() loop you could also copy the array and remove the matched word (pretty much depends on the actual data: few items → copy&remove, many items → pick one random until it's different from the match)

Identifying a random repeating pattern in a structured text string

I have a string that has the following structure:
ABC_ABC_PQR_XYZ
Where PQR has the structure:
ABC+JKL
and
ABC itself is a string that can contain alphanumeric characters and a few other characters like "_", "-", "+", "." and follows no set structure:
eg.qWe_rtY-asdf or pkl123
so, in effect, the string can look like this:
qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ
My goal is to find out what string constitutes ABC.
I was initially just using
$arrString = explode("_",$string);
to return $arrString[0] before I was made aware that ABC ($arrString[0]) itself can contain underscores, thus rendering it incorrect.
My next attempt was exlpoding it on "_" anyway and then comparing each of the exploded string parts with the first string part until I get a semblance of a pattern:
function getPatternABC($string)
{
$count = 0;
$pattern ="";
$arrString = explode("_", $string);
foreach($arrString as $expString)
{
if(strcmp($expString,$arrString[0])!==0 || $count==0)
{
$pattern = $pattern ."_". $arrString[$count];
$count++;
}
else break;
}
return substr($pattern,1);
}
This works great - but I wanted to know if there was a more elegant way of doing this using regular expressions?
Here is the regex solution:
'^([a-zA-Z0-9_+-]+)_\1_\1\+'
What this does is match (starting from the beginning of the string) the longest possible sequence consisting of the characters inside the square brackets (edit that per your spec). The sequence must appear exactly twice, each time followed by an underscore, and then must appear once more followed by a plus sign (this is actually the first half of PQR with the delimiter before JKL). The rest of the input is ignored.
You will find ABC captured as capture group 1.
So:
$input = 'qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ';
$result = preg_match('/^([a-zA-Z0-9_+-]+)_\1_\1\+/', $input, $matches);
if ($result) {
echo $matches[2];
}
See it in action.
Sure, just make a regular expression that matches your pattern. In this case, something like this:
preg_match('/^([a-zA-Z0-9_+.-]+)_\1_\1\+JKL_XYZ$/', $string, $match);
Your ABC is in $match[1].
If the presence of underscores in these strings has a low frequency, it may be worth checking to see if a simple explode() will do it before bothering with regex.
<?php
$str = 'ABC_ABC_PQR_XYZ';
if(substr_count($str, '_') == 3)
$abc = reset(explode('_', $str));
else
$abc = regexy_function($str);
?>

Categories