Search for matching words without false positivis - php

I found this link and am working off of it, but I need to extend it a little further.
Check if string contains word in array
I am trying to create a script that checks a webpage for known bad words. I have one array with a list of bad words, and it compares it to the string from file_get_contents.
This works at a basic level, but returns false positives. For example, if I am loading a webpage with the word "title" it returns that it found the word "tit".
Is my best bet to strip all html and punctuation, then explode it based on spaces and put each individual word into an array? I am hoping there is a more efficient process then that.
Here is my code so far:
$url = 'http://somewebsite.com/';
$content = strip_tags(file_get_contents($url));
//list of bad words separated by commas
$badwords = 'tit,butt,etc'; //this will eventually come from a db
$badwordList = explode(',', $badwords);
foreach($badwordList as $bad) {
$place = strpos($content, $bad);
if (!empty($place)) {
$foundWords[] = $bad;
}
}
print_r($foundWords);
Thanks in advance!

You can just use a regex with preg_match_all():
$badwords = 'tit,butt,etc';
$regex = sprintf('/\b(%s)\b/', implode('|', explode(',', $badwords)));
if (preg_match_all($regex, $content, $matches)) {
print_r($matches[1]);
}
The second statement creates the regex which we are using to match and capture the required words off the webpage. First, it splits the $badwords string on commas, and join them with |. This resulting string is then used as the pattern like so: /\b(tits|butt|etc)\b/. \b (which is a word boundary) will ensure that only whole words are matched.
This regex pattern would match any of those words, and the words which are found in the webpage, will be stored in array $matches[1].

Related

Replace whole words from blacklist array instead of partial matches

I have an array of words
$banned_names = array('about','access','account');
The actual array is very long a contains bad words so at risk of breaking any rule I just added an example, the issue I'm having is the following:
$title = str_ireplace($filterWords, '****', $dn1['title']);
This works however, one of my filtered words is 'rum' and if I was to post the word 'forum' it will display as 'fo****'
So I need to only replace the word with **** if it matches the exact word from the array, if I was to give an example the phrase "Lets check the forum and see if anyone has rum", would be "Lets check the forum and see if anyone has ****".
Similar to the other answers but this uses \b in regex to match word boundaries (whole words). It also creates the regex-compatible banned list on the fly before passing to preg_replace_callback().
$dn1['title'] = 'access forum';
$banned_names = array('about','access','account','rum');
$banned_list = array_map(function($r) { return '/\b' . preg_quote($r, '/') . '\b/'; }, $banned_names);
$title = preg_replace_callback($banned_list, function($m) {
return $m[0][0].str_repeat('*', strlen($m[0])-1);
}, $dn1['title']);
echo $title; //a***** forum
You can use regex with \W to match a "non-word" character:
var_dump(preg_match('/\Wrum\W/i', 'the forum thing')); // returns 0 i.e. doesn't match
var_dump(preg_match('/\Wrum\W/i', 'the rum thing')); // returns 1 i.e. matches
The preg_replace() method takes an array of filters like str_replace() does, but you'll have to adjust the list to include the pattern delimiters and the \W on both sides. You could store the full patterns statically in your list:
$banlist = ['/\Wabout\W/i','/\Waccess\W/i', ... ];
preg_replace($banlist, '****', $text);
Or adjust the array on the fly to add those bits.
You can use preg_replace() to look for your needles with a beginning/end of string tag after converting each string in your haystack to an array of strings, so you'll be matching on full words. Alternatively you can add spaces and continue to use str_ireplace() but that option would fail if your word is the first or last word in the string being checked.
Adding spaces (will miss first/last word, not reccomended):
You'll have to modify your filtering array first of course. And yes the foreach could be simpler, but I hope this makes clear what I'm doing/why.
foreach($filterWords as $key => $value){
$filterWords[$key] = " ".$value." ";
}
str_ireplace ( $filterWords, "****", $dn1['title'] );
OR
Breaking up long string (recommended):
foreach($filterWords as $key => $value){
$filterWords[$key] = "/^".$value."$/i"; //add regex for beginning/end of string value
}
preg_replace ( $filterWords, "****", explode(" ", $dn1['title']) );

preg_match how to return matches?

According to PHP manual "If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on."
How can I return a value from a string with only knowing the first few characters?
The string is dynamic and will always change whats inside, but the first four character will always be the same.
For example how could I return "Car" from this string "TmpsCar". The string will always have "Tmps" followed by something else.
From what I understand I can return using something like this
preg_match('/(Tmps+)/', $fieldName, $matches);
echo($matches[1]);
Should return "Car".
Your regex is flawed. Use this:
preg_match('/^Tmps(.+)$/', $fieldName, $matches);
echo($matches[1]);
$matches = []; // Initialize the matches array first
if (preg_match('/^Tmps(.+)/', $fieldName, $matches)) {
// if the regex matched the input string, echo the first captured group
echo($matches[1]);
}
Note that this task could easily be accomplished without regex at all (with better performance): See startsWith() and endsWith() functions in PHP.
"The string will always have "Tmps" followed by something else."
You don't need a regular expression, in that case.
$result = substr($fieldName, 4);
If the first four characters are always the same, just take the portion of the string after that.
An alternative way is using the explode function
$fieldName= "TmpsCar";
$matches = explode("Tmps", $fieldName);
if(isset($matches[1])){
echo $matches[1]; // return "Car"
}
Given that the text you are looking in, contains more than just a string, starting with Tmps, you might look for the \w+ pattern, which matches any "word" char.
This would result in such an regular expression:
/Tmps(\w+)/
and altogether in php
$text = "This TmpsCars is a test";
if (preg_match('/Tmps(\w+)/', $text, $m)) {
echo "Found:" . $m[1]; // this would return Cars
}

Split regex in to two regex: whole words only & words with substring matches only

I have below code that removes whole words that contain any pattern
$patterns = ["are", "finite", "get", "er"];
$string = "You are definitely getting better today";
$re = '\S*('.implode('|', $patterns).')\S*';
$string = preg_replace('#'.$re.'#', '', $string);
$string = preg_replace('#\h{2,}#', ' ', $string);
echo $string;
the output of the above code is
You today
I want to split this code into two functions so that the first function only removes whole words present in the pattern and a second function that only removes words that contain any pattern.
I expect the output of the function one that remove only whole words
You definitely getting better today (**are** is removed)
and output of the other function that remove whole word that contain pattern
You are today (**definitely getting better** are removed)
The first part is basic: Only match whole keywords (actually, you can find dozens of Q&As like that, e.g this)
\b(?:are|finite|get|er)\b
Which can be applied to your code like this: $re = '\b('.implode('|', $patterns).')\b';
The second part is a bit more involved: While you keep expanding substring matches to match the entire word you want to exclude words that match whole keywords.
We can use a lookahead to achieve this like that:
(?!\b(?:are|finite|get|er)\b)\S*(?:are|finite|get|er)\S*
Demo,
Sample Code:
$patterns = ["are", "finite", "get", "er"];
$string = "You are definitely getting better today";
$alternations = ''.implode('|', $patterns);
$re = '(?!\b(?:'.$alternations.')\b)\S*(?:'.$alternations.')\S*';
$string = preg_replace('#'.$re.'#', '', $string);
If the \b does not work for you and you'd like to go with space as word boundary use lookarounds:
(?<=\s)(?:are|finite|get|er)(?=\s)
Sample Code (updated) case 1.

Find words from the array in the text received through file_get_contents

I have a receipt of a remote page:
$page = file_get_contents ('http://sayt.ru/');
There is a array of words:
$word = array ("word", "second");
How to count the number of words in the array matches the text on the page?
Started to dig in the direction
$matches = array ();
$count_words = preg_match_all ('/'. $word. '/ i',$page, $matches);
But certainly not in the direction I dig because count is always zero. And through preg_match_all sought after one word, not the entire array. : (
you have to either check or each word in array or use regexp like this:
$serachWords = array_map(function($w){ return preg_quote($w,'/'); }, $word);
$search = implode('|', $searchWords);
$count_words = preg_match_all('/\b(?:'.$serach.')\b/i', $page, $matches);
Added few modification to have better results: escape all words, so they wouldn't break expression and add word boundaries (\b) no match word as a word, not part of swords.

Identifying a random repeating pattern in a structured text string

I have a string that has the following structure:
ABC_ABC_PQR_XYZ
Where PQR has the structure:
ABC+JKL
and
ABC itself is a string that can contain alphanumeric characters and a few other characters like "_", "-", "+", "." and follows no set structure:
eg.qWe_rtY-asdf or pkl123
so, in effect, the string can look like this:
qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ
My goal is to find out what string constitutes ABC.
I was initially just using
$arrString = explode("_",$string);
to return $arrString[0] before I was made aware that ABC ($arrString[0]) itself can contain underscores, thus rendering it incorrect.
My next attempt was exlpoding it on "_" anyway and then comparing each of the exploded string parts with the first string part until I get a semblance of a pattern:
function getPatternABC($string)
{
$count = 0;
$pattern ="";
$arrString = explode("_", $string);
foreach($arrString as $expString)
{
if(strcmp($expString,$arrString[0])!==0 || $count==0)
{
$pattern = $pattern ."_". $arrString[$count];
$count++;
}
else break;
}
return substr($pattern,1);
}
This works great - but I wanted to know if there was a more elegant way of doing this using regular expressions?
Here is the regex solution:
'^([a-zA-Z0-9_+-]+)_\1_\1\+'
What this does is match (starting from the beginning of the string) the longest possible sequence consisting of the characters inside the square brackets (edit that per your spec). The sequence must appear exactly twice, each time followed by an underscore, and then must appear once more followed by a plus sign (this is actually the first half of PQR with the delimiter before JKL). The rest of the input is ignored.
You will find ABC captured as capture group 1.
So:
$input = 'qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ';
$result = preg_match('/^([a-zA-Z0-9_+-]+)_\1_\1\+/', $input, $matches);
if ($result) {
echo $matches[2];
}
See it in action.
Sure, just make a regular expression that matches your pattern. In this case, something like this:
preg_match('/^([a-zA-Z0-9_+.-]+)_\1_\1\+JKL_XYZ$/', $string, $match);
Your ABC is in $match[1].
If the presence of underscores in these strings has a low frequency, it may be worth checking to see if a simple explode() will do it before bothering with regex.
<?php
$str = 'ABC_ABC_PQR_XYZ';
if(substr_count($str, '_') == 3)
$abc = reset(explode('_', $str));
else
$abc = regexy_function($str);
?>

Categories