Is this efficient coding for anti-spam?

Is this efficient coding for anti-spam? - php

if(strpos($string, "A Bad Word") != false){
echo 'This word is not allowed';
}
if(strpos($string, "A Bad Word") != false){
echo 'This word is not allowed';
}
Okay, so I am trying to check the submit data to see if there are inappropriate words. Instead of making 5 instances, is there a more efficient way?

I'm sure there's a more clever way to do this in general.
If you just want to be more concise, then it's probably best to loop over some bad words, instead of adding repetitive, almost identical, conditionals (ifs):
<?PHP
$banned = array('bad','words','like','these');
$looksLikeSpam = false;
foreach($banned as $naughty){
if (strpos($string,$naugty) !== false){
$looksLikeSpam=true;
}
}
if ($looksLikeSpam){
echo "You're GROSS! Just... ew!";
die();
}
Edit: Also, note that in your question-code, you test strpos != false. You really want !==, since strpos() will return 0 if the first word, is, say, PENIS. 0 will be cast to false. See where I'm going here?
Also, you probably want to use stripos(), to be case-insensitive (unless you only care if if people SHOUT offensive words) :-)

Yes, you could make an array of badwords and build a regex out of it. This would also make handling case-insensitivity easy.
$badwords = array('staircase', 'tuna', 'pillow');
$badwords_regex = '/' . implode('|', $badwords) . '/i';
$contains_badwords = preg_match($badwords_regex, $text);

No, it's crap. There is a whole branch of computing science concerning string searching algorithms. Heck, Knuth even dedicated half of TAOCP Volume 3 to it.
Boyer-Moore is a good algorithm, now used in many applications involving searching for multiple needles in a haystack.

You need to be careful with word boundaries, or else people will complain about not being able to enter words like "shuttlecock".
I hope you (or your client) realises that automatic "naughty word" filtering does not remove the need for moderating. There are lots of ways to be offensive without using any of the supposedly naughty words. Even deciding what is or is not offensive depends on the cultural context.

You could combine them as a single regular expression and then use preg_grep() to confirm their existence

Use an array of values and iterate over the array, checking the submitted word each time. If a match is found break out of the loop and return true.

You might use PHP in_array function rather than a loop, if you're checking one word. A regex would be better if you're checking a whole sentence though.
http://us2.php.net/manual/en/function.in-array.php
$bad_word_array=array('weenis','dolt','wanker');
$passed=in_array($suspected_word,$bad_word_array);

Related

alternative to if(preg_match() and preg_match())

I want to know if we can replace if(preg_match('/boo/', $anything) and preg_match('/poo/', $anything))
with a regex..
$anything = 'I contain both boo and poo!!';
for example..

From what I understand of your question, you're looking for a way to check if BOTH 'poo' and 'boo' exist within a string using only one regex. I can't think of a more elegant way than this;
preg_match('/(boo.*poo)|(poo.*boo)/', $anything);
This is the only way I can think of to ensure both patterns exists within a string disregarding order. Of course, if you knew they were always supposed to be in the same order, that would make it more simple =]
EDIT
After reading through the post linked to by MisterJ in his answer, it would seem that a more simple regex could be;
preg_match('/(?=.*boo)(?=.*poo)/', $anything);

By using a pipe:
if(preg_match('/boo|poo/', $anything))

You can use the logical or as mentioned by #sroes:
if(preg_match('/(boo)|(poo)/,$anything)) the problem there is that you don't know which one matched.
In this one, you will match "I contain boo","I contain poo" and "I contain boo and poo".
If you want to only match "I contain boo and poo", the problem is really harder to figure out Regular Expressions: Is there an AND operator?
and it seems that you will have to stick with the php test.

To take conditions literally
if(preg_match('/[bp]oo.*[bp]oo/', $anything))

You can achieve this by altering your regular expression, as others have pointed out in other answers. However, if you want to use an array instead, so you do not have to list a long regex pattern, then use something like this:
// Default matches to false
$matches = false;
// Set the pattern array
$pattern_array = array('boo','poo');
// Loop through the patterns to match
foreach($pattern_array as $pattern){
// Test if the string is matched
if(preg_match('/'.$pattern.'/', $anything)){
// Set matches to true
$matches = true;
}
}
// Proceed if matches is true
if($matches){
// Do your stuff here
}
Alternatively, if you are only trying to match strings then it would be much more efficient if you were to use strpos like so:
// Default matches to false
$matches = false;
// Set the strings to match
$strings_to_match = array('boo','poo');
foreach($strings_to_match as $string){
if(strpos($anything, $string) !== false)){
// Set matches to true
$matches = true;
}
}
Try to avoid regular expressions where possible as they are a lot less efficient!

Match array values against text [duplicate]

I have an array full of patterns that I need matched. Any way to do that, other than a for() loop? Im trying to do it in the least CPU intensive way, since I will be doing dozens of these every minute.
Real world example is, Im building a link status checker, which will check links to various online video sites, to ensure that the videos are still live. Each domain has several "dead keywords", if these are found in the html of a page, that means the file was deleted. These are stored in the array. I need to match the contents pf the array, against the html output of the page.

First of all, if you literally are only doing dozens every minute, then I wouldn't worry terribly about the performance in this case. These matches are pretty quick, and I don't think you're going to have a performance problem by iterating through your patterns array and calling preg_match separately like this:
$matches = false;
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
$matches = true;
}
}
You can indeed combine all the patterns into one using the or operator like some people are suggesting, but don't just slap them together with a |. This will break badly if any of your patterns contain the or operator.
I would recommend at least grouping your patterns using parenthesis like:
foreach ($patterns as $pattern)
{
$grouped_patterns[] = "(" . $pattern . ")";
}
$master_pattern = implode($grouped_patterns, "|");
But... I'm not really sure if this ends up being faster. Something has to loop through them, whether it's the preg_match or PHP. If I had to guess I'd guess that individual matches would be close to as fast and easier to read and maintain.
Lastly, if performance is what you're looking for here, I think the most important thing to do is pull out the non regex matches into a simple "string contains" check. I would imagine that some of your checks must be simple string checks like looking to see if "This Site is Closed" is on the page.
So doing this:
foreach ($strings_to_match as $string_to_match)
{
if (strpos($page, $string_to_match) !== false))
{
// etc.
break;
}
}
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
// etc.
break;
}
}
and avoiding as many preg_match() as possible is probably going to be your best gain. strpos() is a lot faster than preg_match().

// assuming you have something like this
$patterns = array('a','b','\w');
// converts the array into a regex friendly or list
$patterns_flattened = implode('|', $patterns);
if ( preg_match('/'. $patterns_flattened .'/', $string, $matches) )
{
}
// PS: that's off the top of my head, I didn't check it in a code editor

If your patterns don't contain many whitespaces, another option would be to eschew the arrays and use the /x modifier. Now your list of regular expressions would look like this:
$regex = "/
pattern1| # search for occurences of 'pattern1'
pa..ern2| # wildcard search for occurences of 'pa..ern2'
pat[ ]tern| # search for 'pat tern', whitespace is escaped
mypat # Note that the last pattern does NOT have a pipe char
/x";
With the /x modifier, whitespace is completely ignored, except when in a character class or preceded by a backslash. Comments like above are also allowed.
This would avoid the looping through the array.

If you're merely searching for the presence of a string in another string, use strpos as it is faster.
Otherwise, you could just iterate over the array of patterns, calling preg_match each time.

If you have a bunch of patterns, what you can do is concatenate them in a single regular expression and match that. No need for a loop.

What about doing a str_replace() on the HTML you get using your array and then checking if the original HTML is equal to the original? This would be very fast:
$sites = array(
'you_tube' => array('dead', 'moved'),
...
);
foreach ($sites as $site => $deadArray) {
// get $html
if ($html == str_replace($deadArray, '', $html)) {
// video is live
}
}

You can combine all the patterns from the list to single regular expression using implode() php function. Then test your string at once using preg_match() php function.
$patterns = array(
'abc',
'\d+h',
'[abc]{6,8}\-\s*[xyz]{6,8}',
);
$master_pattern = '/(' . implode($patterns, ')|(') . ')/'
if(preg_match($master_pattern, $string_to_check))
{
//do something
}
Of course there could be even less code using implode() inline in "if()" condition instead of $master_pattern variable.

Check for word in string

What is the best way to search for a word in a string
preg_match("/word/",$string)
stripos("word",$string)
Or is there a better way

One benefit to using regexp for this job is the ability to use \b (Regexp word boundary) in the regexp, and other random derivations. If you are only looking for that sequence of letters in a string stripos is likely to be a little better.
$tests = array("word", "worded", "This also has the word.", "Words are not the same", "Word capitalized should match");
foreach ($tests as $string)
{
echo "Testing \"$string\": Regexp:";
echo preg_match("/\bword\b/i", $string) ? "Matched" : "Failed";
echo " stripos:";
echo stripos("word", $string) >= 0 ? "Matched": "Failed";
echo "\n";
}
Results:
Testing "word": Regexp:Matched stripos:Matched
Testing "worded": Regexp:Failed stripos:Matched
Testing "This also has the word.": Regexp:Matched stripos:Matched
Testing "Words are not the same": Regexp:Failed stripos:Matched
Testing "Word capitalized should match": Regexp:Matched stripos:Matched

Like it says in the Notes for preg_match:
Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.

If you are simply looking for a substring, stripos() or strpos() and friends are much better than using the preg family of functions.

For simple string matching the PHP string functions offer more performance. Regex is more heavyweight and therefore has lower performance.
Having said that, in most cases, the performance difference is small enough to go unnoticed, unless you're looping over an array with hundreds of thousands of elements or more.
Of course, as soon as you start needing "cleverer" matching, regex becomes the only game in town.

There is also substr_count($haystack, $needle) which just returns the number of substring occurences. With the added bonus of not having to worry about 0 equating to false like stripos() if the first occurrence is at position 0. Although that's not a problem if you use strict equality.
http://php.net/manual/en/function.substr-count.php

How can I match ALL terms using Regular Expressions in PHP?

I figured out how to check an OR case, preg_match( "/(word1|word2|word3)/i", $string );. What I can't figure out is how to match an AND case. I want to check that the string contains ALL the terms (case-insensitive).

It's possible to do an AND match in a single regex using lookahead, eg.:
preg_match('/^(?=.*word1)(?=.*word2)(?=.*word3)/i', $string)
however, it's probably clearer and maybe faster to just do it outside regex:
preg_match('/word1/i', $string) && preg_match('/word2/i', $string) && preg_match('/word3/i', $string)
or, if your target strings are as simple as word1:
stripos($string, 'word1')!==FALSE && stripos($string, 'word2')!==FALSE && stripos($string, 'word3')!==FALSE

I am thinking about a situation in your question that may cause some problem using and case:
this is the situation
words = "abcd","cdef","efgh"
does have to match in the string:
string = "abcdefgh"
maybe you should not using REG.EXP

If you know the order that the terms will appear in, you could use something like the following:
preg_match("/(word1).*(word2).*(word3)/i", $string);
If the order of terms isn't defined, you will probably be best using 3 separate expressions and checking that they all matched. A single expression is possible but likely complicated.

preg_match( "/word1.*word2.*word3)/i");
This works but they must appear in the stated order, you could of course alternate preg_match("/(word1.*word2.*word3|word1.*word3.*word2|word2.*word3.*word1|
word2.*word1.*word3|word3.*word2.*word1|word3.*word1.*word2)/i");
But thats pretty herendous and you'd have to be crazy!, would be nicer to just use strpos($haystack,$needle); in a loop, or multiple regex matches.

How do you perform a preg_match where the pattern is an array, in php?

I have an array full of patterns that I need matched. Any way to do that, other than a for() loop? Im trying to do it in the least CPU intensive way, since I will be doing dozens of these every minute.
Real world example is, Im building a link status checker, which will check links to various online video sites, to ensure that the videos are still live. Each domain has several "dead keywords", if these are found in the html of a page, that means the file was deleted. These are stored in the array. I need to match the contents pf the array, against the html output of the page.

First of all, if you literally are only doing dozens every minute, then I wouldn't worry terribly about the performance in this case. These matches are pretty quick, and I don't think you're going to have a performance problem by iterating through your patterns array and calling preg_match separately like this:
$matches = false;
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
$matches = true;
}
}
You can indeed combine all the patterns into one using the or operator like some people are suggesting, but don't just slap them together with a |. This will break badly if any of your patterns contain the or operator.
I would recommend at least grouping your patterns using parenthesis like:
foreach ($patterns as $pattern)
{
$grouped_patterns[] = "(" . $pattern . ")";
}
$master_pattern = implode($grouped_patterns, "|");
But... I'm not really sure if this ends up being faster. Something has to loop through them, whether it's the preg_match or PHP. If I had to guess I'd guess that individual matches would be close to as fast and easier to read and maintain.
Lastly, if performance is what you're looking for here, I think the most important thing to do is pull out the non regex matches into a simple "string contains" check. I would imagine that some of your checks must be simple string checks like looking to see if "This Site is Closed" is on the page.
So doing this:
foreach ($strings_to_match as $string_to_match)
{
if (strpos($page, $string_to_match) !== false))
{
// etc.
break;
}
}
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
// etc.
break;
}
}
and avoiding as many preg_match() as possible is probably going to be your best gain. strpos() is a lot faster than preg_match().

// assuming you have something like this
$patterns = array('a','b','\w');
// converts the array into a regex friendly or list
$patterns_flattened = implode('|', $patterns);
if ( preg_match('/'. $patterns_flattened .'/', $string, $matches) )
{
}
// PS: that's off the top of my head, I didn't check it in a code editor

If your patterns don't contain many whitespaces, another option would be to eschew the arrays and use the /x modifier. Now your list of regular expressions would look like this:
$regex = "/
pattern1| # search for occurences of 'pattern1'
pa..ern2| # wildcard search for occurences of 'pa..ern2'
pat[ ]tern| # search for 'pat tern', whitespace is escaped
mypat # Note that the last pattern does NOT have a pipe char
/x";
With the /x modifier, whitespace is completely ignored, except when in a character class or preceded by a backslash. Comments like above are also allowed.
This would avoid the looping through the array.

If you're merely searching for the presence of a string in another string, use strpos as it is faster.
Otherwise, you could just iterate over the array of patterns, calling preg_match each time.

If you have a bunch of patterns, what you can do is concatenate them in a single regular expression and match that. No need for a loop.

What about doing a str_replace() on the HTML you get using your array and then checking if the original HTML is equal to the original? This would be very fast:
$sites = array(
'you_tube' => array('dead', 'moved'),
...
);
foreach ($sites as $site => $deadArray) {
// get $html
if ($html == str_replace($deadArray, '', $html)) {
// video is live
}
}

You can combine all the patterns from the list to single regular expression using implode() php function. Then test your string at once using preg_match() php function.
$patterns = array(
'abc',
'\d+h',
'[abc]{6,8}\-\s*[xyz]{6,8}',
);
$master_pattern = '/(' . implode($patterns, ')|(') . ')/'
if(preg_match($master_pattern, $string_to_check))
{
//do something
}
Of course there could be even less code using implode() inline in "if()" condition instead of $master_pattern variable.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Is this efficient coding for anti-spam? - php

Yes, you could make an array of badwords and build a regex out of it. This would also make handling case-insensitivity easy. $badwords = array('staircase', 'tuna', 'pillow'); $badwords_regex = '/' . implode('|', $badwords) . '/i'; $contains_badwords = preg_match($badwords_regex, $text);

No, it's crap. There is a whole branch of computing science concerning string searching algorithms. Heck, Knuth even dedicated half of TAOCP Volume 3 to it. Boyer-Moore is a good algorithm, now used in many applications involving searching for multiple needles in a haystack.

You could combine them as a single regular expression and then use preg_grep() to confirm their existence

Use an array of values and iterate over the array, checking the submitted word each time. If a match is found break out of the loop and return true.

Related

alternative to if(preg_match() and preg_match())

Match array values against text [duplicate]

Check for word in string

How can I match ALL terms using Regular Expressions in PHP?

How do you perform a preg_match where the pattern is an array, in php?

Categories

Resources