I have a list of words in an array. What is the fastest way to check if any of these words exist in an string?
Currently, I am checking the existence of array elements one by one through a foreach loop by stripos. I am curious if there is a faster method, like what we do for str_replace using an array.
Regarding to your additional comment you could explode your string into single words using explode() or preg_split() and then check this array against the needles-array using array_intersect(). So all the work is done only once.
<?php
$haystack = "Hello Houston, we have a problem";
$haystacks = preg_split("/\b/", $haystack);
$needles = array("Chicago", "New York", "Houston");
$intersect = array_intersect($haystacks, $needles);
$count = count($intersect);
var_dump($count, $intersect);
I could imagine that array_intersect() is pretty fast. But it depends what you really want (matching words, matching fragments, ..)
my personal function:
function wordsFound($haystack,$needles) {
return preg_match('/\b('.implode('|',$needles).')\b/i',$haystack);
}
//> Usage:
if (wordsFound('string string string',array('words')))
Notice if you work with UTF-8 exotic strings you need to change \b with teh corrispondent of utf-8 preg word boundary
Notice2: be sure to enter only a-z0-9 chars in $needles (thanks to MonkeyMonkey) otherwise you need to preg_quote it before
Notice3: this function is case insensitve thanks to i modifier
In general regular expressions are slower compared to basic string functions like str_ipos(). But I think it really depends on the situation. If you really need the maximum performance, I suggest making some tests with real-world data.
Related
In PHP I have string with nested brackets:
bar[foo[test[abc][def]]bar]foo
I need a regex that matches the inner bracket-pairs first, so the order in which preg_match_all finds the matching bracket-pairs should be:
[abc]
[def]
[test[abc][def]]
[foo[test[abc][def]]bar]
All texts may vary.
Is this even possible with preg_match_all ?
This is not possible with regular expressions. No matter how complex your regex, it will always return the left-most match first.
At best, you'd have to use multiple regexes, but even then you're going to have trouble because regexes can't really count matching brackets. Your best bet is to parse this string some other way.
Is not evident in your question what kind of "structure of matches" you whant... But you can use only simple arrays. Try
preg_match_all('#\[([a-z\)\(]+?)\]#',$original,$m);
that, for $original = 'bar[foo[test[abc][def]]bar]foo' returns an array with "abc" and "def", the inner ones.
For your output, you need a loop for the "parsing task".
PCRE with preg_replace_callback is better for parsing.
Perhaps this loop is a good clue for your problem,
$original = 'bar[foo[test[abc][def]]bar]foo';
for( $aux=$oldAux=$original;
$oldAux!=($aux=printInnerBracket($aux));
$oldAux=$aux
);
print "\n-- $aux";
function printInnerBracket($s) {
return preg_replace_callback(
'#\[([a-z\)\(]+?)\]#', // the only one regular expression
function($m) {
print "\n$m[0]";
return "($m[1])";
},
$s
);
}
Result (the callback print):
[abc]
[def]
[test(abc)(def)]
[foo(test(abc)(def))bar]
-- bar(foo(test(abc)(def))bar)foo
See also this related question.
I have an array full of patterns that I need matched. Any way to do that, other than a for() loop? Im trying to do it in the least CPU intensive way, since I will be doing dozens of these every minute.
Real world example is, Im building a link status checker, which will check links to various online video sites, to ensure that the videos are still live. Each domain has several "dead keywords", if these are found in the html of a page, that means the file was deleted. These are stored in the array. I need to match the contents pf the array, against the html output of the page.
First of all, if you literally are only doing dozens every minute, then I wouldn't worry terribly about the performance in this case. These matches are pretty quick, and I don't think you're going to have a performance problem by iterating through your patterns array and calling preg_match separately like this:
$matches = false;
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
$matches = true;
}
}
You can indeed combine all the patterns into one using the or operator like some people are suggesting, but don't just slap them together with a |. This will break badly if any of your patterns contain the or operator.
I would recommend at least grouping your patterns using parenthesis like:
foreach ($patterns as $pattern)
{
$grouped_patterns[] = "(" . $pattern . ")";
}
$master_pattern = implode($grouped_patterns, "|");
But... I'm not really sure if this ends up being faster. Something has to loop through them, whether it's the preg_match or PHP. If I had to guess I'd guess that individual matches would be close to as fast and easier to read and maintain.
Lastly, if performance is what you're looking for here, I think the most important thing to do is pull out the non regex matches into a simple "string contains" check. I would imagine that some of your checks must be simple string checks like looking to see if "This Site is Closed" is on the page.
So doing this:
foreach ($strings_to_match as $string_to_match)
{
if (strpos($page, $string_to_match) !== false))
{
// etc.
break;
}
}
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
// etc.
break;
}
}
and avoiding as many preg_match() as possible is probably going to be your best gain. strpos() is a lot faster than preg_match().
// assuming you have something like this
$patterns = array('a','b','\w');
// converts the array into a regex friendly or list
$patterns_flattened = implode('|', $patterns);
if ( preg_match('/'. $patterns_flattened .'/', $string, $matches) )
{
}
// PS: that's off the top of my head, I didn't check it in a code editor
If your patterns don't contain many whitespaces, another option would be to eschew the arrays and use the /x modifier. Now your list of regular expressions would look like this:
$regex = "/
pattern1| # search for occurences of 'pattern1'
pa..ern2| # wildcard search for occurences of 'pa..ern2'
pat[ ]tern| # search for 'pat tern', whitespace is escaped
mypat # Note that the last pattern does NOT have a pipe char
/x";
With the /x modifier, whitespace is completely ignored, except when in a character class or preceded by a backslash. Comments like above are also allowed.
This would avoid the looping through the array.
If you're merely searching for the presence of a string in another string, use strpos as it is faster.
Otherwise, you could just iterate over the array of patterns, calling preg_match each time.
If you have a bunch of patterns, what you can do is concatenate them in a single regular expression and match that. No need for a loop.
What about doing a str_replace() on the HTML you get using your array and then checking if the original HTML is equal to the original? This would be very fast:
$sites = array(
'you_tube' => array('dead', 'moved'),
...
);
foreach ($sites as $site => $deadArray) {
// get $html
if ($html == str_replace($deadArray, '', $html)) {
// video is live
}
}
You can combine all the patterns from the list to single regular expression using implode() php function. Then test your string at once using preg_match() php function.
$patterns = array(
'abc',
'\d+h',
'[abc]{6,8}\-\s*[xyz]{6,8}',
);
$master_pattern = '/(' . implode($patterns, ')|(') . ')/'
if(preg_match($master_pattern, $string_to_check))
{
//do something
}
Of course there could be even less code using implode() inline in "if()" condition instead of $master_pattern variable.
What is the best way to search for a word in a string
preg_match("/word/",$string)
stripos("word",$string)
Or is there a better way
One benefit to using regexp for this job is the ability to use \b (Regexp word boundary) in the regexp, and other random derivations. If you are only looking for that sequence of letters in a string stripos is likely to be a little better.
$tests = array("word", "worded", "This also has the word.", "Words are not the same", "Word capitalized should match");
foreach ($tests as $string)
{
echo "Testing \"$string\": Regexp:";
echo preg_match("/\bword\b/i", $string) ? "Matched" : "Failed";
echo " stripos:";
echo stripos("word", $string) >= 0 ? "Matched": "Failed";
echo "\n";
}
Results:
Testing "word": Regexp:Matched stripos:Matched
Testing "worded": Regexp:Failed stripos:Matched
Testing "This also has the word.": Regexp:Matched stripos:Matched
Testing "Words are not the same": Regexp:Failed stripos:Matched
Testing "Word capitalized should match": Regexp:Matched stripos:Matched
Like it says in the Notes for preg_match:
Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.
If you are simply looking for a substring, stripos() or strpos() and friends are much better than using the preg family of functions.
For simple string matching the PHP string functions offer more performance. Regex is more heavyweight and therefore has lower performance.
Having said that, in most cases, the performance difference is small enough to go unnoticed, unless you're looping over an array with hundreds of thousands of elements or more.
Of course, as soon as you start needing "cleverer" matching, regex becomes the only game in town.
There is also substr_count($haystack, $needle) which just returns the number of substring occurences. With the added bonus of not having to worry about 0 equating to false like stripos() if the first occurrence is at position 0. Although that's not a problem if you use strict equality.
http://php.net/manual/en/function.substr-count.php
Okay, here's what I'm trying to do: I'm trying to use PHP to develop what's essentially a tiny subset of a markdown implementation, not worth using a full markdown class.
I need essentially do a str_replace, but alternate the replace string for every occurrence of the needle, so as to handle the opening and closing HTML tags.
For example, italics are a pair of asterisks like *this*, and code blocks are surrounded by backticks like `this`.
I need to replace the first occurrence of a pair of the characters with the opening HTML tag corresponding, and the second with the closing tag.
Any ideas on how to do this? I figured some sort of regular expression would be involved...
Personally, I'd loop through each occurrence of * or \ with a counter, and replace the character with the appropriate HTML tag based on the count (for example, if the count is even and you hit an asterisk, replace it with <em>, if it's odd then replace it with </em>, etc).
But if you're sure that you only need to support a couple simple kinds of markup, then a regular expression for each might be the easiest solution. Something like this for asterisks, for example (untested):
preg_replace('/\*([^*]+)\*/', '<em>\\1</em>', $text);
And something similar for backslashes.
What you're looking for is more commonly handled by a state machine or lexer/parser.
This is ugly but it works. Catch: only for one pattern type at a time.
$input = "Here's some \\italic\\ text and even \\some more\\ wheee";
$output = preg_replace_callback( "/\\\/", 'replacer', $input );
echo $output;
function replacer( $matches )
{
static $toggle = 0;
if ( $toggle )
{
$toggle = 0;
return "</em>";
}
$toggle = 1;
return "<em>";
}
I created an alternative to str_replace, because the PHP manual for str_replace says that:
If search and replace are arrays, then str_replace() takes a value
from each array and uses them to search and replace on subject.
If replace has fewer values than search, then an empty string is used
for the rest of replacement values.
If search is an array and replace is a string, then this replacement
string is used for every value of search.
The converse would not make sense, though.
But the converse DOES make sense if the same needle appears several times in your haystack, such as '?' in a prepared statement (e.g. PHP's MySQLi extension), and you need to write a log or diagnostic report of what's going on as it runs through the parameters, substituting the parameters in the query string to make a 'cut and paste' version of the query for testing elsewhere.
Occurrences of needle are replaced left-to-right with the values in the replace array. If there are more occurrences of needle that there are replacements, it resets the replace array pointer. This means that for the OP's use, the needle would be "*", and the replacement would be an array with two values, "<I>" and "</I>".
function str_replace_seriatim(string $needle, array $replace, string $haystack) {
$occurrences = substr_count($haystack, $needle);
for ($i = 0; $i <= $occurrences; $i++) {
$substitute = current($replace);
$pos = strpos($haystack, $needle);
if ($pos !== FALSE) $haystack = substr_replace($haystack, $substitute, $pos, strlen($needle));
if ((next($replace) === FALSE)) reset($replace);
}
return $haystack;
}
To do the whole lot in one function call, I suppose that one could expand on this a little, taking an array ($pincushion) of needles and a multidimensional array as the replacement, but I'm not sure if that isn't more work than just multiple function calls.
I have an array full of patterns that I need matched. Any way to do that, other than a for() loop? Im trying to do it in the least CPU intensive way, since I will be doing dozens of these every minute.
Real world example is, Im building a link status checker, which will check links to various online video sites, to ensure that the videos are still live. Each domain has several "dead keywords", if these are found in the html of a page, that means the file was deleted. These are stored in the array. I need to match the contents pf the array, against the html output of the page.
First of all, if you literally are only doing dozens every minute, then I wouldn't worry terribly about the performance in this case. These matches are pretty quick, and I don't think you're going to have a performance problem by iterating through your patterns array and calling preg_match separately like this:
$matches = false;
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
$matches = true;
}
}
You can indeed combine all the patterns into one using the or operator like some people are suggesting, but don't just slap them together with a |. This will break badly if any of your patterns contain the or operator.
I would recommend at least grouping your patterns using parenthesis like:
foreach ($patterns as $pattern)
{
$grouped_patterns[] = "(" . $pattern . ")";
}
$master_pattern = implode($grouped_patterns, "|");
But... I'm not really sure if this ends up being faster. Something has to loop through them, whether it's the preg_match or PHP. If I had to guess I'd guess that individual matches would be close to as fast and easier to read and maintain.
Lastly, if performance is what you're looking for here, I think the most important thing to do is pull out the non regex matches into a simple "string contains" check. I would imagine that some of your checks must be simple string checks like looking to see if "This Site is Closed" is on the page.
So doing this:
foreach ($strings_to_match as $string_to_match)
{
if (strpos($page, $string_to_match) !== false))
{
// etc.
break;
}
}
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
// etc.
break;
}
}
and avoiding as many preg_match() as possible is probably going to be your best gain. strpos() is a lot faster than preg_match().
// assuming you have something like this
$patterns = array('a','b','\w');
// converts the array into a regex friendly or list
$patterns_flattened = implode('|', $patterns);
if ( preg_match('/'. $patterns_flattened .'/', $string, $matches) )
{
}
// PS: that's off the top of my head, I didn't check it in a code editor
If your patterns don't contain many whitespaces, another option would be to eschew the arrays and use the /x modifier. Now your list of regular expressions would look like this:
$regex = "/
pattern1| # search for occurences of 'pattern1'
pa..ern2| # wildcard search for occurences of 'pa..ern2'
pat[ ]tern| # search for 'pat tern', whitespace is escaped
mypat # Note that the last pattern does NOT have a pipe char
/x";
With the /x modifier, whitespace is completely ignored, except when in a character class or preceded by a backslash. Comments like above are also allowed.
This would avoid the looping through the array.
If you're merely searching for the presence of a string in another string, use strpos as it is faster.
Otherwise, you could just iterate over the array of patterns, calling preg_match each time.
If you have a bunch of patterns, what you can do is concatenate them in a single regular expression and match that. No need for a loop.
What about doing a str_replace() on the HTML you get using your array and then checking if the original HTML is equal to the original? This would be very fast:
$sites = array(
'you_tube' => array('dead', 'moved'),
...
);
foreach ($sites as $site => $deadArray) {
// get $html
if ($html == str_replace($deadArray, '', $html)) {
// video is live
}
}
You can combine all the patterns from the list to single regular expression using implode() php function. Then test your string at once using preg_match() php function.
$patterns = array(
'abc',
'\d+h',
'[abc]{6,8}\-\s*[xyz]{6,8}',
);
$master_pattern = '/(' . implode($patterns, ')|(') . ')/'
if(preg_match($master_pattern, $string_to_check))
{
//do something
}
Of course there could be even less code using implode() inline in "if()" condition instead of $master_pattern variable.