Match words in file with regex php - php

I'm new with regex and php. I know this quite simple but i just can't get it. Now, i have file words.txt that contain:
happy
sad
laugh
I want to find match this sentence with my words.txt:
I am happy
So far, i've tried this but it doesn't valid because it read as a sentence not words: (not yet implement regex bcs im confused)
$input0= "I am happy";
$handle = fopen('words.txt', 'r');
$valid = false;
while (($buffer = fgets($handle)) !== false) {
if (strpos($buffer, $input0) !== false) { // here's the problem
$valid = TRUE;
break;
}
}
if($valid == TRUE){
//print the matches word
}
fclose($handle);
can u help me? :(

Depending on your final goal you may not even need regexp here, since you want to match entire word with no variable part.
if you want to have a loop on your keywords a simple str_replace() would do the job to replace the word by an emphasize one for instance, or simple if (strpos($input0, $word) !== false) to just check if found in sentence and find position.
But if you want to avoid a loop, for faster results and especially if you have many words preg_match_all() will do what you need as said by Zanderwar.
Here is an example:
$input0= "I am happy but sometimes quite pretty sad. It depends but I prefer to be happy in general.\nMy paragraph also continue on multilines\nend it makes me laugh and rejoy. I am so happy. HAPPY?";
// $contents = file_get_contents('words.txt');
$contents = "happy\nsad\nlaugh";
$words_list = str_replace("\n", '|', $contents);
if (preg_match_all("~($words_list)~si", $input0, $matches))
{
print_r(array($matches));
// Do what you want
}
The i flag match case insensitive if you need.
The s flag match on multilines content.
[EDIT] to add more details on regexp
In the pattern you need a delimiter which can be ~ because it is very seldom used in sentences and strings to match so you wont need to escape / as when you use / delimiter.
also I am joining your words like ~(sad|joy|happy)~ if you want to capture the words. if you don't you need a group like (?:sad|joy|happy)
the | means or.
You can try to replace regex ~($words_list)~si by ~(?:$words_list)~si if you dont need capturing - and you don't - you will then have only one level of captures in $matches array, at position [0] it is always the full match. but here you don't have more complex patterns to match and so no need to capture

Related

Selecting certain links with a REGEX

I'm working to do a "Wiki Game" with PHP, and i'd like to match all the links in a string starting by /wiki/something, for example /wiki/Chiffrement_RSA or /wiki/OSS_117_:_Le_Caire,_nid_d%27espions. I know just a few thigs about REGEX, so I'm struct. If someone could help me, it would be nice.
For the time, I just have \/wiki\/*...
Thanks for your help !
You can do by regex or strpos:
<?php
$mystring = 'abc';
$find = '/wiki/';
$statusLink = strpos($mystring, $find);
// Note our use of ===. Simply == would not work as expected
// because the position of 'a' was the 0th (first) character.
if ($statusLink === false) {
echo "Not the link that you want";
} else {
echo "You found the link";
}
//or by explode
$link = explode('/', $originalLink);
if ($link[1] == 'wiki' && isset($link[2])){
//is your link
}
?>
I don't use pure regex so much unless it's very necessary.
You can reduce your output array size by by 50% using \K in your pattern. It eliminates the need for a capture group and puts your desired substrings in the "fullstrings" array.
Pattern:
\/wiki\/\K[^ ]+
\K says "start the fullstring match from here". This means no memory waste. It may be a microimprovement, but I believe it to be best practice and I think more people should use it.
I finally chose Cody.code's answer with this regex : \/wiki\/([^ ]+).
I will use this code to check if i keep a link in an array or not (I will parse my html with DOMDocument an get all the <a>, it's faster) , so the preg_match() solution is the best for me, instead of strpos.
Thanks for your help !

alternative to if(preg_match() and preg_match())

I want to know if we can replace if(preg_match('/boo/', $anything) and preg_match('/poo/', $anything))
with a regex..
$anything = 'I contain both boo and poo!!';
for example..
From what I understand of your question, you're looking for a way to check if BOTH 'poo' and 'boo' exist within a string using only one regex. I can't think of a more elegant way than this;
preg_match('/(boo.*poo)|(poo.*boo)/', $anything);
This is the only way I can think of to ensure both patterns exists within a string disregarding order. Of course, if you knew they were always supposed to be in the same order, that would make it more simple =]
EDIT
After reading through the post linked to by MisterJ in his answer, it would seem that a more simple regex could be;
preg_match('/(?=.*boo)(?=.*poo)/', $anything);
By using a pipe:
if(preg_match('/boo|poo/', $anything))
You can use the logical or as mentioned by #sroes:
if(preg_match('/(boo)|(poo)/,$anything)) the problem there is that you don't know which one matched.
In this one, you will match "I contain boo","I contain poo" and "I contain boo and poo".
If you want to only match "I contain boo and poo", the problem is really harder to figure out Regular Expressions: Is there an AND operator?
and it seems that you will have to stick with the php test.
To take conditions literally
if(preg_match('/[bp]oo.*[bp]oo/', $anything))
You can achieve this by altering your regular expression, as others have pointed out in other answers. However, if you want to use an array instead, so you do not have to list a long regex pattern, then use something like this:
// Default matches to false
$matches = false;
// Set the pattern array
$pattern_array = array('boo','poo');
// Loop through the patterns to match
foreach($pattern_array as $pattern){
// Test if the string is matched
if(preg_match('/'.$pattern.'/', $anything)){
// Set matches to true
$matches = true;
}
}
// Proceed if matches is true
if($matches){
// Do your stuff here
}
Alternatively, if you are only trying to match strings then it would be much more efficient if you were to use strpos like so:
// Default matches to false
$matches = false;
// Set the strings to match
$strings_to_match = array('boo','poo');
foreach($strings_to_match as $string){
if(strpos($anything, $string) !== false)){
// Set matches to true
$matches = true;
}
}
Try to avoid regular expressions where possible as they are a lot less efficient!

Match array values against text [duplicate]

I have an array full of patterns that I need matched. Any way to do that, other than a for() loop? Im trying to do it in the least CPU intensive way, since I will be doing dozens of these every minute.
Real world example is, Im building a link status checker, which will check links to various online video sites, to ensure that the videos are still live. Each domain has several "dead keywords", if these are found in the html of a page, that means the file was deleted. These are stored in the array. I need to match the contents pf the array, against the html output of the page.
First of all, if you literally are only doing dozens every minute, then I wouldn't worry terribly about the performance in this case. These matches are pretty quick, and I don't think you're going to have a performance problem by iterating through your patterns array and calling preg_match separately like this:
$matches = false;
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
$matches = true;
}
}
You can indeed combine all the patterns into one using the or operator like some people are suggesting, but don't just slap them together with a |. This will break badly if any of your patterns contain the or operator.
I would recommend at least grouping your patterns using parenthesis like:
foreach ($patterns as $pattern)
{
$grouped_patterns[] = "(" . $pattern . ")";
}
$master_pattern = implode($grouped_patterns, "|");
But... I'm not really sure if this ends up being faster. Something has to loop through them, whether it's the preg_match or PHP. If I had to guess I'd guess that individual matches would be close to as fast and easier to read and maintain.
Lastly, if performance is what you're looking for here, I think the most important thing to do is pull out the non regex matches into a simple "string contains" check. I would imagine that some of your checks must be simple string checks like looking to see if "This Site is Closed" is on the page.
So doing this:
foreach ($strings_to_match as $string_to_match)
{
if (strpos($page, $string_to_match) !== false))
{
// etc.
break;
}
}
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
// etc.
break;
}
}
and avoiding as many preg_match() as possible is probably going to be your best gain. strpos() is a lot faster than preg_match().
// assuming you have something like this
$patterns = array('a','b','\w');
// converts the array into a regex friendly or list
$patterns_flattened = implode('|', $patterns);
if ( preg_match('/'. $patterns_flattened .'/', $string, $matches) )
{
}
// PS: that's off the top of my head, I didn't check it in a code editor
If your patterns don't contain many whitespaces, another option would be to eschew the arrays and use the /x modifier. Now your list of regular expressions would look like this:
$regex = "/
pattern1| # search for occurences of 'pattern1'
pa..ern2| # wildcard search for occurences of 'pa..ern2'
pat[ ]tern| # search for 'pat tern', whitespace is escaped
mypat # Note that the last pattern does NOT have a pipe char
/x";
With the /x modifier, whitespace is completely ignored, except when in a character class or preceded by a backslash. Comments like above are also allowed.
This would avoid the looping through the array.
If you're merely searching for the presence of a string in another string, use strpos as it is faster.
Otherwise, you could just iterate over the array of patterns, calling preg_match each time.
If you have a bunch of patterns, what you can do is concatenate them in a single regular expression and match that. No need for a loop.
What about doing a str_replace() on the HTML you get using your array and then checking if the original HTML is equal to the original? This would be very fast:
$sites = array(
'you_tube' => array('dead', 'moved'),
...
);
foreach ($sites as $site => $deadArray) {
// get $html
if ($html == str_replace($deadArray, '', $html)) {
// video is live
}
}
You can combine all the patterns from the list to single regular expression using implode() php function. Then test your string at once using preg_match() php function.
$patterns = array(
'abc',
'\d+h',
'[abc]{6,8}\-\s*[xyz]{6,8}',
);
$master_pattern = '/(' . implode($patterns, ')|(') . ')/'
if(preg_match($master_pattern, $string_to_check))
{
//do something
}
Of course there could be even less code using implode() inline in "if()" condition instead of $master_pattern variable.

Searching for words in a string

What's the best way to search a string in php and find a case insensitive match?
For example:
$SearchString = "This is a test";
From this string, I want to find the word test, or TEST or Test.
Thanks!
EDIT
I should also mention that I want to search the string and if it contains any of the words in my blacklist array, stop processing it. So an exact match of "Test" is important, however, the case is not
If you want to find word, and want to forbid "FU" but not "fun", you can use regularexpresions whit \b, where \b marks the starts and ends of words,
so if you search for "\bfu\b" if not going to match "fun",
if you add a "i" behind the delimiter, its search case insesitive,
if you got a list of word like "fu" "foo" "bar" your pattern can look like:
"#\b(fu|foo|bar)\b#i", or you can use a variable:
if(preg_match("#\b{$needle}\b#i", $haystack))
{
return FALSE;
}
Edit, added multiword example whit char escaping as requested in comments:
/* load the list somewhere */
$stopWords = array( "word1", "word2" );
/* escape special characters */
foreach($stopWords as $row_nr => $current_word)
{
$stopWords[$row_nr] = addcslashes($current_word, '[\^$.|?*+()');
}
/* create a pattern of all words (using # insted of # as # can be used in urls) */
$pattern = "#\b(" . implode('|', $stopWords) . ")\b#";
/* execute the search */
if(!preg_match($pattern, $images))
{
/* no stop words */
}
You can do one of a few things, but I tend to use one of these:
You can use stripos()
if (stripos($searchString,'test') !== FALSE) {
echo 'I found it!';
}
You can convert the string to one specific case, and search it with strpos()
if (strpos(strtolower($searchString),'test') !== FALSE) {
echo 'I found it!';
}
I do both and have no preference - one may be more efficient than the other (I suspect the first is better) but I don't actually know.
As a couple of more horrible examples, you could:
Use a regex with the i modifier
Do if (count(explode('test',strtolower($searchString))) > 1)
stripos, I would assume. Presumably it stops searching when it finds a match, and I guess internally it converts to lower (or upper) case, so that's about as good as you'll get.
http://us3.php.net/manual/en/function.preg-match.php
Depends if you want to just match
In this case you would do:
$SearchString= "This is a test";
$pattern = '/[Test|TEST]/';
preg_match($pattern, $SearchString);
I wasn't reading the question properly. As stated in other answers, stripos or a preg_match function will do exactly what you're looking for.
I originally offered the stristr function as an answer, but you actually should NOT use this if you're just looking to find a string within another string, as it returns the rest of the string in addition to the search parameter.

How do you perform a preg_match where the pattern is an array, in php?

I have an array full of patterns that I need matched. Any way to do that, other than a for() loop? Im trying to do it in the least CPU intensive way, since I will be doing dozens of these every minute.
Real world example is, Im building a link status checker, which will check links to various online video sites, to ensure that the videos are still live. Each domain has several "dead keywords", if these are found in the html of a page, that means the file was deleted. These are stored in the array. I need to match the contents pf the array, against the html output of the page.
First of all, if you literally are only doing dozens every minute, then I wouldn't worry terribly about the performance in this case. These matches are pretty quick, and I don't think you're going to have a performance problem by iterating through your patterns array and calling preg_match separately like this:
$matches = false;
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
$matches = true;
}
}
You can indeed combine all the patterns into one using the or operator like some people are suggesting, but don't just slap them together with a |. This will break badly if any of your patterns contain the or operator.
I would recommend at least grouping your patterns using parenthesis like:
foreach ($patterns as $pattern)
{
$grouped_patterns[] = "(" . $pattern . ")";
}
$master_pattern = implode($grouped_patterns, "|");
But... I'm not really sure if this ends up being faster. Something has to loop through them, whether it's the preg_match or PHP. If I had to guess I'd guess that individual matches would be close to as fast and easier to read and maintain.
Lastly, if performance is what you're looking for here, I think the most important thing to do is pull out the non regex matches into a simple "string contains" check. I would imagine that some of your checks must be simple string checks like looking to see if "This Site is Closed" is on the page.
So doing this:
foreach ($strings_to_match as $string_to_match)
{
if (strpos($page, $string_to_match) !== false))
{
// etc.
break;
}
}
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
// etc.
break;
}
}
and avoiding as many preg_match() as possible is probably going to be your best gain. strpos() is a lot faster than preg_match().
// assuming you have something like this
$patterns = array('a','b','\w');
// converts the array into a regex friendly or list
$patterns_flattened = implode('|', $patterns);
if ( preg_match('/'. $patterns_flattened .'/', $string, $matches) )
{
}
// PS: that's off the top of my head, I didn't check it in a code editor
If your patterns don't contain many whitespaces, another option would be to eschew the arrays and use the /x modifier. Now your list of regular expressions would look like this:
$regex = "/
pattern1| # search for occurences of 'pattern1'
pa..ern2| # wildcard search for occurences of 'pa..ern2'
pat[ ]tern| # search for 'pat tern', whitespace is escaped
mypat # Note that the last pattern does NOT have a pipe char
/x";
With the /x modifier, whitespace is completely ignored, except when in a character class or preceded by a backslash. Comments like above are also allowed.
This would avoid the looping through the array.
If you're merely searching for the presence of a string in another string, use strpos as it is faster.
Otherwise, you could just iterate over the array of patterns, calling preg_match each time.
If you have a bunch of patterns, what you can do is concatenate them in a single regular expression and match that. No need for a loop.
What about doing a str_replace() on the HTML you get using your array and then checking if the original HTML is equal to the original? This would be very fast:
$sites = array(
'you_tube' => array('dead', 'moved'),
...
);
foreach ($sites as $site => $deadArray) {
// get $html
if ($html == str_replace($deadArray, '', $html)) {
// video is live
}
}
You can combine all the patterns from the list to single regular expression using implode() php function. Then test your string at once using preg_match() php function.
$patterns = array(
'abc',
'\d+h',
'[abc]{6,8}\-\s*[xyz]{6,8}',
);
$master_pattern = '/(' . implode($patterns, ')|(') . ')/'
if(preg_match($master_pattern, $string_to_check))
{
//do something
}
Of course there could be even less code using implode() inline in "if()" condition instead of $master_pattern variable.

Categories