Alternative strings in regular expression - php

Sadly I have to ask this question but after noodling on this problem the whole morning, I give up. Searching online, man pages, documents, none of it seems to give me a conclusive answer to what I try to do.
Looking for a regular expression for the PHP function preg_match to match a string against a pattern. Now that pattern is what gives me headaches.
The pattern should express the following: string starts with "_MG_" or "IMG_" or "DSC_", followed by four digits, followed by an optional "-N" where N is another digit. For example, "IMG_0123" or "DSC_9876-3" are valid. Everything else should be rejected.
I came up with various patterns, but none of them seems to work. For example, I tried
(_MG_|IMG_|DSC_)[0-9]{4}(-[0-9])?
and this in different variations with ( ) and apostrophes around various sub-expressions and using ? vs {0,1} and whatnot. (I experimented using grep, but got no matches still.) Yes, I know I need to add "/.../" for PHP, but here I left it out for readability's sake.
Can I even express this in a single expressions, or will I have to call the matching function several times? If several matches are required, I might be better off writing a small parser for this particular string matching myself.
Thanks!
EDIT: Here is the code that I'm working with
// Iterate over all images in this gallery folder.
if ($h = opendir($dir)) {
while (($f = readdir($h)) !== false) {
// Skip images whose name doesn't match the requirement.
if (0 == preg_match("/(_MG_|IMG_|DSC_)[0-9]{4}(-[0-9]){0,1}/", $f)) {
continue;
}
...
}
}
And this also allows image names like "_MG_7020-1-2.jpg" or "_MG_7444-5-6.2.jpg" or "IMG_6543_2_4_tonemapped.jpg" but that's not what I want to allow.

<?php
$array = array('IMG_0123', 'DSC_9876-3', '_MG_1234', 'DSC_fail');
foreach($array as $arr) {
if(preg_match("/_MG_|IMG_|DSC_[0-9]{4}[-0-9]*/", $arr)) {
echo $arr . ' => TRUE <br />';
} else {
echo $arr . ' => FALSE <br />';
}
}
?>
The above works as expected for me.

I ran this as well:
<?php
$matches = array();
preg_match('/(_MG_|IMG_|DSC_)[0-9]{4}(-[0-9])?/','IMG_0123-3',$matches );
var_dump($matches);
Output:
array(3) {
[0]=>
string(10) "IMG_0123-3"
[1]=>
string(4) "IMG_"
[2]=>
string(2) "-3"
}
Seems ok, unless I'm missing something, or unless what you're referring to is that preg_match returns false if not all your matchers () match.
Note the return type for preg_match from the php doc:
preg_match() returns the number of times pattern matches. That will be either 0 times (no match) or 1 time because preg_match() will stop searching after the first match. preg_match_all() on the contrary will continue until it reaches the end of subject. preg_match() returns FALSE if an error occurred.
So you may be looking to really use preg_match_all() in fact

According to this refiddle, you seem to have it solved just fine. You can use their "unit" test functionality additional "should" and "should not" match scenarios. Granted, that refiddle is using javascript's regex, but I find them to be effectively identical until you get into backreferences and lookarounds.

Here is your original pattern with start and end of string anchor as well as some edits to reduce the pattern length.
Code: (Demo)
var_export(
preg_grep(
'/^(?:DSC|[_I]MG)_\d{4}(?:-\d)?$/',
$array
)
);

Related

Getting titles out of string

I'm really stuck with this one program...
I'm learning how to program and I'm starting with PHP right now.
I need to get titles out of articles.
I already asked this question, and I mannaged to get the first title of the text in many ways. For example if text was :
Hello
I'm learning how
to write this code.
:like this, so I got the "Hello" part for example like this:
<?php
$string = "Hello
I'm learning how
to write this code.";
$str=strstr($string,"\n",true);
echo $str . "<br />";
?>
However, there can be a lot of titles in the article and each one of them is seperated with blank lines from above and bellow and I cannot mannage to get all of these titles.
Here's what I tried:
<?php
$string="
Good text
Good text is good but I have no idea
how to code this.
Another title
I need to get you,
but don't know how."
$get = substr($string, strpos($string, $finda), -1);
$finda="\n";
$getFinal=strstr($get, $finda, true);
echo $getFinal;
?>
But this doesn't work because there are "\n" after every line. How to identify only those blank lines? I tried to find them:
$getRow = explode("\n", $string);
foreach($getRow as $row){
if(strlen($row) <= 1){
but I don't know what to do next.
Do you have any ideas? Can you help?
Thank you in advance:)
You can use a regular expression like this:
<?php
$string="
Good text
Good text is good but I have no idea
how to code this.
Another title
I need to get you,
but don't know how.";
preg_match_all('/^\n(.+?)\n\n/m', $string, $matches);
var_dump($matches[1]);
?>
Outputs:
array(2) {
[0] =>
string(9) "Good text"
[1] =>
string(13) "Another title"
}
Explanation of the regular expression
Regular expressions are a compact way to describe constraints for a string. Either to check that it verifies a given pattern or to capture some of its parts. In this case, we want to capture some parts of the string (titles).
'/^\n(.+?)\n\n/m' is the regular expression used to solve your problem. The actual expression is between the slashes while the leading m is an option. It indicates that we want to analyse multiple lines.
We are left with ^\n(.+?)\n\n which can be read from left to right.
^ indicates the beginning of a line and \n represents the "new line" character. Coupled (^\n), they represent an empty line.
Parenthesis indicates what we want to capture. In this case, the title, which can be any number of any characters. The . represents any characters and the + indicates that we want any number of occurrences of that character (but at least one, the * can be used to include zero occurrence). The ? indicates that we don't want to go too far and capture the whole string. It will thus stop at the first occasion it has to match the remaining part of the regular expression.
Then, the two \n represent the end of the title line and the end of the empty line following it.
As we used preg_match_all instead of preg_match, every occurrence of the pattern will be matched instead of the first one only.
Regular expressions are really powerful and I invite you to learn them further.
While iterating over the lines, you could have a variable that stores what you are currently doing. What I mean is that you could have 3 states: processing_text, expecting_title, got_title.
Each time you find that $row == "" (meaning there was an empty line, only containing a \n), you set your variable to expecting_title. If the var==expecting_title, you store/echo the next row you encounter and set the variable to got_title. This way, when you encounter the next empty line, you won't set the variable to expecting_title, but to processing_text.
Some pseudocode to get you started:
foreach ($getRow as $row)
if (state == expecting_title)
processTitle($row)
state=got_title
if ($row == "")
if (state == processing_text)
state=expecting_title
else
state=processing_text
Or, you can always use regex, as the other answer mentioned, but that's another story.

Extracting substrings between curly brackets inside a string into an array using PHP

I need help extracing all the sub string between curly brackets that are found inside a specific string.
I found some solutions in javascript but I need it for PHP.
$string = "www.example.com/?foo={foo}&test={test}";
$subStrings = HELPME($string);
print_r($subStrings);
The result should be:
array( [0] => foo, [1] => test )
I tried playing with preg_match but I got confused.
I'd appreciate if whoever manage to get it to work with preg_match, explain also what is the logic behind it.
You could use this regex to capture the strings between {}
\{([^}]*)\}
Explanation:
\{ Matches a literal {
([^}]*) Capture all the characters not of } zero or more times. So it would capture upto the next } symbol.
\} Matches a literal }
Your code would be,
<?php
$regex = '~\{([^}]*)\}~';
$string = "www.example.com/?foo={foo}&test={test}";
preg_match_all($regex, $string, $matches);
var_dump($matches[1]);
?>
Output:
array(2) {
[0]=>
string(3) "foo"
[1]=>
string(4) "test"
}
DEMO
Regex Pattern: \{(\w+)\}
Get all the matches that is captured by parenthesis (). The pattern says anything that is enclosed by {...} are captured.
Sample code:
$regex = '/\{(\w{1,})\}/';
$testString = ''; // Fill this in
preg_match_all($regex, $testString, $matches);
// the $matches variable contains the list of matches
Here is demo on debuggex
If you want to capture any type of character inside the {...} then try below regex pattern.
Regex : \{(.*?)\}
Sample code:
$regex = '/\{(.{0,}?)\}/';
$testString = ''; // Fill this in
preg_match_all($regex, $testString, $matches);
// the $matches variable contains the list of matches
Here is demo on debuggex
<?php
$string = "www.example.com/?foo={foo}&test={test}";
$found = preg_match('/\{([^}]*)\}/',$string, $subStrings);
if($found){
print_r($subStrings);
}else{
echo 'NOPE !!';
}
DEMO HERE
Function parse_url, which parses a URL and return its components. Including the query string.
Try This:
preg_match_all("/\{.*?\}/", $string, $subStrings);
var_dump($subStrings[0]);
Good Luck!
You can use the expression (?<=\{).*?(?=\}) to match any string of text enclosed in {}.
$string = "www.example.com/?foo={foo}&test={test}";
preg_match_all("/(?<=\{).*?(?=\})/",$string,$matches);
print_r($matches[0]);
Regex explained:
(?<=\{) is a positive lookbehind, asserting that the line of text is preceeded by a {.
Similarly (?=\}) is a positive lookahead asserting that it is followed by a }. .* matches 0 or more characters of any type. And the ? in .*? makes it match the least possible amount of characters. (Meaning it matches foo in {foo} and {bar} as opposed to foo} and {bar.
$matches[0] contains an array of all the matched strings.
I see answers here using regular expressions with capture groups, lookarounds, and lazy quantifiers. All of these techniques will slow down the pattern -- granted, the performance is very unlikely to be noticeable in the majority of use cases. Because we are meant to offer solutions that are suitable to more scenarios than just the posted question, I'll offer a few solutions that deliver the expected result and explain the differences using the OP's www.example.com/?foo={foo}&test={test} string assigned to $url. I have prepared a php DEMO of the techniques to follow. For information about the function calls, please follow the links to the php manual. For an in depth breakdown of the regex patterns, I recommend using regex101.com -- a free online tool that allows you to test patterns against strings, see the results as both highlighted text and a grouped list, and provides a technique breakdown character-by-character of how the regex engine is interpreting your pattern.
#1 Because your input string is a url, a non-regex technique is appropriate because php has native functions to parse it: parse_url() with parse_str(). Unfortunately, your requirements go beyond extracting the query string's values, you also wish to re-index the array and remove the curly braces from the values.
parse_str(parse_url($url, PHP_URL_QUERY), $assocArray);
$values = array_map(function($v) {return trim($v, '{}');}, array_values($assocArray));
var_export($values);
While this approach is deliberate and makes fair use of native functions that were built for these jobs, it ends up making longer, more convoluted code which is somewhat unpleasant in terms of readability. Nonetheless, it provides the desired output array and should be considered as a viable process.
#2 preg_match_all() is a super brief and highly efficient technique to extract the values. One draw back with using regular expressions is that the regex engine is completely "unaware" of any special meanings that a formatted input string may have. In this case, I don't see any negative impacts, but when hiccups do arise, often the solution is to use a parser that is "format/data-type aware".
var_export(preg_match_all('~\{\K[^}]*~', $url, $matches) ? $matches[0] : []);
Notice that my pattern does not need capture groups or lookarounds; nor does my answer suffer from the use of a lazy quantifier. \K is used to "restart the fullstring match" (in other words, forget any matched characters upto that point). All of these features will mean that the regex engine can traverse the string with peak efficiency. If there is a downsides to using the function they are:
that a multi-dimensional array is generated while you only want a one-dimensional array
that the function creates a reference variable instead of returning the results
#3 preg_split() most closely aligns with the plain-English intent of your task AND it provides the exact output as its return value.
var_export(preg_split('~(?:(?:^|})[^{]*{)|}[^{]*$~', $url, 0, PREG_SPLIT_NO_EMPTY));
My pattern, while admittedly unsavoury to the novice regex pattern designer AND slightly less efficient because it is making "branched" matches (|), basically says: "Split the string at the following delimiters:
from the start of the string or from a }, including all non-{ characters, then the first encountered { (this is the end of the delimiter).
from the lasts }, including all non-{ characters until the end of the string."

Match array values against text [duplicate]

I have an array full of patterns that I need matched. Any way to do that, other than a for() loop? Im trying to do it in the least CPU intensive way, since I will be doing dozens of these every minute.
Real world example is, Im building a link status checker, which will check links to various online video sites, to ensure that the videos are still live. Each domain has several "dead keywords", if these are found in the html of a page, that means the file was deleted. These are stored in the array. I need to match the contents pf the array, against the html output of the page.
First of all, if you literally are only doing dozens every minute, then I wouldn't worry terribly about the performance in this case. These matches are pretty quick, and I don't think you're going to have a performance problem by iterating through your patterns array and calling preg_match separately like this:
$matches = false;
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
$matches = true;
}
}
You can indeed combine all the patterns into one using the or operator like some people are suggesting, but don't just slap them together with a |. This will break badly if any of your patterns contain the or operator.
I would recommend at least grouping your patterns using parenthesis like:
foreach ($patterns as $pattern)
{
$grouped_patterns[] = "(" . $pattern . ")";
}
$master_pattern = implode($grouped_patterns, "|");
But... I'm not really sure if this ends up being faster. Something has to loop through them, whether it's the preg_match or PHP. If I had to guess I'd guess that individual matches would be close to as fast and easier to read and maintain.
Lastly, if performance is what you're looking for here, I think the most important thing to do is pull out the non regex matches into a simple "string contains" check. I would imagine that some of your checks must be simple string checks like looking to see if "This Site is Closed" is on the page.
So doing this:
foreach ($strings_to_match as $string_to_match)
{
if (strpos($page, $string_to_match) !== false))
{
// etc.
break;
}
}
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
// etc.
break;
}
}
and avoiding as many preg_match() as possible is probably going to be your best gain. strpos() is a lot faster than preg_match().
// assuming you have something like this
$patterns = array('a','b','\w');
// converts the array into a regex friendly or list
$patterns_flattened = implode('|', $patterns);
if ( preg_match('/'. $patterns_flattened .'/', $string, $matches) )
{
}
// PS: that's off the top of my head, I didn't check it in a code editor
If your patterns don't contain many whitespaces, another option would be to eschew the arrays and use the /x modifier. Now your list of regular expressions would look like this:
$regex = "/
pattern1| # search for occurences of 'pattern1'
pa..ern2| # wildcard search for occurences of 'pa..ern2'
pat[ ]tern| # search for 'pat tern', whitespace is escaped
mypat # Note that the last pattern does NOT have a pipe char
/x";
With the /x modifier, whitespace is completely ignored, except when in a character class or preceded by a backslash. Comments like above are also allowed.
This would avoid the looping through the array.
If you're merely searching for the presence of a string in another string, use strpos as it is faster.
Otherwise, you could just iterate over the array of patterns, calling preg_match each time.
If you have a bunch of patterns, what you can do is concatenate them in a single regular expression and match that. No need for a loop.
What about doing a str_replace() on the HTML you get using your array and then checking if the original HTML is equal to the original? This would be very fast:
$sites = array(
'you_tube' => array('dead', 'moved'),
...
);
foreach ($sites as $site => $deadArray) {
// get $html
if ($html == str_replace($deadArray, '', $html)) {
// video is live
}
}
You can combine all the patterns from the list to single regular expression using implode() php function. Then test your string at once using preg_match() php function.
$patterns = array(
'abc',
'\d+h',
'[abc]{6,8}\-\s*[xyz]{6,8}',
);
$master_pattern = '/(' . implode($patterns, ')|(') . ')/'
if(preg_match($master_pattern, $string_to_check))
{
//do something
}
Of course there could be even less code using implode() inline in "if()" condition instead of $master_pattern variable.

Searching for words in a string

What's the best way to search a string in php and find a case insensitive match?
For example:
$SearchString = "This is a test";
From this string, I want to find the word test, or TEST or Test.
Thanks!
EDIT
I should also mention that I want to search the string and if it contains any of the words in my blacklist array, stop processing it. So an exact match of "Test" is important, however, the case is not
If you want to find word, and want to forbid "FU" but not "fun", you can use regularexpresions whit \b, where \b marks the starts and ends of words,
so if you search for "\bfu\b" if not going to match "fun",
if you add a "i" behind the delimiter, its search case insesitive,
if you got a list of word like "fu" "foo" "bar" your pattern can look like:
"#\b(fu|foo|bar)\b#i", or you can use a variable:
if(preg_match("#\b{$needle}\b#i", $haystack))
{
return FALSE;
}
Edit, added multiword example whit char escaping as requested in comments:
/* load the list somewhere */
$stopWords = array( "word1", "word2" );
/* escape special characters */
foreach($stopWords as $row_nr => $current_word)
{
$stopWords[$row_nr] = addcslashes($current_word, '[\^$.|?*+()');
}
/* create a pattern of all words (using # insted of # as # can be used in urls) */
$pattern = "#\b(" . implode('|', $stopWords) . ")\b#";
/* execute the search */
if(!preg_match($pattern, $images))
{
/* no stop words */
}
You can do one of a few things, but I tend to use one of these:
You can use stripos()
if (stripos($searchString,'test') !== FALSE) {
echo 'I found it!';
}
You can convert the string to one specific case, and search it with strpos()
if (strpos(strtolower($searchString),'test') !== FALSE) {
echo 'I found it!';
}
I do both and have no preference - one may be more efficient than the other (I suspect the first is better) but I don't actually know.
As a couple of more horrible examples, you could:
Use a regex with the i modifier
Do if (count(explode('test',strtolower($searchString))) > 1)
stripos, I would assume. Presumably it stops searching when it finds a match, and I guess internally it converts to lower (or upper) case, so that's about as good as you'll get.
http://us3.php.net/manual/en/function.preg-match.php
Depends if you want to just match
In this case you would do:
$SearchString= "This is a test";
$pattern = '/[Test|TEST]/';
preg_match($pattern, $SearchString);
I wasn't reading the question properly. As stated in other answers, stripos or a preg_match function will do exactly what you're looking for.
I originally offered the stristr function as an answer, but you actually should NOT use this if you're just looking to find a string within another string, as it returns the rest of the string in addition to the search parameter.

How do you perform a preg_match where the pattern is an array, in php?

I have an array full of patterns that I need matched. Any way to do that, other than a for() loop? Im trying to do it in the least CPU intensive way, since I will be doing dozens of these every minute.
Real world example is, Im building a link status checker, which will check links to various online video sites, to ensure that the videos are still live. Each domain has several "dead keywords", if these are found in the html of a page, that means the file was deleted. These are stored in the array. I need to match the contents pf the array, against the html output of the page.
First of all, if you literally are only doing dozens every minute, then I wouldn't worry terribly about the performance in this case. These matches are pretty quick, and I don't think you're going to have a performance problem by iterating through your patterns array and calling preg_match separately like this:
$matches = false;
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
$matches = true;
}
}
You can indeed combine all the patterns into one using the or operator like some people are suggesting, but don't just slap them together with a |. This will break badly if any of your patterns contain the or operator.
I would recommend at least grouping your patterns using parenthesis like:
foreach ($patterns as $pattern)
{
$grouped_patterns[] = "(" . $pattern . ")";
}
$master_pattern = implode($grouped_patterns, "|");
But... I'm not really sure if this ends up being faster. Something has to loop through them, whether it's the preg_match or PHP. If I had to guess I'd guess that individual matches would be close to as fast and easier to read and maintain.
Lastly, if performance is what you're looking for here, I think the most important thing to do is pull out the non regex matches into a simple "string contains" check. I would imagine that some of your checks must be simple string checks like looking to see if "This Site is Closed" is on the page.
So doing this:
foreach ($strings_to_match as $string_to_match)
{
if (strpos($page, $string_to_match) !== false))
{
// etc.
break;
}
}
foreach ($pattern_array as $pattern)
{
if (preg_match($pattern, $page))
{
// etc.
break;
}
}
and avoiding as many preg_match() as possible is probably going to be your best gain. strpos() is a lot faster than preg_match().
// assuming you have something like this
$patterns = array('a','b','\w');
// converts the array into a regex friendly or list
$patterns_flattened = implode('|', $patterns);
if ( preg_match('/'. $patterns_flattened .'/', $string, $matches) )
{
}
// PS: that's off the top of my head, I didn't check it in a code editor
If your patterns don't contain many whitespaces, another option would be to eschew the arrays and use the /x modifier. Now your list of regular expressions would look like this:
$regex = "/
pattern1| # search for occurences of 'pattern1'
pa..ern2| # wildcard search for occurences of 'pa..ern2'
pat[ ]tern| # search for 'pat tern', whitespace is escaped
mypat # Note that the last pattern does NOT have a pipe char
/x";
With the /x modifier, whitespace is completely ignored, except when in a character class or preceded by a backslash. Comments like above are also allowed.
This would avoid the looping through the array.
If you're merely searching for the presence of a string in another string, use strpos as it is faster.
Otherwise, you could just iterate over the array of patterns, calling preg_match each time.
If you have a bunch of patterns, what you can do is concatenate them in a single regular expression and match that. No need for a loop.
What about doing a str_replace() on the HTML you get using your array and then checking if the original HTML is equal to the original? This would be very fast:
$sites = array(
'you_tube' => array('dead', 'moved'),
...
);
foreach ($sites as $site => $deadArray) {
// get $html
if ($html == str_replace($deadArray, '', $html)) {
// video is live
}
}
You can combine all the patterns from the list to single regular expression using implode() php function. Then test your string at once using preg_match() php function.
$patterns = array(
'abc',
'\d+h',
'[abc]{6,8}\-\s*[xyz]{6,8}',
);
$master_pattern = '/(' . implode($patterns, ')|(') . ')/'
if(preg_match($master_pattern, $string_to_check))
{
//do something
}
Of course there could be even less code using implode() inline in "if()" condition instead of $master_pattern variable.

Categories