PHP Regular Expression - Repeating Match of a Group - php

I have a string that may look something like this:
$r = 'Filed under: <a>Group1</a>, <a>Group2</a>';
Here is the regular expression I am using so far:
preg_match_all("/Filed under: (?:<a.*?>([\w|\d|\s]+?)<\/a>)+?/", $r, $matches);
I want the regular expression to inside the () to continue to make matches as designated with the +? at the end. But it just won't do it. ::sigh::
Any ideas. I know there has to be a way to do this in one regular expression instead of breaking it up.

Just for fun here's a regex that will work with a single preg_match_all:
'%(?:Filed under:\s*+|\G</a>)[^<>]*+<a[^<>]*+>\K[^<>]*%`
Or, in a more readable format:
'%(?:
Filed under: # your sentinel string
|
\G # NEXT MATCH POSITION
</a> # an end tag
)
[^<>]*+ # some non-tag stuff
<a[^<>]*+> # an opening tag
\K # RESET MATCH START
[^<>]+ # the tag's contents
%x'
\G matches the position where the next match attempt would start, which is usually the spot where the previous successful match ended (but if the previous match was zero-length, it bumps ahead one more). That means the regex won't match a substring starting with </a> until after it's matched one starting with Filed under: at at least once.
After the sentinel string or an end tag has been matched, [^<>]*+<a[^<>]*+> consumes everything up to and including the next start tag. Then \K spoofs the start position so the match (if there is one) appears to start after the <a> tag (it's like a positive lookbehind, but more flexible). Finally, [^<>]+ matches the tag's contents and brings the match position up to the end tag so \G can match.
But, as I said, this is just for fun. If you don't have to do the job in one regex, you're better off with a multi-step approach like the one #codaddict used; it's more readable, more flexible, and more maintainable.
\K reference
\G reference
EDIT: Although the references I gave are for the Perl docs, these features are supported by PHP, too--or, more accurately, by the PCRE lib. I think the Perl docs are a little better, but you can also read about this stuff in the PCRE manual.

Try:
<?php
$r = 'Filed under: <a>Group1</a>, <a>Group2</a>, <a>Group3</a>, <a>Group4</a>';
if(preg_match_all("/<a.*?>([^<]*?)<\/a>/", $r, $matches)) {
var_dump($matches[1]);
}
?>
output:
array(4) {
[0]=>
string(6) "Group1"
[1]=>
string(6) "Group2"
[2]=>
string(6) "Group3"
[3]=>
string(6) "Group4"
}
EDIT:
Since you want to include the string 'Filed under' in the search to uniquely identify the match, you can try this, I'm not sure if it can be done using a single call to preg_match
// Since you want to match everything after 'Filed under'
if(preg_match("/Filed under:(.*)$/", $r, $matches)) {
if(preg_match_all("/<a.*?>([^<]*?)<\/a>/", $matches[1], $matches)) {
var_dump($matches[1]);
}
}

$r = 'Filed under: <a>Group1</a>, <a>Group2</a>'
$s = explode("</a>",$r);
foreach ($s as $k){
if ($k){
$k=explode("<a>",$k);
print "$k[1]\n";
}
}
output
$ php test.php
Group1
Group2

I want the regular expression to inside the () to continue to make matches as designated with the +? at the end.
+? is a lazy quantifier - it will match as few times as possible. In other words, just once.
If you want to match several times, you want a greedy quantifier - +.
Also note that your regex doesn't quite work - the match fails as soon as it encounters the comma between the tags, because you haven't accounted for it. That likely needs correcting.

Related

PHP - Preg match with multiple conditions

I have the following the following string:
"#(admin/pages|admin/pages/add|admin/pages/[0-9]+)#"
And this string to compare it to:
"admin/pages/1"
What I need is to return "admin/pages/1" when comparing the 2 strings using preg_match(). I have the following code and it's not working:
if(preg_match("#(admin/pages|admin/pages/add|admin/pages/[0-9]+)#", "admin/pages/1", $matches) {
var_dump($matches);
}
This is the output i get:
array(2) { [0]=> string(11) "admin/pages" [1]=> string(11) "admin/pages" }
Can anybody help out?
Use the following short regex parrent:
"#admin/pages(/add|/[0-9]+)?#"
(/add|/[0-9]+)? - optional alternative group, matches either /add or /<number> at the end of searched substring if occurs
Change your regex to:
"#(admin/pages(?:/\d+)?|admin/pages/add)#"
You don't need both variants (admin/pages|admin/pages/[0+9]+) if you put the digits in the first pattern and make them optional.
Question marks and repetitions are greedy by default, that's why it will always include the digits in the match for my version.
On the other hand, if you have an alternation, it will always pick the first match. Since your first alternation does not include digits, they are not matched.
If you're also wondering why you get your match two times, that's because of the way preg_match works.
Quote from the documentation:
$matches[0] will contain the text that matched the full pattern,
$matches[1] will have the text that matched the first captured
parenthesized subpattern, and so on.
You can remove the outer parentheses if the whole match is enough:
"#admin/pages(?:/\d+)?|admin/pages/add#"
Just use $matches[0].
And, as #RomanPerekhrest has written and I shamelessly include in this answer, you can shorten your pattern. You don't need to include admin/pages multiple times:
"#admin/pages(?:/add|/\d+)?#"
Try changing the order of components:
"#(admin/pages/[0-9]+|admin/pages/add|admin/pages)#"
The regular expression is satisfied as soon as something matches. In your case, it stopped as soon as it found admin/pages without looking any further.

Regex negative match in middle of string

I have these strings (checking every line separately):
adminUpdate
adminDeleteEx
adminEditXe
adminExclude
listWebsite
listEx
Now I want to match anything that starts with admin but does not ends with Ex (case-insensitive)
So after applying regex I must match:
adminUpdate
adminEditXe
adminExclude
My current regex is /^admin[a-z]+(?!ex)$/gi but it matches anything that starts with admin
Just a slight change:
/^admin[a-z]+(?<!ex)$/gi
^
Turn your look-ahead into a look-behind.
It is quite hard to explain in the current form. Basically, you need to reach the end of the string $ for the regex to match, and when it happens, (?!ex) is at the end of the string, so it can't see anything ahead. However, since we are at the end of the string, we can use a look-behind (?<!ex) to check whether the string ends with ex or not.
Since look-around is zero-width, we can swap the position of (?<!ex) and $ without changing the meaning (it does change how the engine searches for the matched string, though):
/^admin[a-z]+$(?<!ex)/gi
It is counter-intuitive to write it this way, but easier to see where my argument goes.
Another way to look at it is: due to the fact that (?!ex) and $ are zero-width assertion, they are checked at the same position, and being at the end of the string $ implies you won't see anything ahead.
^(?!.*Ex$)admin.*$
Try this.See demo.
https://regex101.com/r/sH8aR8/23
$re = ""^(?!.*Ex$)admin.*$"m";
$str = "adminUpdate\nadminDeleteEx\nadminEditXe\nadminExclude\nlistWebsite\nlistEx";
preg_match_all($re, $str, $matches);

Extracting substrings between curly brackets inside a string into an array using PHP

I need help extracing all the sub string between curly brackets that are found inside a specific string.
I found some solutions in javascript but I need it for PHP.
$string = "www.example.com/?foo={foo}&test={test}";
$subStrings = HELPME($string);
print_r($subStrings);
The result should be:
array( [0] => foo, [1] => test )
I tried playing with preg_match but I got confused.
I'd appreciate if whoever manage to get it to work with preg_match, explain also what is the logic behind it.
You could use this regex to capture the strings between {}
\{([^}]*)\}
Explanation:
\{ Matches a literal {
([^}]*) Capture all the characters not of } zero or more times. So it would capture upto the next } symbol.
\} Matches a literal }
Your code would be,
<?php
$regex = '~\{([^}]*)\}~';
$string = "www.example.com/?foo={foo}&test={test}";
preg_match_all($regex, $string, $matches);
var_dump($matches[1]);
?>
Output:
array(2) {
[0]=>
string(3) "foo"
[1]=>
string(4) "test"
}
DEMO
Regex Pattern: \{(\w+)\}
Get all the matches that is captured by parenthesis (). The pattern says anything that is enclosed by {...} are captured.
Sample code:
$regex = '/\{(\w{1,})\}/';
$testString = ''; // Fill this in
preg_match_all($regex, $testString, $matches);
// the $matches variable contains the list of matches
Here is demo on debuggex
If you want to capture any type of character inside the {...} then try below regex pattern.
Regex : \{(.*?)\}
Sample code:
$regex = '/\{(.{0,}?)\}/';
$testString = ''; // Fill this in
preg_match_all($regex, $testString, $matches);
// the $matches variable contains the list of matches
Here is demo on debuggex
<?php
$string = "www.example.com/?foo={foo}&test={test}";
$found = preg_match('/\{([^}]*)\}/',$string, $subStrings);
if($found){
print_r($subStrings);
}else{
echo 'NOPE !!';
}
DEMO HERE
Function parse_url, which parses a URL and return its components. Including the query string.
Try This:
preg_match_all("/\{.*?\}/", $string, $subStrings);
var_dump($subStrings[0]);
Good Luck!
You can use the expression (?<=\{).*?(?=\}) to match any string of text enclosed in {}.
$string = "www.example.com/?foo={foo}&test={test}";
preg_match_all("/(?<=\{).*?(?=\})/",$string,$matches);
print_r($matches[0]);
Regex explained:
(?<=\{) is a positive lookbehind, asserting that the line of text is preceeded by a {.
Similarly (?=\}) is a positive lookahead asserting that it is followed by a }. .* matches 0 or more characters of any type. And the ? in .*? makes it match the least possible amount of characters. (Meaning it matches foo in {foo} and {bar} as opposed to foo} and {bar.
$matches[0] contains an array of all the matched strings.
I see answers here using regular expressions with capture groups, lookarounds, and lazy quantifiers. All of these techniques will slow down the pattern -- granted, the performance is very unlikely to be noticeable in the majority of use cases. Because we are meant to offer solutions that are suitable to more scenarios than just the posted question, I'll offer a few solutions that deliver the expected result and explain the differences using the OP's www.example.com/?foo={foo}&test={test} string assigned to $url. I have prepared a php DEMO of the techniques to follow. For information about the function calls, please follow the links to the php manual. For an in depth breakdown of the regex patterns, I recommend using regex101.com -- a free online tool that allows you to test patterns against strings, see the results as both highlighted text and a grouped list, and provides a technique breakdown character-by-character of how the regex engine is interpreting your pattern.
#1 Because your input string is a url, a non-regex technique is appropriate because php has native functions to parse it: parse_url() with parse_str(). Unfortunately, your requirements go beyond extracting the query string's values, you also wish to re-index the array and remove the curly braces from the values.
parse_str(parse_url($url, PHP_URL_QUERY), $assocArray);
$values = array_map(function($v) {return trim($v, '{}');}, array_values($assocArray));
var_export($values);
While this approach is deliberate and makes fair use of native functions that were built for these jobs, it ends up making longer, more convoluted code which is somewhat unpleasant in terms of readability. Nonetheless, it provides the desired output array and should be considered as a viable process.
#2 preg_match_all() is a super brief and highly efficient technique to extract the values. One draw back with using regular expressions is that the regex engine is completely "unaware" of any special meanings that a formatted input string may have. In this case, I don't see any negative impacts, but when hiccups do arise, often the solution is to use a parser that is "format/data-type aware".
var_export(preg_match_all('~\{\K[^}]*~', $url, $matches) ? $matches[0] : []);
Notice that my pattern does not need capture groups or lookarounds; nor does my answer suffer from the use of a lazy quantifier. \K is used to "restart the fullstring match" (in other words, forget any matched characters upto that point). All of these features will mean that the regex engine can traverse the string with peak efficiency. If there is a downsides to using the function they are:
that a multi-dimensional array is generated while you only want a one-dimensional array
that the function creates a reference variable instead of returning the results
#3 preg_split() most closely aligns with the plain-English intent of your task AND it provides the exact output as its return value.
var_export(preg_split('~(?:(?:^|})[^{]*{)|}[^{]*$~', $url, 0, PREG_SPLIT_NO_EMPTY));
My pattern, while admittedly unsavoury to the novice regex pattern designer AND slightly less efficient because it is making "branched" matches (|), basically says: "Split the string at the following delimiters:
from the start of the string or from a }, including all non-{ characters, then the first encountered { (this is the end of the delimiter).
from the lasts }, including all non-{ characters until the end of the string."

pregmatch between characters and any numeric

I'm stuck writing a preg_match
I have a string:
XPMG_ar121023.txt
and need to extract the 2 letters between XPMG_ and the first digit - be it a 0-9
$str = 'XPMG_ar121023.txt';
preg_match('/('XPMG_')|[0-9\,]))/', $str, $match);
print_r($match);
Maybe this isn't the best option: My characters will always be
You can just do
$str = "XPMG_ar121023.txt" ;
preg_match('/_([a-z]+)/i', $str, $match);
var_dump($match[1]);
Output
string 'ar' (length=2)
This is too simple for a regular expression. Just $match = substr($str,5,3) would get what you're asking for.
Let me walk through this step by step so as to help you solve similar problems in the future. Suppose we have the following format for our filenames:
XPMG_ar121023.txt
We know what we want to capture, we want the "ar" right after the _ and just before the numbers begin. So our expression would look something like this:
_[a-z]+
This is pretty straight-forward. We're starting by looking for an underscore, followed by any number of letters between a and z. The square brackets define a character class. Our class consists of the alphabet, but you can push specific numbers in there and more if you like.
Now because we want to capture only the letters, we need to put parenthesis around that part of the pattern:
_([a-z]+)
In the result we will now have access to only that subpattern. Next we put our delimiters in place to specify where our pattern begins, and ends:
/_([a-z]+)/
And lastly, after our closing delimiter we can add some modifiers. As it is written, our pattern only looks for lower-case letters. We can add the i modifier to make this case-insensitive:
/_([a-z]+)/i
Voila, we're done. Now we can pass it into preg_match to see what it spits out:
preg_match( "/_([a-z]+)/i", "XPMG_ar121023.txt", $match );
This function takes a pattern as the first parameter, a string to match it against as the second, and lastly a variable to spit the results into. When all is said and done, we can check $match for our data.
The results of this operation follow:
array(2) {
[0]=> string(3) "_ar"
[1]=> string(2) "ar"
}
This is the contents of $match. Notice our full pattern is found in the first index of the array, and our captured portion is provided in the second index of the array.
echo $match[1]; // ar
Hope this helps.
Well, why not:
$letters = $str[5].$str[6];
:)
After all, you'll always need the 2 chars after the fixed prefix, there are many ways that do not require a regexp (substr() being the best anyway)

Regular expression anchor text for a link

I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.
PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).
I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);
Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);
The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.
I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#
This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*
You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm

Categories