I am a beginner learning PHP. I am trying out various combinations with regex. I am referring to this and this and trying out examples on this.
My doubts:
Why does preg_match_all('((a|b)*c)', 'ababc', $arr, PREG_PATTERN_ORDER); give both ababac and a as output. Shouldn't it be just ababac?
Assuming that I'm trying to read all possible outputs in $arr, how do I do that without having to echo each element of the array? Currently, I'm using foreach to iterate the elements in $arr. Is this right / is there a better way to do it?
Why does the preg_match_all() give a 2D array with elements $arr[$i][0] and in some cases $arr[0][$i] where $i is no of of possible matches. Is there a way to know how the output will be stored?
Sorry for posting too many questions in one post. Please tell me if I need to change it.
The (a|b) is a captured subpattern, therefore it is returned as part of the result array. You can make it non-capturing by using (?:a|b), but it would be far better as a character class, [ab].
foreach is probably the best way to go about it.
As mentioned in 1, you will get your subpatterns captured and returned in the array.
You may want to look into preg_replace_callback(), as this will apply a given callback (usually an anonymous function) to each match.
Related
im having an issue with preg_match_all. I have this string:
$product_req = "ACTIVE-6,CATEGORY-ACTIVE-8,CATEGORY-ACTIVE-4,ACTIVE-9";
I need to get the numbers preceded by "ACTIVE-" but not by "CATEGORY-ACTIVE-", so in this case the result should be 6,9. I used the statement below:
preg_match_all("/ACTIVE-(\d+)/", $product_req, $this_act);
However this will return all the numbers because all of them are in fact preceded by "ACTIVE-" but thats not what i meant because i need to leave out those preceded by "CATEGORY-ACTIVE-". How can i configure preg_match_all to do it? Or maybe there is some other function that can do the job?
EDIT:
I tried this:
preg_match_all("/CATEGORY-ACTIVE-(\d+)/", $product_req, $this_cat_act);
preg_match_all("/ACTIVE-(\d+)/", $product_req, $this_act);
$act_cat = str_replace($this_cat_act[1],"",$this_act[1]);
it kinda works, but i guess there is a better and cleaner way to do it. Besides the output is kinda weird too.
Thank you.
I've got a Regex query here to pull out all of the tags in a page. It looks like this:
preg_match_all('%<tr[^>]++>(.*?)</tr>%s', $pageText, $rows);
Problem is that while it does find all of the tags on the page in the return array it actually returns a multidimensional array, where each entry of the first array contains an array of all of the matches. In other words, it hands me multiple identical copies of the first array, IE the one I actually want.
Help please?
EDIT: Also relevant: I'm not allowed to use DOM for this application despite it being a significantly easier (and better) way of going about things.
What you're actually asking about is the $row[0] list, which redundantly contains the <tr>...</tr> blob again. If you just care about the (.*?) inner data, then use \K to reset the full match.
preg_match_all('=<tr\b[^>]*+>(.*?)</tr>\K=s', $pageText, $rows);
It's not possible to get rid of $row[0] completely. You'll have to ignore it, and use $row[1] alone.
Try this one:
preg_match_all('~<tr(?:\\s+[^>]*)?>(.*?)</tr>~si', $pageText, $rows);
var_dump($rows[1]);
Don't use % to wrap RegExps. It's a character somehow reserved for printf() like functions and with %s or %i at the end of your Pattern, it can be quite confusing.
I've got a problem with regexp function, preg_replace(), in PHP.
I want to get viewstate from html's input, but it doesn't work properly.
This code:
$viewstate = preg_replace('/^(.*)(<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value=")(.*[^"])("\s+name="__VIEWSTATE">)(.*)$/u','^\${3}$',$html);
Returns this:
%0D%0A%0D%0A%3C%21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.0+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-transitional.dtd%22%3E%0D%0A%0D%0A%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+%3E%0D%0A%3Chead%3E%3Ctitle%3E%0D%0A%09Strava.cz%0D%0A%3C%2Ftitle%3E%3Clink+rel%3D%22shortcut+icon%22+href%3D%22..%2FGrafika%2Ffavicon.ico%22+type%3D%22image%2Fx-icon%22+%2F%3E%3Clink+rel%3D%22stylesheet%22+type%3D%22text%2Fcss%22+media%3D%22screen%22+href%3D%22..%2FStyly%2FZaklad.css%22+%2F%3E%0D%0A++++%3Cstyle+type%3D%22text%2Fcss%22%3E%0D%0A++++++++.style1%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+47px%3B%0D%0A++++++++%7D%0D%0A++++++++.style2%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+64px%3B%0D%0A++++++++%7D%0D%0A++++%3C%2Fstyle%3E%0D%0A%0D%0A%3Cscript+type%3D%22text%2Fjavascript%22%3E%0D%0A%0D%0A++var+_gaq+%3D+_gaq+%7C%7C+%5B%5D%3B%0D%0A++_gaq.push%28%5B
EDIT: Sorry, I left this question for a long time. Finally I used DOMDocument.
To be sure i'd split this match into two phases:
Find the relevant input element
Get the value
Because you cannot be certain what the attributes order in the element will be.
if(preg_match('/<input[^>]+name="__VIEWSTATE"[^>]*>/i', $input, $match))
$value = preg_replace('/.*value="([^"]*)".*/i', '$1', $match[0]);
And, of course, always consider DOM and DOMXpath over regex for parsing html/xml.
You should only capture when you're planning on using the data. So most () are obsolete in that regexp pattern. Not a cause for failure but I thought I'd mention it.
Instead of using [^"] to mark that you don't want that character you could use the non-greedy modifier - ?. This makes sure the pattern is matching as little as it can. Since you have name="__VIEWSTATE" following the value this should be safe.
Let's put this in practice and simplify the pattern some. This works as you want:
'/.*<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value="(.+?)"\s+name="__VIEWSTATE">.*/'
I would strongly recommend checking out an alternative to regexp for DOM operations. This makes certain your code works also if the attributes changes order. Plus it's so much nicer to work with.
The main mistake was the use of funciton preg_replace, witch returns the subject - neither the matched pattern nor the replacement. Thank you for your ideas and for the recommendation of DOMDocument. m93a
http://www.php.net/manual/en/function.preg-replace.php#refsect1-function.preg-replace-returnvalues
I am attempting to match a string which is composed of HTML. Basically it is an image gallery so there is a lot of similarity in the string. There are a lot of <dl> tags in the string, but I am looking to match the last <dl>(.?)+</dl> combo that comes before a </div>.
The way I've devised to do this is to make sure that there aren't any <dl's inside the <dl></dl> combo I'm matching. I don't care what else is there, including other tags and line breaks.
I decided I had to do it with regular expressions because I can't predict how long this substring will be or anything that's inside it.
Here is my current regex that only returns me an array with two NULL indicies:
preg_match_all('/<dl((?!<dl).)+<\/dl>(?=<\/div>)/', $foo, $bar)
As you can see I use negative lookahead to try and see if there is another <dl> within this one. I've also tried negative lookbehind here with the same results. I've also tried using +? instead of just + to no avail. Keep in mind that there's no pattern <dl><dl></dl> or anything, but that my regex is either matching the first <dl> and the last </dl> or nothing at all.
Now I realize . won't match line breaks but I've tried anything I could imagine there and it still either provides me with the NULL indicies or nearly the whole string (from the very first occurance of <dl to </dl></div>, which includes several other occurances of <dl>, exactly what I didn't want). I honestly don't know what I'm doing incorrectly.
Thanks for your help! I've spent over an hour just trying to straighten out this one problem and it's about driven me to pulling my hair out.
Don't use regular expressions for irregular languages like HTML. Use a parser instead. It will save you a lot of time and pain.
I would suggest to use tidy instead. You can easily extra all the desired tags with their contents, even for broken HTML.
In general I would not recommend to write a parser using regex.
See http://www.php.net/tidy
As crazy as it is, about 2 minutes after I posted this question, I found a way that worked.
preg_match_all('/<dl([^\z](?!<dl))+?<\/dl>(?=<\/div>)/', $foo, $bar);
The [^\z] craziness is just a way I used to say "match all characters, even line breaks"
I need some help with creating a regex for my php script. Basically, I have an associative array containing my data, and I want to use preg_replace to replace some place-holders with real data. The input would be something like this:
<td>{{address}}</td><td>{{fixDate}}</td><td>{{measureDate}}</td><td>{{builder}}</td>
I don't want to use str_replace, because the array may hold many more items than I need.
If I understand correctly, preg_replace is able to take the text that it finds from the regex, and replace it with the value of that key in the array, e.g.
<td>{{address}}</td>
get replaced with the value of $replace['address']. Is this true, or did I misread the php docs?
If it is true, could someone please help show me a regex that will parse this for me (would appreciate it if you also explain how it works, since I am not very good with regexes yet).
Many thanks.
Use preg_replace_callback(). It's incredibly useful for this kind of thing.
$replace_values = array(
'test' => 'test two',
);
$result = preg_replace_callback('!\{\{(\w+)\}\}!', 'replace_value', $input);
function replace_value($matches) {
global $replace_values;
return $replace_values[$matches[1]];
}
Basically this says find all occurrences of {{...}} containing word characters and replace that value with the value from a lookup table (being the global $replace_values).
For well-formed HTML/XML parsing, consider using the Document Object Model (DOM) in conjunction with XPath. It's much more fun to use than regexes for that sort of thing.
To not have to use global variables and gracefully handle missing keys you can use
function render($template, $vars) {
return \preg_replace_callback("!{{\s*(?P<key>[a-zA-Z0-9_-]+?)\s*}}!", function($match) use($vars){
return isset($vars[$match["key"]]) ? $vars[$match["key"]] : $match[0];
}, $template);
}