PHP Regex string returns two identical arrays - php

I've got a Regex query here to pull out all of the tags in a page. It looks like this:
preg_match_all('%<tr[^>]++>(.*?)</tr>%s', $pageText, $rows);
Problem is that while it does find all of the tags on the page in the return array it actually returns a multidimensional array, where each entry of the first array contains an array of all of the matches. In other words, it hands me multiple identical copies of the first array, IE the one I actually want.
Help please?
EDIT: Also relevant: I'm not allowed to use DOM for this application despite it being a significantly easier (and better) way of going about things.

What you're actually asking about is the $row[0] list, which redundantly contains the <tr>...</tr> blob again. If you just care about the (.*?) inner data, then use \K to reset the full match.
preg_match_all('=<tr\b[^>]*+>(.*?)</tr>\K=s', $pageText, $rows);
It's not possible to get rid of $row[0] completely. You'll have to ignore it, and use $row[1] alone.

Try this one:
preg_match_all('~<tr(?:\\s+[^>]*)?>(.*?)</tr>~si', $pageText, $rows);
var_dump($rows[1]);
Don't use % to wrap RegExps. It's a character somehow reserved for printf() like functions and with %s or %i at the end of your Pattern, it can be quite confusing.

Related

PHP regex examples

I am a beginner learning PHP. I am trying out various combinations with regex. I am referring to this and this and trying out examples on this.
My doubts:
Why does preg_match_all('((a|b)*c)', 'ababc', $arr, PREG_PATTERN_ORDER); give both ababac and a as output. Shouldn't it be just ababac?
Assuming that I'm trying to read all possible outputs in $arr, how do I do that without having to echo each element of the array? Currently, I'm using foreach to iterate the elements in $arr. Is this right / is there a better way to do it?
Why does the preg_match_all() give a 2D array with elements $arr[$i][0] and in some cases $arr[0][$i] where $i is no of of possible matches. Is there a way to know how the output will be stored?
Sorry for posting too many questions in one post. Please tell me if I need to change it.
The (a|b) is a captured subpattern, therefore it is returned as part of the result array. You can make it non-capturing by using (?:a|b), but it would be far better as a character class, [ab].
foreach is probably the best way to go about it.
As mentioned in 1, you will get your subpatterns captured and returned in the array.
You may want to look into preg_replace_callback(), as this will apply a given callback (usually an anonymous function) to each match.

Regex replace matched subexpression (and nothing else)?

I've used regex for ages but somehow I managed to never run into something like this.
I'm looking to do some bulk search/replace operations within a file where I need to replace some data within tag-like elements. For example, converting <DelayEvent>13A</DelayEvent> to just <DelayEvent>X</DelayEvent> where X might be different for each.
The current way I'm doing this is such:
$new_data = preg_replace('|<DelayEvent>(\w+)</DelayEvent>|', '<DelayEvent>X</DelayEvent>', $data);
I can shorten this a bit to:
$new_data = preg_replace('|(<DelayEvent>)(\w+)(</DelayEvent>)|', '${1}X${2}', $data);
But really all I want to do is simulate a "replace text between tags T with X".
Is there a way to do such a thing? In essence I'm trying to prevent having to match all the surrounding data and reassembling it later. I just want to replace a given matched sub-expression with something else.
Edit: The data is not XML, although it does what appear to be tag-like elements. I know better than parsing HTML and XML with RegEx. ;)
It is possible using lookarounds:
$new_data = preg_replace('|(?<=<DelayEvent>)\w+(?=</DelayEvent>)|', 'X', $data);
See it working online: ideone

PHP dealing with huge string

I have to replace xmlns with ns in my incomming xml in order to fix SimpleXMLElements xpath() function. Most functions do not have a performance problem. But there allways seems to be an overhead as the string grows.
E.g. preg_replace on a 2 MB string takes 50ms to process, even if I limit the replaces to 1 and the replace is done at the very beginning.
If I substr the first few characters and just replace that part it is slightly faster. But not really that what I want.
Is there any PHP method that would perform better in my problem? And if there is no option, could a simple php extension help, that just does Replace => SimpleXMLElement in C?
If you know exactly where the offending "x", "m" and "l" are, you can just use something like $xml[$x_pos] = ' '; $xml[$m_pos] = ' '; $xml[$l_pos] = ' ' to transform them into spaces. Or transform them into ns___ (where _ = space).
You're always going to get an overhead when trying to do this - you're dealing with a char array and trying to do replace multiple matching elements of the array (i.e. words).
50ms is not much of an overhead, unless (as I suspect) you're trying to do this in a loop?
50ms sounds pretty reasonable to me, for something like this. The requirement itself smells of something being wrong.
Is there any particular reason that you're using regular expressions? Why do people keep jumping to the overkill regex solution?
There is a bog-standard string replace function called str_replace that may do what you want in a fraction of the time (though whether this is right for you depends on how complex your search/replace is).
From the PHP source, as we can see, for example here:
http://svn.php.net/repository/php/php-src/branches/PHP_5_2/ext/standard/string.c
I don`t see, any copies, but I'm not expert in C. From the other hand we can see there many convert to string calls, which at 1st sight could copy values. If they copy values, then we in trouble here.
Only if we in trouble
Try to invent some str_replace wheel here with the help of string-by-char processing. For example we have string $somestring = "somevalue". In PHP we could work with it's chars by indexes as echo $somestring{0}, which will give us "s" or echo $somestring{2} which will give us "m". I'm not sure in this way, but it's possible, if official implimentations don't use references, as they should use.

php - regex - catch string inside multiple tags

still on regex! ;-)))
Assuming we have an html file with a lot of <tr> rows with same structure like this below, where (.*?) is the content i need to extract!
<tr align=center><th width=5%><a OnClick="(.*?)"href=#>(.*?)</a><td width=5%>(.*?)<td width=5% align=center >(.*?)</td></tr>
UPDATED
maybe with a nice preg_match_all() ?
i need something like this result
match[0] . match[1] . match[2] . match[3]
just in case someone need someting similar!
THE SOLUTION to my little problem is
/<a\s*OnClick=\"(.*?)\"href=#>(.*?)<\/a><td[^>]+>(.*?)<td[^>]+>(.*?)<\/td><\/tr>/m
thanks for the time!
Luca Filosofi!
Wildly guessing here without actual sample data to match the regex against - also quite unhappy with having to use a regex here. Unless your tables always look exactly alike, I doubt you'll have much fun with regexes.
Anyway, all the caveats aside, this might work:
<tr[^>]+><th[^>]+><a OnClick="([^"]+)"\s*href="([^"]+)">([^<]+)</a><td[^>]+>([^<]+)<td[^>]+>([^<]+)</td></tr>
It expects the tags (and the attributes within the <a> tag) exactly in this order, no angle brackets within quoted strings, no escaped quotes within quoted strings etc. etc. (all those things that you wouldn't have to worry about if you used a parser).
In PHP:
preg_match_all('%<tr[^>]+><th[^>]+><a OnClick="([^"]+)"\s*href="([^"]+)">([^<]+)</a><td[^>]+>([^<]+)<td[^>]+>([^<]+)</td></tr>%', $subject, $result, PREG_PATTERN_ORDER);
$result then is an array where $result[0] contains the entire match, $result[1] contains capturing group no. 1, etc.

How to match anything except a pattern between two tags

I am attempting to match a string which is composed of HTML. Basically it is an image gallery so there is a lot of similarity in the string. There are a lot of <dl> tags in the string, but I am looking to match the last <dl>(.?)+</dl> combo that comes before a </div>.
The way I've devised to do this is to make sure that there aren't any <dl's inside the <dl></dl> combo I'm matching. I don't care what else is there, including other tags and line breaks.
I decided I had to do it with regular expressions because I can't predict how long this substring will be or anything that's inside it.
Here is my current regex that only returns me an array with two NULL indicies:
preg_match_all('/<dl((?!<dl).)+<\/dl>(?=<\/div>)/', $foo, $bar)
As you can see I use negative lookahead to try and see if there is another <dl> within this one. I've also tried negative lookbehind here with the same results. I've also tried using +? instead of just + to no avail. Keep in mind that there's no pattern <dl><dl></dl> or anything, but that my regex is either matching the first <dl> and the last </dl> or nothing at all.
Now I realize . won't match line breaks but I've tried anything I could imagine there and it still either provides me with the NULL indicies or nearly the whole string (from the very first occurance of <dl to </dl></div>, which includes several other occurances of <dl>, exactly what I didn't want). I honestly don't know what I'm doing incorrectly.
Thanks for your help! I've spent over an hour just trying to straighten out this one problem and it's about driven me to pulling my hair out.
Don't use regular expressions for irregular languages like HTML. Use a parser instead. It will save you a lot of time and pain.
I would suggest to use tidy instead. You can easily extra all the desired tags with their contents, even for broken HTML.
In general I would not recommend to write a parser using regex.
See http://www.php.net/tidy
As crazy as it is, about 2 minutes after I posted this question, I found a way that worked.
preg_match_all('/<dl([^\z](?!<dl))+?<\/dl>(?=<\/div>)/', $foo, $bar);
The [^\z] craziness is just a way I used to say "match all characters, even line breaks"

Categories