php preg_match and ereg syntax difference - php

I found that syntax of preg_match() and the deprecated ereg() is different.
For example:
I thought that
preg_match('/^<div>(.*)</div>$/', $content);
means the same as
ereg('^<div>(.*)</div>$', $content);
but I was wrong. preg_match() doesn't include special characters as enter like ereg() does.
So I started to use this syntax:
preg_match('/^<div>([^<]*)</div>$/', $content);
but it isn't exactly the same to what I need.
Can anyone suggest me how to solve this problem, without using deprecated functions?

For parsing HTML I'd suggest reading this question and choosing a built in PHP extension.
If for some reason you need or want to use RegEx to do it you should know that:
preg_match() is a greedy little bugger and it will try to eat your anything (.*) till it get's sick (meaning it hits recursion or backtracking limits). You change this with the U modifier1.
the engine expects to be fed a single line. You change this with the m or s modifiers1.
using your 'not a < character' ([^<]*) hack does a good job as it forces the engine to stop at the first < char, but will work only if the <div> doesn't contain other tags inside!
ref: 1 PCRE Pattern Modifiers

Related

Exploiting preg_replace /e modifier in PHP?

I'm currently tinkering with a hacking challenge where the goal is to output phpinfo() on the page. After some poking around I've found that the injection point is the search page as it runs preg_replace with the 'e' modifier on the search query. I've been able to trigger errors with inputs such as ") blah" and "b|exit(phpinfo());" but unfortunately I'm not sure how to phrase my injection so that preg_replace actually runs it.
The confusing part I haven't wrapped my head entirely around is that the regular expression seems to be matching names that correspond to those listed on the page, but only if said input is 3 characters or greater in length. I deduced that I would have to find a way to both match the regular expression and then append some code to the end so it evaluates phpinfo() as a command instead of a string.
Does anyone have some more insight to what is going on, and how to crack this?
The /e modifier allows a second argument to be evaluated as a PHP expression.
So if you were to do something like:
$string = "phpinfo()";
print preg_replace('/^(.*)/e', 'strtoupper(\\1)', $string);
This would fire the function and print the PHP info. Depending on how the search is set up, you can modify to print properly. Hope this helps.

Difference in laziness of lookahead assertions between JavaScript and PHP

I'm confused by a difference I found between the way JavaScript and PHP handle the following regex.
In JavaScript,
'foobar'.replace(/(?=(bar))/ , '$1');
'foobar'.replace(/(?=(bar))?/ , '$1');
'foobar'.replace(/(?:(?=(bar)))?/, '$1');
results in, respectively,
foobarbar
foobar
foobar
as shown in this jsFiddle.
However, in PHP,
echo preg_replace('/(?=(bar))/', '$1', "foobar<br/>");
echo preg_replace('/(?=(bar))?/', '$1', "foobar<br/>");
echo preg_replace('/(?:(?=(bar)))?/', '$1', "foobar<br/>");
results in,
foobarbar
Warning: preg_replace() [function.preg-replace]: Compilation failed: nothing to repeat at offset 9 in /homepages/26/d94605010/htdocs/lz/writecodeonline.com/php/index.php(201) : eval()'d code on line 2
foobarbar
I'm not so much worried about the warning. But it appears that in JavaScript, lookahead assertions are somehow "lazier" than in PHP. Why the difference? Is this a bug in one of the engines? Which is theoretically more "correct"?
The real difference is actually very simple:
In JavaScript, replace will only replace the first match, unless using the /g flag (global).
In PHP, preg_replace replaces all matches.
The third pattern, (?:(?=(bar)))?, can match the empty string in every position, and captures "bar" in some positions. Without the /g flag, it only matches once, at the beginning of the string.
You would have easily seen the difference had you used a more visible replacement string, like [$1].
PHP Example: http://ideone.com/8Mjg6
JavaScript Example, no /g: http://jsfiddle.net/qKb4b/3/
JavaScript Example, with /g: http://jsfiddle.net/qKb4b/2/
I would also note that "laziness" is a different concept in regular expressions, not related to this question.

Help converting PHP eregi to preg_match

I am wondering if someone could please help me convert a piece of PHP code that is now deprecated.
Here is the single line I am trying to convert:
eregi("<text>(.*)TYPE[ \r\n]*(OF|or)[ \r\n]*REPORTING[ \r\n]*PERSON",$string,$outp);
When I convert to the following:
preg_match("/<text>(.*)TYPE[ \r\n]*(OF|or)[ \r\n]*REPORTING[ \r\n]*PERSON/i",$string,$outp);
It matched nothing. The original eregi function works well.
You need the /is flag at the end of the regex.
The reason is that the preg_ function does not match linebreaks with .*, whereas the old ereg functions would do that per default.
Otherwise your regular expression should work unchanged with PCRE.

Grubers new and improved URL recognising regex

I've been trying to use grubers latest url matching regex in a php project.
To test it I threw together something very simple:
$regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:"'.,<>?«»“”‘’]))";
$array = pret_match_all($regex, $theblockofurltext);
print_r($array);
The first problem was the " would escape a string, depending which I wrapped the regex with, so I just removed it. The use of this is personal and I will never have " anywhere near a url anyway. This left me with a new regex.
$regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'.,<>?«»“”‘’]))";
Raring to go I then ran my little script and it gave me the following error:
Warning: preg_split() [function.preg-split]: Unknown modifier '\' in D:\wwwroot\xxx\index.php on line 14
Unfortunately my REGEX class at school wasn't taught to anywhere near the levels of this regex requires, and I have no idea where to begin fixing this for use with PHP. Any help would be greatly appreciated. No doubt I'm probably doing something stupid too, so please go easy on me :)
Jon
Add # before and after your RE.
$regex = "#(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'.,<>?«»“”‘’]))#";
If you use PCRE, the regular expression must be enclosed in delimiters. Now, parenthesis () can also be delimiters, that is why the engine thinks, your expression is only (?i) and interprets the next \ as modifier.
You could use ~ as delimiter:
$regex = "~(?i)\b...]))~";
Update:
I don't know whether PHP supports the partial modifying of an expression with (?i). So you might have to remove this and put the modifier after the delimiter instead (you apply it to the whole expression anyway):
$regex = "~\b...]))~i";

Weird error using preg_match and unicode

if (preg_match('(\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+)', '2010/02/14/this-is-something'))
{
// do stuff
}
The above code works. However this one doesn't.
if (preg_match('/\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+/u', '2010/02/14/this-is-something'))
{
// do stuff
}
Maybe someone could shed some light as to why the one below doesn't work. This is the error that is being produced:
A PHP Error was encountered
Severity: Warning
Message: preg_match()
[function.preg-match]: Unknown
modifier '\'
Try this: (delimit the regex with ())
if (preg_match('#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#', '2010/02/14/this-is-something'))
{
// do stuff
}
Edited
The modifier u is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32.
Also as nvl observed, you are using / as the delimiter and you are not escaping the / present in the regex. So you'lll have to use:
/\p{Nd}{4}\/\p{Nd}{2}\/\p{Nd}{2}\/\p{L}+/u
To avoid this escaping you can use a different set of delimiters like:
#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#
or
#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#
As a tip, if your delimiter is present in your regex, its better to choose a different delimiter not found in the regex. This keeps the regex clean and short.
In the second regex you're using / as the regex delimiter, but you're also using it in the regex. The compiler is trying to interpret this part as a complete regex:
/\p{Nd}{4}/
It thinks the next character after the second / should be a modifier like 'u' or 'm', but it sees a backslash instead, so it throws that cryptic exception.
In the first regex you're using parentheses as regex delimiters; if you wanted to add the u modifier, you would put it after the closing paren:
'(\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+)u'
Although it's legal to use parentheses or other bracketing characters ({}, [], <>) as regex delimiters, it's not a good idea IMO. Most people prefer to use one of the less common punctuation characters. For example:
'~\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+~u'
'%\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+%u'
Of course, you could also escape the slashes in the regex with backslashes, but why bother?

Categories