PHP regex: what is "class at offset 0"? - php

I'm trying to strip all punctuation out of a string using a simple regular expression and the php preg_replace function, although I get the following error:
Compilation failed: POSIX named classes are supported only within a class at offset 0
I guess this means I can't use POSIX named classes outside of a class at offset 0. My question is, what does it means when it says "within a class at offset 0 "?
$string = "I like: perl";
if (eregi('[[:punct:]]', $string))
$new = preg_replace('[[:punct:]]', ' ', $string); echo $new;

The preg_* functions expect Perl compatible regular expressions with delimiters. So try this:
preg_replace('/[[:punct:]]/', ' ', $string)

NOTE: The g modifier is not needed with PHP's PCRE implementation!
In addition to Gumbo's answer, use the g modifier to replace all occurances of punctuation:
preg_replace('/[[:punct:]]/g', ' ', $string)
// ^
From Johnathan Lonowski (see comments):
> [The g modifier] means "Global" -- i.e., find all existing matches. Without it, regex functions will stop searching after the first match.

An explanation of why you're getting that error: PCRE uses Perl's loose definition of what a delimiter is. Your outer []s look like valid delimiters to it, causing it to read [:punct:] as the regex part.
(Oh, and avoid the ereg functions if you can - they're not going to be included in PHP 5.3.)

I just added g to the regexp as suggested in one of the anwers, it did the opposite of wahts expected and DIDN'T filter out the punctuation, turns out preg_replace doesnt require g as it's global/recursive in the first place

Related

Symfony2 url validation : "preg_match(): Compilation failed: range out of order in character class" [duplicate]

I'm getting this odd error in the preg_match() function:
Warning: preg_match(): Compilation failed: range out of order in character class at offset 54
The line which is causing this is:
preg_match("/<!--GSM\sPER\sNUMBER\s-\s$gsmNumber\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s$gsmNumber\s-\sEND-->/s", $fileData, $matches);
What this regular expression does is parse an HTML file, extracting only the part between:
<!--GSM PER NUMBER - 5550101 - START-->
and:
<!--GSM PER NUMBER - 5550101 - END-->
Do you have a hint about what could be causing this error?
Hi I got the same error and solved it:
Warning: preg_match(): Compilation failed: range out of order in character class at offset <N>
Research Phase:
.. Range out of order .. So there is a range defined which can't be used.
.. at offset N .. I had a quick look at my regex pattern. Position N was the "-". It's used to define ranges like "a-z" or "0-9" etc.
Solution
I simply escaped the "-".
\-
Now it is interpreted as the character "-" and not as range!
If $gsmNumber contains a square bracket, backslash or various other special characters it might trigger this error. If that's possible, you might want to validate that to make sure it actually is a number before this point.
Edit 2016:
There exists a PHP function that can escape special characters inside regular expressions: preg_quote().
Use it like this:
preg_match(
'/<!--GSM\sPER\sNUMBER\s-\s' .
preg_quote($gsmNumber, '/') . '\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s' .
preg_quote($gsmNumber, '/') . '\s-\sEND-->/s', $fileData, $matches);
Obviously in this case because you've used the same string twice you could assign the quoted version to a variable first and re-use that.
This error is caused for an incorrect range. For example: 9-0 a-Z
To correct this, you must change 9-0 to 0-9 and a-Z to a-zA-Z
In your case you are not escaping the character "-", and then, preg_match try to parse the regex and fail with an incorrect range.
Escape the "-" and it must solve your problem.
I was receiving this error with the following sequence:
[/-.]
Simply moving the . to the beginning fixed the problem:
[./-]
While the other answers are correct, I'm surprised to see that no-one has suggested escaping the variable with preg_quote() before using it in a regex. So if you're looking to match an actual bracket or anything else that means something in regex, that'll be converted to a literal token:
$escaped = preg_quote($gsmNumber);
preg_match( '/<!--GSM\sPER\sNUMBER\s-\s'.$escaped.'\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s'.$escaped.'\s-\sEND-->/s', $fileData, $matches);
You probably have people insert mobile numbers including +, -, ( and/or ) characters and just use these as is in your preg_match, so you might want to sanitize the data provided before using it (ie. by stripping these characters out completely).
This is a bug in several versions of PHP, as I have just verified for the current 5.3.5 version, as packaged with XAMPP 1.7.4 on Windows XP home edition.
Even some very simple examples exhibit the problem, e.g.,
$pattern = '/^[\w_-. ]+$/';
$uid = 'guest';
if (preg_match($pattern, $uid)) echo
("<style> p { text-decoration:line-through } </style>");
The PHP folks have known about the bug since 1/10/2010.
See http://pear.php.net/bugs/bug.php?id=18182.
The bug is marked "closed" yet persists.

Fixed-length regex lookbehind complains of variable-length lookbehind

Here is the code I am trying to run:
$str = 'a,b,c,d';
return preg_split('/(?<![^\\\\][\\\\]),/', $str);
As you can see, the regexp being used here is:
/(?<![^\\][\\]),/
Which is a simple fixed-length negative lookbehind for "preceded by something that isn't a backslash, then something that is!".
This regex works just fine on http://www.phpliveregex.com
But when I go and actually attempt to run the above code, I am spat back the error:
Warning: preg_split() [function.preg-split]: Compilation failed: lookbehind assertion is not fixed length at offset 13
To make matters worse, a fellow programmer tested the code on his 5.4.24 PHP server, and it worked fine.
This leads me to believe that my issues are related to the configuration of my server, which I have very little control over. I am told that my PHP version if 5.2.*
Are there any workarounds/alternatives to preg_replace() that might not have this issue?
The problem is caused by the bug fixed in PCRE 6.7. Quoting the changelog:
A negated single-character class was not being recognized as
fixed-length in lookbehind assertions such as (?<=[^f]), leading to an
incorrect compile error "lookbehind assertion is not fixed length"
PCRE 6.7 was introduced in PHP 5.2.0, in Nov 2006. As you still have this bug, it means it's not still there at your server - so for a preg-split based workaround you have to use a pattern without a negative character class. For example:
$patt = '/(?<!(?<!\\\\)\\\\),/';
// or...
$patt = '/(?<![\x00-\x5b\x5d-\xFF]\x5c),/';
However, I find the whole approach a bit weird: what if , symbol is preceded by exactly three backslashes? Or five? Or any odd number of them? The comma in this case should be considered 'escaped', but obviously you cannot create a lookbehind expression of variable length to cover these cases.
On the second thought, one can use preg_match_all instead, with a common alternation trick to cover the escaped symbols:
$str = 'e ,a\\,b\\\\,c\\\\\\,d\\\\';
preg_match_all('/(?:[^\\\\,]|\\\\(?:.|$))+/', $str, $matches);
var_dump($matches[0]);
Demo.
I really think I covered all the issues here, those trailing slashes were a killer )
Way to avoid the negated character class (I write \x5c instead of a lot of backslashes to be more clear)
$result = preg_split('/(?<!(?!\x5c).\x5c),/s', $str);
About the approach itself:
If you are trying to split on comma that are not escaped, you are in the wrong way with a lookbehind since you can't check and undefined number of backslash before the comma. You have several possibilities to solve this problem:
$result = preg_split('/(?:[^\x5c]|\A)(?:\x5c.)*\K,/s', $str);
or
$result = preg_split('/(?<!\x5c)(?:\x5c.)*\K,/s', $str);
or for PHP > 5.2.4
$result = preg_split('/\x5c{2}(*SKIP)(?!)|(?<!\x5c),/s', $str);
I think you are using an older php version since I your error rises on PHP 5.1.6 or lower.
You can check a non working demo here
On the other hand it works for PHP 5.2.16 or higher:
Working demo

Difference in laziness of lookahead assertions between JavaScript and PHP

I'm confused by a difference I found between the way JavaScript and PHP handle the following regex.
In JavaScript,
'foobar'.replace(/(?=(bar))/ , '$1');
'foobar'.replace(/(?=(bar))?/ , '$1');
'foobar'.replace(/(?:(?=(bar)))?/, '$1');
results in, respectively,
foobarbar
foobar
foobar
as shown in this jsFiddle.
However, in PHP,
echo preg_replace('/(?=(bar))/', '$1', "foobar<br/>");
echo preg_replace('/(?=(bar))?/', '$1', "foobar<br/>");
echo preg_replace('/(?:(?=(bar)))?/', '$1', "foobar<br/>");
results in,
foobarbar
Warning: preg_replace() [function.preg-replace]: Compilation failed: nothing to repeat at offset 9 in /homepages/26/d94605010/htdocs/lz/writecodeonline.com/php/index.php(201) : eval()'d code on line 2
foobarbar
I'm not so much worried about the warning. But it appears that in JavaScript, lookahead assertions are somehow "lazier" than in PHP. Why the difference? Is this a bug in one of the engines? Which is theoretically more "correct"?
The real difference is actually very simple:
In JavaScript, replace will only replace the first match, unless using the /g flag (global).
In PHP, preg_replace replaces all matches.
The third pattern, (?:(?=(bar)))?, can match the empty string in every position, and captures "bar" in some positions. Without the /g flag, it only matches once, at the beginning of the string.
You would have easily seen the difference had you used a more visible replacement string, like [$1].
PHP Example: http://ideone.com/8Mjg6
JavaScript Example, no /g: http://jsfiddle.net/qKb4b/3/
JavaScript Example, with /g: http://jsfiddle.net/qKb4b/2/
I would also note that "laziness" is a different concept in regular expressions, not related to this question.

Grubers new and improved URL recognising regex

I've been trying to use grubers latest url matching regex in a php project.
To test it I threw together something very simple:
$regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:"'.,<>?«»“”‘’]))";
$array = pret_match_all($regex, $theblockofurltext);
print_r($array);
The first problem was the " would escape a string, depending which I wrapped the regex with, so I just removed it. The use of this is personal and I will never have " anywhere near a url anyway. This left me with a new regex.
$regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'.,<>?«»“”‘’]))";
Raring to go I then ran my little script and it gave me the following error:
Warning: preg_split() [function.preg-split]: Unknown modifier '\' in D:\wwwroot\xxx\index.php on line 14
Unfortunately my REGEX class at school wasn't taught to anywhere near the levels of this regex requires, and I have no idea where to begin fixing this for use with PHP. Any help would be greatly appreciated. No doubt I'm probably doing something stupid too, so please go easy on me :)
Jon
Add # before and after your RE.
$regex = "#(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'.,<>?«»“”‘’]))#";
If you use PCRE, the regular expression must be enclosed in delimiters. Now, parenthesis () can also be delimiters, that is why the engine thinks, your expression is only (?i) and interprets the next \ as modifier.
You could use ~ as delimiter:
$regex = "~(?i)\b...]))~";
Update:
I don't know whether PHP supports the partial modifying of an expression with (?i). So you might have to remove this and put the modifier after the delimiter instead (you apply it to the whole expression anyway):
$regex = "~\b...]))~i";

Range out of order in character class

I'm getting this odd error in the preg_match() function:
Warning: preg_match(): Compilation failed: range out of order in character class at offset 54
The line which is causing this is:
preg_match("/<!--GSM\sPER\sNUMBER\s-\s$gsmNumber\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s$gsmNumber\s-\sEND-->/s", $fileData, $matches);
What this regular expression does is parse an HTML file, extracting only the part between:
<!--GSM PER NUMBER - 5550101 - START-->
and:
<!--GSM PER NUMBER - 5550101 - END-->
Do you have a hint about what could be causing this error?
Hi I got the same error and solved it:
Warning: preg_match(): Compilation failed: range out of order in character class at offset <N>
Research Phase:
.. Range out of order .. So there is a range defined which can't be used.
.. at offset N .. I had a quick look at my regex pattern. Position N was the "-". It's used to define ranges like "a-z" or "0-9" etc.
Solution
I simply escaped the "-".
\-
Now it is interpreted as the character "-" and not as range!
If $gsmNumber contains a square bracket, backslash or various other special characters it might trigger this error. If that's possible, you might want to validate that to make sure it actually is a number before this point.
Edit 2016:
There exists a PHP function that can escape special characters inside regular expressions: preg_quote().
Use it like this:
preg_match(
'/<!--GSM\sPER\sNUMBER\s-\s' .
preg_quote($gsmNumber, '/') . '\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s' .
preg_quote($gsmNumber, '/') . '\s-\sEND-->/s', $fileData, $matches);
Obviously in this case because you've used the same string twice you could assign the quoted version to a variable first and re-use that.
This error is caused for an incorrect range. For example: 9-0 a-Z
To correct this, you must change 9-0 to 0-9 and a-Z to a-zA-Z
In your case you are not escaping the character "-", and then, preg_match try to parse the regex and fail with an incorrect range.
Escape the "-" and it must solve your problem.
I was receiving this error with the following sequence:
[/-.]
Simply moving the . to the beginning fixed the problem:
[./-]
While the other answers are correct, I'm surprised to see that no-one has suggested escaping the variable with preg_quote() before using it in a regex. So if you're looking to match an actual bracket or anything else that means something in regex, that'll be converted to a literal token:
$escaped = preg_quote($gsmNumber);
preg_match( '/<!--GSM\sPER\sNUMBER\s-\s'.$escaped.'\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s'.$escaped.'\s-\sEND-->/s', $fileData, $matches);
You probably have people insert mobile numbers including +, -, ( and/or ) characters and just use these as is in your preg_match, so you might want to sanitize the data provided before using it (ie. by stripping these characters out completely).
This is a bug in several versions of PHP, as I have just verified for the current 5.3.5 version, as packaged with XAMPP 1.7.4 on Windows XP home edition.
Even some very simple examples exhibit the problem, e.g.,
$pattern = '/^[\w_-. ]+$/';
$uid = 'guest';
if (preg_match($pattern, $uid)) echo
("<style> p { text-decoration:line-through } </style>");
The PHP folks have known about the bug since 1/10/2010.
See http://pear.php.net/bugs/bug.php?id=18182.
The bug is marked "closed" yet persists.

Categories