Fixed-length regex lookbehind complains of variable-length lookbehind - php

Here is the code I am trying to run:
$str = 'a,b,c,d';
return preg_split('/(?<![^\\\\][\\\\]),/', $str);
As you can see, the regexp being used here is:
/(?<![^\\][\\]),/
Which is a simple fixed-length negative lookbehind for "preceded by something that isn't a backslash, then something that is!".
This regex works just fine on http://www.phpliveregex.com
But when I go and actually attempt to run the above code, I am spat back the error:
Warning: preg_split() [function.preg-split]: Compilation failed: lookbehind assertion is not fixed length at offset 13
To make matters worse, a fellow programmer tested the code on his 5.4.24 PHP server, and it worked fine.
This leads me to believe that my issues are related to the configuration of my server, which I have very little control over. I am told that my PHP version if 5.2.*
Are there any workarounds/alternatives to preg_replace() that might not have this issue?

The problem is caused by the bug fixed in PCRE 6.7. Quoting the changelog:
A negated single-character class was not being recognized as
fixed-length in lookbehind assertions such as (?<=[^f]), leading to an
incorrect compile error "lookbehind assertion is not fixed length"
PCRE 6.7 was introduced in PHP 5.2.0, in Nov 2006. As you still have this bug, it means it's not still there at your server - so for a preg-split based workaround you have to use a pattern without a negative character class. For example:
$patt = '/(?<!(?<!\\\\)\\\\),/';
// or...
$patt = '/(?<![\x00-\x5b\x5d-\xFF]\x5c),/';
However, I find the whole approach a bit weird: what if , symbol is preceded by exactly three backslashes? Or five? Or any odd number of them? The comma in this case should be considered 'escaped', but obviously you cannot create a lookbehind expression of variable length to cover these cases.
On the second thought, one can use preg_match_all instead, with a common alternation trick to cover the escaped symbols:
$str = 'e ,a\\,b\\\\,c\\\\\\,d\\\\';
preg_match_all('/(?:[^\\\\,]|\\\\(?:.|$))+/', $str, $matches);
var_dump($matches[0]);
Demo.
I really think I covered all the issues here, those trailing slashes were a killer )

Way to avoid the negated character class (I write \x5c instead of a lot of backslashes to be more clear)
$result = preg_split('/(?<!(?!\x5c).\x5c),/s', $str);
About the approach itself:
If you are trying to split on comma that are not escaped, you are in the wrong way with a lookbehind since you can't check and undefined number of backslash before the comma. You have several possibilities to solve this problem:
$result = preg_split('/(?:[^\x5c]|\A)(?:\x5c.)*\K,/s', $str);
or
$result = preg_split('/(?<!\x5c)(?:\x5c.)*\K,/s', $str);
or for PHP > 5.2.4
$result = preg_split('/\x5c{2}(*SKIP)(?!)|(?<!\x5c),/s', $str);

I think you are using an older php version since I your error rises on PHP 5.1.6 or lower.
You can check a non working demo here
On the other hand it works for PHP 5.2.16 or higher:
Working demo

Related

Weird PHP Regex Preg_Match Bug?

My PHP version is PHP 7.2.24-0ubuntu0.18.04.7 (cli). However it looks like this problem occurs with all versions I've tested.
I've encountered a very weird bug when using preg_match. Anyone know a fix?
The first section of code here works, the second one doesn't. But the regex itself is valid. For some reason the something_happened word is causing it to fail.
$one = ' (branch|leaf)';
echo "ONE:\n";
preg_match('/(?:\( ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?: ?\| ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?: ?\))?/', $one, $matches, PREG_OFFSET_CAPTURE);
print_r($matches); // this works
$two = 'something_happened (branch|leaf)';
echo "\nTWO:\n";
preg_match('/(?:\( ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?: ?\| ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?: ?\))?/', $two, $matches2, PREG_OFFSET_CAPTURE);
print_r($matches2); // this doesn't work
It seems somehow related to the word something_happened. If I change this word it works.
The regex is matching 2 or more type names separated by | that may or may not be surrounded in (), and each type name may or may not be preceded by any number of [] (or [some number] or [!some number]) and *.
Try it and see for yourself! Please let me know if you know how to fix it!
The problem lies in the (?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+ group: the + quantifier quantifies a group with many subsequent optional patterns, and that creates too many options to match a string before the subsequent patterns.
In PHP, you can workaround the problem by using either
Possessive quantifier:
'/(?:\(\ ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)++(?:\ ?\|\ ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?:\ ?\))?/'
Note the ++ at the end of the group mentioned.
2. Atomic group:
'/(?:\(\ ?)?((?>(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?:\ ?\|\ ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?:\ ?\))?/'
See this regex demo. Note the (?>...) syntax.
Also, note how the regex is formatted here, it is very convenient to use the x (extended) flag to break the regex into several lines, format it, so that it could be easier to track down the issue. It is required to escape all literal whitespace and # chars, but it is a minor inconvenience when it comes to debugging long patterns like this.

From Postgress regexp replace match in PHP language

i need some help.
I have PostgreSQL regexp_replace pattern, like:
regexp_replace(lower(institution_title),'[[:cntrl:]]|[[[:digit:]]|[[:punct:]]|[[:blank:]]|[[:space:]|„|“|“|”"]','','g')
and i need this one alternative in PHP language
Because one half is from postgress db, and i have to compare strings from php aswell.
You may use the same POSIX character classes with PHP PCRE regex:
preg_replace('/[[:cntrl:][:digit:][:punct:][:blank:][:space:]„““”"]+/', '', strtolower($institution_title))
See demo
Besides, there are Unicode category classes in PCRE. Thus, you may also try
preg_replace('/[\p{Cc}\d\p{P}\s„““”"]+/u', '', mb_strtolower($institution_title, 'UTF-8'))
Where \p{Cc} stands for Control characters, \d for digits, \p{P} for punctuation, and \s for whitespace.
I am adding /u modifier to handle Unicode strings, too.
See a regex demo
hanks guys, but i bumped to another problem, i cannot match strings, if there is specifik symbols,
here is my output of postgres sql:
SQL:
select regexp_replace(lower(title),'[[:cntrl:]]|[[[:digit:]]|[[:punct:]]|[[:blank:]]|[[:space:]|„|“|“|”"]','','g')
from cls_institutions
Output:
"oxforduniversity"
"šiauliųuniversitetas"
"harwarduniversity"
"internationalbusinessschool"
"vilniuscollege"
"žemaitijoskolegija"
"worldhealthorganization"
But in PHP is output is a little bit different: I got my array with institutions:
$institutions[] = "'".preg_replace('/[[:cntrl:][:digit:][:punct:][:blank:][:space:]„““”"]+/', '', strtolower($data[0]))."'";
And PHP outputs like this:
"oxforduniversity",
"Šiauliųuniversitetas",
"harwarduniversity",
"internationalbusinessschool",
"vilniuscollege",
"Žemaitijoskolegija",
"worldhealthorganization"
First letter is not lowered case, somehow... I am missing something?

Symfony2 url validation : "preg_match(): Compilation failed: range out of order in character class" [duplicate]

I'm getting this odd error in the preg_match() function:
Warning: preg_match(): Compilation failed: range out of order in character class at offset 54
The line which is causing this is:
preg_match("/<!--GSM\sPER\sNUMBER\s-\s$gsmNumber\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s$gsmNumber\s-\sEND-->/s", $fileData, $matches);
What this regular expression does is parse an HTML file, extracting only the part between:
<!--GSM PER NUMBER - 5550101 - START-->
and:
<!--GSM PER NUMBER - 5550101 - END-->
Do you have a hint about what could be causing this error?
Hi I got the same error and solved it:
Warning: preg_match(): Compilation failed: range out of order in character class at offset <N>
Research Phase:
.. Range out of order .. So there is a range defined which can't be used.
.. at offset N .. I had a quick look at my regex pattern. Position N was the "-". It's used to define ranges like "a-z" or "0-9" etc.
Solution
I simply escaped the "-".
\-
Now it is interpreted as the character "-" and not as range!
If $gsmNumber contains a square bracket, backslash or various other special characters it might trigger this error. If that's possible, you might want to validate that to make sure it actually is a number before this point.
Edit 2016:
There exists a PHP function that can escape special characters inside regular expressions: preg_quote().
Use it like this:
preg_match(
'/<!--GSM\sPER\sNUMBER\s-\s' .
preg_quote($gsmNumber, '/') . '\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s' .
preg_quote($gsmNumber, '/') . '\s-\sEND-->/s', $fileData, $matches);
Obviously in this case because you've used the same string twice you could assign the quoted version to a variable first and re-use that.
This error is caused for an incorrect range. For example: 9-0 a-Z
To correct this, you must change 9-0 to 0-9 and a-Z to a-zA-Z
In your case you are not escaping the character "-", and then, preg_match try to parse the regex and fail with an incorrect range.
Escape the "-" and it must solve your problem.
I was receiving this error with the following sequence:
[/-.]
Simply moving the . to the beginning fixed the problem:
[./-]
While the other answers are correct, I'm surprised to see that no-one has suggested escaping the variable with preg_quote() before using it in a regex. So if you're looking to match an actual bracket or anything else that means something in regex, that'll be converted to a literal token:
$escaped = preg_quote($gsmNumber);
preg_match( '/<!--GSM\sPER\sNUMBER\s-\s'.$escaped.'\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s'.$escaped.'\s-\sEND-->/s', $fileData, $matches);
You probably have people insert mobile numbers including +, -, ( and/or ) characters and just use these as is in your preg_match, so you might want to sanitize the data provided before using it (ie. by stripping these characters out completely).
This is a bug in several versions of PHP, as I have just verified for the current 5.3.5 version, as packaged with XAMPP 1.7.4 on Windows XP home edition.
Even some very simple examples exhibit the problem, e.g.,
$pattern = '/^[\w_-. ]+$/';
$uid = 'guest';
if (preg_match($pattern, $uid)) echo
("<style> p { text-decoration:line-through } </style>");
The PHP folks have known about the bug since 1/10/2010.
See http://pear.php.net/bugs/bug.php?id=18182.
The bug is marked "closed" yet persists.

Range out of order in character class

I'm getting this odd error in the preg_match() function:
Warning: preg_match(): Compilation failed: range out of order in character class at offset 54
The line which is causing this is:
preg_match("/<!--GSM\sPER\sNUMBER\s-\s$gsmNumber\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s$gsmNumber\s-\sEND-->/s", $fileData, $matches);
What this regular expression does is parse an HTML file, extracting only the part between:
<!--GSM PER NUMBER - 5550101 - START-->
and:
<!--GSM PER NUMBER - 5550101 - END-->
Do you have a hint about what could be causing this error?
Hi I got the same error and solved it:
Warning: preg_match(): Compilation failed: range out of order in character class at offset <N>
Research Phase:
.. Range out of order .. So there is a range defined which can't be used.
.. at offset N .. I had a quick look at my regex pattern. Position N was the "-". It's used to define ranges like "a-z" or "0-9" etc.
Solution
I simply escaped the "-".
\-
Now it is interpreted as the character "-" and not as range!
If $gsmNumber contains a square bracket, backslash or various other special characters it might trigger this error. If that's possible, you might want to validate that to make sure it actually is a number before this point.
Edit 2016:
There exists a PHP function that can escape special characters inside regular expressions: preg_quote().
Use it like this:
preg_match(
'/<!--GSM\sPER\sNUMBER\s-\s' .
preg_quote($gsmNumber, '/') . '\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s' .
preg_quote($gsmNumber, '/') . '\s-\sEND-->/s', $fileData, $matches);
Obviously in this case because you've used the same string twice you could assign the quoted version to a variable first and re-use that.
This error is caused for an incorrect range. For example: 9-0 a-Z
To correct this, you must change 9-0 to 0-9 and a-Z to a-zA-Z
In your case you are not escaping the character "-", and then, preg_match try to parse the regex and fail with an incorrect range.
Escape the "-" and it must solve your problem.
I was receiving this error with the following sequence:
[/-.]
Simply moving the . to the beginning fixed the problem:
[./-]
While the other answers are correct, I'm surprised to see that no-one has suggested escaping the variable with preg_quote() before using it in a regex. So if you're looking to match an actual bracket or anything else that means something in regex, that'll be converted to a literal token:
$escaped = preg_quote($gsmNumber);
preg_match( '/<!--GSM\sPER\sNUMBER\s-\s'.$escaped.'\s-\sSTART-->(.*)<!--GSM\sPER\sNUMBER\s-\s'.$escaped.'\s-\sEND-->/s', $fileData, $matches);
You probably have people insert mobile numbers including +, -, ( and/or ) characters and just use these as is in your preg_match, so you might want to sanitize the data provided before using it (ie. by stripping these characters out completely).
This is a bug in several versions of PHP, as I have just verified for the current 5.3.5 version, as packaged with XAMPP 1.7.4 on Windows XP home edition.
Even some very simple examples exhibit the problem, e.g.,
$pattern = '/^[\w_-. ]+$/';
$uid = 'guest';
if (preg_match($pattern, $uid)) echo
("<style> p { text-decoration:line-through } </style>");
The PHP folks have known about the bug since 1/10/2010.
See http://pear.php.net/bugs/bug.php?id=18182.
The bug is marked "closed" yet persists.

PHP regex: what is "class at offset 0"?

I'm trying to strip all punctuation out of a string using a simple regular expression and the php preg_replace function, although I get the following error:
Compilation failed: POSIX named classes are supported only within a class at offset 0
I guess this means I can't use POSIX named classes outside of a class at offset 0. My question is, what does it means when it says "within a class at offset 0 "?
$string = "I like: perl";
if (eregi('[[:punct:]]', $string))
$new = preg_replace('[[:punct:]]', ' ', $string); echo $new;
The preg_* functions expect Perl compatible regular expressions with delimiters. So try this:
preg_replace('/[[:punct:]]/', ' ', $string)
NOTE: The g modifier is not needed with PHP's PCRE implementation!
In addition to Gumbo's answer, use the g modifier to replace all occurances of punctuation:
preg_replace('/[[:punct:]]/g', ' ', $string)
// ^
From Johnathan Lonowski (see comments):
> [The g modifier] means "Global" -- i.e., find all existing matches. Without it, regex functions will stop searching after the first match.
An explanation of why you're getting that error: PCRE uses Perl's loose definition of what a delimiter is. Your outer []s look like valid delimiters to it, causing it to read [:punct:] as the regex part.
(Oh, and avoid the ereg functions if you can - they're not going to be included in PHP 5.3.)
I just added g to the regexp as suggested in one of the anwers, it did the opposite of wahts expected and DIDN'T filter out the punctuation, turns out preg_replace doesnt require g as it's global/recursive in the first place

Categories