Compile regex in PHP - php

Is there a way in PHP to compile a regular expression, so that it can then be compared to multiple strings without repeating the compilation process? Other major languages can do this -- Java, C#, Python, Javascript, etc.

The Perl-Compatible Regular Expressions library may have already be optimized for your use case without providing a Regex class like other languages do:
This extension maintains a global per-thread cache of compiled regular expressions (up to 4096).
PCRE Introduction
This is how the study modifier which Imran described can store the compiled expression between calls.

preg regexes can use the uppercase S (study) modifier, which is probably the thing you're looking for.
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
S
When a pattern is going to be used several times, it is worth spending
more time analyzing it in order to
speed up the time taken for matching.
If this modifier is set, then this
extra analysis is performed. At
present, studying a pattern is useful
only for non-anchored patterns that do
not have a single fixed starting
character.

Thread is the thread that the script is currently running in. After first use, compiled regexp is cached and next time it is used PHP does not compile it again.
Simple test:
<?php
function microtime_float() {
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
// test string
$text='The big brown <b>fox</b> jumped over a lazy <b>cat</b>';
$testTimes=10;
$avg=0;
for ($x=0; $x<$testTimes; $x++)
{
$start=microtime_float();
for ($i=0; $i<10000; $i++) {
preg_match_all('/<b>(.*)<\/b>0?/', $text, $m);
}
$end=microtime_float();
$avg += (float)$end-$start;
}
echo 'Regexp with caching avg '.($avg/$testTimes);
// regexp without caching
$avg=0;
for ($x=0; $x<$testTimes; $x++)
{
$start=microtime_float();
for ($i=0; $i<10000; $i++) {
$pattern='/<b>(.*)<\/b>'.$i.'?/';
preg_match_all($pattern, $text, $m);
}
$end=microtime_float();
$avg += (float)$end-$start;
}
echo '<br/>Regexp without caching avg '.($avg/$testTimes);
Regexp with caching avg 0.1
Regexp without caching avg 0.8
Caching a regexp makes it 8 times faster!

As another commenter has already said, PCRE regexes are already compiled without your having to specifically reference them as such, PCRE keeps an internal hash indexed by the original string you provided.

I'm not positive that you can. If you check out Mastering Regular Expressions, some PHP specific optimization techniques are discussed in Chapter10: PHP. Specifically the use of the S pattern modifier to cause the regex engine to "Study" the regular expression before it applies it. Depending on your pattern and your text, this could give you some speed improvements.
Edit: you can take a peek at the contents of the book using books.google.com.

Related

Fixed-length regex lookbehind complains of variable-length lookbehind

Here is the code I am trying to run:
$str = 'a,b,c,d';
return preg_split('/(?<![^\\\\][\\\\]),/', $str);
As you can see, the regexp being used here is:
/(?<![^\\][\\]),/
Which is a simple fixed-length negative lookbehind for "preceded by something that isn't a backslash, then something that is!".
This regex works just fine on http://www.phpliveregex.com
But when I go and actually attempt to run the above code, I am spat back the error:
Warning: preg_split() [function.preg-split]: Compilation failed: lookbehind assertion is not fixed length at offset 13
To make matters worse, a fellow programmer tested the code on his 5.4.24 PHP server, and it worked fine.
This leads me to believe that my issues are related to the configuration of my server, which I have very little control over. I am told that my PHP version if 5.2.*
Are there any workarounds/alternatives to preg_replace() that might not have this issue?
The problem is caused by the bug fixed in PCRE 6.7. Quoting the changelog:
A negated single-character class was not being recognized as
fixed-length in lookbehind assertions such as (?<=[^f]), leading to an
incorrect compile error "lookbehind assertion is not fixed length"
PCRE 6.7 was introduced in PHP 5.2.0, in Nov 2006. As you still have this bug, it means it's not still there at your server - so for a preg-split based workaround you have to use a pattern without a negative character class. For example:
$patt = '/(?<!(?<!\\\\)\\\\),/';
// or...
$patt = '/(?<![\x00-\x5b\x5d-\xFF]\x5c),/';
However, I find the whole approach a bit weird: what if , symbol is preceded by exactly three backslashes? Or five? Or any odd number of them? The comma in this case should be considered 'escaped', but obviously you cannot create a lookbehind expression of variable length to cover these cases.
On the second thought, one can use preg_match_all instead, with a common alternation trick to cover the escaped symbols:
$str = 'e ,a\\,b\\\\,c\\\\\\,d\\\\';
preg_match_all('/(?:[^\\\\,]|\\\\(?:.|$))+/', $str, $matches);
var_dump($matches[0]);
Demo.
I really think I covered all the issues here, those trailing slashes were a killer )
Way to avoid the negated character class (I write \x5c instead of a lot of backslashes to be more clear)
$result = preg_split('/(?<!(?!\x5c).\x5c),/s', $str);
About the approach itself:
If you are trying to split on comma that are not escaped, you are in the wrong way with a lookbehind since you can't check and undefined number of backslash before the comma. You have several possibilities to solve this problem:
$result = preg_split('/(?:[^\x5c]|\A)(?:\x5c.)*\K,/s', $str);
or
$result = preg_split('/(?<!\x5c)(?:\x5c.)*\K,/s', $str);
or for PHP > 5.2.4
$result = preg_split('/\x5c{2}(*SKIP)(?!)|(?<!\x5c),/s', $str);
I think you are using an older php version since I your error rises on PHP 5.1.6 or lower.
You can check a non working demo here
On the other hand it works for PHP 5.2.16 or higher:
Working demo

JavaScript equivalent of PHP preg_split()

Is there an equivalent of the PHP function preg_split for JavaScript?
Any string in javascript can be split using the string.split function, e.g.
"foo:bar".split(/:/)
where split takes as an argument either a regular expression or a literal string.
You can use regular expressions with split.
The problem is the escape characters in the string as the (? opens a non capturing group but there is no corresponding } to close the non capturing group it identifies the string to look for as '
If you want support for all of the preg_split arguments see https://github.com/kvz/phpjs/blob/master/_workbench/pcre/preg_split.js (though not sure how well tested it is).
Just bear in mind that JavaScript's regex syntax is somewhat different from PHP's (mostly less expressive). We would like to integrate XRegExp at some point as that makes up for some of the missing features of PHP regexes (as well as fixes the many browser reliability problems with functions like String.split()).

Expected lifespan of ereg, migrating to preg [duplicate]

This question already has answers here:
How can I convert ereg expressions to preg in PHP?
(4 answers)
Closed 3 years ago.
I work on a large PHP application (>1 million lines, 10 yrs old) which makes extensive use of ereg and ereg_replace - currently 1,768 unique regular expressions in 516 classes.
I'm very aware why ereg is being deprecated but clearly migrating to preg could be highly involved.
Does anyone know how long ereg support is likely to be maintained in PHP, and/or have any advice for migrating to preg on this scale. I suspect automated translation from ereg to preg is impossible/impractical?
I'm not sure when ereg will be removed but my bet is as of PHP 6.0.
Regarding your second issue (translating ereg to preg) doesn't seem something that hard, if your application has > 1 million lines surely you must have the resources to get someone doing this job for a week at most. I would grep all the ereg_ instances in your code and set up some macros in your favorite IDE (simple stuff like adding delimiters, modifiers and so on).
I bet most of the 1768 regexes can be ported using a macro, and the others, well, a good pair of eyes.
Another option might be to write wrappers around the ereg functions if they are not available, implementing the changes as needed:
if (function_exists('ereg') !== true)
{
function ereg($pattern, $string, &$regs)
{
return preg_match('~' . addcslashes($pattern, '~') . '~', $string, $regs);
}
}
if (function_exists('eregi') !== true)
{
function eregi($pattern, $string, &$regs)
{
return preg_match('~' . addcslashes($pattern, '~') . '~i', $string, $regs);
}
}
You get the idea. Also, PEAR package PHP Compat might be a viable solution too.
Differences from POSIX regex
As of PHP 5.3.0, the POSIX Regex
extension is deprecated. There are a
number of differences between POSIX
regex and PCRE regex. This page lists
the most notable ones that are
necessary to know when converting to
PCRE.
The PCRE functions require that the pattern is enclosed by delimiters.
Unlike POSIX, the PCRE extension does not have dedicated functions for
case-insensitive matching. Instead,
this is supported using the /i pattern
modifier. Other pattern modifiers are
also available for changing the
matching strategy.
The POSIX functions find the longest of the leftmost match, but
PCRE stops on the first valid match.
If the string doesn't match at all it
makes no difference, but if it matches
it may have dramatic effects on both
the resulting match and the matching
speed. To illustrate this difference,
consider the following example from
"Mastering Regular Expressions" by
Jeffrey Friedl. Using the pattern
one(self)?(selfsufficient)? on the
string oneselfsufficient with PCRE
will result in matching oneself, but
using POSIX the result will be the
full string oneselfsufficient. Both
(sub)strings match the original
string, but POSIX requires that the
longest be the result.
My intuition says that they are never going to remove ereg on purpose. PHP still supports really old and deprecated stuff like register globals. There're simply too many outdated apps out there. There's however a little chance that the extension has to be removed because someone finds a serious vulnerability and there's just nobody to fix it.
In any case, it's worth noting that:
You are not forced to upgrade your PHP installation. It's pretty common to keep outdated servers to run legady apps.
The PHP_Compat PEAR package offers plain PHP version of some native functions. If ereg disappears, it's possible that it gets added.
BTW... In fact, PHP 6 is dead. They realised that their approach to make PHP fully Unicode compliant was harder than they thought and they are rethinking it all. The conclusion is: you can never make perfect predictions.
I had this problem on a much smaller scale - an application more like 10,000 lines. In every case, all I need to do was switch to preg_replace() and put delimiters around the regex pattern.
Anyone should be able to do that - even a non-programmer can be given a list of filenames and line numbers.
Then just run your tests to watch for any failures that can be fixed.
ereg functions will be removed from PHP6, by the way - http://jero.net/articles/php6.
All ereg functions will be removed as of PHP 6, I believe.

foreach and preg_match on a heavy amount of data not working properly

I have to files, one is full of keywords sequences (~20k lines), the other is full of regular expression (~2.5k).
I want to test each keyword with each regexp and print the one that matches. I tested my files and that makes around 22 750 000 tests. I am using the following code :
$count = 0;
$countM = 0;
foreach ($arrayRegexp as $r) {
foreach ($arrayKeywords as $k) {
$count++;
if (preg_match($r, $k, $match) {
$countM++;
echo $k.' matched with keywords '.$match[1].'<br/>';
}
}
}
echo "$count tests with $countM matches.";
Unfortunately, after computing for a while, only parts of the actual matches are displayed and the final line keeping the counts never displays. What is even more weird is that if I comment the preg section to keep only the two foreach and the count display, everything works fine.
I believe this is due to an excessive amount of data to be processed but I would like to know if there is recommendations I didn't follow for that kind of operations. The regular expressions I use are very complicated and I cannot change to something else.
Ideas anyone?
There are two optimization options:
Regular expressions can usually combined into alternatives /(regex1|regex2|...)/. Oftentimes PCRE can evaluate alternatives faster than PHP can execute a loop.
I'm not sure if this is faster at all (modifies the subjects), but you could use the keywords array as parameter to preg_replace_callback() directly, thus eliminating the second loop.
As example:
$rx = implode("|", $arrayRegexp); // if it hasn't /regexp/ enclosures
preg_replace_callback("#($rx)#", "print", $arrayKeywords);
But define a custom print function to output and count the results, and let it just return e.g. an empty string.
Come to think of it, preg_replace_callback would also take an array of regular expressions. Not sure if it cross-checks each regex on each string though.
Increase execution time
usethis line in .htaccess
php_value max_execution_time 80000

Excluding all characters NOT in list AND NOT in a list of phrases

I'm attempting to make a function in PHP that will evaluate a mathematical expression -- including functions such as sin, cos, etc. My approach is to delete all characters in the phrase that are not numbers, mathematical operators, or mathematical functions and then use that string in an eval(). The problem is that I don't know enough about regular expressions to negate both characters and phrases in the same expression.
So far, this is what I've got:
$input = preg_replace("/[^0-9+\-.*\/()sincota]/", "", $input);
Obviously, the characters for sin, cos, and tan can be used in any order in the input expression (rather than only allowing the phrases sin, cos, and tan). If I further expand this function to include even more characters and functions, that presents an even bigger security risk as a malicious user would be able to execute just about any PHP command through clever interaction with the app.
Can anyone tell me how to fix my regex and eliminate this problem?
I'm attempting to make a function in
PHP that will evaluate a mathematical
expression -- including functions such
as sin, cos, etc
Might I suggest taking advantage of the years of work that has been put into PHPExcel, which includes a formula parser already.
It includes cos, sin and hundreds of others.
Otherwise, rather than negating, you can look for positive matches:
$matches = array();
preg_match_all("#([0-9,/\*()+\s\.-]|sin|cos)+#", 'sin(12) + cos(13.5/2) evddal * (4-1)', $matches);
echo implode('', $matches[0]);
/* output:
sin(12) + cos(13.5/2) * (4-1)
*/
You could try this :
preg_replace("/(sin|cos|tan)?[^0-9+\\.*\/()-]/", "$1", $input);
code on ideone.com
But if you're trying to parse an expression to evaluate it, I suggest you to parse it and not simply pass it through a regular expression pattern.

Categories