Expected lifespan of ereg, migrating to preg [duplicate] - php

This question already has answers here:
How can I convert ereg expressions to preg in PHP?
(4 answers)
Closed 3 years ago.
I work on a large PHP application (>1 million lines, 10 yrs old) which makes extensive use of ereg and ereg_replace - currently 1,768 unique regular expressions in 516 classes.
I'm very aware why ereg is being deprecated but clearly migrating to preg could be highly involved.
Does anyone know how long ereg support is likely to be maintained in PHP, and/or have any advice for migrating to preg on this scale. I suspect automated translation from ereg to preg is impossible/impractical?

I'm not sure when ereg will be removed but my bet is as of PHP 6.0.
Regarding your second issue (translating ereg to preg) doesn't seem something that hard, if your application has > 1 million lines surely you must have the resources to get someone doing this job for a week at most. I would grep all the ereg_ instances in your code and set up some macros in your favorite IDE (simple stuff like adding delimiters, modifiers and so on).
I bet most of the 1768 regexes can be ported using a macro, and the others, well, a good pair of eyes.
Another option might be to write wrappers around the ereg functions if they are not available, implementing the changes as needed:
if (function_exists('ereg') !== true)
{
function ereg($pattern, $string, &$regs)
{
return preg_match('~' . addcslashes($pattern, '~') . '~', $string, $regs);
}
}
if (function_exists('eregi') !== true)
{
function eregi($pattern, $string, &$regs)
{
return preg_match('~' . addcslashes($pattern, '~') . '~i', $string, $regs);
}
}
You get the idea. Also, PEAR package PHP Compat might be a viable solution too.
Differences from POSIX regex
As of PHP 5.3.0, the POSIX Regex
extension is deprecated. There are a
number of differences between POSIX
regex and PCRE regex. This page lists
the most notable ones that are
necessary to know when converting to
PCRE.
The PCRE functions require that the pattern is enclosed by delimiters.
Unlike POSIX, the PCRE extension does not have dedicated functions for
case-insensitive matching. Instead,
this is supported using the /i pattern
modifier. Other pattern modifiers are
also available for changing the
matching strategy.
The POSIX functions find the longest of the leftmost match, but
PCRE stops on the first valid match.
If the string doesn't match at all it
makes no difference, but if it matches
it may have dramatic effects on both
the resulting match and the matching
speed. To illustrate this difference,
consider the following example from
"Mastering Regular Expressions" by
Jeffrey Friedl. Using the pattern
one(self)?(selfsufficient)? on the
string oneselfsufficient with PCRE
will result in matching oneself, but
using POSIX the result will be the
full string oneselfsufficient. Both
(sub)strings match the original
string, but POSIX requires that the
longest be the result.

My intuition says that they are never going to remove ereg on purpose. PHP still supports really old and deprecated stuff like register globals. There're simply too many outdated apps out there. There's however a little chance that the extension has to be removed because someone finds a serious vulnerability and there's just nobody to fix it.
In any case, it's worth noting that:
You are not forced to upgrade your PHP installation. It's pretty common to keep outdated servers to run legady apps.
The PHP_Compat PEAR package offers plain PHP version of some native functions. If ereg disappears, it's possible that it gets added.
BTW... In fact, PHP 6 is dead. They realised that their approach to make PHP fully Unicode compliant was harder than they thought and they are rethinking it all. The conclusion is: you can never make perfect predictions.

I had this problem on a much smaller scale - an application more like 10,000 lines. In every case, all I need to do was switch to preg_replace() and put delimiters around the regex pattern.
Anyone should be able to do that - even a non-programmer can be given a list of filenames and line numbers.
Then just run your tests to watch for any failures that can be fixed.
ereg functions will be removed from PHP6, by the way - http://jero.net/articles/php6.

All ereg functions will be removed as of PHP 6, I believe.

Related

php preg_match and ereg syntax difference

I found that syntax of preg_match() and the deprecated ereg() is different.
For example:
I thought that
preg_match('/^<div>(.*)</div>$/', $content);
means the same as
ereg('^<div>(.*)</div>$', $content);
but I was wrong. preg_match() doesn't include special characters as enter like ereg() does.
So I started to use this syntax:
preg_match('/^<div>([^<]*)</div>$/', $content);
but it isn't exactly the same to what I need.
Can anyone suggest me how to solve this problem, without using deprecated functions?
For parsing HTML I'd suggest reading this question and choosing a built in PHP extension.
If for some reason you need or want to use RegEx to do it you should know that:
preg_match() is a greedy little bugger and it will try to eat your anything (.*) till it get's sick (meaning it hits recursion or backtracking limits). You change this with the U modifier1.
the engine expects to be fed a single line. You change this with the m or s modifiers1.
using your 'not a < character' ([^<]*) hack does a good job as it forces the engine to stop at the first < char, but will work only if the <div> doesn't contain other tags inside!
ref: 1 PCRE Pattern Modifiers

JavaScript equivalent of PHP preg_split()

Is there an equivalent of the PHP function preg_split for JavaScript?
Any string in javascript can be split using the string.split function, e.g.
"foo:bar".split(/:/)
where split takes as an argument either a regular expression or a literal string.
You can use regular expressions with split.
The problem is the escape characters in the string as the (? opens a non capturing group but there is no corresponding } to close the non capturing group it identifies the string to look for as '
If you want support for all of the preg_split arguments see https://github.com/kvz/phpjs/blob/master/_workbench/pcre/preg_split.js (though not sure how well tested it is).
Just bear in mind that JavaScript's regex syntax is somewhat different from PHP's (mostly less expressive). We would like to integrate XRegExp at some point as that makes up for some of the missing features of PHP regexes (as well as fixes the many browser reliability problems with functions like String.split()).

php equals regular expression

I know I can use preg_match but I was wondering if php had a way to evaluate to a regular expression like:
if(substr($example, 0, 1) == /\s/){ echo 'whitespace!'; }
PHP does not have first-class regular expressions.
You will need to use the functions provided by the default PCRE extension. Sorry. It's a backslash-escaping nightmare, but it's all we've got.
(There's also the now-deprecated POSIX regex extension, but you should not use them any longer. They are slower, less featureful, and most important, they aren't Unicode-safe. Modern PCRE versions understand Unicode very well, even if PHP itself is ignorant about it.)
With regard to the backslash-escaping nightmare, you can keep the horror to a minimum by using single quotes to enclose the string containing the regex instead of doubles, and picking an appropriate delimiter. Compare:
"/^http:\\/\\/www.foo.bar\\/index.html\\?/"
versus
'!^http://www.foo.bar/index.html\?!'
Inside single quotes, you only need to backslash-escape backslashes and single quotes, and picking a different delimiter avoids needing to escape the delimiter inside the regex.
:)
if(substr($example, 0, 1) == " "){ echo 'whitespace!';}
You should not be using regexp when it is not needed.
There would also be the microoptimization option:
if (strstr(" \t\r\n", $example{0})) {
The {0} is an outdated way to get the first character (same as [0] actually). And strstr simply checks if the character is contained in the list of whitespace characters. Another option would be strspn, at least in your example case.

Replace ereg_match with preg_match [duplicate]

This question already has answers here:
How can I convert ereg expressions to preg in PHP?
(4 answers)
Closed 3 years ago.
Dear Sir/m'am
How can i replace ther deprecated ereg_replace with preg_replace or str_replace
and still have the same functionality as in the code below?
return ereg_replace("^(.*)%%number%%(.*)$","\\1$i\\2",$number);
///this doesnt work
return preg_replace("^(.*)%%number%%(.*)$","\\1$i\\2",$number);
Anyone smarter have a clue?
Try this:
return ereg_replace("^(.*)%%number%%(.*)$","\\1$i\\2",$number);
becomes
return preg_replace("/^(.*)%%number%%(.*)$/","\\1$i\\2",$number);
Note the / around the regex.
I'll go with a read the fabulous manual approach.
The PHP Manual has a section for moving from POSIX Regex to PCRE.
The PCRE functions require that the pattern is enclosed by delimiters.
Unlike POSIX, the PCRE extension does not have dedicated functions for
case-insensitive matching. Instead,
this is supported using the /i pattern
modifier. Other pattern modifiers are
also available for changing the
matching strategy.
The POSIX functions find the longest of the leftmost match, but
PCRE stops on the first valid match.
If the string doesn't match at all it
makes no difference, but if it matches
it may have dramatic effects on both
the resulting match and the matching
speed. To illustrate this difference,
consider the following example from
"Mastering Regular Expressions" by
Jeffrey Friedl. Using the pattern
one(self)?(selfsufficient)? on the
string oneselfsufficient with PCRE
will result in matching oneself, but
using POSIX the result will be the
full string oneselfsufficient. Both
(sub)strings match the original
string, but POSIX requires that the
longest be the result.
Good luck,
Alin
Perl Compatible Regular Expressions, used by the preg_ functions in PHP require a demarcation character in the pattern string, defining where the actual string pattern starts and ends, and where attributes for extra functionality, such as case insensitivity, is.
For example:
$pattern = "/dog/i"; // Search pattern for "dog", case insensitive.
$replace = "cat";
$subject = "Dogs are cats.";
$result = preg_replace($pattern, $replace, $subject);

Compile regex in PHP

Is there a way in PHP to compile a regular expression, so that it can then be compared to multiple strings without repeating the compilation process? Other major languages can do this -- Java, C#, Python, Javascript, etc.
The Perl-Compatible Regular Expressions library may have already be optimized for your use case without providing a Regex class like other languages do:
This extension maintains a global per-thread cache of compiled regular expressions (up to 4096).
PCRE Introduction
This is how the study modifier which Imran described can store the compiled expression between calls.
preg regexes can use the uppercase S (study) modifier, which is probably the thing you're looking for.
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
S
When a pattern is going to be used several times, it is worth spending
more time analyzing it in order to
speed up the time taken for matching.
If this modifier is set, then this
extra analysis is performed. At
present, studying a pattern is useful
only for non-anchored patterns that do
not have a single fixed starting
character.
Thread is the thread that the script is currently running in. After first use, compiled regexp is cached and next time it is used PHP does not compile it again.
Simple test:
<?php
function microtime_float() {
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
// test string
$text='The big brown <b>fox</b> jumped over a lazy <b>cat</b>';
$testTimes=10;
$avg=0;
for ($x=0; $x<$testTimes; $x++)
{
$start=microtime_float();
for ($i=0; $i<10000; $i++) {
preg_match_all('/<b>(.*)<\/b>0?/', $text, $m);
}
$end=microtime_float();
$avg += (float)$end-$start;
}
echo 'Regexp with caching avg '.($avg/$testTimes);
// regexp without caching
$avg=0;
for ($x=0; $x<$testTimes; $x++)
{
$start=microtime_float();
for ($i=0; $i<10000; $i++) {
$pattern='/<b>(.*)<\/b>'.$i.'?/';
preg_match_all($pattern, $text, $m);
}
$end=microtime_float();
$avg += (float)$end-$start;
}
echo '<br/>Regexp without caching avg '.($avg/$testTimes);
Regexp with caching avg 0.1
Regexp without caching avg 0.8
Caching a regexp makes it 8 times faster!
As another commenter has already said, PCRE regexes are already compiled without your having to specifically reference them as such, PCRE keeps an internal hash indexed by the original string you provided.
I'm not positive that you can. If you check out Mastering Regular Expressions, some PHP specific optimization techniques are discussed in Chapter10: PHP. Specifically the use of the S pattern modifier to cause the regex engine to "Study" the regular expression before it applies it. Depending on your pattern and your text, this could give you some speed improvements.
Edit: you can take a peek at the contents of the book using books.google.com.

Categories