I have noticed in the PHP regex library there is a choice between ereg and preg. What is the difference? Is one faster than the other and if so, why isn't the slower one deprecated?
Are there any situations where it is better to use one over the other?
Visiting php.net/ereg displays the following:
Warning
This function has been DEPRECATED as of PHP 5.3.0 and REMOVED as of PHP 6.0.0. Relying on this feature is highly discouraged.
Down the page just a bit further and we read this:
Note: preg_match(), which uses a Perl-compatible regular expression syntax, is often a faster alternative to ereg().
Note my emphasis.
preg is the Perl Compatible Regex library
ereg is the POSIX complient regex library
They have a slightly diffrent syntax and preg is in some cases slightly faster. ereg is deprecated (and it is removed in php6) so I wouldn't recommend that it is used.
There is much discussion about which is faster and better.
If you plan on someday advancing to PHP6 your decision is made. Otherwise:
The general consensus is that PCRE is the better all around solution, but if you have a specific page with a lot of traffic, and you don't need PHP6 it may be worth some testing.
For example, from the PHP manual comments:
Deprecating POSIX regex in PHP for
Perl searching is like substituting
wooden boards and brick for a house
with pre-fabricated rooms and walls.
Sure, you may be able to mix and match
some of the parts but it's a lot
easier to modify with all the pieces
laid out in front of you.
PCRE faster than POSIX RE? Not always.
In a recent search-engine project here
at Cynergi, I had a simple loop with a
few cute ereg_replace() functions that
took 3min to process data. I changed
that 10-line loop into a 100-line
hand-written code for replacement and
the loop now took 10s to process the
same data! This opened my eye to what
can IN SOME CASES be very slow
regular expressions. Lately I decided
to look into Perl-compatible regular
expressions (PCRE). Most pages claim
PCRE are faster than POSIX, but a few
claim otherwise. I decided on
bechmarks of my own. My first few
tests confirmed PCRE to be faster,
but... the results were slightly
different than others were getting, so
I decided to benchmark every case of
RE usage I had on a 8000-line secure
(and fast) Webmail project here at
Cynergi to check it out. The results?
Inconclusive! Sometimes PCRE are
faster (sometimes by a factor greater
than 100x faster!), but some other
times POSIX RE are faster (by a factor
of 2x). I still have to find a rule on
when are one or the other faster. It's
not only about search data size,
amount of data matched, or "RE
compilation time" which would show
when you repeated the function often:
one would always be faster than the
other. But I didn't find a pattern
here. But truth be said, I also didn't
take the time to look into the source
code and analyse the problem. I can
give you some examples, though. The
POSIX RE
([0-9]{4})/([0-9]{2})/([0-9]{2})[^0-9]+
([0-9]{2}):([0-9]{2}):([0-9]{2}) is
30% faster in POSIX than when
converted to PCRE (even if you use \d
and \D and non-greedy matching). On
the other hand, a similarly PCRE
complex pattern /[0-9]{1,2}[
\t]+[a-zA-Z]{3}[ \t]+[0-9]{4}[
\t]+[0-9]{1,2}:[0-9]{1,2}(:[0-9]{1,2})?[
\t]+[+-][0-9]{4}/ is 2.5x faster in
PCRE than in POSIX RE. Simple
replacement patterns like
ereg_replace( "[^a-zA-Z0-9-]+", "", $m
); are 2x faster in POSIX RE than
PCRE. And then we get confused again
because a POSIX RE pattern like
(^|\n|\r)begin-base64[ \t]+[0-7]{3,4}[
\t]+...... is 2x faster as POSIX RE,
but the case-insensitive PCRE
/^Received[ \t]*:[ \t]by[ \t]+([^
\t]+)[ \t]/i is 30x faster than its
POSIX RE version! When it comes to
case sensitivity, PCRE has so far
seemed to be the best option. But I
found some really strange behaviour
from ereg/eregi. On a very simple
POSIX RE (^|\r|\n)mime-version[ \t]:
I found eregi() taking 3.60s (just a
number in a test benchmark), while the
corresponding PCRE took 0.16s! But if
I used ereg() (case-sensitive) the
POSIX RE time went down to 0.08s! So I
investigated further. I tried to make
the POSIX RE case-insensitive itself.
I got as far as this:
(^|\r|\n)[mM][iI][mM][eE]-vers[iI][oO][nN][
\t]*: This version also took 0.08s.
But if I try to apply the same rule to
any of the 'v', 'e', 'r' or 's'
letters that are not changed, the time
is back to the 3.60s mark, and not
gradually, but immediatelly so! The
test data didn't have any "vers" in
it, other "mime" words in it or any
"ion" that might be confusing the
POSIX parser, so I'm at a loss. Bottom
line: always benchmark your PCRE /
POSIX RE to find the fastest! Tests
were performed with PHP 5.1.2 under
Windows, from the command line. Pedro
Freire cynergi.com
Even though ereg is deprecated in PHP 5.3, the mb_ereg* functions are not. I believe the main reason for this is because PHP6 is rebuilding all MB/Unicode support and therefore the old "regular" ereg methods are useless since the mb_ereg will be newer/better.
I know it doesn't answer the question regarding speed, but it does allow you to continue using both POSIX and PCRE.
Well, ereg and its derivate functions (ereg_match, etc) are deprecated in php5 and being removed in php6, so you're probably best going with the preg family instead.
preg is for Perl-style regular expressions, while ereg is standard POSIX regex.
Related
I am working with XML and using preg_replace_callback().
A question that comes up to my mind is: How do we write preg_replace_callback() patterns?
For example, "/{{{\s*"\s*(.+)\s*"\s*}}}/" or "/(<a .*)href=(")([^"]*)"([^>]*)>/U", what do any of these pattern do? How are these pattern written? What do the characters here mean?
Welcome to Stackoverflow. This seems to be a pretty fundamental question to me, therefore I'd like to answer it with some off-site resources as you already learned about Stackoverflow and could search it your own (it is a FAQ-like Question & Answer site):
These are, as you have rightfully identified your own, Patterns.
More specifically these are Regular Expression patterns (or called that way).
In concrete in PHP with preg_replace() they are in the Perl Compatible Regular Expression (PCRE) dialect.
The preg_* in front of preg_replace stands for (at least this is how I explain it to myself) Perl REGular expressions or as the PHP Manual words it "Regular Expressions (Perl-Compatible)", see https://www.php.net/pcre .
How is it written this pattern?
Like outlined in the PHP manual. (no pun intended)
What do the characters here mean?
The PHP manual has a short version (it is introductory) of the PCRE syntax. Start reading there (see PCRE Patterns (PHP Manual). The full syntax is documented with the PCRE library itself that you find available (bound) with PHP, for example the PCRE 2 manpage on the libraries homepage.
If you're using a supported PHP version, this is PCRE2. If not, it can vary. Nevertheless, the basics of those pattern are explained both in the PHP manual as well as on the PCRE website. It should provide enough text to read.
Remember that you can always create yourself small PHP scripts to try out things in PHP (and therefore with such patterns).
Additionally, literature exists available to study - even when offline and (out of electrical energy) at daylight (very comfortable!).
If you're out of luck finding literature, benefit from being online and having a printer at hand and print out the manual pages.
If you're lucky, your operating system ships with the documentation at pcre(3) or similar.
Good luck and thanks for asking!
After reading this article about two different types of regular expression algorithms (Perl 5.8.7 and Thompson NFA), the latter being ~1,000,000 times faster than the former, according to the article. I use PHP daily, and use regex quite a lot, so I wanted to know which algorithm PHP uses.
I found this question, however it's only for JavaScript. One of the answers states that JavaScript uses the Thompson NFA algorithm, but that will of course vary from implementation to implementation. I think PHP may have switched to using the faster algorithms when it moved to it's PCRE set of functions, deprecating the ereg_* stuff.
I've looked at the PHP PCRE documentation and, as far as I could see, it tells me nothing as to what algorithm it uses. The acronym PCRE, to me, tells me that it uses Perl Compatible Regular Expressions, so I assume it uses the Perl style algorithm.
Which regular expression algorithm does PHP use? Is it "Perl 5.8.7 style", or does it use the much faster Thompson NFA algorithm, or another one entirely? Could it even use a Perl backend to run it's expressions?
If PHP does use a Perl style algorithm, what exactly is it? I'm looking for an abstract definition/explanation in relation to other algorithms.
From the manual:
http://www.php.net/pcre:
Regular Expressions (Perl-Compatible)
http://www.php.net/manual/en/intro.pcre.php:
The PCRE library is a set of functions that implement regular
expression pattern matching using the same syntax and semantics as
Perl 5, with just a few differences (see below). The current
implementation corresponds to Perl 5.005.
I just wanted to adapt my code to be compatible to php 5.3 (6.0).
So i wanted to replace all the calls to the ereg functions with the corresponding preg functions.
But then I saw that the mb_ereg function haven't been marked as deprecated. So I am just wondering if it is save to rely on them? Is something known that they will also been declared deprecated soon or is it even a flaw in the documentation?
I wouldn't depend on them. The preg functions are faster, more efficient, much more powerful and naively support UTF8. I would recommend using the preg functions for all of your regex needs.
But to directly answer your question, it does not appear that mb_ereg is deprecated...
You can replace all of your ereg with mb_ereg if you want quick solution and save your time. mb_ereg is not marked as deprecated and it is a direct replacement for ereg.
You can rely on it for certain time or longer, we don't know. But if you have some free time, I think should be better, as ircmaxell suggest, to replace all of your mb_ereg with preg.
mb_ereg is not deprecated, but I wouldn't rely on it because it probably is going to be. Besides, PCRE supports UTF-8 via the u modifier. See this answer.
Are regular expressions the same for PHP, MySQL, JavaScript, Perl, and so on? If so, is there a chart or tutorial that explains regular expressions?
No, there often are subtle differences in supported features (mostly of the pretty advanced kind1). For example, JavaScript regular expressions don't have lookbehind. PHP uses either POSIX extended regular expressions or PCRE (Perl-compatible regex), which are close to Perl's feature-set. In fact, Perl is probably the ancestor of many advanced features in today's regular expression engines.
As for tutorials and comparisons the site http://regular-expressions.info is a very good resource.
Once you got used to writing and applying them it often is helpful to just quickly try out things. I have found a REPL to be quite handy; I usually use Windows PowerShell but Ruby or Python are also pretty popular.
1 Thanks, Dancrumb.
In theory regular expression is a language for pattern matching.But there are little differences from language to language . My advice is use a tool like Regex Coach to building/learning regular expressions.
There is an excellent write-up of Perl regular expressions compared to "classic", POSIX and GNU grep in Chapter 3 of the book "Minimal Perl" by Tim Maher. And I think it's a good read for any of them, not just Perl.
And what do you know, "Sample Chapter 3" is available as a download from this page: Minimal Perl book!
is the regexp the same between languages?
for example. if i want to use it in javascript, would i have to search for regexp for javascript specifically. cause i got some cheat sheets. it just says regular expression.
i wonder if i could use this on all languages, php, javascript and so on.
The basics are mostly the same but there are some discrepancies between which engine powers the language, PHP and JavaScript differ since PHP uses PCRE (Perl Compatible Regular Expressions).
PHP also has the POSIX-compatible regex engine (ereg_* functions), but that is deprecated.
If you don't already use it, I suggest you try RegexBuddy. It can convert between several Regex engines.
You can find alternatives for RegexBuddy on Mac here.
You might want to start out by looking here. That's my Bible when I do regexping!
Now, regex should be the same everywhere, at least the fundamentals, however there are cases where it differs from compiler to compiler (or interpreter if you will).
Those could be how you search for a specific pattern, let's take \w as an example, that's: any alphanumeric or underscore character in c# but the pattern in javascript might be different.
When you come to a special case like this, you might want to revise the above provided link.
Regular expression synax varies slightly between languages but for the most part the details are the same. Some regex implementations support slightly different variations on how they process as well as what certain special character sequences mean.
Google is your best friend. Google for regex in the language of your choice.
One of the biggest variations in regex is how special characters are escaped / interpreted.
For instance, grep, vim and perl regexs differ in how to handle things like ( ) for grouping / capturing a pattern for back referencing in search & replace. IIRC, Perl uses them straight while grep and vim require them to be escaped.
Also, Perl regex may support more features than earlier regex engines. regex's that would have been simple in Perl were a major Pita in grep.
I'm not completely sure if this is a correct way to sum it up, but there are basically two major classes of regex - Posix ( grep and similar tools ) and Perl compatible ( with minor variations ).
One tool I've found useful is The Regex Coach - interactive regular expressions.