regular expressions - same for all languages? - php

is the regexp the same between languages?
for example. if i want to use it in javascript, would i have to search for regexp for javascript specifically. cause i got some cheat sheets. it just says regular expression.
i wonder if i could use this on all languages, php, javascript and so on.

The basics are mostly the same but there are some discrepancies between which engine powers the language, PHP and JavaScript differ since PHP uses PCRE (Perl Compatible Regular Expressions).
PHP also has the POSIX-compatible regex engine (ereg_* functions), but that is deprecated.
If you don't already use it, I suggest you try RegexBuddy. It can convert between several Regex engines.
You can find alternatives for RegexBuddy on Mac here.

You might want to start out by looking here. That's my Bible when I do regexping!
Now, regex should be the same everywhere, at least the fundamentals, however there are cases where it differs from compiler to compiler (or interpreter if you will).
Those could be how you search for a specific pattern, let's take \w as an example, that's: any alphanumeric or underscore character in c# but the pattern in javascript might be different.
When you come to a special case like this, you might want to revise the above provided link.

Regular expression synax varies slightly between languages but for the most part the details are the same. Some regex implementations support slightly different variations on how they process as well as what certain special character sequences mean.
Google is your best friend. Google for regex in the language of your choice.

One of the biggest variations in regex is how special characters are escaped / interpreted.
For instance, grep, vim and perl regexs differ in how to handle things like ( ) for grouping / capturing a pattern for back referencing in search & replace. IIRC, Perl uses them straight while grep and vim require them to be escaped.
Also, Perl regex may support more features than earlier regex engines. regex's that would have been simple in Perl were a major Pita in grep.
I'm not completely sure if this is a correct way to sum it up, but there are basically two major classes of regex - Posix ( grep and similar tools ) and Perl compatible ( with minor variations ).
One tool I've found useful is The Regex Coach - interactive regular expressions.

Related

How to write preg_replace_callback() patterns?

I am working with XML and using preg_replace_callback().
A question that comes up to my mind is: How do we write preg_replace_callback() patterns?
For example, "/{{{\s*"\s*(.+)\s*"\s*}}}/" or "/(<a .*)href=(")([^"]*)"([^>]*)>/U", what do any of these pattern do? How are these pattern written? What do the characters here mean?
Welcome to Stackoverflow. This seems to be a pretty fundamental question to me, therefore I'd like to answer it with some off-site resources as you already learned about Stackoverflow and could search it your own (it is a FAQ-like Question & Answer site):
These are, as you have rightfully identified your own, Patterns.
More specifically these are Regular Expression patterns (or called that way).
In concrete in PHP with preg_replace() they are in the Perl Compatible Regular Expression (PCRE) dialect.
The preg_* in front of preg_replace stands for (at least this is how I explain it to myself) Perl REGular expressions or as the PHP Manual words it "Regular Expressions (Perl-Compatible)", see https://www.php.net/pcre .
How is it written this pattern?
Like outlined in the PHP manual. (no pun intended)
What do the characters here mean?
The PHP manual has a short version (it is introductory) of the PCRE syntax. Start reading there (see PCRE Patterns (PHP Manual). The full syntax is documented with the PCRE library itself that you find available (bound) with PHP, for example the PCRE 2 manpage on the libraries homepage.
If you're using a supported PHP version, this is PCRE2. If not, it can vary. Nevertheless, the basics of those pattern are explained both in the PHP manual as well as on the PCRE website. It should provide enough text to read.
Remember that you can always create yourself small PHP scripts to try out things in PHP (and therefore with such patterns).
Additionally, literature exists available to study - even when offline and (out of electrical energy) at daylight (very comfortable!).
If you're out of luck finding literature, benefit from being online and having a printer at hand and print out the manual pages.
If you're lucky, your operating system ships with the documentation at pcre(3) or similar.
Good luck and thanks for asking!

String match regex with one typo in PHP [duplicate]

I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.

Which regular expression algorithm does PHP use?

After reading this article about two different types of regular expression algorithms (Perl 5.8.7 and Thompson NFA), the latter being ~1,000,000 times faster than the former, according to the article. I use PHP daily, and use regex quite a lot, so I wanted to know which algorithm PHP uses.
I found this question, however it's only for JavaScript. One of the answers states that JavaScript uses the Thompson NFA algorithm, but that will of course vary from implementation to implementation. I think PHP may have switched to using the faster algorithms when it moved to it's PCRE set of functions, deprecating the ereg_* stuff.
I've looked at the PHP PCRE documentation and, as far as I could see, it tells me nothing as to what algorithm it uses. The acronym PCRE, to me, tells me that it uses Perl Compatible Regular Expressions, so I assume it uses the Perl style algorithm.
Which regular expression algorithm does PHP use? Is it "Perl 5.8.7 style", or does it use the much faster Thompson NFA algorithm, or another one entirely? Could it even use a Perl backend to run it's expressions?
If PHP does use a Perl style algorithm, what exactly is it? I'm looking for an abstract definition/explanation in relation to other algorithms.
From the manual:
http://www.php.net/pcre:
Regular Expressions (Perl-Compatible)
http://www.php.net/manual/en/intro.pcre.php:
The PCRE library is a set of functions that implement regular
expression pattern matching using the same syntax and semantics as
Perl 5, with just a few differences (see below). The current
implementation corresponds to Perl 5.005.

How similar are the Regular expressions engines for PHP, MySQL, JavaScript, Perl, etc?

Are regular expressions the same for PHP, MySQL, JavaScript, Perl, and so on? If so, is there a chart or tutorial that explains regular expressions?
No, there often are subtle differences in supported features (mostly of the pretty advanced kind1). For example, JavaScript regular expressions don't have lookbehind. PHP uses either POSIX extended regular expressions or PCRE (Perl-compatible regex), which are close to Perl's feature-set. In fact, Perl is probably the ancestor of many advanced features in today's regular expression engines.
As for tutorials and comparisons the site http://regular-expressions.info is a very good resource.
Once you got used to writing and applying them it often is helpful to just quickly try out things. I have found a REPL to be quite handy; I usually use Windows PowerShell but Ruby or Python are also pretty popular.
1 Thanks, Dancrumb.
In theory regular expression is a language for pattern matching.But there are little differences from language to language . My advice is use a tool like Regex Coach to building/learning regular expressions.
There is an excellent write-up of Perl regular expressions compared to "classic", POSIX and GNU grep in Chapter 3 of the book "Minimal Perl" by Tim Maher. And I think it's a good read for any of them, not just Perl.
And what do you know, "Sample Chapter 3" is available as a download from this page: Minimal Perl book!

PHP ereg vs. preg

I have noticed in the PHP regex library there is a choice between ereg and preg. What is the difference? Is one faster than the other and if so, why isn't the slower one deprecated?
Are there any situations where it is better to use one over the other?
Visiting php.net/ereg displays the following:
Warning
This function has been DEPRECATED as of PHP 5.3.0 and REMOVED as of PHP 6.0.0. Relying on this feature is highly discouraged.
Down the page just a bit further and we read this:
Note: preg_match(), which uses a Perl-compatible regular expression syntax, is often a faster alternative to ereg().
Note my emphasis.
preg is the Perl Compatible Regex library
ereg is the POSIX complient regex library
They have a slightly diffrent syntax and preg is in some cases slightly faster. ereg is deprecated (and it is removed in php6) so I wouldn't recommend that it is used.
There is much discussion about which is faster and better.
If you plan on someday advancing to PHP6 your decision is made. Otherwise:
The general consensus is that PCRE is the better all around solution, but if you have a specific page with a lot of traffic, and you don't need PHP6 it may be worth some testing.
For example, from the PHP manual comments:
Deprecating POSIX regex in PHP for
Perl searching is like substituting
wooden boards and brick for a house
with pre-fabricated rooms and walls.
Sure, you may be able to mix and match
some of the parts but it's a lot
easier to modify with all the pieces
laid out in front of you.
PCRE faster than POSIX RE? Not always.
In a recent search-engine project here
at Cynergi, I had a simple loop with a
few cute ereg_replace() functions that
took 3min to process data. I changed
that 10-line loop into a 100-line
hand-written code for replacement and
the loop now took 10s to process the
same data! This opened my eye to what
can IN SOME CASES be very slow
regular expressions. Lately I decided
to look into Perl-compatible regular
expressions (PCRE). Most pages claim
PCRE are faster than POSIX, but a few
claim otherwise. I decided on
bechmarks of my own. My first few
tests confirmed PCRE to be faster,
but... the results were slightly
different than others were getting, so
I decided to benchmark every case of
RE usage I had on a 8000-line secure
(and fast) Webmail project here at
Cynergi to check it out. The results?
Inconclusive! Sometimes PCRE are
faster (sometimes by a factor greater
than 100x faster!), but some other
times POSIX RE are faster (by a factor
of 2x). I still have to find a rule on
when are one or the other faster. It's
not only about search data size,
amount of data matched, or "RE
compilation time" which would show
when you repeated the function often:
one would always be faster than the
other. But I didn't find a pattern
here. But truth be said, I also didn't
take the time to look into the source
code and analyse the problem. I can
give you some examples, though. The
POSIX RE
([0-9]{4})/([0-9]{2})/([0-9]{2})[^0-9]+
([0-9]{2}):([0-9]{2}):([0-9]{2}) is
30% faster in POSIX than when
converted to PCRE (even if you use \d
and \D and non-greedy matching). On
the other hand, a similarly PCRE
complex pattern /[0-9]{1,2}[
\t]+[a-zA-Z]{3}[ \t]+[0-9]{4}[
\t]+[0-9]{1,2}:[0-9]{1,2}(:[0-9]{1,2})?[
\t]+[+-][0-9]{4}/ is 2.5x faster in
PCRE than in POSIX RE. Simple
replacement patterns like
ereg_replace( "[^a-zA-Z0-9-]+", "", $m
); are 2x faster in POSIX RE than
PCRE. And then we get confused again
because a POSIX RE pattern like
(^|\n|\r)begin-base64[ \t]+[0-7]{3,4}[
\t]+...... is 2x faster as POSIX RE,
but the case-insensitive PCRE
/^Received[ \t]*:[ \t]by[ \t]+([^
\t]+)[ \t]/i is 30x faster than its
POSIX RE version! When it comes to
case sensitivity, PCRE has so far
seemed to be the best option. But I
found some really strange behaviour
from ereg/eregi. On a very simple
POSIX RE (^|\r|\n)mime-version[ \t]:
I found eregi() taking 3.60s (just a
number in a test benchmark), while the
corresponding PCRE took 0.16s! But if
I used ereg() (case-sensitive) the
POSIX RE time went down to 0.08s! So I
investigated further. I tried to make
the POSIX RE case-insensitive itself.
I got as far as this:
(^|\r|\n)[mM][iI][mM][eE]-vers[iI][oO][nN][
\t]*: This version also took 0.08s.
But if I try to apply the same rule to
any of the 'v', 'e', 'r' or 's'
letters that are not changed, the time
is back to the 3.60s mark, and not
gradually, but immediatelly so! The
test data didn't have any "vers" in
it, other "mime" words in it or any
"ion" that might be confusing the
POSIX parser, so I'm at a loss. Bottom
line: always benchmark your PCRE /
POSIX RE to find the fastest! Tests
were performed with PHP 5.1.2 under
Windows, from the command line. Pedro
Freire cynergi.com
Even though ereg is deprecated in PHP 5.3, the mb_ereg* functions are not. I believe the main reason for this is because PHP6 is rebuilding all MB/Unicode support and therefore the old "regular" ereg methods are useless since the mb_ereg will be newer/better.
I know it doesn't answer the question regarding speed, but it does allow you to continue using both POSIX and PCRE.
Well, ereg and its derivate functions (ereg_match, etc) are deprecated in php5 and being removed in php6, so you're probably best going with the preg family instead.
preg is for Perl-style regular expressions, while ereg is standard POSIX regex.

Categories