I want to know how regex patterns are written. Where is their base and what do I need to know in order to write patterns. I have absolutely no idea on how to write my own patterns and I am quite in a need of finding a particular match in some of my code.
Is this some math based subject? Give me all possible information about patterns please :)
I would say you need to read the PCRE Patterns section of the manual -- and, more specificaly, the Pattern Syntax sub-section ;-)
Pretty much everything about PCRE is in there1.
Considering that PCRE are, afterall, Perl-Compatible Regular Expressions, you might also want to read some Perl-related documentation ; for exemple : perlre.
1. Well, once again, the PHP manual is so great, it wasn't possible to not link to it...
Pascal Martin already pointed to some useful online resources. If you're looking for a book, I can suggest Mastering Regular Expressions.
Related
I am working with XML and using preg_replace_callback().
A question that comes up to my mind is: How do we write preg_replace_callback() patterns?
For example, "/{{{\s*"\s*(.+)\s*"\s*}}}/" or "/(<a .*)href=(")([^"]*)"([^>]*)>/U", what do any of these pattern do? How are these pattern written? What do the characters here mean?
Welcome to Stackoverflow. This seems to be a pretty fundamental question to me, therefore I'd like to answer it with some off-site resources as you already learned about Stackoverflow and could search it your own (it is a FAQ-like Question & Answer site):
These are, as you have rightfully identified your own, Patterns.
More specifically these are Regular Expression patterns (or called that way).
In concrete in PHP with preg_replace() they are in the Perl Compatible Regular Expression (PCRE) dialect.
The preg_* in front of preg_replace stands for (at least this is how I explain it to myself) Perl REGular expressions or as the PHP Manual words it "Regular Expressions (Perl-Compatible)", see https://www.php.net/pcre .
How is it written this pattern?
Like outlined in the PHP manual. (no pun intended)
What do the characters here mean?
The PHP manual has a short version (it is introductory) of the PCRE syntax. Start reading there (see PCRE Patterns (PHP Manual). The full syntax is documented with the PCRE library itself that you find available (bound) with PHP, for example the PCRE 2 manpage on the libraries homepage.
If you're using a supported PHP version, this is PCRE2. If not, it can vary. Nevertheless, the basics of those pattern are explained both in the PHP manual as well as on the PCRE website. It should provide enough text to read.
Remember that you can always create yourself small PHP scripts to try out things in PHP (and therefore with such patterns).
Additionally, literature exists available to study - even when offline and (out of electrical energy) at daylight (very comfortable!).
If you're out of luck finding literature, benefit from being online and having a printer at hand and print out the manual pages.
If you're lucky, your operating system ships with the documentation at pcre(3) or similar.
Good luck and thanks for asking!
I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.
I came across some regular expressions that I've never seen before, and I can't find any information on what they do. Here's an example:
/[\p{Z}\p{Cc}\p{Cf}\p{Cs}\p{Pi}\p{Pf}]/u
I'm looking for a full reference for regex.
P.S. I think the example provided only words in certain languages. It works in PHP but not Javascript.
The complete reference for PHP PCRE (Perl Compatible Regular Expression) is in the PHP docs.
What you're looking at are Unicode character properties, also in the PHP docs, as well as the regular expression modifiers for the u at the end of the regex.
Mastering Regular Expression 3rd is your best choice
can somebody point me to a good regular expression resource (for php if it matters).
I am looking now for a book here amazon but don't know which one is better. It would be great to find something simple to understand and a fast and interesting process of learning.
You can start at http://www.regular-expressions.info/
The site also has a list of regular expressions books.
PHP supports several regular expressions variants, but the most important is PCRE (perl-comptabible regular expressions).
The mb_ereg family, besides supporting several encodings, also supports several variants:
j Java (Sun java.util.regex)
u GNU regex
g grep
c Emacs
r Ruby
z Perl
b POSIX Basic regex
d POSIX Extended regex
Mastering Regular Expressions by: Jeffrey E.F. Friedl is considered the bible of Regular Expressions books.
Also, as Artefacto mentioned: http://www.regular-expressions.info/ is a terrific resource with clear and simple explanations.
But the best way to learn is to play with them using a regex tool like Reggy (Mac tool)
Along with http://www.regular-expressions.info, you might find it useful to get an interactive regex editor.
The Regex Coach is a good, light-weight editor.
RegexBuddy is amazing, but costly.
regexpal.com a simple, online tester.
This is the only place I've found an intro to regular expressions that's suitable for beginners. It's part of an online book called "Practical PHP".
http://www.tuxradar.com/practicalphp/4/8/0
These are the only regex books you'll ever need:
Mastering Regular Expressions
Regular Expressions Cookbook
Both are great tools for learning regexes in general, but they both have lots of information specific to PHP as well.
is the regexp the same between languages?
for example. if i want to use it in javascript, would i have to search for regexp for javascript specifically. cause i got some cheat sheets. it just says regular expression.
i wonder if i could use this on all languages, php, javascript and so on.
The basics are mostly the same but there are some discrepancies between which engine powers the language, PHP and JavaScript differ since PHP uses PCRE (Perl Compatible Regular Expressions).
PHP also has the POSIX-compatible regex engine (ereg_* functions), but that is deprecated.
If you don't already use it, I suggest you try RegexBuddy. It can convert between several Regex engines.
You can find alternatives for RegexBuddy on Mac here.
You might want to start out by looking here. That's my Bible when I do regexping!
Now, regex should be the same everywhere, at least the fundamentals, however there are cases where it differs from compiler to compiler (or interpreter if you will).
Those could be how you search for a specific pattern, let's take \w as an example, that's: any alphanumeric or underscore character in c# but the pattern in javascript might be different.
When you come to a special case like this, you might want to revise the above provided link.
Regular expression synax varies slightly between languages but for the most part the details are the same. Some regex implementations support slightly different variations on how they process as well as what certain special character sequences mean.
Google is your best friend. Google for regex in the language of your choice.
One of the biggest variations in regex is how special characters are escaped / interpreted.
For instance, grep, vim and perl regexs differ in how to handle things like ( ) for grouping / capturing a pattern for back referencing in search & replace. IIRC, Perl uses them straight while grep and vim require them to be escaped.
Also, Perl regex may support more features than earlier regex engines. regex's that would have been simple in Perl were a major Pita in grep.
I'm not completely sure if this is a correct way to sum it up, but there are basically two major classes of regex - Posix ( grep and similar tools ) and Perl compatible ( with minor variations ).
One tool I've found useful is The Regex Coach - interactive regular expressions.