Explain Regular Expression - php

1. (.*?)
2. (*)
3. #regex#
4. /regex/
A. What do the above symbols mean?
B. What is the different between # and /?
I have the cheat-sheet, but didn't full get it yet. What i know * gets
all characters, so what .*? is for!
The above patterns are used in PHP preg_match and preg_replace.

. matches any character (roughly).
*? is a so-called quantifier, matching the previous token at least zero times (and only as often as needed to complete a match – it's lazy, hence the ?).
(...) create a capturing group you can refer to in either the regex or the match. They also are used for limiting the reach of the | alternation to only parts of the regex (just like parentheses in math make precedence clear).
/.../ and #...# are delimiters for the entire regex, in PHP at least. Technically they're not part of the regex syntax. What delimiter you use is up to you (but I think you can't use \), and mostly changes what characters you need to escape in the regex. So / is a bad choice when you're matching URIs that might contain a lot of slashes. Compare the following two varaints for finding end-of-line comments in C++ style:
preg_match('/\/\/.*$/', $text);
preg_match('#//.*$#', $text);
The latter is easier to read as you don't have to escape slashes within the regex itself. # or # are commonly used as delimiter because they stands out and aren't that frequent in text, but you can use whatever you like.
Technically you don't need this delimiter at all. This is probably mostly a remnant of PHP's Perl heritage (in Perl regexes are delimited, but are not contained in a string). Other languages that use strings (because they have no native regex literals), such as Java, C# or PowerShell do well without the delimiter. In PHP you can add options after the closing delimiter, such as /a/i which matches a or A (case-insensitively), but the regex (?i)a does exactly the same and doesn't need delimiters.
And next time you take the time to read through Regular-Expressions.info, it's an awesome reference on regex basics and advcanced topics, explaining many things very well and thoroughly. And please also take a look at the PHP documentation in this regard.

Well, please stick to one actual question per ... question.
This is an answer to question 3+4, as the other questions have allready been answered.
Regexpes are generally delimited by /, e.g. /abc123/ or /foo|bar/i. In php, you can use whatever character for this you want. You are not limited to /, i.e. you can use e.g. # or %, #/usr/local/bin#.

Related

Regex PHP. Reduce steps: limited by fixed width Lookbehind

I have a regex that will be used to match #users tags.
I use lokarround assertions, letting punctuation and white space characters surround the tags.
There is an added complication, there are a type of bbcodes that represent html.
I have two types of bbcodes, inline (^B bold ^b) and blocks (^C center ^c).
The inline ones have to be passed thru to reach for the previous or next character.
And the blocks are allowed to surround a tag, just like punctuation.
I made a regex that does work. What I want to do now is to lower the number of steps that it does in every character that’s not going to be a match.
At first I thought I could do a regex that would just look for #, and when found, it would start looking at the lookarrounds, that worked without the inline bbcodes, but since lookbehind cannot be quantifiable, it’s more difficult since I cannot add ((\^[BIUbiu])++)* inside, producing much more steps.
How could I do my regex more efficient with fewer steps?
Here is a simplified version of it, in the Regex101 link there is the full regex.
(?<=[,\.:=\^ ]|\^[CJLcjl])((\^[BIUbiu])++)*#([A-Za-z0-9\-_]{2,25})((\^[BIUbiu])++)*(?=[,\.:=\^ ]|\^[CJLcjl])
https://regex101.com/r/lTPUOf/4/
A rule of thumb:
Do not let engine make an attempt on matching each single one character if
there are some boundaries.
The quote originally comes from this answer. Following regular expression reduces steps in a significant manner because of the left side of the outermost alternation, from ~20000 to ~900:
(?:[^#^]++|[#^]{2,}+)(*SKIP)(*F)
|
(?<=([HUGE-CHARACTER-CLASS])|\^[cjleqrd])
(\^[34biu78])*+#([a-z\d][\w-.]{0,25}[a-z\d])(\^[34biu78])*+(?=(?1))
Actually I don't care much about the number of steps being reported by regex101 because that wouldn't be true within your own environment and it is not obvious if some steps are real or not or what steps are missed. But in this case since the logic of regex is clear and the difference is a lot it makes sense.
What is the logic?
We first try to match what probably is not desired at all, throw it away and look for parts that may match our pattern. [^#^]++ matches up to a # or ^ symbols (desired characters) and [#^]{2,}+ prevents engine to take extra steps before finding out it's going nowhere. So we make it to fail as soon as possible.
You can use i flag instead of defining uppercase forms of letters (this may have a little impact however).
See live demo here

PHP Regex detect repeated character in a word

(preg_match('/(.)\1{3}/', $repeater))
I am trying to create a regular expression which will detect a word that repeats a character 3 or more times throughout the word. I have tried this numerous ways and I can't seem to get the correct output.
If you don't need letters to be contiguous, you can do it with this pattern:
\b\w*?(\w)\w*?\1\w*?\1\w*
otherwise this one should suffice:
\b\w*?(\w)\1{2}\w*
Try this regex instead
(preg_match('/(.)\1{2,}/', $repeater))
This should match 3 or more times, see example here http://regexr.com/3fk80
Strictly speaking, regular expressions that include \1, \2, ... things are not mathematical regular expressions and the scanner that parses them is not efficient in the sense that it has to modify itself to include the accepted group, in order to be used to match the discovered string, and in case of failure it has to backtrack for the length of the matched group.
The canonical way to express a true regular expression that accepts word characters repeated three or more times is
(A{3,}|B{3,}|C{3,}|...|Z{3,}|a{3,}|b{3,}|...|z{3,})
and there's no associativity of the operator {3,} to be able to group it as you shown in your question.
For the pedantic, the pure regular expression should be:
(AAAA*|BBBB*|CCCC*|...|ZZZZ*|aaaa*|bbbb*|cccc*|...|zzzz*)
again, this time, you can use the fact that AAAA* is matched as soon as three As are found, so it would be valid also the regex:
AAA|BBB|CCC|...|ZZZ|aaa|bbb|ccc|...|zzz
but the first version allow you to capture the \1 group that delimits the actual matching sequence.
This approach will be longer to write but is by far much more efficient when parsing the data string, as it has no backtrack at all and visits each character only once.

Convert regex from gskinner to PHP

I know that I'd likely hear "Don't parse HTML with regex", so let me say that this question is just academic at this point because I actually solved my problem using the DOM, but on my road to a solution, I ran across this pattern that works on the gskinner website, but I can't figure out how to make it work in PHP preg_match().
(?<=href\=")[^]+?(?=")
I think that the [^] is causing the problem, but I'm not certain what to do about it.
What it is intended to do is pull the substring from between the quotes of an href. (One would expect it to be a web-address or at least part of one.)
[^] is a difficult construct. Basically it is an empty negated character class. But what should it match? That depends on the implementation. Some languages are interpreting it as negation of nothing, so it will match every character, that is what gskinner (means ActionScript 3) seems to be doing.
I would never use this, because it is ambiguous.
The most readable way is to use ., the meta character that matches every character (without newlines), if newlines are also wanted, just add the modifier s that enables the dotall mode, this would be exactly what you wanted to achieve with [^].
A workaround that is sometimes used is to use a character class something like this [\s\S] or [\w\W]. Those will also match every character (including newlines), because they are matching some predefined character class and their negation.

Need to switch from ereg() to preg_match() [duplicate]

This question already has answers here:
How can I convert ereg expressions to preg in PHP?
(4 answers)
Closed 9 years ago.
I need to know what this line of code does, tried to figure it out because i have to build it with preg_match() but I didn't understand it completely:
ereg("([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})", $date)
I know it checks a date, but i don't know in which way.
thanks for some help
Let's break this down:
([0-9]{1,2})
This looks for numbers zero through nine (- indicates a range when used in brackets []) and there can be 1 or two of them.
.
This looks for any single character
([0-9]{1,2})
This looks for numbers zero through nine and there can be 1 or two of them (again)
.
This looks for any single character (again)
([0-9]{4})
This looks for numbers zero through nine and there must be four of them in a row
So it is looking for a date in any of the following formats:
04 18 1973
04-18-1973
04/18/1973
04.18.1973
More will fit that pattern so it isn't a very good regex for what it is supposed to validate against. There are lots of sample regex patterns for matting dates in this format so if you google it you'll have a PCRE in no time.
It's a relatively simple regular expression (regex). If you're going to be working with regex, then I suggest taking a bit of time to learn the syntax. A good starting place to learn is http://regular-expressions.info.
"Regular expressions" or "regex" is a pattern matching language used for searching through strings. There are a number of dialects, which are mostly fairly similar but have some differences. PHP started out with the ereg() family of functions using one particular dialect and then switched to the preg_xx() functions to use a slightly different regex dialect.
There are some differences in syntax between the two, which it is helpful to learn, but they're fairly minor. And in fact the good news for you is that the pattern here is pretty much identical between the two.
Beyond the patterns themselves, the only other major difference you need to know about is that patterns in preg_match() must have a pair of delimiting characters at either end of the pattern string. The most commonly used characters for this are slashes (/).
So in this case, all you need to do is swap ereg for preg_match, and add the slashes to either end of the pattern:
$result = preg_match("/([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})/", $date);
^ ^
slash here and here
It would still help to get an understanding of what the pattern is doing, but for a quick win, that's probably all you need to do in this case. Other cases may be more complex, but most will be as simple as that.
Go read the regular-expressions.info site I linked earlier though; it will help you.
One thing I would add, however, is that the pattern given here is actually quite poorly written. It is intending to match a date string, but will match a lot of things that it probably didn't intend to.
You could fix it up by finding a better regex expression for matching dates, but it is quite possible that the code could be written without needing regex at all -- PHP has some perfectly good date handling functionality built into it. You'd need to consider the code around it and understand what it's doing, but it's perfectly possible that the whole thing could be replaced with something like this:
$dateObject = DateTime::CreateFromFormat($date, 'd.M.Y');
It looks like it would be pretty much agnostic in its matching.
You could interpret it either as mm.dd.yyyy or dd.mm.yyyy. I would consider modifying it if you were in fact trying to match/verify a date as 00.00.0000 would be a match but is an invalid data, outside of possible historic context.
Edit: I forget '.' in this case would match any character without escaping.
this do the same, i have only replace [0-9] by \d, and the dot (that match all) by \D (a non digit, but can replace it by \. or [.- ])
preg_match("~\d{2}\D\d{2}\D\d{4}~", $date)

Regular expression .*? vs .*

I came across a php article about regular expressions which used (.*?) in its syntax. As far I can see it behaves just like (.*)
Is there any advantage of using (.*?) ? I can't really see why someone would use that.
in most flavours of regex, the *? production is a non-greedy repeat. This means that the .*? production matches first the empty string, and then if that fails, one character, and so on until the match succeeds. In contrast, the greedy production .* first attempts to match the entire input, and then if that fails, tries one character less.
This concept only applies to regular expression engines that use recursive backtracking to match ambiguous expressions. In theory, they match exactly the same sentances, but since they try different things first, it's likely that one will be much quicker than the other.
This can also be useful when capture groups (in recursive and NFA style engines equally) are used to extract information from the matching action. For instance, an expression like
"(.*?)"
can be used to capture a quoted string. Since the subgroup is non-greedy, you can be sure that no quotes will be captured, and the subgroup contains only the desired content.
.* is greedy, .*? is not. It only makes sense in context though. Given the pattern:
<br/>(.*?)<br/> and <br/>(.*)<br/>, and the input <br/>test<br/>test2<br/>,
.* will match <br/>test<br/>test2<br/>,
.*? will only match <br/>test<br/>.
Note: don't ever use regex to parse complex html.

Categories