What do these characters mean? - php

Could you please explain the statement below? I think it's called regex, but I'm really not sure.
~<p>(.*?)</p>~si
What does si and (.*?) stand for?

Find everything between <p> and </p> case insensitive (i) (so <P> will work also) and possibly spanning multiple lines (s)

Actually, it's called regex, short for Regular Expression, and has a syntax that doesn't look familiar at first, but becomes second-nature quickly enough.
si are flags: s stands for "dotall", which makes the . (which I'll explain in a bit) match every single character, including newlines. The i stands for "case-insensitive", which is self-explanatory.
The (.*?) part says this: "match every 0 or more repetitions (*) of any character (.), and make it greedy lazy (?) i.e. match as few characters as possible".
The "matching" happens when you check a string against the regex. For example, you say that <p>something</p> matches the given regex.
You'll find #Mchl's link a great source of information on regex.
Hope this helps.

It's called regex - short for regular expressions, which is a standard for string parsing, manipulation, and validation. Look at the reference section on the site I linked to and you'll be able to work out what that regex does.

It's a lazy regular expression, basically it will try as LITTLE (lazy) as possible with that mask while by default it will try to match as much as it can (greedy).
Check out this resource for a better, more complete explanation:
http://www.regular-expressions.info/repeat.html#greedy

Related

Regular expression for templating captures too much

With PHP I'm trying to get my regular expression to match both template references below. The problem is, it also grabs the </ul> from the first block of text. When I remove the /s flag that it only catches the second reference. What I'm a doing wrong?
/{{\%USERS}}(.*)?{{\%\/USERS}}/s
Here is my string.
<ul class="users">
{{%USERS}}
<li>{%}</li>
{{%/USERS}}
</ul>
{{%USERS}} hello?!{{%/USERS}}
Why is my expression catching too much or too little?
You probably need to use non-greedy quantifiers.
* and + are "greedy". They'll match as many characters as they can.
*? and +? are "non-greedy". They'll match only as many characters as required to move on to the next part of the regex.
So in the following test string:
<alpha><bravo>
<.+> will capture <alpha><bravo> (because . matches >< as
well!).
<.+?> will capture <alpha>.
Why is my expression catching too much or too little?
Its catching too much because the quantifiers are greedy by default (see Li-aung Yip answer +1 for that)
If you remove the modifier s it matches only the second occurrence, because that modifier makes the . also match newline characters, so without it, it's not possible to match the first part, because there are newlines in between.
See the non greedy answer
{{\%USERS}}(.*?){{\%\/USERS}}
here on Regexr, a good place to test regular expressions.
Btw. I removed the ? after the capturing group, its not needed, since * matches also the empty string, so no need to make it additionally optional.
Here is your regexp:
/{{%USERS}}([^{]+({%[^{]+)?){{%/USERS}}/g

Regex/PHP check if group of characters appears only once

I am trying to validate an input in PHP with REGEX. I want to check whether the input has the %s character group inside it and that it appears only once. Otherwise, the rule should fail.
Here's what I've tried:
preg_match('|^[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value); (there are also some other rules besides this; I've tried the (%s){1} part and it doesn't work).
I believe it is a very easy solution to this, but I'm not really into REGEX's...Thank you for your help!
If I understand your question, you need a positive lookahead. The lookahead causes the expression to only match if it finds a single %s.
preg_match('|^(?=[^%s].*?[%s][^%s]*$)[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value);
I'll explain how each part works
^(?=[^%s].*?[%s][^%s]*$) is a zero-width assertion -- (?=regex) a positive lookahead -- (meaning it must match, but does not "eat" any characters). It means that the whole line can have only 1 %s.
[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$ The remaining part of the regex also looks at the entire string and ensures that the whole string is composed only of the characters in the character class (like your original regex).
I managed to do this with PHP's substr_count() function, following Johnsyweb suggestion to use an alternate way to perform the validation and because the REGEX's suggested seem pretty complicated.
Thank you again!
Alternatively, you can use preg_match_all with your pattern and check the number of matches. If it's 1, then you're ok - something like this:
$result = (preg_match_all('|^[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value) == 1)
Try this:
'|^(?=(?:(?!%s).)*%s(?:(?!%s).)*$)[0-9_\s:;,.?!()\p{L}-]+$|u'
The (%s){1} sequence inside the square brackets probably doesn't do what you think it does, but never mind, the solution is more complex. In fact, {1} should never appear anywhere in a regex. It doesn't ensure that there's only one of something, as many people assume. As a matter of fact, it doesn't do anything; it's pure clutter.
EDIT (in answer to the comment): To ensure that only one of a particular sequence is present in a string, you have to actively examine every single character, classifying it as either part-of-%s or not part-of-%s. To that end, (?:(?!%s).)* consumes one character at a time, after the negative lookahead has confirmed that the character is not the start of %s.
When that part of the lookahead expression quits matching, the next thing in the string has to be %s. Then the second (?:(?!%s).)*$ kicks in to confirm that there are no more %s sequences until the end of the string.
And don't forget that the lookahead expression must be anchored at both ends. Because the lookahead is the first thing after the main regex's start anchor you don't need to add another ^. But the lookahead must end with its own $ anchor.
If you're not "into" regular expressions, why not solve this with PHP?
One call to the builtin strpos() will tell you if the string has a match. A second call will tell you if it appears more than once.
This will be easier for you to read and for others to maintain.

PHP Regex Question

I am developing an application using PHP but I am new to regular expressions, I could not find a solution to my problem. I want to replace all occurences of #word with a link, i have written a preg_match for this:
$text=preg_replace('~#([\p{L}|\p{N}]+)~u', '#$1', $text);
The problem is, this regular expression also matches the html character codes like
'
and gives corrupt output. I need to exclude the words starting with &# but i do not know how to do that using regular expressions.
Thanks for your help.
'~(?<!&)#([\p{L}|\p{N}]+)~u'
That's a negative lookbehind assertion: http://www.php.net/manual/en/regexp.reference.assertions.php
Matches # only if not preceded by &
http://gskinner.com/RegExr/
use this online regular expression constructor. They have explanation for every flag you may want to use.. and you will see highlighted matches in example text.
and yes use [a-zA-Z]
You would need to add a [A-Za-z] rule in your regular expression statement so that it only limits itself to letters and no numbers.
I will edit with an example later on.

recursive regular expression to process nested strings enclosed by {| and |}

In a project I have a text with patterns like that:
{| text {| text |} text |}
more text
I want to get the first part with brackets. For this I use preg_match recursively. The following code works fine already:
preg_match('/\{((?>[^\{\}]+)|(?R))*\}/x',$text,$matches);
But if I add the symbol "|", I got an empty result and I don't know why:
preg_match('/\{\|((?>[^\{\}]+)|(?R))*\|\}/x',$text,$matches);
I can't use the first solution because in the text something like { text } can also exist. Can somebody tell me what I do wrong here? Thx
Try this:
'/(?s)\{\|(?:(?:(?!\{\||\|\}).)++|(?R))*\|\}/'
In your original regex you use the character class [^{}] to match anything except a delimiter. That's fine when the delimiters are only one character, but yours are two characters. To not-match a multi-character sequence you need something this:
(?:(?!\{\||\|\}).)++
The dot matches any character (including newlines, thank to the (?s)), but only after the lookahead has determined that it's not part of a {| or |} sequence. I also dropped your atomic group ((?>...)) and replaced it with a possessive quantifier (++) to reduce clutter. But you should definitely use one or the other in that part of the regex to prevent catastrophic backtracking.
You've got a few suggestions for working regular expressions, but if you're wondering why your original regexp failed, read on. The problem lies when it comes time to match a closing "|}" tag. The (?>[^{}]+) (or [^{}]++) sub expression will match the "|", causing the |} sub expression to fail. With no backtracking in the sub expression, there's no way to recover from the failed match.
See PHP - help with my REGEX-based recursive function
To adapt it to your use
preg_match_all('/\{\|(?:^(\{\||\|\})|(?R))*\|\}/', $text, $matches);

Can you rely on the order that regular expression syntax is interpreted?

(The background for this question is that I thought it would be fun to write something that parses wiki creole markup. Anyway the problem that I think I have a solution to is differentiating between // in a url and as opening/closing syntax for italic text)
My question is slightly compound so I've tried to break it up under the headings
If there is a substring(S1) that can contain any one of a series of substrings separated by | does the regular expression interpreter simply match the first substring within 'S1' then move onto the regular expression after 'S1'? Or can will it in some instances try find the best/greediest match?
Here is an example to try and make my question more clear:
String to search within: String
Regex: /(?:(Str|Strin).*)/ (the 'S1' in my question refers to the non-capturing substring
I think that the matches from the above should be:
$0 will be String
$1 will be Str and not Strin
Will this always happen or are the instances (e.g maybe 'S1' being match greedily using *) where the another matching substring will be used i.e. Strin in my example.
If the above is correct than can I/should I rely on this behaviour?
Real world example
/^\/\/(\b((https?|ftp):\/\/|mailto:)([^\s~]*?(?:~(.|$))?)+?(?=\/\/|\s|$)|~(.|$)|[^/]|\/([^/]|$))*\/\//
Should correctly match:
//Some text including a http//:url//
With $1 == Some text including a http//:url
Note: I've tried to make this relatively language agnostic but I will be using php
PHP uses the PCRE regex engine. By default, and the way PHP uses it, the PCRE engine runs in longest-leftmost mode. This mode returns the first match, evaluating the regex from left to right. So yes, you can rely on the order that PHP interprets a regex.
The other mode, provided by the pcre_dfa_exec() function, evaluates all possible matches and returns the longest possible match.
In PHP, using preg extension, you can choose between greedy and non greedy operators (usually appending '?' to them).
By the way, in the example you gave, if you want Strin to match, you must invert your cases : /(?:(Strin|Str).*)/. I think, you should put the most generic expression at the end of the Regex.
FYI, with preg engine,
alternation operator is neither greedy nor lazy but ordered
Mastering regular expressions, J. Friedl, p175
If you want a greedy engine, you must use a Posix compliant engine (ereg - but it's deprecated).

Categories