What's the difference between these perl compatible regular expressions? - php

An answer from another question piqued my curiosity.
Consider:
$string = "asfasdfasdfasdfasdf[[sometextomatch]]asfkjasdfjaskldfj";
$regex = "/\[\[(.+?)\]\]/";
preg_match($regex, $string, $matches);
$regex = "/\[\[(.*)\]\]/";
preg_match($regex, $string, $matches);
I asked what the difference between the two regexes is. The aswer I got was that ".*" matches any character 0 or more times as many times as possible,
and ".+?" matches any character 1 or more times as few times as possible.
I read those regexes differently so I did some experimenting on my own but didn't come to any conclusions. Php.net says "?" is equivalent to {0,1} so you could rewrite
"/\[\[(.+?)\]\]/"
as
"/\[\[((.+){0,1})\]\]/"
or as
"/\[\[(.{0,})\]\]/"
or as
"/\[\[(.*)\]\]/"
Will they capture different text? Is the difference that one is less expensive? Am I being anal?

Stand-alone, ? does mean {0,1}, however, when it follows something like *, +, ?, or {3,6} (for example), ? means something else entirely, which is that it does minimal matching. So, no, you can't rewrite /\[\[(.+?)\]\]/ as /\[\[((.+){0,1})\]\]/. :-)

Just take an example where you get different results:
foo [[bar]] baz [[quux]]
Your first regular expression will match [[bar]] and [[quux]] while the second will match only [[bar]] baz [[quux]].
The reason for that is that a lazy quantifier (suffixed with ?) will match the minimum of possible repetitions the normal greedy mode will match the maximum of possible repetitions:
However, if a quantifier is followed by a question mark, then it ceases to be greedy, and instead matches the minimum number of times possible, so the pattern /\*.*?\*/ does the right thing with the C comments. The meaning of the various quantifiers is not otherwise changed, just the preferred number of matches. Do not confuse this use of question mark with its use as a quantifier in its own right. Because it has two uses, it can sometimes appear doubled, as in \d??\d which matches one digit by preference, but can match two if that is the only way the rest of the pattern matches.

Normally, ? means "capture the preceding thing 0 or 1 times". However, when used after a * or +, a ? modifies the meaning of the * or +. Normally, */+ mean "match 0 (1 for +) or more times, and match as many as possible". Adding the ? modifies that meaning to be "match 0 (1 for +) or more times, but match as few as possible". By default those expressions are "greedy", ? modifies them to be non-greedy.

The ? will only capture it one time ( the (0,1) means 0 to 1 times) where as the * will capture it as many times as it occurs in the string.
From this page:
If you take <.+> and use it on The <em>Big</em> Dog. it will give <em>Big</em>. Where as <.+?> will only match <em>

/.*/ === /.{0,}/
/.+/ === /.{1,}/
/.?/ === /.{0,1}/
"aaaaaa" =~ /a*/; # "aaaaaa"
"aaaaaa" =~ /a*?/; # ""

Related

Detect phone number with preg_replace with some specifics

It's a basic preg_replace that detects phone numbers (and just long numbers). My problem is I want to avoid detecting numbers between double "", single '' and forward slashes //
$text = preg_replace("/(\+?[\d-\(\)\s]{8,25}[0-9]?\d)/", "<strong>$1</strong>", $text);
I poked around but nothing is working for me. Your help will be appreciated.
I predict that your pattern is going to let you down more than it is going to satisfy you (or you are very comfortable with "over-matching" within the scope of your project).
While my suggestion really blows out the pattern length, a (*SKIP)(*FAIL) technique will serve you well enough by consuming and discarding the substrings that require disqualification. There may be a way of dictating the pattern logic with lookaround instead, but with an initial pattern with so many potential holes in it and no sample data, there are just too many variables to make a confident suggestion.
Regex101 Demo
Code: (Demo)
$text = <<<TEXT
A number 555555555 then some more text and a quoted number "(123)4567890" and
then 1 2 3 4 6 (54) 3 -2 and forward slashed /+--------0/ versus
+--------0 then something more realistic '234 588 9191' no more text.
This is not closed by the same character on both
ends: "+012345678901/ which of course is a _necessary_ check?
TEXT;
echo preg_replace(
'~([\'"/])\+?[\d()\s-]{8,25}\d{1,2}\1(*SKIP)(*FAIL)|((?!\s)\+?[\d()\s-]{8,25}\d{1,2})~',
"<strong>$2</strong>",
$text);
Output:
A number <strong>555555555</strong> then some more text and a quoted number "(123)4567890" and
then <strong>1 2 3 4 6 (54) 3 -2</strong> and forward slashed /+--------0/ versus
<strong>+--------0</strong> then something more realistic '234 588 9191' no more text.
This is not closed by the same character on both
ends: "<strong>+012345678901</strong>/ which of course is a _necessary_ check?
For the technical breakdown, see the Regex101 link.
Otherwise, this is effectively checking for "phone numbers" (by your initial pattern) and if they are wrapped by ', ", or / then the match is ignored and the regex engine continues looking for matches AFTER that substring. I have added (?!\s) at the start of the second usage of your phone pattern so that leading spaces are omitted from the replacement.
It seems that you're not validating, then you might be trying to write some expression with less boundaries, such as:
^\+?[0-9()\s-]{8,25}[0-9]$
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Combine two regular expressions for php

I have these two regular expression
^(((98)|(\+98)|(0098)|0)(9){1}[0-9]{9})+$
^(9){1}[0-9]{9}+$
How can I combine these phrases together?
valid phone :
just start with : 0098 , +98 , 98 , 09 and 9
sample :
00989151855454
+989151855454
989151855454
09151855454
9151855454
You haven't provided what passes and what doesn't, but I think this will work if I understand correctly...
/^\+?0{0,2}98?/
Live demo
^ Matches the start of the string
\+? Matches 0 or 1 plus symbols (the backslash is to escape)
0{0,2} Matches between 0 and 2 (0, 1, and 2) of the 0 character
9 Matches a literal 9
8? Matches 0 or 1 of the literal 8 characters
Looking at your second regex, it looks like you want to make the first part ((98)|(\+98)|(0098)|0) in your first regex optional. Just make it optional by putting ? after it and it will allow the numbers allowed by second regex. Change this,
^(((98)|(\+98)|(0098)|0)(9){1}[0-9]{9})+$
to,
^(?:98|\+98|0098|0)?9[0-9]{9}$
^ this makes the non-grouping pattern optional which contains various alternations you want to allow.
I've made few more corrections in the regex. Use of {1} is redundant as that's the default behavior of a character, with or without it. and you don't need to unnecessarily group regex unless you need the groups. And I've removed the outer most parenthesis and + after it as that is not needed.
Demo
This regex
^(?:98|\+98|0098|0)?9[0-9]{9}$
matches
00989151855454
+989151855454
989151855454
09151855454
9151855454
Demo: https://regex101.com/r/VFc4pK/1/
However note that you are requiring to have a 9 as first digit after the country code or 0.

PHP Regexp capturing repeating group of chars, e.g. hahaha jajajaja hihihi

As title, is there a way in PHP, with preg_match_all to catch all the repetitions of chars group?
For instante catch
hahahaha
jajajaj
hihihi
It's fine to catch repetition of any char, like abababab, acacacacac.
Also, is there a way to count the number of repetition?
The idea is to catch all this "forms" of smiling on social media.
I figured out that there are also other cases, such as misspelled instances like ahahhahaah (where you have two consecutive a or h). Any ideas?
How about this:
preg_match_all('/((?i)[a-z])((?i)[a-z])(\1\2)+/', $str, $m);
$matches = $m[0]; //$matches will contain an array of matches
A bit complicated, but it does work. To explain, the first subpattern (((?i)[a-z])) matches any character between a and z, no matter the case. The second subpattern (((?i)[a-z])) does the same thing. The third subpattern ((\1\2)+) matches one or more repetitions of the first two letters, in the same case as they were originally put. This regular expression also assumes that there's an even number of repetitions. If you don't want that, you can add \1? at the end, meaning that (as long as it contains one or more repetitions), it can end with the first character (for instance, hahah and ikikikik would both be valid, but not asa).
To retrieve the number of repetitions for a specific match, you can do:
$numb = strlen($matches[$index])/2 - 1; //-1 because the first two letters aren't repetitions
For the shortest repetition (e.g. ha gets repeated multiple times in hahahaha):
(.+?)\1+
See demo.
For the longest repetition (e.g. haha gets repeated in hahahaha):
(.+)\1+
Counting Repetitions
The non-regex solution is to compare the lengths of Group 1 (the repteated token) and the overall match.
With pure regex, in .NET, you could simply do (.+?)(\1)+ and look at the number of captures in the Group 1 CaptureCollection object.
In PHP, that's not possible, but there are some hacks. See, for instance, this question about matching a line number—it's the same technique. This is for "study purposes" only—you wouldn't want to use that in real life.

Regex - matching all between second set of brackets ([])

I have the following string that I need to match only the last seven digets between [] brackets. The string looks like this
[15211Z: 2012-09-12] ([5202900])
I only need to match 5202900 in the string contained between ([]), a similar number could appear anywhere in the string so something like this won't work (\d{7})
I also tried the following regex
([[0-9]{1,7}])
but this includes the [] in the string?
If you just want the 7 digits, not the brackets, but want to make sure that the digits are surrounded with brackets:
(?<=\[)\d{7}(?=\])
FYI: This is called a positive lookahead and positive lookbehind.
Good source on the topic: http://www.regular-expressions.info/lookaround.html
Try matching \(\[(\d{7})\]\), so you match this whole regular expression, then you take group 1, the one between unescaped parentheses. You can replace {7} with a '*' for zero or more, + for 1 or more or a precise range like you already showed in your question.
You can try to use
\[(\d{1,7})\]
If first pattern looks like yours (not only digits), then this should work for you to extract group of digits surrounded by brackets like ([123]):
\(\[(\d+)\]\)
From your details, lookbehind and lookaround seems to be good one. You can also use this one:
(\d{7})\]\)$
Since the pattern of seven digit is expected at the end of the line, engine need to work less in order to find the match.
Hope it helps!
Here is a benchmark (in Perl, but I think is close the same in php) that compares lookaround approach and capture group:
use Benchmark qw(:all);
my $str = q/[15211Z: 2012-09-12] ([5202900])/;
my $count = -3;
cmpthese($count, {
'lookaround' => sub {
$str =~ /(?<=\[)\d{7}(?=\])/;
},
'capture group' => sub {
$str =~ /\[(\d{7})\]/;
},
});
result:
Rate lookaround capture group
lookaround 274914/s -- -70%
capture group 931043/s 239% --
As we can see, capture is more than 3 times faster than lookaround.

Regular expression to match an exact number of occurrence for a certain character

I'm trying to check if a string has a certain number of occurrence of a character.
Example:
$string = '123~456~789~000';
I want to verify if this string has exactly 3 instances of the character ~.
Is that possible using regular expressions?
Yes
/^[^~]*~[^~]*~[^~]*~[^~]*$/
Explanation:
^ ... $ means the whole string in many regex dialects
[^~]* a string of zero or more non-tilde characters
~ a tilde character
The string can have as many non-tilde characters as necessary, appearing anywhere in the string, but must have exactly three tildes, no more and no less.
As single character is technically a substring, and the task is to count the number of its occurences, I suppose the most efficient approach lies in using a special PHP function - substr_count:
$string = '123~456~789~000';
if (substr_count($string, '~') === 3) {
// string is valid
}
Obviously, this approach won't work if you need to count the number of pattern matches (for example, while you can count the number of '0' in your string with substr_count, you better use preg_match_all to count digits).
Yet for this specific question it should be faster overall, as substr_count is optimized for one specific goal - count substrings - when preg_match_all is more on the universal side. )
I believe this should work for a variable number of characters:
^(?:[^~]*~[^~]*){3}$
The advantage here is that you just replace 3 with however many you want to check.
To make it more efficient, it can be written as
^[^~]*(?:~[^~]*){3}$
This is what you are looking for:
EDIT based on comment below:
<?php
$string = '123~456~789~000';
$total = preg_match_all('/~/', $string);
echo $total; // Shows 3

Categories