Php, Regular expression - php

I got this pattern(I am using php):
'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?)\]/i'
When i search for this string: http://phpquest.zapto.org/users/register.php
The matches are(The order is 0-5):
'[link=http://phpquest.zapto.org/users/register.php]'
'http://phpquest.zapto.org/users/register.php'
'http://'
'phpquest.zapto'
org
''
When i replace the * with + inside the last subpattern like that:
'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]+)*\/?)\]/i'
The matches are(The order is 0-5):
'[link=http://phpquest.zapto.org/users/register.php]'
'http://phpquest.zapto.org/users/register.php'
'http://'
'phpquest.zapto'
org
'/users/register.php'
If anyone can help me understand why is that i will be very thankful, Thank you all and have a nice day.

Maybe a simpler example is when you compare this to this.
The regexes involved are:
(a*)*
and
(a+)*
And the test string is aaaaaa.
What happens is that after capturing the main group (in the example I provided, the series of a's) it attempts to match more, but cannot. But wait! It can also match nothing, because * means 0 or more times!
Therefore, after matching all the a's, it will match and catch a 'nothing' and since only the last captured part is stored, you get '' as result of the capture group.
In (a+)*, after matching and catching aaaaaa, it cannot match or catch anything more (+ prevents it to match nothing, as opposed to *) and hence, aaaaaa is the last match.

This can be simplified with the following pattern.
/\[link=(https?:\/\/)(([a-z0-9]+\.?)+)((\/[^\/]+)+)\/?\]/i
The regex symbol * is not greedy, while + is. Hence, when using the + in the second attempt, all path components are matched and that group is captured; however, in the first attempt with *, since you were only capturing the inner * group with parenthesis, you matched the un-greedy sample of the *, in this case nothing.

Related

Preg_replace() to add to string using non-capturing group

I have a piece of HTML markup, for which I need to add a specific CSS rule to it. The HTML is like this:
<tr>
<td style="color:#555555;padding-top: 3px;padding-bottom: 20px;">In order to stop receiving similar emails, simply remove the relevant saved search from your account.</td>
</tr>
As you can see td already contains a style tag, so my idea is to match the last ; of it and replace it with a ; plus the rule I need to add...
The problem is that, although I used the appropriate non-capturing group, I still can't figure out how to do this properly... Take a look at this experiment please: https://regex101.com/r/qlVq6A/1
(<td.*style=".*)(;)(".*>)(?:In order to stop receiving)
On the other hand, when I assign a capturing group to the last part (the text in English that's there just to identify which td I'm interested in) it works OK, but I feel like this is an indirect way to make this work... Take a look at this experiment: https://regex101.com/r/qhVatN/1
(<td.*style=".*)(;)(".*>In order to stop receiving)
Can someone explain to me why the first route doesn't work? Basically, why the non-capturing group still captures the text inside of it...
In your second pattern, you use 3 capture groups and you use the style that you want to add in the replacement and the 3rd group contains In order to stop receiving which will be present after using group 3 in the replacement.
But in your first pattern, you use a non capture group (?: and that will match but is not part of the replacement.
Note that when using a non capture group like that you can just omit it at all because the grouping by itself like that without for example a quantifier or alternation has no additional purpose.
You can use a pattern for the example string, but this can be error prone and using a DOM parser would be a better option.
A way to write the pattern with just 2 capture groups:
(<td[^>]*\bstyle="[^"]*;)([^"]*">In order to stop receiving)
In the replacement use:
$1font-size: 80%;$2
Explanation
( Capture group 1
<td[^>]* Match <td and then optionally repeat any char except >
\bstyle="[^"]*; Match style=" and then optionally repeat matching any char except " and then match the last semicolon (note that it is part of group 1 now)
) Close group 1
( Capture group 2
[^"]*">In order to stop receiving Optionally repeat matching any char except : and then match "> followed by the expected text
) Close group 2
See a regex demo.
Another option to write the pattern without capture groups making use of \K to forget what is matched so far, and a positive lookahead (?= to assert the expected text to the right:
<td[^>]*\bstyle="[^"]*;\K(?=[^"]*">In order to stop receiving)
See another regex demo.

PHP Regex Not Quite Working

I am using the following regex:
^[0-9.,]*(([.,][-])|([.,][0-9]{2}))?\$
I use this regex to check for valid prices -- so it catches/rejects things like xxx, or llddd or 34.23dsds
and allows things like 100 or 120.00
The problem with it seems to be if it is blank(empty) it passes as valid which it should not -- any ideas how to change this??
Thanks
One of your problems is that you use the dot in your regex which stands for "any character". If you mean a dot you need to escape it like this \.
Also you should have at least one number in it so exchange the asterisk * by a + for "one or more".
Then you can have .,.,.,.,.,.,- if you do not remove the comma and dot from the first part:
^[0-9]+(([\.,][-])|([\.,][0-9]{2}))?$
Taking yoiur regex and just solving the "don't match blanks" problem:
^[0-9.,]+(([.,][-])|([.,][0-9]{2}))?$
the * allows 0 or more, while the + allows 1 or more, thus the * allowed blanks but the + will not, instead there must be at least one digit.
EDIT:
You should clean this regex up a bit to be
^[0-9]+(?:[.,-](?:[0-9]{2})?)?$
This solves the matching of ",,,"
http://www.regextester.com/?fam=95185
EDIT 2: #Fuzzzzel pointed out that this did not match the case "50,-" which we assume you would like to match and that removing capturing groups is presumptive. Here's the latest iteration of my suggested regex:
^[0-9]+([.,-](-|([0-9]{2}))?)?$

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.
When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.
Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101
Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.
I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.
This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

PHP Regexp capturing repeating group of chars, e.g. hahaha jajajaja hihihi

As title, is there a way in PHP, with preg_match_all to catch all the repetitions of chars group?
For instante catch
hahahaha
jajajaj
hihihi
It's fine to catch repetition of any char, like abababab, acacacacac.
Also, is there a way to count the number of repetition?
The idea is to catch all this "forms" of smiling on social media.
I figured out that there are also other cases, such as misspelled instances like ahahhahaah (where you have two consecutive a or h). Any ideas?
How about this:
preg_match_all('/((?i)[a-z])((?i)[a-z])(\1\2)+/', $str, $m);
$matches = $m[0]; //$matches will contain an array of matches
A bit complicated, but it does work. To explain, the first subpattern (((?i)[a-z])) matches any character between a and z, no matter the case. The second subpattern (((?i)[a-z])) does the same thing. The third subpattern ((\1\2)+) matches one or more repetitions of the first two letters, in the same case as they were originally put. This regular expression also assumes that there's an even number of repetitions. If you don't want that, you can add \1? at the end, meaning that (as long as it contains one or more repetitions), it can end with the first character (for instance, hahah and ikikikik would both be valid, but not asa).
To retrieve the number of repetitions for a specific match, you can do:
$numb = strlen($matches[$index])/2 - 1; //-1 because the first two letters aren't repetitions
For the shortest repetition (e.g. ha gets repeated multiple times in hahahaha):
(.+?)\1+
See demo.
For the longest repetition (e.g. haha gets repeated in hahahaha):
(.+)\1+
Counting Repetitions
The non-regex solution is to compare the lengths of Group 1 (the repteated token) and the overall match.
With pure regex, in .NET, you could simply do (.+?)(\1)+ and look at the number of captures in the Group 1 CaptureCollection object.
In PHP, that's not possible, but there are some hacks. See, for instance, this question about matching a line number—it's the same technique. This is for "study purposes" only—you wouldn't want to use that in real life.

RegEx with character set inside positive lookbehind, Is it possible?

I need to match "name" only after "listing", but of course those words could be any url directory or page.
mydomain.com/listing/name
so the only thing I can "REGuest" (request) is to be some parent directory there.
In other words, I want to match the "position" i.e. whatever comes 2nd after the domain.
I'm trying something like
(?<=mydomain\.com/[^/\?&]+/)[^/\?&]+(?:/)?
But the character set won't work inside the positive lookbehind, at least it's setup to match only ONE character. As soon as I try to match other than one (e.g. modify it with +, ? or *) it just stops working.
I'm obviously missing the positive lookbehind syntax and it seems not intended for what I'm trying.
How can I match that 2nd level filename?
Thanks.
Regular-expressions.info states that
The bad news is that most regex flavors do not allow you to use just
any regex inside a lookbehind, because they cannot apply a regular
expression backwards. Therefore, the regular expression engine needs
to be able to figure out how many steps to step back before checking
the lookbehind...
(Read further, they even mention Perl, Python and Java.)
I think the quantifier might be the problem. I found this on stackoverflow and briefly flew over it.
Wouldn't it be possible to just match the whole path, and use a group for the second level filename:
mydomain\.com\/[^\/\?&]+\/([^\/\?&]+)(?:\/)?
(note: I had to escape the / for my tests...)
The result of this would be something like:
Array
(
[0] => mydomain.com/listing/name
[1] => name
)
Now, because I don't know the context of your problem, I just assumed you would be able to postprocess the results and get the group 1 (index 1) from the result. If not, I unfortunately don't know...

Categories