getting url to work with or without a subexpression

getting url to work with or without a subexpression - php

im trying to rewrite a url like
page.php?sort=66&search=s&category=2,3,4&archive=june&page=3
to
page-sort-1-search(s)-category-1,2,3-archive(june)-page3
but the thing is each of this subexpressions my or may not be in the url every time that this page is called so i had to put each one in the "( )?" so the regex works with or without them
^page
(-sort-([0-9]*))?
(-search\(([a-z]*)\))?
(-category-([0-9][,]?*))?
.............
you get the idea
now the problem is mode rewrite is considering each one of this subexpression in the parenthesis as an actual variable
(-sort-([0-9]*))? this is how mode rewrite interpret this => $1 = -sort-66 , $2 = 66
so for each subexpression i got 2 capture-group and that's more then 10 for a link with 5-6 variable
and there is a 9 match limit in mode rewrite
is there a replacement for "()?"

I have to admit, I'm a bit confused by some parts of your question, so I apologize in advance if my answer is totally off-base . . . but it sounds like what you need is a non-capturing group. This:
(...)
matches ... and creates a capture-group (what you're calling a "variable"); this:
(?:...)
just matches ..., without creating a capture-group. So it's useful if all you want to do is group a subexpression, without saving all of its contents to be used as $1.

Related

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.

When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.

Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101

Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.

I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.

This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

Capturing key value pairs from a url string with a regex pattern

I'm trying to use regex to parse a string like the below:
/subject=hello±#text=something that may contain\#hello.com or a normal sla/sh±#date=blah/somethingelseI don't want to capture after the first/
into:
subject = hello
text =something that may contain\#hello.com or a normal sla/sh
date = blah
Ideally I'd like to be able to split the string after the first '/' by something like '±#' - and only that combination in that order.
I've looked around and at the minute have the below:
([^/±#,= ]+)=([^±#,= ]+)
But this doesn't match only '±#' - it matches either # or ±.
It also doesn't cope with the escaped #. (Instead i get: text= something that may contain\ ).
Is there a better way to do this?
Thanks

Try this:
(?:\/|(?<=±#))(.*?=.*?)(?:±#|$|\/(?!.*±#))
See live demo
An important part is the negative look ahead after the trailing slash /(?!.*±#) - this means "match a slash, but only if ±# doesn't appear in the input after it".
Given this input:
/subject=hello±#text=something that may contain\#hello.com or a normal sla/sh±#date=blah/somethingelseI don't want to capture after the first/
It produces matches whose group 1 are:
subject=hello
text=something that may contain\#hello.com or a normal sla/sh
date=blah

Php, Regular expression

I got this pattern(I am using php):
'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?)\]/i'
When i search for this string: http://phpquest.zapto.org/users/register.php
The matches are(The order is 0-5):
'[link=http://phpquest.zapto.org/users/register.php]'
'http://phpquest.zapto.org/users/register.php'
'http://'
'phpquest.zapto'
org
''
When i replace the * with + inside the last subpattern like that:
'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]+)*\/?)\]/i'
The matches are(The order is 0-5):
'[link=http://phpquest.zapto.org/users/register.php]'
'http://phpquest.zapto.org/users/register.php'
'http://'
'phpquest.zapto'
org
'/users/register.php'
If anyone can help me understand why is that i will be very thankful, Thank you all and have a nice day.

Maybe a simpler example is when you compare this to this.
The regexes involved are:
(a*)*
and
(a+)*
And the test string is aaaaaa.
What happens is that after capturing the main group (in the example I provided, the series of a's) it attempts to match more, but cannot. But wait! It can also match nothing, because * means 0 or more times!
Therefore, after matching all the a's, it will match and catch a 'nothing' and since only the last captured part is stored, you get '' as result of the capture group.
In (a+)*, after matching and catching aaaaaa, it cannot match or catch anything more (+ prevents it to match nothing, as opposed to *) and hence, aaaaaa is the last match.

This can be simplified with the following pattern.
/\[link=(https?:\/\/)(([a-z0-9]+\.?)+)((\/[^\/]+)+)\/?\]/i
The regex symbol * is not greedy, while + is. Hence, when using the + in the second attempt, all path components are matched and that group is captured; however, in the first attempt with *, since you were only capturing the inner * group with parenthesis, you matched the un-greedy sample of the *, in this case nothing.

regex matching url

In PHP, the klein routing will match as many routes as it can.
2 routes I have set up are conflicting. They are:
$route1: '/websites/[i:websiteId]/users/[i:id]?'
and
$route2: '/websites/[i:websiteId]/users/[a:filename].[json|csv:extension]?'
This is the URL I'm trying to match, which I think should match the first and not the second, is:
/api/v1-test/websites/100/users/4
The regex produced for these two are:
$regex1: `^/api(?:/(v1|v1-test))/websites(?:/(?P<websiteId>[0-9]++))/users(?:/(?P<id>[0-9]++))?$`
$regex2: `^/api(?:/(v1|v1-test))/websites(?:/(?P<websiteId>[0-9]++))/users(?:/(?P<filename>[0-9A-Za-z]++))(?:\.(?P<extension>json|csv))?$`
I mean for it not to match if there is no '.csv' or '.json'. The problem is that it is matching both routes. For the second, the resulting filename is '4' and the extension is blank.
Sending /api/v1-test/websites/100/users/users.csv works correctly and only matches the second route.
I only have control over the route, not the regex or the matching.
Thanks.

This bit here
(?:\.(?P<extension>json|csv))?
at the end of your second regex causes it to match whether or not there's a filename due to the ? at the very end. Question marks mean 0 or 1 of the previous expression. Get rid of that and, at the least, strings will only match this regex when they have the extension.
To make this change, just remove the question mark from your second route, like so:
$route2: '/websites/[i:websiteId]/users/[a:filename].[json|csv:extension]'

The problem is that the match_type is defined really... weirdly:
$match_types = array(
'i' => '[0-9]++',
'a' => '[0-9A-Za-z]++',
[...]
As such, you can't really capture a sequence corresponding to [a-zA-Z] only... The only option I see would be to use 3 routes:
$route1: '/websites/[i:websiteId]/users/[i:id]?'
$route2: '/websites/[i:websiteId]/users/[a:filename]'
$route3: '/websites/[i:websiteId]/users/[a:filename].[json|csv:extension]'
And to assign the same actions for routes 2 and 3. Then you would have this:
/api/v1-test/websites/100/users/ is matched by 1
/api/v1-test/websites/100/users/4 is matched by 1
/api/v1-test/websites/100/users/test is matched by 2
/api/v1-test/websites/100/users/test.csv is matched by 3
Which seems like the behavior you wanted.
Abother (easier) solution would be to take advantage of this bit of the documentation:
Routes automatically match the entire request URI.
If you need to match only a part of the request URI
or use a custom regular expression, use the # operator.
You can then define your routes like this:
$route1: '#/websites/[0-9]+/users/[0-9]*$'
$route1: '#/websites/[0-9]+/users/[a-zA-Z]+(\.[a-zA-Z]+)?$'

RegEx with character set inside positive lookbehind, Is it possible?

I need to match "name" only after "listing", but of course those words could be any url directory or page.
mydomain.com/listing/name
so the only thing I can "REGuest" (request) is to be some parent directory there.
In other words, I want to match the "position" i.e. whatever comes 2nd after the domain.
I'm trying something like
(?<=mydomain\.com/[^/\?&]+/)[^/\?&]+(?:/)?
But the character set won't work inside the positive lookbehind, at least it's setup to match only ONE character. As soon as I try to match other than one (e.g. modify it with +, ? or *) it just stops working.
I'm obviously missing the positive lookbehind syntax and it seems not intended for what I'm trying.
How can I match that 2nd level filename?
Thanks.

Regular-expressions.info states that
The bad news is that most regex flavors do not allow you to use just
any regex inside a lookbehind, because they cannot apply a regular
expression backwards. Therefore, the regular expression engine needs
to be able to figure out how many steps to step back before checking
the lookbehind...
(Read further, they even mention Perl, Python and Java.)
I think the quantifier might be the problem. I found this on stackoverflow and briefly flew over it.
Wouldn't it be possible to just match the whole path, and use a group for the second level filename:
mydomain\.com\/[^\/\?&]+\/([^\/\?&]+)(?:\/)?
(note: I had to escape the / for my tests...)
The result of this would be something like:
Array
(
[0] => mydomain.com/listing/name
[1] => name
)
Now, because I don't know the context of your problem, I just assumed you would be able to postprocess the results and get the group 1 (index 1) from the result. If not, I unfortunately don't know...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

getting url to work with or without a subexpression - php

Related

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

Capturing key value pairs from a url string with a regex pattern

Php, Regular expression

regex matching url

RegEx with character set inside positive lookbehind, Is it possible?

Categories

Resources