regex matching url - php

In PHP, the klein routing will match as many routes as it can.
2 routes I have set up are conflicting. They are:
$route1: '/websites/[i:websiteId]/users/[i:id]?'
and
$route2: '/websites/[i:websiteId]/users/[a:filename].[json|csv:extension]?'
This is the URL I'm trying to match, which I think should match the first and not the second, is:
/api/v1-test/websites/100/users/4
The regex produced for these two are:
$regex1: `^/api(?:/(v1|v1-test))/websites(?:/(?P<websiteId>[0-9]++))/users(?:/(?P<id>[0-9]++))?$`
$regex2: `^/api(?:/(v1|v1-test))/websites(?:/(?P<websiteId>[0-9]++))/users(?:/(?P<filename>[0-9A-Za-z]++))(?:\.(?P<extension>json|csv))?$`
I mean for it not to match if there is no '.csv' or '.json'. The problem is that it is matching both routes. For the second, the resulting filename is '4' and the extension is blank.
Sending /api/v1-test/websites/100/users/users.csv works correctly and only matches the second route.
I only have control over the route, not the regex or the matching.
Thanks.

This bit here
(?:\.(?P<extension>json|csv))?
at the end of your second regex causes it to match whether or not there's a filename due to the ? at the very end. Question marks mean 0 or 1 of the previous expression. Get rid of that and, at the least, strings will only match this regex when they have the extension.
To make this change, just remove the question mark from your second route, like so:
$route2: '/websites/[i:websiteId]/users/[a:filename].[json|csv:extension]'

The problem is that the match_type is defined really... weirdly:
$match_types = array(
'i' => '[0-9]++',
'a' => '[0-9A-Za-z]++',
[...]
As such, you can't really capture a sequence corresponding to [a-zA-Z] only... The only option I see would be to use 3 routes:
$route1: '/websites/[i:websiteId]/users/[i:id]?'
$route2: '/websites/[i:websiteId]/users/[a:filename]'
$route3: '/websites/[i:websiteId]/users/[a:filename].[json|csv:extension]'
And to assign the same actions for routes 2 and 3. Then you would have this:
/api/v1-test/websites/100/users/ is matched by 1
/api/v1-test/websites/100/users/4 is matched by 1
/api/v1-test/websites/100/users/test is matched by 2
/api/v1-test/websites/100/users/test.csv is matched by 3
Which seems like the behavior you wanted.
Abother (easier) solution would be to take advantage of this bit of the documentation:
Routes automatically match the entire request URI.
If you need to match only a part of the request URI
or use a custom regular expression, use the # operator.
You can then define your routes like this:
$route1: '#/websites/[0-9]+/users/[0-9]*$'
$route1: '#/websites/[0-9]+/users/[a-zA-Z]+(\.[a-zA-Z]+)?$'

Related

Match first occurence of a word and ignoring duplicates in a named group

I'm writing a PHP router engine for practice and i'm currently doing the regular expressions for it.
A mapped URL can have parameter patterns and are written down like {type:varName}. I don't want to allow that there are multiplle occurences of the variable name, which is varName in this case.
I've currently got this regex for it:
{(?<key>[a-zA-Z]{1,4}):(?<name>[a-zA-Z_]\w*\b)(?!.*\1\b)}
(live version here)
The problem is that it does only check for duplicates on the <key> group and not for the <name> group. Also it finds the last one occured instead of the first one found.
How do I make this regular expression so that it only matches the first occurence of the <name> group and does not match the duplicates of this first match?
Example
When you have a pattern like this:
{s:varName}-{i:varName}-{s:varName}
Only the first {s:varName} should match, the other 2 shouldn't be matched.
When there is an pattern like this:
{i:varName1}-{d:varName1}-{i:varName2}-{i:varName3}-{m:varName3}
{i:varName1}, {i:varName2} and {i:varName3} should match.
Update
Thanks to #sln I ended up with this regular expression:
{(?<key>[a-zA-Z]{1,4}):(?<name>[a-zA-Z_]+\b)}(?:(?!.*{[a-zA-Z_]{1,4}:\2))
The only problem with this is that it doesn't match the first occurunce but the latest one found.
What am I doing wrong here?
You can make workaround. Set proxy names (multi-group without duplicates) and get what you want in code.
If you want regexp:
{s:varName}-{i:varName}-{s:varName}
write out:
{s:varName-1}-{i:varName-2}-{s:varName-3}
And write some logic:
get all groups for varName-* (varName-1, varName-2, varName-3),
get what you want (for example fist occurrence it is varName-1).
For this regexp:
{i:varName1}-{d:varName1}-{i:varName2}-{i:varName3}-{m:varName3}
write:
{i:varName1-1}-{d:varName1-2}-{i:varName2-1}-{i:varName3-1}-{m:varName3-2}
And the same logic:
get all groups for varName1-* (varName1-1, varName1-2), for varName2-*, varName3-*, etc.
get *-1 from all multi-groups (varName1-1, varName2-1, varName3-1)
I use this workaround, because some other languages (for example Java) doesn't support duplicate group names.

Regular expression for exploding route in Laravel4

How to explode a laravel
Route::getCurrentRoute()->getActionName() // IndexController#getRegister
Into array of: namespace, controller, method, action
I created the expr:
(.*)\\(.*)Controller#(get|post|put|delete|patch)(.*)
But it doesnt work for all of routes
Admin\IndexController#getIndex
Admin\IndexController#postIndex
Other\Namespace\Admin\IndexController#putIndex
Admin\IndexController#deleteGetAjaxSuper
Admin\IndexController#patchIndex
IndexController#getRegister
Other_IndexController#getRegister
IndexController#getRegister
IndexController#getRegister
http://regexr.com?389rh
It doesnt work for last 4 items.
It isn't matching the last 4 items because the items do not contain a \ before the controller. Your regexp requires them to have it, and as such does not match it. The answer is to make the initial slash optional by using a ? (=0 or 1 occurrence), like so:
(.*)\\?(.*)Controller#(get|post|put|delete|patch)(.*)
On a sidenote, are you sure that a Controller have no specific name? ie. a controller just called "Controller" will match your regexp as well. To disallow that (and only allow [something]Controller) you can change the 2nd .* to .+ (so from 0 or more matches to 1 or more matches). Like so:
(.*)\\?(.+)Controller#(get|post|put|delete|patch)(.*)
And the same thing for your method (get|post|etc.):
(.*)\\?(.+)Controller#(get|post|put|delete|patch)(.+)
Assuming you will not have any space around there before the Controller
\S+Controller#(get|post|put|delete|patch)(.*)
using .* instead of \S+ will work as well.

"OR" operator in RegEx syntax

OK, I've worked with RegEx numerous times but this is one of the things I honestly can't get my head around. And it looks as if I'm missing something rather simple...
So, let's say we want to match "AB" or "AC". In other words, "A" followed by either "B" OR "C".
This would be expressed like A[BC] or A[B|C] or A(B|C) and so on.
Now, what if A,B,C are not just single letters but sub-expressions?
Please, have a look at this example here (well, I admit it doesn't look that... simple! lol) : http://regexr.com?382a4
I'm trying to match capital = (and its variations) followed by either :
Pattern 1
Pattern 2
Why is it that using the | operator only works on the latter part (my regex also matches "Pattern 2" withOUT preceding capital =). Please note that I've also tried using positive look-arounds, but without any success.
Any ideas?
Your original regex could be summarized as:
capital = (ABC)|(DEF)
This matches capital = ABC or DEF. Add an extra pair of () that wraps the | clause properly.
Demo here
I suppose this regexp:
capital = (ABC|XYZ)
should work (if I did correctly understand your request...)
Actually [B|C] is incorrect, (B|C) is correct.
Character classes
In RegEx jargon [] is called a character class and it is used to represent one (single) character according to the options listed between the brackets.
In your case [B|C] matches either B or | or C. We can correct this by using [BC] to match either B or C. This matches exactly one character either B or C.
Capturing groups
In RegEx jargon () is called a capturing group. It is used to create boundaries between adjacent groups and whatever it matches will be present in the output array of a preg_match or as a variable in preg_replace.
Within that group you can us the | operator to specify that you want to match either whatever's before or whatever's after the operator.
This can be used to match strings with more than one characters such as (Ana|Maria) or various structures such as ([a-zA-Z]+|[0-9]+).
You can also use the | outside of a capturing group such as (group-1)|(group-2) and you can also use subgrouping such as ((group-1)|(group-2)).

PHP regex to detect pagination

I'm re-writting a route handling class for a MVC based site in PHP and need a regex to detect a pagination string in the URL. The pagination string is formed of three different parts;
Page number detection: /page/[NUMERIC]/
Items per page detection: /per_page/[NUMERIC]/
Ordering detection: /sort/[ALMOST_ANY_CHARACTER]/[asc or desc]/
Due to the way it was previously developed, these three parts can be in any order. There are a number of existing links which I need to keep working plus the code used to handle pagination (no plans for a re-write yet) - so changing the pagination code to always generate a consistent url isn't possible.
Therefore, I need to build a regex pattern to detect every possible combination of the pagination structure. I have three patterns to detect each part, which are as follows:
Page number detection: (page/\d+)
Items per page detection: (per_page/\d+)
Ordering detection: (sort/([a-zA-Z0-9\.\-_%=]+)/(asc|desc))
Being new to writing complex (well this is complex to me anyway!) regex patterns, the only I can think of doing it is two combine the three patterns I have for each of the url structures (eg /pagenum/ordering/perpage/, /pagenum/perpage/ordering/) and using the | operator as an 'or' statement.
Is there a better / more efficient way of doing this?
I am running the regex using preg_match.
You could use lookaheads. After a lookahead is completely matched position of the regex engine jumps back to where it start (that's why it's called *look*ahead; it doesn't actually advance the position in the subject string or include anything in the match). Since you don't know when the desired part occurs, start all three lookaheads from the beginning of the string, and prepend the capturing groups with .* to allow an arbitrary position:
^(?=.*(page/\d+))(?=.*(per_page/\d+))(?=.*(sort/([a-zA-Z0-9\.\-_%=]+)/(asc|desc)))
You can maybe even switch around the capturing groups a bit:
preg_match(
'~^(?=.*page/(\d+))(?=.*per_page/(\d+))(?=.*sort/([a-zA-Z0-9\.\-_%=]+)/(asc|desc))~',
$input,
$match
);
Now the captures will be:
$match[1] => page number
$match[2] => items per page
$match[3] => sort key
$match[4] => sort order
If any of these can be optional, you can simply make the entire lookahead optional with ?.
You could use lookaheads, but unless I'm missing something, I don't think it's necessary here -- you probably can just use the OR operator:
(/(page/\d+)|/(per_page/\d+)|/(sort/([a-zA-Z0-9\.\-_%=]+)/(asc|desc)))+
The outer group here searches for 1 or more instances of any group 1 OR group 2 OR group 3.
More URL routing tips:
This general approach may actually allow you to simplify things a bit, too. Rather than defining all the rules for your route in the Regex, check first certain types of actions then handle them in code. The simplest version:
(/(page|per_page)/([\d+]))+
Now (for each outer-group match) you'll get a match list containing an "action" and a "value". Switch on the action, process the value accordingly.
To handle sort as you've spec'd it (two value parameters instead of one), we'll add another layer.. and to make it more interesting, let's say you decide to add a fourth action, search, which searches a specific field for some content:
(/(page|per_page)/([\d+])|/(sort|search)/([^/]+)/([^/]+))+
Again, when evaluating your match list, check for the action first -- depending on which action it is, you'll know how many successive match values to process.
Hope that's helpful.
Don't use regular expressions. Just because you're operating on a string doesn't mean that a regex is the way to go.
Split apart your path on / into an array and then deal with each part of the path as an individual element of the array.
$parts = explode( '/', $path );
if ( ( $parts[0] == 'page' ) && is_integer( $parts[1] ) ) {
....

RegEx with character set inside positive lookbehind, Is it possible?

I need to match "name" only after "listing", but of course those words could be any url directory or page.
mydomain.com/listing/name
so the only thing I can "REGuest" (request) is to be some parent directory there.
In other words, I want to match the "position" i.e. whatever comes 2nd after the domain.
I'm trying something like
(?<=mydomain\.com/[^/\?&]+/)[^/\?&]+(?:/)?
But the character set won't work inside the positive lookbehind, at least it's setup to match only ONE character. As soon as I try to match other than one (e.g. modify it with +, ? or *) it just stops working.
I'm obviously missing the positive lookbehind syntax and it seems not intended for what I'm trying.
How can I match that 2nd level filename?
Thanks.
Regular-expressions.info states that
The bad news is that most regex flavors do not allow you to use just
any regex inside a lookbehind, because they cannot apply a regular
expression backwards. Therefore, the regular expression engine needs
to be able to figure out how many steps to step back before checking
the lookbehind...
(Read further, they even mention Perl, Python and Java.)
I think the quantifier might be the problem. I found this on stackoverflow and briefly flew over it.
Wouldn't it be possible to just match the whole path, and use a group for the second level filename:
mydomain\.com\/[^\/\?&]+\/([^\/\?&]+)(?:\/)?
(note: I had to escape the / for my tests...)
The result of this would be something like:
Array
(
[0] => mydomain.com/listing/name
[1] => name
)
Now, because I don't know the context of your problem, I just assumed you would be able to postprocess the results and get the group 1 (index 1) from the result. If not, I unfortunately don't know...

Categories