PHP regex to detect pagination - php

I'm re-writting a route handling class for a MVC based site in PHP and need a regex to detect a pagination string in the URL. The pagination string is formed of three different parts;
Page number detection: /page/[NUMERIC]/
Items per page detection: /per_page/[NUMERIC]/
Ordering detection: /sort/[ALMOST_ANY_CHARACTER]/[asc or desc]/
Due to the way it was previously developed, these three parts can be in any order. There are a number of existing links which I need to keep working plus the code used to handle pagination (no plans for a re-write yet) - so changing the pagination code to always generate a consistent url isn't possible.
Therefore, I need to build a regex pattern to detect every possible combination of the pagination structure. I have three patterns to detect each part, which are as follows:
Page number detection: (page/\d+)
Items per page detection: (per_page/\d+)
Ordering detection: (sort/([a-zA-Z0-9\.\-_%=]+)/(asc|desc))
Being new to writing complex (well this is complex to me anyway!) regex patterns, the only I can think of doing it is two combine the three patterns I have for each of the url structures (eg /pagenum/ordering/perpage/, /pagenum/perpage/ordering/) and using the | operator as an 'or' statement.
Is there a better / more efficient way of doing this?
I am running the regex using preg_match.

You could use lookaheads. After a lookahead is completely matched position of the regex engine jumps back to where it start (that's why it's called *look*ahead; it doesn't actually advance the position in the subject string or include anything in the match). Since you don't know when the desired part occurs, start all three lookaheads from the beginning of the string, and prepend the capturing groups with .* to allow an arbitrary position:
^(?=.*(page/\d+))(?=.*(per_page/\d+))(?=.*(sort/([a-zA-Z0-9\.\-_%=]+)/(asc|desc)))
You can maybe even switch around the capturing groups a bit:
preg_match(
'~^(?=.*page/(\d+))(?=.*per_page/(\d+))(?=.*sort/([a-zA-Z0-9\.\-_%=]+)/(asc|desc))~',
$input,
$match
);
Now the captures will be:
$match[1] => page number
$match[2] => items per page
$match[3] => sort key
$match[4] => sort order
If any of these can be optional, you can simply make the entire lookahead optional with ?.

You could use lookaheads, but unless I'm missing something, I don't think it's necessary here -- you probably can just use the OR operator:
(/(page/\d+)|/(per_page/\d+)|/(sort/([a-zA-Z0-9\.\-_%=]+)/(asc|desc)))+
The outer group here searches for 1 or more instances of any group 1 OR group 2 OR group 3.
More URL routing tips:
This general approach may actually allow you to simplify things a bit, too. Rather than defining all the rules for your route in the Regex, check first certain types of actions then handle them in code. The simplest version:
(/(page|per_page)/([\d+]))+
Now (for each outer-group match) you'll get a match list containing an "action" and a "value". Switch on the action, process the value accordingly.
To handle sort as you've spec'd it (two value parameters instead of one), we'll add another layer.. and to make it more interesting, let's say you decide to add a fourth action, search, which searches a specific field for some content:
(/(page|per_page)/([\d+])|/(sort|search)/([^/]+)/([^/]+))+
Again, when evaluating your match list, check for the action first -- depending on which action it is, you'll know how many successive match values to process.
Hope that's helpful.

Don't use regular expressions. Just because you're operating on a string doesn't mean that a regex is the way to go.
Split apart your path on / into an array and then deal with each part of the path as an individual element of the array.
$parts = explode( '/', $path );
if ( ( $parts[0] == 'page' ) && is_integer( $parts[1] ) ) {
....

Related

Problems with Router / URL Matcher class

since regular expressions aren't really my specialty, I need help with this little problem (in PHP).
I want to match a given url with an array of defined routes, e.g.:
$definedRoute = '/admin/user/[:id]/edit';
$url = '/admin/user/37/edit';
In my class, it would be like this, I imagine (getRoutes() returns an array of defined routes):
foreach ($this->getRoutes() as $route) {
$pattern = '~' . preg_replace('~\[\:[a-z]+\]~', '[a-z0-9]+',
str_replace('/', '\/', $route['definition'])) . '~';
if (preg_match($pattern, $url)) {
$parameters = $this->getRouteParameters($route['definition']);
(new $route['class']())->{$route['method']}($parameters);
// die? break?
}
}
I went about it like this: replace every occurence of a named parameter like [:id] with a regex for lowercase letters and numbers, e.g. [a-z0-9]+.
This would actually work but in some cases, it would match multiple and therefore the wrong routes. Also, it would always match ~\/~ in most cases. But every url should only be matched once.
Edit #1: the problem is: routes get matched multiple times. How can I prevent this?
Can someone enlighten me?
I don't know if this will cover every conceivable case, but you can use preg_match_all or preg_match as opposed to iterating over the patterns. It should also improve performance.
What this does is make the match order (left to right) important, with an array and loop you cannot do that (you can actually but it's uglier). Then we can sort the routes on complexity, like this:
//this is intentionally in the opposite order of what I want it.
$routes = ['definition' => ['\/', '\/admin\/user\/[a-z0-9]+\/']];
//the more / separators the closer to the beginning we want it. or the more complex regexs go first.
uasort($routes['definition'], function($a,$b){
//count the number of / in the route
//note the <=> spaceship (as it's called) is only available in PHP7+
return substr_count($b, '/') <=> substr_count($a, '/');
});
$url = '/admin/user/37/edit';
//in regex the pipe | is OR
preg_match('~^('.implode('|', $routes['definition']).')~i', $url, $matches);
print_r($matches);
Sandbox
Output:
Array
(
[0] => Array
(
[0] => /admin/user/37/edit
)
[1] => Array
(
[0] => /admin/user/37/edit
)
)
Which is correct in this case. Even if you do get multiple matches, you can count the length of them strlen and then take the longest or "best" match from them. This is pretty simple using strlen and probably sorting by length, so I will leave it up to you.
However I wouldn't call this a guarantee of it working 100% of the time, it's just the first thing that came to me.
Another Idea
Another idea is you are not anchoring the match to the start and end of the string. In theory the route could/should match the entire string so with my above example if you add ^ and $ here:
preg_match('~^('.implode('|', $routes['definition']).')$~i', $url, $matches);
This will ensure a full match and in this case ~\/~ will not match even if the array is not sorted, as you can see in the sandbox below.
Sandbox
That said it's not inconceivable you would only have/need a partial match. This is up to you and how you build your routes and URLs. You can of course just use the ^ start as with a begins with type of match, but you would need to sort them in that case.
Preg Match vs Preg Match All
Preg match will also work, but it will only return the first match. So if it matches multiple time you cannot compare them to find the best one. This may be fine if you use ^ and $.
Hope it helps.

Match first occurence of a word and ignoring duplicates in a named group

I'm writing a PHP router engine for practice and i'm currently doing the regular expressions for it.
A mapped URL can have parameter patterns and are written down like {type:varName}. I don't want to allow that there are multiplle occurences of the variable name, which is varName in this case.
I've currently got this regex for it:
{(?<key>[a-zA-Z]{1,4}):(?<name>[a-zA-Z_]\w*\b)(?!.*\1\b)}
(live version here)
The problem is that it does only check for duplicates on the <key> group and not for the <name> group. Also it finds the last one occured instead of the first one found.
How do I make this regular expression so that it only matches the first occurence of the <name> group and does not match the duplicates of this first match?
Example
When you have a pattern like this:
{s:varName}-{i:varName}-{s:varName}
Only the first {s:varName} should match, the other 2 shouldn't be matched.
When there is an pattern like this:
{i:varName1}-{d:varName1}-{i:varName2}-{i:varName3}-{m:varName3}
{i:varName1}, {i:varName2} and {i:varName3} should match.
Update
Thanks to #sln I ended up with this regular expression:
{(?<key>[a-zA-Z]{1,4}):(?<name>[a-zA-Z_]+\b)}(?:(?!.*{[a-zA-Z_]{1,4}:\2))
The only problem with this is that it doesn't match the first occurunce but the latest one found.
What am I doing wrong here?
You can make workaround. Set proxy names (multi-group without duplicates) and get what you want in code.
If you want regexp:
{s:varName}-{i:varName}-{s:varName}
write out:
{s:varName-1}-{i:varName-2}-{s:varName-3}
And write some logic:
get all groups for varName-* (varName-1, varName-2, varName-3),
get what you want (for example fist occurrence it is varName-1).
For this regexp:
{i:varName1}-{d:varName1}-{i:varName2}-{i:varName3}-{m:varName3}
write:
{i:varName1-1}-{d:varName1-2}-{i:varName2-1}-{i:varName3-1}-{m:varName3-2}
And the same logic:
get all groups for varName1-* (varName1-1, varName1-2), for varName2-*, varName3-*, etc.
get *-1 from all multi-groups (varName1-1, varName2-1, varName3-1)
I use this workaround, because some other languages (for example Java) doesn't support duplicate group names.

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.
When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.
Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101
Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.
I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.
This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

regex matching url

In PHP, the klein routing will match as many routes as it can.
2 routes I have set up are conflicting. They are:
$route1: '/websites/[i:websiteId]/users/[i:id]?'
and
$route2: '/websites/[i:websiteId]/users/[a:filename].[json|csv:extension]?'
This is the URL I'm trying to match, which I think should match the first and not the second, is:
/api/v1-test/websites/100/users/4
The regex produced for these two are:
$regex1: `^/api(?:/(v1|v1-test))/websites(?:/(?P<websiteId>[0-9]++))/users(?:/(?P<id>[0-9]++))?$`
$regex2: `^/api(?:/(v1|v1-test))/websites(?:/(?P<websiteId>[0-9]++))/users(?:/(?P<filename>[0-9A-Za-z]++))(?:\.(?P<extension>json|csv))?$`
I mean for it not to match if there is no '.csv' or '.json'. The problem is that it is matching both routes. For the second, the resulting filename is '4' and the extension is blank.
Sending /api/v1-test/websites/100/users/users.csv works correctly and only matches the second route.
I only have control over the route, not the regex or the matching.
Thanks.
This bit here
(?:\.(?P<extension>json|csv))?
at the end of your second regex causes it to match whether or not there's a filename due to the ? at the very end. Question marks mean 0 or 1 of the previous expression. Get rid of that and, at the least, strings will only match this regex when they have the extension.
To make this change, just remove the question mark from your second route, like so:
$route2: '/websites/[i:websiteId]/users/[a:filename].[json|csv:extension]'
The problem is that the match_type is defined really... weirdly:
$match_types = array(
'i' => '[0-9]++',
'a' => '[0-9A-Za-z]++',
[...]
As such, you can't really capture a sequence corresponding to [a-zA-Z] only... The only option I see would be to use 3 routes:
$route1: '/websites/[i:websiteId]/users/[i:id]?'
$route2: '/websites/[i:websiteId]/users/[a:filename]'
$route3: '/websites/[i:websiteId]/users/[a:filename].[json|csv:extension]'
And to assign the same actions for routes 2 and 3. Then you would have this:
/api/v1-test/websites/100/users/ is matched by 1
/api/v1-test/websites/100/users/4 is matched by 1
/api/v1-test/websites/100/users/test is matched by 2
/api/v1-test/websites/100/users/test.csv is matched by 3
Which seems like the behavior you wanted.
Abother (easier) solution would be to take advantage of this bit of the documentation:
Routes automatically match the entire request URI.
If you need to match only a part of the request URI
or use a custom regular expression, use the # operator.
You can then define your routes like this:
$route1: '#/websites/[0-9]+/users/[0-9]*$'
$route1: '#/websites/[0-9]+/users/[a-zA-Z]+(\.[a-zA-Z]+)?$'

RegEx with character set inside positive lookbehind, Is it possible?

I need to match "name" only after "listing", but of course those words could be any url directory or page.
mydomain.com/listing/name
so the only thing I can "REGuest" (request) is to be some parent directory there.
In other words, I want to match the "position" i.e. whatever comes 2nd after the domain.
I'm trying something like
(?<=mydomain\.com/[^/\?&]+/)[^/\?&]+(?:/)?
But the character set won't work inside the positive lookbehind, at least it's setup to match only ONE character. As soon as I try to match other than one (e.g. modify it with +, ? or *) it just stops working.
I'm obviously missing the positive lookbehind syntax and it seems not intended for what I'm trying.
How can I match that 2nd level filename?
Thanks.
Regular-expressions.info states that
The bad news is that most regex flavors do not allow you to use just
any regex inside a lookbehind, because they cannot apply a regular
expression backwards. Therefore, the regular expression engine needs
to be able to figure out how many steps to step back before checking
the lookbehind...
(Read further, they even mention Perl, Python and Java.)
I think the quantifier might be the problem. I found this on stackoverflow and briefly flew over it.
Wouldn't it be possible to just match the whole path, and use a group for the second level filename:
mydomain\.com\/[^\/\?&]+\/([^\/\?&]+)(?:\/)?
(note: I had to escape the / for my tests...)
The result of this would be something like:
Array
(
[0] => mydomain.com/listing/name
[1] => name
)
Now, because I don't know the context of your problem, I just assumed you would be able to postprocess the results and get the group 1 (index 1) from the result. If not, I unfortunately don't know...

Categories