CakePHP route regex not matching correctly - php

After updating my CakePHP version from 2.2.2 to 2.6.2, one of my routes stopped working properly.
Router::connect('/articles/:keywords', array('action' => 'search', 'controller' => 'Articles', 'keywords' => null), array('pass' => array('keywords'), 'keywords' => '[A-Za-z0-9\+_]+'));
It takes input such as "World" and "World+News" through a url such as website.com/articles/World+News and passes whatever is after articles/ to the search function in the Articles controller. This was working fine up until the update. Now it will pass up the route and go to my "cannot find route" route if there is anything other than alphanumeric characters. It's like the regex isn't matching properly. e.g. "World" and "World123" will work but "World+News" will not.
Things I have tried:
Changing the regex to .* just to see if it works. It does.
Changing the route from :keywords to * just to see if it works. It does.
Trying something I know will fail such as excluding anything with letters in the match. It fails to use this route as expected.
I've been scouring everywhere, trying all sorts of regex combinations (the ones I have match successfully in the tester), and just generally trying to find out why this route will match but I cannot. This was working fine before the update and I can't find anything in the CakePHP documentation that would suggest why this isn't working. As far as I know the expressions have been right and I have confirmed that they fully match using a regex tester. Any help would be appreciated, thanks!

Actually the problem is the regex, at least it's part of the problem.
In earlier versions, CakePHP passed the raw URL into the matching subject, which however was rather problematic, as it could require very ugly regexes, especially for non-ASCII characters. Now the URL decoded variant is being passed:
https://github.com/cakephp/.../commit/d5283af818b59c5d96355d6e42bbd77e1322d8cb
So since + has a special meaning in URL encoding and actually stands for a whitespace, your regex won't match anymore. It's rather easy to fix, just match a whitespace instead of the +.

Related

Matching a complicated route with a regular expression

I'm currently working on a request router for a large PHP based website that I'm working on, but I'm getting stuck trying to use a custom form of expression for my routes.
While I know there are pre-made alternatives and routers that could make my life easier, and would have the same features (in fact, I've been looking at their source code to try and solve this), I'm still a programming student and learning how to create my own can only be a good thing!
Examples:
Here's an example of one of my route expressions:
<protocol (https?)>://<wildcard>.example.com/<controller>/{<lang=en (en|de|pl)>/}<name ([a-zA-Z0-9_-]{8})>
This could match either of these equally well:
http://www.example.com/test/en/hello_123
https://subdomain.example.com/another_test/hello_45
Returning me a nice, handy array like this (for the latter):
array(
'protocol' => 'http',
'wildcard' => 'subdomain',
'controller' => 'another_test',
'lang' => 'en',
'name' => "hello_45"
)
I can also include an array in the first place, with default values that would be overridden by the values found by the router. So, for example, I could leave out the <controller> variable and just write test instead, and then use the array, adding "controller"=>"test".
Here's the rules:
If there's no match, there's no match. Variables have to exist, and if they don't, the route is skipped. Goodbye. Optional sections don't have to exist, luckily.
Anything between <> is a variable. Escaped \<\> are ignored, even when between. The area matching in the URL should be saved to the result array, with the variable name as the key.
Curly braces {} mark a section as optional, and can never be inside a variable <>. Anything between them can be ignored in the target - however, if there is a default value specified for any variables in between, that variable must be added to the result array, using the name as the key, and with the default value as the value. Escaped braces are ignored.
A variable doesn't have to have a default value, but if you add one, it needs to be after an =, like <name=default>.
Regex rules can be added, separated by a space after the name or default value, and encased in brackets (). Escaped brackets are ignored, of course.
Lastly, you can just put Regex rules, in brackets, anywhere, if you don't mind matching anything and not getting a result. So, I could just replace <controller> with ([\/]+), but then I'd have to use the array to set a value for it instead.
What I've Tried:
I've been reading the source code of every Router I can find.
So far, I've done a couple of nasty little regular expressions, but I realised I was confused completely about how to conglomerate them and extend them.
This matches the brackets, ignoring escaped ones: {([^{\\]*(?:\\.[^}\\]*)*)}
This matches a variable, with or without the default value: <([^<\\]*(?:\\.[^>\\]*)*)(?:=?([^<>\\]*))>
This is a kind of unholy hell, the like of which made me write this post: <([^<\\]*(?:\\.[^>\\]*)*)(?:=?([^<>\\]*))(?: ?)(\([^{}<>\(\)\\]+\))?>
(It does, however, match the variables and the Regex sections.)
Can anybody give me any hints, or even example source code from libraries that offer similar functionality? And if this is really near impossible to code myself, is there a library good enough to use?
If you are trying to match the domain, this regex101 demo should match those portions with the individual sections named.
On the other hand, if you are trying to match the route expression, this other regex101 demo is able to parse the tokens you specified so far.
I may have missed some specifications, but you can always leave feedback and explain where it falls short (or even update the regex on that site itself and save a newer version).

Can't rewrite URL because parameters are dynamic

I'm making an API, everything is handled inside the file, so here's what an example URL might look like.
https://website.com/api/?type=search&user=bob
And I'd want that to turn into
https://website.com/api/search/bob
But now here's the other part to this issue. I have another type, which is CSRF
https://website.com/api/?type=csrf
And that would be
https://website.com/api/csrf/
Note that it's one parameter short, but yet still working off the same file. Anything i've tried never seems to work correctly. Additionally there always seems to be a \ added to the api file. I've already removed the .php from there.
So when I try this it doesn't work. Any ideas?
rewrite ^/([a-zA-Z0-9_-]+)/([0-9]+)$ /api/?type=$1&user=$2;
Your problem seems to be that you use $2 for your username and this correspond to ([0-9]+) in your regular expression.
Which means, username will have to be numbers only.
Change your expression to :
rewrite ^/([a-zA-Z0-9\_\-]+)/([a-zA-Z0-9\_\-]+)$ /api/?type=$1&user=$2;
And your rules should work.

PHP Regex match url path

I have a url in the following formats
/fixed1/term1/fixed2/term2
/fixed1/term1/term2/fixed2/term3
/fixed1/term1/term2/...termN.../fixed2/termN+1
in all cases I need the regex to return me all the terms (not including the fixedN).
term can be anything (as long as it's a valid url)
I managed to get until
fixed1\/([^\/]+)\/fixed2\/(.*)
which works fine for
/fixed1/term1/fixed2/term2
but does not work properly on the other cases (when I have multiple terms between the two fixed words)
Any suggestions?
your regex
[^\/]+
will stop after it hits the first backslash, which is why you can find term1 just fine. As a lazy first stab at this solution, i would try
fixed1\/(.+?)\/fixed2\/(.*)

Trying to stop regex at a tag

I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.

PHP Regex URL parsing issues preg_replace

I have a custom markup parsing function that has been working very well for many years. I recently discovered a bug that I hadn't noticed before and I haven't been able to fix it. If anyone can help me with this that'd be awesome. So I have a custom built forum and text based MMORPG and every input is sanitized and parsed for bbcode like markup. It'll also parse out URL's and make them into legit links that go to an exit page with a disclaimer that you're leaving the site... So the issue that I'm having is that when I user posts multiple URL's in a text box (let's say \n delimited) it'll only convert every other URL into a link. Here's the parser for URL's:
$markup = preg_replace("/(^|[^=\"\/])\b((\w+:\/\/|www\.)[^\s<]+)" . "((\W+|\b)([\s<]|$))/ei", '"$1".shortURL("$2")."$4"', $markup);
As you can see it calls a PHP function, but that's not the issue here. Then entire text block is passed into this preg_replace at the same time rather than line by line or any other means.
If there's a simpler way of writing this preg_replace, please let me know
If you can figure out why this is only parsing every other URL, that's my ultimate goal here
Example INPUT:
http://skylnk.co/tRRTnb
http://skylnk.co/hkIJBT
http://skylnk.co/vUMGQo
http://skylnk.co/USOLfW
http://skylnk.co/BPlaJl
http://skylnk.co/tqcPbL
http://skylnk.co/jJTjRs
http://skylnk.co/itmhJs
http://skylnk.co/llUBAR
http://skylnk.co/XDJZxD
Example OUTPUT:
http://skylnk.co/tRRTnb
<br>http://skylnk.co/hkIJBT
<br>http://skylnk.co/vUMGQo
<br>http://skylnk.co/USOLfW
<br>http://skylnk.co/BPlaJl
<br>http://skylnk.co/tqcPbL
<br>http://skylnk.co/jJTjRs
<br>http://skylnk.co/itmhJs
<br>http://skylnk.co/llUBAR
<br>http://skylnk.co/XDJZxD
<br>
e flag in preg_replace is deprecated. You can use preg_replace_callback to access the same functionality.
i flag is useless here, since \w already matches both upper case and lower case, and there is no backreference in your pattern.
I set m flag, which makes the ^ and $ matches the beginning and the end of a line, rather than the beginning and the end of the entire string. This should fix your weird problem of matching every other line.
I also make some of the groups non-capturing (?:pattern) - since the bigger capturing groups have captured the text already.
The code below is not tested. I only tested the regex on regex tester.
preg_replace_callback(
"/(^|[^=\"\/])\b((?:\w+:\/\/|www\.)[^\s<]+)((?:\W+|\b)(?:[\s<]|$))/m",
function ($m) {
return "$m[1]".shortURL($m[2])."$m[3]";
},
$markup
);

Categories