Matching a complicated route with a regular expression - php

I'm currently working on a request router for a large PHP based website that I'm working on, but I'm getting stuck trying to use a custom form of expression for my routes.
While I know there are pre-made alternatives and routers that could make my life easier, and would have the same features (in fact, I've been looking at their source code to try and solve this), I'm still a programming student and learning how to create my own can only be a good thing!
Examples:
Here's an example of one of my route expressions:
<protocol (https?)>://<wildcard>.example.com/<controller>/{<lang=en (en|de|pl)>/}<name ([a-zA-Z0-9_-]{8})>
This could match either of these equally well:
http://www.example.com/test/en/hello_123
https://subdomain.example.com/another_test/hello_45
Returning me a nice, handy array like this (for the latter):
array(
'protocol' => 'http',
'wildcard' => 'subdomain',
'controller' => 'another_test',
'lang' => 'en',
'name' => "hello_45"
)
I can also include an array in the first place, with default values that would be overridden by the values found by the router. So, for example, I could leave out the <controller> variable and just write test instead, and then use the array, adding "controller"=>"test".
Here's the rules:
If there's no match, there's no match. Variables have to exist, and if they don't, the route is skipped. Goodbye. Optional sections don't have to exist, luckily.
Anything between <> is a variable. Escaped \<\> are ignored, even when between. The area matching in the URL should be saved to the result array, with the variable name as the key.
Curly braces {} mark a section as optional, and can never be inside a variable <>. Anything between them can be ignored in the target - however, if there is a default value specified for any variables in between, that variable must be added to the result array, using the name as the key, and with the default value as the value. Escaped braces are ignored.
A variable doesn't have to have a default value, but if you add one, it needs to be after an =, like <name=default>.
Regex rules can be added, separated by a space after the name or default value, and encased in brackets (). Escaped brackets are ignored, of course.
Lastly, you can just put Regex rules, in brackets, anywhere, if you don't mind matching anything and not getting a result. So, I could just replace <controller> with ([\/]+), but then I'd have to use the array to set a value for it instead.
What I've Tried:
I've been reading the source code of every Router I can find.
So far, I've done a couple of nasty little regular expressions, but I realised I was confused completely about how to conglomerate them and extend them.
This matches the brackets, ignoring escaped ones: {([^{\\]*(?:\\.[^}\\]*)*)}
This matches a variable, with or without the default value: <([^<\\]*(?:\\.[^>\\]*)*)(?:=?([^<>\\]*))>
This is a kind of unholy hell, the like of which made me write this post: <([^<\\]*(?:\\.[^>\\]*)*)(?:=?([^<>\\]*))(?: ?)(\([^{}<>\(\)\\]+\))?>
(It does, however, match the variables and the Regex sections.)
Can anybody give me any hints, or even example source code from libraries that offer similar functionality? And if this is really near impossible to code myself, is there a library good enough to use?

If you are trying to match the domain, this regex101 demo should match those portions with the individual sections named.
On the other hand, if you are trying to match the route expression, this other regex101 demo is able to parse the tokens you specified so far.
I may have missed some specifications, but you can always leave feedback and explain where it falls short (or even update the regex on that site itself and save a newer version).

Related

Using wild cards in str_replace URL

i'm trying to use str_replace to edit urls in a csv import in Wordpress, using WP All Import.
This code works
[str_replace("https://oldsite.com/wp-content/uploads/2022/10/Image-Download.jpg", "http://newsite.com/wp-content/uploads/", {title_slider[1]})]
The problem is that not all the uploads in the old site rest in 2022/10 ... so i was wondering if there was any way to use a wildcard to replace 2022 for any year and 10 for any month ?
I tried uploads/*
I hoped that it might accept that but what is being produced is a mixed URL of both oldsite and newsite.
I know this would not work in a browser to navigate to the file, but i only require str_replace.
Current outcome https://oldsite.com/newsite.com/wp-content/uploads/2022/10/Image.jpg
Looking up the documentation for that plugin, confirms that the syntax you're using is running an ordinary PHP function.
So to find the direct answer to your question, we can look up the PHP manual for the str_replace function. It's rather straight-forward:
This function returns a string or an array with all occurrences of search in subject replaced with the given replace value.
No special syntax for wildcards or pattern matching. However, that page does link a couple of times to another function, preg_replace, which:
Searches subject for matches to pattern and replaces them with replacement.
That sounds more promising, right? Specifically, it uses regular expressions, which are a common and very powerful way to specify text patterns.
If you haven't heard of "regular expressions" before, the details might be confusing, but there are lots of tutorials and cheat sheets online.
For example, to match "exactly two digits" you can write \d\d, or \d{2}. In PHP, you would write that like this - note the special / or # delimiters around the pattern, inside the quotes; they can be pretty much any character, but if it appears in the pattern, it has to be "escaped", so best to choose something that doesn't appear:
$new_value = preg_replace('/\d\d/', '99', $old_value);
// or
$new_value = preg_replace("/\d\d/", "99", $old_value);
// or
$new_value = preg_replace('#\d\d#', '99', $old_value);
// or
$new_value = preg_replace("#\d\d#", "99", $old_value);
Translating that example into the plugin's syntax, and noting this from the plugin manual:
You must use double quotes when executing PHP functions from within WP All Import.
You would have this:
[preg_replace("/\d\d/", "99", {something[1]})]
// or
[preg_replace("#\d\d#", "99", {something[1]})]
Hopefully you can work out the rest from there.

Setting a route depending on a String in the URL with Symfony

I am pretty new to Symfony.
I need to check if an URL contains a certain word, which can be in any order, and if it has, it will be redirected to a certain page.
For example www.example.com/mystring/, www.example.com/content/mystring or www.example.com/001/content/mystring/ should redirect you to www.example.com/mystring because it contains mystring in the URL
This could be easy:
$routes->add('mystring', '/mystring/')
->controller([MyStringController::class, 'show'])
;
$routes->add('mystring', '{number}/{content}/mystring')
->controller([MyStringController::class, 'show'])
;
Etc etc the problem is that mystring could be anywhere in the URL.
I already have a workaround in the controller that will redirect you if the string, however I would like a clean solution in the Routing file.
So the question is:
Is there any way to set a route depending if a URL contains certain string that can be anywhere and in any order?
Theoretically, yes. A route placeholder usually isn't allowed to contain a /, because it makes things more difficult. However, there are ways to allow it. BUT, there are other problems arising, unless that additional magic string always takes priority over anything else that happens.
To allow a slash inside a placeholder, this is possible with this: https://symfony.com/doc/current/routing/slash_in_parameter.html
so essentially to add a slash, you'd have a route with
'/{url}', and requirement 'url' => '.+'
now, just a '.+' is not enough for your purpose though. I'm not absolutely certain about escaping in this case, but it would probably be something like
'url' => '.*\bmystring\b.*'
if \b is allowed, this means it's a word boundary (which is probably what you want).
otherwise '(.+/)*mystring(.+/|$)+' should do the trick
Also, you shouldn't name multiple routes the same ... also, this kind of route definition won't give you the other placeholders you have ...
If your special route should extend only existing routes, though, you should probably find a way to cycle through existing routes and add your magic string. But that's a different question ;o)

Recursive Regex in PHP with variable names

I try to make bbcode-ish engine for me website. But the thing is, it is not clear which codes are available, because the codes are made by the users. And on top of that, the whole thing has to be recursive.
For example:
Hello my name is [name user-id="1"]
I [bold]really[/bold] like cheeseburgers
These are the easy ones and i achieved making it work.
Now the problem is, what happens, when two of those codes are behind each other:
I [bold]really[/bold] like [bold]cheeseburgers[/bold]
Or inside each other
I [bold]really like [italic]cheeseburgers[/italic][/bold]
These codes can also have attributes
I [bold strengh="600"]really like [text font-size="24px"]cheeseburgers[/text][bold]
The following one worked quite well, but lacks in the recursive part (?R)
(?P<code>\[(?P<code_open>\w+)\s?(?P<attributes>[a-zA-Z-0-1-_=" .]*?)](?:(?P<content>.*?)\[\/(?P<code_close>\w+)\])?)
I just dont know where to put the (?R) recursive tag.
Also the system has to know that in this string here
I [bold]really like [italic]cheeseburgers[/italic][/bold] and [bold]football[/bold]
are 2 "code-objects":
1. [bold]really like [italic]cheeseburgers[/italic][/bold]
and
2. [bold]football[/bold]
... and the content of the first one is
really like [italic]cheeseburgers[/italic]
which again has a code in it
[italic]cheeseburgers[/italic]
which content is
cheeseburgers
I searched the web for two days now and i cant figure it out.
I thought of something like this:
Look for something like [**** attr="foo"] where the attributes are optional and store it in a capturing group
Look up wether there is a closing tag somewhere (can be optional too)
If a closing tag exists, everything between the two tags should be stored as a "content"-capturing group - which then has to go through the same procedure again.
I hope there are some regex specialist which are willing to help me. :(
Thank you!
EDIT
As this might be difficult to understand, here is an input and an expected output:
Input:
[heading icon="rocket"]I'm a cool heading[/heading][textrow][text]<p>Hi!</p>[/text][/textrow]
I'd like to have an array like
array[0][name] = heading
array[0][attributes][icon] = rocket
array[0][content] = I'm a cool heading
array[1][name] = textrow
array[1][content] = [text]<p>Hi!</p>[/text]
array[1][0][name] = text
array[1][0][content] = <p>Hi!</p>
Having written multiple BBCode parsing systems, I can suggest NOT using regexes only. Instead, you should actually parse the text.
How you do this is up to you, but as a general idea you would want to use something like strpos to locate the first [ in your string, then check what comes after it to see if it looks like a BBCode tag and process it if so. Then, search for [ again starting from where you ended up.
This has certain advantages, such as being able to examine each code and skip it if it's invalid, as well as enforcing proper tag closing order ([bold][italic]Nesting![/bold][/italic] should be considered invalid) and being able to provide meaningful error messages to the user if something is wrong (invalid parameter, perhaps) because the parser knows exactly what is going on, whereas a regex would output something unexpected and potentially harmful.
It might be more work (or less, depending on your skill with regex), but it's worth it.

CakePHP route regex not matching correctly

After updating my CakePHP version from 2.2.2 to 2.6.2, one of my routes stopped working properly.
Router::connect('/articles/:keywords', array('action' => 'search', 'controller' => 'Articles', 'keywords' => null), array('pass' => array('keywords'), 'keywords' => '[A-Za-z0-9\+_]+'));
It takes input such as "World" and "World+News" through a url such as website.com/articles/World+News and passes whatever is after articles/ to the search function in the Articles controller. This was working fine up until the update. Now it will pass up the route and go to my "cannot find route" route if there is anything other than alphanumeric characters. It's like the regex isn't matching properly. e.g. "World" and "World123" will work but "World+News" will not.
Things I have tried:
Changing the regex to .* just to see if it works. It does.
Changing the route from :keywords to * just to see if it works. It does.
Trying something I know will fail such as excluding anything with letters in the match. It fails to use this route as expected.
I've been scouring everywhere, trying all sorts of regex combinations (the ones I have match successfully in the tester), and just generally trying to find out why this route will match but I cannot. This was working fine before the update and I can't find anything in the CakePHP documentation that would suggest why this isn't working. As far as I know the expressions have been right and I have confirmed that they fully match using a regex tester. Any help would be appreciated, thanks!
Actually the problem is the regex, at least it's part of the problem.
In earlier versions, CakePHP passed the raw URL into the matching subject, which however was rather problematic, as it could require very ugly regexes, especially for non-ASCII characters. Now the URL decoded variant is being passed:
https://github.com/cakephp/.../commit/d5283af818b59c5d96355d6e42bbd77e1322d8cb
So since + has a special meaning in URL encoding and actually stands for a whitespace, your regex won't match anymore. It's rather easy to fix, just match a whitespace instead of the +.

PHP - I need some help with my Regex

I've created a simple template 'engine' in PHP to substitute PHP-generated data into the HTML page. Here's how it works:
In my main template file, I have variables like so:
<title><!-- %{title}% --></title>
I then assign data into those variables for the main page load
$assign = array (
'title' => 'my website - '
);
I then have separate template blocks that get loaded for the content pages. The above really just handles the header and the footer. In one of these 'content template files', I have variables like so:
<!-- %{title=content page}% -->
Once this gets executed, the main template data is edited to include the content page variables resulting in:
<title>my website - content page</title>
It does this with the following code:
if (preg_match('/<!-- %{title=\s*(.*?)}% -->/s', $string, $matches)) {
// Find variable names in the form of %{varName=new data to append}%
// If found, append that new data to the exisiting data
$string = preg_replace('/<!-- %{title=\s*(.*?)}% -->/s', null, $string);
$varData[$i] .= $matches[1];
}
This basically removes the template variables and then assigns the variable data to the existing variable. Now, this all works fine. What I'm having issues with is nesting template variables. If I do something like:
<!-- %{title=content page (author: <!-- %{name}% -->) -->
The pattern, at times, messes up the opening and closing tags of each variable.
How can I fix my regular expression to prevent this?
Thank you.
The answer is you don't do this with regex. Regular expressions are a regular language. When you start nesting things it is no longer a regular language. It is, at a minimum, a context-free language ("CFL"). CFLs can only be processed (assuming they're unambiguous) with a stack.
Specifically, regular languages can be represented with a finite state machine ("FSM"). CFLs require a pushdown automaton ("PDA").
An example of the difference is nested tags in HTML:
<div>
<div>inner</div>
</div>
My advice is don't write your own template language. That's been done. Many times. Use Smarty or something in Zend, Kohana or whatever. If you do write your own, do it properly. Parse it.
Why are you rolling your own template engine? If you want this kind of complexity, there's a lot of places that have already come up with solutions for it. You should just plug in Smarty or something like that.
If you're asking what I think you're asking, it's literally impossible. If I read your question correctly, you want to match arbitrarily-nested <!-- ... --> sequences with particular things inside. Unfortunately, regular expressions can only match certain classes of strings; any regular expression can match only a regular language. One well-known language which is not regular is the language of balanced parentheses (also known as the the Dyck language), which is exactly what you're trying to match. In order to match arbitrarily-nested comment strings, you need a more powerful tool. I'm fairly sure there are pre-existing PHP template engines; you might look into one of those.
To resolve your problem you should
replace preg_match() with preg_match_all();
find the pattern, and replace them from the last one to the first one;
use a more restrictive pattern like '/<!-- %{title=\s*([^}]*?)}% -->/s'.
I've done something similar in the past, and I have encountered the same nesting issue you did. In your case, what I would do is repeatedly search your text for matches (rather than searching once and looping through the matches) and extract the strings you want by searching for anything that doesn't include your closing string.
In your case, it would probably look like this:
/(<!--([^(-->)]*?)-->)/
Regexes like this are a nightmare to explain, but basically, ([^(-->)]*) will find any string that doesn't include your closing tag (let's call that AAA). It will be inside a matching group that is, itself, your template tag, (<!--AAA-->).
I'm convinced this sort of templating method is the wrong way to do things, but I've never known enough to do it better. It's always bothered me in ASP and ColdFusion that you had to nest your scripting tags inside HTML and when I started to do it myself, I considered it a personal failure.
Most Regexes I do now are in JavaScript and so I may be missing some of the awesome nuances PHP has via Perl. I'd be happy if someone can write this more cleanly.
I too have ran into this problem in the past, although I didn't use regular expressions.
If instead you search from right to left for the opening tag, <!-- %{ in your syntax, using strrpos (PHP5+), then search forwards for the first occurrence of the next closing tag, and then replace that chunk first, you will end up replacing the inner-most nested variables first. This should resolve your problem.
You can also do it the other way around and find the first occurrence of a closing tag, and work backwards to find its corresponding opening tag.

Categories