since regular expressions aren't really my specialty, I need help with this little problem (in PHP).
I want to match a given url with an array of defined routes, e.g.:
$definedRoute = '/admin/user/[:id]/edit';
$url = '/admin/user/37/edit';
In my class, it would be like this, I imagine (getRoutes() returns an array of defined routes):
foreach ($this->getRoutes() as $route) {
$pattern = '~' . preg_replace('~\[\:[a-z]+\]~', '[a-z0-9]+',
str_replace('/', '\/', $route['definition'])) . '~';
if (preg_match($pattern, $url)) {
$parameters = $this->getRouteParameters($route['definition']);
(new $route['class']())->{$route['method']}($parameters);
// die? break?
}
}
I went about it like this: replace every occurence of a named parameter like [:id] with a regex for lowercase letters and numbers, e.g. [a-z0-9]+.
This would actually work but in some cases, it would match multiple and therefore the wrong routes. Also, it would always match ~\/~ in most cases. But every url should only be matched once.
Edit #1: the problem is: routes get matched multiple times. How can I prevent this?
Can someone enlighten me?
I don't know if this will cover every conceivable case, but you can use preg_match_all or preg_match as opposed to iterating over the patterns. It should also improve performance.
What this does is make the match order (left to right) important, with an array and loop you cannot do that (you can actually but it's uglier). Then we can sort the routes on complexity, like this:
//this is intentionally in the opposite order of what I want it.
$routes = ['definition' => ['\/', '\/admin\/user\/[a-z0-9]+\/']];
//the more / separators the closer to the beginning we want it. or the more complex regexs go first.
uasort($routes['definition'], function($a,$b){
//count the number of / in the route
//note the <=> spaceship (as it's called) is only available in PHP7+
return substr_count($b, '/') <=> substr_count($a, '/');
});
$url = '/admin/user/37/edit';
//in regex the pipe | is OR
preg_match('~^('.implode('|', $routes['definition']).')~i', $url, $matches);
print_r($matches);
Sandbox
Output:
Array
(
[0] => Array
(
[0] => /admin/user/37/edit
)
[1] => Array
(
[0] => /admin/user/37/edit
)
)
Which is correct in this case. Even if you do get multiple matches, you can count the length of them strlen and then take the longest or "best" match from them. This is pretty simple using strlen and probably sorting by length, so I will leave it up to you.
However I wouldn't call this a guarantee of it working 100% of the time, it's just the first thing that came to me.
Another Idea
Another idea is you are not anchoring the match to the start and end of the string. In theory the route could/should match the entire string so with my above example if you add ^ and $ here:
preg_match('~^('.implode('|', $routes['definition']).')$~i', $url, $matches);
This will ensure a full match and in this case ~\/~ will not match even if the array is not sorted, as you can see in the sandbox below.
Sandbox
That said it's not inconceivable you would only have/need a partial match. This is up to you and how you build your routes and URLs. You can of course just use the ^ start as with a begins with type of match, but you would need to sort them in that case.
Preg Match vs Preg Match All
Preg match will also work, but it will only return the first match. So if it matches multiple time you cannot compare them to find the best one. This may be fine if you use ^ and $.
Hope it helps.
Related
I need to extract from a string 2 parts and place them inside an array.
$test = "add_image_1";
I need to make sure that this string starts with "add_image" and ends with "_1" but only store the number part at the very end. I would like to use preg_split as a learning experience as I will need to use it in the future.
I don't know how to use this function to find an exact word (I tried using "\b" and failed) so I've used "\w+" instead:
$result = preg_split("/(?=\w+)_(?=\d)/", $test);
print_r($result);
This works fine except it also accepts a bunch of other invalid formats such as:
"add_image_1_2323". I need to make sure it only accepts this format. The last digit can be larger than 1 digit long.
Result should be:
Array (
[0] => add_image
[1] => 1
)
How can I make this more secure?
Following regex checks for add_image as beginning of string and matches _before digit.
Regex: (?<=add_image)_(?=\d+$)
Explanation:
(?<=add_image) looks behind for add_image
(?=\d+$) looks ahead for number which is end of string and matches the _.
Regex101 Demo
another regex question. I use PHP, and have a string: fdjkaljfdlstopfjdslafdj. You see there is a stop in the middle. I just want to replace any other words excluding that stop. i try to use [^stop], but it also includes the s at the end of the string.
My Solution
Thanks everyone’s help here.
I also figure out a solution with pure RegEx method(I mean in my knowledge scoop to RegEx. PCRE verbs are too advanced for me). But it needs 2 steps. I don’t want to mix PHP method in, because sometimes the jobs are out of coding area, i.e. multi-renaming filenames in Total Commander.
Let’s see the string: xxxfooeoropwfoo,skfhlk;afoofsjre,jhgfs,vnhufoolsjunegpq. For example, I want to keep all foos in this string, and replace any other non-foo greedily into ---.
First, I need to find all the non-foo between each foo: (?<=foo).+?(?=foo).
The string will turn into xxxfoo---foo---foo---foolsjunegpq, just both sides non-foo words left now.
Then use [^-]+(?=foo)|(?<=foo)[^-]+.
This time: ---foo---foo---foo---foo---. All words but foo have been turned into ---.
i just dont want to include "stop"...
You can skip it by using PCRE verbs (*SKIP)(*F) try like this
stop(*SKIP)(*F)|.
Demo at regex101
or sequence: (stop)(*SKIP)(*F)|(?:(?!(?1)).)+
or for words: stop(*SKIP)(*F)|\w+
[^stop] doesn't means any text that is NOT stop. It just means any character that is not one of the 4 characters inside [...] which is in this case s,t,o,p.
Better to split on the text you don't want to match:
$s = 'fdjkaljfdlstopfjdslafdjstopfoobar';
php> $arr = preg_split('/stop/', $s);
php> print_r($arr);
Array
(
[0] => fdjkaljfdl
[1] => fjdslafdj
[2] => foobar
)
You can generalize this to any pattern:
(?<neg>stop)(*SKIP)(*FAIL)|(?s:.)+?(?=\Z|(?&neg))
Demo
Just put the pattern you don't want in the neg group.
This regex will try to do the following for any character position:
Match the pattern you don't want. If it matches, discard it with (*SKIP)(*FAIL) and restart another match at this position.
If the pattern you don't want doesn't match at a particular position, then match anything, until either:
You reach the end of the input string (\Z)
Or the pattern you don't want immediately follows the current matching position ((?&neg))
This approach is slower than manually tuning the expression, you could get better performance at the cost of repeating yourself, which avoids the recursion:
stop(*SKIP)(*FAIL)|(?s:.)+?(?=\Z|stop)
But of course, the best approach would be to use the features provided by your language: match the string you don't want, then use code to discard it and keep everything else.
In PHP, you can use the PREG_OFFSET_CAPTURE flag to tell the preg_match_all function to provide you the offsets of each match.
I have this string authors[0][system:id] and I need a regex that returns:
array('authors', '0', 'system:id')
Any ideas?
Thanks.
Just use PHP's preg_split(), which returns an array of elements similarly to explode() but with RegEx.
Split the string on [ or ] and the remove the last element (which is an empty string) of the provided array, $tokens.
EDIT: Also, remove the 3rd element with array_splice($array, int $offset, int $lenth), since this item is also an empty string.
The regex /[\[\]]/ just means match any [ or ] character
$string = "authors[0][system:id]";
$tokens = preg_split("/[\]\[]/", $string);
array_pop($tokens);
array_splice($tokens, 2, 1);
//rest of your code using $tokens
Here is the format of $tokens after this has run:
Array ( [0] => authors [1] => 0 [2] => system:id )
Taking the most simplistic approach, we would just match the three individual parts. So first of all we'd look for the token that is not enclosed in brackets:
[a-z]+
Then we'd look for the brackets and the value in between:
\[[^\]]+\]
And then we'd repeat the second step.
You'd also need to add capture groups () to extract the actual values that you want.
So when you put it all together you get something like:
([a-z]+)\[([^\]]+)\]\[([^\]]+)\]
That expression could then be used with preg_match() and the values you want would be extracted into the referenced array passed to the third argument (like this). But you'll notice the above expression is quite a difficult-to-read collection of punctuation, and also that the resulting array has an extra element on it that we don't want - preg_match() places the whole matched string into the first index of the output array. We're close, but it's not ideal.
However, as #AlienHoboken correctly points out and almost correctly implements, a simpler solution would be to split the string up based on the position of the brackets. First let's take a look at the expression we'd need (or at least, the one that I would use):
(?:\[|\])+
This looks for at least one occurence of either [ or ] and uses that block as delimiter for the split. This seems like exactly what we need, except when we run it we'll find we have a small issue:
array('authors', '0', 'system:id', '')
Where did that extra empty string come from? Well, the last character of the input string matches you delimiter expression, so it's treated as a split position - with the result that an empty string gets appended to the results.
This is quite a common issue when splitting based on a regular expression, and luckily PCRE knows this and provides a simple way to avoid it: the PREG_SPLIT_NO_EMPTY flag.
So when we do this:
$str = 'authors[0][system:id]';
$expr = '/(?:\[|\])+/';
$result = preg_split($expr, $str, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
...you will see the result you want.
See it working
I have a regular expression that Im using in php:
$word_array = preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path), NULL, PREG_SPLIT_NO_EMPTY
);
It works great. It takes a chunk of url paramaters like:
/2009/06/pagerank-update.html
and returns an array like:
array(4) {
[0]=>
string(4) "2009"
[1]=>
string(2) "06"
[2]=>
string(8) "pagerank"
[3]=>
string(6) "update"
}
The only thing I need is for it to also not return strings that are less than 3 characters. So the "06" string is garbage and I'm currently using an if statement to weed them out.
The magic of the split. My original assumption was technically not correct (albeit a solution easier to come to). So let's check your split pattern:
(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)
I re-arranged it a bit. The outer parenthesis is not necessary and I moved the single characters into a character class at the end:
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\/._=?&%+-]
That for some sorting upfront. Let's call this pattern the split pattern, s in short and define it.
You want to match all parts that are not of those characters from the split-at pattern and at minimum three characters.
I could achieve this with the following pattern, including support of the correct split sequences and unicode support.
$pattern = '/
(?(DEFINE)
(?<s> # define subpattern which is the split pattern
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\\/._=?&%+-] # a little bit optimized with a character class
)
)
(?:(?&s)) # consume the subpattern (URL starts with \/)
\K # capture starts here
(?:(?!(?&s)).){3,} # ensure this is not the skip pattern, take 3 characters minimum
/ux';
Or in smaller:
$path = '/2009/06/pagerank-update.htmltesthtmltest%C3%A4shtml';
$subject = urldecode($path);
$pattern = '/(?(DEFINE)(?<s>html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\\/._=?&%+-]))(?:(?&s))\K(?:(?!(?&s)).){3,}/u';
$word_array = preg_match_all($pattern, $subject, $m) ? $m[0] : [];
print_r($word_array);
Result:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
The same principle can be used with preg_split as well. It's a little bit different:
$pattern = '/
(?(DEFINE) # define subpattern which is the split pattern
(?<s>
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\/._=?&%+-]
)
)
(?:(?!(?&s)).){3,}(*SKIP)(*FAIL) # three or more is okay
|(?:(?!(?&s)).){1,2}(*SKIP)(*ACCEPT) # two or one is none
|(?&s) # split # split, at least
/ux';
Usage:
$word_array = preg_split($pattern, $subject, 0, PREG_SPLIT_NO_EMPTY);
Result:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
These routines work as asked for. But this does have its price with performance. The cost is similar to the old answer.
Related questions:
Antimatch with Regex
Split string by delimiter, but not if it is escaped
Old answer, doing a two-step processing (first splitting, then filtering)
Because you are using a split routine, it will split - regardless of the length.
So what you can do is to filter the result. You can do that again with a regular expression (preg_filter), for example one that is dropping everything smaller three characters:
$word_array = preg_filter(
'/^.{3,}$/', '$0',
preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path),
NULL,
PREG_SPLIT_NO_EMPTY
)
);
Result:
Array
(
[0] => 2009
[2] => pagerank
[3] => update
)
I'm guessing you're building a URL router of some kind.
Detecting which parameters are useful and which are not should not be part of this code. It may vary per page whether a short parameter is relevant.
In this case, couldn't you just ignore the 1'th element? Your page should (or 'handler') should have knowledge over which parameters it wants to be called with, it should do the triage.
I would think that if you were trying to derive meaning from the URL's that you would actually want to write clean URL's in such a way that you don't need a complex regex to derive the value.
In many cases this involves using server redirect rules and a front controller or request router.
So what you build are clean URL's like
/value1/value2/value3
Without any .html,.php, etc. in the URL at all.
It seems to me that you are not addressing the problem at the point of entry into the system (i.e the web server) adequately so as to make your URL parsing as simple as it should be.
How about trying preg_match() instead of preg_split()?
The pattern (using the Assertions):
/([a-z0-9]{3,})(?<!htm|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org)/iu
The function call:
$pattern = '/([a-z0-9]{3,})(?<!htm|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org)/iu';
$subject = '/2009/06/pagerank-update.html';
preg_match_all($pattern, $subject, $matches);
print_r($matches);
You can try the function here: functions-online.com/preg_match_all.html
Hope this helps
Don't use a regex to break apart that path. Just use explode.
$dirs = explode( '/', urldecode($path) );
Then, if you need to break apart an individual element of the array, do that, like on your "pagerank-update" element at the end.
EDIT:
The key is that you have two different problems. First you want to break apart the path elements on slashes. Then, you want to break up the filename into smaller parts. Don't try to cram everything into one regex that tries to do everything.
Three discrete steps:
$dirs = explode...
Weed out arguments < 3 chars
Break up file argument at the end
It is far clearer if you break up your logic into discrete logical chunks rather than trying to make the regex do everything.
Let's say for an instance I have this string:
var a=23434,bc=3434,erd=5656,ddfeto='dsf3df34dff3',eof='sdfwerwer34',wer=4554;
How should I match all the initializations assigned as integer? Here's my current try, but I don't understand why it's matching everything.
$pattern = '/var (.*=\d)/';
preg_match_all($pattern,$page,$matches);
EDIT: I'm trying to match each initialization:
1 => a=23434
2 => bc=3434
and so on...
EDIT: Here's an update on my try:
$pattern = '/[^v^a^r] (.*=\d+),/';
preg_match_all($pattern,$page,$matches);
0 => 'var a=23434,bc=3434,erd=5656,'
1 => 'a=23434,bc=3434,erd=5656'
The function is using "greedy" matching. You don't want that. In PHP, you can either follow your wildcard with a ? to specify non-greedy matching, as in:
$pattern = '/var (.*?=\d)/';
or using the U flag as documented here, as in:
$pattern = '/var (.*=\d)/U';
which will make all wildcards use non-greedy matching.
EDIT: Also, since you're including "var", you would probably need to change it to
$pattern = '/var (.*?=\d)*/';
or
$pattern = '/var (.*=\d)*/U';
to match any number of (.*=\d) patterns.
EDIT: Update per discussion:
PHP
$page = "var a=23434,bc=3434,erd=5656,ddfeto='dsf3df34dff3',eof='sdfwerwer34',wer=4554;";
$pattern = '/([a-zA-Z]+=\d+)/';
preg_match_all($pattern,$page,$matches);
print_r($matches[1]);
Produces
Array
(
[0] => a=23434
[1] => bc=3434
[2] => erd=5656
[3] => wer=4554
)
Note: This filters out the entries that have the RHS enclosed in single quotes. If you don't want that, let us know.
EDIT #2: My answer to your question exceeded the size of the comment box so I edited my answer.
The [a-zA-z] expression matches only alphabetical characters of either case. Note that the updated code also removed the "ungreedy" modifier, so we actually want it to be greedy now. And since we want it to be greedy, the . will "eat" too much. Go ahead, play around with the code, see what happens when you change it to .* it is a good opportunity to get more familiar with regex.
Since the . "eats" too much, we need to restrict it from matching all characters to matching the ones we want. We could have used something like
$pattern = '/([^\s,]*=\d+)/';
where the [^\s,]* would match any number of non-whitespace, non-comma characters. This would also have worked for your test cases.
But in this case, we can say confidently what the characters we want to include are, so instead of "blacklisting" characters, we'll "whitelist" them. In this case we specify that we want to match any alphabetical character of either case.
As is the case with many things, especially in programming, there are many ways to skin a cat. There are a number of alternative regex patterns that would have also worked for your test cases. Its up to you to understand the limits of each, how they will perform on edge cases, and how maintainable they are, and make a decision.
You don't have to use regex:
$string = substr($string, 4); // remove the first 4 characters, 'var '
$pairs = explode(',', $string); // split using the comma
foreach ($pairs as $pair) {
list($key, $value) = explode('=', $pair);
if (is_int($value)) {
// this is an integer
} else {
// not an integer
}
}
Try this regex
$pattern = '/([a-zA-Z]+=\d+)/';