Confused about the behavior of regex in a url routing script - php

I just finished learning about regex and I thought that I should put it into something useful, so I created a small url routing script with php and the following regex:
^(?:/(\w+)?)*$
(the php code currently doesnt do anything, just prints out the matching groups from preg_match)
currently if given the url /foobar/foo/bar, the matching groups are the entire string (normal behavior) and the last part of the url (in this case: bar).
Obviously, this is a problem.
I think that this is caused because of the use of 1 capture group, which only captures the last matching string, but I'm not sure. any advice on the real cause of this and/or a solution to this will be greatly appreciated.
Thanks in advance!

You have diagnosed the problem correctly - on each repetition of the surrounding group, the previously matched contents of the capturing group are "overwritten" by the new match.
It's not quite clear what you would have expected to happen. I guess that you would have liked each part of the path to be "remembered" as its own group? This is something you can't do with repeated groups in PHP (only a few regex dialects (Perl 6 and .NET) allow something like this).
In your case, you're probably better off by using your regex to validate the URL and then split it along the slashes:
$result = preg_split('%/%', $subject);

Related

Extract all words between two phrases using regex [duplicate]

This question already has an answer here:
Simple AlphaNumeric Regex (single spacing) without Catastrophic Backtracking
(1 answer)
Closed 4 years ago.
I'm trying to extract all the words between two phrases using the following regex:
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b(.*)\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
The documents I'm running this regex on are 10-K filings. The filings are too long to post here (see regex101 url below for example), but basically they are something like this:
ITEM 1. BUSINESS
lots of words
ITEM 2. PROPERTIES
lots of words
ITEM 3. LEGAL PROCEEDINGS
I want to extract all the words between ITEM 1 and ITEM 3. Note that the subtitles for each ITEM may be slightly different for each 10-K filing, hence I'm allowing for a few words between each word.
I keep getting catastrophic backtracking error, and I cannot figure out why. For example, please see https://regex101.com/r/zgTiyb/1.
What am I doing wrong?
Catastrophic backtracking has almost one main reason:
A possible match is found but can't finish.
You made too many positions available for regex to try. This hits backtracking limit on PCRE. A quick work around would be removing the only dot-star in regex in order to replace it with a restrictive quantifier i.e.
.{0,200}
See live demo here
But the better approach is re-constructing the regular expression:
\bitem\b.*?\b(?:1|one)\b(*COMMIT)\W+(?:\w+\W+){0,2}?business\b\h*\R+(?:(?!item\h+(?:3|three)\b)[\s\S])*+item\h+(?:3|three)\b\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings\b
See live demo here
Your own regex needs ~45K steps on given input string to find those two matches. In contrast, this modified regex needs ~8K steps to accomplish the task. That's a huge improvement.
The latter doesn't need s flag (and it shouldn't be enabled). I used (*COMMIT) backtracking verb to cause an early failure if a possible match is found but is likely to not finish.
#Sebastian Proske's solution matches three sub-strings but I don't think the third match is an expected match. This huge third match is the only reason for your regex to break.
Please read this answer to have a better insight into this problem.
This isn't really catastrophic backtracking, just a whole lot of text and a comparedly low backtracking limit in regex101. In this scenario the use of .* isn't optimal, as it will match the whole remainder of the textfile once it is reached and then backtrack character after character to match the parts after it - which means a lot of characters to process.
Seems you can stick to \w+\W+ at that place as well and use lazy matching instead of greedy to get your result, like
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b\W+(?:\w+\W+)*?\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
Note that the pcre engine optimizes (?:\w+\W+) to (?>\w++\W++) thus working by word-no-word-chunks instead of single characters.

changing www*.com to a clickable URL with REGEX

I'm working on a web page and regex keeps coming up as the best way to handle string manipulation for an issue I'm trying to resolve. Unfortunately, regex is not exactly trivial and I've been having trouble. Any help is appreciated;
I would like to make strings entered from a php form into clickable links. I've received help with my first challenge; how to make strings starting with http, https or ftp into clickable links;
function make_links_clickable($message){
return preg_replace('!(((f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $message);
}
$message = make_links_clickable($message);
And this works well. When I look at it (and do some research), the best that I can glean from the syntax is that the first piece is matching ftp, http, and https, :, and // along with a wide range of combined patterns. I would like to know how I can;
1) Make links starting with www, or ending with .com/.net/.org/etc clickable (like google.com, or www.google.com - leaving out the http://)
2) Change youtube links like
"https://www.youtube.com/watch?v=examplevideo"
into
"<iframe width="560" height="315" src="//www.youtube.com/embed/examplevideo" frameborder="0" allowfullscreen></iframe>"
I think these two cases are basically doing the same kind of thing, but figuring out is not intuitive. Any help would be deeply appreciated.
The first regular expression there is made to match almost everything that follows ftp://, http://, https:// that occurs, so it might be best to implement the others as separate expressions since they'll only be matching hostnames.
For number 1, you'll need to decide how strictly you wish to match different TLDs (.com/.net/etc). For example, you can explicitly match them like this:
(www\.)?[a-z0-9\-]+\.(com|net|org)
However, that will only match URLs that end in .com, .net, or .org. If you want all top-level domains and only the valid ones, you'll need to manually write them all in to the end of that. Alternatively, you can do something like this,
(www\.)?[a-z0-9\-]+\.[a-z]{2,6}
which will accept anything that looks like a url and ends with "dot", and any combination of 2 to 6 letters (.museum and .travel). However, this will match strings like "fgs.fds". Depending on your application, you may need to add more characters to [a-z], to add support for extended character alphabets.
Edit (2 Aug 14): As pointed out in the comments below, this won't match TLDs like .co.uk. Here's one that will:
(www\.)?[a-z0-9\-]+\.([a-z]{2,3}(\.?[a-z]{2,3})?)
Instead of any string between two and six characters (following a period), this will match any two to three, then another one to three (if present), with or without a dividing period.
It'd be redundant, but you could instead remove the question mark after www on the second option, then do both tests; that way, you can match any string ending in a common TLD, or a string that begins with "www." and is followed by any characters with one period separating them, "gpspps.cobg". It would still match sites that might not actually exist, but at least it looks like a url, at it would look like one.
For the YouTube one, I went a little question mark crazy.
(?i:(?:(?:http(?:s)?://)?(?:www\.)?)?youtu(?:\.be/|be\.com/watch\?(?:[a-z0-9_\-\%\&\=]){0,}?v\=))([a-zA-Z0-9_\-]{11}){0,}?v\=))(?i)([a-zA-Z0-9_\-]{11})
EDIT: I just tried to use the above regex in one of my own projects, but I encountered some errors with it. I changed it a little and I think this version may be better:
(?i:(?:(?:http(?:s)?://)?(?:www\.)?)?youtu(?:\.be/|be\.com/watch\?(?:[a-z0-9_\-\%\&\=]){0,})?)(?:v=)?([a-zA-Z0-9_\-]{11})
For those not familiar with regular expressions, parentheses , ( ...regex... ), are stored as groups, which can be selectively picked out of matched strings. Parenthesis groups that begin with ?: as in most of the ones up there, (?:www\.) are however not captured within the groups. Because the end of that regex was left as a normal—"captured"—group, ([a-zA-Z0-9_\-]{11}), you use the $matches argument of functions like preg_match, then you can use $matches[1] to get the YouTube ID of the video, 'examplevide', then work with it however you'd like. Also note, the regex is only matching 11 characters for the ID.
This regex will match pretty much any of the current youtube url formats including incorrect cases, and out of (normal) order parameters:
http://youtu.be/dQw4w9WgXcQ
https://www.youtube.com/watch?v=dQw4w9WgXcQ
http://www.youtube.com/watch?v=dQw4w9WgXcQ&feature=featured
http://www.youtube.com/watch?feature=featured&v=dQw4w9WgXcQ
http://WWW.YouTube.Com/watch?v=dQw4w9WgXcQ
http://YouTube.Com/watch?v=dQw4w9WgXcQ
www.youtube.com/watch?v=dQw4w9WgXcQ

Basic Regular Expression for

For some reason I always get stuck making anything past extremely basic regular expressions.
I'm trying to make a regular expression that kind of looks like a URL. I only want basic checking.
I would like it to match the following patterns where X is "something".
X://X.X
X://X.X... etc.
X.X
X.X... etc
If the string contains one of these patterns, it is sufficient checking for me. This way a url like www.example.com:8888 will still match. I have tried many different REGEX combinations with preg_match and cannot seem to get any to behave the way I want it to. I have consulted many other related REGEX questions on SO but my readings have not helped me.
Any help? I will be happy to provide more information if you would like but I don't know what else you would need.
It takes practice but here is one that I made using a regex tester (http://www.regextester.com/) to check my pattern:
^.+(:\/\/|\.)([a-zA-Z0-9]+\.)+.+
My approach is to slowly build my pattern from the beginning and add on one piece at a time. This cheatsheet is extremely helpful for remembering http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/ what everything is.
Basically the pattern starts at the beginning of the string and checks for any characters followed by either :// or . then checks for groupings of letters and numbers followed by a . ending with any number of characters.
The pattern could probably be improved with groupings to not pass on invalid characters. But this one was quick and dirty. You could replace the first and last . with the characters that would be valid.
UPDATE
Per the comments here is an updated pattern:
^.+?(:\/\/|\.)?([a-zA-Z0-9]+?\.)+.+
/^(.+:\/\/)?[^.]+\.[^.\/]+([.\/][^.\/]+)*$/

Discard character in matching group

I have a couple of matching groups one after another in a long Regex pattern. Around the middle I have
...(?<number>(?:/(?:digit|num))?\d+|)...
which should match something like /num9, /digit9 or 9 or blank (because I need the named group to appear in the resulting associative array even if it's empty).
The pattern works, but is it possible to discard the / character if the one of first two cases is matched? I tried a positive lookahead, but it seems that you can't use those if you have expressions before the lookahead.
Is what I'm trying to accomplish possible using Regex?
Based on your input, I think that you need to capture / anyway at some point, otherwise your whole regex fails. At the same time you want to ignore it, so it cannot be a part of you named group. Therefore by putting it outside it and making it optional, while ensuring that a digit is not preceded directly by a / you come up with the desired results :
^/?(?<number>(?:(?:digit|num))?(?<!/)\d+|)$
However given your lack of a more complete input and regex, I am not 100% sure this will work for all your cases.

Finding correct php regex for this complex element

I'm trying to get a regex which is able to find the following part in a string.
[TABLE|head,border|{
#TEXT|TEXT|TEXT#
TEXT|TEXT|TEXT
TEXT|TEXT|TEXT
TEXT|TEXT|TEXT
}]
Its from a simple self made WYSIWYG Editor, which gives the possibility to add tables. But the "syntax" for a table should be as simple as the one above.
No as there can be many of these table definitions, I need to find all with php's preg_match_all to replace them with the well known <table> tag in html.
The regex iam trying to use for is the following:
/\[TABLE\|(.*)\|\{(.*)\}\]/si
The \x0A stays for a newline as my app is running on Linux this is enough (works fine with simpler regex).
I use the online regex tester on functions-online.com.
The matches it gets are not really usefull. And if i have more than one TABLE definition like the one above, then the matches are completely useless. Because of the (.*) it covers all from starting from "head,border" going to the very last "|" character in the second TABLE definition.
I would like to get a list of matches giving me the complete table command one by one.
This is because by default the .* will be a greedy match, assuming your code works correctly for an input containing only a single value. Placing a question mark after the two .*'s should prevent greedyness being an issue.
/\[TABLE\|(.*?)\|\{(.*?)\}\]/si

Categories