How to preg_match for a specific pattern in a url? - php

I would like to use preg_match in PHP to test the format of a URL. The URL looks like this:
/2013/09/05/item-01.html
I'm trying to get this to work :
if (preg_match("/([0-9]{4})\/([0-9]{2})\/([0-9]{2})/[.]+[.html]$", $_SERVER['REQUEST_URI'])) :
echo "match";
endif;
But something's not quite right. Any ideas?

Try:
if (preg_match('!\d{4}/\d{2}/\d{2}/.*\.html$!', $_SERVER['REQUEST_URI'])) {
echo 'match';
}
\d is short for [0-9] and you can use different start/end delimiters (I use ! in this case) to make the regexp more readable when you're trying to match slashes.

It looks like you are nearly correct with what you have, though you have some minor problems
you forgot to escape your last "/" before the page.html
the [.]+ should be [^.]+, you aren't looking for 1 or more periods, you are looking for anything not a period.
You shouldnt be using the [] to match the html, but rather () or nothing at all
if (preg_match("/([0-9]{4})\/([0-9]{2})\/([0-9]{2})\/([^.]+.html)$", $_SERVER['REQUEST_URI'])) :
echo "match";
endif;
Also you should probably learn when to use the (), these are used to make sure you are storing whatever is matched inside them. In your case I'm not sure if you want to be storing every directory up until the file or not.

My guess is you had a working expression for a file path, and it stopped working when you tried to add the file name part.
preg_match() requires a pair of delimiter characters to be specified; one at each end of the expression. It looks like you have these, but you've put an extra bit of the expression (ie the file name) at the end of the string outside of the delimiters. This is invalid.
"/([0-9]{4})\/([0-9]{2})\/([0-9]{2})/[.]+[.html]$"
^ ^
your start delimiter your end delimiter
You need to move the expression code [.]+[.html]$ that is currently after the end delimiter so that it is inside it.
That should solve the problem.

Related

PHP regex last occurrence of words

My string is: /var/www/domain.com/public_html/foo/bar/folder/another/..
I want to remove the root folder from this string, to get only public folder, because some servers have multiple websites inside.
My actual regex is: /^(.*?)(www|public_html|public|html)/s
My actual result is: /domain.com/public_html/foo/bar/folder/another/..
But i want to remove the last ocorrence, and get somethig like this: /foo/bar/folder/another/..
Thanks!
You have to use a greedy quantifier and to check if the alternative is enclosed between slashes using lookarounds:
/^.*(?<![^\/])(?:www|public(?:_html)?|html)(?![^\/])/
About the lookarounds: I use negative lookarounds with a negated character class to check if there is a slash or the limit of the string at the same time. This way you are sure that for instance html is a folder and not the part of another folder name.
I removed the s modifier that is useless. I removed the capture groups too since the goal is to replace all with an empty string.
The ? makes your expression non-greedy which is not actually what you want here. Try:
^(.*)(www|public_html|public|html)
which should keep going until the last match.
Demo: https://regex101.com/r/v5WbB3/1/

Regex Including the next occurence of word

The regex works perfectly but the problem is it also include the next occurrence instead of ending with the first occurrence then start again from the
Regex : (?=<appView)\s{0,1}(.*)(?<=<\/appView>)
String: <appView></appView> <appView></appView>
But my problem is it eat matches the whole word like
(Match 1)<appView></appView> <appView></appView>
I want it to search the group differently but i cant make it work.
Desired output : (Match 1) <appView></appView> (Match 2)<appView></appView>
\s{0,1} equals \s? You need to use (.*?) to be lazy instead of (.*)
Use this pattern: ~(?=<appView)\s?(.*?)(?<=</appView>)~
Demo Link
*note, you don't have to escape / in the closing tag if you use something other than a slash as your pattern delimiter. I am using ~ at the beginning and end of my pattern to avoid escaping.
I fully recommend to switch from regex to an actual sequential xml parser. Regex is aweful for parsing xml based files, for example because of the problems below.
That said, you can "fix" your regex by using ([^<>]*). This will match all characters without < or >, which will make sure that no other tags are nested inside. If done with all tags, you cannot match something like <appview><unclosedTag></appView>, because it is invalid. If you can be certain that the structure is correct, this is slightly less of an issue.
Another problem your approach has is that if you have nested tags like so: <appView> something <appView> something else </appView> else </appView>, your approach will make you end up with [replaced] else </appView>.

my regexp does not work for a simple word match

I want to see if the current request is on the localhost or not. For doing this, I use the following regular expression:
return ( preg_match("/(^localhost).*/", $url) == true ||
preg_match("/^({http|ftp|https}://localhost).*/", $url) == true )
? true : false;
And here is the var_dump() of $url:
string 'http://localhost/aone/public/' (length=29)
Which keeps returning false though. What is the problem of this regular expression?
You are currently using the forward slash (/) as the delimiter, but you aren't escaping it inside your pattern string. This will result in an error and will cause your preg_match() statement to not work (if you don't have error reporting enabled).
Also, you are using alternation incorrectly. If you want to match either foo or bar, you'd write (foo|bar), and not {foo|bar}.
The updated preg_match() should look like:
preg_match("/^(http|ftp|https):\/\/localhost.*/", $url)
Or with a different delimiter (so you don't have to escape all the / characters):
preg_match("#^(http|ftp|https)://localhost.*#", $url)
Curly braces have a special meaning in a regex, they are used to quantify the preceding character(s).
So:
/^({http|ftp|https}://localhost).*/
Should probably be something like:
#^((http|ftp|https)://localhost).*#
Edit: changed the delimiters so that the forward slash does not need to be escaped
This
{http|ftp|https}
is wrong.
I suppose you mean
(http|ftp|https)
Also, if you want only group and don't capture, please add ?::
(?:http|ftp|https)
I would change your current code to:
return preg_match("~^(?:(?:https?|ftp)://)?localhost~", $url);
You were using { and } for grouping, when those are used for quantifying and otherwise mean literal { and `} characters.
A couple of things to add is that:
you can use https? instead of (http|https);
you can use other delimiters for the regex when your pattern has those symbols as delimiters. This avoids you excessive escaping;
you can combine the two regex, since one part is optional (the (?:https?|ftp):// part) and doing so would make the later comparator unnecessary;
the .* at the end is not required.

How to remove backpath/parentpath from the URL?

Input:
http://foo/bar/baz/../../qux/
Desired Output:
http://foo/qux/
This can be achieved using regular expression (unless someone can suggest a more efficient alternative).
If it was a forward look-up, it would be as simple as:
/\.\.\/[^\/]+/
Though I am not familiar with with how to make a backward look up for the first "/" (ie. not doing /[a-z0-9-_]+\/\.\./).
One of the solutions I thought of is to use strrev then apply forward look up regex (first example) and then do strrev. Though I am sure there is a more efficient way.
Not the clearest question I've ever seen, but if I understand what you're asking, I think you only need to switch around what you have like this:
/[^\/]+/\.\./
...then replace that with a /
Do that until no replacements are made and you should have what you want
EDIT
Your attempt seems to try to match a forward slash / and two dots \.\. followed by a slash / (or \/ - they should both match the same thing), then one or more non-slash characters[^/]+, terminated by a slash /. Flipping it around, you want to find a slash followed by one or more non-slash characters and a terminating slash, then two dots and a final slash.
You may be confused into thinking that the regex engine parses and consumes things as it goes (so you wouldn't want to consume a directory name that is not followed by the correct number of dots), but that's not how it typically works - a regex engine matches the entire expression before it replaces or returns anything. So, you can have two dots followed by a directory name, or a directory name followed by two dots - it doesn't make a difference to the engine.
If your attempt is using the slash-enclosed Perl-style syntax, then you would of course need to use \/ for any slashes you're trying to match such as the middle one, but I would also recommend matching and replacing the enclosing slashes in the url as well: I think the PHP would be something like
preg_replace('/\/[^\/]+\/\.\.\//', '/', $input)
(??)
Technically what do you want is replace segments of '/path1/path2/../../' by '/' what is needed to do that is match 'pathx/'^n'../'^n that is definetly NOT a regular expression (Context Free Lenguaje) ... but most of Regex libraries supports some non regular lenguajes and can (with a lot of effort) manage those kind of lenguajes.
An easy way to solve it is stay in Regular Expressions and cycle several times, replacing '/[^./]+/../' by ''
if you still to do it in a single step, Lookahead and grouping is needed, but it will be hard to write it, (I'm not so used on, but I will try)
EDIT:
I've found the solution in only 1 REGEX... but should use PCRE Regex
([^/.]+/(?1)?\.\./)
I've based my solution on the folowing link:
Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)
(note that dots are "forbidden" in the first section, you cannot have path.1/path.2/ if you whant to is quite more complex because you should admit them but forbid '../' as valid in the first section
this sub expression is for admiting the path names like 'path1/'
[^/.]+/
this sub expression is for admiting the double dots.
\.\./
you can test the regexp in
https://www.debuggex.com/
(remember to set it in PCRE mode)
Here is a working copy:
https://eval.in/52675

Parse block with php regex

I'm trying to write a (I think) pretty simple RegEx with PHP but it's not working.
Basically I have a block defined like this:
%%%%blockname%%%%
stuff goes here
%%%%/blockname%%%%
I'm not any good at RegEx, but this is what I tried:
preg_match_all('/^%%%%(.*?)%%%%(.*?)%%%%\/(.*?)%%%%$/i',$input,$matches);
It returns an array with 4 empty entries.
I guess it also, apart from actually working, needs some sort of pointer for the third match because it should be equal to the first one?
Please enlighten me :)
You need to allow the dot to match newlines, and to allow ^ and $ to match at the start and end of lines (not just the entire string):
preg_match_all('/^%%%%(.*?)%%%%(.*?)%%%%\/(.*?)%%%%$/sm',$input,$matches);
The s (single-line) option makes the dot match any character including newlines.
The m (multi-line) option allows ^ and $ to match at the start and end of lines.
The i option is unnecessary in your regex since there are no case-sensitive characters in it.
Then, to answer the second part of your question: If blockname is the same in both cases, then you can make that explicit by using a backreference to the first capturing group:
preg_match_all('/^%%%%(.*?)%%%%(.*?)%%%%\/\1%%%%$/sm',$input,$matches);
I'm pretty sure you can't since these operations would need to save a variable and you can't in regex. You should try to do this using PHP's built-in token parser. http://php.net/manual/en/function.token-get-all.php

Categories