Regex for detecting a word enclosed but in order - php

I have a regex that needs to match up a specific url and load some configuration based on that.
Basically What I am have is /*[^(search)].php/ This regex needs to match every url that has .php in the end but parameters may be present (something.pgp?t=19) except for the urls that have search.php
For example
1. http://www.example.com/discuss/viewtopic.php?t=19
2. http://www.example.com/discuss/viewforum.php?f=8
3. http://www.example.com/discuss/search.php?f=8
Among the above three urls the regular expression needs to be able to match 1 & 2 but not 3.
Any help is much appreciated thanks.
EDITED
However it should not be matching any other urls that does not include .php in it.
www.example.com/something should not be matched.

try this:
/.+?(?<!search)\.php(?<params>.*)/
the key is the non-greedy 'anything' .+? : it crawls up the string one by one, always checking for "look behind you and don't see 'search', followed by .php: (?<!search)\.php, followed by the named group which are the optional query string params.
Note that this simple regex is pretty permissive, and assumes that .php alone signifies a "php URL" - you could get crazy complicated validating URL's.

Just use two different location (make sure search.php is before general *.php)
location ~ /search\.php$ {
# config for phusion-passenger
}
location ~ \.php$ {
# config for php-fpm
}
Nginx strip off request parameters while searching for match, so you don't have to care about them.

Problem in /*[^(search)].php/
[^ ] negates of character class
so [^(search)] here would match anything other than ( or s or e or a etc
Solution
You can use a look behind assertion as
^.*(?<!search\.php)\?([^=]=)+\d+$
will match 1 and 2
Example : http://regex101.com/r/bJ3vG1/4
What it doess?
(?<!search\.php) negative lookbehind. asserts that the regex is presceded not by search.php
\?[^=]=\d+ matches parameters
Edit
If the parameter part is optional, a lengthier regex would do the purpose
.*(?<!search\.php)\?([^=]=)+\d+$|^[^?]*(?<!search\.php)$
Example : http://regex101.com/r/bJ3vG1/3

Related

.htaccess rewrite url if starts with numbers and add the complete part (incl. text) as parameter for another URL

I know there are many similar out there but I did not find a working solution for me.
I have incoming URLs like: https://example.com/123-frank-street
and want them to rewrite to: https://example.com/street/index.php?name=123-frank-street
I tried dozens of versions and my closest is the following
RewriteRule ^[0-9]+([A-Za-z0-9-]+)/?$ /street/index_test.php?street=$1 [NC,L]
This only rewrites to: https://example.com/street/index.php?name=-frank-street
The 123 is missing and gets somehow not forwarded. only the rest!
What I am missing here or doing wrong?
thx in advance
Rewrite rules are really quite simple once you understand their structure:
On the left is a regular expression which determines which URLs from the browser the rule should match
Inside that regular expression, you can "capture" parts of the URL with parentheses ()
Next, is the URL you want to serve instead; it can include parts that you captured, using numbered placeholders $1, $2, etc. It's entirely up to you where you put these, and Apache won't guess
Finally, there are flags which act as options; in your example, you're using "NC" for "Not Case-sensitive", and "L" for "Last rule, stop here if this matches"
In your example, the pattern you are matching is ^[0-9]+[A-Za-z0-9-]+/?$, which is "from start of URL, 1 or more digits, one or more letters/digits/hyphens, 0 or 1 trailing slash, end of URL".
The only part you're capturing is ([A-Za-z0-9-]+), the "one or more letters/digits/hyphens" part; so that is being put into $1. So the rest of the URL is being "discarded" simply because you haven't told Apache you want to put it anywhere.
If you want to capture other parts, just move the parentheses, or add more. For instance, if you write ^([0-9]+)([A-Za-z0-9-]+)/?$ then $1 will contain the "one or more digits" part, and $2 will contain the "one or more letters/digits/hyphens" part.

regex to match a URL with optional 'www' and protocol

I'm trying to write a regexp.
some background info: I am try to see if the REQUEST_URI of my website's URL contains another URL. like these:
http://mywebsite.com/google.com/search=xyz
However, the url wont always contain the 'http' or the 'www'. so the pattern should also match strings like:
http://mywebsite.com/yahoo.org/search=xyz
http://mywebsite.com/www.yahoo.org/search=xyz
http://mywebsite.com/msn.co.uk'
http://mywebsite.com/http://msn.co.uk'
there are a bunch of regexps out there to match urls but none I have found do an optional match on the http and www.
i'm wondering if the pattern to match could be something like:
^([a-z]).(com|ca|org|etc)(.)
I thought maybe another option was to perhaps just match any string that had a dot (.) in it. (as the other REQUEST_URI's in my application typically won't contain dots)
Does this make sense to anyone?
I'd really appreciate some help with this its been blocking my project for weeks.
Thanks you very much
-Tim
I suggest using a simple approach, essentially building on what you said, just anything with a dot in it, but working with the forward slashes too. To capture everything and not miss unusual URLs. So something like:
^((?:https?:\/\/)?[^./]+(?:\.[^./]+)+(?:\/.*)?)$
It reads as:
optional http:// or https://
non-dot-or-forward-slash characters
one or more sets of a dot followed by non-dot-or-forward-slash characters
optional forward slash and anything after it
Capturing the whole thing to the first grouping.
It would match, for example:
nic.uk
nic.uk/
http://nic.uk
http://nic.uk/
https://example.com/test/?a=bcd
Verifying they are valid URLs is another story! It would also match:
index.php
It would not match:
directory/index.php
The minimal match is basically something.something, with no forward slash in it, unless it comes at least one character past the dot. So just be sure not to use that format for anything else.
To match an optional part, you use a question mark ?, see Optional Items.
For example to match an optional www., capture the domain and the search term, the regular expression could be
(www\.)?(.+?)/search=(.+)
Although, the question mark in .+? is a non-greedy quantifier, see http://www.regular-expressions.info/repeat.html.
You might try starting your regex with
^(http://)?(www\.)?
And then the rules to match the rest of a URL.
$re = '/http:\/\/mywebsite\.com\/((?:http:\/\/)?[0-9A-Za-z]+(?:-+[0-9A-Za-z]+)*(?:\.[0-9A-Za-z]+(?:-+[0-9A-Za-z]+)*)+(?:\/.*)?)/';
https://regex101.com/r/x6vUvp/1
Obeys the DNS rule that hyphens must be surrounded. Replace http with https? to allow https URLs as well.
According to the list of TLDs at Wikipedia there are at least 1519 of them and it's not constant so you may want to give the domain its own capture group so it can be verified with an online API or a file listing them all.
Here is my two cents :
$regex = "/http:\/\/mywebsite\.com\/((http:\/\/|www\.)?[a-z]*(\.org|\.co\.uk|\.com).*)/";
See the working exemple
But I'm sure you can do better !
Hope it helps.

How to remove backpath/parentpath from the URL?

Input:
http://foo/bar/baz/../../qux/
Desired Output:
http://foo/qux/
This can be achieved using regular expression (unless someone can suggest a more efficient alternative).
If it was a forward look-up, it would be as simple as:
/\.\.\/[^\/]+/
Though I am not familiar with with how to make a backward look up for the first "/" (ie. not doing /[a-z0-9-_]+\/\.\./).
One of the solutions I thought of is to use strrev then apply forward look up regex (first example) and then do strrev. Though I am sure there is a more efficient way.
Not the clearest question I've ever seen, but if I understand what you're asking, I think you only need to switch around what you have like this:
/[^\/]+/\.\./
...then replace that with a /
Do that until no replacements are made and you should have what you want
EDIT
Your attempt seems to try to match a forward slash / and two dots \.\. followed by a slash / (or \/ - they should both match the same thing), then one or more non-slash characters[^/]+, terminated by a slash /. Flipping it around, you want to find a slash followed by one or more non-slash characters and a terminating slash, then two dots and a final slash.
You may be confused into thinking that the regex engine parses and consumes things as it goes (so you wouldn't want to consume a directory name that is not followed by the correct number of dots), but that's not how it typically works - a regex engine matches the entire expression before it replaces or returns anything. So, you can have two dots followed by a directory name, or a directory name followed by two dots - it doesn't make a difference to the engine.
If your attempt is using the slash-enclosed Perl-style syntax, then you would of course need to use \/ for any slashes you're trying to match such as the middle one, but I would also recommend matching and replacing the enclosing slashes in the url as well: I think the PHP would be something like
preg_replace('/\/[^\/]+\/\.\.\//', '/', $input)
(??)
Technically what do you want is replace segments of '/path1/path2/../../' by '/' what is needed to do that is match 'pathx/'^n'../'^n that is definetly NOT a regular expression (Context Free Lenguaje) ... but most of Regex libraries supports some non regular lenguajes and can (with a lot of effort) manage those kind of lenguajes.
An easy way to solve it is stay in Regular Expressions and cycle several times, replacing '/[^./]+/../' by ''
if you still to do it in a single step, Lookahead and grouping is needed, but it will be hard to write it, (I'm not so used on, but I will try)
EDIT:
I've found the solution in only 1 REGEX... but should use PCRE Regex
([^/.]+/(?1)?\.\./)
I've based my solution on the folowing link:
Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)
(note that dots are "forbidden" in the first section, you cannot have path.1/path.2/ if you whant to is quite more complex because you should admit them but forbid '../' as valid in the first section
this sub expression is for admiting the path names like 'path1/'
[^/.]+/
this sub expression is for admiting the double dots.
\.\./
you can test the regexp in
https://www.debuggex.com/
(remember to set it in PCRE mode)
Here is a working copy:
https://eval.in/52675

Regex matching a string that includes a substring, but does not include a substring

I'm trying to write a regex that will match a string that contains a certain substring, but fails if it also contains different substring. I've found this answer, but I'm not sure how to get it to work for my needs. In the interest of being specific as possible:
Yes, it has to do this as part of the expression. I do not have access to the code that will be processing this.
Yes, it needs to be one expression.
It needs to work with PHP's regex flavor. I'm pretty sure it's being evaluated using preg
To give an idea of what I'm trying to do, I have a set of URLs I'm trying to filter. URLs that have "/somedir" in them I want to match, but I don't want it to match if it also has "somestring" in the URL.
So,
www.somesite.com/somedir/index.html
www.somesite.com/somedir/somotherdir/index.html
www.somesite.com/somedir/somepage.html
would all match, but,
www.somesite.com/somedir/somestring.html
www.somesite.com/somedir/somestring/index.html
would both fail.
You need a regex that will accept a certain pattern only if it is not surrounded by another pattern:
~
(?(DEFINE)
(?<ACCEPT> must-contain-pattern)
(?<REFUSE> must-not-contain-pattern)
)
^
(?:(?!(?&REFUSE)).)*
(?&ACCEPT)
(?:(?!(?&REFUSE)).)*
$
~ux
In the DEFINE block define the ACCEPT and REFUSE pattern according to your needs and this should work.
Edit: The pattern from above tailored for the case of yours by defining the two named sub-patterns:
~
(?(DEFINE)
(?<ACCEPT> \Q/somedir\E)
(?<REFUSE> \Qsomestring\E)
)
^
(?:(?!(?&REFUSE)).)*
(?&ACCEPT)
(?:(?!(?&REFUSE)).)*
$
~ux

Why doesn't this URL pattern match?

I'm using a pattern as described by John Gruber in this daringfireball article to auto link URLs in user comments.
I'm using it with PHP to match URLs, and want it to match a single TLD with no www and no trailing slash, but it doesn't seem to be working.
Here's the pattern (and can be seen in more detail at the article above):
$pattern = '#(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4})(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))#';
Specifically I'm looking at this particular subpattern: [a-z0-9.\-]+[.][a-z]{2,4}
This subpattern works separately, but as a part of the larger pattern, it doesn't match google.com.
[a-z0-9.\-]+[.][a-z]{2,4} works as you expect, but the rest of the pattern requires at least 1 following character:
google.com/
google.com?lang=en-us
google.com#!foo/bar
etc.
You can try allowing the tail to be optional, but it may in turn give you false-positives rather than excluding false-negatives:
$pattern = '#...“”‘’])?)#'; # '...' for brevity
# ^
Works for me:
http://regexr.com?2uica
Are you sure there is nothing in you php that is tripping you up?
EDIT
It's because the full pattern expects to find something before the domain name, like http:// or www

Categories