Simple regex question (php)

Simple regex question (php) - php

I've been using this line in a routes file:
$route['^(?!home|members).*'] = "pages/view/$0";
The string in the array on the left of the expression ( ^(?!home|members).* ) is what I'm trying to figure out.
Basically any url that is not:
/home or /home/ or /members or /members/ should be true. The problem I have is if the url is something like /home-asdf. This counts as being in my list of excluded urls (which in my example only has 'home' and 'members'.
Ideas on how to fix this?

Try this modification:
^(?!(home|members)([/?]|$)).*
This filters out URLs beginning with home or members only if those names are immediately followed by a slash or question mark ([/?]), or the end of the string ($).

http://www.regular-expressions.info/
The dot . operator matches all characters. The * operator means the previous pattern will be repeated 0 or more times. That means the end of your route matches any character any number of times after the word home or members. If you only want to match one or zero slashes, then change .* to /?.

As an aside, I use this all the time and it works wonders: http://www.rubular.com/
Its predominantly for ruby but works well when working out general regex for php etc too.

Related

regex to match a URL with optional 'www' and protocol

I'm trying to write a regexp.
some background info: I am try to see if the REQUEST_URI of my website's URL contains another URL. like these:
http://mywebsite.com/google.com/search=xyz
However, the url wont always contain the 'http' or the 'www'. so the pattern should also match strings like:
http://mywebsite.com/yahoo.org/search=xyz
http://mywebsite.com/www.yahoo.org/search=xyz
http://mywebsite.com/msn.co.uk'
http://mywebsite.com/http://msn.co.uk'
there are a bunch of regexps out there to match urls but none I have found do an optional match on the http and www.
i'm wondering if the pattern to match could be something like:
^([a-z]).(com|ca|org|etc)(.)
I thought maybe another option was to perhaps just match any string that had a dot (.) in it. (as the other REQUEST_URI's in my application typically won't contain dots)
Does this make sense to anyone?
I'd really appreciate some help with this its been blocking my project for weeks.
Thanks you very much
-Tim

I suggest using a simple approach, essentially building on what you said, just anything with a dot in it, but working with the forward slashes too. To capture everything and not miss unusual URLs. So something like:
^((?:https?:\/\/)?[^./]+(?:\.[^./]+)+(?:\/.*)?)$
It reads as:
optional http:// or https://
non-dot-or-forward-slash characters
one or more sets of a dot followed by non-dot-or-forward-slash characters
optional forward slash and anything after it
Capturing the whole thing to the first grouping.
It would match, for example:
nic.uk
nic.uk/
http://nic.uk
http://nic.uk/
https://example.com/test/?a=bcd
Verifying they are valid URLs is another story! It would also match:
index.php
It would not match:
directory/index.php
The minimal match is basically something.something, with no forward slash in it, unless it comes at least one character past the dot. So just be sure not to use that format for anything else.

To match an optional part, you use a question mark ?, see Optional Items.
For example to match an optional www., capture the domain and the search term, the regular expression could be
(www\.)?(.+?)/search=(.+)
Although, the question mark in .+? is a non-greedy quantifier, see http://www.regular-expressions.info/repeat.html.

You might try starting your regex with
^(http://)?(www\.)?
And then the rules to match the rest of a URL.

$re = '/http:\/\/mywebsite\.com\/((?:http:\/\/)?[0-9A-Za-z]+(?:-+[0-9A-Za-z]+)*(?:\.[0-9A-Za-z]+(?:-+[0-9A-Za-z]+)*)+(?:\/.*)?)/';
https://regex101.com/r/x6vUvp/1
Obeys the DNS rule that hyphens must be surrounded. Replace http with https? to allow https URLs as well.
According to the list of TLDs at Wikipedia there are at least 1519 of them and it's not constant so you may want to give the domain its own capture group so it can be verified with an online API or a file listing them all.

Here is my two cents :
$regex = "/http:\/\/mywebsite\.com\/((http:\/\/|www\.)?[a-z]*(\.org|\.co\.uk|\.com).*)/";
See the working exemple
But I'm sure you can do better !
Hope it helps.

How do I extract one group from a URL using regex for use in a redirect?

I've read the Best RegEx Trick Ever and tried to wrap my head around the other answers here on Stack Exchange and just can't seem to get it right. Take these three strings:
http://www.test.com/newyork/class-schedule
http://www.test.com/location/newyork/class-schedule
http://www.test.com/location/newyork/training
I need a regex that will extract the newyork from the first string and save it for a replace later, but will NOT match any part of the other strings. Also, for obscure reasons, I can not include http://www.test.com as a condition for matching (so I can't use anything before the slash that precedes newyork). Note that in this scenario, newyork could easily be chicago, atlanta, or any other city name with no spaces or punctuation.
The only thing I've been able to figure out that isolates only newyork in the first string is the following:
/.*\.com\/(.[^\/]*)\/class-schedule/g
However, this relies on using the URL first which I can't use.
Any ideas on how to achieve this WITHOUT using the URL?
[EDIT]
To clarify what I'm looking for, I'm trying to take the results from the first string and add "location" to it, still using regex. So:
http://www.test.com/newyork/class-schedule
would become
http://www.test.com/location/newyork/class-schedule
using something like
http://www.test.com/location/$1/class-schedule

Try this: ~/(\w+)/[-a-z]+?/?(?:\?.*?)*(:?\s|$)~gm
See it working here: https://regex101.com/r/4VMazZ/3.
So it will use the end of URL instead of the beginning and match only the word between slash 2 and 3 from the end. There can be a query string it will still work.
[EDIT 1]
I exchanged 2 chars doing typo in the end so it was capturing one extra group: /(\w+)/[-a-z]+?/?(?:\?.*?)*(?:\s|$). here: https://regex101.com/r/4VMazZ/4
If you use preg_match($pattern, $string, $matches); the result you want (newyork) will be in $matches[1];, $matches[0] contains everything.
You can see the captures in 'MATCH INFORMATION' panel on regex101 in my example!
[EDIT 2] after your comment.
If you want to replace the whole url you have to match the whole URL, something like this: .*?/(\w+)/[-a-z]+?/?(?:\?.*?)*(?:\s|$) will do in this example. See it working here: https://regex101.com/r/4VMazZ/5
[EDIT 3] Add capturing of last part for replacement.
So as you want to reuse last part you need to add capturing parenthesis: .*?/(\w+)/([-a-z]+?)/?(?:\?.*?)*(?:\s|$).
See it working here: https://regex101.com/r/4VMazZ/6

Could this work? See it here.
(?<=location\/|\.\w{3}\/|\.\w{2}\/)(?!location).*?(?=\/|$)
It matches everything following .xxx/ or .xx/ or location/. I don't know if one letter domain exist, in this case, you can add |\.\w\/ to the lookahead at the start of the regex.
(?<=location\/|\.\w{3}\/|\.\w{2}\/) is a lookahead, so it matches the following pattern only if preceded by location/ or .xxx or .xx
.*? matches every character (lazy)
(?=\/|$) end match if next character is / or on line end
Note: If location is counted as part of the url, I don't think what you are asking is possible in regex, as the city name could be anywhere in string. If so, then you could have a list of cities and check what part of the url matches one of them.
EDIT: You need the multiline m flag so $ also matches end of line

How to remove backpath/parentpath from the URL?

Input:
http://foo/bar/baz/../../qux/
Desired Output:
http://foo/qux/
This can be achieved using regular expression (unless someone can suggest a more efficient alternative).
If it was a forward look-up, it would be as simple as:
/\.\.\/[^\/]+/
Though I am not familiar with with how to make a backward look up for the first "/" (ie. not doing /[a-z0-9-_]+\/\.\./).
One of the solutions I thought of is to use strrev then apply forward look up regex (first example) and then do strrev. Though I am sure there is a more efficient way.

Not the clearest question I've ever seen, but if I understand what you're asking, I think you only need to switch around what you have like this:
/[^\/]+/\.\./
...then replace that with a /
Do that until no replacements are made and you should have what you want
EDIT
Your attempt seems to try to match a forward slash / and two dots \.\. followed by a slash / (or \/ - they should both match the same thing), then one or more non-slash characters[^/]+, terminated by a slash /. Flipping it around, you want to find a slash followed by one or more non-slash characters and a terminating slash, then two dots and a final slash.
You may be confused into thinking that the regex engine parses and consumes things as it goes (so you wouldn't want to consume a directory name that is not followed by the correct number of dots), but that's not how it typically works - a regex engine matches the entire expression before it replaces or returns anything. So, you can have two dots followed by a directory name, or a directory name followed by two dots - it doesn't make a difference to the engine.
If your attempt is using the slash-enclosed Perl-style syntax, then you would of course need to use \/ for any slashes you're trying to match such as the middle one, but I would also recommend matching and replacing the enclosing slashes in the url as well: I think the PHP would be something like
preg_replace('/\/[^\/]+\/\.\.\//', '/', $input)
(??)

Technically what do you want is replace segments of '/path1/path2/../../' by '/' what is needed to do that is match 'pathx/'^n'../'^n that is definetly NOT a regular expression (Context Free Lenguaje) ... but most of Regex libraries supports some non regular lenguajes and can (with a lot of effort) manage those kind of lenguajes.
An easy way to solve it is stay in Regular Expressions and cycle several times, replacing '/[^./]+/../' by ''
if you still to do it in a single step, Lookahead and grouping is needed, but it will be hard to write it, (I'm not so used on, but I will try)
EDIT:
I've found the solution in only 1 REGEX... but should use PCRE Regex
([^/.]+/(?1)?\.\./)
I've based my solution on the folowing link:
Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)
(note that dots are "forbidden" in the first section, you cannot have path.1/path.2/ if you whant to is quite more complex because you should admit them but forbid '../' as valid in the first section
this sub expression is for admiting the path names like 'path1/'
[^/.]+/
this sub expression is for admiting the double dots.
\.\./
you can test the regexp in
https://www.debuggex.com/
(remember to set it in PCRE mode)
Here is a working copy:
https://eval.in/52675

Why doesn't this URL pattern match?

I'm using a pattern as described by John Gruber in this daringfireball article to auto link URLs in user comments.
I'm using it with PHP to match URLs, and want it to match a single TLD with no www and no trailing slash, but it doesn't seem to be working.
Here's the pattern (and can be seen in more detail at the article above):
$pattern = '#(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4})(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))#';
Specifically I'm looking at this particular subpattern: [a-z0-9.\-]+[.][a-z]{2,4}
This subpattern works separately, but as a part of the larger pattern, it doesn't match google.com.

[a-z0-9.\-]+[.][a-z]{2,4} works as you expect, but the rest of the pattern requires at least 1 following character:
google.com/
google.com?lang=en-us
google.com#!foo/bar
etc.
You can try allowing the tail to be optional, but it may in turn give you false-positives rather than excluding false-negatives:
$pattern = '#...“”‘’])?)#'; # '...' for brevity
# ^

Works for me:
http://regexr.com?2uica
Are you sure there is nothing in you php that is tripping you up?
EDIT
It's because the full pattern expects to find something before the domain name, like http:// or www

php regex question for matching google searchterms in url

im finding searchwords from google request urls.
im using
preg_match("/[q=](.*?)[&]/", $requesturl, $match);
but it fails when the 'q' parameter is the last parameter of the string.
so i need to fetch everything that comes after 'q=', but the match must stop IF it finds '&'
how to do that?
EDIT:
I eventually landed on this for matching google request url:
/[?&]q=([^&]+)/
Because sometimes they have a param that ends with q. like 'aq=0'

You need /q=([^&]+)/. The trick is to match everything except & in the query.
To build on your query, this is a slightly modified version that will (almost) do the trick, and it's the closest to what you have there: /q=(.*?)(&|$)/. It puts the q= out of the brackets, because inside the brackets it will match either of them, not both together, and at the end you need to match either & or the end of the string ($). There are, though, a few problems with this:
sometimes you will have an extra & at the end of the match; you don't need it. To solve this problem you can use a lookahead query: (?=&|$)
it introduces an extra group at the end (not necessarily bad, but can be avoided) -- actually, this is fixed by 1.
So, if you want a slightly longer query to expand what you have there, here it is: /q=(.*?)(?=&|$)/

Try this:
preg_match("/q=([^&]+)/", $requesturl, $match);
A little explaining:
[q=] will search for either q or =, but not one after another.
[&] is not needed as there is only one character. & is fine.
the ? operator in regex tells it to match 0 or 1 occurrences of the ** preceding** character.
[^&] will tell it to match any character except for &. Which means you'll get all the query string until it hits &.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Simple regex question (php) - php

Try this modification: ^(?!(home|members)([/?]|$)).* This filters out URLs beginning with home or members only if those names are immediately followed by a slash or question mark ([/?]), or the end of the string ($).

As an aside, I use this all the time and it works wonders: http://www.rubular.com/ Its predominantly for ruby but works well when working out general regex for php etc too.

Related

regex to match a URL with optional 'www' and protocol

How do I extract one group from a URL using regex for use in a redirect?

How to remove backpath/parentpath from the URL?

Why doesn't this URL pattern match?

php regex question for matching google searchterms in url

Categories

Resources