Does anyone know an up to date regular expression for validating URLs? I found a few on Google but they all allowed junk URL's i.e (www.google_com) when testing.
My regular expression knowledge is not so vast, so I would hate to put something together that would fail under pressure.
Thanks.
You can use the filter functions in PHP
$filtered = filter_var($url, FILTER_VALIDATE_URL);
http://uk3.php.net/manual/en/function.filter-var.php
Not every problem should be answered with a regex.
http://php.net/manual/en/function.parse-url.php
Related
I have this regular expression:
preg_match_all("/<a\s.*?href\s*=\s*['|\"](.*?)(?=#|\"|')/si", $data, $matches);
to find all urls, it works fine, BUT how can I modificate it to find urls with question marks ONLY?
Example:
0123
And preg_match_all will return:
http://site.com/index.php?id=1
http://site.com/calc/index.php?id=1&scheme=Venus
preg_match_all("#<a\s*href\s*=[\'\"]([^\'\"]+\?[^\'\"]+)[\'\"]#si", $data, $matches);
Try this.
Don't try to make everything happen in one regex. Use your existing method, and then separately check the URL that you get back to see if it has a question mark in it.
That said, don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged.
Andy Lester gave you the answer with right thing to do.
Here's your regex though:
<a\s.*?href\s*=\s*['|\"](.*?\?.*?)(?=#|\"|')
as seen here:
http://rubular.com/r/LHi11VMMR9
Over the years, I used this regex to validate urls, and it's done an 'ok' job. The problem is, it won't validate after the .com part. It'll only validate http://www.domain.com. Anything more, and it'll throw an error.
function theUrl($rUrl)
{
if (preg_match('/^((http|https):\/{2})([w]{3})([\.]{1})([a-zA-Z0-9-]+)([\.]{1})((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|(c[acdfghiklmnorsuvxyz]|cat|co.in|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|(m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])$/i', $rUrl))
{
return true;
}
}
Can you help me with how the part after the .com should be for best results?
I would encourage you to use one of the native PHP functions instead of a custom regular expression, such as:
parse_url()
filter_var() using the FILTER_VALIDATE_URL type.
I always find regular expressions a headache, and googling didn't really help. I'm currently using the following expression (preg_match): /^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$/
However, if I'd want to allow emails with plus symbols, this obviously won't work, eg: foo+bar#domain.com
How would I need to change my expression to allow it? Thanks in advance for all the help!
You should just use PHPs builtin regex for email validation, because it covers all the things:
filter_var($email, FILTER_VALIDATE_EMAIL)
See filter_var and FILTER_VALIDATE_EMAIL (or https://github.com/php/php-src/blob/master/ext/filter/logical_filters.c#L499 for the actual beast).
Your wrong regex can be changed to another wrong regex:
/^[\w-]+(\.[\w+-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$/
which allows for the + character where you want it. But it's wrong anyway.
Try add \+ into the char collection [] :
/^[_a-z0-9-]+(.[_a-z0-9-\+]+)#[a-z0-9-]+(.[a-z0-9-]+)(.[a-z]{2,3})$/
Regex is my bete noire, can anyone help me isolate a string from a URL?
I want to get the page name from a URL which could appear in any of the following ways from an input form:
https://www.facebook.com/PAGENAME?sk=wall&filter=2
http://www.facebook.com/PAGENAME?sk=wall&filter=2
www.facebook.com/PAGENAME
facebook.com/PAGENAME?sk=wall
... and so on.
I can't seem to find a way to isolate the string after .com/ but before ? (if present at all). Is it preg_match, replace or split?
If anyone can recommend a particularly clear and introductory regex guide they found useful, it'd be appreciated.
You can use the parse_url function and then get the last segment from the path of the url:
$parts=parse_url($url);
$path_parts=explode("/", $parts["path"]);
$page=$path_parts[count($path_parts)-1];
For learning and testing regexes I found RegExr, an online tool, very useful: http://gskinner.com/RegExr/
But as others mentioned, parsing the url with appropriate functions might be better in this case.
I think you can use this php function (parse_url) directly instead of using regex.
Use smth like:
substr(parse_url('https://www.facebook.com/PAGENAME?sk=wall&filter=2', PHP_URL_PATH), 1);
I am looking for help in creating a Regular Expression that validates a URL of very long length (eg. a very long Google Maps address). I am new to regular expressions and I am using it in PHP with preg_match().
I have used the expression:
preg_match('/^(http|https|ftp):\/\/([A-Z0-9][A-Z0-9_-]*(?:\.[A-Z0-9][A-Z0-9_-]*)+):?(\d+)?\/?/i', $url)
but this doesn't work for the very long URLs.
My knowledge of Regular Expressions is virtually non-existent, so if there's a simple change that would help, feel free to point it out.
Here's an example of the errors that I'm receiving:
If the link is originally:
http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=park&sll=43.882057,-108.852539&sspn=4.204424,9.876709&ie=UTF8&hq=park&hnear=&z=7
Validation reproduces:
http://maps.google.com/maps?f=q
Validating whether it's a valid url:
$valid = filter_var($url, FILTER_VALIDATE_URL);
Inspecting whether required get variables are set, just get them as an associative array and use isset():
parse_str(parse_url($url,PHP_URL_QUERY),$get_variables);