Regular Expression to Validate Long URLs in PHP - php

I am looking for help in creating a Regular Expression that validates a URL of very long length (eg. a very long Google Maps address). I am new to regular expressions and I am using it in PHP with preg_match().
I have used the expression:
preg_match('/^(http|https|ftp):\/\/([A-Z0-9][A-Z0-9_-]*(?:\.[A-Z0-9][A-Z0-9_-]*)+):?(\d+)?\/?/i', $url)
but this doesn't work for the very long URLs.
My knowledge of Regular Expressions is virtually non-existent, so if there's a simple change that would help, feel free to point it out.
Here's an example of the errors that I'm receiving:
If the link is originally:
http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=park&sll=43.882057,-108.852539&sspn=4.204424,9.876709&ie=UTF8&hq=park&hnear=&z=7
Validation reproduces:
http://maps.google.com/maps?f=q

Validating whether it's a valid url:
$valid = filter_var($url, FILTER_VALIDATE_URL);
Inspecting whether required get variables are set, just get them as an associative array and use isset():
parse_str(parse_url($url,PHP_URL_QUERY),$get_variables);

Related

using regex to find invalid postcode

I'm new to php and I'm trying to write a function to find an invalid postcode. This is an option, however I've been told this isnt the ideal format:
function postcode_valid($postcode) {
return preg_match('/\w{2,3} \d\w{2}/', $postcode);
}
//more accurate
//[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}
I understand the first function, but I don't know how to write the 'ideal' solution as a function, please can you advise?
If the regular expression you provided in the comment field is the correct one and you don't know how to use it in PHP, here is the solution:
function postcode_valid($postcode) {
return preg_match('/^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$/', $postcode);
}
You need to add two slashes (one in front, one at the end) of the regular expression and pack it in a string in PHP. I would also highly recommend you to use ^ and $ at the beginning resp. at the end of the regular expression to indicate the beginning and the end of the string (otherwise, it is valid, if only a part of the string contains the correct pattern i.e. a longer string with a valid part would be accepted.) Here is a live example.
If you are looking for the validation of a UK post code, you should be using the following regex instead (source):
(GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKPSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})
If you are looking for something else, please provide a comment below.

Modifying this URL validation regex to accept whats after the .com part

Over the years, I used this regex to validate urls, and it's done an 'ok' job. The problem is, it won't validate after the .com part. It'll only validate http://www.domain.com. Anything more, and it'll throw an error.
function theUrl($rUrl)
{
if (preg_match('/^((http|https):\/{2})([w]{3})([\.]{1})([a-zA-Z0-9-]+)([\.]{1})((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|(c[acdfghiklmnorsuvxyz]|cat|co.in|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|(m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])$/i', $rUrl))
{
return true;
}
}
Can you help me with how the part after the .com should be for best results?
I would encourage you to use one of the native PHP functions instead of a custom regular expression, such as:
parse_url()
filter_var() using the FILTER_VALIDATE_URL type.

Parsing link from javascript function

I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?
function videoPoster() {
document.getElementById("html5_vid").innerHTML =
"<video x-webkit-airplay='allow' id='html5_video' style='margin-top:"
+ style_padding
+ "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' "
+ "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}
What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!
/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.
See it in action.
If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/
In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.
In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:
['"](http://[^'"]*)['"]
If you have to enter that regexp as a string, beware of escaping.
If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.
For your specific case, this should work, provided that none of the characters in the URL are escaped.
preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];
See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).
(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)
The following captures any url in your html
$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
print_r($matches['urls']);
}
if you want to do the same in javascript you could use this:
var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}

URL Validation?

Does anyone know an up to date regular expression for validating URLs? I found a few on Google but they all allowed junk URL's i.e (www.google_com) when testing.
My regular expression knowledge is not so vast, so I would hate to put something together that would fail under pressure.
Thanks.
You can use the filter functions in PHP
$filtered = filter_var($url, FILTER_VALIDATE_URL);
http://uk3.php.net/manual/en/function.filter-var.php
Not every problem should be answered with a regex.
http://php.net/manual/en/function.parse-url.php

PHP server-side validation regular expression match

I have the following part of a validation script:
$invalidEmailError .= "<br/>ยป You did not enter a valid E-mail address";
$match = "/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/";
That's the expression, here is the validation:
if ( !(preg_match($match,$email)) ) {
$errors .= $invalidEmailError; // checks validity of email
}
I think that's enough info, let me know if more is needed.
Basically, what happens is the message "You did not enter a valid E-mail address" gets echoed no matter what. Whether a correct email address or an incorrect email address is entered.
Does anyone have any idea or a clue as to why?
EDIT: I'm running this on localhost (using Apache), could that be the reason as to why the preg_match ain't working?
Thanks!
Amit
Your regex only includes [A-Z], not [a-z]. Try
$match = "/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i";
to make the regex case-insensitive.
You can test this live on http://regexpal.com.
However, I'd advise you to try one of the expressions on the page mentioned by strager: http://fightingforalostcause.net/misc/2006/compare-email-regex.php. They have been perfected over time and will probably behave better. But Gmail users will be satisfied with yours, since they'll be able to use plus aliases which are rejected incorrectly by many validators.
You likely got the regular expression you're using from regular-expressions.info. On that page, the author states (emphasis added):
If you want to use the regular expression above, there's two things you need to understand. First, long regexes make it difficult to nicely format paragraphs. So I didn't include a-z in any of the three character classes. This regex is intended to be used with your regex engine's "case insensitive" option turned on. (You'd be surprised how many "bug" reports I get about that.) Second, the above regex is delimited with word boundaries, which makes it suitable for extracting email addresses from files or larger blocks of text. If you want to check whether the user typed in a valid email address, replace the word boundaries with start-of-string and end-of-string anchors, like this: ^[A-Z0-9._%+-]+#[A-Z0-9.-]+.[A-Z]{2,4}$.
To solve this problem, add the i PCRE flag after your regular expression.
You can always try debugging your regex using a simpler tool (I'm quite fond of using Notepad++ for this purpose) and performing iterative tests - ie. making the expression more/less complicated and seeing if that fixes/breaks things.

Categories