Validate a domain name with GET parameters using a REGEX

Validate a domain name with GET parameters using a REGEX - php

I am trying to validate if a domain does have GET parameters with preg_match and and a REGEX, which i require it to have for my purposes.
What I have got working is validating a domain without GET parameters like so:
if (preg_match("/^[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}$/", 'domain.com')) {
echo 'true';
} else {
echo 'false';
}
I get true for this test.
So far so good. What I am having trouble with is adding in the GET parameters, Amongst a number of REGEX's I have tried with still no luck is the following:
if (preg_match("/^[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}([/?].*)?$/", 'domain.com?test=test')) {
echo 'true';
} else {
echo 'false';
}
Here i get false returned and hence am not able to validate a domain with GET parameters which are required.
Any assistance will be much appreciated ^^
Regards

This code is not tested, but I think it should work:
$pattern = "([a-z0-9-.]*)\.([a-z]{2,3})"; //Host
$pattern .= "(\?[a-z+&\$_.-][a-z0-9;:#&%=+\/\$_.-]*)?"; //Get requests
if (preg_match($pattern, 'domain.com?test=test')) {
echo 'true';
} else {
echo 'false';
}

What is the advantage of using a REGEX?
Why not just
<?php
$xGETS = count($_GET);
if(!$xGETS)
{
echo 'false';
} else {
echo 'true';
}
// PHP 5.2+
$xGETS = filter_var('http://domain.com?test=test', FILTER_VALIDATE_URL, FILTER_FLAG_QUERY_REQUIRED);
if(!$xGETS)
{
echo 'false';
} else {
echo 'true';
}

Your first regular expression will reject some valid domain names (e.g. from the museum and travel TLDs and domain names that include upper case letters) and will recognize some invalid domain names (e.g. where a label or the whole domain name is too long).
If this is fine with you, you might just as well search for the first question mark and treat the prefix as domain name and the suffix as "GET parameters" (actually called query string).
If this is not fine with you, a simple regular expression will not suffice to validate domain names, because of the length constraints of domain names and labels.

Related

How to convert random domain names into lowercase consistent urls?

I have this function in a class:
protected $supportedWebsitesUrls = ['www.youtube.com', 'www.vimeo.com', 'www.dailymotion.com'];
protected function isValid($videoUrl)
{
$urlDetails = parse_url($videoUrl);
if (in_array($urlDetails['host'], $this->supportedWebsitesUrls))
{
return true;
} else {
throw new \Exception('This website is not supported yet!');
return false;
}
}
It basically extracts the host name from any random url and then checks if it is in the $supportedWebsitesUrls array to ensure that it is from a supported website. But if I add say: dailymotion.com instead of www.dailymotion.com it won't detect that url. Also if I try to do WWW.DAILYMOTION.COM it still won't work. What can be done? Please help me.

You can use preg_grep function for this. preg_grep supports regex matches against a given array.
Sample use:
$supportedWebsitesUrls = array('www.dailymotion.com', 'www.youtube.com', 'www.vimeo.com');
$s = 'DAILYMOTION.COM';
if ( empty(preg_grep('/' . preg_quote($s, '/') . '/i', $supportedWebsitesUrls)) )
echo 'This website is not supported yet!\n';
else
echo "found a match\n";
Output:
found a match

You can run a few checks on it;
For lower case vs upper case, the php function strtolower() will sort you out.
as for checking with the www. at the beginning vs without it, you can add an extra check to your if clause;
if (in_array($urlDetails['host'], $this->supportedWebsitesUrls) || in_array('www.'.$urlDetails['host'], $this->supportedWebsitesUrls))

PHP Auto-correcting URLs

I dont wan't reinvent wheel, but i couldnt find any library that would do this perfectly.
In my script users can save URLs, i want when they give me list like:
google.com
www.msn.com
http://bing.com/
and so on...
I want to be able to save in database in "correct format".
Thing i do is I check is it there protocol, and if it's not present i add it and then validate URL against RegExp.
For PHP parse_url any URL that contains protocol is valid, so it didnt help a lot.
How guys you are doing this, do you have some idea you would like to share with me?
Edit:
I want to filter out invalid URLs from user input (list of URLs). And more important, to try auto correct URLs that are invalid (ex. doesn't contains protocol). Ones user enter list, it should be validated immediately (no time to open URLs to check those they really exist).
It would be great to extract parts from URL, like parse_url do, but problem with parse_url is, it doesn't work well with invalid URLs. I tried to parse URL with it, and for parts that are missing (and are required) to add default ones (ex. no protocol, add http). But parse_url for "google.com" wont return "google.com" as hostname but as path.
This looks like really common problem to me, but i could not find available solution on internet (found some libraries that will standardize URL, but they wont fix URL if it is invalid).
Is there some "smart" solution to this, or I should stick with my current:
Find first occurrence of :// and validate if it's text before is valid protocol, and add protocol if missing
Found next occurrence of / and validate is hostname is in valid format
For good measure validate once more via RegExp whole URL
I just have feeling I will reject some valid URLs with this, and for me is better to have false positive, that false negative.

I had the same problem with parse_url as OP, this is my quick and dirty solution to auto-correct urls(keep in mind that the code in no way are perfect or cover all cases):
Results:
http:/wwww.example.com/lorum.html => http://www.example.com/lorum.html
gopher:/ww.example.com => gopher://www.example.com
http:/www3.example.com/?q=asd&f=#asd =>http://www3.example.com/?q=asd&f=#asd
asd://.example.com/folder/folder/ =>http://example.com/folder/folder/
.example.com/ => http://example.com/
example.com =>http://example.com
subdomain.example.com => http://subdomain.example.com
function url_parser($url) {
// multiple /// messes up parse_url, replace 2+ with 2
$url = preg_replace('/(\/{2,})/','//',$url);
$parse_url = parse_url($url);
if(empty($parse_url["scheme"])) {
$parse_url["scheme"] = "http";
}
if(empty($parse_url["host"]) && !empty($parse_url["path"])) {
// Strip slash from the beginning of path
$parse_url["host"] = ltrim($parse_url["path"], '\/');
$parse_url["path"] = "";
}
$return_url = "";
// Check if scheme is correct
if(!in_array($parse_url["scheme"], array("http", "https", "gopher"))) {
$return_url .= 'http'.'://';
} else {
$return_url .= $parse_url["scheme"].'://';
}
// Check if the right amount of "www" is set.
$explode_host = explode(".", $parse_url["host"]);
// Remove empty entries
$explode_host = array_filter($explode_host);
// And reassign indexes
$explode_host = array_values($explode_host);
// Contains subdomain
if(count($explode_host) > 2) {
// Check if subdomain only contains the letter w(then not any other subdomain).
if(substr_count($explode_host[0], 'w') == strlen($explode_host[0])) {
// Replace with "www" to avoid "ww" or "wwww", etc.
$explode_host[0] = "www";
}
}
$return_url .= implode(".",$explode_host);
if(!empty($parse_url["port"])) {
$return_url .= ":".$parse_url["port"];
}
if(!empty($parse_url["path"])) {
$return_url .= $parse_url["path"];
}
if(!empty($parse_url["query"])) {
$return_url .= '?'.$parse_url["query"];
}
if(!empty($parse_url["fragment"])) {
$return_url .= '#'.$parse_url["fragment"];
}
return $return_url;
}
echo url_parser('http:/wwww.example.com/lorum.html'); // http://www.example.com/lorum.html
echo url_parser('gopher:/ww.example.com'); // gopher://www.example.com
echo url_parser('http:/www3.example.com/?q=asd&f=#asd'); // http://www3.example.com/?q=asd&f=#asd
echo url_parser('asd://.example.com/folder/folder/'); // http://example.com/folder/folder/
echo url_parser('.example.com/'); // http://example.com/
echo url_parser('example.com'); // http://example.com
echo url_parser('subdomain.example.com'); // http://subdomain.example.com

It's not 100% foolproof, but a 1 liner.
$URL = (((strpos($URL,'https://') === false) && (strpos($URL,'http://') === false))?'http://':'' ).$URL;
EDIT
There was apparently a problem with my initial version if the hostname contain http.
Thanks Trent

PHP: need explanation using [a-zA-Z0-9]

I am new to PHP (not programming overall), and having problems with this simple line of code. I want to check whether some input field has been filled as anysymbolornumber#anysymbolornumber just for checking whether correct email was typed. I don't get any error, but the whole check system doesn't work. Here is my code and thanks!
if ($email = "[a-zA-Z0-9]#[a-zA-Z0-9]")
{

Since your new to php , i suggest you should buy a book or read an tutorial or two.
For email validation you should use filter_var an build in function that comes with with php 5.2 and up :
<?php
if(!filter_var("someone#example....com", FILTER_VALIDATE_EMAIL)){
echo("E-mail is not valid");
}else{
echo("E-mail is valid");
}
?>

you can use other functions .. instead of regular expressions
if(filter_var($email,FILTER_VALIDATE_EMAIL)){
echo "Valid email";
}else{
echo "Not a valid email";
}

As correctly pointed out in the comments, the regex you are using isn't actually a very good way of validating the email. There are much better ways, but if you are just wanting to get a look at how regular expressions work, it is a starting point. I am not an expert in regex, but this will at least get your if statement working :)
if(preg_match("[a-zA-Z0-9]#[a-zA-Z0-9]",$email)
{
// Your stuff
}

It looks like you're trying to verify that an email address matches a certain pattern. But you're not using the proper function. You probably want something like preg_match( $pattern, $target ).
Also, your regex isn't doing what you would want anyway. In particular, you need some quantifiers, or else your email addresses will only be able to consist of one character ahead of the #, and one after. And you need anchors at the beginning and end of the sequence so that you're matching against the entire address, not just the two characters closest to the #.
Consider this:
if( preg_match("^[a-zA-Z0-9._-]+#[a-zA-Z0-9._-]+$", $email ) ) {
// Whatever
}
Keep in mind, however, that this is really a poor-man's approach to validating an email address. Email addresses can contain a lot more characters than those listed in the character class I provided. Furthermore, it would also be possible to construct an invalid email address with those same character classes. It doesn't even begin to deal with Unicode. Using a regex to validate an email address is quite difficult. Friedl takes a shot at it in Mastering Regular Expressions (O'Reilly), and his effort takes a 2KB regular expression pattern. At best, this is only a basic sanity check. It's not a secure means of verifying an email address. At worst, it literally misses valid regexes, and still matches invalid ones.
There is the mailparse_rfc822_parse_addresses function which is more reliable in detecting and matching email addresses.

You need to use preg_match to run the regular expression.
Now you're setting the $email = to the regular expression.
It could look like:
if ( preg_match("[a-zA-Z0-9]#[a-zA-Z0-9]", $email ))
Also keep in mind when matching in an if you must use the == operator.
I believe best pratice would be to use a filter_var instead like:
if( ! filter_var( $email , FILTER_VALIDATE_EMAIL )) {
// Failed.
}

Another way taken from: http://www.linuxjournal.com/article/9585
function check_email_address($email) {
// First, we check that there's one # symbol,
// and that the lengths are right.
if (!ereg("^[^#]{1,64}#[^#]{1,255}$", $email)) {
// Email invalid because wrong number of characters
// in one section or wrong number of # symbols.
return false;
}
// Split it into sections to make life easier
$email_array = explode("#", $email);
$local_array = explode(".", $email_array[0]);
for ($i = 0; $i < sizeof($local_array); $i++) {
if
(!ereg("^(([A-Za-z0-9!#$%&'*+/=?^_`{|}~-][A-Za-z0-9!#$%&
↪'*+/=?^_`{|}~\.-]{0,63})|(\"[^(\\|\")]{0,62}\"))$",
$local_array[$i])) {
return false;
}
}
// Check if domain is IP. If not,
// it should be valid domain name
if (!ereg("^\[?[0-9\.]+\]?$", $email_array[1])) {
$domain_array = explode(".", $email_array[1]);
if (sizeof($domain_array) < 2) {
return false; // Not enough parts to domain
}
for ($i = 0; $i < sizeof($domain_array); $i++) {
if
(!ereg("^(([A-Za-z0-9][A-Za-z0-9-]{0,61}[A-Za-z0-9])|
↪([A-Za-z0-9]+))$",
$domain_array[$i])) {
return false;
}
}
}
return true;
}

check if a string is a URL [duplicate]

This question already has answers here:
Best way to check if a URL is valid
(12 answers)
Closed 7 years ago.
I've seen many questions but wasn't able to understand how it works
as I want a more simple case.
If we have text, whatever it is, I'd like to check if it is a URL or not.
$text = "something.com"; //this is a url
if (!IsUrl($text)){
echo "No it is not url";
exit; // die well
}else{
echo "Yes it is url";
// my else codes goes
}
function IsUrl($url){
// ???
}
Is there any other way rather than checking with JavaScript in the case JS is blocked?

The code below worked for me:
if(filter_var($text, FILTER_VALIDATE_URL))
{
echo "Yes it is url";
exit; // die well
}
else
{
echo "No it is not url";
// my else codes goes
}
You can also specify RFC compliance and other requirements on the URL using flags. See PHP Validate Filters for more details.

PHP's filter_var function is what you need. Look for FILTER_VALIDATE_URL. You can also set flags to fine-tune your implementation.
No regex needed....

http://www.php.net/manual/en/function.preg-match.php#93824
<?php
$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?#)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,3})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:#&%=+\/\$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor
if(preg_match("/^$regex$/i", $url)) // `i` flag for case-insensitive
{
return true;
}
?>
but your example URL is over simplified, (\w+)\.(\w+) would match it. somebody else mentioned filter_var which is simply a filter_var($url, FILTER_VALIDATE_URL) but it doesn't seem to like non-ascii characters so, beware...

Check if it is a valid url (example.com IS NOT a valid URL)
function isValidURL($url)
{
return preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*
(:[0-9]+)?(/.*)?$|i', $url);
}
How to use the function:
if(!isValidURL($fldbanner_url))
{
$errMsg .= "* Please enter valid URL including http://<br>";
}
Source: http://phpcentral.com/208-url-validation-in-php.html

Regexes are a poor way to validate something as complex as a URL.
PHP's filter_var() function offers a much more robust way to validate URLs. Plus, it's faster, since it's native code.

I don't think there is a definitive answer to this. Example of a valid URL:
localhost
http://xxx.xxx.xxx/alkjnsdf
abs.com
If you have some text. and not a large amount of it. You can check by doing a CURL request and see if that returns a valid response. Otherwise if I put localhost, it could be a link and it could be something else and you wouldn't be able check it.

You could use the following regex pattern to check if your variable is an url or not :
$pattern = "\b(([\w-]+:\/\/?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|\/)))";

Something like might work for you:
$arr = array('abc.com/foo',
'localhost',
'abc+def',
'how r u',
'https://how r u',
'ftp://abc.com',
'a.b');
foreach ($arr as $u) {
$url = $u;
if (!preg_match('#^(?:https?|ftp)://#', $url, $m))
$url = 'http://' . $url;
echo "$u => ";
var_dump(filter_var($url, FILTER_VALIDATE_URL));
}
OUTPUT:
abc.com/foo => string(18) "http://abc.com/foo"
localhost => string(16) "http://localhost"
abc+def => string(14) "http://abc+def"
how r u => bool(false)
https://how r u => bool(false)
ftp://abc.com => string(13) "ftp://abc.com"
a.b => string(10) "http://a.b"
So basically wherever you notice false as return value that is an INVALID URL for you.

Only execute script if entered email is from a specific domain

I am trying to create a script that will only execute its actions if the email address the user enters is from a specific domain. I created a regex that seems to work when testing it via regex utility, but when its used in my PHP script, it tells me that valid emails are invalid. In this case, I want any email that is from #secondgearsoftware.com, #secondgearllc.com or asia.secondgearsoftware.com to echo success and all others to be rejected.
$pattern = '/\b[A-Z0-9\._%+-]+#((secondgearsoftware|secondgearllc|euro\.secondgearsoftware|asia\.secondgearsoftware)+\.)+com/';
$email = urldecode($_POST['email']);
if (preg_match($pattern, $email))
{
echo 'success';
}
else
{
echo 'opposite success';
}
I am not really sure what's futzed with the pattern. Any help would be appreciated.

Your regular expression is a bit off (it will allow foo#secondgearsoftwaresecondgearsoftware.com) and can be simplified:
$pattern = '/#((euro\.|asia\.)?secondgearsoftware|secondgearllc)\.com$/i';
I've made it case-insensitive and anchored it to the end of the string.
There doesn't seem to be a need to check what's before the "#" - you should have a proper validation routine for that if necessary, but it seems you just want to check if the email address belongs to one of these domains.

You probably need to use /\b[A-Z0-9\._%+-]+#((euro\.|asia\.)secondgearsoftware|secondgearllc)\.com/i (note the i at the end) in order to make the regex case-insensitive. I also dropped the +s as they allow for infinite repetition which doesn't make sense in this case.

Here's an easy to maintain solution using regular expressions
$domains = array(
'secondgearsoftware',
'secondgearllc',
'euro\.secondgearsoftware',
'asia\.secondgearsoftware'
);
preg_match("`#(" .implode("|", $domains). ")\.com$`i", $userProvidedEmail);
Here's a couple of tests:
$tests = array(
'bob#secondgearsoftware.com',
'bob#secondgearllc.com',
'bob#Xsecondgearllc.com',
'bob#secondgearllc.net',
'bob#euro.secondgearsoftware.org',
'bob#euro.secondgearsoftware.com',
'bob#euroxsecondgearsoftware.com',
'bob#asia.secondgearsoftware.com'
);
foreach ( $tests as $test ) {
echo preg_match("`#(" .implode("|", $domains). ")\.com$`i", $test),
" <- $test\n";
}
Result (1 is passing of course)
1 <- bob#secondgearsoftware.com
1 <- bob#secondgearllc.com
0 <- bob#Xsecondgearllc.com
0 <- bob#secondgearllc.net
0 <- bob#euro.secondgearsoftware.org
1 <- bob#euro.secondgearsoftware.com
0 <- bob#euroxsecondgearsoftware.com
1 <- bob#asia.secondgearsoftware.com

I suggest you drop the regex and simply use stristr to check if it matches. Something like this should work:
<?php
// Fill out as needed
$domains = array('secondgearsoftware.com', 'secondgearllc.com');
$email = urldecode($_POST['email']);
$found = false;
for(i=0;i<count($domains);i++)
{
if ($domains[i] == stristr($email, $domains[i]))
$found = true;
}
if ($found) ...
?>
The function stristr returns the e-mail address from the part where it found a match to the end, which should be the same as the match in this case. Technically there could be something prior to the domains (fkdskjfsdksfks.secondgeartsoftware.com), but you can just insert "#domainneeded.com" to prevent this. This code is also slightly longer, but easily extended with new domains without worrying about regex.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.