check if given domain name present in set of urls php - php

I have an url whose format may be :
www.discover.com
http://discover.com
http://www.discover.com
http://www.abcd.discover.com
discover.com
And i have another url which may be any of below format:
www.discover.com/something/smoething
http://discover.com/something/smoething
http://www.discover.com/something/smoething
http://www.abcd.discover.com/something/smoething
discover.com/something/smoething
Now i want to compare this two urls to check whether domain name "discover.com" is present in the second url.
Am using below code :
$domain1 = str_ireplace('www.', '', parse_url($urlItem1, PHP_URL_HOST));
$domain2= str_ireplace('www.', '', parse_url($urlItem2, PHP_URL_HOST));
if(strstr($domain2, $domain1))
{
return $domain2;
}
Solution :
function url_comparison($url1, $url2) {
$domain1 = parse_url($url1,PHP_URL_HOST);
$domain2 = parse_url($url2,PHP_URL_HOST);
$domain1 = isset($domain1) ? str_ireplace('www.', '',$domain1) : str_ireplace('www.', '',$url1);
$domain2 = isset($domain2) ? str_ireplace('www.', '',$domain2) : str_ireplace('www.', '',$url2);
if(strstr($domain2, $domain1))
{
return true;
}
else
{
return false;
}
}
$url1 = "discover.com";
$url2 = "https://www.abcd.discover.com/credit-cards/resources/balance-transfer.shtml";
if(url_comparison($url1, $url2))
{
echo "Same Domain";
}
else
{
echo "Diffrent Domain";
}
Thanks.

Make use of the documentation, parse url
Then you should look at the hostname, and with use of strpos.
$url = parse_url('www.discover.com/something/smoething');
if (strpos($url['host'], 'discover.com') !== false) {
// do you thing
}
0 is also a valid value so the !== or === is needed
To check if two domain are equal you need to set some rules, because is www.example.com the same as example.com, and is https the same as http?
function url_comparison($url_1, $url_2, $www = false, $scheme = false) {
$url_part_1 = parse_url($url_1);
$url_part_2 = parse_url($url_2);
if ($scheme && $url_part_1['scheme'] !== $url_part_2['scheme']) {
return false;
}
if ($www && $url_part_1['host'] === $url_part_2['host']) {
return false;
} elseif(!$www && (strpos($url_part_1['host'], $url_part_2['host']) !== false || strpos($url_part_2['host'], $url_part_1['host']) !== false)) {
return false;
}
return true;
}
With the above function you should see the right direction, not tested so should be tweaked perhaps. The first 2 values should be an url. $www is a boolean if the 'www.' should be checked, and if $scheme = true also the https or http needs to be the same

Related

PHP - Get Website Title From User Site Input

I'm trying to get the title of a website that is entered by the user.
Text input: website link, entered by user is sent to the server via AJAX.
The user can input anything: an actual existing link, or just single word, or something weird like 'po392#*#8'
Here is a part of my PHP script:
// Make sure the url is on another host
if(substr($url, 0, 7) !== "http://" AND substr($url, 0, 8) !== "https://") {
$url = "http://".$url;
}
// Extra confirmation for security
if (filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED)) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
// Make sure there is a dot in the url
if (strpos($url, '.') !== false) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
// Retrieve title if no title is entered
if($title == "" AND $urlIsValid == "1") {
function get_http_response_code($theURL) {
$headers = get_headers($theURL);
if($headers) {
return substr($headers[0], 9, 3);
} else {
return 'error';
}
}
if(get_http_response_code($url) != "200") {
$urlIsValid = "0";
} else {
$file = file_get_contents($url);
$res = preg_match("/<title>(.*)<\/title>/siU", $file, $title_matches);
if($res === 1) {
$title = preg_replace('/\s+/', ' ', $title_matches[1]);
$title = trim($title);
$title = addslashes($title);
}
// If title is still empty, make title the url
if($title == "") {
$title = $url;
}
}
}
However, there are still errors occuring in this script.
It works perfectly if an existing url as 'https://www.youtube.com/watch?v=eB1HfI-nIRg' is entered and when a non-existing page is entered as 'https://www.youtube.com/watch?v=NON-EXISTING', but it doesn't work when the users enters something like 'twitter.com' (without http) or something like 'yikes'.
I tried literally everthing: cUrl, DomDocument...
The problem is that when an invalid link is entered, the ajax call never completes (it keeps loading), while it should $urlIsValid = "0" whenever an error occurs.
I hope someone can help you - it's appreciated.
Nathan
You have a relatively simple problem but your solution is too complex and also buggy.
These are the problems that I've identified with your code:
// Make sure the url is on another host
if(substr($url, 0, 7) !== "http://" AND substr($url, 0, 8) !== "https://") {
$url = "http://".$url;
}
You won't make sure that that possible url is on another host that way (it could be localhost). You should remove this code.
// Make sure there is a dot in the url
if (strpos($url, '.') !== false) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
This code overwrites the code above it, where you validate that the string is indeed a valid URL, so remove it.
The definition of the additional function get_http_response_code is pointless. You could use only file_get_contents to get the HTML of the remote page and check it against false to detect the error.
Also, from your code I conclude that, if the (external to context) variable $title is empty then you won't execute any external fetch so why not check it first?
To sum it up, your code should look something like this:
if('' === $title && filter_var($url, FILTER_VALIDATE_URL))
{
//# means we suppress warnings as we won't need them
//this could be done with error_reporting(0) or similar side-effect method
$html = getContentsFromUrl($url);
if(false !== $html && preg_match("/<title>(.*)<\/title>/siU", $file, $title_matches))
{
$title = preg_replace('/\s+/', ' ', $title_matches[1]);
$title = trim($title);
$title = addslashes($title);
}
// If title is still empty, make title the url
if($title == "") {
$title = $url;
}
}
function getContentsFromUrl($url)
{
//if not full/complete url
if(!preg_match('#^https?://#ims', $url))
{
$completeUrl = 'http://' . $url;
$result = #file_get_contents($completeUrl);
if(false !== $result)
{
return $result;
}
//we try with https://
$url = 'https://' . $url;
}
return #file_get_contents($url);
}

validating a URL with PHP and returning only the host domain name.

I want to validate a domain name and then return the main domain striped down e.g. http://www.example.co.uk/path/ to example.co.uk
I have made a start with help from various sources and can do this with .com, .net, .org, .info & all the .uk’s.
$targetUrl = 'http://sub.example.uk/test/';
$host = filter(get_domain($targetUrl));
function filter($domain){
if($domain){
$domain_array = explode(".", $domain);
$domain_count = count($domain_array);
$domain_last = end($domain_array);
$domain_first = $domain_array[0];
$domain_second = $domain_array[1];
$domain_second_last = array_slice($domain_array, -2, 1);
$domain_second_last = $domain_second_last[0];
$domain_third_last = array_slice($domain_array, -3, 1);
$domain_third_last = $domain_third_last[0];
// UK Validation
$uk_second = array('ac', 'co', 'gov', 'judiciary', 'ltd', 'me', 'mod', 'net', 'nhs', 'nic', 'org', 'parliament', 'plc', 'police', 'sch');
if($domain_last == 'uk'){
if($domain_count == '2'){
// if domain.uk
return $domain;
}elseif(in_array($domain_second, $uk_second)){
//if domain.$uk_second.uk
return $domain;
}elseif(in_array($domain_second_last, $uk_second)){
// if subdomain on 2 dd.dd.co.uk rename to dd.co.uk
$domain = $domain_third_last.'.'.$domain_second_last.'.'.$domain_last;
return $domain;
}else{
// finaly it must be a dsd.sds.uk so lets remove the subdomain
$domain = $domain_second_last.'.'.$domain_last;
return $domain;
}
}
// END .UK
// SImple Single TLDs
$single_tlds = array('com', 'net', 'org', 'info');
if(in_array($domain_last, $single_tlds)){
if($domain_count == '2'){
// simple is it a ddd.com
return $domain;
}else{
$domain = $domain_second_last.'.'.$domain_last;
return $domain;
}
}
}//if domain
}
function get_domain($domain) {
$domain = strtolower($domain);
if (!filter_var($domain, FILTER_VALIDATE_URL) === false) {
$urlParts = parse_url($domain);
$domain = $urlParts['host'];
$domain = str_ireplace('www.','',$domain);
$original = $domain = strtolower($domain);
if (filter_var($domain, FILTER_VALIDATE_IP)) { return $domain; }
$arr = array_slice(array_filter(explode('.', $domain, 4), function($value){
return $value !== 'www'; }), 0); //rebuild array indexes
if (count($arr) > 2) {
$count = count($arr);
$_sub = explode('.', $count === 4 ? $arr[3] : $arr[2]);
if (count($_sub) === 2) { // two level TLD
$removed = array_shift($arr);
if ($count === 4) // got a subdomain acting as a domain
$removed = array_shift($arr);
}
elseif (count($_sub) === 1){ // one level TLD
$removed = array_shift($arr); //remove the subdomain
if (strlen($_sub[0]) === 2 && $count === 3) // TLD domain must be 2 letters
array_unshift($arr, $removed);
else{
// non country TLD according to IANA
$tlds = array( 'aero', 'arpa', 'asia', 'biz', 'cat', 'com', 'coop', 'edu', 'gov', 'info', 'jobs', 'mil', 'mobi', 'museum', 'name', 'net', 'org', 'post', 'pro', 'tel', 'travel', 'xxx', );
if (count($arr) > 2 && in_array($_sub[0], $tlds) !== false) {//special TLD don't have a country
array_shift($arr);
}
}
}
else { // more than 3 levels, something is wrong
for ($i = count($_sub); $i > 1; $i--)
$removed = array_shift($arr);
}
}
elseif (count($arr) === 2) {
$arr0 = array_shift($arr);
if (strpos(join('.', $arr), '.') === false
&& in_array($arr[0], array('localhost','test','invalid')) === false) // not a reserved domain
{
// seems invalid domain, restore it
array_unshift($arr, $arr0);
}
}
return join('.', $arr);
}
}
It’s just not very scalable I’m going to have to go through all the domain suffixes and add them. I’m sure there must be a simpler way? Would someone be so kind to help out? Maybe some way of loading the list from https://publicsuffix.org/list/public_suffix_list.dat
So, for a list of data and the results I would expect to see are:
http://subdomain.example.co.uk/path/site.php -> example.co.uk
http://subdomain.example.uk/path/site.php -> example.uk
www.subdomain.example.uk/path/site.php -> example.uk
subdomain.example.uk -> example.uk
http://gobble.gobble.notavalidsuffix -> false
The below will validate a URL by stripping the unnecessary URL parameters etc.. from a domain and then pass this string into gethostbyname(). This will then query a DNS server for the given root domain, if successful, you will be presented back with an IP, if not, the same input string will be returned. I have then passed this result into a filter which validates IP strings. If it's successful, it will then return the domain in the format given. Just make sure you are pointing to a DNS provider which will not resolve every DNS lookup...for example, my ISP in the UK automatically resolves every failed DNS lookup with a valid A record which in-turn resolves to web page saying "No Such Webpage". Google DNS works fine so use that if you can.
function validDom($url) {
$newUrl = (filter_var($url, FILTER_VALIDATE_URL)) ? $url : FALSE;
if ($newUrl === FALSE) {
return FALSE;
}
$urlSplit = explode('/', $newUrl);
foreach ($urlSplit as $k=>$v) {
if(substr_count($v, '.') >= 2) {
$newUrl = $v;
}
}
$cleanDomain = substr_replace($newUrl, '', 0, strpos($newUrl, '.')+1);
$chkDNS = gethostbyname($cleanDomain);
if (filter_var($chkDNS, FILTER_VALIDATE_IP) !== FALSE) {
return $cleanDomain;
}
return false;
}
Test Domains
$domainArr = [
'https://www.facebook.com',
'https://www.care.org.uk',
'https://www.facebook.co.uk',
'https://www.google.com/dfsdfsdfsd/sdfsdf',
'https://sub.fsdfsdfsdfsdfsd.co.uk/dfsdfsdf',
'https://www.nhs.uk/dfsdfsdfsdfsd?fgfg=fgfg',
'javascript://comment%0Aalert(1)"hello',
];
foreach($domainArr as $k=>$v) {
var_dump(validDom($v));
echo '<br>';
}
Output:
string(12) "facebook.com"
string(11) "care.org.uk"
string(14) "facebook.co.uk"
string(10) "google.com"
bool(false)
string(6) "nhs.uk"
bool(false)
Edit:
This function will also get around the issue with malicious code bypassing FILTER_VALIDATE_URL due to javascript://comment%0Aalert(1)"hello' not resolving via DNS which ultimately ends in a fail.
The truth is that validating a url in PHP is a complex task.
You could use the built-in parse_url() and filter_var() functions, but as a number of user comments on PHP.net, and even the documentation, point out, they're not very reliable.
For one, they don't support internationalized domain names (URLs containing non-ASCII, e.g. Unicode characters).
Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail.
For another, they pass a lot of false positives. The documentation states:
Beware a valid URL may not specify the HTTP protocol http:// so further validation may be required to determine the URL uses an expected protocol, e.g. ssh:// or mailto:.
They also don't have a list of valid name extensions. This means something like asdf://asdf.asdf gets passed by filter_var. I tried it, and it actually was.
filter_var could also be a potential XSS vulnerability, because it passes something like javascript://comment%0Aalert(1)"hello as valid.
Sorry to be a bearer of bad tidings, but that's the truth. I did spot a number of libraries for validation in PHP which included URLs, but they all still built upon parse_url or filter_var. I'm also not confident regex could the job.
However, (plug time:) I'm working on a PHP library that should be able to achieve what you want, and I hope to get it done in a couple of days. 😊
Here you are:
function filterUrl ($url) {
if (filter_var($url, FILTER_VALIDATE_URL)) {
$host = parse_url($url, PHP_URL_HOST);
$parts = explode('.', $host);
$lastParts = array_slice($parts, -3, 3);
return implode('.', $lastParts);
} else {
return false;
}
}

PHP check file size in via Internet [duplicate]

This question already has answers here:
Remote file size without downloading file
(15 answers)
Closed 9 years ago.
How to check file size in via Internet? The sample below is my code that does not work
echo filesize('http://localhost/wordpress-3.1.2.zip');
echo filesize('http://www.wordpress.com/wordpress-3.1.2.zip');
The filesize function is used to get size of files stored locally.* For remote files you must find other solution, for example:
<?php
function getSizeFile($url) {
if (substr($url,0,4)=='http') {
$x = array_change_key_case(get_headers($url, 1),CASE_LOWER);
if ( strcasecmp($x[0], 'HTTP/1.1 200 OK') != 0 ) { $x = $x['content-length'][1]; }
else { $x = $x['content-length']; }
}
else { $x = #filesize($url); }
return $x;
}
?>
Source: See the first post-comment in link below
http://php.net/manual/en/function.filesize.php
*Well, to be honest since PHP 5 there are some wrappers for file functions, see here:
http://www.php.net/manual/en/wrappers.php
You can find a lot more examples, even here on SO, this should satisfy your needs: PHP: Remote file size without downloading file
Try to use search function before asking question next time!
try this function
<?php
function remotefsize($url) {
$sch = parse_url($url, PHP_URL_SCHEME);
if (($sch != "http") && ($sch != "https") && ($sch != "ftp") && ($sch != "ftps")) {
return false;
}
if (($sch == "http") || ($sch == "https")) {
$headers = get_headers($url, 1);
if ((!array_key_exists("Content-Length", $headers))) { return false; }
return $headers["Content-Length"];
}
if (($sch == "ftp") || ($sch == "ftps")) {
$server = parse_url($url, PHP_URL_HOST);
$port = parse_url($url, PHP_URL_PORT);
$path = parse_url($url, PHP_URL_PATH);
$user = parse_url($url, PHP_URL_USER);
$pass = parse_url($url, PHP_URL_PASS);
if ((!$server) || (!$path)) { return false; }
if (!$port) { $port = 21; }
if (!$user) { $user = "anonymous"; }
if (!$pass) { $pass = "phpos#"; }
switch ($sch) {
case "ftp":
$ftpid = ftp_connect($server, $port);
break;
case "ftps":
$ftpid = ftp_ssl_connect($server, $port);
break;
}
if (!$ftpid) { return false; }
$login = ftp_login($ftpid, $user, $pass);
if (!$login) { return false; }
$ftpsize = ftp_size($ftpid, $path);
ftp_close($ftpid);
if ($ftpsize == -1) { return false; }
return $ftpsize;
}
}
?>
I think that's probably not possible. The best way is to download the file via file_get_contents and then use filesize over the file. You can later delete the file too!

PHP: checking if YT url is valid and if video exists

The function purpose is to validate the URLs of a YouTube video and check if the video exists. This is a snippet of my actual code. I manipulate the string to my desired format and then i proceed to check if it is valid and exists. If it passes the test, then i echo the results. The problem is that I am not calling the function correctly.
I am getting this echo even though the video does exist:
The video does not exist or invalid url
Edited: and added isValidURL function
*Code for checking if video exist or is invalid:*
if($_POST)
{
// After applying url manipulation and getting the url in a proper format result = $formatted_url
function isValidURL($formatted_url) {
$formatted_url = trim($formatted_url);
$isValid = true;
if (strpos($formatted_url, 'http://') === false && strpos($formatted_url, 'https://') === false) {
$formatted_url = 'http://'.$formatted_url;
}
//first check with php's FILTER_VALIDATE_URL
if (filter_var($formatted_url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED) === false) {
$isValid = false;
} else {
//not all invalid URLs are caught by FILTER_VALIDATE_URL
//use our own mechanism
$host = parse_url($formatted_url, PHP_URL_HOST);
$dotcount = substr_count($host, '.');
//the host should contain at least one dot
if ($dotcount > 0) {
//if the host contains one dot
if ($dotcount == 1) {
//and it start with www.
if (strpos($host, 'www.') === 0) {
//there is no top level domain, so it is invalid
$isValid = false;
}
} else {
//the host contains multiple dots
if (strpos($host, '..') !== false) {
//dots can't be next to each other, so it is invalid
$isValid = false;
}
}
} else {
//no dots, so it is invalid
$isValid = false;
}
}
//return false if host is invalid
//otherwise return true
return $isValid;
}
$isValid = getYoutubeVideoID($formatted_url);
function isYoutubeVideo($formatted_url) {
$isValid = false;
//validate the url, see: http://snipplr.com/view/50618/
if (isValidURL($formatted_url)) {
//code adapted from Moridin: http://snipplr.com/view/19232/
$idLength = 11;
$idOffset = 3;
$idStarts = strpos($formatted_url, "?v=");
if ($idStarts !== FALSE) {
//there is a videoID present, now validate it
$videoID = substr($formatted_url, $idStarts + $idOffset, $idLength);
$http = new HTTP("http://gdata.youtube.com");
$result = $http->doRequest("/feeds/api/videos/".$videoID, "GET");
//returns Array('headers' => Array(), 'body' => String);
$code = $result['headers']['http_code'];
//did the request return a http code of 2xx?
if (substr($code, 0, 1) == 2) {
$isValid = true;
}
}
}
return $isValid;
}
$isValid = isYoutubeVideo($formatted_url);
parse_str($parsed_url['query'], $parsed_query_string);
$v = $parsed_query_string['v'];
if ( $isValid == true ) {
//Iframe code
echo htmlentities ('<iframe src="http://www.youtube.com/embed/'.$v.'" frameborder="0" width="'.$wdth.'" height="'.$hth.'"></iframe>');
//Old way to embed code
echo htmlentities ('<embed src="http://www.youtube.com/v/'.$v.'" width="'.$wdth.'" height="'.$hth.'" type="application/x-shockwave-flash" wmode="transparent" embed="" /></embed>');
}
else {
echo ("The video does not exist or invalid url");
}
}
?>
You are missing the isValidURL() function. Try changing this line:
if (isValidURL($formatted_url)) {
to
if(preg_match('/http:\/\/www\.youtube\.com\/watch\?v=[^&]+/', $formatted_url, $result)) {
or
$test = parse_url($formatted_url);
if($test['host']=="www.youtube.com"){

Don't check the domain name only

Soooo it's me again with this function..
I have the function working.
function http_file_exists($url)
{
$f = #fopen($url,"r");
if($f)
{
fclose($f);
return true;
}
return false;
}
And this is the usage :
if ($submit || $preview || $refresh)
{
$post_data['your_url'] = "http://www.google.com/this"; //remove the equals and url value if using in real post
$your_url = $post_data['your_url'];
$your_url_exists = (isset($your_url)) ? true : false;
$your_url = preg_replace(array('#&\#46;#','#&\#58;#','/\[(.*?)\]/'), array('.',':',''), $your_url);
if ($your_url_exists && http_file_exists($your_url) == true)
{
trigger_error('exists!');
}
How do I let it check the whole url and not the domain name only ? for example http://www.google.com/this
url tested is http://www.google.com/abadurltotest
source of code below = What is the fastest way to determine if a URL exists in PHP?
function http_file_exists($url)
{
//$url = preg_replace(array('#&\#46;#','#&\#58;#','/\[(.*?)\]/'), array('.',':',''), $url);
$url_data = parse_url ($url);
if (!$url_data) return FALSE;
$errno="";
$errstr="";
$fp=0;
$fp=fsockopen($url_data['host'],80,$errno,$errstr,30);
if($fp===0) return FALSE;
$path ='';
if (isset( $url_data['path'])) $path .= $url_data['path'];
if (isset( $url_data['query'])) $path .= '?' .$url_data['query'];
$out="GET /$path HTTP/1.1\r\n";
$out.="Host: {$url_data['host']}\r\n";
$out.="Connection: Close\r\n\r\n";
fwrite($fp,$out);
$content=fgets($fp);
$code=trim(substr($content,9,4)); //get http code
fclose($fp);
// if http code is 2xx or 3xx url should work
return ($code[0] == 2 || $code[0] == 3) ? TRUE : FALSE;
}
add the top code to functions_posting.php replacing previous function
if ($submit || $preview || $refresh)
{
$post_data['your_url'] = " http://www.google.com/abadurltotest";
$your_url = $post_data['your_url'];
$your_url_exists = (request_var($your_url, '')) ? true : false;
$your_url = preg_replace(array('#&\#46;#','#&\#58;#','/\[(.*?)\]/'), array('.',':',''), $your_url);
if ($your_url_exists === true && http_file_exists($your_url) === false)
{
trigger_error('A bad url was entered, Please push the browser back button and try again.');
}
Use curl and check the HTTP status code. if it's not 200 - most likely the url doesn't exist or inaccessible.
also note that
$your_url_exists = (isset($your_url)) ? true : false;
makes no sense. It seems you want
$your_url_exists = (bool)$your_url;
or just check $your_url instead of $your_url_exists

Categories