I'm working on a little webcrawler as a side project at the moment and basically having it collect all hrefs on a page and then subsequently parsing those, my problem is.
How can I only get the actual page results? at the moment i'm using the following
foreach($page->getElementsByTagName('a') as $link)
{
$compare_url = parse_url($link->getAttribute('href'));
if (#$compare_url['host'] == "")
{
$links[] = 'http://'.#$base_url['host'].'/'.$link->getAttribute('href');
}
elseif ( #$base_url['host'] == #$compare_url['host'] )
{
$links[] = $link->getAttribute('href');
}
}
As you can see this will bring in jpegs, exe files etc. I only need to pickup the web pages like .php, .html, .asp etc.
I'm not sure if there is some function able to work this one out or if it will need to be regex from some sort of master list?
Thanks
Since the URL string alone doesn't connected with the resource behind it in any way you will have to go out and ask the webserver about them. For this there's a HTTP method called HEAD so you won't have to download everything.
You can implement this with curl in php like this:
function is_html($url) {
function curl_head($url) {
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_MAXREDIRS, 5);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true );
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1);
$content = curl_exec($curl);
curl_close($curl);
// redirected heads just pile up one after another
$parts = explode("\r\n\r\n", trim($content));
// return only the last one
return end($parts);
}
$header = curl_head('http://github.com');
// look for the content-type part of the header response
return preg_match('/content-type\s*:\s*text\/html/i', $header);
}
var_dump(is_html('http://github.com'));
This version is only accepts text/html responses and doesn't check if the response is 404 or other error (however follows redirects up to 5 jumps). You can tweak the regexp or add some error handling in either from the curl response, or by matching against the header string's first line.
Note: Webservers will run scripts behind these URLs to give you responses. Be careful not overload hosts with probing, or grabbing "delete" or "unsubscribe" type links.
To check if a page is valid (html,php... extension use this function:
function check($url){
$extensions=array("php","html"); //Add extensions here
foreach($extensions as $ext){
if(substr($url,-(strlen($ext)+1))==".".$ext){
return 1;
}
}
return 0;
}
foreach($page->getElementsByTagName('a') as $link) {
$compare_url = parse_url($link->getAttribute('href'));
if (#$compare_url['host'] == "") { if(check($link->getAttribute('href'))){ $links[] = 'http://'.#$base_url['host'].'/'.$link->getAttribute('href');} }
elseif ( #$base_url['host'] == #$compare_url['host'] ) {
if(check($link->getAttribute('href'))){ $links[] = $link->getAttribute('href'); }
}
Consider using preg_match to check the type of the link (application , picture , html file) and considering the results decide what to do.
Another option (and simple) is to use explode and find the last string of the url which comes after a . (the extension)
For instance:
//If the URL will has any one of the following extensions , ignore them.
$forbid_ext = array('jpg','gif','exe');
foreach($page->getElementsByTagName('a') as $link) {
$compare_url = parse_url($link->getAttribute('href'));
if (#$compare_url['host'] == "")
{
if(check_link_type($link->getAttribute('href')))
$links[] = 'http://'.#$base_url['host'].'/'.$link->getAttribute('href');
}
elseif ( #$base_url['host'] == #$compare_url['host'] )
{
if(check_link_type($link->getAttribute('href')))
$links[] = $link->getAttribute('href');
}
}
function check_link_type($url)
{
global $forbid_ext;
$ext = end(explode("." , $url));
if(in_array($ext , $forbid_ext))
return false;
return true;
}
UPDATE (instead of checking 'forbidden' extensions , let's look for good ones)
$good_ext = array('html','php','asp');
function check_link_type($url)
{
global $good_ext;
$ext = end(explode("." , $url));
if($ext == "" || !in_array($ext , $good_ext))
return true;
return false;
}
Related
I'm trying to create a simple script that'll let me know if a website is based off WordPress.
The idea is to check whether I'm getting a 404 from a URL when trying to access its wp-admin like so:
https://www.audi.co.il/wp-admin (which returns "true" because it exists)
When I try to input a URL that does not exist, like "https://www.audi.co.il/wp-blablabla", PHP still returns "true", even though Chrome, when pasting this link to its address bar returns 404 on the network tab.
Why is it so and how can it be fixed?
This is the code (based on another user's answer):
<?php
$file = 'https://www.audi.co.il/wp-blabla';
$file_headers = #get_headers($file);
if(!$file_headers || strpos($file_headers[0], '404 Not Found')) {
$exists = "false";
}
else {
$exists = "true";
}
echo $exists;
You can try to find the wp-admin page and if it is not there then there's a good change it's not WordPress.
function isWordPress($url)
{
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER , 1 );
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
// grab URL and pass it to the browser
curl_exec($ch);
$httpStatus = curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
// close cURL resource, and free up system resources
curl_close($ch);
if ( $httpStatus == 200 ) {
return true;
}
return false;
}
if ( isWordPress("http://www.example.com/wp-admin") ) {
// This is WordPress
} else {
// Not WordPress
}
This may not be one hundred percent accurate as some WordPress installations protect the wp-admin URL.
I'm probably late to the party but another way you can easily determine a WordPress site is by crawling the /wp-json. If you're using Guzzle by PHP, you can do this:
function isWorpress($url) {
try {
$http = new \GuzzleHttp\Client();
$response = $http->get(rtrim($url, "/")."/wp-json");
$contents = json_decode($response->getBody()->getContents());
if($contents) {
return true;
}
} catch (\Exception $exception) {
//...
}
return false;
}
I've got a web application in Drupal that is basically acting as a proxy to a multi-page HTML form somewhere else. I am able to retrieve the page with cURL and parse it with DOMDocument, then embed the contents of the <form> inside a Drupal form:
<?php
function proxy_get_dom($url, $method = 'get', $arguments = array()) {
// Keep a static cURL resource for speed.
static $web = NULL;
if (!is_resource($web)) {
$web = curl_init();
// Don't include any HTTP headers in the output.
curl_setopt($web, CURLOPT_HEADER, FALSE);
// Return the result as a string instead of echoing directly.
curl_setopt($web, CURLOPT_RETURNTRANSFER, TRUE);
}
// Add any GET arguments directly to the URL.
if ($method == 'get' && !empty($arguments)) {
$url .= '?' . http_build_arguments($arguments, 'n', '&');
}
curl_setopt($web, CURLOPT_URL, $url);
// Include POST data.
if ($method == 'post' && !empty($arguments)) {
curl_setopt($web, CURLOPT_POST, TRUE);
curl_setopt($web, CURLOPT_POSTFIELDS, http_build_query($arguments));
}
else {
curl_setopt($web, CURLOPT_POST, FALSE);
}
$use_errors = libxml_use_internal_errors(TRUE);
try {
$dom = new DOMDocument();
$dom->loadHTML(curl_exec($web));
}
catch (Exception $e) {
// Error handling...
return NULL;
}
if (!isset($dom)) {
// Error handling...
return NULL;
}
libxml_use_internal_errors($use_errors);
return $dom;
}
function FORM_ID($form, &$form_state) {
// Set the initial URL if it hasn't already been set.
if (!isset($form_state['remote_url'])) {
$form_state['remote_url'] = 'http://www.example.com/form.faces';
}
// Get the DOMDocument
$dom = proxy_get_dom($form_state['remote_url'], 'post', $_POST);
if (!isset($dom)) {
return $form;
}
// Pull out the <form> and insert it into $form['embedded'].
$nlist = $dom->getElementsByTagName('form');
// assert that $nlist->length == 1
$form['embedded']['#markup'] = '';
foreach ($nlist->item(0)->childNodes as $childnode) {
// It would be better to use $dom->saveHTML but it does not accept the
// $node parameter until php 5.3.6, which we are not guaranteed to be
// using.
$form['embedded']['#markup'] .= $dom->saveXML($childnode);
}
// Apply some of the attributes from the <form> element onto our <form>
// element.
if (isset($element->attributes)) {
foreach ($nlist->item(0)->attributes as $attr) {
if ($attr->nodeName == 'action') {
$form_state['remote_action'] = $attr->nodeValue;
}
elseif ($attr->nodeName == 'class') {
$form['#attributes']['class'] = explode(' ', $attr->nodeValue);
}
elseif ($attr->nodeName != 'method') {
$form['#attributes'][$attr->nodeName] = $attr->nodeValue;
}
}
}
return $form;
}
function FORM_ID_submit($form, &$form_state) {
// Use the remote_action as the remote_url, if set.
if (isset($form_state['remote_action'])) {
$form_state['remote_url'] = $form_state['remote_action'];
}
// Rebuilt the form.
$form_state['rebuild'] = TRUE;
}
?>
However, the embedded form will not move past the first step. The issue seems to be that the page behind the proxy is setting a session cookie which I am ignoring in the above code. I can store the cookies with CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR, but I'm not sure where the file should be. For one thing it should definitely be a different location for each user, and it definitely should not be a publicly accessible location.
My question is: How do I store and send cookies from cURL per-user in Drupal?
Assuming you're using sessions, then use the user's session ID to name the cookie files. e.g.
curl_setopt(CURLOPT_COOKIEFILE, 'cookies.txt');
would give EVERYONE the same cookie file and they'll end up sharing the same cookie. But doing
curl_setopt(CURLOPT_COOKIEFILE, 'cookie-' . session_id() . '.txt');
will produce a unique session file for every user. You will have to manually remove that file, otherwise you're going to end up with a HUGE cookie filer repository. And if you're changing session ID's (e.g. session_regenerate_id(), you'll "lose" the cookie file because the session ID's won't be the same anymore.
Is it possible to pull text data from another domain (not currently owned) using php? If not any other method? I've tried using Iframes, and because my page is a mobile website things just don't look good. I'm trying to show a marine forecast for a specific area. Here is the link I'm trying to display.
Update...........
This is what I ended up using. Maybe it will help someone else. However I felt there was more than one right answer to my question.
<?php
$ch = curl_init("http://forecast.weather.gov/MapClick.php?lat=29.26034686&lon=-91.46038359&unit=0&lg=english&FcstType=text&TextType=1");
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
echo $content;
?>
This works as I think you want it to, except it depends on the same format from the weather site (also that "Outlook" is displayed).
<?php
//define the URL of the resource
$url = 'http://forecast.weather.gov/MapClick.php?lat=29.26034686&lon=-91.46038359&unit=0&lg=english&FcstType=text&TextType=1';
//function from http://stackoverflow.com/questions/5696412/get-substring-between-two-strings-php
function getInnerSubstring($string, $boundstring, $trimit=false)
{
$res = false;
$bstart = strpos($string, $boundstring);
if($bstart >= 0)
{
$bend = strrpos($string, $boundstring);
if($bend >= 0 && $bend > $bstart)
{
$res = substr($string, $bstart+strlen($boundstring), $bend-$bstart-strlen($boundstring));
}
}
return $trimit ? trim($res) : $res;
}
//if the URL is reachable
if($source = file_get_contents($url))
{
$raw = strip_tags($source,'<hr>');
echo '<pre>'.substr(strstr(trim(getInnerSubstring($raw,"<hr>")),'Outlook'),7).'</pre>';
}
else{
echo 'Error';
}
?>
If you need any revisions, please comment.
Try using a user-agent as shown below. Then you can use simplexml to parse the contents and extract the text you want. For more info on simplexml.
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-agent: www.example.com"
)
);
$content = file_get_contents($url, false, stream_context_create($opts));
$xml = simplexml_load_string($content);
You may use cURL for that. Have a Look at http://www.php.net/manual/en/book.curl.php
I have implemented a function that runs on each page that I want to restrict from non-logged in users. The function automatically redirects the visitor to the login page in the case of he or she is not logged in.
I would like to make a PHP function that is run from a exernal server and iterates through a number of set URLs (array with URLs that is for each protected site) to see if they are redirected or not. Thereby I could easily make sure if protection is up and running on every page.
How could this be done?
Thanks.
$urls = array(
'http://www.apple.com/imac',
'http://www.google.com/'
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
foreach($urls as $url) {
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
// line endings is the wonkiest piece of this whole thing
$out = str_replace("\r", "", $out);
// only look at the headers
$headers_end = strpos($out, "\n\n");
if( $headers_end !== false ) {
$out = substr($out, 0, $headers_end);
}
$headers = explode("\n", $out);
foreach($headers as $header) {
if( substr($header, 0, 10) == "Location: " ) {
$target = substr($header, 10);
echo "[$url] redirects to [$target]<br>";
continue 2;
}
}
echo "[$url] does not redirect<br>";
}
I use curl and only take headers, after I compare my url and url from header curl:
$url="http://google.com";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, '60'); // in seconds
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$res = curl_exec($ch);
if(curl_getinfo($ch)['url'] == $url){
echo "not redirect";
}else {
echo "redirect";
}
You could always try adding:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
since 302 means it moved, allow the curl call to follow it and return whatever the moved url returns.
Getting the headers with get_headers() and checking if Location is set is much simpler.
$urls = [
"https://example-1.com",
"https://example-2.com"
];
foreach ($urls as $key => $url) {
$is_redirect = does_url_redirect($url) ? 'yes' : 'no';
echo $url . ' is redirected: ' . $is_redirect . PHP_EOL;
}
function does_url_redirect($url){
$headers = get_headers($url, 1);
if (!empty($headers['Location'])) {
return true;
} else {
return false;
}
}
I'm not sure whether this really makes sense as a security check.
If you are worried about files getting called directly without your "is the user logged in?" checks being run, you could do what many big PHP projects do: In the central include file (where the security check is being done) define a constant BOOTSTRAP_LOADED or whatever, and in every file, check for whether that constant is set.
Testing is great and security testing is even better, but I'm not sure what kind of flaw you are looking to uncover with this? To me, this idea feels like a waste of time that will not bring any real additional security.
Just make sure your script die() s after the header("Location:...") redirect. That is essential to stop additional content from being displayed after the header command (a missing die() wouldn't be caught by your idea by the way, as the redirect header would still be issued...)
If you really want to do this, you could also use a tool like wget and feed it a list of URLs. Have it fetch the results into a directory, and check (e.g. by looking at the file sizes that should be identical) whether every page contains the login dialog. Just to add another option...
Do you want to check the HTTP code to see if it's a redirect?
$params = array('http' => array(
'method' => 'HEAD',
'ignore_errors' => true
));
$context = stream_context_create($params);
foreach(array('http://google.com', 'http://stackoverflow.com') as $url) {
$fp = fopen($url, 'rb', false, $context);
$result = stream_get_contents($fp);
if ($result === false) {
throw new Exception("Could not read data from {$url}");
} else if (! strstr($http_response_header[0], '301')) {
// Do something here
}
}
I hope it will help you:
function checkRedirect($url)
{
$headers = get_headers($url);
if ($headers) {
if (isset($headers[0])) {
if ($headers[0] == 'HTTP/1.1 302 Found') {
//this is the URL where it's redirecting
return str_replace("Location: ", "", $headers[9]);
}
}
}
return false;
}
$isRedirect = checkRedirect($url);
if(!$isRedirect )
{
echo "URL Not Redirected";
}else{
echo "URL Redirected to: ".$isRedirect;
}
You can use session,if the session array is not set ,the url redirected to a login page.
.
I modified Adam Backstrom answer and implemented chiborg suggestion. (Download only HEAD). It have one thing more: It will check if redirection is in a page of the same server or is out. Example: terra.com.br redirects to terra.com.br/portal. PHP will considerate it like redirect, and it is correct. But i only wanted to list that url that redirect to another URL. My English is not good, so, if someone found something really difficult to understand and can edit this, you're welcome.
function RedirectURL() {
$urls = array('http://www.terra.com.br/','http://www.areiaebrita.com.br/');
foreach ($urls as $url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// chiborg suggestion
curl_setopt($ch, CURLOPT_NOBODY, true);
// ================================
// READ URL
// ================================
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
// line endings is the wonkiest piece of this whole thing
$out = str_replace("\r", "", $out);
echo $out;
$headers = explode("\n", $out);
foreach($headers as $header) {
if(substr(strtolower($header), 0, 9) == "location:") {
// read URL to check if redirect to somepage on the server or another one.
// terra.com.br redirect to terra.com.br/portal. it is valid.
// but areiaebrita.com.br redirect to bwnet.com.br, and this is invalid.
// what we want is to check if the address continues being terra.com.br or changes. if changes, prints on page.
// if contains http, we will check if changes url or not.
// some servers, to redirect to a folder available on it, redirect only citting the folder. Example: net11.com.br redirect only to /heiden
// only execute if have http on location
if ( strpos(strtolower($header), "http") !== false) {
$address = explode("/", $header);
print_r($address);
// $address['0'] = http
// $address['1'] =
// $address['2'] = www.terra.com.br
// $address['3'] = portal
echo "url (address from array) = " . $url . "<br>";
echo "address[2] = " . $address['2'] . "<br><br>";
// url: terra.com.br
// address['2'] = www.terra.com.br
// check if string terra.com.br is still available in www.terra.com.br. It indicates that server did not redirect to some page away from here.
if(strpos(strtolower($address['2']), strtolower($url)) !== false) {
echo "URL NOT REDIRECT";
} else {
// not the same. (areiaebrita)
echo "SORRY, URL REDIRECT WAS FOUND: " . $url;
}
}
}
}
}
}
function unshorten_url($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
$real_url = $url;//default.. (if no redirect)
if (preg_match("/location: (.*)/i", $out, $redirect))
$real_url = $redirect[1];
if (strstr($real_url, "bit.ly"))//the redirect is another shortened url
$real_url = unshorten_url($real_url);
return $real_url;
}
I have just made a function that checks if a URL exists or not
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
function url_exists($url, $ch) {
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
// line endings is the wonkiest piece of this whole thing
$out = str_replace("\r", "", $out);
// only look at the headers
$headers_end = strpos($out, "\n\n");
if( $headers_end !== false ) {
$out = substr($out, 0, $headers_end);
}
//echo $out."====<br>";
$headers = explode("\n", $out);
//echo "<pre>";
//print_r($headers);
foreach($headers as $header) {
//echo $header."---<br>";
if( strpos($header, 'HTTP/1.1 200 OK') !== false ) {
return true;
break;
}
}
}
Now I have used an array of URLs to check if a URL exists as following:
$my_url_array = array('http://howtocode.pk/result', 'http://google.com/jobssss', 'https://howtocode.pk/javascript-tutorial/', 'https://www.google.com/');
for($j = 0; $j < count($my_url_array); $j++){
if(url_exists($my_url_array[$j], $ch)){
echo 'This URL "'.$my_url_array[$j].'" exists. <br>';
}
}
I can't understand your question.
You have an array with URLs and you want to know if user is from one of the listed URLs?
If I'm right in understanding your quest:
$urls = array('http://url1.com','http://url2.ru','http://url3.org');
if(in_array($_SERVER['HTTP_REFERER'],$urls))
{
echo 'FROM ARRAY';
} else {
echo 'NOT FROM ARR';
}
The best I could find, an if fclose fopen type thing, makes the page load really slowly.
Basically what I'm trying to do is the following: I have a list of websites, and I want to display their favicons next to them. However, if a site doesn't have one, I'd like to replace it with another image rather than display a broken image.
You can instruct curl to use the HTTP HEAD method via CURLOPT_NOBODY.
More or less
$ch = curl_init("http://www.example.com/favicon.ico");
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_exec($ch);
$retcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// $retcode >= 400 -> not found, $retcode = 200, found.
curl_close($ch);
Anyway, you only save the cost of the HTTP transfer, not the TCP connection establishment and closing. And being favicons small, you might not see much improvement.
Caching the result locally seems a good idea if it turns out to be too slow.
HEAD checks the time of the file, and returns it in the headers. You can do like browsers and get the CURLINFO_FILETIME of the icon.
In your cache you can store the URL => [ favicon, timestamp ]. You can then compare the timestamp and reload the favicon.
As Pies say you can use cURL. You can get cURL to only give you the headers, and not the body, which might make it faster. A bad domain could always take a while because you will be waiting for the request to time-out; you could probably change the timeout length using cURL.
Here is example:
function remoteFileExists($url) {
$curl = curl_init($url);
//don't fetch the actual page, you only want to check the connection is ok
curl_setopt($curl, CURLOPT_NOBODY, true);
//do request
$result = curl_exec($curl);
$ret = false;
//if request did not fail
if ($result !== false) {
//if request was ok, check response code
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 200) {
$ret = true;
}
}
curl_close($curl);
return $ret;
}
$exists = remoteFileExists('http://stackoverflow.com/favicon.ico');
if ($exists) {
echo 'file exists';
} else {
echo 'file does not exist';
}
CoolGoose's solution is good but this is faster for large files (as it only tries to read 1 byte):
if (false === file_get_contents("http://example.com/path/to/image",0,null,0,1)) {
$image = $default_image;
}
This is not an answer to your original question, but a better way of doing what you're trying to do:
Instead of actually trying to get the site's favicon directly (which is a royal pain given it could be /favicon.png, /favicon.ico, /favicon.gif, or even /path/to/favicon.png), use google:
<img src="http://www.google.com/s2/favicons?domain=[domain]">
Done.
A complete function of the most voted answer:
function remote_file_exists($url)
{
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); # handles 301/2 redirects
curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if( $httpCode == 200 ){return true;}
}
You can use it like this:
if(remote_file_exists($url))
{
//file exists, do something
}
If you are dealing with images, use getimagesize. Unlike file_exists, this built-in function supports remote files. It will return an array that contains the image information (width, height, type..etc). All you have to do is to check the first element in the array (the width). use print_r to output the content of the array
$imageArray = getimagesize("http://www.example.com/image.jpg");
if($imageArray[0])
{
echo "it's an image and here is the image's info<br>";
print_r($imageArray);
}
else
{
echo "invalid image";
}
if (false === file_get_contents("http://example.com/path/to/image")) {
$image = $default_image;
}
Should work ;)
This can be done by obtaining the HTTP Status code (404 = not found) which is possible with file_get_contentsDocs making use of context options. The following code takes redirects into account and will return the status code of the final destination (Demo):
$url = 'http://example.com/';
$code = FALSE;
$options['http'] = array(
'method' => "HEAD",
'ignore_errors' => 1
);
$body = file_get_contents($url, NULL, stream_context_create($options));
foreach($http_response_header as $header)
sscanf($header, 'HTTP/%*d.%*d %d', $code);
echo "Status code: $code";
If you don't want to follow redirects, you can do it similar (Demo):
$url = 'http://example.com/';
$code = FALSE;
$options['http'] = array(
'method' => "HEAD",
'ignore_errors' => 1,
'max_redirects' => 0
);
$body = file_get_contents($url, NULL, stream_context_create($options));
sscanf($http_response_header[0], 'HTTP/%*d.%*d %d', $code);
echo "Status code: $code";
Some of the functions, options and variables in use are explained with more detail on a blog post I've written: HEAD first with PHP Streams.
PHP's inbuilt functions may not work for checking URL if allow_url_fopen setting is set to off for security reasons. Curl is a better option as we would not need to change our code at later stage. Below is the code I used to verify a valid URL:
$url = str_replace(' ', '%20', $url);
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode>=200 && $httpcode<300){ return true; } else { return false; }
Kindly note the CURLOPT_SSL_VERIFYPEER option which also verify the URL's starting with HTTPS.
To check for the existence of images, exif_imagetype should be preferred over getimagesize, as it is much faster.
To suppress the E_NOTICE, just prepend the error control operator (#).
if (#exif_imagetype($filename)) {
// Image exist
}
As a bonus, with the returned value (IMAGETYPE_XXX) from exif_imagetype we could also get the mime-type or file-extension with image_type_to_mime_type / image_type_to_extension.
A radical solution would be to display the favicons as background images in a div above your default icon. That way, all overhead would be placed on the client while still not displaying broken images (missing background images are ignored in all browsers AFAIK).
You could use the following:
$file = 'http://mysite.co.za/images/favicon.ico';
$file_exists = (#fopen($file, "r")) ? true : false;
Worked for me when trying to check if an image exists on the URL
function remote_file_exists($url){
return(bool)preg_match('~HTTP/1\.\d\s+200\s+OK~', #current(get_headers($url)));
}
$ff = "http://www.emeditor.com/pub/emed32_11.0.5.exe";
if(remote_file_exists($ff)){
echo "file exist!";
}
else{
echo "file not exist!!!";
}
This works for me to check if a remote file exist in PHP:
$url = 'https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico';
$header_response = get_headers($url, 1);
if ( strpos( $header_response[0], "404" ) !== false ) {
echo 'File does NOT exist';
} else {
echo 'File exists';
}
You can use :
$url=getimagesize(“http://www.flickr.com/photos/27505599#N07/2564389539/”);
if(!is_array($url))
{
$default_image =”…/directoryFolder/junal.jpg”;
}
If you're using the Laravel framework or guzzle package, there is also a much simpler way using the guzzle client, it also works when links are redirected:
$client = new \GuzzleHttp\Client(['allow_redirects' => ['track_redirects' => true]]);
try {
$response = $client->request('GET', 'your/url');
if ($response->getStatusCode() != 200) {
// not exists
}
} catch (\GuzzleHttp\Exception\GuzzleException $e) {
// not exists
}
More in Document : https://docs.guzzlephp.org/en/latest/faq.html#how-can-i-track-redirected-requests
You should issue HEAD requests, not GET one, because you don't need the URI contents at all. As Pies said above, you should check for status code (in 200-299 ranges, and you may optionally follow 3xx redirects).
The answers question contain a lot of code examples which may be helpful: PHP / Curl: HEAD Request takes a long time on some sites
There's an even more sophisticated alternative. You can do the checking all client-side using a JQuery trick.
$('a[href^="http://"]').filter(function(){
return this.hostname && this.hostname !== location.hostname;
}).each(function() {
var link = jQuery(this);
var faviconURL =
link.attr('href').replace(/^(http:\/\/[^\/]+).*$/, '$1')+'/favicon.ico';
var faviconIMG = jQuery('<img src="favicon.png" alt="" />')['appendTo'](link);
var extImg = new Image();
extImg.src = faviconURL;
if (extImg.complete)
faviconIMG.attr('src', faviconURL);
else
extImg.onload = function() { faviconIMG.attr('src', faviconURL); };
});
From http://snipplr.com/view/18782/add-a-favicon-near-external-links-with-jquery/ (the original blog is presently down)
all the answers here that use get_headers() are doing a GET request.
It's much faster/cheaper to just do a HEAD request.
To make sure that get_headers() does a HEAD request instead of a GET you should add this:
stream_context_set_default(
array(
'http' => array(
'method' => 'HEAD'
)
)
);
so to check if a file exists, your code would look something like this:
stream_context_set_default(
array(
'http' => array(
'method' => 'HEAD'
)
)
);
$headers = get_headers('http://website.com/dir/file.jpg', 1);
$file_found = stristr($headers[0], '200');
$file_found will return either false or true, obviously.
If the file is not hosted external you might translate the remote URL to an absolute Path on your webserver. That way you don't have to call CURL or file_get_contents, etc.
function remoteFileExists($url) {
$root = realpath($_SERVER["DOCUMENT_ROOT"]);
$urlParts = parse_url( $url );
if ( !isset( $urlParts['path'] ) )
return false;
if ( is_file( $root . $urlParts['path'] ) )
return true;
else
return false;
}
remoteFileExists( 'https://www.yourdomain.com/path/to/remote/image.png' );
Note: Your webserver must populate DOCUMENT_ROOT to use this function
Don't know if this one is any faster when the file does not exist remotely, is_file(), but you could give it a shot.
$favIcon = 'default FavIcon';
if(is_file($remotePath)) {
$favIcon = file_get_contents($remotePath);
}
If you're using the Symfony framework, there is also a much simpler way using the HttpClientInterface:
private function remoteFileExists(string $url, HttpClientInterface $client): bool {
$response = $client->request(
'GET',
$url //e.g. http://example.com/file.txt
);
return $response->getStatusCode() == 200;
}
The docs for the HttpClient are also very good and maybe worth looking into if you need a more specific approach: https://symfony.com/doc/current/http_client.html
You can use the filesystem:
use Symfony\Component\Filesystem\Filesystem;
use Symfony\Component\Filesystem\Exception\IOExceptionInterface;
and check
$fileSystem = new Filesystem();
if ($fileSystem->exists('path_to_file')==true) {...
Please check this URL. I believe it will help you. They provide two ways to overcome this with a bit of explanation.
Try this one.
// Remote file url
$remoteFile = 'https://www.example.com/files/project.zip';
// Initialize cURL
$ch = curl_init($remoteFile);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_exec($ch);
$responseCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
// Check the response code
if($responseCode == 200){
echo 'File exists';
}else{
echo 'File not found';
}
or visit the URL
https://www.codexworld.com/how-to/check-if-remote-file-exists-url-php/#:~:text=The%20file_exists()%20function%20in,a%20remote%20server%20using%20PHP.