Check for files (robots.txt, favicon.ico) to a website php

Check for files (robots.txt, favicon.ico) to a website php - php

I would like to check to a remote website if it contains some files. Eg. robots.txt, or favicon.ico. Of course the files should be accessible (read mode).
So if the website is: http://www.example.com/ I would like to check if http://www.example.com/robots.txt.
I tried fetching the URL like http://www.example.com/robots.txt. And sometimes you can see if the file is there because you get page not found error in the header.
But some websites handle this error and all you get is some HTML code saying that page can not be found.
You get headers with status code 200.
So Anybody any idea how to check if file exists really or not?
Thanx,
Granit

I use a quick function with CURL to do this, so far it handle's fine even if the URL's server tries to redirect:
function remoteFileExists($url){
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
$result = curl_exec($curl);
$ret = false;
if ($result !== false) {
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 200) {
$ret = true;
}
}
curl_close($curl);
return $ret;
}
$url = "http://www.example.com";
$exists = remoteFileExists("$url/robots.txt");
if($exists){
$robottxt = file_get_contents("$url/robots.txt");
}else{
$robottxt = "none";
}

If they serve an error page with HTTP 200 I doubt you have a reliable way of detecting this. Needless to say that it's extremely stupid to serve error pages that way ...
You could try:
Issuing a HEAD request which yields you only the headers for the requested resource. Maybe you get more reliable status codes that way
Check the Content-Type header. If it's text/html you can assume that it's a custom error page instead of a robots.txt (which should be served as text/plain). For favicons likewise. But I think simply checking for text/html would be the most reliable way here.

Well, if the website gives you an error page with a success status code, there is not much you can do about it.
Naturally, if you're just after robots.txt or favicon.ico or something else very specific, you can simply check if the response document is in correct format... like robots.txt should be text/plain containing stuff that robots.txt is allowed to contain and favicon.ico should be an image file.

The header content-type for a .txt file should be text/plain, so if you receive text/html it's not a simple text file.
To check if a picture is a picture you would need to retrieve the content-type as it will usually be image/png or image/gif. There is also the possibility of using PHP's GD library to check if it is in fact an image.

Related

Get set-cookie header from redirect php

Okay, I haven't been able to find a solution to this as of yet, and I need to start asking questions on SO so I can get my reputation up and hopefully help out others.
I am making a wordpress plugin that retrieves a json list of items from a remote site. Recently, the site added a redirecting check for a cookie.
Upon first request without the cookie, 302 headers are provided, pointing to a second page which also returns a 302 redirect pointing to the homepage. On this second page, however, the set-cookie headers are also provided, which prevents the homepage from redirecting yet again.
When I make a cURL request to a url on the site, however, it fails in a redirect loop.
Now, obviously the easiest solution would be to fix this on the remote server. It should not be implementing that redirect for api routes. But that at the moment is not an option for me.
I have found how to retrieve the set-cookie header value from a 2** code response, however I cannot seem to figure out how to access that value when 302 headers are provided, and cURL returns nothing but an error.
Is there a way to access the headers even when it reaches the maximum (20) redirects?
Is it possible to stop the execution after a set number of redirects?
How can I get this cookie's value so I can provide it in a final request?

If you use the cURL option CURLOPT_HEADER the data you get back from curl_exec will include the headers from each response, including the 302.
If you enable cookie handling in cURL, it should pick up the cookie set by the 302 response just fine unless you prefer to handle it manually.
I often do something like this when there could be multiple redirects:
$ch = curl_init($some_url_that_302_redirects);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEFILE, ''); // enable curl cookie handling
$result = curl_exec($ch);
// $result contains the headers from each response, plus the body of the last response
$info = curl_getinfo($ch); // info will tell us how many redirects were followed
for ($i = 0; $i < intval($info['redirect_count']); ++$i) {
// get headers from each response
list($headers, $response) = explode("\r\n\r\n", $response, 2);
// DO SOMETHING WITH $headers HERE
// If there was a redirect, headers will be all headers from that response,
// including Set-Cookie headers
}
list($headers, $body) = explode("\r\n\r\n", $response, 2);
// Now $headers are the headers from the final response
// $body is the content from the final response

You already had problems before you started trying to add cookies into the mix. Doing a single redirect is bad for performance. Using a 302 response as a means of dissociating data presentation from data retrieval under HTTP/1,1 or later is bad (it works, but is a violation of the protocol - you should be using a 303 if you really must redirect).
Trying to set a cookie in a 3xx response will not work consistently across browsers. Setting a cookie in an Ajax response will not work consistently across browsers.
It should not be implementing that redirect for api routes
Maybe the people at the remote site are trying to prevent you leeching their content?
Fetch the homepage first in an iframe to populate the cookie and record a flag in your domain on the browser.

I actually found another SO question, of course after I posted, that lead me in the right direction to make this possible, HERE
I used the WebGet class to make the curl request. It has not been maintained for three years, but it still works fine.
It has a function that makes the curl request without following through on the redirect loop.
There are a lot of curl options set in that function, and curl is not returning an error in it, so I'm sure the exact solution could be simpler. HERE is a list of curl options for anyone who would like to delve deeper.
Here is how I handle each of the responses to get the final response
$w = new WebGet();
$cookie_file = 'cookie.txt';
if (!file_exists($cookie_file)) {
$cookie_file_inter = fopen($cookie_file, "w");
fclose($cookie_file_inter);
}
$w->cookieFile = $cookie_file; // must exist and be writable
$w->requestContent($url);
$headers = $w->responseHeaders;
if ($w->responseStatusCode == 302 && isset($headers['LOCATION'])) {
$w->requestContent($headers['LOCATION']);
}
if ($w->responseStatusCode == 302 && isset($headers['LOCATION'])) {
$w->requestContent($headers['LOCATION']);
}
$response = $w->cachedContent;
Of course, this is all extremely bad practice, and has severe performance implications, but there may be some rare use cases that find themselves needing to do this.

Is it possible to tell the type of image just from a remote path?

I'm working on what is effectively a watermarking system, where the user has a URL something like this.
https://api.example.com/overlay?image_path=http://www.theuserssite.com/image.ext&other=parameters
I'm wondering if the users image_path is something like www.theuserssite.com/image?image_id=123, there's no extension in the url, so I can't tell what type of file the image is.
Is there anyway I can get what the file type is just from a remote path?

no, not without calling and loading the image to check its type. try using getimagesize PHP function.
example:
//https://api.example.com/overlay?image_path=http://www.theuserssite.com/image.ext
$var = "http://www.theuserssite.com/image.ext";
$output = getimagesize($var);
$output[2] = image MIME type.

Martin is right.
PHP must download the image before it is able to verify the image type.
You could give it a try by requesting the headers of the http request.
But these can be easily wrong / spoofed by the other side.

You can try to use cURL (if exists on your system) - make the request and parse response:
$ch = curl_init('www.theuserssite.com/image?image_id=123');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
$results = split("\n", trim($result));
foreach ($results as $row)
{
if (strtok($row, ':') == 'Content-Type')
{
$exploded = explode(":", $row);
echo trim($exploded[1]);
}
}
As stated here
The HTTP server responds with a status line (indicating if things went well), response headers and most often also a response body.
Thus there is no need to store the actual file just interpret the result.

Check if remote images exist PHP

I'm using the last.fm API to get the recent tracks and to search albums and artists etc.
When images are returned from the API, they sometimes doesn't exist. An empty URL string is easily replaced with a placeholder image, but when an image url is given and it returns a 404, that's when my problem comes in.
I tried using fopen($url, 'r') for checking if images are available, but sometimes this gives me the following error:
Warning: fopen(http://ec1.images-amazon.com/images/I/31II3Cn67jL.jpg) [function.fopen]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in file.php on line 371
Also, I don't want to use cURL, because there are a lot of images to check and it slows the website down a lot.
What would the best solution be for checking images?
I'm now using the following solution:
<img src="..." onerror='this.src="core/img/no-image.jpg"' alt="..." title="..." />
Is this useful?
Any help is appreciated

You can use getimagesize since you are dealing with images it would also return mime type of the image
$imageInfo = #getimagesize("http://www.remoteserver.com/image.jpg");
You can also use CURL to check HTTP response code of am image or any URL
$ch = curl_init("http://www.remoteserver.com/image.jpg");
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 2);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 2);
curl_exec($ch);
if(curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200)
{
// Found Image
}
curl_close($ch);

function fileExists($path){
return (#fopen($path,"r")==true);
}
from the manual for file_exists()

Depending on the number of images and frequency of failures, it's probably best to stick with your current client-side approach. Also, it looks like the images are served through Amazon CloudFront - in that case, use the client-side approach because it could just be a propagation issue with a single edge server.
Applying server-side approach will be network intensive and slow (waste of resources), especially in php because you'll need to check each image sequentially.

Also might be useful check the request headers, using get_headers php function as follows:
$url = "http://www.remoteserver.com/image.jpg";
$imgHeaders = #get_headers( str_replace(" ", "%20", $url) )[0];
if( $imgHeaders == 'HTTP/1.1 200 Ok' ) {
//img exist
}
elseif( $imgHeaders == 'HTTP/1.1 404 Not Found' ) {
//img doesn't exist
}

The following function will try to get any online resource (IMG, PDF, etc.) given by URL using get_headers, read headers and searching for string 'Not Found' in it using function strpos. If this string is found, meaning the resource given by URL is not available, this function will return FALSE, otherwise TRUE.
function isResourceAvaiable($url)
{
return !strpos(#get_headers($url)[0],'Not Found')>0;
}

How do I know if a url returns an image PHP?

I have folowing problem. I use curl for downloading images from server. I generating automaticly name for images and downloaded it. Some of the generated image names are not returned image. How do I know if a url returns an image?
Sorry for my "broken" English.
Thank's in advance!

$content_type = curl_getinfo ( $curl_obj, CURLINFO_CONTENT_TYPE);
http://www.php.net/manual/en/function.curl-getinfo.php

Vytautas' answer is correct, but incomplete.
$url = 'http://test.com/test/test/something';
$c = curl_init($url);
// Here you would want to set more curl settings, such as
// enabling redirection and setting a valid user agent.
curl_exec($c);
$t = curl_getinfo($c, CURLINFO_CONTENT_TYPE);
After that, $t should contain the mimetype (together with some other stuff), except when an error occurs, at which point you get a NULL.
That said, there are 3 points you should check to ensure the returned data is of a certain type:
file extension
content-type header
file's magic number

I'd suggest you keep streaming your data to your server like you already do. To check before you try to convert, you could use finfo-file.
A more complex way could be to set the Accept header when e.g. using cUrl to only get the response if the header matches. Example:
curl_setopt($cURL,CURLOPT_HTTPHEADER,array (
"Accept: application/json"
));
Or use CURLINFO_CONTENT_TYPE, see http://ch.php.net/manual/en/function.curl-setopt.php

file_get_contents() GET request not showing up on my webserver log

I've got a simple php script to ping some of my domains using file_get_contents(), however I have checked my logs and they are not recording any get requests.
I have
$result = file_get_contents($url);
echo $url. ' pinged ok\n';
where $url for each of the domains is just a simple string of the form http://mydomain.com/, echo verifies this. Manual requests made by myself are showing.
Why would the get requests not be showing in my logs?
Actually I've got it to register the hit when I send $result to the browser. I guess this means the webserver only records browser requests? Is there any way to mimic such in php?
ok tried curl php:
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "getcorporate.co.nr");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
same effect though - no hit registered in logs. So far it only registers when I feed the http response back from my script to the browser. Obviously this will only work for a single request and not a bunch as is the purpose of my script.
If something else is going wrong, what debugging output can I look at?
Edit: D'oh! See comments below accepted answer for explanation of my erroneous thinking.

If the request is actually being made, it would be in the logs.
Your example code could be failing silently.
What happens if you do:
<?PHP
if ($result = file_get_contents($url)){
echo "Success";
}else{
echo "Epic Fail!";
}
If that's failing, you'll want to turn on some error reporting or logging and try to figure out why.
Note: if you're in safe mode, or otherwise have fopen url wrappers disabled, file_get_contents() will not grab a remote page. This is the most likely reason things would be failing (assuming there's not a typo in the contents of $url).

Use curl instead?

That's odd. Maybe there is some caching afoot? Have you tried changing the URL dynamically ($url = $url."?timestamp=".time() for example)?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Check for files (robots.txt, favicon.ico) to a website php - php

Related

Get set-cookie header from redirect php

Is it possible to tell the type of image just from a remote path?

Check if remote images exist PHP

How do I know if a url returns an image PHP?

file_get_contents() GET request not showing up on my webserver log

Categories

Resources