Check if remote images exist PHP - php

I'm using the last.fm API to get the recent tracks and to search albums and artists etc.
When images are returned from the API, they sometimes doesn't exist. An empty URL string is easily replaced with a placeholder image, but when an image url is given and it returns a 404, that's when my problem comes in.
I tried using fopen($url, 'r') for checking if images are available, but sometimes this gives me the following error:
Warning: fopen(http://ec1.images-amazon.com/images/I/31II3Cn67jL.jpg) [function.fopen]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in file.php on line 371
Also, I don't want to use cURL, because there are a lot of images to check and it slows the website down a lot.
What would the best solution be for checking images?
I'm now using the following solution:
<img src="..." onerror='this.src="core/img/no-image.jpg"' alt="..." title="..." />
Is this useful?
Any help is appreciated

You can use getimagesize since you are dealing with images it would also return mime type of the image
$imageInfo = #getimagesize("http://www.remoteserver.com/image.jpg");
You can also use CURL to check HTTP response code of am image or any URL
$ch = curl_init("http://www.remoteserver.com/image.jpg");
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 2);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 2);
curl_exec($ch);
if(curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200)
{
// Found Image
}
curl_close($ch);

function fileExists($path){
return (#fopen($path,"r")==true);
}
from the manual for file_exists()

Depending on the number of images and frequency of failures, it's probably best to stick with your current client-side approach. Also, it looks like the images are served through Amazon CloudFront - in that case, use the client-side approach because it could just be a propagation issue with a single edge server.
Applying server-side approach will be network intensive and slow (waste of resources), especially in php because you'll need to check each image sequentially.

Also might be useful check the request headers, using get_headers php function as follows:
$url = "http://www.remoteserver.com/image.jpg";
$imgHeaders = #get_headers( str_replace(" ", "%20", $url) )[0];
if( $imgHeaders == 'HTTP/1.1 200 Ok' ) {
//img exist
}
elseif( $imgHeaders == 'HTTP/1.1 404 Not Found' ) {
//img doesn't exist
}

The following function will try to get any online resource (IMG, PDF, etc.) given by URL using get_headers, read headers and searching for string 'Not Found' in it using function strpos. If this string is found, meaning the resource given by URL is not available, this function will return FALSE, otherwise TRUE.
function isResourceAvaiable($url)
{
return !strpos(#get_headers($url)[0],'Not Found')>0;
}

Related

Gateway Timeout 504 on multiple requests. Apache

I have an XML file localy. It contains data from marketplace.
It roughly looks like this:
<offer id="2113">
<picture>https://anotherserver.com/image1.jpg</picture>
<picture>https://anotherserver.com/image2.jpg</picture>
</offer>
<offer id="2117">
<picture>https://anotherserver.com/image3.jpg</picture>
<picture>https://anotherserver.com/image4.jpg</picture>
</offer>
...
What I want is to save those images in <picture> node localy.
There are about 9,000 offers and about 14,000 images.
When I iterate through them I see that images are being copied from that another server but at some point it gives 504 Gateway Timeout.
Thing is that sometimes error is given after 2,000 images sometimes way more or less.
I tried getting only one image 12,000 times from that server (i.e. only https://anotherserver.com/image3.jpg) but it still gave the same error.
As I've read, than another server is blocking my requests after some quantity.
I tried using PHP sleep(20) after every 100th image but it still gave me the same error (sleep(180) - same). When I tried local image but with full path it didn't gave any errors. Tried second server (non local) the same thing occured.
I use PHP copy() function to move image from that server.
I've just used file_get_contents() for testing purposes but got the same error.
I have
set_time_limit(300000);
ini_set('default_socket_timeout', 300000);
as well but no luck.
Is there any way to do this without chunking requests?
Does this error occur on some one image? Would be great to catch this error or just keep track of the response delay to send another request after some time if this can be done?
Is there any constant time in seconds that I have to wait in order to get those requests rollin'?
And pls give me non-curl answers if possible.
UPDATE
Curl and exec(wget) didn't work as well. They both gone to same error.
Can remote server be tweaked so it doesn't block me? (If it does).
p.s. if I do: echo "<img src = 'https://anotherserver.com/image1.jpg'" /> in loop for all 12,000 images, they show up just fine.
Since you're accessing content on a server you have no control over, only the server administrators know the blocking rules in place.
But you have a few options, as follows:
Run batches of 1000 or so, then sleep for a few hours.
Split the request up between computers that are requesting the information.
Maybe even something as simple as changing the requesting user agent info every 1000 or so images would be good enough to bypass the blocking mechanism.
Or some combination of all of the above.
I would suggest you to try following
1. reuse previously opened connection using CURL
$imageURLs = array('https://anotherserver.com/image1.jpg', 'https://anotherserver.com/image2.jpg', ...);
$notDownloaded = array();
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
foreach ($imageURLs as $URL) {
$filepath = parse_url($URL, PHP_URL_PATH);
$fp = fopen(basename($filepath), "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_URL, $URL);
curl_exec($ch);
fclose($fp);
if (curl_getinfo($ch, CURLINFO_RESPONSE_CODE) == 504) {
$notDownloaded[] = $URL;
}
}
curl_close($ch);
// check to see if $notDownloaded is empty
If images are accessible via both https and http try to use http instead. (this will at least speed up the downloading)
Check response headers when 504 is returned as well as when you load url your browser. Make sure there are no X-RateLimit-* headers. BTW what is the response headers actually?

Is it possible to tell the type of image just from a remote path?

I'm working on what is effectively a watermarking system, where the user has a URL something like this.
https://api.example.com/overlay?image_path=http://www.theuserssite.com/image.ext&other=parameters
I'm wondering if the users image_path is something like www.theuserssite.com/image?image_id=123, there's no extension in the url, so I can't tell what type of file the image is.
Is there anyway I can get what the file type is just from a remote path?
no, not without calling and loading the image to check its type. try using getimagesize PHP function.
example:
//https://api.example.com/overlay?image_path=http://www.theuserssite.com/image.ext
$var = "http://www.theuserssite.com/image.ext";
$output = getimagesize($var);
$output[2] = image MIME type.
Martin is right.
PHP must download the image before it is able to verify the image type.
You could give it a try by requesting the headers of the http request.
But these can be easily wrong / spoofed by the other side.
You can try to use cURL (if exists on your system) - make the request and parse response:
$ch = curl_init('www.theuserssite.com/image?image_id=123');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
$results = split("\n", trim($result));
foreach ($results as $row)
{
if (strtok($row, ':') == 'Content-Type')
{
$exploded = explode(":", $row);
echo trim($exploded[1]);
}
}
As stated here
The HTTP server responds with a status line (indicating if things went well), response headers and most often also a response body.
Thus there is no need to store the actual file just interpret the result.

Amazon CloudSearch throws HTTP 403 on document upload

I am trying to integrate Amazon CloudSearch into SilverStripe. What I want to do is when the pages are published I want a CURL request to send the data about the page as a JSON string to the search cloud.
I am using http://docs.aws.amazon.com/cloudsearch/latest/developerguide/uploading-data.html#uploading-data-api as a reference.
Every time I try to upload it returns me a 403. I have allowed the IP address in the access policies for the search domain as well.
I am using this as a code reference: https://github.com/markwilson/AwsCloudSearchPhp
I think the problem is the AWS does not authenticate correctly. How do I correctly authenticate this?
If you are getting the following error
403 Forbidden, Request forbidden by administrative rules.
and if you are sure you have appropriate rules in effect, I would check the api url you are using. Make sure you are using the correct endpoint. If you are doing batch upload the api endpoint should look like below
your-search-doc-endpoint/2013-01-01/documents/batch
Notice 2013-01-01, that is a required part of the url. That is the api version you will be using. You cannot do the following even though it might make sense
your-search-doc-endpoint/documents/batch <- Won't work
To search you would need to hit the following api
your-search-endpoint/2013-01-01/search?your-search-params
After many searches and trial and error I was able to put together a small code block, from small pieces of code from everywhere to be able to upload a "file" using CURL and PHP to aws cloudsearch.
The one and most important things is to make sure that your data is prepare correctly to be sent in JSON format.
Note: For cloudsearch you're not uploading a file your posting a stream of JSON data. That is why many of us have a problem uploading the data.
So in my case I wanted to be able to upload data that my search engine on clousearch, it seems simple and it is but the lack of example code to do this is not there most people tell you you to go to the documentation which usually has examples but to use the aws CLI. The php SDK is just a learning curb plus instead of making it simple you do 20 steps to do 1 task and not only that you're require to have all these other libraries that are just wrappers for native PHP functions and sometimes instead of making it simple it becomes complicated.
So back to how I did it, first I am pulling the data from my database as an array and serialize it to save it to a file.
$row = $database_data;
foreach ($rows as $key => $row) {
$data['type'] = 'add';
$data['id'] = $row->id;
$data['fields']['title'] = $row->title;
$data['fields']['content'] = $row->content;
$data2[] = $data;
}
// now save your data to a file and make sure
// to serialize() it
$fp = fopen($path_to_file, $mode)
flock($fp, LOCK_EX);
fwrite($fp, serialize($data2));
flock($fp, LOCK_UN);
fclose($fp);
Now that you have your data saved we can play with it
$aws_doc_endpoint = '{Your AWS CloudSearch Document Endpoint URL}';
// Lets read the data
$data = file_get_contents($path_to_file);
// Now lets unserialize() it and encoded in JSON format
$data = json_encode(unserialize($data));
// finally lets use CURL
$ch = curl_init($aws_doc_endpoint);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Length: ' . strlen($data)));
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json'));
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$response = curl_exec($ch);
curl_close($ch);
$response = json_decode($response);
if ($response->status == 'success')
{
return TRUE;
}
return FALSE;
And like I said there is nothing to it. Most answers that I encounter where, use Guzzle its really easy, well yes it is but for just a simple task like this you don't need it.
Aside from that if you still get an error make sure to check the following.
Well formatted JSON data.
Make sure you have access to the endpoint.
Well I hope someone finds this code helpful.
To diagnose whether it's an access policy issue, have you tried a policy that allows all access to the upload? Something like the following opens it up to everything:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "cloudsearch:*"
}
]
}
I noticed that if you just go to the document upload endpoint in a browser (mine looks like "doc-YOURDOMAIN-RANDOMID.REGION.cloudsearch.amazonaws.com") you'll get the 403 "Request forbidden by administrative rules" error, even with open access, so as #dminer said you'll need to make sure you're posting to the correct full url.
Have you considered using a PHP SDK? Like http://docs.aws.amazon.com/aws-sdk-php/guide/latest/service-cloudsearchdomain.html. It should take care of making correct requests, in which case you could rule out transport errors.
this never worked for me. and i used the Cloudsearch terminal to upload files. and php curl to search files.
Try adding "cloudsearch:document" to CloudSearch's access policy under Actions

file_get_contents() GET request not showing up on my webserver log

I've got a simple php script to ping some of my domains using file_get_contents(), however I have checked my logs and they are not recording any get requests.
I have
$result = file_get_contents($url);
echo $url. ' pinged ok\n';
where $url for each of the domains is just a simple string of the form http://mydomain.com/, echo verifies this. Manual requests made by myself are showing.
Why would the get requests not be showing in my logs?
Actually I've got it to register the hit when I send $result to the browser. I guess this means the webserver only records browser requests? Is there any way to mimic such in php?
ok tried curl php:
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "getcorporate.co.nr");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
same effect though - no hit registered in logs. So far it only registers when I feed the http response back from my script to the browser. Obviously this will only work for a single request and not a bunch as is the purpose of my script.
If something else is going wrong, what debugging output can I look at?
Edit: D'oh! See comments below accepted answer for explanation of my erroneous thinking.
If the request is actually being made, it would be in the logs.
Your example code could be failing silently.
What happens if you do:
<?PHP
if ($result = file_get_contents($url)){
echo "Success";
}else{
echo "Epic Fail!";
}
If that's failing, you'll want to turn on some error reporting or logging and try to figure out why.
Note: if you're in safe mode, or otherwise have fopen url wrappers disabled, file_get_contents() will not grab a remote page. This is the most likely reason things would be failing (assuming there's not a typo in the contents of $url).
Use curl instead?
That's odd. Maybe there is some caching afoot? Have you tried changing the URL dynamically ($url = $url."?timestamp=".time() for example)?

Check for files (robots.txt, favicon.ico) to a website php

I would like to check to a remote website if it contains some files. Eg. robots.txt, or favicon.ico. Of course the files should be accessible (read mode).
So if the website is: http://www.example.com/ I would like to check if http://www.example.com/robots.txt.
I tried fetching the URL like http://www.example.com/robots.txt. And sometimes you can see if the file is there because you get page not found error in the header.
But some websites handle this error and all you get is some HTML code saying that page can not be found.
You get headers with status code 200.
So Anybody any idea how to check if file exists really or not?
Thanx,
Granit
I use a quick function with CURL to do this, so far it handle's fine even if the URL's server tries to redirect:
function remoteFileExists($url){
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
$result = curl_exec($curl);
$ret = false;
if ($result !== false) {
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 200) {
$ret = true;
}
}
curl_close($curl);
return $ret;
}
$url = "http://www.example.com";
$exists = remoteFileExists("$url/robots.txt");
if($exists){
$robottxt = file_get_contents("$url/robots.txt");
}else{
$robottxt = "none";
}
If they serve an error page with HTTP 200 I doubt you have a reliable way of detecting this. Needless to say that it's extremely stupid to serve error pages that way ...
You could try:
Issuing a HEAD request which yields you only the headers for the requested resource. Maybe you get more reliable status codes that way
Check the Content-Type header. If it's text/html you can assume that it's a custom error page instead of a robots.txt (which should be served as text/plain). For favicons likewise. But I think simply checking for text/html would be the most reliable way here.
Well, if the website gives you an error page with a success status code, there is not much you can do about it.
Naturally, if you're just after robots.txt or favicon.ico or something else very specific, you can simply check if the response document is in correct format... like robots.txt should be text/plain containing stuff that robots.txt is allowed to contain and favicon.ico should be an image file.
The header content-type for a .txt file should be text/plain, so if you receive text/html it's not a simple text file.
To check if a picture is a picture you would need to retrieve the content-type as it will usually be image/png or image/gif. There is also the possibility of using PHP's GD library to check if it is in fact an image.

Categories