Making PHP cURL skip binary data like images, video, etc

Making PHP cURL skip binary data like images, video, etc - php

Setting up curl like this:
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$this->domain);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,3);
curl_setopt($ch,CURLOPT_FAILONERROR,TRUE);
curl_setopt($ch,CURLOPT_USERAGENT,"Useragent");
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,TRUE);
curl_setopt($ch,CURLOPT_MAXREDIRS,1);
$str = curl_exec($ch);
return $str;
$str = $this->cURL();
Pass the url to an html page and all is well - but pass a link direct to a .jpg for example and it returns a load of garbled data.
I'd like to ensure that if a page, say, redirects to a .jpg or .gif, etc - it's ignored and only html pages are returned.
I can't seem to find a setopt for curl that does this.
Any ideas?
-The Swan.

Curl doesn't care if the content's text (html) or binary garbage (a jpg), it'll just return what you tell it to fetch. You've told curl to follow redirects with the "CURLOPT_FOLLOWLOCATION" option, so it'll just follow the chain of redirects until it hits the regular limit, or gets something to download
If you don't know what the URL might contain ahead of time, you'd have to do some workarounds, such as issuing a custom HEAD request, which would return the URL's normal http headers, from which you can extract the mime type (Content-type: ...) of the response and decide if you want to fetch it.
Or just fetch the URL and then keep/toss the data based on the mime type in the full response's headers.

My idea - use HEAD request, check if content-type is interesting( eg. another HTML ) and after this make GET request for data.
set CURLOPT_NOBODY for HEAD request

Related

CURL POST: Ignore the response-body like it happens with FOLLOWLOCATION

How can I not download request's body?
Using CURLOPT_NOBODY it performs HEAD request not sending POST data.
Uing FOLLOWLOCATION=0 I make request+body+request+body.
Using FOLLOWLOCATION=1 it makes request+request+body if it redirects to url I need.
If it redirects to the page I don't need I make request+request+body+request+body
I need: Request ignoring body + request with body.
Something like 3rd option but with redirect to url I really need (that obviously I can't control).

curl - Check content type before deciding what to do with payload

I am working with an API that has a DownloadFile method. Upon a successful request to the method, the response will have a content type of application/octet-stream and contain a file. Upon an unsuccessful request to the method, the response will have a content type of text/xml and contain the appropriate error.
The files I am requesting are archives containing multiple photos and could get very large. Therefore, I am using CURLOPT_FILE to write the payload directly to a file rather than storing it in memory.
My question is, is there any way to check the content type of the response and then decide what to do with the payload? I only want to write the payload to a file if the content type is application/octet-stream. Otherwise I just want to get the error from the XML response and return that to the user.
Thanks.

You could send a HEAD http request before download and parse the response for the content-type field. For this, you will have to use CURLOPT_NOBODY and CURLOPT_HEADER options with maybe CURLOPT_RETURNTRANSFER so you get the full http HEAD sesction as a string from curl_exec().

How do I know if a url returns an image PHP?

I have folowing problem. I use curl for downloading images from server. I generating automaticly name for images and downloaded it. Some of the generated image names are not returned image. How do I know if a url returns an image?
Sorry for my "broken" English.
Thank's in advance!

$content_type = curl_getinfo ( $curl_obj, CURLINFO_CONTENT_TYPE);
http://www.php.net/manual/en/function.curl-getinfo.php

Vytautas' answer is correct, but incomplete.
$url = 'http://test.com/test/test/something';
$c = curl_init($url);
// Here you would want to set more curl settings, such as
// enabling redirection and setting a valid user agent.
curl_exec($c);
$t = curl_getinfo($c, CURLINFO_CONTENT_TYPE);
After that, $t should contain the mimetype (together with some other stuff), except when an error occurs, at which point you get a NULL.
That said, there are 3 points you should check to ensure the returned data is of a certain type:
file extension
content-type header
file's magic number

I'd suggest you keep streaming your data to your server like you already do. To check before you try to convert, you could use finfo-file.
A more complex way could be to set the Accept header when e.g. using cUrl to only get the response if the header matches. Example:
curl_setopt($cURL,CURLOPT_HTTPHEADER,array (
"Accept: application/json"
));
Or use CURLINFO_CONTENT_TYPE, see http://ch.php.net/manual/en/function.curl-setopt.php

How to display images when using cURL?

When scraping page, I would like the images included with the text.
Currently I'm only able to scrape the text. For example, as a test script, I scraped Google's homepage and it only displayed the text, no images(Google logo).
I also created another test script using Redbox, with no success, same result.
Here's my attempt at scraping the Redbox 'Find a Movie' page:
<?php
$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
the page was broken, missing box art, missing scripts, etc.
Looking at FF's Firebug's Extension 'Net' tool(allows me to check headers and file paths), I discovered that Redbox's images and css files were not loaded/missing (404 not found). I noticed why, it was because my browser was looking for Redbox's images and css files in the wrong place.
Apperently the Redbox images and css files are located relative to the domain, likewise for Google's logo. So if my script above is using its domain as the base for the files path, how could I change this?
I tried altering the host and referer request headers with the script below, and I've googled extensively, but no luck.
My fix attempt:
<?php
$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$referer = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Host: www.redbox.com") );
curl_setopt ($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
I hope I made sense, if not, let me know and I'll try to explain it better.
Any help would be great! Thanks.
UPDATE
Thanks to everyone(especially Marc, and Wyatt), your answers helped me figure out a method to implement.
I was able to succesfully test by following the steps below:
Download the page and its requisites via Wget.
Add <base href="..." /> to downloaded page's header.
Upload the revised downloaded page and its original requisites via Wput to a temporary server.
Test uploaded page on temporary server via browser
If the uploaded page is not displayed properly, some of the requisites might be missing still(css,jss,ect). View which are missing via a tool that lets you view header responses(eg. the 'net' tool from FF's Firebug Addon). After locating the missing requisites, visit original page that the uploaded page is based on, take note of proper requisite locations that were missing, then revise the downloaded page from step 1 to
accommodate the new proper locations and begin at step 3 again. Else, if page is rendered properly, then success!
Note: When revising the downloaded page I manually edited the code, I'm sure you could use regEX or a parsing library on cUrl's request to automate the process.

When you scrape a URL, you're retrieving a single file, be it html, image, css, javascript, etc... The document you see displayed in a browser is almost always the result of MULTIPLE files: the original html, each seperate image, each css file, each javascript file. You enter only a single address, but fully building/displaying the page will require many HTTP requests.
When you scrape the google home page via curl and output that HTML to the user, there's no way for the user to know that they're actually viewing Google-sourced HTML - it appears as if the HTML came from your server, and your server only. The user's browser will happily suck in this HTML, find the images, and request the images from YOUR server, not google's. Since you're not hosting any of google's images, your server responds with a properly 404 "not found" error.
To make the page work properly, you've got a few choices. The easiest is to parse the HTML of the page and insert a <base href="..." /> tag into the document's header block. This will tell any viewing browsers that "relatively" links within the document should be fetched from this 'base' source (e.g. google).
A harder option is to parse the document and rewrite any references to external files (images ,css, js, etc...) and put in the URL of the originating server, so the user's browser goes to the original site and fetches from there.
The hardest option is to essentially set up a proxy server, and if a request comes in for a file that doesn't exist on your server, to try and fetch the corresponding file from Google via curl and output it to the user.

If the site you're loading is using relative paths for its resource URLs (i.e. /images/whatever.gif instead of http://www.site.com/images/whatever.gif), you're going to need to do some rewriting of those URLs in the source you get back, since cURL won't do that itself, though Wget (official site seems to be down) does (and will even download and mirror the resources for you), but does not provide PHP bindings.
So, you need to come up with a methodology to scrape through the resulting source and change relative paths into absolute paths. A naive way would be something like this:
if (!preg_match('/src="https?:\/\/"/', $result))
$result = preg_replace('/src="(.*)"/', "src=\"$MY_BASE_URL\\1\"", $result);
where $MY_BASE_URL is the base URL you want to rewrite, i.e. http://www.mydomain.com. That won't work for everything, but it should get you started. It's not an easy thing to do, and you might be better off just spawning off a wget command in the background and letting it mirror or rewrite the HTML for you.

Try obtaining the images by having the raw output returned, using the CURLOPT_BINARYTRANSFER option set to true, as below
curl_setopt($ch,CURLOPT_BINARYTRANSFER, true);
I've used this successfully to obtain images and audio from a webpage.

Make cURL behave like exactly like form

I have a form on my site which sends data to some remote site - simple html form.
What I want to do is to use data user enters into form for statistical purposes.
So I instead of sending data to the remote page I send it first to my script which resends it the remote site.
The thing is I need it to behave in exact way the usual form would behave taking user to the remote site and displaying resources.
When I use this code it kinda works but not in the way I want it to:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $action);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result = curl_exec($ch);
curl_close($ch);
Problem is that it displays response in the same script. For example if $action is for example:
somesite.com/processform.php and my script name is mysqcript.php it would display the response of "somesite.com/processform.php" inside "mysqcript.php" so all the relative links are not working.
How do I make it to send the user to "somesite.com/processform.php"? Same thing that pressing the button would do?
Leonti

I think you will have to do this on your end, as translating relative paths is the client's job. It should be simple: Just take the base directory of the request you made
http://otherdomain.com/my/request/path.php
and add it in front of every outgoing link that does not begin with "/" or a protocol ("http://", "ftp://").
Detecting all the outgoing links is hard, but I am 100% sure there are ready-made PHP classes that do that. Check for example this article and the getLinks() function in the user comments. I am not 100% sure whether this is what you need but it certainly goes to the right direction.

Here are a couple of possible solutions, which I post separately so they don't get mixed up with the one I recommend:
1 - keep using cURL, parse the response and add a <base/> tag to it. It should work for pretty much everything on that page.
<base href="http://realsite.com/form_url.php" />
2 - do not alter the submit URL. Submit the form to the real URL, but capture its content using some Javascript library (YUI does that) and send it to your script via XHR. It's still kind of hacky though.

There are several ways to do that. Here's one of the easiest: just use a 307 redirect.
header('Location: http://realsite.com/form_url.php', true, 307');
You can do your logging and stuff either before or after header() but if you do it after calling header() you will need to start your script with
ignore_user_abort(true);
Note that browsers are supposed to notify the user that their form is being redirected.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.