CURL get file only if had been modified - php

how can I understand if a file was been modified before to open the stream with CURL
(then I can open it with file-get-contents)
thanks

Check for CURLINFO_FILETIME:
$ch = curl_init('http://www.mysite.com/index.php');
curl_setopt($ch, CURLOPT_FILETIME, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
$exec = curl_exec($ch);
$fileTime = curl_getinfo($ch, CURLINFO_FILETIME);
if ($fileTime > -1) {
echo date("Y-m-d H:i", $fileTime);
}

Try sending a HEAD request first to get the last-modified header for the target url for comparison of your cached version. Also you could try to use the If-Modified-Since header with the time your cached version is created with the GET request so the other side can respond you with 302 Not Modified too.
Sending a HEAD request with curl looks something like this:
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1);
$content = curl_exec($curl);
curl_close($curl)
The $content now will contain the returned HTTP header, as one long string, you can look for last-modified: in it like this:
if (preg_match('/last-modified:\s?(?<date>.+)\n/i', $content, $m)) {
// the last-modified header is found
if (filemtime('your-cached-version') >= strtotime($m['date'])) {
// your cached version is newer or same age than the remote content, no re-fetch required
}
}
You should handle the expires header too the same way (extract the value from the header string, check if if the value is in the future or not)

Related

Using cURL with PHP to get the HTML of a website. How do I get all of it?

I'm using a library that parses HTML for particular data. It also offers a convenient fetch function. However, it has a weird line that I don't understand. Here's the code:
function fetch($url, &$curlInfo=null) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_MAXREDIRS, 5);
$response = curl_exec($ch);
$info = $curlInfo = curl_getinfo($ch);
curl_close($ch);
if (strpos(strtolower($info['content_type']), 'html') === false) {
// The content was not delivered as HTML, do not attempt to parse it.
return null;
}
$html = mb_substr($response, $info['header_size']);
return parse($html, $url);
}
The penultimate line currently ends up chopping off the first n bites of the actual HTML. Has cURL changed behavior since this was first written?
Whats the correct way to use cURL to get the HTML of a website?
CURLOPT_HEADER determines whether or not the HTTP headers will be included in the output of curl_exec.
Since you have set it to 0, they will not – yet you are cutting of a number of characters according to the size of the headers of the content.

Getting executed URL from CURL

I have a Affiliate URL Like http://track.abc.com/?affid=1234
open this link will go to http://www.abc.com
now i want to execute the http://track.abc.com/?affid=1234 Using CURL
and now how i can Get http://www.abc.com
with Curl ?
If you want cURL to follow redirect headers from the responses it receives, you need to set that option with:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
You may also want to limit the number of redirects it follows using:
curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
So you'd using something similar to this:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://track.abc.com/?affid=1234");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$data = curl_exec($ch);
Edit: Question wasn't exactly clear but from the comment below, if you want to get the redirect location, you need to get the headers from cURL and parse them for the Location header:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://track.abc.com/?affid=1234");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, true);
$data = curl_exec($ch);
This will give you the headers returned by the server in $data, simply parse through them to get the location header and you'll get your result. This question shows you how to do that.
I wrote a function that will extract any header from a cURL header response.
function getHeader($headerString, $key) {
preg_match('#\s\b' . $key . '\b:\s.*\s#', $headerString, $header);
return substr($header[0], strlen($key) + 3, -2);
}
In this case, you're looking for the value of the header Location. I tested the function by retrieving headers from a TinyURL, that redirects to http://google.se, using cURL.
$url = "http://tinyurl.com/dtrkv";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
curl_close($ch);
$location = getHeader($data, 'Location');
var_dump($location);
Output from the var_dump.
string(16) "http://google.se"

Pass header/content-type from cURL-request to current output/header

I don't know how to write a better title. Feel free to edit. Somehow I didn't find anything on this:
I have a cURL request from PHP which returns a quicktime file. This works fine if I want to output the stream in the browser's window. But I want to send it as it were a real file. How can I pass the headers and set it to the script's output, without the need of storing everything in a variable.
The script looks like this:
if (preg_match('/^[\w\d-]{36}$/',$key)) {
// create url
$url = $remote . $key;
// init cURL request
$ch = curl_init($url);
// set options
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, false);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_BUFFERSIZE, 256);
if (null !== $username) {
curl_setopt($ch, CURLOPT_USERPWD, $username . ':' . $password);
}
// execute request
curl_exec($ch);
// close
curl_close($ch);
}
I can see the header and content like this, so the request itself is working fine:
HTTP/1.1 200 OK X-Powered-By: Servlet/3.0 JSP/2.2 (GlassFish Server Open Source Edition 3.1.2 Java/Oracle Corporation/1.7) Server: GlassFish Server Open Source Edition 3.1.2 Content-Type: video/quicktime Transfer-Encoding: chunked
Get the Content-Type from your curl query:
$info = curl_getinfo($ch);
$contentType = $info['content_type'];
And send it to the client:
header("Content-Type: $contentType");
Try this:
header ('Content-Type: video/quicktime');
before outputting the content
So with the help of the previous answers I got it to work. Still it has one request to much in my opinion, but maybe someone has a better approach.
The problems that occurred where:
1.) When using cURL like this:
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
the header didn't return the content-type, but only *\*.
2.) Using curl_setopt($ch, CURLOPT_NOBODY, false); got the right content-type but also the whole content itself. So I could store everything in a variable, read the header, send the content. Not really an option somehow.
So I had to request the header once using get_headers($url, 1); before getting the content.
3.) Finally, there was the problem that the HTML5-video-tag and the jwPlayer both didn't want to play 'index.php'. So with mod_rewrite and setting 'name.mov' to 'index.php' it worked:
RewriteRule ^(.*).mov index.php?_route=$1 [QSA]
This is the result:
if (preg_match('/^[\w\d-]{36}$/',$key)) {
// create url
$url = $remote . $key;
// get header
$header = get_headers($url, 1);
if ( 200 == intval(substr($header[0], 9, 3)) ) {
// create url
$url = $remote . $key;
// init cURL request
$ch = curl_init($url);
// set options
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, false);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_BUFFERSIZE, 256);
if (null !== $username) {
curl_setopt($ch, CURLOPT_USERPWD, $username . ':' . $password);
}
// set header
header('Content-Type: ' . $header['Content-Type']);
// execute request
curl_exec($ch);
// close
curl_close($ch);
exit();
}
}

PHP cURL, read remote file and write contents to local file

I want to connect to a remote file and writing the output from the remote file to a local file, this is my function:
function get_remote_file_to_cache()
{
$the_site="http://facebook.com";
$curl = curl_init();
$fp = fopen("cache/temp_file.txt", "w");
curl_setopt ($curl, CURLOPT_URL, $the_site);
curl_setopt($curl, CURLOPT_FILE, $fp);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_exec ($curl);
$httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
touch('cache/404_err.txt');
}else
{
touch('cache/'.rand(0, 99999).'--all_good.txt');
}
curl_close ($curl);
}
It creates the two files in the "cache" directory, but the problem is it does not write the data into the "temp_file.txt", why is that?
Actually, using fwrite is partially true.
In order to avoid memory overflow problems with large files (Exceeded maximum memory limit of PHP), you'll need to setup a callback function to write to the file.
NOTE: I would recommend creating a class specifically to handle file downloads and file handles etc. rather than EVER using a global variable, but for the purposes of this example, the following shows how to get things up and running.
so, do the following:
# setup a global file pointer
$GlobalFileHandle = null;
function saveRemoteFile($url, $filename) {
global $GlobalFileHandle;
set_time_limit(0);
# Open the file for writing...
$GlobalFileHandle = fopen($filename, 'w+');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FILE, $GlobalFileHandle);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "MY+USER+AGENT"); //Make this valid if possible
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); # optional
curl_setopt($ch, CURLOPT_TIMEOUT, -1); # optional: -1 = unlimited, 3600 = 1 hour
curl_setopt($ch, CURLOPT_VERBOSE, false); # Set to true to see all the innards
# Only if you need to bypass SSL certificate validation
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
# Assign a callback function to the CURL Write-Function
curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'curlWriteFile');
# Exceute the download - note we DO NOT put the result into a variable!
curl_exec($ch);
# Close CURL
curl_close($ch);
# Close the file pointer
fclose($GlobalFileHandle);
}
function curlWriteFile($cp, $data) {
global $GlobalFileHandle;
$len = fwrite($GlobalFileHandle, $data);
return $len;
}
You can also create a progress callback to show how much / how fast you're downloading, however that's another example as it can be complicated when outputting to the CLI.
Essentially, this will take each block of data downloaded, and dump it to the file immediately, rather than downloading the ENTIRE file into memory first.
Much safer way of doing it!
Of course, you must make sure the URL is correct (convert spaces to %20 etc.) and that the local file is writeable.
Cheers,
James.
Let's try sending GET request to http://facebook.com:
$ curl -v http://facebook.com
* Rebuilt URL to: http://facebook.com/
* Hostname was NOT found in DNS cache
* Trying 69.171.230.5...
* Connected to facebook.com (69.171.230.5) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: facebook.com
> Accept: */*
>
< HTTP/1.1 302 Found
< Location: https://facebook.com/
< Vary: Accept-Encoding
< Content-Type: text/html
< Date: Thu, 03 Sep 2015 16:26:34 GMT
< Connection: keep-alive
< Content-Length: 0
<
* Connection #0 to host facebook.com left intact
What happened? It appears that Facebook redirected us from http://facebook.com to secure https://facebook.com/. Note what is response body length:
Content-Length: 0
It means that zero bytes will be written to xxxx--all_good.txt. This is why the file stays empty.
Your solution is absolutelly correct:
$fp = fopen('file.txt', 'w');
curl_setopt($handle, CURLOPT_FILE, $fp);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
All you need to do is change URL to https://facebook.com/.
Regarding other answers:
#JonGauthier: No, there is no need to use fwrite() after curl_exec()
#doublehelix: No, you don't need CURLOPT_WRITEFUNCTION for such a simple operation which is copying contents to file.
#ScottSaunders: touch() creates empty file if it doesn't exists. I think it was intention of OP.
Seriously, three answers and every single one is invalid?
You need to explicitly write to the file using fwrite, passing it the file handle you created earlier:
if ( $httpCode == 404 ) {
...
} else {
$contents = curl_exec($curl);
fwrite($fp, $contents);
}
curl_close($curl);
fclose($fp);
In your question you have
curl_setopt($curl, CURLOPT_FILE, $fp);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
but from PHP's curl_setopt documentation notes...
It appears that setting CURLOPT_FILE before setting CURLOPT_RETURNTRANSFER doesn't work, presumably because CURLOPT_FILE depends on CURLOPT_RETURNTRANSFER being set.
So do this:
<?php
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FILE, $fp);
?>
not this:
<?php
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
?>
...stating "CURLOPT_FILE depends on CURLOPT_RETURNTRANSFER being set".
Reference: https://www.php.net/manual/en/function.curl-setopt.php#99082
To avoid memory leak problems:
I was confronted with this problem as well. It's really stupid to say but the solution is to set CURLOPT_RETURNTRANSFER before CURLOPT_FILE!
it seems CURLOPT_FILE depends on CURLOPT_RETURNTRANSFER.
$curl = curl_init();
$fp = fopen("cache/temp_file.txt", "w+");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_FILE, $fp);
curl_setopt($curl, CURLOPT_URL, $url);
curl_exec ($curl);
curl_close($curl);
fclose($fp);
The touch() function doesn't do anything to the contents of the file. It just updates the modification time. Look at the file_put_contents() function.

Curl, follow location but only get header of the new location?

I know that when I set CURLOPT_FOLLOWLOCATION to true, cURL will follow the Location header and redirect to new page. But is it possible only to get header of the new page without actually redirecting there? Or is it not possible?
Appears to be a duplicate of PHP cURL: Get target of redirect, without following it
However, this can be done in 3 easy steps:
Step 1. Initialise curl
curl_init($ch); //initialise the curl handle
//COOKIESESSION is optional, use if you want to keep cookies in memory
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
Step 2. Get the headers for $url
curl_setopt($ch, CURLOPT_URL, $url); //specify your URL
curl_setopt($ch, CURLOPT_HEADER, true); //include headers in http data
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false); //don't follow redirects
$http_data = curl_exec($ch); //hit the $url
$curl_info = curl_getinfo($ch);
$headers = substr($http_data, 0, $curl_info["header_size"]); //split out header
Step 3. Parse the headers to get the new URL
preg_match("!\r\n(?:Location|URI): *(.*?) *\r\n!", $headers, $matches);
$url = $matches[1];
Once you have the new URL you can then repeat steps 2-3 as often as you like.
No. You'd have to disable FOLLOWLOCATION, extract the redirect URL from the response, and then issue a new HEAD request with that URL.
Set CURLOPT_FOLLOWLOCATION as false and CURLOPT_HEADER as true, and get the "Location" from the response header.
Yes, you can set it to follow the redirect until you get the last location on the header response.
The function to get the last redirect:
function get_redirect_final_target($url)
{
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // follow redirects
curl_setopt($ch, CURLOPT_AUTOREFERER, 1); // set referer on redirect
curl_setopt($ch,CURLOPT_HEADER,false); // if you want to print the header response change false to true
$response = curl_exec($ch);
$target = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
if ($target)
return $target; // the location you want
return false;
}
You can get the redirect URL directly with curl_getinfo:
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIESESSION, false);
curl_setopt($ch, CURLOPT_URL, $url); //specify your URL
curl_setopt($ch, CURLOPT_HEADER, true); //include headers in http data
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false); //don't follow redirects
$http_data = curl_exec($ch); //hit the $url
$redirect = curl_getinfo($ch)['redirect_url'];
curl_close($ch);
return $redirect;
And for analyze headers, your can use CURLOPT_HEADERFUNCTION
Make sure you set CURLOPT_HEADER to True to get the headers in the response, otherwise the response returned as blank string

Categories