I am using curl and php to find out information about a given url (e.g. http status code, mimetype, http redirect location, page title etc).
$ch = curl_init($url);
$useragent="Mozilla/5.0 (X11; U; Linux x86_64; ga-GB) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.307.11 Safari/532.9";
curl_setopt($ch,CURLOPT_HTTPHEADER,array (
"Accept: application/rdf+xml;q=0.9, application/json;q=0.6, application/xml;q=0.5, application/xhtml+xml;q=0.3, text/html;q=0.2, */*;q=0.1"
));
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$content=curl_exec($ch);
$chinfo = curl_getinfo($ch);
curl_close($ch);
This generally works well. However, if the url points to a larger file then I get a fatal error:
Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 14421576 bytes)
Is there anyway of preventing this? For example, by telling curl to give up if the file is too large, or by catching the error?
As a workaround, I've added
curl_setopt($ch, CURLOPT_TIMEOUT, 3);
which assumes that any file that takes longer than 3 seconds to load will exhaust the allowed memory, but this is far from satisfactory.
Have you tried using CURLOPT_FILE to save the file directly to disk instead of using memory? You can even specify /dev/null to put it nowhere at all...
Or, you can use CURLOPT_WRITEFUNCTION to set a custom data-writing function. Have the function just scan the headers and then throw away the actual data.
Alternately, give PHP some more memory via php.ini.
If you're getting header information, then why not use a HEAD request? That avoids the memory usage of getting the whole page in a maximumn 16MiB memory slot.
curl_setopt($ch, CURLOPT_HEADER, true);
Then, for the page title, use file_get_contents() instead, as it's much better with its native memory allocation.
Related
I have audio files on a remote server that are streamed / chunked to the user. This all works great in the clients browser.
But when I try to download and save the files locally from another server using curl, it only seems to be able to download small files (less than 10mb) sucessfully, anything larger and it seems to only download the header.
I assume this is because of the chunking, so my question is how do I make curl download the larger (chunked) files?
With wget on the cli on linux this is as simple as :
wget -cO - https://example.com/track?id=460 > mytrack.mp3
This is the func I have written using curl in PHP, but like I say it's only downloading headers on large files :
private function downloadAudio($url, $fn){
$ch = curl_init($url);
$path = TEMP_DIR . $fn;
$fp = fopen($path, 'wb');
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
if (file_exists($path)) {
return true;
}
return false;
}
In my case it was failing as I had forgotten to increase the default PHP memory_limit on the origin server.
It turned out after posting this question that it was actually successfully downloading any files that seemed to be below the 100mb mark, not 10mb as I had stated in the question. As soon as I realised this I checked the memory_limit and low and behold it was set to the default 128m.
I hadn't noticed any problems client side as it was being chunked, but when the server tried to grab an entire 300mb file in less than 1 second the memory limit must have been reached.
Downloading an image using cURL
https://cdni.rt.com/deutsch/images/2018.04/article/5ac34e500d0403503d8b4568.jpg
when saving this image manually from the browser to the local pc, the size shown by the system is 139,880 bytes
When downloading it using cURL, the file seems to be damaged and does not get considered as a valid image
its size, when downloaded using cURL, is 139,845 which is lower than the size when downloading it manually
digging the issue further, found that the server is returning the content length in the response headers as
content-length: 139845
This length is identical to what cURL downloaded, so I suspected that cURL closes the transfer once reached the alleged (possibly wrong) length by the server
Is there any way to make cURL download the file completely even if the content-length header is wrong
Used code:
//curl ini
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER,0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_TIMEOUT,20);
curl_setopt($ch, CURLOPT_REFERER, 'http://www.bing.com/');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8');
curl_setopt($ch, CURLOPT_MAXREDIRS, 5); // Good leeway for redirections.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // Many login forms redirect at least once.
curl_setopt($ch, CURLOPT_COOKIEJAR , "cookie.txt");
//curl get
$x='error';
$url='https://cdni.rt.com/deutsch/images/2018.04/article/5ac34e500d0403503d8b4568.jpg';
curl_setopt($ch, CURLOPT_HTTPGET, 1);
curl_setopt($ch, CURLOPT_URL, trim($url));
$exec=curl_exec($ch);
$x=curl_error($ch);
$fp = fopen('test.jpg','x');
fwrite($fp, $exec);
fclose($fp);
the server has a bugged implementation of Accept-Encoding compressed transfer mechanism.
the response is ALWAYS gzip-compressed, but won't tell the client that it's gzip-compressed unless the client has the Accept-Encoding: gzip header in the request. when the server doesn't tell the client that it's gzipped, the client won't gzip-decompress it before saving it, thus your corrupted download. tell curl to offer gzip compression by setting CURLOPT_ENCODING,
curl_setopt($ch,CURLOPT_ENCODING,'gzip');
, then the server will tell curl that it's gzip-compressed, and curl will decompress it for you, before giving it to PHP.
you should probably tell the server admin about this, it's a serious bug in his web server, corrupting downloads.
libcurl has an option for that called CURLOPT_IGNORE_CONTENT_LENGTH, unfortunately this is not natively supported in php, but you can trick php into setting the option anyway, by using the correct magic number (which, at least on my system is 136),
if(!defined('CURLOPT_IGNORE_CONTENT_LENGTH')){
define('CURLOPT_IGNORE_CONTENT_LENGTH',136);
}
if(!curl_setopt($ch,CURLOPT_IGNORE_CONTENT_LENGTH,1)){
throw new \RuntimeException('failed to set CURLOPT_IGNORE_CONTENT_LENGTH! - '.curl_errno($ch).': '.curl_error($ch));
}
you can find the correct number for your system by compiling and running the following c++ code:
#include <iostream>
#include <curl/curl.h>
int main(){
std::cout << CURLOPT_IGNORE_CONTENT_LENGTH << std::endl;
}
but it's probably 136.
lastly, protip, file_get_contents ignore the content-length header altogether, and just keeps downloading until the server closes the connection (which is potentially much slower than curl) - also, you should probably contact the server operator and let him know, something's wrong/bugged with his server.
Seem to be in a bit of a predicament. As far as I am aware, there have been no changed to PHP or Apache, however a code that has worked for almost 6 months just stoped working today at 2pm.
The code is:
function ls_record($prospectid,$campid){
$api_post = "method=NewProspect&prospect_id=".$prospectid."&campaign_id=".$campid;
$ch = curl_init();
curl_setopt($ch, CURLOPT_FRESH_CONNECT, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $api_post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://XXXXX/XXXXXX/store.php");
$x = print_r(curl_exec($ch), TRUE);
return $x;
}
It returns NULL, I tried usingfile_get_contents()which also returnsNULL`. I checking the Apache error logs and see nothing...I need some help on this one.
Do you have access to the command line of the server? It could be that the destination has blocked you somehow.
If you have command line access, try this
wget http://XXXXX/XXXXXX/store.php
That should at least return something (if not headers)
use curl_getinfo to check your curl execution status, it maybe that the server you try to extract content from need your curl to set user-agent, some site check user-agent to block unwanted curl access.
below are the user agent I used to disguise my curl as desktop chrome browser.
curl_setopt($ch,CURLOPT_USERAGENT,' Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36');
I face same problem on my server , because of the low internet speed. Internet speed is go down for some time and curl take so many time to execute , so it return a timeout error . After a few minute it is working fine without any changes on server.
I am using Windows 7. I have installed XAMPP 1.8.0 and also enabled the CURL functionality. But, I am still not able to crawl web pages.
I used the following code segment:
<?php
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.google.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
curl_exec($ch);
// close cURL resource, and free up system resources
curl_close($ch);
?>
And when I executed the program, the error message was:
Fatal error: Maximum execution time of 30 seconds exceeded in
C:\xampp\htdocs\testing\curl.php on line 14
Line 14: curl_close($ch)
Why I am getting such an error? How can I debug it? Please help me.
Add set_time_limit($seconds); in your page
also
add curl_setopt ($ch, CURLOPT_TIMEOUT, 60); and try
We've written a script that pulls data from an external server. If the server goes down we don't want our server waiting for the data since we process a lot of data and we don't want it bogged down. To address this, we're trying to timeout our curl calls if they take more than a couple hundred milliseconds.
I found some documentation saying that CURLOPT_TIMEOUT_MS and CURLOPT_CONNECTTIMEOUT_MS should be available in my version of php and libcurl, but it does not seem to be timing out, even if I set the timeout to 1ms.
$url = "http://www.cnn.com;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER,0); //Change this to a 1 to return headers
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT_MS, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT_MS, 1);
$data = curl_exec($ch);
curl_close($ch);
Does anyone know what we're doing wrong or another way to do this?
saw this in unresponsive dns server and curl multi timeouts not working:
"...We have had some times where a
site that we pull information has had
dns server become unresponsive. When
this happens the timeouts set in curl
(php bindings) do not work as
expected. It times out after 1min 14
sec with "Could not resolve host:
www.yahoo.com (Domain name not found)"
To make this happen in test env we
modify /etc/resolv.conf to have a
nameserver that does not exist
(nameserver 1.1.1.1). No mater what
they are set at
(CURLOPT_CONNECTTIMEOUT, CURLOPT_CONNECTTIMEOUT_MS
, CURLOPT_TIMEOUT, CURLOPT_TIMEOUT_MS)
they don't timeout when we cant get
to the DNS server. I use curl_multi
because i we have multiple sources
that we pull info from at the same
time. The example below makes one
call for example simplicity. And as a
side note curl_errno does not return
an error code even though there was an
error. Not sure why..."