using file get contents or curl - php

I was ask to use a simple facebook api to return the number of likes or shares at work which return json string.
Now since i am going to do this for a very large amount of links, which one is better:
Using file_get_contents or cURL.
Both of them seem to return the same results and cURL seems to be more complicated to use, but what is the difference among them. why do most people recommend using cURL over file_get_contents?
Before i run the api which might take a whole day to process, i will like to have feedback.

A few years ago I benchmarked the two and CURL was faster. With CURL you create one CURL instance which can be used for every request, and it maps directly to the very fast libcurl library. Using file_get_contents you have the overhead of protocol wrappers and the initialization code getting executed for every single request.
I will dig out my benchmark script and run on PHP 5.3 but I suspect that CURL will still be faster.

cURL supports https requests more widely than file_get_contents and it's not too terribly complicated. Although the one-line file_get_contents solution sure is clean looking, it's behind-the-scene overhead is larger than cURL.
$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL,$feedURL);
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,true);
curl_setopt($curl_handle, CURLOPT_SSL_VERIFYPEER, false);
$buffer = curl_exec($curl_handle);
curl_close($curl_handle);
This is what I use to make facebook api calls as many of them require an access_token and facebook will only accept access_token information in a secure connection. I've also noticed a large difference in execution time (cURL is much faster).

Related

Curl with multithreading

I am scraping data from an URL using cURL
for ($i = 0; $i < 1000000; $i++) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, 'http://example.com?page='.$i);
curl_exec($curl_handle);
curl_close($curl_handle);
// some code to save the HTML page on HDD
}
I wanted to know if there is some way that I could speed up the process? Maybe multithreading? How could I do it?
cURL Multi does not make parallel requests, it makes asynchronous requests.
The documentation was wrong until 5 minutes ago, it will take some time for the corrected documentation to be deployed and translated.
Asynchronous I/O (using something like the cURL Multi API) is the simplest thing to do, however, it can only make requests asynchronously; the processing of data once downloaded, for example writing to disk would still cause lots of blocking I/O, similarly further processing of the data (parsing json for example) would occur synchronously, in a single thread of execution.
Multi-threading is the other option, this requires that you have a thread safe build of PHP and the pthreads extension installed.
Multi-threading has the advantage that all processing can be done for each download and subsequent actions in parallel, fully utilizing all the CPU cores available.
What is best depends largely on how much processing of downloaded data your code must perform, and even then can be considered a matter of opinion.
You're looking for the curl_multi_* set of functions: "Allows the processing of multiple cURL handles in parallel".
Take a look at the complete example on the curl_multi_init() page.
Check out these articles for more information about how curl_multi_exec() works:
http://technosophos.com/2012/10/26/php-and-curlmultiexec.html
http://www.somacon.com/p537.php

best value for curl timeout and connection timeout

Greetings everyone
I am working on a small crawling engine and am using curl to request pages from various websites. Question is what do suggest should I set my connection_timeout and timeout values to? Stuff I would normally be crawling would be pages with lots of images and text.
cURL knows two different timeouts.
For CURLOPT_CONNECTTIMEOUT it doesn't matter how much text the site contains or how many other resources like images it references because this is a connection timeout and even the server cannot know about the size of the requested page until the connection is established.
For CURLOPT_TIMEOUT it does matter. Even large pages require only a few packets on the wire, but the server may need more time to assemble the output. Also the number of redirects and other things (e.g. proxies) can significantly increase response time.
Generally speaking the "best value" for timeouts depends on your requirements and conditions of the networks and servers. Those conditions are subject of change. Therefore there is no "one best value".
I recommend to use rather short timeouts and retry failed downloads later.
Btw cURL does not automatically download resources referenced in the response. You have to do this manually with further calls to curl_exec (with fresh timeouts).
If you set it too high then your script will be slow as a one url that is down will take all the time you set in CURLOPT_TIMEOUT to finish processing. If you are not using proxies then you can just set the following values
CURLOPT_TIMEOUT = 3
CURLOPT_CONNECTTIMEOUT = 1
Then you can go through failed urls at a later time to double check on them.
The best response is the rik's one.
I have a Proxy Checker and in my benchmarks I saw that most of working Proxies takes less than 10 seconds to connect.
So I use 10 seconds for ConnectionTimeOut and TimeOut but that's in my case, you have to decide how many time you want to use so start with big values, use curl_getinfo to see time benchmarks and decrease the value.
Note: A proxy that takes more than 5 or 10 seconds to connect is useless for me, that's why I use that values.
Yes. If your target is a proxy to query another site, such a cascading connection will require fairly long period like these values to execute the curl calls.
Especially when you encountered intermittent curl problems, please check these values first.
I use
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,30);
curl_setopt($ch, CURLOPT_TIMEOUT,60);

PHP Multiple Curl Requests

I'm currently using Curl for PHP a lot. It takes a lot of time to get results of about 100 pages each time. For every request i'm using code like this
$ch = curl_init();
// get source
curl_close($ch);
What are my options to speed things up?
How should I use the multi_init() etc?
Reuse the same cURL handler ($ch) without running curl_close. This will speed it up just a little bit.
Use curl_multi_init to run the processes in parallel. This can have a tremendous effect.
take curl_multi - it is far better. Save the handshakes - they are not needed every time!
when i use code given in "http://php.net/curl_multi_init", response of 2 requests are conflicting.
But the code written in below link, returns each response separately (in array format)
https://stackoverflow.com/a/21362749/3177302
or take pcntl_fork, fork some new threads to execute curl_exec. But it's not as good as curl_multi.

Faster alternative to file_get_contents()

Currently I'm using file_get_contents() to submit GET data to an array of sites, but upon execution of the page I get this error:
Fatal error: Maximum execution time of 30 seconds exceeded
All I really want the script to do is start loading the webpage, and then leave. Each webpage may take up to 5 minutes to load fully, and I don't need it to load fully.
Here is what I currently have:
foreach($sites as $s) //Create one line to read from a wide array
{
file_get_contents($s['url']); // Send to the shells
}
EDIT: To clear any confusion, this script is being used to start scripts on other servers, that return no data.
EDIT: I'm now attempting to use cURL to do the trick, by setting a timeout of one second to make it send the data and then stop. Here is my code:
$ch = curl_init($s['url']); //load the urls
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 1); //Only send the data, don't wait.
curl_exec($ch); //Execute
curl_close($ch); //Close it off.
Perhaps I've set the option wrong. I'm looking through some manuals as we speak. Just giving you an update. Thank you all of you that are helping me thus far.
EDIT: Ah, found the problem. I was using CURLOPT_CONNECTTIMEOUT instead of CURLOPT_TIMEOUT. Whoops.
However now, the scripts aren't triggering. They each use ignore_user_abort(TRUE); so I can't understand the problem
Hah, scratch that. Works now. Thanks a lot everyone
There are many ways to solve this.
You could use cURL with its curl_multi_* functions to execute asynchronously the requests. Or use cURL the common way but using 1 as timeout limit, so it will request and return timeout, but the request will be executed.
If you don't have cURL installed, you could continue using file_get_contents but forking processes (not so cool, but works) using something like ZendX_Console_Process_Unix so you avoid the waiting between each request.
As Franco mentioned and I'm not sure was picked up on, you specifically want to use the curl_multi functions, not the regular curl ones. This packs multiple curl objects into a curl_multi object and executes them simultaneously, returning (or not, in your case) the responses as they arrive.
Example at http://php.net/curl_multi_init
Re your update that you only need to trigger the operation:
You could try using file_get_contents with a timeout. This would lead to the remote script being called, but the connection being terminated after n seconds (e.g. 1).
If the remote script is configured so it continues to run even if the connection is aborted (in PHP that would be ignore_user_abort), it should work.
Try it out. If it doesn't work, you won't get around increasing your time_limit or using an external executable. But from what you're saying - you just need to make the request - this should work. You could even try to set the timeout to 0 but I wouldn't trust that.
From here:
<?php
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
)
);
file_get_contents("http://example.com/", 0, $ctx);
?>
To be fair, Chris's answer already includes this possibility: curl also has a timeout switch.
it is not file_get_contents() who consume that much time but network connection itself.
Consider not to submit GET data to an array of sites, but create an rss and let them get RSS data.
I don't fully understands the meaning behind your script.
But here is what you can do:
In order to avoid the fatal error quickly you can just add set_time_limit(120) at the beginning of the file. This will allow the script to run for 2 minutes. Of course you can use any number that you want and 0 for infinite.
If you just need to call the url and you don't "care" for the result you should use cUrl in asynchronous mode. This case any call to the URL will not wait till it finished. And you can call them all very quickly.
BR.
If the remote pages take up to 5 minutes to load, your file_get_contents will sit and wait for that 5 minutes. Is there any way you could modify the remote scripts to fork into a background process and do the heavy processing there? That way your initial hit will return almost immediately, and not have to wait for the startup period.
Another possibility is to investigate if a HEAD request would do the trick. HEAD does not return any data, just headers, so it may be enough to trigger the remote jobs and not wait for the full output.

Improving cURL performance (PHP Library)

Here is a brief overview of what I am doing, it is quite simple really:
Go out and fetch records from a database table.
Walk through all those records and for each column that contains a URL go out (using cURL) and make sure the URL is still valid.
For each record a column is updated with a current time stamp indicating when it was last checked and some other db processing takes place.
Anyhow all this works well and good and does exactly what it is supposed to. The problem is that I think performance could be greatly improved in terms of how I am validating the URL's with cURL.
Here is a brief (over simplified) excerpt from my code which demonstrates how cURL is being used:
$ch = curl_init();
while($dbo = pg_fetch_object($dbres))
{
// for each iteration set url to db record url
curl_setopt($ch, CURLOPT_URL, $dbo->url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_exec($ch); // perform a cURL session
$ihttp_code = intval(curl_getinfo($ch, CURLINFO_HTTP_CODE));
// do checks on $ihttp_code and update db
}
// do other stuff here
curl_close($ch);
As you can see I am just reusing the same cURL handle the entire time but even if I strip out all over the processing (database or otherwise) the script still takes incredibly long to run. Would changing any of the cURL options help improve performance? Tuning timeout values / etc? Any input would be appreciated.
Thank you,
Nicholas
Set CURLOPT_NOBODY to 1 (see curl documentation) tell curl not to ask for the body of the response. This will contact the web server and issue a HEAD request. The response code will tell you if the URL is valid or not, and won't transfer the bulk of the data back.
If that's still too slow, then you'll likely see a vast improvement by running N threads (or processes) each doing 1/Nth of the work. The bottleneck may not be in your code, but in the response times of the remote servers. If they're slow to respond, then your loop will be slow to run.

Categories