Fastest way to download multiple urls

Fastest way to download multiple urls - php

I have a web portal that needs to download many of separate json files and display their contents in a sort of form view. By lots I mean 32 separate files minimum.
I've tried cUrl with brute force iteration and its taking ~12.5 seconds.
I've tried curl_multi_exec as demonstrated here http://www.php.net/manual/en/function.curl-multi-init.php with the function below and its taking ~9 seconds. A little better but still terribly slow.
function multiple_threads_request($nodes){
$mh = curl_multi_init();
$curl_array = array();
foreach($nodes as $i => $url)
{
$curl_array[$i] = curl_init($url);
curl_setopt($curl_array[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($mh, $curl_array[$i]);
}
$running = NULL;
do {
curl_multi_exec($mh,$running);
} while($running > 0);
$res = array();
foreach($nodes as $i => $url)
{
$res[$url] = curl_multi_getcontent($curl_array[$i]);
}
foreach($nodes as $i => $url){
curl_multi_remove_handle($mh, $curl_array[$i]);
}
curl_multi_close($mh);
return $res;
}
I realize this is an inherently expensive operation but does anyone know any other alternatives that might faster?
EDIT: In the end, my system was limiting curl_multi_exec and moving the code to a production machine saw dramatic improvements

You should definitely look into benchmarking your cURLs to see which one has the slowdown but this was too lengthy for a comment so let me know if it helps or not:
// revert to "cURLing with brute force iteration" as you described it :)
$curl_timer = array();
foreach($curlsite as $row)
{
$start = microtime(true);
/**
* curl code
*/
$curl_timer[] = (microtime(true)-$start);
}
echo '<pre>'.print_r($curl_timer, true).'</pre>';

Related

php curling to a same API numerous times with curl_multi

I have an API that I call about at least 10 times at the same time with different information.
This is the function I am currently using.
$mh = curl_multi_init();
$arr = array();
$rows = array();
while ($row = mysqli_fetch_array($query)) {
array_push($arr, initiate_curl($row, $mh));
array_push($rows, $row);
}
$running = null;
for(;;){
curl_multi_exec($mh, $running);
if(!$running){
break;
}
curl_multi_select($mh);
usleep(1);
}
sleep(1);
foreach($arr as $curl) {curl_multi_remove_handle($mh, $curl);}
curl_multi_close($mh);
foreach($arr as $key=>$curl) {
$result = curl_multi_getcontent($curl);
$dat = simplexml_load_string($result);
check_time($dat, $rows[$key], $fp);
}
It works fine when the number of requests is small, but when it grows some of the curls do not bring back the appropriate data. i.e. they return null, and I am guessing because the server goes through before anything is happening..
what can I do to make this work? I am unexperienced in php or server and am having a hard time going through documents..
if I create another php file in which I curl to the API to do stuff with the data, and multi_curl that php file, would it work better? (because in that case it won't be that important that some of the calls do not return the data.. Would that overload my server constantly?

This is not curl issue, this is server utilisation issue so try upgrading server.
Useful links: link1 , link2
Note: Be careful when using infinite loop [for(;;)] and regularly monitor CPU utilisation.

curl_multi_exec() not requesting all

I am trying to use curl_multi_exec() in php with about I am guessing 4000 post calls and getting the return (json). However, After 234 records in my results, my print_r starts showing nothing. I can change the post call url's since each of my URLs has a different postfield, but I would still get 234 results. Does anybody know if there are any limits to curl_multi_exec(). I am using an xampp server on my computer to retrieve the json off a remote server. Is it an option in my xampp install that is preventing more results or a server end limits on my connections?
Thanks. My code for the function is below. The function takes in input $opt which is an array of the curl options.
$ch = array();
$results = array();
$mh = curl_multi_init();
foreach($opt as $handler => $array)
{
//print_r($array);
//echo "<br><br>";
$ch[$handler] = curl_init();
curl_setopt_array($ch[$handler],$array);
curl_multi_add_handle($mh, $ch[$handler]);
}
$running = null;
do {
curl_multi_exec($mh, $running);
}
while ($running > 0);
// Get content and remove handles.
foreach ($ch as $key => $val) {
$results[$key] = json_decode(curl_multi_getcontent($val),true);
curl_multi_remove_handle($mh, $val);
}
curl_multi_close($mh);
return $results;

Reusing handles from cURL multi handler

Well, I am attempting to reuse the handles I've spawned in the initial process, however after the first run it simply stops working. If I remove (or recreate the entire handler) the handles and add them again, it works fine. What could be the culprit of this?
My code currently looks like this:
<?php
echo 'Handler amount: ';
$threads = (int) trim(fgets(STDIN));
if($threads < 1) {
$threads = 1;
}
$s = microtime(true);
$url = 'http://mywebsite.com/some-script.php';
$mh = curl_multi_init();
$ch = array();
for($i = 0; $i < $threads; $i++) {
$ch[$i] = curl_init($url);
curl_setopt_array($ch[$i], array(
CURLOPT_USERAGENT => 'Mozilla/5.0 (X11; Linux i686; rv:21.0) Gecko/20130213 Firefox/21.0',
CURLOPT_REFERER => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_NOBODY => true
));
curl_multi_add_handle($mh, $ch[$i]);
}
while($mh) {
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
$e = microtime(true);
$totalTime = number_format($e - $s, 2);
if($totalTime >= 1) {
echo floor($threads / $totalTime) . ' requests per second (total time '.$totalTime.'s)' . "\r";
$s = microtime(true);
}
}
foreach($ch as $handler) {
curl_multi_remove_handle($mh, $handler);
curl_close($handler);
}
curl_multi_close($mh);
?>
When I have CURLOPT_VERBOSE set to true, I see many "additional stuff not fine transfer.c:1037: 0 0" messages, I read about them on a different question, and it seems that it is caused by some obvious things:
Too fast
Firewall
ISP restricting
AFAIK, this is not it, because if I recreate the handles every time, they successfully complete at about 79 requests per second (about 529 bytes each)
My process for reusing the handles:
Create the multi handler, and add the specified number of handles to the multi handler
While the mutli handler is working, execute all the handles
After the while loop has stopped (it seems very unlikely that it will), close all the handles and the multi curl handler
It executes all handles once and then stops.
This is really stumping me. Any ideas?

I ran into the same problem (using C++ though) and found out that I need to remove the curl easy handle(s) and add it back in again. My solution was to remove all handles at the end of the curl_multi_perform loop and add them back in at the beginning of the outer loop in which I reuse existing keep-alive connections:
for(;;) // loop using keep-alive connections
{
curl_multi_add_handle(...)
while ( stillRunning ) // curl_multi_perform loop
{
...
curl_multi_perform(...)
...
}
curl_multi_remove_handle(...)
}
Perhaps this also applies to your PHP scenario. Remember: don't curl_easy_cleanup or curl_easy_init the curl handle in between.
If you turn on CURLOPT_VERBOSE you can follow along in the console and very that your connections are indeed reused. That has solved this problem for me.

cURL Mult Simultaneous Requests (domain check)

I'm trying to take a list of 20,000 + domain names and check if they are "alive". All I really need is a simple http code check but I can't figure out how to get that working with curl_multi. On a separate script I'm using I have the following function which simultaneously checks a batch of 1000 domains and returns the json response code. Maybe this can be modified to just get the http response code instead of the page content?
(sorry about the syntax I couldn't get it to paste as a nice block of code without going line by line and adding 4 spaces...(also tried skipping a line and adding 8 spaces)
$dotNetRequests = array of domains...
//loop through arrays
foreach(array_chunk($dotNetRequests, 1000) as $Netrequests) {
$results = checkDomains($Netrequests);
$NetcurlRequest = array_merge($NetcurlRequest, $results);
}
function checkDomains($data) {
// array of curl handles
$curly = array();
// data to be returned
$result = array();
// multi handle
$mh = curl_multi_init();
// loop through $data and create curl handles
// then add them to the multi-handle
foreach ($data as $id => $d) {
$curly[$id] = curl_init();
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
curl_setopt($curly[$id], CURLOPT_URL, $url);
curl_setopt($curly[$id], CURLOPT_HEADER, 0);
curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);
// post?
if (is_array($d)) {
if (!empty($d['post'])) {
curl_setopt($curly[$id], CURLOPT_POST, 1);
curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
}
}
curl_multi_add_handle($mh, $curly[$id]);
}
// execute the handles
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
// get content and remove handles
foreach($curly as $id => $c) {
// $result[$id] = curl_multi_getcontent($c);
// if($result[$id]) {
if (curl_multi_getcontent($c)){
//echo "yes";
$netName = $data[$id];
$dName = str_replace(".net", ".com", $netName);
$query = "Update table1 SET dotnet = '1' WHERE Domain = '$dName'";
mysql_query($query);
}
curl_multi_remove_handle($mh, $c);
}
// all done
curl_multi_close($mh);
return $result;
}

In any other language you would thread this kind of operation ...
https://github.com/krakjoe/pthreads
And you can in PHP too :)
I would suggest a few workers rather than 20,000 individual threads ... not that 20,000 threads is out of the realms of possibility - it isn't ... but that wouldn't be a good use of resources, I would do as you are now and have 20 workers getting the results of 1000 domains each ... I assume you don't need me to give the example of getting a response code, I'm sure curl would give it to you, but it's probably overkill to use curl being that you do not require it's threading capabilities: I would fsockopen port 80, fprintf GET HTTP/1.0/\n\n, fgets the first line and close the connection ... if you're going to be doing this all the time then I would also use Connection: close so that the receiving machines are not holding connections unnecessary ...

This script works great for handling bulk simultaneous cURL requests using PHP.
I'm able to parse through 50k domains in just a few minutes using it!
https://github.com/petewarden/ParallelCurl/

How to reduce virtual memory by optimising my PHP code?

My current code (see below) uses 147MB of virtual memory!
My provider has allocated 100MB by default and the process is killed once run, causing an internal error.
The code is utilising curl multi and must be able to loop with more than 150 iterations whilst still minimizing the virtual memory. The code below is only set at 150 iterations and still causes the internal server error. At 90 iterations the issue does not occur.
How can I adjust my code to lower the resource use / virtual memory?
Thanks!
<?php
function udate($format, $utimestamp = null) {
if ($utimestamp === null)
$utimestamp = microtime(true);
$timestamp = floor($utimestamp);
$milliseconds = round(($utimestamp - $timestamp) * 1000);
return date(preg_replace('`(?<!\\\\)u`', $milliseconds, $format), $timestamp);
}
$url = 'https://www.testdomain.com/';
$curl_arr = array();
$master = curl_multi_init();
for($i=0; $i<150; $i++)
{
$curl_arr[$i] = curl_init();
curl_setopt($curl_arr[$i], CURLOPT_URL, $url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYPEER, FALSE);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i=0; $i<150; $i++)
{
$results = curl_multi_getcontent ($curl_arr[$i]);
$results = explode("<br>", $results);
echo $results[0];
echo "<br>";
echo $results[1];
echo "<br>";
echo udate('H:i:s:u');
echo "<br><br>";
usleep(100000);
}
?>

As per your last comment..
Download RollingCurl.php.
Hopefully this will sufficiently spam the living daylights out of your API.
<?php
$url = '________';
$fetch_count = 150;
$window_size = 5;
require("RollingCurl.php");
function request_callback($response, $info, $request) {
list($result0, $result1) = explode("<br>", $response);
echo "{$result0}<br>{$result1}<br>";
//print_r($info);
//print_r($request);
echo "<hr>";
}
$urls = array_fill(0, $fetch_count, $url);
$rc = new RollingCurl("request_callback");
$rc->window_size = $window_size;
foreach ($urls as $url) {
$request = new RollingCurlRequest($url);
$rc->add($request);
}
$rc->execute();
?>
Looking through your questions, I saw this comment:
If the intention is domain snatching,
then using one of the established
services is a better option. Your
script implementation is hardly as
important as the actual connection and
latency.
I agree with that comment.
Also, you seem to have posted the "same question" approximately seven hundred times:
https://stackoverflow.com/users/558865/icer
https://stackoverflow.com/users/516277/icer
How can I adjust the server to run my PHP script quicker?
How can I re-code my php script to run as quickly as possible?
How to run cURL once, checking domain availability in a loop? Help fixing code please
Help fixing php/api/curl code please
How to reduce virtual memory by optimising my PHP code?
Overlapping HTTPS requests?
Multiple https requests.. how to?
Doesn't the fact that you have to keep asking the same question over and over tell you that you're doing it wrong?
This comment of yours:
#mario: Cheers. I'm competing against
2 other companies for specific
ccTLD's. They are new to the game and
they are snapping up those domains in
slow time (up to 10 seconds after
purge time). I'm just a little slower
at the moment.
I'm fairly sure that PHP on a shared hosting account is the wrong tool to use if you are seriously trying to beat two companies at snapping up expired domain names.

The result of each of the 150 queries is being stored in PHP memory and by your evidence this is insufficient. The only conclusion is that you cannot keep 150 queries in memory. You must have a method of streaming to files instead of memory buffers, or simply reduce the number of queries and processing the list of URLs in batches.
To use streams you must set CURLOPT_RETURNTRANSFER to 0 and implement a callback for CURLOPT_WRITEFUNCTION, there is an example in the PHP manual:
http://www.php.net/manual/en/function.curl-setopt.php#98491
function on_curl_write($ch, $data)
{
global $fh;
$bytes = fwrite ($fh, $data, strlen($data));
return $bytes;
}
curl_setopt ($curl_arr[$i], CURLOPT_WRITEFUNCTION, 'on_curl_write');
Getting the correct file handle in the callback is left as problem for the reader to solve.

<?php
echo str_repeat(' ', 1024); //to make flush work
$url = 'http://__________/';
$fetch_count = 15;
$delay = 100000; //0.1 second
//$delay = 1000000; //1 second
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
for ($i=0; $i<$fetch_count; $i++) {
$start = microtime(true);
$result = curl_exec($ch);
list($result0, $result1) = explode("<br>", $result);
echo "{$result0}<br>{$result1}<br>";
flush();
$end = microtime(true);
$sleeping = $delay - ($end - $start);
echo 'sleeping: ' . ($sleeping / 1000000) . ' seconds<hr />';
usleep($sleeping);
}
curl_close($ch);
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Fastest way to download multiple urls - php

Related

php curling to a same API numerous times with curl_multi

curl_multi_exec() not requesting all

Reusing handles from cURL multi handler

cURL Mult Simultaneous Requests (domain check)

How to reduce virtual memory by optimising my PHP code?

Categories

Resources