How many maximum urls can I download at one time using curl - php

I've tested this Curl code to download multiple pages simultaneously. But I want to know what is the maximum permissible limit if any for simultaneous downloads:
<?php
class Footo_Content_Retrieve_HTTP_CURLParallel
{
/**
* Fetch a collection of URLs in parallell using cURL. The results are
* returned as an associative array, with the URLs as the key and the
* content of the URLs as the value.
*
* #param array<string> $addresses An array of URLs to fetch.
* #return array<string> The content of each URL that we've been asked to fetch.
**/
public function retrieve($addresses)
{
$multiHandle = curl_multi_init();
$handles = array();
$results = array();
foreach($addresses as $url)
{
$handle = curl_init($url);
$handles[$url] = $handle;
curl_setopt_array($handle, array(
CURLOPT_HEADER => false,
CURLOPT_RETURNTRANSFER => true,
));
curl_multi_add_handle($multiHandle, $handle);
}
// execute the handles
$result = CURLM_CALL_MULTI_PERFORM;
$running = false;
// set up and make any requests..
while ($result == CURLM_CALL_MULTI_PERFORM)
{
$result = curl_multi_exec($multiHandle, $running);
}
// wait until data arrives on all sockets
while($running && ($result == CURLM_OK))
{
if (curl_multi_select($multiHandle) > -1)
{
$result = CURLM_CALL_MULTI_PERFORM;
// while we need to process sockets
while ($result == CURLM_CALL_MULTI_PERFORM)
{
$result = curl_multi_exec($multiHandle, $running);
}
}
}
// clean up
foreach($handles as $url => $handle)
{
$results[$url] = curl_multi_getcontent($handle);
curl_multi_remove_handle($multiHandle, $handle);
curl_close($handle);
}
curl_multi_close($multiHandle);
return $results;
}
}
Original source:
http://css.dzone.com/articles/retrieving-urls-parallel-curl

No limits but you must consider the connection of internet on your server, bandwidth, memory leaks, CPU and etc

Related

Access multiple URL at once in Curl PHP

I am working on API that return single currency record in one request. One request take 0.5-1 sec to response, and 15 requests take 7-15 seconds.
As i know server can manage 100s of request per seconds.
I want to hit 15 request on server at once so server will give response in 1-2 seconds not in 15 Seconds. Return all data in one single array to save my loading time.
Check my Code
I am using Loop, loop wait until previous curl request not complete. How can i say to loop, keep continue and dont wait for response.
$time_Start = microtime(true);
$ids = array(1,2,11,15,20,21); // 6 ids in demo, 15+ ids in real
$response = array();
foreach ($ids as $key => $id) {
$response[$id] = get_data($id);
}
echo "Time: ". (microtime(true)-$time_Start)."sec";
// output 5 seconds on 6 request
function get_data($id){
$fcs_api_key = "API_KEY";
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,"https://fcsapi.com/api/forex/indicators?id=".$id."&period=1d&access_key=".$fcs_api_key);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$buffer = curl_exec($ch);
curl_close($ch);
return $buffer;
}
You can use PHP multi curl https://www.php.net/manual/en/function.curl-multi-init.php
Below I write a code that open Parallel request.
$time_Start = microtime(true);
$ids = array(1,2,3,4,5,6); // You forex currency ids.
$response = php_curl_multi($ids);
echo "Time: ". (microtime(true)-$time_Start)."sec";
// Time: 0.7 sec
Function
function php_curl_multi($ids){
$parameters = "/api/forex/indicators?period=1d&access_key=API_KEY&id="; // ID will set dynamic
$url = "https://fcsapi.com".$parameters;
$ch_index = array(); // store all curl init
$response = array();
// create both cURL resources
foreach ($ids as $key => $id) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url.$id);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$ch_index[] = $ch;
}
//create the multiple cURL handle
$mh = curl_multi_init();
//add the handles
foreach ($ch_index as $key => $ch) {
curl_multi_add_handle($mh,$ch);
}
//execute the multi handle
do {
$status = curl_multi_exec($mh, $active);
if ($active) {
curl_multi_select($mh);
}
} while ($active && $status == CURLM_OK);
//close the handles
foreach ($ch_index as $key => $ch) {
curl_multi_remove_handle($mh, $ch);
}
curl_multi_close($mh);
// get all response
foreach ($ch_index as $key => $ch) {
$response[] = curl_multi_getcontent($ch);
}
return $response;
}

Nested loop with recursive function?

I need to do a recursive loop on every result suggested by google up to a user-defined depth and save results in a multidimensional array, explored later on.
I want to get this result.
google
google app
google app store
google app store games
google app store games free
google maps
google maps directions
google maps directions driving
google maps directions driving canada
...
Currently, my recursive function returns replicated results from the second nesting.
google
google app
google app
google app store
google app store
google app
google app store
google app store
google app store
...
I think the problem comes from the array (parent results) that I pass as an argument to my function recursive_function() to each nested loops.
$child = recursive_function($parent[0][1], $depth, $inc+1);
Recursive function
// keywords at line or spaced
$keywords = explode("\n", trim("facebook"));
$result = recursive_function($keywords, 2);
function recursive_function($query, $depth, $inc = 1)
{
$urls = preg_filter('/^/', 'http://suggestqueries.google.com/complete/search?client=firefox&q=', array_map('urlencode', $query));
$parent = curl_multi_function($urls);
array_multisort($parent[0][1]);
if (count($parent[0][1]) === 0 || $inc >= $depth)
{
$out[] = $parent[0][1];
}
else
{
$child = recursive_function($parent[0][1], $depth, $inc+1);
$out[] = $child;
}
return $out;
}
Function curl
function curl_multi_function($data, $options = array())
{
// array of curl handles
$curly = array();
// data to be returned
$result = array();
// multi handle
$mh = curl_multi_init();
// loop through $data and create curl handles
// then add them to the multi-handle
foreach ($data as $id => $d)
{
$curly[$id] = curl_init();
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
curl_setopt($curly[$id], CURLOPT_URL, $url);
curl_setopt($curly[$id], CURLOPT_HEADER, 0);
curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curly[$id], CURLOPT_SSL_VERIFYPEER, 0);
// post?
if (is_array($d))
{
if (!empty($d['post']))
{
curl_setopt($curly[$id], CURLOPT_POST, 1);
curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
}
}
// extra options?
if (!empty($options)) {
curl_setopt_array($curly[$id], $options);
}
curl_multi_add_handle($mh, $curly[$id]);
}
// execute the handles
$running = null;
do
{
curl_multi_exec($mh, $running);
}
while($running > 0);
// get content and remove handles
foreach($curly as $id => $c)
{
$result[$id] = curl_multi_getcontent($c);
// decode json result
$result[$id] = json_decode(utf8_encode($result[$id]));
curl_multi_remove_handle($mh, $c);
}
// all done
curl_multi_close($mh);
return $result;
}
Thank's
I've changed your recursive_function a little bit:
function recursive_function($query, $depth, $inc = 1)
{
$urls = preg_filter('/^/', 'http://suggestqueries.google.com/complete/search?client=firefox&q=', array_map('urlencode', $query));
$parent = curl_multi_function($urls);
foreach ($parent as $key => $value) {
array_multisort($value[1]);
$words = explode(' ', $value[0]);
$lastWord = end($words);
if (count($value[1]) === 0 || $inc >= $depth) {
$out[$lastWord] = [];
} else {
unset($value[1][0]);
$child = recursive_function($value[1], $depth, $inc+1);
$out[$lastWord] = $child;
}
}
return $out;
}
It generates an array like this:
[
google =>
[
app =>
[
store =>
[
games =>
[
free => []
]
]
]
...
]
]
Is that what you want?

Duplicate detection code not working

I have a fairly simple piece of code here, i just add a bunch of links in the database, then check each link for a 200 ok.
<?php
function check_alive($url, $timeout = 10) {
$ch = curl_init($url);
// Set request options
curl_setopt_array($ch, array(
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_NOBODY => true,
CURLOPT_TIMEOUT => $timeout,
CURLOPT_USERAGENT => "page-check/1.0"
));
// Execute request
curl_exec($ch);
// Check if an error occurred
if(curl_errno($ch)) {
curl_close($ch);
return false;
}
// Get HTTP response code
$code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
// Page is alive if 200 OK is received
return $code === 200;
}
if (isset($_GET['cron'])) {
// database connection
$c = mysqli_connect("localhost", "paydayci_gsa", "", "paydayci_gsa");
//$files = scandir('Links/');
$files = glob("Links/*.{*}", GLOB_BRACE);
foreach($files as $file)
{
$json = file_get_contents($file);
$data = json_decode($json, true);
if(!is_array($data)) continue;
foreach ($data as $platform => $urls)
{
foreach($urls as $link)
{
//echo $link;
$lnk = parse_url($link);
$resUnique = $c->query("SELECT * FROM `links_to_check` WHERE `link_url` like '%".$lnk['host']."%'");
// If no duplicate insert in database
if(!$resUnique->num_rows)
{
$i = $c->query("INSERT INTO `links_to_check` (link_id,link_url,link_platform) VALUES ('','".$link."','".$platform."')");
}
}
}
// at the very end delete the file
unlink($file);
}
// check if the urls are alive
$select = $c->query("SELECT * FROM `links_to_check` ORDER BY `link_id` ASC");
while($row = $select->fetch_array()){
$alive = check_alive($row['link_url']);
$live = "";
if ($alive == true)
{
$live = "Y";
$lnk = parse_url($row['link_url']);
// Check for duplicate
$resUnique = $c->query("SELECT * FROM `links` WHERE `link_url` like '%".$row['link_url']."%'");
echo $resUnique;
// If no duplicate insert in database
if(!$resUnique->num_rows)
{
$i = $c->query("INSERT INTO links (link_id,link_url,link_platform,link_active,link_date) VALUES ('','".$row['link_url']."','".$row['link_platform']."','".$live."',NOW())");
}
}
$c->query("DELETE FROM `links_to_check` WHERE link_id = '".$row['link_id']."'");
}
}
?>
I'm trying not to add duplicate urls to the database but they are still getting in, have i missed something obvious with my code can anyone see? i have looked over it a few times, i can't see anything staring out at me.
If you are trying to enforce unique values in a database, you should be relying on the database itself to enforce that constraint. You can add an index (assuming you are using MySQL or a variant, which the syntax appears to be) like this:
ALTER TABLE `links` ADD UNIQUE INDEX `idx_link_url` (`link_url`);
One thing to be aware of is extra spaces as prefixes/suffixes so use trim() on the values and also, you should strip trailing slashes to keep everything consistent (so you don't get dupes) using rtrim().

Multithreading PHP Function

Currently when I execute this function with say 60 URL's I get a HTTP 504 error. Is there anyway to multithread this so that I no longer get a 504 error and iterate throughout the entire list of URL's?
<?php
namespace App\Http\Controllers;
use Request;
use App\Http\Controllers\Controller;
class MainController extends Controller
{
public function parse()
{
$input = Request::all();
$csv = $input['laraCsv'];
$new_csv = trim(preg_replace('/\s\s+/', ',', $csv));
$headerInfo = [];
//$titles = [];
$csvArray = str_getcsv($new_csv, ",");
$csvLength = count($csvArray);
$i = 0;
while ($i < $csvLength) {
if(strpos($csvArray[$i], '.pdf') !== false) {
print_r($csvArray[$i]);
}
else{
array_push($headerInfo, get_headers($csvArray[$i], 1));
}
//sleep(3);
//echo file_get_contents($csvArray[$i]);
$i++;
}
return view('csvViewer')->with('data', $headerInfo)->with('urls', $csvArray);
}
}
I've used digitalocean in the past before but I'm not sure what error codes they give if you run out of time, (also set_time_limit(0); should already be in your code).
See if this works:
<?php
function getHeaders($data) {
$curly = array();
$result = array();
$mh = curl_multi_init();
foreach ($data as $id => $url) {
$curly[$id] = curl_init();
curl_setopt($curly[$id], CURLOPT_URL, $url);
curl_setopt($curly[$id], CURLOPT_HEADER, true);
curl_setopt($curly[$id], CURLOPT_NOBODY, true);
curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($mh, $curly[$id]);
}
$running = null;
do {
curl_multi_exec($mh, $running);
} while ($running > 0);
foreach($curly as $id => $c) {
$result[$id] = array_filter(explode("\n", curl_multi_getcontent($c)));
curl_multi_remove_handle($mh, $c);
}
curl_multi_close($mh);
return $result;
}
$urls = array(
'http://google.com',
'http://yahoo.com',
'http://doesnotexistwillitplease.com'
);
$r = getHeaders($urls);
echo '<pre>';
print_r($r);
So once you've gotten all your URLs into an array, run it like getHeaders($urls);.
If it doesn't work try it only with 3 or 4 urls first. Also set_time_limit(0); at the top as mentioned before.
Are you sure it is because of your code ? it could also be the server configuration.
about HTTP 504
This problem is entirely due to slow IP communication between back-end
computers, possibly including the Web server. Only the people who set
up the network at the site which hosts the Web server can fix this
problem.

Simultaneous HTTP requests in PHP with cURL

I'm trying to take a rather large list of domains query the rank of each using the compete.com API as seen here -> https://www.compete.com/developer/documentation
The script I wrote takes a database of domains I populated and initiates a cURL request to compete for the rank of the website. I quickly realized that this was very slow because each request was being sent one at a time. I did some searching and came across this post-> http://www.phpied.com/simultaneuos-http-requests-in-php-with-curl/ which explains how to perform simultaneous HTTP requests in PHP with cURL.
Unfortunately that script will take an array of 25,000 domains and try to process them all at once. I found that batches of 1,000 work quite well.
Any idea how to send 1,000 queries to compete.com then wait for completion and send the next 1,000 until the array is empty? Here's what I'm workin with thus far:
<?php
//includes
include('includes/mysql.php');
include('includes/config.php');
//get domains
$result = mysql_query("SELECT * FROM $tableName");
while($row = mysql_fetch_array($result)) {
$competeRequests[] = "http://apps.compete.com/sites/" . $row['Domain'] . "/trended/rank/?apikey=xxx&start_date=201207&end_date=201208&jsonp=";
}
//first batch
$curlRequest = multiRequest($competeRequests);
$j = 0;
foreach ($curlRequest as $json){
$j++;
$json_output = json_decode($json, TRUE);
$rank = $json_output[data][trends][rank][0][value];
if($rank) {
//Create mysql query
$query = "Update $tableName SET Rank = '$rank' WHERE ID = '$j'";
//Execute the query
mysql_query($query);
echo $query . "<br/>";
}
}
function multiRequest($data) {
// array of curl handles
$curly = array();
// data to be returned
$result = array();
// multi handle
$mh = curl_multi_init();
// loop through $data and create curl handles
// then add them to the multi-handle
foreach ($data as $id => $d) {
$curly[$id] = curl_init();
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
curl_setopt($curly[$id], CURLOPT_URL, $url);
curl_setopt($curly[$id], CURLOPT_HEADER, 0);
curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);
// post?
if (is_array($d)) {
if (!empty($d['post'])) {
curl_setopt($curly[$id], CURLOPT_POST, 1);
curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
}
}
curl_multi_add_handle($mh, $curly[$id]);
}
// execute the handles
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
// get content and remove handles
foreach($curly as $id => $c) {
$result[$id] = curl_multi_getcontent($c);
curl_multi_remove_handle($mh, $c);
}
// all done
curl_multi_close($mh);
return $result;
}
?>
Instead of
//first batch
$curlRequest = multiRequest($competeRequests);
$j = 0;
foreach ($curlRequest as $json){
You can do:
$curlRequest = array();
foreach (array_chunk($competeRequests, 1000) as $requests) {
$results = multiRequest($requests);
$curlRequest = array_merge($curlRequest, $results);
}
$j = 0;
foreach ($curlRequest as $json){
$j++;
// ...
This will split the large array into chunks of 1,000 and pass those 1,000 values to your multiRequest function which uses cURL to execute those requets.
https://github.com/webdevelopers-eu/ShadowHostCloak
This does exactly what you want. Just pass empty argument to new Proxy() to bypass proxy and make direct requests.
You can stuff 1000 requests in it and call $proxy->execWait() and it will process all requests simultaneously and exit that method when everything is done... Then you can repeat.

Categories