I am using cURL multi to get data from some websites. With code:
function getURL($ids)
{
global $mh;
$curl = array();
$response = array();
$n = count($ids);
for($i = 0; $i < $n; $i++) {
$id = $ids[$i];
$url = 'http://www.domain.com/?id='.$id;
// Init cURL
$curl[$i] = curl_init($url);
curl_setopt($curl[$i], CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl[$i], CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl[$i], CURLOPT_USERAGENT, 'Googlebot/2.1 (http://www.googlebot.com/bot.html)');
//curl_setopt($curl[$i], CURLOPT_FORBID_REUSE, true);
//curl_setopt($curl[$i], CURLOPT_HEADER, false);
curl_setopt($curl[$i], CURLOPT_HTTPHEADER, array(
'Connection: Keep-Alive',
'Keep-Alive: 300'
));
// Set to multi cURL
curl_multi_add_handle($mh, $curl[$i]);
}
// Execute
do {
curl_multi_exec($mh, $flag);
} while ($flag > 0);
// Get response
for($i = 1; $i < $n; $i++) {
// Get data
$id = $ids[$i];
$response[] = array(
'id' => $id,
'data' => curl_multi_getcontent($curl[$i])
);
// Remove handle
//curl_multi_remove_handle($mh, $curl[$i]);
}
// Reponse
return $response;
}
But, i have problem is cURL open too many sockets to connect to webserver. Each connection, cURL create new socket to webserver.
I want to current connection is keep-alive for next connection. I don't want that 100 URL then cURL must create 100 sockets to handle :(
Please help me. Thanks so much !
So don't open that many sockets. Modify your code to only open X sockets, and then repeatedly use those sockets until all of your $ids have been consumed. That or pass fewer $ids into the function to begin with.
I know, this is old, but the correct answer has not been given, yet, IMHO.
Please have a look at th CURLMOPT_MAX_TOTAL_CONNECTIONS option, which should solve your problem:
https://curl.se/libcurl/c/CURLMOPT_MAX_TOTAL_CONNECTIONS.html
Also make sure, that multiplexing via HTTP/2 is not disabled accidentally:
https://curl.se/libcurl/c/CURLMOPT_PIPELINING.html
Classical HTTP/1 pipelining is no longer supported by cURL, but cURL can still re-use an existing HTTP/1 connection to send a new request once the current request has finished on that connection.
Related
I have a custom cURL function that has to download huge number of images from remote server. I was banned a couple of times before when I used file_get_contents(). I found that curl_multi_init() is better option as with 1 connection it can download for example 20 images at once.
I made a custom functions that uses curl_init() and I am trying to figure out how I can implement curl_multi_init() so in my LOOP where I grab the list of 20 URLs from the database I can call my custom function and at the last loop to use curl_close(). At the current situation my function generates connection for each url in the LOOP. Here is the function:
function downloadUrlToFile($remoteurl,$newfileName){
$errors = 0;
$options = array(
CURLOPT_FILE => fopen('../images/products/'.$newfileName, 'w'),
CURLOPT_TIMEOUT => 28800,
CURLOPT_URL => $remoteurl,
CURLOPT_RETURNTRANSFER => 1
);
$ch = curl_init();
curl_setopt_array($ch, $options);
$imageString =curl_exec($ch);
$image = imagecreatefromstring($imageString);
if($imageString !== false AND !empty($imageString)){
if ($image !== false){
$width_orig = imagesx($image);
if($width_orig > 1000){
$saveimage = copy_and_resize_remote_image_product($image,$newfileName);
}else $saveimage = file_put_contents('../images/products/'.$newfileName,$imageString);
}else $errors++;
}else $errors++;
curl_close($ch);
return $errors;
}
There has to be a way to use curl_multi_init() and my function downloadUrlToFile because:
I need to change the file name on the fly
In my function I am also checking several things for the remote image.. In the sample function I check the size only and resize it if neccessary but there is much more things done by this function (I cutted that part for shorter, but I also use the function to pass more variables..)
How should the code be changed so during the LOOP to connect only once to the remote server?
Thanks in advance
Try this pattern for Multi CURL
$urls = array($url_1, $url_2, $url_3);
$content = array();
$ch = array();
$mh = curl_multi_init();
foreach( $urls as $index => $url ) {
$ch[$index] = curl_init();
curl_setopt($ch[$index], CURLOPT_URL, $url);
curl_setopt($ch[$index], CURLOPT_HEADER, 0);
curl_setopt($ch[$index], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($mh, $ch[$index]);
}
$active = null;
for(;;) {
curl_multi_exec($mh, $active);
if($active < 1){
// all downloads completed
break;
}else{
// sleep-wait for more data to arrive on socket.
// (without this, we would be wasting 100% cpu of 1 core while downloading,
// with this, we'll be using like 1-2% cpu of 1 core instead.)
curl_multi_select($mh, 1);
}
}
foreach ( $ch AS $index => $c ) {
$content[$index] = curl_multi_getcontent($c);
curl_multi_remove_handle($mh, $c);
//You can add some functions here and use $content[$index]
}
curl_multi_close($mh);
I can't allow for file_get_contents to work more than 1 second, if it is not possible - I need to skip to next loop.
for ($i = 0; $i <=59; ++$i) {
$f=file_get_contents('http://example.com');
if(timeout<1 sec) - do something and loop next;
else skip file_get_contents(), do semething else, and loop next;
}
Is it possible to make a function like this?
Actually I'm using curl_multi and I can't fugure out how to set timeout on a WHOLE curl_multi request.
If you are working with http urls only you can do the following:
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
));
for ($i = 0; $i <=59; $i++) {
file_get_contents("http://example.com/", 0, $ctx);
}
However, this is just the read timeout, meaning the time between two read operations (or the time before the first read operation). If the download rate is constant, there should not being such gaps in the download rate and the download can take even an hour.
If you want the whole download not take more than a second you can't use file_get_contents() anymore. I would encourage to use curl in this case. Like this:
// create curl resource
$ch = curl_init();
for($i=0; $i<59; $i++) {
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
// set timeout
curl_setopt($ch, CURLOPT_TIMEOUT, 1);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
}
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
)
);
file_get_contents("http://example.com/", 0, $ctx);
Source
I'm trying to take a list of 20,000 + domain names and check if they are "alive". All I really need is a simple http code check but I can't figure out how to get that working with curl_multi. On a separate script I'm using I have the following function which simultaneously checks a batch of 1000 domains and returns the json response code. Maybe this can be modified to just get the http response code instead of the page content?
(sorry about the syntax I couldn't get it to paste as a nice block of code without going line by line and adding 4 spaces...(also tried skipping a line and adding 8 spaces)
$dotNetRequests = array of domains...
//loop through arrays
foreach(array_chunk($dotNetRequests, 1000) as $Netrequests) {
$results = checkDomains($Netrequests);
$NetcurlRequest = array_merge($NetcurlRequest, $results);
}
function checkDomains($data) {
// array of curl handles
$curly = array();
// data to be returned
$result = array();
// multi handle
$mh = curl_multi_init();
// loop through $data and create curl handles
// then add them to the multi-handle
foreach ($data as $id => $d) {
$curly[$id] = curl_init();
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
curl_setopt($curly[$id], CURLOPT_URL, $url);
curl_setopt($curly[$id], CURLOPT_HEADER, 0);
curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);
// post?
if (is_array($d)) {
if (!empty($d['post'])) {
curl_setopt($curly[$id], CURLOPT_POST, 1);
curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
}
}
curl_multi_add_handle($mh, $curly[$id]);
}
// execute the handles
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
// get content and remove handles
foreach($curly as $id => $c) {
// $result[$id] = curl_multi_getcontent($c);
// if($result[$id]) {
if (curl_multi_getcontent($c)){
//echo "yes";
$netName = $data[$id];
$dName = str_replace(".net", ".com", $netName);
$query = "Update table1 SET dotnet = '1' WHERE Domain = '$dName'";
mysql_query($query);
}
curl_multi_remove_handle($mh, $c);
}
// all done
curl_multi_close($mh);
return $result;
}
In any other language you would thread this kind of operation ...
https://github.com/krakjoe/pthreads
And you can in PHP too :)
I would suggest a few workers rather than 20,000 individual threads ... not that 20,000 threads is out of the realms of possibility - it isn't ... but that wouldn't be a good use of resources, I would do as you are now and have 20 workers getting the results of 1000 domains each ... I assume you don't need me to give the example of getting a response code, I'm sure curl would give it to you, but it's probably overkill to use curl being that you do not require it's threading capabilities: I would fsockopen port 80, fprintf GET HTTP/1.0/\n\n, fgets the first line and close the connection ... if you're going to be doing this all the time then I would also use Connection: close so that the receiving machines are not holding connections unnecessary ...
This script works great for handling bulk simultaneous cURL requests using PHP.
I'm able to parse through 50k domains in just a few minutes using it!
https://github.com/petewarden/ParallelCurl/
I'm trying to find a way to only quickly access a file and then disconnect immediately.
So I've decided to use cURL since it's the fastest option for me. But I can't figure out how I should "disconnect" cURL.
With the code below, Apache's access logs says that the file I tried accessing was indeed accessed, but I'm feeling a little iffy about this, because when I just run the while loop without breaking out of it, it just keeps looping. Shouldn't the loop stop when cURL has finished fetching the file? Or am I just being silly; is the loop just restarting constantly?
<?php
$Resource = curl_init();
curl_setopt($Resource, CURLOPT_URL, '...');
curl_setopt($Resource, CURLOPT_HEADER, 0);
curl_setopt($Resource, CURLOPT_USERAGENT, '...');
while(curl_exec($Resource)){
break;
}
curl_close($Resource);
?>
I tried setting the CURLOPT_CONNECTTIMEOUT_MS / CURLOPT_CONNECTTIMEOUT options to very small values, but it didn't help in this case.
Is there a more "proper" way of doing this?
This statement is superflous:
while(curl_exec($Resource)){
break;
}
Instead just keep the return value for future reference:
$result = curl_exec($Resource);
The while loop does not help anything. So now to your question: You can tell curl that it should only take some bytes from the body and then quit. That can be achieved by reducing the CURLOPT_BUFFERSIZE to a small value and by using a callback function to tell curl it should stop:
$withCallback = array(
CURLOPT_BUFFERSIZE => 20, # ~ value of bytes you'd like to get
CURLOPT_WRITEFUNCTION => function($handle, $data) {
echo "WRITE: (", strlen($data), ") $data\n";
return 0;
},
);
$handle = curl_init("http://stackoverflow.com/");
curl_setopt_array($handle, $withCallback);
curl_exec($handle);
curl_close($handle);
Output:
WRITE: (10) <!DOCTYPE
Another alternative is to make a HEAD request by using CURLOPT_NOBODY which will never fetch the body. But it's not a GET request.
The connect timeout settings are about how long it will take until the connect times out. The connect is the phase until the server accepts input from curl and curl starts to know about that the server does. It's not related to the phase when curl fetches data from the server, that's
CURLOPT_TIMEOUT The maximum number of seconds to allow cURL functions to execute.
You find a long list of available options in the PHP Manual: curl_setoptĀDocs.
Perhaps that might be helpful?
$GLOBALS["dataread"] = 0;
define("MAX_DATA", 3000); // how many bytes should be read?
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.php.net/");
curl_setopt($ch, CURLOPT_WRITEFUNCTION, "handlewrite");
curl_exec($ch);
curl_close($ch);
function handlewrite($ch, $data)
{
$GLOBALS["dataread"] += strlen($data);
echo "READ " . strlen($data) . " bytes\n";
if ($GLOBALS["dataread"] > MAX_DATA) {
return 0;
}
return strlen($data);
}
My current code (see below) uses 147MB of virtual memory!
My provider has allocated 100MB by default and the process is killed once run, causing an internal error.
The code is utilising curl multi and must be able to loop with more than 150 iterations whilst still minimizing the virtual memory. The code below is only set at 150 iterations and still causes the internal server error. At 90 iterations the issue does not occur.
How can I adjust my code to lower the resource use / virtual memory?
Thanks!
<?php
function udate($format, $utimestamp = null) {
if ($utimestamp === null)
$utimestamp = microtime(true);
$timestamp = floor($utimestamp);
$milliseconds = round(($utimestamp - $timestamp) * 1000);
return date(preg_replace('`(?<!\\\\)u`', $milliseconds, $format), $timestamp);
}
$url = 'https://www.testdomain.com/';
$curl_arr = array();
$master = curl_multi_init();
for($i=0; $i<150; $i++)
{
$curl_arr[$i] = curl_init();
curl_setopt($curl_arr[$i], CURLOPT_URL, $url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYPEER, FALSE);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i=0; $i<150; $i++)
{
$results = curl_multi_getcontent ($curl_arr[$i]);
$results = explode("<br>", $results);
echo $results[0];
echo "<br>";
echo $results[1];
echo "<br>";
echo udate('H:i:s:u');
echo "<br><br>";
usleep(100000);
}
?>
As per your last comment..
Download RollingCurl.php.
Hopefully this will sufficiently spam the living daylights out of your API.
<?php
$url = '________';
$fetch_count = 150;
$window_size = 5;
require("RollingCurl.php");
function request_callback($response, $info, $request) {
list($result0, $result1) = explode("<br>", $response);
echo "{$result0}<br>{$result1}<br>";
//print_r($info);
//print_r($request);
echo "<hr>";
}
$urls = array_fill(0, $fetch_count, $url);
$rc = new RollingCurl("request_callback");
$rc->window_size = $window_size;
foreach ($urls as $url) {
$request = new RollingCurlRequest($url);
$rc->add($request);
}
$rc->execute();
?>
Looking through your questions, I saw this comment:
If the intention is domain snatching,
then using one of the established
services is a better option. Your
script implementation is hardly as
important as the actual connection and
latency.
I agree with that comment.
Also, you seem to have posted the "same question" approximately seven hundred times:
https://stackoverflow.com/users/558865/icer
https://stackoverflow.com/users/516277/icer
How can I adjust the server to run my PHP script quicker?
How can I re-code my php script to run as quickly as possible?
How to run cURL once, checking domain availability in a loop? Help fixing code please
Help fixing php/api/curl code please
How to reduce virtual memory by optimising my PHP code?
Overlapping HTTPS requests?
Multiple https requests.. how to?
Doesn't the fact that you have to keep asking the same question over and over tell you that you're doing it wrong?
This comment of yours:
#mario: Cheers. I'm competing against
2 other companies for specific
ccTLD's. They are new to the game and
they are snapping up those domains in
slow time (up to 10 seconds after
purge time). I'm just a little slower
at the moment.
I'm fairly sure that PHP on a shared hosting account is the wrong tool to use if you are seriously trying to beat two companies at snapping up expired domain names.
The result of each of the 150 queries is being stored in PHP memory and by your evidence this is insufficient. The only conclusion is that you cannot keep 150 queries in memory. You must have a method of streaming to files instead of memory buffers, or simply reduce the number of queries and processing the list of URLs in batches.
To use streams you must set CURLOPT_RETURNTRANSFER to 0 and implement a callback for CURLOPT_WRITEFUNCTION, there is an example in the PHP manual:
http://www.php.net/manual/en/function.curl-setopt.php#98491
function on_curl_write($ch, $data)
{
global $fh;
$bytes = fwrite ($fh, $data, strlen($data));
return $bytes;
}
curl_setopt ($curl_arr[$i], CURLOPT_WRITEFUNCTION, 'on_curl_write');
Getting the correct file handle in the callback is left as problem for the reader to solve.
<?php
echo str_repeat(' ', 1024); //to make flush work
$url = 'http://__________/';
$fetch_count = 15;
$delay = 100000; //0.1 second
//$delay = 1000000; //1 second
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
for ($i=0; $i<$fetch_count; $i++) {
$start = microtime(true);
$result = curl_exec($ch);
list($result0, $result1) = explode("<br>", $result);
echo "{$result0}<br>{$result1}<br>";
flush();
$end = microtime(true);
$sleeping = $delay - ($end - $start);
echo 'sleeping: ' . ($sleeping / 1000000) . ' seconds<hr />';
usleep($sleeping);
}
curl_close($ch);
?>