Curl Multi Threading

Curl Multi Threading - php

i am finding a Curl function which can open particular no. of webpage open at a time also there will no output or returndata false will more good . I need to access 5-10 url at a same time .. I heard abt Curl Multi Threading but dont have proper function or class to use it ..
i find some by searching but most of them seems to be loop mean it i not using continuous connection just one after one ! I want something which can connect multiple connection at a time not one by one !
I made one :
function mutload($url){
if(!is_array($url)){
exit;
}
for($i=0;$i<count($url);$i++){
// create both cURL resources
$ch[] = curl_init();
$ch[] = curl_init();
// set URL and other appropriate options
curl_setopt($ch[$i], CURLOPT_URL, $url[$i]);
curl_setopt($ch[$i], CURLOPT_HEADER, 0);
curl_setopt($ch[$i], CURLOPT_RETURNTRANSFER, 0);
}
//create the multiple cURL handle
$mh = curl_multi_init();
for($i=0;$i<count($url);$i++){
//add the two handles
curl_multi_add_handle($mh,$ch[$i]);
}
$active = null;
//execute the handles
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
//close the handles
for($i=0;$i<count($url);$i++){
curl_multi_remove_handle($mh, $ch[$i]);
}
curl_multi_close($mh);
}
ok ! but i m confused that will it connect all the urls at a time or one by one ! mre over i am geeting the content also i only want to connect or request to the site do not need ay content from there i used RETURNTRASFER,false but didnt work .. please hlep me thanks !

You're looking for the curl_multi_* family of functions. Have a look at curl_multi_exec.
Set CURLOPT_NOBODY to prevent curl from downloading any cotent.

I didn't test your code but curl_multi adds items to a queue from a loop and process them in parallel. Sometimes there can be issues if you are trying to load 100s of URLs, but it should be fine for a few URLs. If you have long DNS lookups or slow servers, all your results will have to wait for the slowest request.
This code is tested and should work, it is somewhat similar to yours:
http://www.onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blocking/

Related

PHP - multiple curl requests curl_multi speed optimizations

I'm using curl_multi to process multiple API requests in parallel.
However, I've noticed there is a lot of fluctuation in the time it takes to complete the requests.
Is this related to the speed of the APIs themselves, or the timeout I set on curl_multi_select? Right now it is 0.05. Should it be less? How can I know this process is finishing the requests as fast as possible without wasted time in between checks to see if they're done?
<?php
// Build the multi-curl handle, adding each curl handle
$handles = array(/* Many curl handles*/);
$mh = curl_multi_init();
foreach($handles as $curl){
curl_multi_add_handle($mh, $curl);
}
$running = null;
do {
curl_multi_exec($mh, $running);
curl_multi_select($mh, 0.05); // Should this value be less than 0.05?
} while ($running > 0);
// Close the handles
foreach($results as $curl){
curl_multi_remove_handle($mh, $curl);
}
curl_multi_close($mh);
?>

current implementation of curl_multi_select() in php doesn't block and doesn't respect timeout parameter, maybe it will be fixed later. the proper way of waiting is not implemented in your code, it have to be 2 loops, i will post some tested code from my bot as an example:
$running = 1;
while ($running)
{
# execute request
if ($a = curl_multi_exec($this->murl, $running)) {
throw BotError::text("curl_multi_exec[$a]: ".curl_multi_strerror($a));
}
# check finished
if (!$running) {
break;
}
# wait for activity
while (!$a)
{
if (($a = curl_multi_select($this->murl, $wait)) < 0)
{
throw BotError::text(
($a = curl_multi_errno($this->murl))
? "curl_multi_select[$a]: ".curl_multi_strerror($a)
: 'system select failed'
);
}
usleep($wait * 1000000);# wait for some time <1sec
}
}

doing
$running = null;
for(;;){
curl_multi_exec($mh, $running);
if($running <1){
break;
}
curl_multi_select($mh, 1);
}
should be better, then you'll avoid a useless select() when nothing is running..

Increase speed of my script

I have a script which takes a some.txt file and reads the links and return if my websites backlink is there or not. But the problem is, it is very slow and I want to increase its speed. Is there any way to increase its speed?
<?php
ini_set('max_execution_time', 3000);
$source = file_get_contents("your-backlinks.txt");
$needle = "http://www.submitage.com"; //without http as I have imploded the http later in the script
$new = explode("\n",$source);
foreach ($new as $check) {
$a = file_get_contents(trim($check));
if (strpos($a,$needle)) {
$found[] = $check;
} else {
$notfound[] = $check;
}
}
echo "Matches that were found: \n ".implode("\n",$found)."\n";
echo "Matches that were not found \n". implode("\n",$notfound);
?>

Your biggest bottleneck is the fact that you are executing the HTTP requests in sequence, not in parallel. curl is able to perform multiple requests in parallel. Here's an example from the documentation, heavily adapted to use a loop and actually collect the results. I cannot promise it's correct, I only promise I've followed the documentation correctly:
$mh = curl_multi_init();
$handles = array();
foreach($new as $check){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $check);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_multi_add_handle($mh,$ch);
$handles[$check]=$ch;
}
// verbatim from the demo
$active = null;
//execute the handles
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
// end of verbatim code
for($handles as $check => $ch){
$a = curl_multi_getcontent($ch)
...
}

You won't be able to squeeze any more speed out of the operation by optimizing the PHP, except maybe some faux-multithreading solution.
However, you could create a queue system that would allow you to run the check as a background task. Instead of checking the URLs as you iterate through them, add them to the queue instead. Then write a cron script that grabs unchecked URLs from the queue one by one, checks if they contain a reference to your domain and saves the result.

cURL Mult Simultaneous Requests (domain check)

I'm trying to take a list of 20,000 + domain names and check if they are "alive". All I really need is a simple http code check but I can't figure out how to get that working with curl_multi. On a separate script I'm using I have the following function which simultaneously checks a batch of 1000 domains and returns the json response code. Maybe this can be modified to just get the http response code instead of the page content?
(sorry about the syntax I couldn't get it to paste as a nice block of code without going line by line and adding 4 spaces...(also tried skipping a line and adding 8 spaces)
$dotNetRequests = array of domains...
//loop through arrays
foreach(array_chunk($dotNetRequests, 1000) as $Netrequests) {
$results = checkDomains($Netrequests);
$NetcurlRequest = array_merge($NetcurlRequest, $results);
}
function checkDomains($data) {
// array of curl handles
$curly = array();
// data to be returned
$result = array();
// multi handle
$mh = curl_multi_init();
// loop through $data and create curl handles
// then add them to the multi-handle
foreach ($data as $id => $d) {
$curly[$id] = curl_init();
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
curl_setopt($curly[$id], CURLOPT_URL, $url);
curl_setopt($curly[$id], CURLOPT_HEADER, 0);
curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);
// post?
if (is_array($d)) {
if (!empty($d['post'])) {
curl_setopt($curly[$id], CURLOPT_POST, 1);
curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
}
}
curl_multi_add_handle($mh, $curly[$id]);
}
// execute the handles
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
// get content and remove handles
foreach($curly as $id => $c) {
// $result[$id] = curl_multi_getcontent($c);
// if($result[$id]) {
if (curl_multi_getcontent($c)){
//echo "yes";
$netName = $data[$id];
$dName = str_replace(".net", ".com", $netName);
$query = "Update table1 SET dotnet = '1' WHERE Domain = '$dName'";
mysql_query($query);
}
curl_multi_remove_handle($mh, $c);
}
// all done
curl_multi_close($mh);
return $result;
}

In any other language you would thread this kind of operation ...
https://github.com/krakjoe/pthreads
And you can in PHP too :)
I would suggest a few workers rather than 20,000 individual threads ... not that 20,000 threads is out of the realms of possibility - it isn't ... but that wouldn't be a good use of resources, I would do as you are now and have 20 workers getting the results of 1000 domains each ... I assume you don't need me to give the example of getting a response code, I'm sure curl would give it to you, but it's probably overkill to use curl being that you do not require it's threading capabilities: I would fsockopen port 80, fprintf GET HTTP/1.0/\n\n, fgets the first line and close the connection ... if you're going to be doing this all the time then I would also use Connection: close so that the receiving machines are not holding connections unnecessary ...

This script works great for handling bulk simultaneous cURL requests using PHP.
I'm able to parse through 50k domains in just a few minutes using it!
https://github.com/petewarden/ParallelCurl/

Why are curl_multi_select and curl_multi_info_read contradicting each other?

When I run the below code it seems to me curl_multi_select and curl_multi_info_read are contradicting each other. As I understand it curl_multi_select is supposed to be blocking until curl_multi_exec has a response but I haven't seen that actually happen.
$url = "http://google.com";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_VERBOSE, TRUE);
$mc = curl_multi_init();
curl_multi_add_handle($mc, $ch);
do {
$exec = curl_multi_exec($mc, $running);
} while ($exec == CURLM_CALL_MULTI_PERFORM);
$ready=curl_multi_select($mc, 100);
var_dump($ready);
$info = curl_multi_info_read($mc,$msgs);
var_dump($info);
this returns
int 1
boolean false
which seems to contradict itself. How can it be ready and not have any messages?
The php version I'm using is 5.3.9

Basically curl_multi_select blocks until there is something to read or send with curl_multi_exec. If you loop around curl_multi_exec without using curl_multi_select this will eat up 100% of a CPU core.
So curl_multi_info_read is used to check if any transfer has ended (correctly or with an error).
Code using the multi handle should follow the following pattern:
do
{
$mrc = curl_multi_exec($this->mh, $active);
}
while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK)
{
curl_multi_select($this->mh);
do
{
$mrc = curl_multi_exec($this->mh, $active);
}
while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($info = curl_multi_info_read($this->mh))
{
$this->process_ch($info);
}
}
See also: Doing curl_multi_exec the right way.

From the spec:
Ask the multi handle if there are any messages or information from the individual transfers. Messages may include information such as an error code from the transfer or just the fact that a transfer is completed.
The 1 could mean there is activity, but not necessarily a message waiting: in this case probably that some of your download data is available, but not all. The example in the curl_multi_select doc explicitly tests for false values back from curl_multi_info_read.

Using fsockopen to precompile key jsp pages

So the original problem is that we run an "industry standard" java based web app application, on WebSphere App Servers with around 100 million visits per year. The issue is after a restart of these appservers, we need to hit a few of the key pages so that the main servlets get compiled before we let the public onto them ... otherwise they tend to crash in the initial crush.
On some clusters, its about 6 pages that need to be hit, once for each of 35+ markets.... 200 ish url's!
So the script I am working on has all the hard work done of how to put together all these URL's and at the end of it all is a list of 200 url's in an array... now how to hit them?
We were using CGI for this earlier and it's main problem was that is was synchronous... taking a loooooong time. Now I am trying to make a simple url.php which will hit one single URL which I can then call from JQuery in an asynchronous way. I don't want to hit all 200 at first of course, probably in batchs of 5 should mean a 500% speed increase :)
So onto the url.php . I haven't use php much in the past so sockets is a bit new to me. What I have cobbled together so far is this:
function checkUrl($url,$port) {
set_time_limit(20);
ob_start();
header("Content-Type: text/plain");
$u = $url;
$p = $port;
$post = "HEAD / HTTP/1.1\r\n";
$post .= "Host: $u\r\n";
$post .= "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.2) Gecko/20060308 Firefox/1.5.0.2\r\n";
$post .= "Keep-Alive: 200\r\n";
$post .= "Connection: keep-alive\r\n\r\n";
$sock = fsockopen($u, $p, $errno, $errstr, 10);
if (!$sock) {
echo "$errstr ($errno)<br />\n";
} else {
fwrite($sock, $post, strlen($post));
while (!feof($sock)){
echo fgets($sock);
}
ob_end_flush();
}
}
Which works great if the url is simply someserver.somedomain.com but if the is a Uri tapped on the end it fails (e.g. someserver.somedomain.com/gb/en)
As I understand it, all I have done with the code so far is open the socket connection ... but how can I get it to parse the path separately?
The only output I need from this in the end is the HTTP Status code (200, 404, 301 etc) though it is important that it does fetch the complete page first in order for it to be compiled properly.

Maybe I'm missing something but do you have the curl extension available? No need to get jQuery in the mix, you can run asynchronous queries straight from PHP with ease. You'll also be able to control batch size easily, and put in delays and what-not per your needs. Also I'm not sure why you would need to use a raw socket to hit the JSP pages, hopefully this makes your life easier!
Here's a quick test script I have, based on code from php.net I'm sure:
<?php
// create both cURL resources
$ch1 = curl_init();
$ch2 = curl_init();
// set URL and other appropriate options
curl_setopt($ch1, CURLOPT_URL, "http://news.php.net/php.general/255000");
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch2, CURLOPT_URL, "http://news.php.net/php.general/255001");
curl_setopt($ch2, CURLOPT_HEADER, 0);
//create the multiple cURL handle
$mh = curl_multi_init();
//add the two handles
curl_multi_add_handle($mh,$ch1);
curl_multi_add_handle($mh,$ch2);
$active = null;
//execute the handles
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
//close the handles
curl_multi_remove_handle($mh, $ch1);
curl_multi_remove_handle($mh, $ch2);
curl_multi_close($mh);
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Curl Multi Threading - php

You're looking for the curl_multi_* family of functions. Have a look at curl_multi_exec. Set CURLOPT_NOBODY to prevent curl from downloading any cotent.

Related

PHP - multiple curl requests curl_multi speed optimizations

Increase speed of my script

cURL Mult Simultaneous Requests (domain check)

Why are curl_multi_select and curl_multi_info_read contradicting each other?

Using fsockopen to precompile key jsp pages

Categories

Resources