How does cURL Keep-alive works? - php

I've been working on this and couldn't find a way to understand it fully.
I have this code:
<?php
function get2($url) {
// Create a handle.
$handle = curl_init($url);
// Set options...
// Do the request.
$ret = curlExecWithMulti($handle);
// Do stuff with the results...
// Destroy the handle.
curl_close($handle);
}
function curlExecWithMulti($handle) {
// In real life this is a class variable.
static $multi = NULL;
// Create a multi if necessary.
if (empty($multi)) {
$multi = curl_multi_init();
}
// Add the handle to be processed.
curl_multi_add_handle($multi, $handle);
// Do all the processing.
$active = NULL;
do {
$ret = curl_multi_exec($multi, $active);
} while ($ret == CURLM_CALL_MULTI_PERFORM);
while ($active && $ret == CURLM_OK) {
if (curl_multi_select($multi) != -1) {
do {
$mrc = curl_multi_exec($multi, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
// Remove the handle from the multi processor.
curl_multi_remove_handle($multi, $handle);
return TRUE;
}
?>
The above script is doing this: I run the PHP and it creates new TCP connection, it returns data and then it closes the connection.
The server is working on HTTP 1.1 and connection: keep-alive.
What i want is if i run the script will create connection, return data and NOT close the connection and when i run the PHP script again will use that same connection (of course if that connection didn't expire after the timeout of the sever).
Is that possible with cURL? Am I understanding the multi in cURL wrong?

When a program exits, all of its open sockets (indeed, all open files) are closed. There is no way to reuse a connection from one instance to another(*). You must re-open a new connection or loop within your application.
If you want to use HTTP Keep-Alive, your program must not exit.
(*) There are ways to keep a socket open inside one process and pass it to others via Unix domain sockets but that is an advanced topic I recommend against; I mention it only for completeness.

Related

what the best options for CURLOPT_ENCODING in PHP

am facing issue of fetching data from big file of multi weblinks.
in my code use :
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
i used it to speed the curl request,
i tested my code on a few links and its works fine but after using the big file i get a bad results .
As i understand it used to compress the data while retrieve. so i used it to fast the curl request.
could it make any error in data retrieve like that the site not using gzip?
my code also using fork with no sleep time, maybe the problem from the fork ?
$pid = #pcntl_fork();
$execute++;
if ($execute >= $maxproc)
{
while (pcntl_waitpid(0, $status) != -1)
{
$status = pcntl_wexitstatus($status);
$execute =0;
//usleep(250000);
//sleep(1);
//echo " [$ipkey] Child $status completed\n";
}
}
if (!$pid)
{
do
exit;
}
any one have idea where the probliem is ?

PHP Multicurl Works for a while then gives a 500 error on Godaddy shared hosting

Alright, so I have a script that uses a rolling curl for multiple requests. The past two days, it works for a couple times, but then eventually it starts providing a 500 error. I try a regular curl_init and that works, so I know I'm getting connection with the site I'm connecting too. And if I wait for tomorrow it will work again. So I imagine there is some leak or something going on. I can't figure out how to check 500 errors on Godaddy. But is there a way I can check curl to see what the issue is or stop it from the infinite loop. Because it's not connecting. And just to clarify the very same script works at first, but after a certain number of tries, it stops.
while (($execrun = curl_multi_exec($this->multi_handle, $running)) ==
CURLM_CALL_MULTI_PERFORM) {
;
}
if ($execrun != CURLM_OK) {
break;
}
//not entering this loop
while ($info = curl_multi_info_read($this->multi_handle)) {
}
Edit: I'm doing about 30 requests at a time. It works very fast and works for like the first 4-5 times I load the script. But then eventually it will just stop and I'll have to wait for the next day. Here's the while loop where the only occurrences of closing the handles are. I've never ran in to this problem before. Have no clue where to look. It just won't connect to the website and enter this loop.
while ($info = curl_multi_info_read($this->multi_handle)) {
$ch = $info['handle'];
$ch_array_key = (int)$ch;
if (!isset($this->outstanding_requests[$ch_array_key])) {
die("Error - handle wasn't found in requests: '$ch' in ".
print_r($this->outstanding_requests, true));
}
$request = $this->outstanding_requests[$ch_array_key];
$url = $request['url'];
$content = curl_multi_getcontent($ch);
$callback = $this->curl['CALLBACK'];
$user_data = $request['user_data'];
if(curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200)
{
call_user_func($callback, $content, $url, $ch, $user_data, $user_data);
unset($this->outstanding_requests[$ch_array_key]);
curl_multi_remove_handle($this->multi_handle, $ch);
curl_close($ch);
}
else
{
//unset the outstanding request so it doesn't get stuck in a loop
unset($this->outstanding_requests[$ch_array_key]);
curl_multi_remove_handle($this->multi_handle, $ch);
curl_close($ch);
//these come back as 0's, so not found. Restart the request
self::startRequest($url);
}
//self::msg('USER END');
}

multi-thread, multi-curl crawler in PHP

Hi everyone once again!
We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.
Let's use some pseudo code to understand the logic:
1) While ($links_to_be_scanned > 0).
2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
3) Scan_the_link() and run some other functions.
4) Extract the new links from the xdom.
5) Push the new links into $links_to_be_scanned.
5) Push the current link into $links_already_scanned.
6) Remove the current link from $links_to_be_scanned.
Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.
I understand that we're gonna have to create a $links_being_scanned or some kind of queue.
I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.
Thanks in advance!
Chris;
Extended:
I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.
Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.
So now rethinking, we would have to do something like this:
While (There's links to be scanned)
Foreach ($Link_to_scann as $link)
If (There's less than 10 scanners running)
Launch_a_new_scanner($link)
Remove the link from $links_to_be_scanned array
Push the link into $links_on_queue array
Endif;
And each scanner does (This should be run in parallel):
Create an object with the given link
Send a curl request to the given link
Create a dom and an Xdom with the response body
Perform other operations over the response body
Remove the link from the $links_on_queue array
Push the link into the $links_already_scanned array
I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?
Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.
I assume I would have to approach this using fsockopen or pcntl_fork.
Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!
Thanks a lot!
DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.
The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.
Limiting the number of concurrent connections is easily accomplished. Consider:
<?php
use Artax\Client, Artax\Response;
require dirname(__DIR__) . '/autoload.php';
$client = new Client;
// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);
$requests = array(
'so-home' => 'http://stackoverflow.com',
'so-php' => 'http://stackoverflow.com/questions/tagged/php',
'so-python' => 'http://stackoverflow.com/questions/tagged/python',
'so-http' => 'http://stackoverflow.com/questions/tagged/http',
'so-html' => 'http://stackoverflow.com/questions/tagged/html',
'so-css' => 'http://stackoverflow.com/questions/tagged/css',
'so-js' => 'http://stackoverflow.com/questions/tagged/javascript'
);
$onResponse = function($requestKey, Response $r) {
echo $requestKey, ' :: ', $r->getStatus();
};
$onError = function($requestKey, Exception $e) {
echo $requestKey, ' :: ', $e->getMessage();
}
$client->requestMulti($requests, $onResponse, $onError);
IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.
you could try something like this, haven't checked it, but you should get the idea
$request_pool = array();
function CreateHandle($url) {
$handle = curl_init($url);
// set curl options here
return $handle;
}
function Process($data) {
global $request_pool;
// do something with data
array_push($request_pool , CreateHandle($some_new_url));
}
function RunMulti() {
global $request_pool;
$multi_handle = curl_multi_init();
$active_request_pool = array();
$running = 0;
$active_request_count = 0;
$active_request_max = 10; // adjust as necessary
do {
$waiting_request_count = count($request_pool);
while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
$request = array_shift($request_pool);
curl_multi_add_handle($multi_handle , $request);
$active_request_pool[(int)$request] = $request;
$waiting_request_count--;
$active_request_count++;
}
curl_multi_exec($multi_handle , $running);
curl_multi_select($multi_handle);
while($info = curl_multi_info_read($multi_handle)) {
$curl_handle = $info['handle'];
call_user_func('Process' , curl_multi_getcontent($curl_handle));
curl_multi_remove_handle($multi_handle , $curl_handle);
curl_close($curl_handle);
$active_request_count--;
}
} while($active_request_count > 0 || $waiting_request_count > 0);
curl_multi_close($multi_handle);
}
You should look for some more robust solution to your problem. RabbitMQ
is a very good solution I used. There is also Gearman but I think it is your choice.
I prefer RabbitMQ.
I will share with you my code which I have used to collect email addresses from certain website.
You can modify it to fit your needs.
There were some problems with relative URL's there.
And I do not use CURL here.
<?php
error_reporting(E_ALL);
$home = 'http://kharkov-reklama.com.ua/jborudovanie/';
$writer = new RWriter('C:\parser_13-09-2012_05.txt');
set_time_limit(0);
ini_set('memory_limit', '512M');
function scan_page($home, $full_url, &$writer) {
static $done = array();
$done[] = $full_url;
// Scan only internal links. Do not scan all the internet!))
if (strpos($full_url, $home) === false) {
return false;
}
$html = #file_get_contents($full_url);
if (empty($html) || (strpos($html, '<body') === false && strpos($html, '<BODY') === false)) {
return false;
}
echo $full_url . '<br />';
preg_match_all('/([A-Za-z0-9_\-]+\.)*[A-Za-z0-9_\-]+#([A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]\.)+[A-Za-z]{2,4}/', $html, $emails);
if (!empty($emails) && is_array($emails)) {
foreach ($emails as $email_group) {
if (is_array($email_group)) {
foreach ($email_group as $email) {
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
$writer->write($email);
}
}
}
}
}
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER);
if (is_array($matches)) {
foreach($matches as $match) {
if (!empty($match[2]) && is_scalar($match[2])) {
$url = $match[2];
if (!filter_var($url, FILTER_VALIDATE_URL)) {
$url = $home . $url;
}
if (!in_array($url, $done)) {
scan_page($home, $url, $writer);
}
}
}
}
}
class RWriter {
private $_fh = null;
private $_written = array();
public function __construct($fname) {
$this->_fh = fopen($fname, 'w+');
}
public function write($line) {
if (in_array($line, $this->_written)) {
return;
}
$this->_written[] = $line;
echo $line . '<br />';
fwrite($this->_fh, "{$line}\r\n");
}
public function __destruct() {
fclose($this->_fh);
}
}
scan_page($home, 'http://kharkov-reklama.com.ua/jborudovanie/', $writer);

mysql connection from daemon written in php

i have written a daemon to fetch some stuff from mysql and make some curl requests based on the info from mysql. since i'm fluent in php i've written this daemon in php using System_Daemon from pear.
this works fine but i'm curious about the best approach for connecting to mysql. feels weird to create a new mysql connection every couple of seconds, should i try a persistent connection? any other input? keeping potential memory leaks to a minimum is of the essence...
cleaned up the script, attached below. removed the mysql stuff for now, using a dummy array to keep this unbiased:
#!/usr/bin/php -q
<?php
require_once "System/Daemon.php";
System_Daemon::setOption("appName", "smsq");
System_Daemon::start();
$runningOkay = true;
while(!System_Daemon::isDying() && $runningOkay){
$runningOkay = true;
if (!$runningOkay) {
System_Daemon::err('smsq() produced an error, '.
'so this will be my last run');
}
$messages = get_outgoing();
$messages = call_api($messages);
#print_r($messages);
System_Daemon::iterate(2);
}
System_Daemon::stop();
function get_outgoing(){ # get 10 rows from a mysql table
# dummycode, this should come from mysql
for($i=0;$i<5;$i++){
$message->msisdn = '070910507'.$i;
$message->text = 'nr'.$i;
$messages[] = $message;
unset($message);
}
return $messages;
}
function call_api($messages=array()){
if(count($messages)<=0){
return false;
}else{
foreach($messages as $message){
$message->curlhandle = curl_init();
curl_setopt($message->curlhandle,CURLOPT_URL,'http://yadayada.com/date.php?text='.$message->text);
curl_setopt($message->curlhandle,CURLOPT_HEADER,0);
curl_setopt($message->curlhandle,CURLOPT_RETURNTRANSFER,1);
}
$mh = curl_multi_init();
foreach($messages as $message){
curl_multi_add_handle($mh,$message->curlhandle);
}
$running = null;
do{
curl_multi_exec($mh,$running);
}while($running > 0);
foreach($messages as $message){
$message->api_response = curl_multi_getcontent($message->curlhandle);
curl_multi_remove_handle($mh,$message->curlhandle);
unset($message->curlhandle);
}
curl_multi_close($mh);
}
return $messages;
}
Technically if it's a daemon, it runs in background and doesn't stop until you ask it to. There's is no need to use a persistent connection in that case, and even, you probably shouldn't. I'd expect the connection to close when I kill the daemon.
Basically, you should open a connection on startup, and close it on shutdown, and that's about it. However, you should put some error trapping in there in case the connection drops unexpectedly while the daemon is running, so it either shutdowns gracefully (by logging a connection drop somewhere) or have it retry a reconnection later.
maybe before while statement just add mysql_pconnect, but I don't now anything about php daemons...

How can I use cURL to open multiple URLs simultaneously with PHP?

Here is my current code:
$SQL = mysql_query("SELECT url FROM urls") or die(mysql_error()); //Query the urls table
while($resultSet = mysql_fetch_array($SQL)){ //Put all the urls into one variable
// Now for some cURL to run it.
$ch = curl_init($resultSet['url']); //load the urls
curl_setopt($ch, CURLOPT_TIMEOUT, 2); //No need to wait for it to load. Execute it and go.
curl_exec($ch); //Execute
curl_close($ch); //Close it off
} //While loop
I'm relatively new to cURL. By relatively new, I mean this is my first time using cURL. Currently it loads one for two seconds, then loads the next one for 2 seconds, then the next. however, I want to make it load ALL of them at the same time. I'm sure its possible, I'm just unsure as to how. If someone could point me in the right direction, I'd appreciate it.
You set up each cURL handle in the same way, then add them to a curl_multi_ handle. The functions to look at are the curl_multi_* functions documented here. In my experience, though, there were issues with trying to load too many URLs at once (though I can't find my notes on it at the moment), so the last time I used curl_mutli_, I set it up to do batches of 5 URLs at a time.
edit: Here is a reduced version of the code I have using curl_multi_:
edit: Slightly rewritten and lots of added comments, which hopefully will help.
// -- create all the individual cURL handles and set their options
$curl_handles = array();
foreach ($urls as $url) {
$curl_handles[$url] = curl_init();
curl_setopt($curl_handles[$url], CURLOPT_URL, $url);
// set other curl options here
}
// -- start going through the cURL handles and running them
$curl_multi_handle = curl_multi_init();
$i = 0; // count where we are in the list so we can break up the runs into smaller blocks
$block = array(); // to accumulate the curl_handles for each group we'll run simultaneously
foreach ($curl_handles as $a_curl_handle) {
$i++; // increment the position-counter
// add the handle to the curl_multi_handle and to our tracking "block"
curl_multi_add_handle($curl_multi_handle, $a_curl_handle);
$block[] = $a_curl_handle;
// -- check to see if we've got a "full block" to run or if we're at the end of out list of handles
if (($i % BLOCK_SIZE == 0) or ($i == count($curl_handles))) {
// -- run the block
$running = NULL;
do {
// track the previous loop's number of handles still running so we can tell if it changes
$running_before = $running;
// run the block or check on the running block and get the number of sites still running in $running
curl_multi_exec($curl_multi_handle, $running);
// if the number of sites still running changed, print out a message with the number of sites that are still running.
if ($running != $running_before) {
echo("Waiting for $running sites to finish...\n");
}
} while ($running > 0);
// -- once the number still running is 0, curl_multi_ is done, so check the results
foreach ($block as $handle) {
// HTTP response code
$code = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// cURL error number
$curl_errno = curl_errno($handle);
// cURL error message
$curl_error = curl_error($handle);
// output if there was an error
if ($curl_error) {
echo(" *** cURL error: ($curl_errno) $curl_error\n");
}
// remove the (used) handle from the curl_multi_handle
curl_multi_remove_handle($curl_multi_handle, $handle);
}
// reset the block to empty, since we've run its curl_handles
$block = array();
}
}
// close the curl_multi_handle once we're done
curl_multi_close($curl_multi_handle);
Given that you don't need anything back from the URLs, you probably don't need a lot of what's there, but this is how I chunked the requests into blocks of BLOCK_SIZE, waited for each block to run before moving on, and caught errors from cURL.

Categories