multi-thread, multi-curl crawler in PHP

multi-thread, multi-curl crawler in PHP - php

Hi everyone once again!
We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.
Let's use some pseudo code to understand the logic:
1) While ($links_to_be_scanned > 0).
2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
3) Scan_the_link() and run some other functions.
4) Extract the new links from the xdom.
5) Push the new links into $links_to_be_scanned.
5) Push the current link into $links_already_scanned.
6) Remove the current link from $links_to_be_scanned.
Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.
I understand that we're gonna have to create a $links_being_scanned or some kind of queue.
I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.
Thanks in advance!
Chris;
Extended:
I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.
Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.
So now rethinking, we would have to do something like this:
While (There's links to be scanned)
Foreach ($Link_to_scann as $link)
If (There's less than 10 scanners running)
Launch_a_new_scanner($link)
Remove the link from $links_to_be_scanned array
Push the link into $links_on_queue array
Endif;
And each scanner does (This should be run in parallel):
Create an object with the given link
Send a curl request to the given link
Create a dom and an Xdom with the response body
Perform other operations over the response body
Remove the link from the $links_on_queue array
Push the link into the $links_already_scanned array
I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?
Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.
I assume I would have to approach this using fsockopen or pcntl_fork.
Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!
Thanks a lot!

DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.
The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.
Limiting the number of concurrent connections is easily accomplished. Consider:
<?php
use Artax\Client, Artax\Response;
require dirname(__DIR__) . '/autoload.php';
$client = new Client;
// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);
$requests = array(
'so-home' => 'http://stackoverflow.com',
'so-php' => 'http://stackoverflow.com/questions/tagged/php',
'so-python' => 'http://stackoverflow.com/questions/tagged/python',
'so-http' => 'http://stackoverflow.com/questions/tagged/http',
'so-html' => 'http://stackoverflow.com/questions/tagged/html',
'so-css' => 'http://stackoverflow.com/questions/tagged/css',
'so-js' => 'http://stackoverflow.com/questions/tagged/javascript'
);
$onResponse = function($requestKey, Response $r) {
echo $requestKey, ' :: ', $r->getStatus();
};
$onError = function($requestKey, Exception $e) {
echo $requestKey, ' :: ', $e->getMessage();
}
$client->requestMulti($requests, $onResponse, $onError);
IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.

you could try something like this, haven't checked it, but you should get the idea
$request_pool = array();
function CreateHandle($url) {
$handle = curl_init($url);
// set curl options here
return $handle;
}
function Process($data) {
global $request_pool;
// do something with data
array_push($request_pool , CreateHandle($some_new_url));
}
function RunMulti() {
global $request_pool;
$multi_handle = curl_multi_init();
$active_request_pool = array();
$running = 0;
$active_request_count = 0;
$active_request_max = 10; // adjust as necessary
do {
$waiting_request_count = count($request_pool);
while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
$request = array_shift($request_pool);
curl_multi_add_handle($multi_handle , $request);
$active_request_pool[(int)$request] = $request;
$waiting_request_count--;
$active_request_count++;
}
curl_multi_exec($multi_handle , $running);
curl_multi_select($multi_handle);
while($info = curl_multi_info_read($multi_handle)) {
$curl_handle = $info['handle'];
call_user_func('Process' , curl_multi_getcontent($curl_handle));
curl_multi_remove_handle($multi_handle , $curl_handle);
curl_close($curl_handle);
$active_request_count--;
}
} while($active_request_count > 0 || $waiting_request_count > 0);
curl_multi_close($multi_handle);
}

You should look for some more robust solution to your problem. RabbitMQ
is a very good solution I used. There is also Gearman but I think it is your choice.
I prefer RabbitMQ.

I will share with you my code which I have used to collect email addresses from certain website.
You can modify it to fit your needs.
There were some problems with relative URL's there.
And I do not use CURL here.
<?php
error_reporting(E_ALL);
$home = 'http://kharkov-reklama.com.ua/jborudovanie/';
$writer = new RWriter('C:\parser_13-09-2012_05.txt');
set_time_limit(0);
ini_set('memory_limit', '512M');
function scan_page($home, $full_url, &$writer) {
static $done = array();
$done[] = $full_url;
// Scan only internal links. Do not scan all the internet!))
if (strpos($full_url, $home) === false) {
return false;
}
$html = #file_get_contents($full_url);
if (empty($html) || (strpos($html, '<body') === false && strpos($html, '<BODY') === false)) {
return false;
}
echo $full_url . '<br />';
preg_match_all('/([A-Za-z0-9_\-]+\.)*[A-Za-z0-9_\-]+#([A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]\.)+[A-Za-z]{2,4}/', $html, $emails);
if (!empty($emails) && is_array($emails)) {
foreach ($emails as $email_group) {
if (is_array($email_group)) {
foreach ($email_group as $email) {
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
$writer->write($email);
}
}
}
}
}
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER);
if (is_array($matches)) {
foreach($matches as $match) {
if (!empty($match[2]) && is_scalar($match[2])) {
$url = $match[2];
if (!filter_var($url, FILTER_VALIDATE_URL)) {
$url = $home . $url;
}
if (!in_array($url, $done)) {
scan_page($home, $url, $writer);
}
}
}
}
}
class RWriter {
private $_fh = null;
private $_written = array();
public function __construct($fname) {
$this->_fh = fopen($fname, 'w+');
}
public function write($line) {
if (in_array($line, $this->_written)) {
return;
}
$this->_written[] = $line;
echo $line . '<br />';
fwrite($this->_fh, "{$line}\r\n");
}
public function __destruct() {
fclose($this->_fh);
}
}
scan_page($home, 'http://kharkov-reklama.com.ua/jborudovanie/', $writer);

Related

Retrieving results from background Gearman job/task

Subject is quite self-explanatory, but I definitely need a fresh pair of eyes on this.
I am using mmoreram/GearmanBundle Symfony2 bundle to send jobs to execute. So, far I have managed to send a job, execute it and return results. That part works as expected.
However, I am trying to the same with background job/tasks. I know that, in this scenario, client does not wait for job to complete, but I was hoping that job handle can help me with that (e.g. retrieve job status).
$gearman = $this->get('gearman');
$jobId = $gearman->doHighBackgroundJob("CsvWorker~parseCsv", json_encode(["foo", "bar", "123"]));
sleep(3);
// At this point, job has completed for sure (it's very simple)
var_dump($jobId);
var_dump($gearman->getJobStatus($jobId));
This outputs the following:
string 'H:localhost.localdomain:10' (length=26)
object(Mmoreram\GearmanBundle\Module\JobStatus)[410]
private 'known' => boolean false
private 'running' => boolean false
private 'completed' => int 0
private 'completionTotal' => int 0
The known => false, in particular, really puzzles me. During the job execution, I made sure to invoke correctly sendStatus and sendComplete methods.
So, I guess, a general question would be: once the job has completed, is it still known to Gearman?
UPDATE:
I managed to add some code changes to the bundle which allowed me to listen for data being returned by job. That way, I may be able to persist that in database, however, my client (job creator) is still pretty much left in the dark on whether the job has actually finished.

I found here such an option for solving the problem.
It is convenient when you need to complete a task, and the answer is needed only for a while.
Worker
$gmworker = new GearmanWorker();
$gmworker->addServer();
$gmworker->addFunction("long_running_task", "long_running_task_fn");
print "Waiting for job...\n";
while($gmworker->work()) {
if ($gmworker->returnCode() != GEARMAN_SUCCESS) {
echo "return_code: " . $gmworker->returnCode() . "\n";
break;
}
}
function long_running_task_fn($job) {
$mc = memcache_connect('localhost', 11211);
$result = 1;
$n = $job->workload();
for ($i = 1; $i <= $n; $i++) {
$result *= $i;
$job->sendStatus($i, $n);
sleep(1);
}
memcache_set($mc, $job->handle(), $result);
}
Client
<?php
if ($_POST['start']) {
$gmc = new GearmanClient();
$gmc->addServer();
$handle = $gmc->doBackground('long_running_task', '10');
header('Location: /client.php?handle='.urlencode($handle));
}
if ($_GET['handle']) {
$handle = $_GET['handle'];
$gmc = new GearmanClient();
$gmc->addServer();
$status = $gmc->jobStatus($handle);
}
function get_result($handle) {
$mc = memcache_connect('localhost', 11211);
$reply = memcache_get($mc, $handle);
memcache_close($mc);
return $reply;
}
?>

As described in PHP Manual, as long as the job is known to the server, it is not completed.

call the exec() php function in the web browser

I have a php function called getServerAddres() and I am trying to execute the exec() from the web browser. I understand this is not the proper way of using the function, I was just a task to exploit a web server using remote code injection. Any help on how to do remote code injection using the exec() through the web browser would be greatly appreciated.
Lets say the login in screen is: https://www.10.10.20.161/test/
function getServerAddress() {
if(isset($_SERVER["SERVER_ADDR"]))
return $_SERVER["SERVER_ADDR"];
else {
// Running CLI
if(stristr(PHP_OS, 'WIN')) {
// Rather hacky way to handle windows servers
exec('ipconfig /all', $catch);
foreach($catch as $line) {
if(eregi('IP Address', $line)) {
// Have seen exec return "multi-line" content, so another hack.
if(count($lineCount = split(':', $line)) == 1) {
list($t, $ip) = split(':', $line);
$ip = trim($ip);
} else {
$parts = explode('IP Address', $line);
$parts = explode('Subnet Mask', $parts[1]);
$parts = explode(': ', $parts[0]);
$ip = trim($parts[1]);
}
if(ip2long($ip > 0)) {
echo 'IP is '.$ip."\n";
return $ip;
} else
; // to-do: Handle this failure condition.
}
}
} else {
$ifconfig = shell_exec('/sbin/ifconfig eth0');
preg_match('/addr:([\d\.]+)/', $ifconfig, $match);
return $match[1];
}
}
}
The php script came from the login.php file.

You dont seem to understand the exec function....
First thing, read the documentation here.
This function gets executed on the server side, and thus cannot be executed on the client side.
If what you want is the information of the host machine, then you can run the command there, and output the result.
Create this file: example.php, and enter this code:
<?php
echo exec('whoami');
?>
Now, upload this file to the host, and make a request:
www.YOURHOST.smg/example.php
And read the result

Libevent timeout loop exit

I'm having some difficulties getting the PHP libevent extension to break out of a loop on a timeout. Here's what I've got so far based on the demos on the PHP.net docs pages:
// From here: http://www.php.net/manual/en/libevent.examples.php
function print_line($fd, $events, $arg) {
static $max_requests = 0;
$max_requests++;
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), fgets($fd));
if ($max_requests == 10) {
// exit loop after 3 writes
echo " [EXIT]\n";
event_base_loopexit($arg[1]);
}
}
// create base and event
$base = event_base_new();
$event = event_new();
getTimer(); // Initialise time
$fd = STDIN;
event_set($event, $fd, EV_READ | EV_PERSIST, "print_line", array($event, $base));
event_base_set($event, $base);
event_add($event, 2000000);
event_base_loop($base);
// extract flags from bitmask
function getEventFlags ($ebm) {
$expFlags = array('EV_TIMEOUT', 'EV_SIGNAL', 'EV_READ', 'EV_WRITE', 'EV_PERSIST');
$ret = array();
foreach ($expFlags as $exf) {
if ($ebm & constant($exf)) {
$ret[] = $exf;
}
}
return $ret;
}
// Used to track time!
function getTimer () {
static $ts;
if (is_null($ts)) {
$ts = microtime(true);
return "Timer initialised";
}
$newts = microtime(true);
$r = sprintf("Delta: %.3f", $newts - $ts);
$ts = $newts;
return $r;
}
I can see that the timeout value passed to event_add effects the events passed to print_line(), if these events are any more than 2 seconds apart I get an EV_TIMEOUT instead of an EV_READ. What I want however is for libevent to call print_line as soon as the timeout is reached rather than waiting for the next event in order to give me the timeout.
I've tried using event_base_loopexit($base, 2000000), this causes the event loop to exit immediately without blocking for events. I've also tried passing EV_TIMEOUT to event_set, this seems to have no effect at all.
Has anyone managed to get this working before? I know that the event_buffer_* stuff works with timeouts, however I want to use the standard event_base functions. One of the PECL bugs talks about event_timer_* functions and these functions do exist on my system, however they're not documented at all.

Problem is in fgets() in:
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), fgets($fd));
This blocks processing and waits for data from STDIN (but there are none on timeout)
Change to something like that:
$text = '';
if ($events & EV_READ) {
$text = fgets($fd);
}
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), $text);

Gearman with multiple servers and php workers

I'm having a problem with gearman workers running on multiple servers which i can't seem to solve.
The problem occurs when a worker server is taken offline, rather than the worker process being cancelled, and causes all other worker processes to error and fail.
Example with just 1 client and 2 workers -
Client:
$client = new GearmanClient ();
$client->addServer ('192.168.1.200');
$client->addServer ('192.168.1.201');
$job = $client->do ('generate_tile', serialize ($arrData));
Worker:
$worker = new GearmanWorker ();
$worker->addServer ('192.168.1.200');
$worker->addServer ('192.168.1.201');
$worker->addFunction ('generate_tile', 'generate_tile');
while (1)
{
if (!$worker->work ())
{
switch ($worker->returnCode ())
{
default:
echo "Error: " . $worker->returnCode () . ': ' . $worker->error () . "\n";
break;
}
}
}
function generate_tile ($job) { ... }
The worker code is being run on 2 separate servers. When every server is up and running both workers execute jobs as expected. When one of the worker processes is cancelled, the other worker executes all jobs as expected.
However, when the server with the cancelled worker process is shutdown and taken completely offline, requests to the client script hang and the remaining worker process does not pick up any jobs.
I get the following set of errors from the remaining worker process:
Error: 46: gearman_con_wait:timeout reached
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:110
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
....
When i start-up the other server, not starting the worker process on it, the remaining worker process immediately jumps into life and executes any remaining jobs.
It seems clear to me that i need some code in the worker process to cope with any servers that may be offline, however i cannot see how to do this.
Many thanks,
Andy

Our tests with multiple gearman servers shows that if the last server in the list (192.168.1.201 in your case) is taken down, the workers stop executing the way you are describing. (Also, the workers grab jobs from the last server. They process jobs on .200 only if on .201 there are no jobs).
It seems that this is a bug with the linked list in the gearman server, which is reported to be fixed multiple times, but with all available versions of gearman, the bug persist. Sorry, I know that's not a solution, but we had the same problem and didn't found a solution. (if someone can provide working solution for this problem, I agree to give large bounty)

Further to #Darhazer 's comment above. We found that as well and solved like thus :-
// Gearman workers show a strong preference for servers at the end of a list so randomize the order
$worker = new GearmanWorker();
$s2 = explode(",", Configure::read('workers.servers'));
shuffle($s2);
$servers = implode(",", $s2);
$worker->addServers($servers);
We run 6 to 10 workers at any time, and expire them after they've completed x requests.

I use this class, which keep track of which jobs work on which servers. It hasn't been thoroughly tested, just wrote it now. I've pasted an edited version, so there might be a typo or somesuch, but otherwise appears to solve the issue.
<?
class MyGearmanClient {
static $server = "server1,server2,server3";
static $server_array = false;
static $workingServers = false;
static $gmclient = false;
static $timeout = 5000;
static $defaultTimeout = 5000;
static function randomServer() {
return self::$server_array[rand(0, count(self::$server_array) -1)];
}
static function getServer($job = false) {
if (self::$server_array == false) {
self::$server_array = explode(",", self::$server);
self::$workingServers = array();
}
$serverList = array();
if ($job) {
if (array_key_exists($job, self::$workingServers)) {
foreach (self::$server_array as $server) {
if (array_key_exists($server, self::$workingServers[$job])) {
if (self::$workingServers[$job][$server]) {
$serverList[] = $server;
}
} else {
$serverList[] = $server;
}
}
if (count($serverList) == 0) {
# All servers have failed, need to insert all the servers again and retry.
$serverList = self::$workingServers[$job] = self::$server_array;
}
return $serverList[rand(0, count($serverList) - 1)];
} else {
return self::randomServer();
}
} else {
return self::randomServer();
}
}
static function serverWorked($server, $job) {
self::$workingServers[$job][$server] = $server;
}
static function serverFailed($server, $job) {
self::$workingServers[$job][$server] = false;
}
static function Connect($server = false, $job = false) {
if ($server) {
self::$server = self::getServer();
}
self::$gmclient= new GearmanClient();
self::$gmclient->setTimeout(self::$timeout);
# add the default job server
self::$gmclient->addServer($server = self::getServer($job));
return $server;
}
static function Destroy() {
self::$gmclient = false;
}
static function Client($name, $vars, $timeout = false) {
if (is_int($timeout)) {
self::$timeout = $timeout;
} else {
self::$timeout = self::$defaultTimeout;
}
do {
$server = self::Connect(false, $name);
$value = self::$gmclient->do($name, $vars);
$return_code = self::$gmclient->returnCode();
if (!$value) {
$error_message = self::$gmclient->error();
if ($return_code == 47) {
self::serverFailed($server, $name);
if (count(self::$server_array) > 1) {
// ADDED SINGLE SERVER LOOP AVOIDANCE // echo "Timeout on server $server, trying another server...\n";
continue;
} else {
return false;
}
}
echo "ERR: $error_message ($return_code)\n";
}
# printf("Worker has returned\n");
$short_value = substr($value, 0, 80);
switch ($return_code)
{
case GEARMAN_WORK_DATA:
echo "DATA: $short_value\n";
break;
case GEARMAN_SUCCESS:
self::serverWorked($server, $name);
break;
case GEARMAN_WORK_STATUS:
list($numerator, $denominator)= self::$gmclient->doStatus();
echo "Status: $numerator/$denominator\n";
break;
case GEARMAN_TIMEOUT:
// self::Connect();
// Fall through
default:
echo "ERR: $error_message " . self::$gmclient->error() . " ($return_code)\n";
break;
}
}
while($return_code != GEARMAN_SUCCESS);
$rv = unserialize($value);
return $rv["rv"];
}
}
# Example usage:
# $rv = MyGearmanClient::Client("Function", $args);
?>

since 'addServer' from gearman client is not working properly this code can choose a jobserver randomly and if fails try the next one, this way you can balance the load.
// job servers
$jobservers = array('192.168.1.1','192.168.1.2');
// prepare gearman client
$gmclient = new GearmanClient();
// shuffle job servers (deliver jobs equally by server)
shuffle($jobservers);
// add job servers
foreach($jobservers as $jobserver) {
// add random jobserver
$gmclient->addServer($jobserver);
// check server state if ok end foreach
if (#$gmclient->ping('ping')) break;
// if connections fails reset client
$gmclient = new GearmanClient();
}

Solution tested and working ok.
$client = new GearmanClient();
if(!$client->addServer("11.11.65.73",4730))
$client->addServer("11.11.65.79",4730);

Using PHP's IMAP library triggers Kaspersky's Antivirus

I just started today working with PHP's IMAP library, and while imap_fetchbody or imap_body are called, it is triggering my Kaspersky antivirus. The viruses are Trojan.Win32.Agent.dmyq and Trojan.Win32.FraudPack.aoda. I am running this off a local development machine with XAMPP and Kaspersky AV.
Now, I am sure there are viruses there since there is spam in the box (who doesn't need a some viagra or vicodin these days?). And I know that since the raw body includes attachments and different mime-types, bad stuff can be in the body.
So my question is: are there any risks using these libraries?
I am assuming that the IMAP functions are retrieving the body, caching it to disk/memory and the AV scanning it sees the data.
Is that correct? Are there any known security concerns using this library (I couldn't find any)? Does it clean up cached message parts perfectly or might viral files be sitting somewhere?
Is there a better way to get plain text out of the body than this? Right now I am using the following code (credit to Kevin Steffer):
function get_mime_type(&$structure) {
$primary_mime_type = array("TEXT", "MULTIPART","MESSAGE", "APPLICATION", "AUDIO","IMAGE", "VIDEO", "OTHER");
if($structure->subtype) {
return $primary_mime_type[(int) $structure->type] . '/' .$structure->subtype;
}
return "TEXT/PLAIN";
}
function get_part($stream, $msg_number, $mime_type, $structure = false, $part_number = false) {
if(!$structure) {
$structure = imap_fetchstructure($stream, $msg_number);
}
if($structure) {
if($mime_type == get_mime_type($structure)) {
if(!$part_number) {
$part_number = "1";
}
$text = imap_fetchbody($stream, $msg_number, $part_number);
if($structure->encoding == 3) {
return imap_base64($text);
} else if($structure->encoding == 4) {
return imap_qprint($text);
} else {
return $text;
}
}
if($structure->type == 1) /* multipart */ {
while(list($index, $sub_structure) = each($structure->parts)) {
if($part_number) {
$prefix = $part_number . '.';
}
$data = get_part($stream, $msg_number, $mime_type, $sub_structure,$prefix . ($index + 1));
if($data) {
return $data;
}
} // END OF WHILE
} // END OF MULTIPART
} // END OF STRUTURE
return false;
} // END OF FUNCTION
$connection = imap_open($server, $login, $password);
$count = imap_num_msg($connection);
for($i = 1; $i <= $count; $i++) {
$header = imap_headerinfo($connection, $i);
$from = $header->fromaddress;
$to = $header->toaddress;
$subject = $header->subject;
$date = $header->date;
$body = get_part($connection, $i, "TEXT/PLAIN");
}

Your guess seems accurate. IMAP itself is fine. What you do with the contents is what's dangerous.
What's dangerous about virus e-mails is that users might open a .exe attachment or something, so bad attachments and potentially evil HTML are what's being checked. As long as your code handling attachments doesn't tell the user to open them and this is just automatic processing or whatever, you're good to go. If you're planning on outputting HTML contents, be sure to use something like HTML Purifier.

The AV is detecting these signatures as they pass through the networking stack, most likely. You should be able to tell the source of the detection from the messages Kaspersky is giving you.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.