Downloading pages in parallel using PHP - php

I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.
I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.
In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.
The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.
My questions are
How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
Any location to store downloaded data, such as shared memory. Of course, other than database.
There any chances that data gets corrupt while storing and retrieving, how to avoid this?
Also, please guide me know if anyone have a better plan.

When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:
public function launch() {
$channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
$activeJobs = array();
$running = 0;
do {
// pick jobs for free channels:
while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
// take free channel, (re)init curl handle and let
// queued object set options
$chId = key($freeChannels);
if (empty($channels[$chId])) {
$channels[$chId] = curl_init();
}
$job = array_pop($this->jobQueue);
$job->init($channels[$chId]);
curl_multi_add_handle($this->master, $channels[$chId]);
$activeJobs[$chId] = $job;
unset($freeChannels[$chId]);
}
$pending = count($activeJobs);
// launch them:
if ($pending > 0) {
while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
// poke it while it wants
curl_multi_select($this->master);
// wait for some activity, don't eat CPU
while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
// some connection(s) finished, locate that job and run response handler:
$pending--;
$chId = array_search($info['handle'], $channels);
$content = curl_multi_getcontent($channels[$chId]);
curl_multi_remove_handle($this->master, $channels[$chId]);
$freeChannels[$chId] = NULL;
// free up this channel
if ( !array_key_exists($chId, $activeJobs) ) {
// impossible, but...
continue;
}
$activeJobs[$chId]->onComplete($content);
unset($activeJobs[$chId]);
}
}
} while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
}
In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete.
With this structure new requests will start as soon as something out of the pool finishes.
Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)
P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"

You can use curl_multi: http://www.somacon.com/p537.php
You may also want to consider doing this client side and using Javascript.
Another solution is to write a hunter/gatherer that you submit an array of URLs to, then it does the parallel work and returns a JSON array after it's completed.
Put another way: if you had 100 URLs you could POST that array (probably as JSON as well) to mysite.tld/huntergatherer - it does whatever it wants in whatever language you want and just returns JSON.

Aside from the curl multi solution, another one is just having a batch of gearman workers. If you go this route, I've found supervisord a nice way to start a load of deamon workers.

Things you should look at in addition to CURL multi:
Non-blocking streams (example: PHP-MIO)
ZeroMQ for spawning off many workers that do requests asynchronously
While node.js, ruby EventMachine or similar tools are quite great for doing this stuff, the things I mentioned make it fairly easy in PHP too.

Try execute from PHP, python-pycurl scripts. Easier, faster than PHP curl.

Related

Async PHP | Processing data into several systems (Advice)

I'm building an integration that communicates data to several different systems via API (REST). I need to process data as quickly as possible. This is a basic layout:
Parse and process data (probably into an array as below)
$data = array( Title => "Title", Subtitle => "Test", .....
Submit data into service (1) $result1 = $class1->functionservice1($data);
Submit data into service (2) $result2 = $class2->functionservice2($data);
Submit data into service (3) $result3 = $class3->functionservice3($data);
Report completion echo "done";
Run in a script as above I'll need to wait for each function to finish before it starts the next one (taking 3 times longer).
Is there an easy way to run each service function asynchronously but wait for all to complete before (5) reporting completion. I need to be able to extract data from each $result and return that as one post to a 4th service.
Sorry if this is an easy question - I'm a PHP novice
Many thanks, Ben
Yes, there are multiple ways.
The most efficient is to use an event loop that leverages non-blocking I/O to achieve concurrency and cooperative multitasking.
One such event loop implementation is Amp. There's an HTTP client that works with Amp, it's called Artax. An example is included in its README. You should have a look at how promises and coroutines work. There's Amp\wait to mix synchronous code with async code.
<?php
Amp\run(function() {
$client = new Amp\Artax\Client;
// Dispatch two requests at the same time
$promises = $client->requestMulti([
'http://www.google.com',
'http://www.bing.com',
]);
try {
// Yield control until all requests finish
list($google, $bing) = (yield Amp\all($promises));
var_dump($google->getStatus(), $bing->getStatus());
} catch (Exception $e) {
echo $e;
}
});
Other ways include using threads and or processes to achieve concurrency. Using multiple processes is the easiest way if you want to use your current code. However, spawning processes isn't cheap and using threads in PHP isn't really a good thing to do.
You can also put your code in another php file and call it using this :
exec("nohup /usr/bin/php -f your script > /dev/null 2>&1 &");
If you want to use asynchronicity like you can do in other languages ie. using threads, you will need to install the pthreads extension from PECL, because PHP does not support threading out of the box.
You can find an explaination on how to use threads with this question :
How can one use multi threading in PHP applications

What's wrong with my concurrent programming logic?

I wrote a web spider to spider pages concurrently. For each link that the spider finds, I want to fork off a new child that starts the process all over again.
I don't want to overload the target server so I created a static array that all objects can access. Each child can add their PID to the array, and either parent or child should check the array to see if $maxChildren have been met, and if so, patiently wait until any child finishes.
As you see, I have $maxChildren set to 3. I am expecting to see 3 simultaneous processes at any given time. However, that's not the case. The linux top command shows 12 to 30 processes at any given time. In concurrent programming, how can I regulate the number of simultaneous processes? My logic is currently inspired by how Apache handles it's max children, but I'm not exactly sure how that works.
As pointed out in one of the answers, globally accessing the static variable brings up issues with race conditions. To deal with this, the $children array takes the unique $PID of the process as both the key and it's value, thereby creating a unique value. My thinking is that since any object can only deal with one $children[$pid] value, locking is not necessary. Is this not true? Is there a chance that two processes could try to unset or add the same value at some point?
private static $children = array();
private $maxChildren = 3;
public function concurrentSpider($url) {
// STEP 1:
// Download the $url
$pageData = http_get($url, $ref = '');
if (!$this->checkIfSaved($url)) {
$this->save_link_to_db($url, $pageData);
}
// STEP 2:
// extract all hyperlinks from this url's page data
$linksOnThisPage = $this->harvest_links($url, $pageData);
// STEP 3:
// Check the links array from STEP 2 to see if this page has
// already been saved or is excluded because of any other
// logic from the excluded_link() function
$filteredLinks = $this->filterLinks($linksOnThisPage);
shuffle($filteredLinks);
// STEP 4: loop through each of the links and
// repeat the process
foreach ($filteredLinks as $filteredLink) {
$pid = pcntl_fork();
switch ($pid) {
case -1:
print "Could not fork!\n";
exit(1);
case 0:
if ($this->checkIfSaved($filteredLink)) {
exit();
}
//$pid = getmypid();
print "In child with PID: " . getmypid() . " processing $filteredLink \n";
$var[$pid]->concurrentSpider($filteredLink);
sleep(2);
exit(1);
default:
// Add an element to the children array
self::$children[$pid] = $pid;
// If the maximum number of children has been
// achieved, wait until one or more return
// before continuing.
while (count(self::$children) >= $this->maxChildren) {
//print count(self::$children) . " children \n";
$pid = pcntl_waitpid(-1, $status);
unset(self::$children[$pid]);
}
}
}
}
This is written in PHP. I know that the pcntl_waitpid function with argument of -1 waits for any child to complete regardless of the parent (http://php.net/manual/en/function.pcntl-waitpid.php).
What's wrong with my logic and how can I correct it so that only $maxChildren processes are running simultaneously? I'm also open to improving the logic in general if you have suggestions.
First thing to note: if this is truly a global being shared among multiple threads, it's possible that multiple threads are adding to it at once and you're running afoul of a race condition. You need some sort of concurrency control to ensure that only one process is accessing your global array at once.
Also, try the simple debugging trick of having each process write out (to the console or to a file) its PID and the full contents of the global array each time a new spider is forked. It will help you to check your assumptions (which are plainly wrong at some point) and figure out what's going wrong.
EDIT: (In response to the comments)
I'm not a PHP developer, but if I had to guess, based on the fact that you're using an OS tool that counts OS-level processes, I'd guess that your fork is spawning multiple processes, but your static array is global within the current process. Implementing system-wide shared memory is a lot more complicated!
If you just want to count something and ensure that instances of a shared resource don't grow out of control, look into semaphores, and see if you can find a way in PHP to create a named semaphore object that can be shared between multiple instances of your spider.
Use a real programming language ;)
Step 1 is kind of bad why are you downloading if it might be in the db. Put that inside the if and see if you can put a mutex around it. Maybe so something in sql to imitate one.
I hope harvest_links uses a proper html processor with css selector support (i like fizzler for .NET). I guess regular expression would be fine if its just to get links but it is possible to mess up.
I see step 4 and i don't think its bad but personally i'd do it a different way.
I'd have something like step one to insert url,page,flag into a db. Then i'd have another process or the same one ask the db for unprocessed pages and set the flag to some value if it errors and another if its successful. This is so if something fails of the process exits (shutdown, crash, power out, etc) it can pick it up easily and don't need to scan every page to find where it left off. It just ask the database for the next link and redoes what it didnt finish
PHP doesn't support multithreading, therefore it doesn't support mutexes or any other synchronization methods. As others have said in their answers, this will lead to a race condition.
You'll have to write a wrapper in C or bash. That way, the PHP script can submit targets to the wrapper, and the wrapper will handle scheduling.
Another approach is to rewrite your spider in Python or Ruby, both of which support multithreading. That will eliminate the need for interprocess communication.
Edit: On second thought, the best way is to write the wrapper in Python or Ruby and reuse your existing PHP code as a black box. That's a compromise of the solutions above.
If the spider is for practical purposes, you might want to google "curl multithread"
cURL Multi Threading with PHP

Non Blocking Curl requests in PHP

So I have this function for making non-blocking curl requests. It works fine on what I've tested so far (small amounts of requests). But I need this to scale up to thousands of requests (maybe max 10,000). My issue is that I don't want to run into issues with too many parallel requests running at once.
What would you suggest to rate-limit the requests? Usleep? Requests in batches? The function is below:
function poly_curl($requests){
$queue = curl_multi_init();
$curl_array = array();
$count = 0;
foreach($requests as $request)
{
$curl_array[$count] = curl_init($request);
curl_setopt($curl_array[$count], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($queue, $curl_array[$count]);
$count++;
}
$running = NULL;
do {
curl_multi_exec($queue,$running);
} while($running > 0);
$res = array();
$count = 0;
foreach($requests as $request)
{
$res[$count] = curl_multi_getcontent($curl_array[$count]);
$count++;
}
$count = 0;
foreach($requests as $request){
curl_multi_remove_handle($queue, $curl_array[$count]);
$count++;
}
curl_multi_close($queue);
return $res;
}
I think curl_multi_exec is bad for this purpose, because even if you use batches in groups of 100, 99 request could be finished and still will have to wait for the last request completion.
But you need 100 parallel requests and when one finishes, another is immediately started. So you cannot use curl_multi_exec at all.
I would use normal producer-consumer algorithm with multiple (constant number) consumers with every consumer processing only one url. For example php-resque and COUNT=100 php resque.php
You may want to implement something that is called Exponential Backoff (wikipedia).
Basically, it is an algorithm that allows you dynamically scale your processes depending on some feedback.
You define a rate in your application, and on the first time-out, error, or anything you decide, you decrease this rate until the request finishes.
You may implement it easily using the HTTP response code for example.
Last time i was doing something like this it was including downloading and "parsing" files. Was able to proceed only 4 subpages at a time limited by very weak hardware processor (2 cores with HT). What time i ended up with two queuest: 1 for waiting, 2 for in-process. Every time a task gone from 2nd queue, new was taken from 1st one.
May saund complicated, but ended in two loops inside eachother and simple count()'s
Btw, considering so hight rate i would think of using Node.js - for simplicity - or anything more nonblocking and more suitable for deamons than PHP.. As long as threads are PHP weakpoint, it just does not suit there.
PS: nice & useful bit of code, thanks.
We used to face the same problem with C++ connection pooling code. The approach is those days involved some serious analysis.
But, the essence was that, we created a pool and requests would get processed depending on number of available requests. What we also did was assign a maximum number of connection pools.[This was determined by testing].
What you really need is a method to determine how many requests are being processed and put a limit to it. In your case that is $count
Just compare $count to a maximum value[say, $max] to it and stop there. define the value depending on the system the program runs. $max could be hardcoded or dynamic.

PHP multithread to call several soap servers

I have to call a php function wich takes one second to response, in a "for" loop :
for ($i=0; $i<count($wsdlTab); $i++)
{
$serverLoadTab[$i] = $this->getServerLoad($wsdlTab[$i]);
}
My problem is that I would like to call my getServerLoad($wsdlTab[$i]) function simultaneous for each row of my $wsdlTab[$i], to not have to wait one second on each loop.
That is the reason why I need to call that function in a thread.
I have seen various ways to "emulate" threads, but I have not found any way with my limitations :
I have to get the return value of my getServerLoad($wsdlTab[$i]), and put in in an array
The Apache server is on Windows
Thanks in advance for your responses.
You should check out Gearman for parallel processing: http://gearman.org/
PHP doesn't really have asynchronous or threading built-in, as you've discovered.
What I might do in a case like this is separate the script I need to execute in a parallel, putting it into its own, small and self-contained PHP file. Then I'd execute that in a separate thread, storing the result somewhere I could monitor in the original thread. Once all the scripts have returned and filled the results, or with some given timeout, I would then continue with processing.
So, for instance,
prepareResults(); // something like clearing a db row or zeroing out a file or whatever
for ($i=0; $i<count($wsdlTab); $i++) {
exec('./doServerLoad.php ' . $wsdlTab[$i] . ' &');
}
while (!waitingForResults()) { // checking the results table/row/file
}
$serverLoadTab = parseTheResults();
May be request WSDLs in multi threads/forks using pcntl_fork() or curl_multi, and save them to the local drive. Then just parse them:
$soapClient = new SoapClient('WSDLSTemp/wsdl1.wsdl');
// .. do whatever you want...

Use non-blocking streams to paralellize REST-Api requests in PHP?

Consider the following scenario:
http://www.restserver.com/example.php returns some content that I want to work with in my web-application.
I don't want to load it using ajax (SEO issues etc.)
My page takes 100ms to generate, the REST resource also takes 100ms to be loaded.
We assume that the 100ms generation time of my website occour before I begin working with the REST resource. What comes after that can be neglected.
Example Code:
Index.php of my website
<?
do_some_heavy_mysql_stuff(); // takes 100 ms
get_rest_resource(); // takes 100 ms
render_html_with_data_from_mysql_and_rest(); // takes neglectable amount of time
?>
Website will take ~200ms to generate.
I want to turn this into:
<?
Restclient::initiate_rest_loading(); // takes 0ms
do_some_heavy_mysql_stuff(); // takes 100 ms
Restclient::get_rest_resource(); // takes 0 ms because 100 ms have already passed since initiation
render_html_with_data_from_mysql_and_rest(); // takes neglectable amount of time
?>
Website will take ~100ms to generate.
To accomplis this I thought about using something like this:
(I am pretty sure this code will not work because this question is all about asking how to accomplish this, and whether its possible. I just thought some naive code could demonstrate it best)
class Restclient {
public static $buffer;
public static function initiate_rest_loading() {
// open resource
$handle = fopen ("http://www.restserver.com/example.php", "r");
// set to non blocking so fgets will return immediately
stream_set_blocking($handle,0);
// initate loading, but return immediately to continue website generation
fgets($handle, 40960);
}
public static function get_rest_resource() {
// set stream to blocking again because now we really want the data
stream_set_blocking($handle,1);
// get the data and save it so templates can work with it
self::$buffer = fgets($handle, 40960); templates
}
}
So final question:
Is this possible and how?
What do I have to keep an eye on (internal buffer overflows, stream lengths etc.)
Are there better methods?
Does this well work with http resources?
Any input is appriciated!
I hope I explained it understandable. If anything is unclear, please leave a comment, so I can rephrase it!
As "any input is appreciated", here is mine:
What you want is called asynchronous (you want to something while something else is being done "in the background").
To solve your problem, I thought on this:
Separate do_some_heavy_mysql_stuff and get_rest_resource in two different PHP scripts.
Use cURL "multi" ability to do simultaneous requests. Please, check:
curl_multi_init and related PHP functions
Simultaneous HTTP requests in PHP with cURL
This way, you can perform both scripts at the same time. Using cURL multi features, you can call http://example.com/do_some_heavy_mysql_stuff.php and http://example.com/get_rest_resource.php at the same time, and then play with the results as soon as they're available.
These are my first thoughts, and Iim sharing them with you. Maybe there are different and more interesting approaches... Good luck!

Categories