Comparing two large files are taking over four hours - php

I have an online store that has about 15,000 products that get's updated everyday. Currently I upload the new list everyday, but it poses some issues (like downtime being a huge issue) and I wanted to come up with an alternative.
I created a script that moves the "yesterdays" products list and downloads today's products list. Then I go line-by-line and compare the two files seeing what needs to be deleted, modified, are created. This will allow me to perform an update with minimal amount of work, no downtime since everything will happen behind the scenes via CRON job, and it's how it should done.
The problem I have is it takes over four hours for the process to happen and I'm not sure if what I'm doing is the most efficient way. My first thought is to write something in C++, but I'm not sure how much faster that would be compared to PHP.
My question(s) is:
• Is this the most efficient way to do this?
• Is PHP the best language to do this?
Here's my script I wrote that handles the download and comparison:
public function __construct($url, $user, $pass)
{
$this->logger = new KLogger("/opt/lampp/htdocs/lea/logs/master.log" , KLogger::INFO);
/* increase execution time and server memory limit */
ini_set('max_execution_time', 14400);
ini_set('memory_limit', '-1');
/* set veriables */
$this->ftp = ftp_connect($url);
$this->login = ftp_login($this->ftp, $user, $pass);
$this->old = file('/opt/lampp/htdocs/lea/products/new/temp/rsr_inventory.txt');
$this->new = file('/opt/lampp/htdocs/lea/products/new/rsr_inventory.txt');
$this->list = array();
$this->start_time = date('Hi');
$this->counter = 0;
}
public function download($to, $from)
{
// move current file to new location to get new file ready
$this->logger->LogInfo('move yesterday\'s products list');
rename('/opt/lampp/htdocs/lea/products/new/temp/rsr_inventory.txt', '/opt/lampp/htdocs/lea/products/new/rsr_inventory.txt');
// get list from rsr
$this->logger->LogInfo('get new list from rsr');
if(ftp_get($this->ftp, $to, $from, FTP_BINARY))
{
return true;
}
return false;
}
public function update()
{
// initialize process
$this->logger->LogInfo('update process initialized');
for($i = 0; $i < count($this->new); $i++)
{
$new[$i] = explode(';', $this->new[$i]);
$response = $this->_match($new[$i]);
if($response[0])
{
if(trim($response[2]) != trim($new[$i][5]) || trim($response[3]) != trim($new[$i][8]))
{
$this->list[$this->counter][0] = $response[1];
$this->list[$this->counter][1] = 'update';
$this->list[$this->counter][2] = trim($response[2]);
$this->list[$this->counter][3] = trim($response[3]);
$this->counter++;
}
}
else
{
$this->list[$this->counter][0] = $response[1];
$this->list[$this->counter][1] = 'create';
$this->list[$this->counter][2] = trim($response[2]);
$this->list[$this->counter][3] = trim($response[3]);
$this->counter++;
}
}
if(count($this->list) > 0)
{
//csv
$this->logger->LogInfo('create update.csv');
$updates = fopen('/opt/lampp/htdocs/lea/products/new/updates.csv', 'w');
foreach($this->list as $fields)
{
fputcsv($updates, $fields);
}
fclose($updates);
}
$this->logger->LogInfo('product update process complete');
$this->__mail();
}
private function _match($item)
{
for($j = 0; $j < count($this->old); $j++)
{
$old[$j] = explode(';', $this->old[$j]);
if($item[0] === $old[$j][0])
{
return array(true, $item[0], $old[$j][5], $old[$j][8]);
}
}
return array(false, NULL, NULL, NULL);
}
Here is an example of the products.txt file I get everyday (I'm only showing 10 products, but there are roughly 15,000 (there is a lot of things missing; prices, qty, and etc..., but I shortened everything up since it doesn't matter to show those) :
511-10010-019-L-XL;844802282208;5.11 RECON ANKLE SOCK BLK L/XL;
511-10010-036-L-XL;844802282246;5.11 RECON ANKLE SOCK SHADOW L/XL;
511-10010-132-LXL;844802334662;5.11 RECON ANKLE SOCK TIMBER L/XL;
511-10010-200-L-XL;844802282222;5.11 RECON ANKLE SOCK FATIGUE L/XL;
511-10011-019-L-XL;844802276382;5.11 COLD WEATHER OTC SOCK BLK L/XL;
511-10012-019-L-XL;844802276429;5.11 COLD WEATHER CREW SOCK BLK L/XL;
511-30012-019-M;844802269650;5.11 WOMENS HOLSTER SHIRT BLK M;
511-40011-010-L;844802016148;5.11 HOLSTER SHIRT L WHITE;
511-40011-010-M;844802016131;5.11 HOLSTER SHIRT M WHITE;
511-40011-010-XL;844802016155;5.11 HOLSTER SHIRT XL WHITE;
511-40011-010-XXL;844802016162;5.11 HOLSTER SHIRT 2XL WHITE;

I think your problem is that you are doing 15000 x 15000 comparisons (so 225 million operations on the data).
If you instead create a map (in other words an array in PHP) with some unique identifier as the index for both the old and the new. That is 30k operations, and then iterate over the one list checking if the other contains the same thing or not. That's another 15K operations. Total of 45K operations, rather than 225M operations.
I'm not saying the suggestion to do a database is a bad idea, but the excessive time it takes is clearly caused by a poor choice of algorithm + data structure.

This is a job for MySQL. Importing your data will be a substantial investment up front, but will be worth it in the long run. Databases are designed to update, merge, delete, and insert data efficiently. This sort of job would take seconds in MySQL. You could keep PHP as your scripting language.

Related

Solution for calling a function doing lots of stuff in it by Cron?

function cronProcess() {
# > 100,000 users
$users = $this->UserModel->getUsers();
foreach ($users as $user) {
# Do lots of database Insert/Update/Delete, HTTP request stuff
}
}
The problem happens when the number of users reaches ~ 100,000.
I called the function by CURL via CronTab.
So what is the best solution for this?
I do a lot of bulk tasks in CakePHP, some processing millions of records. It's certainly possible to do, the key as others suggested is small batches in a loop.
If this is something you're calling from Cron, it's probably easier to use a Shell (< v3.5) or the newer Command class (v3.6+) than cURL.
Here's generally how I paginate large batches, including some helpful optional things like a progress bar, turning off hydration to speed things up slightly, and showing how many users/second the script was able to process:
<?php
namespace App\Command;
use Cake\Console\Arguments;
use Cake\Console\Command;
use Cake\Console\ConsoleIo;
class UsersCommand extends Command
{
public function execute(Arguments $args, ConsoleIo $io)
{
// I'd guess a Finder would be a more Cake-y way of getting users than a custom "getUsers" function:
// See https://book.cakephp.org/3.0/en/orm/retrieving-data-and-resultsets.html#custom-finder-methods
$usersQuery = $this->UserModel->find('users');
// Get a total so we know how many we're gonna have to process (optional)
$total = $usersQuery->count();
if ($total === 0) {
$this->abort("No users found, stopping..");
}
// Hydration takes extra processing time & memory, which can add up in bulk. Optionally if able, skip it & work with $user as an array not an object:
$usersQuery->enableHydration(false);
$this->info("Grabbing $total users for processing");
// Optionally show the progress so we can visually see how far we are in the process
$progress = $io->helper('Progress')->init([
'total' => 10
]);
// Tune this page value to a size that solves your problem:
$limit = 1000;
$offset = 0;
// Simply drawing the progress bar every loop can slow things down, optionally draw it only every n-loops,
// this sets it to 1/5th the page size:
$progressInterval = $limit / 5;
// Optionally track the rate so we can evaluate the speed of the process, helpful tuning limit and evaluating enableHydration effects
$startTime = microtime(true);
do {
$users = $usersQuery->offset($offset)->toArray();
$count = count($users);
$index = 0;
foreach ($users as $user) {
$progress->increment(1);
// Only draw occasionally, for speed
if ($index % $progressInterval === 0) {
$progress->draw();
}
### WORK TIME
# Do your lots of database Insert/Update/Delete, HTTP request stuff etc. here
###
}
$progress->draw();
$offset += $limit; // Increment your offset to the next page
} while ($count > 0);
$totalTime = microtime(true) - $startTime;
$this->out("\nProcessed an average " . ($total / $totalTime) . " Users/sec\n");
}
}
Checkout these sections in the CakePHP Docs:
Console Commands
Command Helpers
Using Finders & Disabling Hydration
Hope this helps!

Restrict function to maximum 100 executions per minute

I have a script that makes multiple POST requests to an API. Rough outline of the script is as follows:
define("MAX_REQUESTS_PER_MINUTE", 100);
function apirequest ($data) {
// post data using cURL
}
while ($data = getdata ()) {
apirequest($data);
}
The API is throttled, it allows users to post up to 100 requests per minute. Additional requests return HTTP error + Retry-After response until the window resets. Note that the server can take anywhere between 100 milliseconds to 100 seconds to process the request.
I need to make sure that my function does not execute more than 100 times per minute. I have tried usleep function to introduce a constant delay of 0.66 seconds but this simply adds one extra minute per minute. An arbitrary value such as 0.1 second results in error one time or another. I log all requests inside a database table along with time, the other solution I used is to probe the table and count the number of requests made within last 60 seconds.
I need a solution that wastes as little time as possible.
I've put Derek's suggestion into code.
class Throttler {
private $maxRequestsPerMinute;
private $getdata;
private $apirequest;
private $firstRequestTime = null;
private $requestCount = 0;
public function __construct(
int $maxRequestsPerMinute,
$getdata,
$apirequest
) {
$this->maxRequestsPerMinute = $maxRequestsPerMinute;
$this->getdata = $getdata;
$this->apirequest = $apirequest;
}
public function run() {
while ($data = call_user_func($this->getdata)) {
if ($this->requestCount >= $this->maxRequestsPerMinute) {
sleep(ceil($this->firstRequestTime + 60 - microtime(true)));
$this->firstRequestTime = null;
$this->requestCount = 0;
}
if ($this->firstRequestTime === null) {
$this->firstRequestTime = microtime(true);
}
++$this->requestCount;
call_user_func($this->apirequest, $data);
}
}
}
$throttler = new Throttler(100, 'getdata', 'apirequest');
$throttler->run();
UPD. I've put its updated version on Packagist so you can use it with Composer: https://packagist.org/packages/ob-ivan/throttler
To install:
composer require ob-ivan/throttler
To use:
use Ob_Ivan\Throttler\JobInterface;
use Ob_Ivan\Throttler\Throttler;
class SalmanJob implements JobInterface {
private $data;
public function next(): bool {
$this->data = getdata();
return (bool)$this->data;
}
public function execute() {
apirequest($this->data);
}
}
$throttler = new Throttler(100, 60);
$throttler->run(new SalmanJob());
Please note there are other packages providing the same functionality (I haven't tested any of them):
https://packagist.org/packages/franzip/throttler
https://packagist.org/packages/andrey-mashukov/throttler
https://packagist.org/packages/queryyetsimple/throttler
I would start by recording initial time when first request is to be made and then count how many requests are being made. Once 60 requests have been made make sure the current time is at least 1 minute after initial time. If not usleep for however long is left until minute is reached. When minute is reached reset count and initial time value.
Here is my go at this:
define("MAX_REQUESTS_PER_MINUTE", 100);
function apirequest() {
static $startingTime;
static $requestCount;
if ($startingTime === null) {
$startingTime = time();
}
if ($requestCount === null) {
$requestCount = 0;
}
$consumedTime = time() - $startingTime;
if ($consumedTime >= 60) {
$startingTime = time();
$requestCount = 0;
} elseif ($requestCount === MAX_REQUESTS_PER_MINUTE) {
sleep(60 - $consumedTime);
$startingTime = time();
$requestCount = 0;
}
$requestCount++;
echo sprintf("Request %3d, Range [%d, %d)", $requestCount, $startingTime, $startingTime + 60) . PHP_EOL;
file_get_contents("http://localhost/apirequest.php");
// the above script sleeps for 200-400ms
}
for ($i = 0; $i < 1000; $i++) {
apirequest();
}
I've tried the naive solutions of static sleeps, counting requests, and doing simple math but they tended to be quite inaccurate, unreliable, and generally introduced far more sleeping that was necessary when they could have been doing work. What you want is something that only starts issuing consequential sleeps when you're approaching your rate-limit.
Lifting my solution from a previous problem for those sweet, sweet internet points:
I used some math to figure out a function that would sleep for the correct sum of time over the given request, and allow me to ramp it up exponentially towards the end.
If we express the sleep as:
y = e^( (x-A)/B )
where A and B are arbitrary values controlling the shape of the curve, then the sum of all sleeps, M, from 0 to N requests would be:
M = 0∫N e^( (x-A)/B ) dx
This is equivalent to:
M = B * e^(-A/B) * ( e^(N/B) - 1 )
and can be solved with respect to A as:
A = B * ln( -1 * (B - B * e^(N/B)) / M )
While solving for B would be far more useful, since specifying A lets you define a what point the graph ramps up aggressively, the solution to that is mathematically complex, and I've not been able to solve it myself or find anyone else that can.
/**
* #param int $period M, window size in seconds
* #param int $limit N, number of requests permitted in the window
* #param int $used x, current request number
* #param int $bias B, "bias" value
*/
protected static function ratelimit($period, $limit, $used, $bias=20) {
$period = $period * pow(10,6);
$sleep = pow(M_E, ($used - self::biasCoeff($period, $limit, $bias))/$bias);
usleep($sleep);
}
protected static function biasCoeff($period, $limit, $bias) {
$key = sprintf('%s-%s-%s', $period, $limit, $bias);
if( ! key_exists($key, self::$_bcache) ) {
self::$_bcache[$key] = $bias * log( -1 * ( ($bias - $bias * pow(M_E, $limit/$bias)) / $period ) );
}
return self::$_bcache[$key];
}
With a bit of tinkering I've found that B = 20 seems to be a decent default, though I have no mathematical basis for it. Something something slope mumble mumble exponential bs bs.
Also, if anyone wants to solve that equation for B for me I've got a bounty up on math.stackexchange.
Though I believe that our situations differ slightly in that my API provider's responses all included the number of available API calls, and the number still remaining within the window. You may need additional code to track this on your side instead.

php RRD graph separation

I am trying to create RRD graphs with the help of PHP in order to keep track of the inoctets,outoctets and counter of a server.
So far the script is operating as expected but my problems comes when I am trying to produce 2 or more separate graphs. I am trying to produce (hourly, weekly , etc) graphs. I thought by creating a loop would solve my problem, since I have split the RRA in hours and days. Unfortunately I end up having 2 graphs that updating simultaneously as expected but plotting the same thing. Has any one encounter similar problem? I have applied the same program in perl with RRD::Simple,where is extremely easy and everything is adjusted almost automatically.
I have supplied under a working example of my code with the minimum possible data because the code is a bit long:
<?php
$file = "snmp-2";
$rrdFile = dirname(__FILE__) . "/snmp-2.rrd";
$in = "ifInOctets";
$out = "ifOutOctets";
$count = "sysUpTime";
$step = 5;
$rounds = 1;
$output = array("Hourly","Daily");
while (1) {
sleep (6);
$options = array(
"--start","now -15s", // Now -10 seconds (default)
"--step", "".$step."",
"DS:".$in.":GAUGE:10:U:U",
"DS:".$out.":GAUGE:10:U:U",
"DS:".$count.":ABSOLUTE:10:0:4294967295",
"RRA:MIN:0.5:12:60",
"RRA:MAX:0.5:12:60",
"RRA:LAST:0.5:12:60",
"RRA:AVERAGE:0.5:12:60",
"RRA:MIN:0.5:300:60",
"RRA:MAX:0.5:300:60",
"RRA:LAST:0.5:300:60",
"RRA:AVERAGE:0.5:300:60",
);
if ( !isset( $create ) ) {
$create = rrd_create(
"".$rrdFile."",
$options
);
if ( $create === FALSE ) {
echo "Creation error: ".rrd_error()."\n";
}
}
$t = time();
$ifInOctets = rand(0, 4294967295);
$ifOutOctets = rand(0, 4294967295);
$sysUpTime = rand(0, 4294967295);
$update = rrd_update(
"".$rrdFile."",
array(
"".$t.":".$ifInOctets.":".$ifOutOctets.":".$sysUpTime.""
)
);
if ($update === FALSE) {
echo "Update error: ".rrd_error()."\n";
}
$start = $t - ($step * $rounds);
foreach ($output as $test) {
$final = array(
"--start","".$start." -15s",
"--end", "".$t."",
"--step","".$step."",
"--title=".$file." RRD::Graph",
"--vertical-label=Byte(s)/sec",
"--right-axis-label=latency(min.)",
"--alt-y-grid", "--rigid",
"--width", "800", "--height", "500",
"--lower-limit=0",
"--alt-autoscale-max",
"--no-gridfit",
"--slope-mode",
"DEF:".$in."_def=".$file.".rrd:".$in.":AVERAGE",
"DEF:".$out."_def=".$file.".rrd:".$out.":AVERAGE",
"DEF:".$count."_def=".$file.".rrd:".$count.":AVERAGE",
"CDEF:inbytes=".$in."_def,8,/",
"CDEF:outbytes=".$out."_def,8,/",
"CDEF:counter=".$count."_def,8,/",
"COMMENT:\\n",
"LINE2:".$in."_def#FF0000:".$in."",
"COMMENT:\\n",
"LINE2:".$out."_def#0000FF:".$out."",
"COMMENT:\\n",
"LINE2:".$count."_def#FFFF00:".$count."",
);
$outputPngFile = rrd_graph(
"".$test.".png",
$final
);
if ($outputPngFile === FALSE) {
echo "<b>Graph error: </b>".rrd_error()."\n";
}
} /* End of foreach function */
$debug = rrd_lastupdate (
"".$rrdFile.""
);
if ($debug === FALSE) {
echo "<b>Graph result error: </b>".rrd_error()."\n";
}
var_dump ($debug);
$rounds++;
} /* End of while loop */
?>
A couple of issues.
Firstly, your definition of the RRD has a step of 5seconds and RRAs with steps of 12x5s=1min and 300x5s=25min. They also have a length of only 60 rows, so 1hr and 25hr respectively. You'll never get a weekly graph this way! You need to add more rows; also the step seems rather short, and you might need a smaller-step RRA for hourly graphs and a larger-step one for weekly graphs.
Secondly, it is not clear how you're calling the graph function. You seem to be specifying:
"--start","".$start." -15s",
"--end", "".$t."",
"--step","".$step."",
... which would force it to use the 5s interval (unavailable, so the 1min one would always get used) and for the graph to be only for the time window from the start to the last update, not a 'hourly' or 'daily' as you were asking.
Note that the RRA you have defined do not define the time window of the graph you are asking for. Also, just because you have more than one RRA defined, it doesnt mean you'll get more than one graph unless oyu call the graph function twice with different arguments.
If you want a daily graph, use
"--start","end - 1 hour",
"--end",$t,
Do not specify a step as the most appropriate available will be used anyway. For a daily graph, use
"--start","end - 1 day"
"--end",$t,
Similarly, no need to specify a step.
Hopefully this will make it a little clearer. Most of the RRD graph options have sensible defaults, and RRDTool is pretty good at picking the correct RRA to use based on the graph size, time window, and DEF statements.

pthreads stopping already running thread once a condition has been met

I'm still relatively new to PHP and trying to use pthreads to solve an issue. I have 20 threads running processes that end at varying times. Most finish around < 10 seconds or so. I don't need all 20, just 10 detected. Once I get to 10, I would like to kill the threads, or to continue on to the next step.
I have tried using set_time_limit to about 20 seconds for each of the threads, but they ignore it and keep running. I am looping through the jobs looking for the join because I didn't want the rest of the program to run but I'm stuck until the slowest one has finished. While pthreads has reduced the time from around a minute to about 30 seconds, I can shave even more time since the first 10 run in about 3 seconds.
Thanks for any help and here is my code:
$count = 0;
foreach ( $array as $i ) {
$imgName = $this->smsId."_$count.jpg";
$name = "LocalCDN/".$imgName;
$stack[] = new AsyncImageModify($i['largePic'], $name);
$count++;
}
// Run the threads
foreach ( $stack as $t ) {
$t->start();
}
// Check if the threads have finished; push the coordinates into an array
foreach ( $stack as $t ) {
if($t->join()){
array_push($this->imgArray, $t->data);
}
}
class class AsyncImageModify extends \Thread{
public $data;
public function __construct($arg, $name, $container) {
$this->arg = $arg;
$this->name = $name;
}
public function run() {
//tried putting the set_time_limit() here, didn't work
if ($this->arg) {
// Get the image
$didWeGetTheImage = Image::getImage($this->arg, $this->name);
if($didWeGetTheImage){
$timestamp1 = microtime(true);
print_r("Starting face detection $this->arg" . "\n");
print_r(" ");
$j = Image::process1($this->name);
if($j){
// lets go ahead and do our image manipulation at this point
$userPic = Image::process2($this->name, $this->name, 200, 200, false, $this->name, $j);
if($userPic){
$this->data = $userPic;
print_r("Back from process2; the image returned is $userPic");
}
}
$endTime = microtime(true);
$td = $endTime-$timestamp1;
print_r("Finished face detection $this->arg in $td seconds" . "\n");
print_r($j);
}
}
}
It is difficult to guess the functionality of Image::* methods, so I can't really answer in any detail.
What I can say, is that there are very few machines I can think of that are suitable to run 20 concurrent threads in any case. A more suitable setup would be the worker/stackable model. A Worker thread is a reuseable context, and can execute task after task, implemented as Stackables; execution in a multi-threaded environment should always use the least amount of threads to get the most work done possible.
Please see pooling example and other examples that are distributed with pthreads, available on github, additionally, much information regarding usage is contained in past bug reports, if you are still struggling after that ...

Improving HTML scraper efficiency with pcntl_fork()

With the help from two previous questions, I now have a working HTML scraper that feeds product information into a database. What I am now trying to do is improve efficiently by wrapping my brain around with getting my scraper working with pcntl_fork.
If I split my php5-cli script into 10 separate chunks, I improve total runtime by a large factor so I know I am not i/o or cpu bound but just limited by the linear nature of my scraping functions.
Using code I've cobbled together from multiple sources, I have this working test:
<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0);
ini_set('max_input_time', 0);
set_time_limit(0);
$hrefArray = array("http://slashdot.org", "http://slashdot.org", "http://slashdot.org", "http://slashdot.org");
function doDomStuff($singleHref,$childPid) {
$html = new DOMDocument();
$html->loadHtmlFile($singleHref);
$xPath = new DOMXPath($html);
$domQuery = '//div[#id="slogan"]/h2';
$domReturn = $xPath->query($domQuery);
foreach($domReturn as $return) {
$slogan = $return->nodeValue;
echo "Child PID #" . $childPid . " says: " . $slogan . "\n";
}
}
$pids = array();
foreach ($hrefArray as $singleHref) {
$pid = pcntl_fork();
if ($pid == -1) {
die("Couldn't fork, error!");
} elseif ($pid > 0) {
// We are the parent
$pids[] = $pid;
} else {
// We are the child
$childPid = posix_getpid();
doDomStuff($singleHref,$childPid);
exit(0);
}
}
foreach ($pids as $pid) {
pcntl_waitpid($pid, $status);
}
// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();
Which raises the following questions:
1) Given my hrefArray contains 4 urls - if the array was to contain say 1,000 product urls this code would spawn 1,000 child processes? If so, what is the best way to limit the amount of processes to say 10, and again 1,000 urls as an example split the child work load to 100 products per child (10 x 100).
2) I've learn that pcntl_fork creates a copy of the process and all variables, classes, etc. What I would like to do is replace my hrefArray variable with a DOMDocument query that builds the list of products to scrape, and then feeds them off to child processes to do the processing - so spreading the load across 10 child workers.
My brain is telling I need to do something like the following (obviously this doesn't work, so don't run it):
<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0);
ini_set('max_input_time', 0);
set_time_limit(0);
$maxChildWorkers = 10;
$html = new DOMDocument();
$html->loadHtmlFile('http://xxxx');
$xPath = new DOMXPath($html);
$domQuery = '//div[#id=productDetail]/a';
$domReturn = $xPath->query($domQuery);
$hrefsArray[] = $domReturn->getAttribute('href');
function doDomStuff($singleHref) {
// Do stuff here with each product
}
// To figure out: Split href array into $maxChilderWorks # of workArray1, workArray2 ... workArray10.
$pids = array();
foreach ($workArray(1,2,3 ... 10) as $singleHref) {
$pid = pcntl_fork();
if ($pid == -1) {
die("Couldn't fork, error!");
} elseif ($pid > 0) {
// We are the parent
$pids[] = $pid;
} else {
// We are the child
$childPid = posix_getpid();
doDomStuff($singleHref);
exit(0);
}
}
foreach ($pids as $pid) {
pcntl_waitpid($pid, $status);
}
// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();
But what I can't figure out is how to build my hrefsArray[] in the master/parent process only and feed it off to the child process. Currently everything I've tried causes loops in the child processes. I.e. my hrefsArray gets built in the master, and in each subsequent child process.
I am sure I am going about this all totally wrong, so would greatly appreciate just general nudge in the right direction.
Introduction
pcntl_fork() is not the only way to improve performance HTML scraper while it might be a good idea to use Message Queue has Charles suggested but you still need a faster effective way to pull that request in your workers
Solution 1
Use curl_multi_init ... curl is actually faster and using multi curl gives you parallel processing
From PHP DOC
curl_multi_init Allows the processing of multiple cURL handles in parallel.
So Instead of using $html->loadHtmlFile('http://xxxx'); to load the files several times you can just use curl_multi_init to load multiple url at the same time
Here are some Interesting Implementations
php - Fastest way to check presence of text in many domains (above 1000)
php get all the images from url which width and height >=200 more quicker
How to prevent server from overloading during Curl requests in PHP
Solution 2
You can use pthreads to use multi-threading in PHP
Example
// Number of threads you want
$threads = 10;
// Treads storage
$ts = array();
// Your list of URLS // range just for demo
$urls = range(1, 50);
// Group Urls
$urlsGroup = array_chunk($urls, floor(count($urls) / $threads));
printf("%s:PROCESS #load\n", date("g:i:s"));
$name = range("A", "Z");
$i = 0;
foreach ( $urlsGroup as $group ) {
$ts[] = new AsyncScraper($group, $name[$i ++]);
}
printf("%s:PROCESS #join\n", date("g:i:s"));
// wait for all Threads to complete
foreach ( $ts as $t ) {
$t->join();
}
printf("%s:PROCESS #finish\n", date("g:i:s"));
Output
9:18:00:PROCESS #load
9:18:00:START #5592 A
9:18:00:START #9620 B
9:18:00:START #11684 C
9:18:00:START #11156 D
9:18:00:START #11216 E
9:18:00:START #11568 F
9:18:00:START #2920 G
9:18:00:START #10296 H
9:18:00:START #11696 I
9:18:00:PROCESS #join
9:18:00:START #6692 J
9:18:01:END #9620 B
9:18:01:END #11216 E
9:18:01:END #10296 H
9:18:02:END #2920 G
9:18:02:END #11696 I
9:18:04:END #5592 A
9:18:04:END #11568 F
9:18:04:END #6692 J
9:18:05:END #11684 C
9:18:05:END #11156 D
9:18:05:PROCESS #finish
Class Used
class AsyncScraper extends Thread {
public function __construct(array $urls, $name) {
$this->urls = $urls;
$this->name = $name;
$this->start();
}
public function run() {
printf("%s:START #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
if ($this->urls) {
// Load with CURL
// Parse with DOM
// Do some work
sleep(mt_rand(1, 5));
}
printf("%s:END #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
}
}
It seems like I suggest this daily, but have you looked at Gearman? There's even a well documented PECL class.
Gearman is a work queue system. You'd create workers that connect and listen for jobs, and clients that connect and send jobs. The client can either wait for the requested job to be completed, or it can fire it and forget. At your option, workers can even send back status updates, and how far through the process they are.
In other words, you get the benefits of multiple processes or threads, without having to worry about processes and threads. The clients and workers can even be on different machines.

Categories