PHP Parallel processing for a Metasearch Engine

PHP Parallel processing for a Metasearch Engine - php

I have developed a metasearch engine and one of the optimisations I would like to make is to process the search APIs in parallel. Imagine that results are retrieved from Search Engine A in 0.24 seconds, SE B in 0.45 Seconds and from SE C in 0.5 seconds. With other overheads the metasearch engine can return aggregated results in about 1.5 seconds, which is viable. Now what I would like to do is to send those requests in parallel rather than in series, as at present, and get that time down to under a second. I have investigated exec, forking, threading and all, for various reasons, have failed. Now I have only spent a day or two on this so I may have missed something. Ideally i would like to implement this on a WAMP stack on my development machine (localhost) and see about implementing on a Linux webserver thereafter. Any help appreciated.
Let's take a simple example: say we have two files we want to run simultaneously. File 1:
<?php
// file1.php
echo 'File 1 - Test 1'.PHP_EOL;
$sleep = mt_rand(1, 5);
echo 'Start Time: '.date("g:i:sa").PHP_EOL;
echo 'Sleep Time: '.$sleep.' seconds.'.PHP_EOL;
sleep($sleep);
echo 'Finish Time: '.date("g:i:sa").PHP_EOL;
?>
Now, imagine file two is the same... the idea is that if run in parallel the command line output for the times should be the same, for example:
File 1 - Test 1
Start Time: 9:30:43am
Sleep Time: 4 seconds.
Finish Time: 9:30:47am
But whether I use exec, popen or whatever, I just cannot get this to work in PHP!

I would use socket_select(). Doing so, only the connection time would be cummulative as you can read from the sockets in parralel. This will give you a big performance boost.

There is one viable approach. Make a cli php file that gets in arguments what it have to do and returns whatever result is produced serialized.
In your main app you may popen as many of these workers as you need and then in a simple loop collect the outputs:
[edit] I used your worker example, just had to chmod +x and add a #!/usr/bin/php line on top:
#!/usr/bin/php
<?php
echo 'File 1 - Test 1'.PHP_EOL;
$sleep = mt_rand(1, 5);
echo 'Start Time: '.date("g:i:sa").PHP_EOL;
echo 'Sleep Time: '.$sleep.' seconds.'.PHP_EOL;
sleep($sleep);
echo 'Finish Time: '.date("g:i:sa").PHP_EOL;
?>
also modified the run script a little bit - ex.php:
#!/usr/bin/php
<?php
$pha=array();
$res=array();
$pha[1]=popen("./file1.php","r");
$res[1]='';
$pha[2]=popen("./file2.php","r");
$res[2]='';
while (list($id,$ph)=each($pha)) {
while (!feof($ph))
$res[$id].=fread($ph,8192);
pclose($ph);
}
echo $res[1].$res[2];
here is the result, when tested in cli (its the same when ex.php is called from web, but paths to file1.php and file2.php should be fixed):
$ time ./ex.php
File 1 - Test 1
Start Time: 11:00:33am
Sleep Time: 3 seconds.
Finish Time: 11:00:36am
File 2 - Test 1
Start Time: 11:00:33am
Sleep Time: 4 seconds.
Finish Time: 11:00:37am
real 0m4.062s
user 0m0.040s
sys 0m0.036s
As seen in the result one script takes 3 seconds to execute and the other takes 4. Both run for 4 seconds together in parallel.
[end edit]
In this way the slow operation will run in parallel, you will only collect the result in serial.
Finally it will take (the slowest worker time)+(time for collecting) to execute. Since the time for collecting the results and time to unserialize, etc., may be ignored you get all data for the time of the slowest request.
As a side note you may try to use the igbinary serialiser that is much faster than the built-in one.
As noted in comments:
worker.php is executed outside of the web request and you have to pass all its state via arguments. Passing arguments may also be a problem to handle all escaping, security and etc., so not-effective but simple way is to use base64.
A major drawback in this approach is that it is not easy to debug.
It can be further improved by using stream_select instead of fread and also collecting data in parallel.

Related

Performance degrades when php getmxrr() is called from inside shell for loop

I noticed big performance difference when tried to fetch MX of gmail 100000 times using php and shell script.
PHP script is taking around 1.5 min.
<?php
$time = time();
for($i=1;$i<=100000;$i++)
{
getmxrr('gmail.com', $hosts, $mxweights);
unset($hosts, $mxweights);
}
$runtime = time() - $time;
echo "Time Taken : $runtime Sec.";
?>
But same thing done inside shell for loop is almost 10 times slower
time for i in {1..100000}; do (php -r 'getmxrr("gmail.com", $mxhosts, $mxweight);');done
I am curious to know, what are the reasons, shell script is taking more time to complete exactly the same thing which php script can do very fast.

trying to fix a crontab file duplicate by PID table

I'm trying to develop a crontab task that every 5 seconds check my email. Normally I could request it every 1 minute instead of 5 seconds, but reading some other posts with no solution, I found one with the same problem than me. The script, after a period of time, was stopping. This is not a real problem cause I can configure a crontab task and make sleep(5) Also I have the same 1and1 server as the other question, which I'm including here.
PHP script stops running arbitrarily with no errors
The real problem I had when I tried to solve this via crontab, every minute a new PID was created, so in an hour I could get almost 50 process at the same time doing the same.
Here I include the .php file called by crontab every minute:
date_default_timezone_set('Europe/Madrid');
require_once ( $_SERVER['DOCUMENT_ROOT'] . '/folder1/path.php' );
require_once ( CLASSES . 'Builder.php');
$UIModules = Builder::getUIModules();
$UIModules->getfile();
So I found a solution by checking the PID table. The idea is if in the PID table are running 2 process, then that means the last proccess is still working, so just finish doing anything. If in the PID table there's just 1 process running, that means the latest process that was working has expired so we can use this new one. The way is something like I show on the next code:
$var_aux = exec("ps -A | grep php");
if (!isarray($var_aux)){
date_default_timezone_set('Europe/Madrid');
require_once ( $_SERVER['DOCUMENT_ROOT'] . '/folder1/path.php' );
require_once ( CLASSES . 'Builder.php');
$UIModules = Builder::getUIModules();
$UIModules->getfile();
}
I'm not sure about the condition isarray($var_aux) cause $var_aux always returns me the last PID process, so it returns a string of 28 characters, but in this case we want to return more than a process so the condition could even change to if (strlen($var) < 34). Note: I've given more margin to the len, cause sometime process take longer than 9999, so it's 1 lenght more.
The main problem I found on this is the exec sentence just print me the last process, in other words, it always returns me a string with a lenght of 28 (The PID for that script).
I don't know if what I've purposed is a crazy idea, but is it possible to get all the PID table with php?

You can use a much simpler solution than emulating crontab in php: use contab
make multiple entries to check every 5 seconds an then call your php program.
A good description of how to set up crontab to perform subminute action can be found here:
https://usu.li/how-to-run-a-cron-job-every-x-seconds
This solution only requires the maximum of 12 processes running every minute.

php timeout - set_time_limit(0); - don't work

I'm having a problem with my PHP file that takes more than 30 seconds to execute.
After searching, I added set_time_limit(0); at the start of the code,cbut the file still times out with a 500 error after 30 seconds.
log: PHP Fatal error: Maximum execution time of 30 seconds exceeded in /xxx/xx/xxx.php
safe-mode : off

Check the php.ini
ini_set('max_execution_time', 300); //300 seconds = 5 minutes
ini_set('max_execution_time', 0); //0=NOLIMIT

This is an old thread, but I thought I would post this link, as it helped me quite a bit on this issue. Essentially what it's saying is the server configuration can override the php config. From the article:
For example mod_fastcgi has an option called "-idle-timeout" which controls the idle time of the script. So if the script does not output anything to the fastcgi handler for that many seconds then fastcgi would terminate it. The setup is somewhat like this:
Apache <-> mod_fastcgi <-> php processes
The article has other examples and further explanation. Hope this helps somebody else.

I usually use set_time_limit(30) within the main loop (so each loop iteration is limited to 30 seconds rather than the whole script).
I do this in multiple database update scripts, which routinely take several minutes to complete but less than a second for each iteration - keeping the 30 second limit means the script won't get stuck in an infinite loop if I am stupid enough to create one.
I must admit that my choice of 30 seconds for the limit is somewhat arbitrary - my scripts could actually get away with 2 seconds instead, but I feel more comfortable with 30 seconds given the actual application - of course you could use whatever value you feel is suitable.
Hope this helps!

ini_set('max_execution_time', 300);
use this

Checkout this, This is from PHP MANUAL, This may help you.
If you're using PHP_CLI SAPI and getting error "Maximum execution time of N seconds exceeded" where N is an integer value, try to call set_time_limit(0) every M seconds or every iteration. For example:
<?php
require_once('db.php');
$stmt = $db->query($sql);
while ($row = $stmt->fetchRow()) {
set_time_limit(0);
// your code here
}
?>

I think you must say limit time to execution to php , try this.
ini_set('max_execution_time', 0);

Build a time out to load the page slower

Is there a way to make the loading of a page go slower? There are some processes which happen to fast to get a grip on and I would like to see them a bit slower.
Is there anything I can do to slow down the loading-time of a page?
I need this because there is one CSS-selector on which I need to change something, but I can't catch him with firebug, cause the page is loading too fast.

You can just use sleep() in PHP to make it delay the loading.
Here is an example from the PHP Manual:
<?php
// current time
echo date('h:i:s') . "\n";
// sleep for 10 seconds
sleep(10);
// wake up !
echo date('h:i:s') . "\n";
?>
http://uk.php.net/sleep

You can use sleep(seconds); (see HERE), but I suspect your application design should be improved if you need it...

Solution 1 (seconds based)
You could use
sleep($seconds);
where $seconds, as the variable name explain, are the seconds that the script have to wait.
Solution 2 (microseconds based)
You can also use
usleep($microseconds);
to delay the execution in microseconds instead of seconds.
References
sleep()
usleep()

sleep().

how big is the performance impact of using system('hostname') in PHP?

I saw some existing code in the PHP scripts doing a
system('hostname');
how big can the performance impact on the server be if using this method?

Running external processes can be a real performance hit when you have thousands of clients trying to connect to your web server. That's why people ended up ditching CGI (common gateway interface, the act of web servers calling external processes to dynamically create content) and incorporating code directly into their web servers, such as mod_perl.
You won't notice it when you're testing your little web application at home but, when the hordes that make up the Internet swarm down to your site, it will collapse under the load.
You'd be far better trying to figure out a way to cache this information within PHP itself (how often does it change, really?). For your particular example, you could use the "php_uname('n')" call to retrieve the full name (e.g., "localhost.example.com" and (optionally) strip off the domain part, but I've assumed you want the question answered in a more general sense.
Update:
Since someone has requested benchmarks, here's a C program that does three loops of 1,000 iterations each. The first does nothing within the loop, the second gets an environment variable (one other possibility for having PHP get its hostname), the third runs a system() command to execute hostname:
#include <stdio.h>
#include <time.h>
#define LOOP1 1000
int main (int c, char *v[]) {
time_t t1, t2, t3, t4;
int i;
t1 = time(NULL);
for (i = 0; i < LOOP1; i++) {
}
t2 = time(NULL);
for (i = 0; i < LOOP1; i++) {
getenv ("xxhostname");
}
t3 = time(NULL);
for (i = 0; i < LOOP1; i++) {
system ("hostname >/dev/null");
}
t4 = time(NULL);
printf ("Loop 1 took %d seconds\n", t2-t1);
printf ("Loop 2 took %d seconds\n", t3-t2);
printf ("Loop 3 took %d seconds\n", t4-t3);
return 0;
}
The results are:
Cygwin (gcc):
Loop 1 took 0 seconds
Loop 2 took 0 seconds
Loop 3 took 103 seconds
Linux on System z (gcc):
Loop 1 took 0 seconds
Loop 2 took 0 seconds
Loop 3 took 5 seconds
Linux on Intel (gcc):
Loop 1 took 0 seconds
Loop 2 took 0 seconds
Loop 3 took 5 seconds
Linux on Power (gcc):
Loop 1 took 0 seconds
Loop 2 took 0 seconds
Loop 3 took 4 seconds
Windows on Intel (VS2008, and using "ver >nul:", not "hostname"):
Loop 1 took 0 seconds
Loop 2 took 0 seconds
Loop 3 took 45 seconds
However you slice'n'dice it, that's quite a discrepancy on loop number 3. It probably won't cause any problems if you're getting one hit a week on your site but, if you hold any hope of surviving in the real world under load, you'd be best to avoid system() calls as much as possible.

Errm, you are aware of php_uname and posix_uname, right?
<?php
echo "php_uname: " . php_uname('n') . "\n";
$ar = posix_uname();
echo "posix_uname: $ar[nodename]\n";
?>
should both work. In PHP 5.3, there is also a gethostname.

I used to have a site that was using exec, system and passthru in PHP to execute grep, sed and other tools on text files.
This made it so that each page view would result in not just one process, but two at once. I ran into problems with the process limit on my shared hosting - more than 6 processes at once, and people got 503 errors.
This wasn't a problem until the site became popular. I had to rewrite the page to use PHP functions instead of calling external programs, and it was faster, and fixed the 503 errors. This might not be a problem if you have a less busy site, or a dedicated/virtual server.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Parallel processing for a Metasearch Engine - php

I would use socket_select(). Doing so, only the connection time would be cummulative as you can read from the sockets in parralel. This will give you a big performance boost.

Related

Performance degrades when php getmxrr() is called from inside shell for loop

trying to fix a crontab file duplicate by PID table

php timeout - set_time_limit(0); - don't work

Build a time out to load the page slower

how big is the performance impact of using system('hostname') in PHP?

Categories

Resources