How to design the backend to put requests into a queue? - php

I want to design such a system that users can submit files. After they submit a file I will run some scripts with the files. I want to run these files in order so I want to maintain a queue of requests. How can I do this with php? Is there any open source library for this?
Thanks!

I would use a database..
When user submits a file it adds a link to it's location in the file system to the database.
You have a cron running that checks for new submissions and processes them in order, when done it marks it as processed in the database.

I would use Redis. Redis is a superfast key-value storage; usually its response time is double digit microseconds. (10 - 99 microseconds)
Redis transactions are atomic (transactions either happen or they don't), and you can have a job constantly running without using cron.
To use Redis with PHP, you can use Predis.
Once redis is installed, and Predis is set up to work with your script, upon uploading the file I would do something like this:
// 'hostname' and 'port' is the hostname and the port
// where Redis is installed and listening to.
$client = new Predis\Client('tcp://hostname:port');
// assuming the path to the file is stored in $pathToFile
$client->lpush('queue:files', $pathToFile);
then the script that needs to work with the files, just needs to do something like:
$client = new Predis\Client('tcp://hostname:port');
// assuming the path to the file is stored in $pathToFile
while(true){
$pathToFile = $client->rpop('queue:files');
if(!$pathToFile) {
// list is empty, so move on.
continue;
}
// there was something in the list, do whatever you need to do with it.
// if there's an exception or an error, you can always use break; or exit; to terminate the loop.
}
Take in consideration that PHP tends to use a lot of memory, so I would either explicitly collect the garbage (via gc_enable() and gc_collect_cycles() and unset() variables as you go).
Alternatively, you can use software like supervisord to make this script run just once, and as soon as it ends, start it up again.
In general, I would stay away from using a database and cron to implement queues. It can lead to serious problems.
Assume, for example, you use a table as a queue.
In the first run, your script pulls a job from the database and start doing its thing.
Then for some reason, your script takes longer to run, and the cron job kicks in again, and now you have 2 scripts working the same file. This could either have no consequences, or could have serious consequences. That depends on what your application is actually doing.
So unless you're working with a very small dataset, and you know for sure that your cronjob will finish before the previous script ran, and there will be no collisions, then you should be fine. Otherwise, stay away from that approach.

Third party Library? Too simple to need an entire Library. You could use Redis (see answer by AlanChavez) if you want to waste time and resources and then have to be concerned about garbage collection, when the real solution is not to bring in garbage into the mix in the first place.
Your queue is a text file. When a file is uploaded, the name of the file is appended to the queue.
$q= fopen('queue.txt','a');
The 'a' mode is important. It automatically moves the write pointer to the end of the file for append writes. But the reason it is important is because if the file does not exist, a new one is created.
fwrite($q,"$filename\n");
fclose($q);
If there are simultaneous append writes to this file the OS with arbitrate the conflict without error. No need for file locking, cooperative multitasking, or transactional processing.
When your script that processes the queue begins to run it renames the live queue to a working queue.
if(!file_exists('q.txt')){
if(!file_exists('queue.txt')){exit;}
rename('queue.txt','q.txt');
$q = fopen(`q.txt`,'r');
while (($filename = fgets($q, 4096)) !== false) {
process($filename);
}
fclose($q);
unlink('q.txt');
}
else{
echo 'Houston, we have a problem';
}
Now you see why the 'a' mode was important. We rename the queue and when the next upload occurs, the queue.txt is automatically created.
If the file is being written to as it is being rename the OS will sort it out without error. The rename is so fast that the chance of a simultaneous write is astronomically small. And it is a basic OS feature to sort out file system contention. No need for file locking, cooperative multitasking, or transactional processing.
This is a bullet proof method I have been using for many years.
Replace the Apollo 13 quote with an error recovery routine. If q.txt exists the previous processing did not complete successfully.
That was too easy.
Because it is so simple and we have plenty of memory because we are so efficient: Let's have some fun.
Let's see if writing to the queue is faster than AlanChavez's "super fast" Redis with its only double digit millisecond response.
The time in seconds to add a file name to the queue = 0.000014537 or 14.5µS. Slightly better than Redis's "super fast" 10-99 mS Response Time by, at minimum, 100,000%.

Related

Keeping a file handler open over a long period of time in PHP

I am using PHP 7.1 (if it makes any difference), and I am looking for possible issues with opening a file handler with fopen, but closing that file handler a long time after (potentially 30+ minutes).
The scenario comes down to a long running script that logs its actions periodically. Right now, I use a simple process of using file_put_contents with FILE_APPEND. It works. But it slows down over time as the file gets bigger and bigger.
Using fopen and fwrite means I can avoid that slow-down. The only issue is that I wouldnt be calling fclose until the end of the scripts execution. I assume constantly fopen/fclose'ing the file will give me the same poor performance as my current implementation.
Any one have experience with the internals in this regard? I have no need to read the file, only write.
EDIT:
More info: I am currently running this in a Linux VM on my laptop (VirtualBox). I do not expect top notch performance from this setup. That being said, I still notice the slow-down as the log file gets bigger and bigger.
My code logic as as simple as:
while(true)
{
$result = someFunction();
if(!$result)
{
logSomething();
}
else
{
break;
}
}
The frequency of writes is several times a second.
And files can become several GBs in size.

How can I lock parts of the code in an inter-request manner?

Consider a PHP script (possibly calling functions in other scripts). I want to make sure some part of it can only be executed by one request at a time. For example:
doSomething();
doSomethingElse(); // Lock this: Can only be executed by one request at a time
yetAnotherThing();
So if request A is currently 'inside' doSomethingElse(), I want any farther requests to be queued before the line of code calling this function.
I haven't found a solution to this online, because I'm talking about a lock between requests, and not a lock for separate threads executing as part of the same request. I am using Apache server.
You would need to guard the execution by setting up a flag to indicate what should happen or not happen.
You can store the guard-status in any other storage, which is persistent across requests: database, session, flat-file...
The most basic thing you could do is to write a flag file.
This will exclude all subsequent requests from processing doSomethingElse(), while the file exists. But, when the file is gone, the next request will exec doSomethingElse() again.
You might use flock() (http://php.net/manual/en/function.flock.php)
or your own flat-file locking approach. Just for the concept:
Add file_put_contents(__DIR__.'/doSomethingElse.processing.flag', 'processing');
at the start of the doSomethingElse() and remove it at the end of the function.
Then wrap the execution into a condition check:
doSomething();
if( ! is_file(__DIR__.'/doSomethingElse.processing.flag')) {
doSomethingElse();
}
yetAnotherThing();
Building a Queue
Well, you could expand the given idea or use a prepared library/tool for the job.
For building a "queue" you would need to expand the lock-idea and:
add a processing ID,
add a stack to lookup the current lowest ID for processing,
and the lookup and processing logic itself
including "locking" for the resource processed
A queue locks the resource to process and only allows the lowest ID.
The lock is often called semaphore. (It could actually be the highest ID, that depends on your processing logic - its basically LIFO or FIFO stack processing.)
Create a queue and put the job in, then add a worker running as a cronjob or daemon. The worker takes jobs off the queue, processes it and returns the result with the status flag "done". You might then periodically poll the queue to see if job have finished. You can use a database for the queue, pick one that supports locking.
while(1) {
begin new transaction;
remove item from queue;
process item;
save new state of item;
commit;
}
Not sure where you are heading, but you have a lot of options to implement it:
For a file based queue-ing mechanism see this basic tutorial: http://squirrelshaterobots.com/programming/php/building-a-queue-server-in-php-part-1-understanding-the-project/
You could rely on SPLQueue and combine it with the locking idea.
PHP has support for semaphores, too: http://php.net/manual/en/ref.sem.php
Then there are real job-queue systems like Gearman, Beanstalk, Redis or any message queue, like RabbitMQ, ZeroMQ.
See the gearman examples http://php.net/manual/en/gearman.examples-reverse.php
http://laravel.com/docs/5.1/queues
You can put doSomethingElse() function in another file and then you can use
flock
http://php.net/manual/en/function.flock.php
This function is to prevent simultaneous access.

Running a PHP app on windows - daemon or cron?

I need some implementation advice. I have a MYSQL DB that will be written to remotely for tasks to process locally and I need my application which is written in PHP to execute these tasks imediatly as they come in.
But of course my PHP app needs to be told when to run. I thought about using cron jobs but my app is on a windows machine. Secondly, I need to be constantly checking every few seconds and cron can only do every minute.
I thought of writing a PHP daemon but I am getting consued on hows its going to work and if its even a good idea!
I would appreciate any advice on the best way to do this.
pyCron is a good CRON alternative for Windows:
Since this task is quite simple I would just set up pyCron to run the following script every minute:
set_time_limit(60); // one minute, same as CRON ;)
ignore_user_abort(false); // you might wanna set this to true
while (true)
{
$jobs = getPendingJobs();
if ((is_array($jobs) === true) && (count($jobs) > 0))
{
foreach ($jobs as $job)
{
if (executeJob($job) === true)
{
markCompleted($job);
}
}
}
sleep(1); // avoid eating unnecessary CPU cycles
}
This way, if the computer goes down, you'll have a worst case delay of 60 seconds.
You might also want to look into semaphores or some kind of locking strategy like using an APC variable or checking for the existence of a locking file to avoid race conditions, using APC for example:
set_time_limit(60); // one minute, same as CRON ;)
ignore_user_abort(false); // you might wanna set this to true
if (apc_exists('lock') === false) // not locked
{
apc_add('lock', true, 60); // lock with a ttl of 60 secs, same as set_time_limit
while (true)
{
$jobs = getPendingJobs();
if ((is_array($jobs) === true) && (count($jobs) > 0))
{
foreach ($jobs as $job)
{
if (executeJob($job) === true)
{
markCompleted($job);
}
}
}
sleep(1); // avoid eating unnecessary CPU cycles
}
}
If you're sticking with the PHP daemon do yourself a favor and drop that idea, use Gearman instead.
EDIT: I asked a related question once that might interest you: Anatomy of a Distributed System in PHP.
I'll suggest something out of the ordinary: you said you need to run the task at the point the data is written to MySQL. That implies MySQL "knows" something should be executed.
It sounds like perfect scenario for MySQL's UDF sys_exec.
Basically, it would be nice if MySQL could invoke an external program once something happened to it.
If you use the mentioned UDF, you can execute a php script from within - let's say, INSERT or UPDATE trigger.
On the other hand, you can make it more resource-friendly and create MySQL Event (assuming you're using appropriate version) that would use sys_exec to invoke a PHP script that does certain updates at predefined intervals - that reduces the need for Cron or any similar program that can execute something at predefined intervals.
i would definately not advise to use cronjobs for this.
cronjobs are a good thing and very useful and easy for many purposes, but as you describe your needs, i think they can produce more complications than they do good. here are some things to consider:
what happens if jobs overlap? one takes longer to execute than one minute? are there any shared resources/deadlocks/tempfiles? - the most common method is to use a lock file, and stop the execution if its occupied right at the start of the program. but the program also has to look for further jobs right before it completes. - this however can also get complicated on windows machines because they AFAIK don't support write locks out of the box
cronjobs are a pain in the ass to maintain. if you want to monitor them you have to implement additional logic like a check when the program last ran. this however can get difficult if your program should run only on demand. the best way would be some sort of "job completed" field in the database or delete rows that have been processed.
on most unix based systems cronjobs are pretty stable now, but there are a lot of situatinos where you can break your cronjob system. most of them are based on human error. for example a sysadmin not exiting the crontab editor properly in edit mode can cause all cronjobs to be deleted. a lot of companies also have no proper monitoring system for the reasons stated above and notice as soon as their services experience problems. at this point often nobody has written down/put under version control which cronjobs should run and wild guessing and reconstruction work begins.
cronjob maintaince can be further complicated when external tools are used and the environment is not a native unix system. sysadmins have to gain knowledge of more programs and they can have potential errors.
i honestly think just a small script that you start from the console and let open is just fine.
<?php
while(true) {
$job = fetch_from_db();
if(!$job) {
sleep(10)
} else {
$job->process();
}
}
you can also touch a file (modify modification timestamp) in every loop, and you can write a nagios script that checks for that timestamp getting out of date so you know that your job is still running...
if you want it to start up with the system i recommend a deamon.
ps: in the company i work there is a lot of background activity for our website (crawling, update processes, calculations etc...) and the cronjobs were a real mess when i started there. their were spread over different servers responsible for different tasks. databases were accessed wildly accross the internet. a ton of nfs filesytems, samba shares etc were in place to share resouces. the place was full of single points of failures, bottlenecks and something constantly broke. there were so many technologies involved that it was very difficult to maintain and when something didnt work it needed hours of tracking down the problem and another hour of what that part even was supposed to do.
now we have one unified update program that is responsible for literally everyhing, it runs on several servers and they have a config file that defines the jobs to run. eveyrthing gets dispatched from one parent process doing an infinite loop. its easy to monitor, customice, synchronice and everything runs smoothly. it is redundant, it is syncrhonized and the granularity is fine. so it runs parallel and we can scale up to as many servers as we like.
i really suggest to sit down for enough time and think about everything as a whole and get a picture of the complete system. then invest the time and effort to implement a solution that will serve fine in future and doesnt spread tons of different programs throughout your system.
pps:
i read a lot about the minimum interaval of 1/5 minutes for cronjobs/tasks. you can easily work around that with an arbitrary script that takes over that interval:
// run every 5 minutes = 300 secs
// desired interval: 30 secs
$runs = 300/30; // be aware that the parent interval needs to be a multiple of the desired interval
for($i=0;$i<$runs;$i++) {
$start = time();
system('myscript.php');
sleep(300/10-time()+$start); // compensate the time that the script needed to run. be aware that you have to implement some logic to deal with cases where the script takes longer to run than your interavl - technique and problem described above
}
This looks like a job for a job server ;) Have a look at Gearman. The additional benefit of this approach is, that this is triggered by the remote side, when and only then there is something to do, instead of polling. Especially in intervals smaller than (lets say) 5 min polling is not very effective any more, depending on the tasks the job performs.
The quick and dirty way is to create a loop that continuously checks if there is new work.
Psuedo-code
set_ini("max_execution_time", "3600000000");
$keeplooping = true;
while($keeplooping){
if(check_for_work()){
process_work();
}
else{
sleep(5);
}
// some way to change $keeplooping to false
// you don't want to just kill the process, because it might still be doing something
}
Have you tried windows scheduler(comes with Windows by default)? In this you will need to provide php path and your php file path. It works well
Can't you just write a java/c++ program that will query for you through a set time interval? You can have this included in the list of startup programs so its always running as well. Once a task is found, it can handle it on a separate thread even and process more requests and mark others complete.
The most simple way is to use embed Windows schedule.
Run your script with php-cli.exe with filled php.ini with extensions your script needs.
But I should to say that in practice you don't need so short time interval to run your scheduled jobs. Just make some tests to get best time interval value for yours one. It is not recommended to setup time interval less than a 1 minute.
And another little advise: make some lock file at the beginning of your script (php flock function), check for availability to write into this lock file to prevent working of two or more copies same time and at the end of your script unlink it after unlocking.
If you have to write output result to DB try to use MySQL TRIGGERS instead of PHP. Or use events in MySQL.

Processing large amounts of data in PHP without a browser timeout

I have array of mobile numbers, around 50,000. I'm trying to process and send bulk SMS to these numbers using third-party API, but the browser will freeze for some minutes. I'm looking for a better option.
Processing of the data involves checking mobile number type (e.g CDMA), assigning unique ids to all the numbers for further referencing, check for network/country unique charges, etc.
I thought of queuing the data in the database and using cron to send about 5k by batch every minute, but that will take time if there are many messages. What are my other options?
I'm using Codeigniter 2 on XAMPP server.
I would write two scripts:
File index.php:
<iframe src="job.php" frameborder="0" scrolling="no" width="1" height="1"></iframe>
<script type="text/javascript">
function progress(percent){
document.getElementById('done').innerHTML=percent+'%';
}
</script><div id="done">0%</div>
File job.php:
set_time_limit(0); // ignore php timeout
ignore_user_abort(true); // keep on going even if user pulls the plug*
while(ob_get_level())ob_end_clean(); // remove output buffers
ob_implicit_flush(true); // output stuff directly
// * This absolutely depends on whether you want the user to stop the process
// or not. For example: You might create a stop button in index.php like so:
// Stop!
// Start
// But of course, you will need that line of code commented out for this feature to work.
function progress($percent){
echo '<script type="text/javascript">parent.progress('.$percent.');</script>';
}
$total=count($mobiles);
echo '<!DOCTYPE html><html><head></head><body>'; // webkit hotfix
foreach($mobiles as $i=>$mobile){
// send sms
progress($i/$total*100);
}
progress(100);
echo '</body></html>'; // webkit hotfix
I'm assuming these numbers are in a database, if so you should add a new column titled isSent (or whatever you fancy).
This next paragraph you typed should be queued and possibly done night/weekly/whenever appropriate. Unless you have a specific reason too, it shouldn't be done in bulk on demand. You can even add a column to the db to see when it was last checked so that if a number hasn't been checked in at least X days then you can perform a check on that number on demand.
Processing of the data involves checking mobile number type (e.g CDMA), assigning unique ids to all the numbers for further referencing, check for network/country unique charges, etc.
But that still leads you back to the same question of how to do this for 50,000 numbers at once. Since you mentioned cron jobs, I'm assuming you have SSH access to your server which means you don't need a browser. These cron jobs can be executed via the command line as such:
/usr/bin/php /home/username/example.com/myscript.php
My recommendation is to process 1,000 numbers at a time every 10 minutes via cron and to time how long this takes, then save it to a DB. Since you're using a cron job, it doesn't seem like these are time-sensitive SMS messages so they can be spread out. Once you know how long it took for this script to run 50 times (50*1000 = 50k) then you can update your cron job to run more/less frequently.
$time_start = microtime(true);
set_time_limit(0);
function doSendSMS($phoneNum, $msg, $blah);
$time_end = microtime(true);
$time = $time_end - $time_start;
saveTimeRequiredToSendMessagesInDB($time);
Also, you might have noticed a set_time_limit(0), this will tell PHP to not timeout after the default 30seconds. If you are able to modify the PHP.ini file then you don't need to enter this line of code. Even if you are able to edit the PHP.ini file, I would still recommend not changing this feature since you might want other pages to time out.
http://php.net/manual/en/function.set-time-limit.php
If this isn't a one-off type of situation, consider engineering a better solution.
What you basically want is a queue that your browser-bound process can write to, and than 1-N worker processes can read from and update.
Putting work in the queue should be rather inexpensive - perhaps a bunch of simple INSERT statements to a SQL RDBMS.
Then you can have a daemon or two (or 100, distributed across multiple servers) that read from the queue and process stuff. You'll want to be careful here and avoid two workers taking on the same task, but that's not hard to code around.
So your browser-bound workflow is: click some button that causes a bunch of stuff to get added to the queue, then redirect to some "queue status" interface, where the user can watch the system chew through all their work.
A system like this is nice, because it's easy to scale horizontally quite a ways.
EDIT: Christian Sciberras' answer is going in this direction, except the browser ends up driving both sides (it adds to the queue, then drives the worker process)
Cronjob would be your best bet, I don't see why it would take any longer than doing it in the browser if your only problem at the moment is the browser timing out.
If you insist on doing it via the browser then the other solution would be doing it in batches of say 1000 and redirecting to the same script but with some reference to where it got up to last time in a $_GET variable.

php multithreading, mysql

I have a php script which I use to make about 1 mil. requests every day to a specific web service.
The problem is that in a "normal" workflow the script is working almost the whole day to complete the job .
Therefore I've worked on an additional component. Basically I developed a script which access the main script using multi-curl GET request to generates some random tempid for each 500 records and finally makes another multi-curl request using POST with all the generated tempids.
However I don't feel this is the right way so I would like some advice/solutions to add multithreading capabilities to the main script without to use additional /external applications (e.g the curl script that I'm currently using).
Here is the main script : http://pastebin.com/rUQ6pwGS
If you want to do it right you should install a message queue. My preference goes out to redis because it is a "data structure server since keys can contain strings, hashes, lists, sets and sorted sets". Also redis is extremely fast.
Using the blpop(spawning a couple of worker threads using php <yourscript> to process work concurrently) to listen for new messages(work) and rpush to push new messages onto the queue. Spawning processes is expensive(relative) and when using a message queue this has to be done only once when the process is created.
I would go for phpredis if you could(need to be to recompile PHP) because it is an extension written in C and therefor going to be a lot faster than the pure PHP clients. Else PRedis is also pretty mature library you could use.
You could also use this brpop/rpush as some sort of lock(if you need to). This is because:
Multiple clients can block for the
same key. They are put into a queue,
so the first to be served will be the
one that started to wait earlier, in a
first-BLPOP first-served fashion.
I would advise you to have a look at Simon's redis tutorial to get an impression of the sheer power that redis has to offer.
This is background process, correct? In which case, you should not run it via a web server. Run it from the command-line, either as a daemon or as a cron job.
My preference is a "cron" job because you get automatic restart for free. Be sure that you don't have more instances of the program running than desired (You can achieve this by locking a file in the filesystem, doing something atomic in a database etc).
Then you just need to start the number of processes you want, and have them read work from a queue.
Normally the pattern for doing this is having a table containing columns to store who is currently excuting a given task:
CREATE TABLE sometasks (
ID of some kind,
Other info required to do task,
some data we need to know if the task is due yet or complete,
locked_by_host VARCHAR(64) NULL,
locked_by_pid INT NULL
)
Then the process will do the following pseduo-query to lock a set of tasks (batch_size is how many per batch, can be 1)
UPDATE sometasks SET locked_by_host=my_hostname, locked_by_pid=my_pid
WHERE not_done_already AND locked_by_host IS NULL ORDER BY ID LIMIT batch_size
Then select the rows back out using a select to find the current process's tasks. Then process the tasks, and update them as being "done" and clear out the lock.
I'd opt for a cron job with a controller process which starts up N child processes and monitors them. The child processes could periodically die (remember PHP does not have good GC, so it can easily leak memory) and be respawned to prevent resource leaks.
If the work is all done, the parent could quit, and wait to be respawned by cron (the next hour or something).
NB: locked_by_host can store the host name (pids aren't unique in different hosts) to allow for distributed processing, but maybe you don't need that, so you can omit it.
You can make this design more robust by putting a locked_time column and detecting when a task has been taking too long - you can alert, kill the process, and try again or something.

Categories