Temporary storage for collecting data prior to sending

Temporary storage for collecting data prior to sending - php

I'm working on a composer package for PHP apps. The goal is to send some data after requests, queue jobs, other actions that are taken. My initial (and working) idea is to use register_shutdown_function to do it. There are a couple of issues with this approach, firstly, this increases the page response time, meaning that there's the overhead of computing the request, plus sending the data via my API. Another issue is that long-running processes, such as queue workers, do not execute this method for a long time, therefore there might be massive gaps between when the data was created and when it's sent and processed.
My thought is that I could use some sort of temporary storage to store the data and have a cronjob to send it every minute. The only issue I can see with this approach is managing concurrency on hight IO. Because many processes will be writing to the file every (n) ms, there's an issue with reading the file and removing lines that had been already sent.
Another option which I'm trying to desperately avoid is using the client database. This could potentially cause performance issues.
What would be the preferred way to do this?
Edit: the package is essentially a monitoring agent.

There are a couple of issues with this approach, firstly, this increases the page response time, meaning that there's the overhead of computing the request, plus sending the data via my API
I'm not sure you can get around this, there will be additional overhead to doing more work within the context of a web request. I feel like using a job-queue based/asynchronous system is minimizing this for the client. Whether you choose a local file system write, or a socket write you'll have that extra overhead, but you'll be able to return to the client immediately and not block on the processing of that request.
Another issue is that long-running processes, such as queue workers, do not execute this method for a long time, therefore there might be massive gaps between when the data was created and when it's sent and processed.
Isn't this the whole point?? :p To return to your client immediately, and then asynchronously complete the job at some point in the future? Using a job queue allows you to decouple and scale your worker pool and webserver separately. Your webservers can be pretty lean because heavy lifting is deferred to the workers.
My thought is that I could use some sort of temporary storage to store the data and have a cronjob to send it every minute.
I would def recommend looking at a job queue opposed to rolling your own. This is pretty much solved and there are many extremely popular open source projects to handle this (any of the MQs) Will the minute cron job be doing the computation for the client? How do you scale that? If a file has 1000 entries, or you scale 10x and has 10000 will you be able to do all those computations in less than a minute? What happens if a server dies? How do you recover? Inter-process concurrency? Will you need to manage locks for each process? Will you use a separate file for each process and each minute? To bucket events? What happens if you want less than 1 minute runs?
Durability Guarantees
What sort of guarantees are you offering your clients? If a request returns can the client be sure that the job is persisted and it will be completed at sometime in the future?
I would def recommend choosing a worker queue, and having your webserver processes write to it. It's an extremely popular problem with so many resources on how to scale it, and with clear durability and performance guarantees.

Related

Managing workers with RabbitMQ

I have implemented rabbitMQ in my current php application to handle asynchroneous jobs that are handled by workers. But my current problem is that how should i monitor and scale up or down the workers. Also, i want to add error handling in case all the workers die. I have thought of following two ways but don't know which one is the better:
At producer end, i would analyze the rabbitMQ queue size. If queue size (list of pending tasks) is more than a threshold, i would create one new worker everytime producer script executes but before that i would check the server load (using linux command uptime). If server load is less than a threshold then only new worker would be created. At consumer end (in worker.php), i would apply same method to scale up the workers and i would also check that if script is idle for a given time (i.e. there is no pending task in rabbit mq queue) then it would automatically die (to automate scaling down of workers).
Second method is to use background process or cron to monitor and scale/up down the workers. But i don't want to rely on cron (as i have very bad experiences with it) or background process because if background process crashes for some reason then there is no way to recover from it.
Please help.

I wouldn't recommend bothering to scale them down to nothing when there's no work to be done. The worker that's left (if you want to scale back to 1) will simply wait for something else to consume and it's not an expensive operation.
In terms of determining whether to scale up, I'd recommend leveraging the RabbitMQ Management HTTP API (http://hg.rabbitmq.com/rabbitmq-management/raw-file/3646dee55e02/priv/www-api/help.html). You can use the queue related aspects via a GET operation to get information about queues, including how many entries are currently waiting to be processed.
With that info, you can decide to scale if it either hits a certain threshold, or keeps increasing with every check for a certain amount of time, or something similar. This can be done from the consumer side.
In terms of error handling, I would recommend encapsulating the RabbitMQ connection aspect of your workers such that if a RabbitMQ exception occurs the connection is re-established from scratch and continues.
If it's a more serious type of exception that isn't RabbitMQ-related, you may need to catch it at such a level where the worker basically spawns a new worker before it dies. Then of course there are other types of exceptions (out of memory conditions, for example), where it really isn't feasible to try to continue and your program should just completely die.

It is very difficult to answer your question with any degree of accuracy since there are many aspects of the context which are not included.
How long do the tasks take to execute?
Why do you want to scale up/down? Why don't you have threads waiting for load in the first place?
That being said, coming from the world of Erland and functional programming (which is the language used to power RabbitMQ) I would like to suggest the concept of a SUPERVISOR thread. This thread would have the following responsibilities:
Spawn threads depending on the load/qty of requests
Discard threads depending on the load/qty of requests
Monitor the children threads and re-launch them as required reprocessing the same messages if necessary or discarding them
The Supervisor thread should be as easy as possible and should be built in such a way that it simply loops, sleeps and checks if all the threads that need to be alive actually are - it can then check the load and spawn up or kill off the workers as needed. Or in other words, spawn more and/or not-spawn depending on your needs.
You could easily use an exchange to send messages to both the supervisor and the worker queues where the supervisor would then be able to keep a record/count of the messages in the queue without having to write polling code to the server, it would simply listen to it's own queue. You can increase/dec the counter from the supervisor thread and manage everything from there.
Hope this helps.
See: http://docs.dotcloud.com/guides/daemons/
Regretfully I don't program in PHP and therefore cannot give you PHP-specific assistance, this is however the programming pattern that I recommend that you use. If PHP doesn't allow multi-threaded programming and/or threads then I would highly recommend that you use a language that does since you will not be able to scale and use the full power of the local machine unless you use multiple threads. As for the supervisor crashing, if you keep minimal work in the supervisor and delegate all responsibilities to children threads then the risk of a supervisor crash is minimal.
Perhaps this will help:
Philosophy:
http://soapatterns.org/design_patterns/service_agent
PHP-specific:
http://www.quora.com/PHP-programming-language-1/Is-there-an-actor-framework-for-php

BigQuery streaming 'insertAll' performance with PHP

We're streaming a high volume of data server-side into BigQuery using the google-api-php-client library. The streaming works fine apart from the performance.
Our load testing is giving us an average time of 1000ms (1 sec) to stream one row into BigQuery. We can't have the client waiting for more than 200ms. We've tested with smaller payloads and the time remains the same. Async calls on the client side is not an option for us.
The 'bottleneck' line of code is:
$service->tabledata->insertAll(PROJECT_NUMBER, DATA_SET, TABLE, $request);
Having looked under the hood of the library the call to insert the row is simply a cURL request (Curl.php in the library).
Is there any way to modify the insertAll() to make it faster? We don't care about the result so a fire-and-forget would work for us. We've tried setting CURLOPT_CONNECTTIMEOUT_MS and CURLOPT_TIMEOUT_MS in the underlying cCURL request but it does not work.

Reading all your comments, and side notes. The approach you've chosen does not scale, and won't scale. You need to rethink the approach with async processes.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

Cross server MySQL connection and requests

I'm going to be using Nodejs to process some CPU intense loop operations with sending emails to registered users as PHP was using too much during the time it runs and freezes the site.
One thing is that Nodejs will be on different server and do a request using external connection in MySQL.
I've heard that external db connection is bad for performance.
Is this true? And are there any pros and cons of doing this?

Keep in mind, when running a CPU intensive operation in Node the whole application blocks as it runs in a single thread. If you're going to run a CPU intensive operation in Node, make sure you spawn it off into a child process who's only job is to run the calculation and then return to the primary application. This will ensure your Node app is able to continue responding to income requests as the data is being processed.
Now, onto your question. Having the database on a different server is extremely common and typically is a good practice to have. Where you can run into performance problems is if your database is in a different data center entirely. The further (physically) your database server is from your application server, the more latency there will be per request.
If these requests are seriously CPU intensive, you should consider looking into a queueing mechanism for a couple reasons. One, it ensures that even in the event of an application crash, you don't lose a request that is being processed. Two, you can monitor the queue, and scale the number of workers processing the queue in the event that the operations are piling to the point that a single application can't finish processing one before another comes in.

sending batch requests

I have a daemon that does the following
retrieves site members from a mysql database (I used LIMIT 1000 to retrieve 1000 rows at a time)
send information about these members to a third party server
flag each member as having been processed
Sleep for 2 seconds
Retrieve the next batch of 1000 "unprocessed" members and send to third party server.
and so on.
I am wondering whether a php daemon (I am using the system Daemon library), is the best way to accomplish this task delineated above.
I am worried of wasting too much memory (as PHP is known for that)
I am also worried about sending multiple requests to third party server, because on a high traffic day, there can be a lot of nonreceipts.
Is there a tool other than daemon I can use to accomplish this task? What methods can I implement to make this efficient considering there is a possibility of having to process over 100K rows in the mysql table, and the task is time sensitive. Also, at what point should I consider adding more servers?
Thanks!

A cron should be a very good option for doing a sync job with a third party server.
Consider the following 'improvments':
1) A lock file to prevent multiple jobs from starting in parallel and taking extra resources from other processes you have running. And also to avoid duplicate processing of data.
2) If you don't have already implement an 'information update' and 'sync time' check on your side. For example if user A hasn't suffered any changes since he was sync you don't sync him again.
3) Consider how often you need data to be sync and if it doesn't have to be real time factor that into the selection query. Combined with user/time distribution and other factors you migth end up having periods of time when your script doesn't sync that many accounts.
4) Do your own memory cleanup unsetting variables, unlinking files and even reusing the same variables so you don't have garbage variables that are a 1 time use only inside the scripts. Carefull with this as it might lead to obfuscating the code.
Also consider using smaller datasets when you send them to php for processing. Databases love big datasets, php doesn't.

I would suggest you using Perl, as it is more memory and performance efficient and it has more features for integrating with system and running as daemon.
And now about when it's time for adding more servers. I am assuming that third party server has enough resources for processing many records. So if you are running out of resources on your side I would suggest using MySQL replication to replicate your DBs to other server(s) and running above mentioned daemon there.

Does PHP proc_nice leave Apache threads at new priority setting?

When executing proc_nice(), is it actually nice'ing Apache's thread?
If so, and if the current user (non-super user) can't renice to its original priority is killing the Apache thread appropriate (apache_child_terminate) on an Apache 2.0x server?
The issue is that I am trying to limit the impact of an app that allows the user to run Ad-Hack queries. The Queries can be massive and the resultant transform on the data requires a lot of Memory and CPU.
I've already re-written the process to be more stream based - helping with the memory consumption, but I would also like the process to run a lower priority. However I can't leave the Apache thread in low priority as we have a lot of high-priority web services running on this same box.
TIA

In that kind of situation, a solution if often to not do that kind of heavy work within the Apache processes, but either :
run an external PHP process, using something like shell_exec, for instance -- this is if you must work in synchronous mode (ie, if you cannot execute the task a couple of minutes later)
push the task to a FIFO system, and immediatly return a message to the user saying "your task will be processed soon"
and have some other process (launched via a crontab every minute, for instance) check that FIFO queue
and do the processing it there is something in the queue
That process, itself, can run in low priority mode.
As often as possible, especially if the heavy calculations take some time, I would go for the second solution :
It allows users to get some feedback immediatly : "the server has received your request, and will process it soon"
It doesn't keep Apaches's processes "working" for long : the heavy stuff is done by other processes
If, one day, you need such an amount of processing power that one server is not enough anymore, this kind of system will be easier to scale : just add a second server that'll pick from the same FIFO queue
If your server is really too loaded, you can stop processing from the queue, at least for some time, so the load can get better -- for instance, this can be usefull if your critical web-services are used a lot in a specific time-frame.
Another (nice-looking, but I haven't tried it yet) solution would be to use some kind of tool like, for instance, Gearman :
Gearman provides a generic application
framework to farm out work to other
machines or processes that are better
suited to do the work. It allows you
to do work in parallel, to load
balance processing, and to call
functions between languages. It can be
used in a variety of applications,
from high-availability web sites to
the transport of database replication
events. In other words, it is the
nervous system for how distributed
processing communicates.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.