I have a cron job script that runs every 60 seconds to process and store results in a database. That’s a maximum of 1,440 new database entries per day.
I need to have many many millions of database entries, so doing this with just one instance of this script is really impractical. I’m looking for a minimum of a 50x speed up, and ideally 300x to 500x if the cost is reasonable.
It seems like I need a server farm, but I have to use Amazon Web Services to process this data. How can I set this script up to run many simultaneous instances, while storing the data in a single, unified database?
Do I need to create completely separate server instances for every time I want to run this script, multiplying the cost?
Thank you for your help!
A serverless approach using a remote lambda function to execute your job triggered by a queue system solve your problem both technically and at pricing level.
https://aws.amazon.com/lambda/
For example you can trigger a lambda function execution from a local centralized script (eg. by a single cron) which enqueue some messages to a queue system for as many entries you need to compute in an asyncronous/concurrent way.
The serverless framework can help you to avoid AWS lock-in:
https://serverless.com/
Related
I'm about to undertake a large project, where I'll need scheduled tasks (cron jobs) to run a script that will loop through my entire database of entities and make calls to multiple API's such as Facebook, Twitter & Foursquare every 10 minutes. I need this application to be scalable.
I can already foresee a few potential pitfalls...
Fetching data from API's is slow..
With thousands of records (constantly increasing) in my database, it's going to take too much time to process every record within 10 minutes.
Some shared servers only stop scripts running after 30 seconds.
Server issues due to constant intensive scripts running.
My question is how to structure my application...?
Could I create multiple cron jobs to handle small segments of my database (this will have to be automated)?
This will require potentially thousands of cron jobs.. Is that sustainable?
How to bypass the 30 sec issue with some servers?
Is there a better way to go about this?
Thanks!
I'm about to undertake a large project, where I'll need scheduled
tasks (cron jobs) to run a script that will loop through my entire
database of entities and make calls to multiple API's such as
Facebook, Twitter & Foursquare every 10 minutes. I need this
application to be scalable.
Your best option is to design the application to make use of a distributed database, and deploy it on multiple servers.
You can design it to work in two "ranks" of servers, not unlike the map-reduce approach: lightweight servers that only perform queries and "pre-digest" some data ("map"), and servers that aggregate the data ("reduce").
Once you do that, you can establish a performance baseline and calculate that, say, if you can generate 2000 queries per minute and you can handle as many responses, then you need a new server every 20,000 users. In that "generate 2000 queries per minute" you need to factor in:
data retrieval from the database
traffic bandwidth from and to the control servers
traffic bandwidth to Facebook, Foursquare, Twitter etc.
necessity to log locally (and maybe distill and upload log digests to Command and Control)
An advantage of this architecture is that you can start small - a testbed can be built with a single machine running both Connector, Mapper, Reducer, Command and Control and Persistence. When you grow, you just outsource different services to different servers.
On several distributed computing platforms, this also allows you to run queries faster by judiciously allocating Mappers geographically or connectivity-wise, and reduce the traffic costs between your various platforms by playing with, e.g. Amazon "zones" (Amazon has also a message service that you might find valuable for communicating between the tasks)
One note: I'm not sure that PHP is the right tool for this whole thing. I'd rather think Python.
At the 20,000 users-per-instance traffic level, though, I think that you'd better take this up with the guys at Facebook, Foursquare etc. . At a minimum you might glean some strategies such as running the connector scripts as independent tasks, each connector sorting its queue based on that service's user IDs, to leverage what little data locality there might be, and taking advantage of pipelining to squeeze more bandwidth with less server load. At the most, they might point you to bulk APIs or different protocols, or buy you for one trillion bucks :-)
See http://php.net/manual/en/function.set-time-limit.php to bypass the 30 second limit.
For scheduling jobs in PHP look at:
http://www.phpjobscheduler.co.uk/
http://www.zend.com/en/products/server/zend-server-job-queue
I personally would look at a more robust framework that handles job scheduling (see Grails with Quartz) instead of reinventing the wheel and writing your own job scheduler. Don't forget that you are probably going to need to be checking on the status of tasks from time to time so you will need a logging solution around the tasks.
I have a daemon that does the following
retrieves site members from a mysql database (I used LIMIT 1000 to retrieve 1000 rows at a time)
send information about these members to a third party server
flag each member as having been processed
Sleep for 2 seconds
Retrieve the next batch of 1000 "unprocessed" members and send to third party server.
and so on.
I am wondering whether a php daemon (I am using the system Daemon library), is the best way to accomplish this task delineated above.
I am worried of wasting too much memory (as PHP is known for that)
I am also worried about sending multiple requests to third party server, because on a high traffic day, there can be a lot of nonreceipts.
Is there a tool other than daemon I can use to accomplish this task? What methods can I implement to make this efficient considering there is a possibility of having to process over 100K rows in the mysql table, and the task is time sensitive. Also, at what point should I consider adding more servers?
Thanks!
A cron should be a very good option for doing a sync job with a third party server.
Consider the following 'improvments':
1) A lock file to prevent multiple jobs from starting in parallel and taking extra resources from other processes you have running. And also to avoid duplicate processing of data.
2) If you don't have already implement an 'information update' and 'sync time' check on your side. For example if user A hasn't suffered any changes since he was sync you don't sync him again.
3) Consider how often you need data to be sync and if it doesn't have to be real time factor that into the selection query. Combined with user/time distribution and other factors you migth end up having periods of time when your script doesn't sync that many accounts.
4) Do your own memory cleanup unsetting variables, unlinking files and even reusing the same variables so you don't have garbage variables that are a 1 time use only inside the scripts. Carefull with this as it might lead to obfuscating the code.
Also consider using smaller datasets when you send them to php for processing. Databases love big datasets, php doesn't.
I would suggest you using Perl, as it is more memory and performance efficient and it has more features for integrating with system and running as daemon.
And now about when it's time for adding more servers. I am assuming that third party server has enough resources for processing many records. So if you are running out of resources on your side I would suggest using MySQL replication to replicate your DBs to other server(s) and running above mentioned daemon there.
I need to run some tasks continuously. These tasks consist, mainly, of retrieving specific records from the DB, analyzing and saving them. This a non-trivial analysis, which might take several seconds (more than a minute, perhaps).
I do not know how frequently will new records be saved in the DB waiting for analysis (there's another cronjob for that).
Should I retrieve records one by one calling the same analysis function again once it finishes (recursively) and try to keep the cronjob running until there are no more unanalyzed records?
Or should I retrieve a fixed amount of new records on each cronjob run and call the cronjob every certain amount of minutes?
A job queue server may work well for this scenario (See ActiveMQ or MemcacheQ for example. Rather than adding the un-analyzed records directly to the database, send them to a queue for processing. Then your cron job could retrieve some items from the queue for processing, and if one job takes so long to run the cron job is triggered again, the next one will run and grab the next items in the queue.
Personally, I would have the cron job retrieve a fixed number of records for processing, just to make sure you don't get the script stuck processing for a very long time in the event new records keep getting added and the processor can't keep up. Eventually it would probably finish everything but you could end up in a situation where it continues for a very long time.
You may consider creating a lock file as well that the job can look for to see if the task processor is already running. For example when the cron job starts, check for the existence of a file (e.g. processor.lock), if it exists, exit, if not, create the file, process some records, and delete the file.
Hope that helps.
Or should I retrieve a fixed amount of new records on each cronjob run and call the cronjob every certain amount of minutes?
That. And you'll have to do some trial and error metrics first to decide an optimal fixed amount.
Of course it heavily depends on what you are actually doing, how many db intensive cron jobs you are running simultaneously and what kind of setup you have. I recently spent a day looking for a Heisenbug in a very intensive script that migrated images from db to s3 (and created a few thumbs while migrating). The problem was that due to an undocumented behaviour in our ORM the connection to the database was lost at some point, as posting to s3 + thumbs generation for certain images took a little bit more than the connection time limit. It was an ugly situation, that would probably cost more than a day to identify in a recursive do it all scheme.
You'd be better off with the safe approach, even if it means a little time lost between cron executions.
Instead of using a cron job, I would use The Fat Controller to run and repeat tasks. It is basically a daemon which can run any script or application and restart it after it finishes, optionally with a delay between runs.
You can additionally specify a timeout so that long-running scripts will be stopped. This way you don't need to care about locking, long-running processes, error process and so on. It will help to keep your business logic clean.
There's more examples and use cases on the website:
http://fat-controller.sourceforge.net/
I am trying to write a client-server app.
Basically, there is a Master program that needs to maintain a MySQL database that keeps track of the processing done on the server-side,
and a Slave program that queries the database to see what to do for keeping in sync with the Master. There can be many slaves at the same time.
All the programs must be able to run from anywhere in the world.
For now, I have tried setting up a MySQL database on a shared hosting server as where the DB is hosted
and made C++ programs for the master and slave that use CURL library to make request to a php file (ex.: www.myserver.com/check.php) located on my hosting server.
The master program calls the URL every second and some PHP code is executed to keep the database up to date. I did a test with a single slave program that calls the URL every second also and execute PHP code that queries the database.
With that setup however, my web hoster suspended my account and told me that I was 'using too much CPU resources' and I that would need to use a dedicated server (200$ per month rather than 10$) from their analysis of the CPU resources that were needed. And that was with one Master and only one Slave, so no more than 5-6 MySql queries per second. What would it be with 10 slaves then..?
Am I missing something?
Would there be a better setup than what I was planning to use in order to achieve the syncing mechanism that I need between two and more far apart programs?
I would use Google App Engine for storing the data. You can read about free quotas and pricing here.
I think the syncing approach you are taking is probably fine.
The more significant question you need to ask yourself is, what is the maximum acceptable time between sync's that is acceptable? If you truly need to have virtually realtime syncing happening between two databases on opposite sites of the world, then you will be using significant bandwidth and you will unfortunately have to pay for it, as your host pointed out.
Figure out what is acceptable to you in terms of time. Is it okay for the databases to only sync once a minute? Once every 5 minutes?
Also, when running sync's like this in rapid succession, it is important to make sure you are not overlapping your syncs: Before a sync happens, test to see if a sync is already in process and has not finished yet. If a sync is still happening, then don't start another. If there is not a sync happening, then do one. This will prevent a lot of unnecessary overhead and sync's happening on top of eachother.
Are you using a shared web host? What you are doing sounds like excessive use for a shared (cPanel-type) host - use a VPS instead. You can get an unmanaged VPS with 512M for 10-20USD pcm depending on spec.
Edit: if your bottleneck is CPU rather than bandwidth, have you tried bundling up updates inside a transaction? Let us say you are getting 10 updates per second, and you decide you are happy with a propagation delay of 2 seconds. Rather than opening a connection and a transaction for 20 statements, bundle them together in a single transaction that executes every two seconds. That would substantially reduce your CPU usage.
Greetings All!
I am having some troubles on how to execute thousands upon thousands of requests to a web service (eBay), I have a limit of 5 million calls per day, so there are no problems on that end.
However, I'm trying to figure out how to process 1,000 - 10,000 requests every minute to every 5 minutes.
Basically the flow is:
1) Get list of items from database (1,000 to 10,000 items)
2) Make a API POST request for each item
3) Accept return data, process data, update database
Obviously a single PHP instance running this in a loop would be impossible.
I am aware that PHP is not a multithreaded language.
I tried the CURL solution, basically:
1) Get list of items from database
2) Initialize multi curl session
3) For each item add a curl session for the request
4) execute the multi curl session
So you can imagine 1,000-10,000 GET requests occurring...
This was ok, around 100-200 requests where occurring in about a minute or two, however, only 100-200 of the 1,000 items actually processed, I am thinking that i'm hitting some sort of Apache or MySQL limit?
But this does add latency, its almost like performing a DoS attack on myself.
I'm wondering how you would handle this problem? What if you had to make 10,000 web service requests and 10,000 MySQL updates from the return data from the web service... And this needs to be done in at least 5 minutes.
I am using PHP and MySQL with the Zend Framework.
Thanks!
I've had to do something similar, but with Facebook, updating 300,000+ profiles every hour. As suggested by grossvogel, you need to use many processes to speed things up because the script is spending most of it's time waiting for a response.
You can do this with forking, if your PHP install has support for forking, or you can just execute another PHP script via the command line.
exec('nohup /path/to/script.php >> /tmp/logfile 2>&1 & echo $!'), $processId);
You can pass parameters (getopt) to the php script on the command line to tell it which "batch" to process. You can have the master script do a sleep/check cycle to see if the scripts are still running by checking for the process id's. I've tested up to 100 scripts running at once in this manner, at which point the CPU load can get quite high.
Combine multiple processes with multi-curl, and you should easily be able to do what you need.
My two suggestions are (a) do some benchmarking to find out where your real bottlenecks are and (b) use batching and cacheing wherever possible.
Mysqli allows multiple-statement queries, so you could definitely batch those database updates.
The http requests to the web service are more likely the culprit, though. Check the API you're using to see if you can get more info from a single call, maybe? To break up the work, maybe you want a single master script to shell out to a bunch of individual processes, each of which makes an api call and stores the results in a file or memcached. The master can periodically read the results and update the db. (Careful to rotate the data store for safe reading and writing by multiple processes.)
To understand your requirements better, you must implement your solution only in PHP? Or you can interface a PHP part with another part written in another language?
If you could not go for another language, try to perform this update maybe as php script that runs in the background and not through the apache.
You can follow Brent Baisley advice for a simple use case.
If you want to build a robuts solution, then you need to :
set up a representation of the actions in a table in database that will be your process queue;
set up a script that pop this queue and process your action;
set up a cron daemon that run this script every x.
This way you can have 1000 PHP scripts running, using your OS parallelism capabilities and not hanging when ebay is taking to to respond.
The real advantage of this system is that you can fully control the firepower you throw at your task by adjusting :
the number of request one PHP script does;
the order / number / type / priority of the action in the queue;
the number or scripts the cron daemon runs.
Thanks everyone for the awesome and quick answers!
The advice from Brent Baisley and e-satis works nicely, rather than executing the sub-processes using CURL like i did before, the forking takes a massive load off, it also nicely gets around the issues with max out my apache connection limit.
Thanks again!
It is true that PHP is not multithreaded, but it can certainly be setup with multiple processes.
I have created a system that resemebles the one that you are describing. It's running in a loop and is basically a background process. It uses up to 8 processes for batch processing and a single control process.
It is somewhat simplified because i do not have to have any communication between the processes. Everything resides in a database so each process is spawned with the full context taken from the database.
Here is a basic description of the system.
1. Start control process
2. Check database for new jobs
3. Spawn child process with the job data as a parameter
4. Keep a table of the child processes to be able to control the number of simultaneous processes.
Unfortunately it does not appear to be a widespread idea to use PHP for this type of application, and i really had to write wrappers for the low level functions.
The manual has a whole section on these functions, and it appears that there are methods for allowing IPC as well.
PCNTL has the functions to control forking/child processes, and Semaphore covers IPC.
The interesting part of this is that i'm able to fork off actual PHP code, not execute other programs.