I have a daemon that does the following
retrieves site members from a mysql database (I used LIMIT 1000 to retrieve 1000 rows at a time)
send information about these members to a third party server
flag each member as having been processed
Sleep for 2 seconds
Retrieve the next batch of 1000 "unprocessed" members and send to third party server.
and so on.
I am wondering whether a php daemon (I am using the system Daemon library), is the best way to accomplish this task delineated above.
I am worried of wasting too much memory (as PHP is known for that)
I am also worried about sending multiple requests to third party server, because on a high traffic day, there can be a lot of nonreceipts.
Is there a tool other than daemon I can use to accomplish this task? What methods can I implement to make this efficient considering there is a possibility of having to process over 100K rows in the mysql table, and the task is time sensitive. Also, at what point should I consider adding more servers?
Thanks!
A cron should be a very good option for doing a sync job with a third party server.
Consider the following 'improvments':
1) A lock file to prevent multiple jobs from starting in parallel and taking extra resources from other processes you have running. And also to avoid duplicate processing of data.
2) If you don't have already implement an 'information update' and 'sync time' check on your side. For example if user A hasn't suffered any changes since he was sync you don't sync him again.
3) Consider how often you need data to be sync and if it doesn't have to be real time factor that into the selection query. Combined with user/time distribution and other factors you migth end up having periods of time when your script doesn't sync that many accounts.
4) Do your own memory cleanup unsetting variables, unlinking files and even reusing the same variables so you don't have garbage variables that are a 1 time use only inside the scripts. Carefull with this as it might lead to obfuscating the code.
Also consider using smaller datasets when you send them to php for processing. Databases love big datasets, php doesn't.
I would suggest you using Perl, as it is more memory and performance efficient and it has more features for integrating with system and running as daemon.
And now about when it's time for adding more servers. I am assuming that third party server has enough resources for processing many records. So if you are running out of resources on your side I would suggest using MySQL replication to replicate your DBs to other server(s) and running above mentioned daemon there.
Related
I'm working on a composer package for PHP apps. The goal is to send some data after requests, queue jobs, other actions that are taken. My initial (and working) idea is to use register_shutdown_function to do it. There are a couple of issues with this approach, firstly, this increases the page response time, meaning that there's the overhead of computing the request, plus sending the data via my API. Another issue is that long-running processes, such as queue workers, do not execute this method for a long time, therefore there might be massive gaps between when the data was created and when it's sent and processed.
My thought is that I could use some sort of temporary storage to store the data and have a cronjob to send it every minute. The only issue I can see with this approach is managing concurrency on hight IO. Because many processes will be writing to the file every (n) ms, there's an issue with reading the file and removing lines that had been already sent.
Another option which I'm trying to desperately avoid is using the client database. This could potentially cause performance issues.
What would be the preferred way to do this?
Edit: the package is essentially a monitoring agent.
There are a couple of issues with this approach, firstly, this increases the page response time, meaning that there's the overhead of computing the request, plus sending the data via my API
I'm not sure you can get around this, there will be additional overhead to doing more work within the context of a web request. I feel like using a job-queue based/asynchronous system is minimizing this for the client. Whether you choose a local file system write, or a socket write you'll have that extra overhead, but you'll be able to return to the client immediately and not block on the processing of that request.
Another issue is that long-running processes, such as queue workers, do not execute this method for a long time, therefore there might be massive gaps between when the data was created and when it's sent and processed.
Isn't this the whole point?? :p To return to your client immediately, and then asynchronously complete the job at some point in the future? Using a job queue allows you to decouple and scale your worker pool and webserver separately. Your webservers can be pretty lean because heavy lifting is deferred to the workers.
My thought is that I could use some sort of temporary storage to store the data and have a cronjob to send it every minute.
I would def recommend looking at a job queue opposed to rolling your own. This is pretty much solved and there are many extremely popular open source projects to handle this (any of the MQs) Will the minute cron job be doing the computation for the client? How do you scale that? If a file has 1000 entries, or you scale 10x and has 10000 will you be able to do all those computations in less than a minute? What happens if a server dies? How do you recover? Inter-process concurrency? Will you need to manage locks for each process? Will you use a separate file for each process and each minute? To bucket events? What happens if you want less than 1 minute runs?
Durability Guarantees
What sort of guarantees are you offering your clients? If a request returns can the client be sure that the job is persisted and it will be completed at sometime in the future?
I would def recommend choosing a worker queue, and having your webserver processes write to it. It's an extremely popular problem with so many resources on how to scale it, and with clear durability and performance guarantees.
I'm about to undertake a large project, where I'll need scheduled tasks (cron jobs) to run a script that will loop through my entire database of entities and make calls to multiple API's such as Facebook, Twitter & Foursquare every 10 minutes. I need this application to be scalable.
I can already foresee a few potential pitfalls...
Fetching data from API's is slow..
With thousands of records (constantly increasing) in my database, it's going to take too much time to process every record within 10 minutes.
Some shared servers only stop scripts running after 30 seconds.
Server issues due to constant intensive scripts running.
My question is how to structure my application...?
Could I create multiple cron jobs to handle small segments of my database (this will have to be automated)?
This will require potentially thousands of cron jobs.. Is that sustainable?
How to bypass the 30 sec issue with some servers?
Is there a better way to go about this?
Thanks!
I'm about to undertake a large project, where I'll need scheduled
tasks (cron jobs) to run a script that will loop through my entire
database of entities and make calls to multiple API's such as
Facebook, Twitter & Foursquare every 10 minutes. I need this
application to be scalable.
Your best option is to design the application to make use of a distributed database, and deploy it on multiple servers.
You can design it to work in two "ranks" of servers, not unlike the map-reduce approach: lightweight servers that only perform queries and "pre-digest" some data ("map"), and servers that aggregate the data ("reduce").
Once you do that, you can establish a performance baseline and calculate that, say, if you can generate 2000 queries per minute and you can handle as many responses, then you need a new server every 20,000 users. In that "generate 2000 queries per minute" you need to factor in:
data retrieval from the database
traffic bandwidth from and to the control servers
traffic bandwidth to Facebook, Foursquare, Twitter etc.
necessity to log locally (and maybe distill and upload log digests to Command and Control)
An advantage of this architecture is that you can start small - a testbed can be built with a single machine running both Connector, Mapper, Reducer, Command and Control and Persistence. When you grow, you just outsource different services to different servers.
On several distributed computing platforms, this also allows you to run queries faster by judiciously allocating Mappers geographically or connectivity-wise, and reduce the traffic costs between your various platforms by playing with, e.g. Amazon "zones" (Amazon has also a message service that you might find valuable for communicating between the tasks)
One note: I'm not sure that PHP is the right tool for this whole thing. I'd rather think Python.
At the 20,000 users-per-instance traffic level, though, I think that you'd better take this up with the guys at Facebook, Foursquare etc. . At a minimum you might glean some strategies such as running the connector scripts as independent tasks, each connector sorting its queue based on that service's user IDs, to leverage what little data locality there might be, and taking advantage of pipelining to squeeze more bandwidth with less server load. At the most, they might point you to bulk APIs or different protocols, or buy you for one trillion bucks :-)
See http://php.net/manual/en/function.set-time-limit.php to bypass the 30 second limit.
For scheduling jobs in PHP look at:
http://www.phpjobscheduler.co.uk/
http://www.zend.com/en/products/server/zend-server-job-queue
I personally would look at a more robust framework that handles job scheduling (see Grails with Quartz) instead of reinventing the wheel and writing your own job scheduler. Don't forget that you are probably going to need to be checking on the status of tasks from time to time so you will need a logging solution around the tasks.
I am trying to write a client-server app.
Basically, there is a Master program that needs to maintain a MySQL database that keeps track of the processing done on the server-side,
and a Slave program that queries the database to see what to do for keeping in sync with the Master. There can be many slaves at the same time.
All the programs must be able to run from anywhere in the world.
For now, I have tried setting up a MySQL database on a shared hosting server as where the DB is hosted
and made C++ programs for the master and slave that use CURL library to make request to a php file (ex.: www.myserver.com/check.php) located on my hosting server.
The master program calls the URL every second and some PHP code is executed to keep the database up to date. I did a test with a single slave program that calls the URL every second also and execute PHP code that queries the database.
With that setup however, my web hoster suspended my account and told me that I was 'using too much CPU resources' and I that would need to use a dedicated server (200$ per month rather than 10$) from their analysis of the CPU resources that were needed. And that was with one Master and only one Slave, so no more than 5-6 MySql queries per second. What would it be with 10 slaves then..?
Am I missing something?
Would there be a better setup than what I was planning to use in order to achieve the syncing mechanism that I need between two and more far apart programs?
I would use Google App Engine for storing the data. You can read about free quotas and pricing here.
I think the syncing approach you are taking is probably fine.
The more significant question you need to ask yourself is, what is the maximum acceptable time between sync's that is acceptable? If you truly need to have virtually realtime syncing happening between two databases on opposite sites of the world, then you will be using significant bandwidth and you will unfortunately have to pay for it, as your host pointed out.
Figure out what is acceptable to you in terms of time. Is it okay for the databases to only sync once a minute? Once every 5 minutes?
Also, when running sync's like this in rapid succession, it is important to make sure you are not overlapping your syncs: Before a sync happens, test to see if a sync is already in process and has not finished yet. If a sync is still happening, then don't start another. If there is not a sync happening, then do one. This will prevent a lot of unnecessary overhead and sync's happening on top of eachother.
Are you using a shared web host? What you are doing sounds like excessive use for a shared (cPanel-type) host - use a VPS instead. You can get an unmanaged VPS with 512M for 10-20USD pcm depending on spec.
Edit: if your bottleneck is CPU rather than bandwidth, have you tried bundling up updates inside a transaction? Let us say you are getting 10 updates per second, and you decide you are happy with a propagation delay of 2 seconds. Rather than opening a connection and a transaction for 20 statements, bundle them together in a single transaction that executes every two seconds. That would substantially reduce your CPU usage.
I've a problem which is giving me some hard time trying to figure it out the ideal solution and, to better explain it, I'm going to expose my scenario here.
I've a server that will receive orders
from several clients. Each client will
submit a set of recurring tasks that
should be executed at some specified
intervals, eg.: client A submits task
AA that should be executed every
minute between 2009-12-31 and
2010-12-31; so if my math is right
that's about 525 600 operations in a
year, given more clients and tasks
it would be infeasible to let the server process all these tasks so I
came up with the idea of worker
machines. The server will be developed
on PHP.
Worker machines are just regular cheap
Windows-based computers that I'll
host on my home or at my workplace,
each worker will have a dedicated
Internet connection (with dynamic IPs)
and a UPS to avoid power outages. Each
worker will also query the server every
30 seconds or so via web service calls,
fetch the next pending job and process it.
Once the job is completed the worker will
submit the output to the server and request
a new job and so on ad infinitum. If
there is a need to scale the system I
should just set up a new worker and the
whole thing should run seamlessly.
The worker client will be developed
in PHP or Python.
At any given time my clients should be
able to log on to the server and check
the status of the tasks they ordered.
Now here is where the tricky part kicks in:
I must be able to reconstruct the
already processed tasks if for some
reason the server goes down.
The workers are not client-specific,
one worker should process jobs for
any given number of clients.
I've some doubts regarding the general database design and which technologies to use.
Originally I thought of using several SQLite databases and joining them all on the server but I can't figure out how I would group by clients to generate the job reports.
I've never actually worked with any of the following technologies: memcached, CouchDB, Hadoop and all the like, but I would like to know if any of these is suitable for my problem, and if yes which do you recommend for a newbie is "distributed computing" (or is this parallel?) like me. Please keep in mind that the workers have dynamic IPs.
Like I said before I'm also having trouble with the general database design, partly because I still haven't chosen any particular R(D)DBMS but one issue that I've and I think it's agnostic to the DBMS I choose is related to the queuing system... Should I precalculate all the absolute timestamps to a specific job and have a large set of timestamps, execute and flag them as complete in ascending order or should I have a more clever system like "when timestamp modulus 60 == 0 -> execute". The problem with this "clever" system is that some jobs will not be executed in order they should be because some workers could be waiting doing nothing while others are overloaded. What do you suggest?
PS: I'm not sure if the title and tags of this question properly reflect my problem and what I'm trying to do; if not please edit accordingly.
Thanks for your input!
#timdev:
The input will be a very small JSON encoded string, the output will also be a JSON enconded string but a bit larger (in the order of 1-5 KB).
The output will be computed using several available resources from the Web so the main bottleneck will probably be the bandwidth. Database writes may also be one - depending on the R(D)DBMS.
It looks like you're on the verge of recreating Gearman. Here's the introduction for Gearman:
Gearman provides a generic application
framework to farm out work to other
machines or processes that are better
suited to do the work. It allows you
to do work in parallel, to load
balance processing, and to call
functions between languages. It can be
used in a variety of applications,
from high-availability web sites to
the transport of database replication
events. In other words, it is the
nervous system for how distributed
processing communicates.
You can write both your client and the back-end worker code in PHP.
Re your question about a Gearman Server compiled for Windows: I don't think it's available in a neat package pre-built for Windows. Gearman is still a fairly young project and they may not have matured to the point of producing ready-to-run distributions for Windows.
Sun/MySQL employees Eric Day and Brian Aker gave a tutorial for Gearman at OSCON in July 2009, but their slides mention only Linux packages.
Here's a link to the Perl CPAN Testers project, that indicates that Gearman-Server can be built on Win32 using the Microsoft C compiler (cl.exe), and it passes tests: http://www.nntp.perl.org/group/perl.cpan.testers/2009/10/msg5521569.html But I'd guess you have to download source code and build it yourself.
Gearman seems like the perfect candidate for this scenario, you might even want to virtualize you windows machines to multiple worker nodes per machine depending on how much computing power you need.
Also the persistent queue system in gearman prevents jobs getting lost when a worker or the gearman server crashes. After a service restart the queue just continues where it has left off before crash/reboot, you don't have to take care of all this in your application and that is a big advantage and saves alot of time/code
Working out a custom solution might work but the advantages of gearman especially the persistent queue seem to me that this might very well be the best solution for you at the moment. I don't know about a windows binary for gearman though but i think it should be possible.
A simpler solution would be to have a single database with multiple php-nodes connected. If you use a proper RDBMS (MSql + InnoDB will do), you can have one table act as a queue. Each worker will then pull tasks from that to work on and write it back into the database upon completion, using transactions and locking to synchronise. This depends a bit on the size of input/output data. If it's large, this may not be the best scheme.
I would avoid sqlite for this sort of task, although it is a very wonderful database for small apps, it does not handle concurrency very well, it has only one locking strategey which is to lock the entire database and keep it locked until a sinlge transaction is complete.
Consider Postgres which has industrial strength concurrency and lock management and can handle multiple simultanious transactions very nicely.
Also this sounds like a job for queuing! If you were in hte Java world I would recommend a JMS based archictecture for your solution. There is a 'dropr' project to do something similar in php but its all fairly new so it might not be suitable for your project.
Whichever technoligy you use you should go for a "free market" solution where the worker threads consume available "jobs" as fast as they can, rather than a "command economy" where a central process allocates tasks to choosen workers.
The setup of a master server and several workers looks right in your case.
On the master server I would install MySQL (Percona InnoDB version is stable and fast) in master-master replication so you won't have a single point of failure.
The master server will host an API which the workers will pull at every N seconds. The master will check if there is a job available, if so it has to flag that the job has been assigned to the worker X and return the appropriate input to the worker (all of this via HTTP).
Also, here you can store all the script files of the workers.
On the workers, I would strongly suggest you to install a Linux distro. On Linux it's easier to set up scheduled tasks and in general I think it's more appropriate for the job.
With Linux you can even create a live cd or iso image with a perfectly configured worker and install it fast and easy on all the machines you want.
Then set up a cron job that will RSync with the master server to update/modify the scripts. In this way you will change the files in just one place (the master server) and all the workers will get the updates.
In this configuration you don't care of the IPs or the number of workers because the workers are connecting to the master, not vice-versa.
The worker job is pretty easy: ask the API for a job, do it, send back the result via API. Rinse and repeat :-)
Rather than re-inventing the queuing wheel via SQL, you could use a messaging system like RabbitMQ or ActiveMQ as the core of your system. Each of these systems provides the AMQP protocol and has hard-disk backed queues. On the server you have one application that pushes new jobs into a "worker" queue according to your schedule and another that writes results from a "result" queue into the database (or acts on it some other way).
All the workers connect to RabbitMQ or ActiveMQ. They pop the work off the work queue, do the job and put the response into another queue. After they have done that, they ACK the original job request to say "its done". If a worker drops its connection, the job will be restored to the queue so another worker can do it.
Everything other than the queues (job descriptions, client details, completed work) can be stored in the database. But anything realtime should be put somewhere else. In my own work I'm streaming live power usage data and having many people hitting the database to poll it is a bad idea. I've written about live data in my system.
I think you're going in the right direction with a master job distributor and workers. I would have them communicate via HTTP.
I would choose C, C++, or Java to be clients, as they have capabilities to run scripts (execvp in C, System.Desktop.something in Java). Jobs could just be the name of a script and arguments to that script. You can have the clients return a status on the jobs. If the jobs failed, you could retry them. You can have the clients poll for jobs every minute (or every x seconds and make the server sort out the jobs)
PHP would work for the server.
MySQL would work fine for the database. I would just make two timestamps: start and end. On the server, I would look for WHEN SECONDS==0
I'm working an image processing website, instead of having lengthy jobs hold up the users browser I want all commands to return fast with a job id and have a background task do the actual work. The id could then be used to check for status and results (ie a url of the processed image). I've found a lot of distributed queue managers for ruby, java and python but I don't know nearly enough of any of those languages to be able to use them.
My own tests have been with shared mysql database to queue jobs, lock them to a worker, and mark them as completed (saving the return data in the db). It was just a messy prototype, and the entire time I felt as if I was reinventing the wheel (and not very elegantly). Does something exist in php (or that I can talk to RESTfully?) that I could use?
Reading around a bit more, I've found that what I'm looking for is a queuing system that has a php api, it doesn't have to be written in php. I've only found classes for use with Amazon's SQS, but not only is that not free, it's also quite latent sometimes (over a minute for a message to show up).
Have you tried ActiveMQ? It makes mention of supporting PHP via the Stomp protocol. Details are available on the activemq site.
I've gotten a lot of mileage out of the database approach your describing though, so I wouldn't worry too much about it.
Do you have full control over server?
MySQL queue could be fine in such case. Have a PHP script that is running constantly (in endless while loop), querying the MySQL database for new "tasks" and sleep()ing in between to reduce the load in idle time.
When each task is completed, mark it in the database and move to the next one.
To prevent that whole thing stops if your script crashes/exists (PHP memory overflow, etc.) you can, for example, place it in inittab (if you use Linux as a server) and init will restart it automatically.
Zend_Framework has a queue class, with a number of implementations of Mysql-backed, SQS and some other back-ends.
Personally, I've had excellent results with BeanstalkD recently, which also has a PHP client. I'm just serialising some data with JSON to throw into it, which gets decoded and run on the worker(s).