I have a service, where I need to ask 40 external services (API's) to get information from them, by each user request. For example one user is searching for some information and my service is asking 40 external partners to get the information, aggregates it in one DB (mysql) and displays the result to the user.
At this moment I have a multicurl solution, where I have 10 partner request at one time and if someone parnter is done with the request, then the software is adding another partner from the remaining 30 to the queue of multicurl, until all the 40 request are done and the results are in the DB.
The problem on this solution, is that it can not scale on many servers and I want to have some solution, where I can fire 40 request at one time for example divided on 2-3 servers and wait only so long, as the slowest partner delivers the results ;-) What means, that if the slowest partner tooks 10 seconds I will have the result of all 40 partners in 10 seconds. On multicurl I come in troubles, when there are more then 10-12 requests at one time.
What kind of solution, can you offer me, what i getting as low as possible ressources and can run many many process on one server and be scalable. My software is on PHP written, that mean I need an good connect to the solution with framework or API.
I hope you understand my problem and need. Please ask, if something is not clear.
One possible solution would be to use a message queue system like beanstalkd, Apache ActiveMQ, memcacheQ etc.
A high level example would be:
User makes request to your service for information
Your service adds the requests to the queue (presumably one for each of the 40 services you want to query)
One or more job servers continuously poll the queue for work
A job server gets a message from the queue to do some work, adds the data to the DB and deletes the item from the queue.
In this model, since now the one task of performing 40 requests is distributed and is no longer part of one "process", the next part of the puzzle will be figuring out how to mark a set of work as completed. This part may not be that difficult or maybe it introduces a new challenge (depends on the data and your application). Perhaps you could use another cache/db row to set a counter to the number of jobs a particular request needs in order to complete and as each queue worker finishes a request, it can reduce the counter by 1. Once the counter is 0, you know the request has been completed. But when you do that you need to make sure the counter gets to 0 and doesn't get stuck for some reason.
That's one way at least, hope that helps you a little or opens the door for more ideas.
Related
Well a question I wanted to ask for a while but it is getting more and more the right time to ask it.
I am building a system in PHP that processes bookings for holiday facilities.
Every facility (hotel, motel, hostel, bead & breakfast etc etc) has it's own login and is able to manage its own bookings.
Now it is being ran on a system with a single database that separates the user data at the hand of the hostname & user & login.
Also the system is equipped with the option to import bookings provided by partner / resellers.
I have created a specific page for doing this.
Let's say :
cron.examplesystem.com
This URL is called by the Cron task every 5 minutes to check if there are any new bookings or any pending changes / cancellations or confirmations.
Now the issue
This is going just fine until now.
However we had one instance for every client. And generally every call had to process something between 3 to 95 bookings.
But as I now updated my system from one instance for every client I work with one instance for all clients.
So in the past :
myclient.myholidaybookings.example.com
I am now going to :
myholidaybookings.example.com
With one server to handle all clients instead of multiple servers for each client his own.
This will put a lot of stress on the server. And this in general is none of my worries since I have worked hard to make it manageable and scalable.
But I have no clue how to approach this.
Because let's say we have about 100 clients with each 3 - 95 bookings (average 49) we'll have 490 bookings or updates to process a time.
For sure we'll be having a timeout soon or later.
And this is what I want to prevent.
Now all kind of creative solutions are to be found. But what's best practice. I want to create a solution that is solid and doesn't have to be reworked half way going live.
Summary :
problem : I have a system that processes many API feeds in one call and I am sure we'll have a timeout when processing if the system gets populated with users
desired solution : a best practice approach in PHP how to handle and process many API feeds without worrying about a timeout when the user database is growing.
I am creating a project management system and in need to do push notifications when an activity took place.
Question : If I do a jquery to refresh and fetch notification from mysql database, say every 30seconds, will there be a huge impact in the server? What are the minimum requirements?
So basically, I'm looking at 10 notifications/day for 20 employees.
Assuming you're talking about an AJAX request to the server in order to update DOM elements, most basic web servers would very much be able to handle a few requests every 30 seconds or so. More important is how well-optimized the server-side code that finds & returns the notifications is. Assuming you'll have a few clients requesting every 30 seconds, I would suggest making sure the code only takes a few seconds to process the request and respond with the updated data.
I am putting together an interface for our employees to upload a list of products for which they need industry stat's (currently doing 'em manually one at a time).
Each product will then be served up to our stat's engine via a webservice api.
I will be replying. The Stat's-engine will be requesting the "next victim" from my api.
Each list the users upload will have between 50 and 1000 products, and will be its own queue.
For now, Queues/Lists will likely be added (& removed via completion) aprox 10-20 times per day.
If successful, traffic will probably rev up after a few months to something like 700-900 lists per day.
We're just planning to go with a simple round-robin approach to direct the traffic evenly across queues.
The multiplexer would grab the top item off of List A, then List B, then List C and so on until looping back around to List A again ... keeping in mind that lists/queues can be added/removed at any time.
The issue I'm facing is just conceptualizing the management of this.
I thought about storing each queue as a flat file and managing the rotation via relational DB (MySQL). Thought about doing it the reverse. Thought about going either completely flat-file or completely relational DB ... bottom line, I'm flexible.
Regardless, my brain is just vapor locking when I try to statelessly meld a variable list of participants with a circular rotation (I just got back from a quick holiday, and I don't think my brain's made it home yet ;)
Has anyone done something like this?
How did you handle it?
What would you improve if you had to do it again?
Any & all tips/suggestions/advice are welcome.
NOTE: Since each request from our stat's engine/tool will be separated by many seconds, if not a couple minutes, I need to keep this stateless.
List data should be stored in a database, for sure. Your PHP side should have a view giving the status of the system, and the form to add lists.
Since each request becomes its own queue, and all the request-queues are considered equal in priority, the ideal number of tables is probably three. One to list requests and their priority relative to another (to determine who goes next in the round-robin) and processing status, another to list the contents (list-items) of each request that are yet to be processed, and a third table to list the processed items from each queue.
You will also need a script that does the actual processing, that is not driven by a user request, but instead by a system-scheduled job that executes periodically (throttled to whatever you desire). This can of course also be in PHP. This is where you would set up your 10-at-a-time list checks and updates.
The processing would be something like:
Select the next set of at most 10 items from the highest-priority queue.
Process them, Updating their DB status as they complete.
Update the priority of the above queue so that it is now the lowest priority.
And if new queues are added, they would be added with lowest priority.
Priority could be represented with an integer.
Your users would need to wait patiently for their list to be processed and then view or download the result. You might setup an auto-refresh script for this on your view page.
It sounds like you're trying to implement something that Gearman already does very well. For each upload / request, you can simply send off a job to the Gearman server to be queued.
Gearman can be configured to be persistent (just in case things go to hell), which should eliminate the need for you logging requests in a relational database.
Then, you can start as many workers as you'd like. I know you suggest running all jobs serially, which you can still do, but you can also parallelize the work, so that your user isn't sitting around quite as long as they would've been if all jobs had been processed in a serial fashion.
After a good nights sleep, I now have my wits about me (I hope :).
A simple solution is a flat file for the priorities.
Have a text file simply with one List/Queue ID on each line.
Feed from one end of the list, and add to the other ... simple.
Criticisms are welcome ;o)
Thanks #Trylobot and #Chris_Henry for the feedback.
We have a web app that uses IMAP to conditionally insert messages into users' mailboxes at user-defined times.
Each of these 'jobs' are stored in a MySQL DB with a timestamp for when the job should be run (may be months into the future). Jobs can be cancelled at anytime by the user.
The problem is that making IMAP connections is a slow process, and before we insert the message we often have to conditionally check whether there is a reply from someone in the inbox (or similar), which adds considerable processing overhead to each job.
We currently have a system where we have cron script running every minute or so that gets all the jobs from the DB that need delivering in the next X minutes. It then splits them up into batches of Z jobs, and for each batch performs an asynchronous POST request back to the same server with all the data for those Z jobs (in order to achieve 'fake' multithreading). The server then processes each batch of Z jobs that come in via HTTP.
The reason we use an async HTTP POST for multithreading and not something like pnctl_fork is so that we can add other servers and have them POST the data to those instead, and have them run the jobs rather than the current server.
So my question is - is there a better way to do this?
I appreciate work queues like beanstalkd are available to use, but do they fit with the model of having to run jobs at specific times?
Also, because we need to keep the jobs in the DB anyway (because we need to provide the users with a UI for managing the jobs), would adding a work queue in there somewhere actually be adding more overhead rather than reducing it?
I'm sure there are better ways to achieve what we need - any suggestions would be much appreciated!
We're using PHP for all this so a PHP-based/compatible solution is really what we are looking for.
Beanstalkd would be a reasonable way to do this. It has the concept of put-with-delay, so you can regularly fill the queue from your primary store with a message that will be able to be reserved, and run, in X seconds (time you want it to run - the time now).
The workers would then run as normal, connecting to the beanstalkd daemon and waiting for a new job to be reserved. It would also be a lot more efficient without the overhead of a HTTP connection. As an example, I used to post messages to Amazon SQS (by http). This could barely do 20 QPS at very most, but Beanstalkd accepted over a thousand per second with barely any effort.
Edited to add: You can't delete a job without knowing it's ID, though you could store that outside. OTOH, do users have to be able to delete jobs at any time up to the last minute? You don't have to put a job into the queue weeks or months in advance, and so you would still only have one DB-reader that ran every, say, 1 to 5 mins to put the next few jobs into the queue, and still have as many workers as you would need, with the efficiencies they can bring.
Ultimately, it depends on the number of DB read/writes that you are doing, and how the database server is able to handle them.
If what you are doing is not a problem now, and won't become so with additional load, then carry on.
I'm new to PHP, so I need some guidance as to which would be the simplest and/or elegant solution to the following problem:
I'm working on a project which has a table with as many as 500,000 records, at user specified periods, a background task must be started which will invoke a command line application on the server that does the magic, the problem is, at each 1 minute or so, I need to check on all 500,000 records(and counting) if something needs to be done.
As the title says, it is time-critical, this means that a maximum of 1 minute delay can be allowed between the time expected by the user and the time that the task is executed, of course the less delay, the better.
Thus far, I can only think of a very dirty option, have a simple utility app that runs on the server, that at each minute, will make multiple requests to the server, example:
check records between 1 and 100,000;
check records between 100,000 and 200,000;
etc. you get the point;
and the server basically starts a task for each bulk of 100,000 records or less, but it seems to me that there must be a faster approach, something similar to facebook's notification.
Additional info:
server is Windows 2008
using apache + php
EDIT 1
users have an average of 3 tasks per day at about 6-8 hours interval
more than half of the tasks can be at least 1 time per day executed at the same time[!]
Any suggestion is highly appreciated!
The easiest approach would be using a persistent task that runs the whole time and receives notification about records that need to be processed. Then it could process them immediately or, in case it needs to be processed at a certain time, it could sleep until either that time is reached or another notification arrives.
I think I gave this question more than enough time, I will stick to a utility application(that sits on the server) that will make requests to a URL accessible only from the server's IP which will spawn a new thread for each task if multiple tasks needs to be executed at the same time, it's not really scalable but it will have to do for now.