For the past few days I've been googling about how should I handle long running tasks on a web server. I have found a lot of good answers of how to run them in a specific language, but not what languages to choose for this kind of job.
So, I have a web server which is running some kind of custom e-commerce platform. There is another server where some products dostributor is providing access to its data through API. I have to sync products list across these two servers. Product base is pretty big (about 100000 products).
My Idea was to write php script which collects data from various API endpoints and updates database accordingly. But it gonna take a long time, so without hard modofications and deep-diving to php itself it will timeout.
Now I'm thinking, maybe I should write python script which goes through API endpoints and collects data about each product. After data about product is collected python script could initiate php script which could update data in database about that particular product..
What are your toughts about it? What would be the best way to handle it?
This sounds like something you could be better off doing from the server-side with Cron, for example and not from a browser. Unless, of course, you need to do this manually at random times by people who have no terminal access.
If it has to be PHP, you can run that with cron or from terminal and disable the timeout (see: http://php.net/manual/en/function.set-time-limit.php ) and even leave it as a background process if needed with &. This way you wouldn't be limited by Apache (or whatever server you are using) time limits.
Related
To extend the request limits I want to fetch data from an API endpoint and provide them to my users from a third party hosting platform. They usually support php so I was thinking of using it. The data should update like once a minute or every two minutes. The fetching process itself could be as simple as possible, e.g. like this:
$json = file_get_contents('abc.com/xyz');
file_put_contents('example.json', $json);
Like this an endpoint would be fetched and written into a local file. But to repeat this step continuously and keep the data updated this script would be needed to run permanently or executed frequently. The only way I found was to use cron jobs for that issue but would that be recommendable to use to keep files updated? Or are there way better methods to do this?
I know that there are better setups to solve that issue like handling it with node.js but I consider using a platform like this so I only have to manage the communication between the API and the server and not between server and clients and didn’t find another way to do so but I‘m open to other suggestions!
While it can be done differently (like with node.js you mentioned or other methods), I believe that a system cron job to be run every X minutes (depending on how long it takes for the API to respond) will suffice and keep things simple.
Provided of course that you are able to set-up system cron jobs on your webserver.
this side of PHP is rather new to me.
I am interested in firing off a large number (25-50) separate processes from a parent script. I would like for the parent script to not wait for these other scripts to complete AND I would like for these other scripts to run in parallel.
Each script would run for a specified amount of time calling a webservice.
Can anyone give me some direction with this? I'm not asking for a coded answer specifically, but I just need some guidance.
Much thanks.
It really depends on what you want to achieve. #Julien's forking method could work, but this is not preferable if your web service calls are data intensive. I am not saying that forking is bad on the contrary it works, but with the ammount of different wev services you want to call you should have a way manage things better.
Another thing that you can do is base this on cronjobs. For example if you're calling these webservices for some users in your app create a queue - a DB table that you add records that need to be processed. If you are using Cake use the Cake Shells. Then set up cronjobs that call a the shells that processes these records every now and then. Divide all services is separate queues - at least for those that are very different in logic. This way you will also divide your risk because if there is a failure in one of the web service calls you would not jeopardise all in some way. Have separate logging abilities for each queue which will enable you to quickly track down problems. With consuming web services very often problems are external to your application.
I'm putting together my first commercial PHP application, it's nothing really huge as I'm still eagerly learning PHP :)
Right now I'm still in the conceptual stage of planning my application but I run into one problem all the time, the application is supposed to be self-hosted by my customers, on their own servers and will include some very long running scripts, depending on how much data every customer enters in his application.
Now I think I have two options, either use cronjobs, like for example let one or multiple cronjobs run at a time that every customer can set himself, OR make the whole processing of data as daemons that run in the background...
My question is, since it's a self-hosted application (and every server is different)... is it even recommended to try to write php that starts background processes on a customers server, or is this more something that you can do reliably only on your own server...?
Or should I use cronjobs for these long running processes?
(depending on the amount of data my customers will enter in the application, a process could run 3+ hours)
Is that even a problem that can be solved, reliably, with PHP...? Excuse me if this should be a weird question, I'm really not experienced with PHP daemons and/or long running cronjobs created by php.
So to recap everything:
Commercial self-hosted application, including long running processes, cronjobs or daemons? And is either or maybe both also a reliable solution for a paid application that you can give to your customers with a clear conscience because you know it will work reliable on all kinds of different servers...?
EDIT*
PS: Sorry, I forgot to mention that the application targets only Linux servers, so everything like Debian, Ubuntu etc etc.
Short answer, no, don't go for background process if this will be a client hosted solution. If you go towards the ASP concept (Application Service Provider... not Active Server Pages ;)) then you can do some wacky stuff with background processes and external apps connecting to your sql servers and processing stuff for you.
What i suggest is to create a strong task management backbone and link that to a solid task processing infrastructure. I'll recommend you read an old post i did quite some time ago regarding background processes and a strategy i had adopted to fix long running processes:
Start & Stop PHP Script from Backend Administrative Webpage
Happy reading...
UPDATE
I realize that my old post is far from easy to understand so here goes:
You need 2 models: Job and JobQueue, 2 controller: JobProcessor, XYZProcessor
JobProcessor is called either by a user when a page triggers or using a cronjob as you wish. JobProcessor::process() is the key that starts the whole processing or continues it. It loads the JobQueues and asks the job queues if there is work to do. If there is work to do, it asks the jobqueue to start/continue it's job.
JobQueue Model: Used to queue several JOBS one behind each other and controls what job is currently current by keep some kind of ID and STATE about which job is running.
Job Model: Represents exactly what needs to be done, it contains for example the name of the controller that will process the data, the function to call to process the data and a serialized configuration property that describe what must be done.
XYZController: Is the one that contains the processing method. When the processing method is called, the controller must load everything it needs to memory and then process each individual unit of work as fast as possible.
Example:
Call of index.php
Index.php creates a jobprocessor controller
Index.php calls the jobprocessor's process()
JobProcessor::Process() loads all the queues and processes them
For each JobQueue::Process(), the job queue loads it's possible Jobs and detects if one is currently running or not. If none is running, it starts the next one by calling Job::Process();
Job::Process() creates the XYZController that will work the task at hand. For example, my old system had an InvoicingController and a MassmailingController that worked hand in hand.
Job::Process() calls XYZController::Prepare() so that it loads it's information to process. (For example, load a batch of emails to process, load a batch of invoices to create)
Job::Process() calls XYZController::RunWorkUnit() so that it processes a single unit of work (For example, create one invoice, send one email)
Job::Process() asks JobProcessingController::DoIStillHaveTimeToProcess() and if so, continues processing the next element.
Job::Process() runs out of time and calls XYZController::Cleanup() so that all resources are released
JobQueue::Process() ends and returns to JobController
JobController::Process() is about to end? Open a socket, call myself back so i can start another round of processing until i don't have anything to do anymore
Handle the request from the user that start in position #1.
Ultimately, you can instead open a socket each time and ask the processor to do something, or you can queue a CronJob to call your processor. This way your users won't get stuck waiting for the 3/4 work units to complete each time.
Its worth noting that, in addition to running daemons or cron jobs, you can kick off long running processes from a web request (but note that it must run outside of the webserver process group) and of course asynchronous message processing (which is essentially a variant on the batch approach).
All four of these approaches are very different in terms of how they behave, how concurrency and timing are managed. The factors which make them all different are the same ones you omitted from your question - so it's not really possible to answer.
Unfortunately all rely on facilities which are very different between MSWindows and POSIX systems - so although PHP will run on both, if you want to sell your app on both platforms it's going to need 2 versions.
Maybe you should talk to your potential customer base and ask them what they want?
I'm currently developing a php daemon for connecting and retreiving data from social networks like facebook and twitter. This script allready works but I have some concerns about it.
It's possible to create an infinite amount of accounts that the script has to process and (right now) it runs every 5 minutes to create a 'near' realtime experience. So my concern is that, when, let's say 5000 accounts, have been created and have to be monitored. The script slows down and maybe wil run longer than the 5 minute interval. Is there any way to work around this problem? And better, is there any good way (with php, possible with javascript) to create a better 'near' realtime experience?
Any advice will be great!
Thanks in advance
One option would be to spawn multiple daemons and share duties between them. Perhaps have single central job queue and have the daemons consume that. It's really a server-side issue and Javascript has very little to do with such tasks, as long it's not server-side JS.
If the number of monitored subjects is going into thousands, PHP is not really a viable choice since it's neither inherently multi-threaded nor does it have synchronization features. In mass monitoring scenarios, a dedicated server running a J2EE, .NET or a custom multithreaded application is pretty much a must.
for most sites you can retrieve a stream containing all that data(in real-time). For example:
1. twitter
site streams allows services,
such as web sites or mobile push
services, to receive real-time updates
for a large number of users without
any of the hassles of managing REST
API rate limits
2. Facebook
The Graph API supports real-time
updates to enable your application
using Facebook to subscribe to changes
in data from Facebook.
When using these streams you can process the streams in real-time and don't have to do no(nearly none) polling.
P.S: I would most definitely code this in node.js.
set the max execution time to zero and include it
enclose your script in a inite loop:
set_time_limit(0);
while(true){
/your code
}
You should however include some way to end the process gracefully.
Some popular ways to do this is by checking if a env var was set or if a specific file exists.
set_time_limit(0);
while(true){
/your code
if(file_exist(KILL_SWITCH_FILE))break;
}
Another approach would be setting a flag when(in a filem,in a sql database,...) that your script is running and removing it when your done.
That way you can check if another instance of your script is still running.
Greetings All!
I am having some troubles on how to execute thousands upon thousands of requests to a web service (eBay), I have a limit of 5 million calls per day, so there are no problems on that end.
However, I'm trying to figure out how to process 1,000 - 10,000 requests every minute to every 5 minutes.
Basically the flow is:
1) Get list of items from database (1,000 to 10,000 items)
2) Make a API POST request for each item
3) Accept return data, process data, update database
Obviously a single PHP instance running this in a loop would be impossible.
I am aware that PHP is not a multithreaded language.
I tried the CURL solution, basically:
1) Get list of items from database
2) Initialize multi curl session
3) For each item add a curl session for the request
4) execute the multi curl session
So you can imagine 1,000-10,000 GET requests occurring...
This was ok, around 100-200 requests where occurring in about a minute or two, however, only 100-200 of the 1,000 items actually processed, I am thinking that i'm hitting some sort of Apache or MySQL limit?
But this does add latency, its almost like performing a DoS attack on myself.
I'm wondering how you would handle this problem? What if you had to make 10,000 web service requests and 10,000 MySQL updates from the return data from the web service... And this needs to be done in at least 5 minutes.
I am using PHP and MySQL with the Zend Framework.
Thanks!
I've had to do something similar, but with Facebook, updating 300,000+ profiles every hour. As suggested by grossvogel, you need to use many processes to speed things up because the script is spending most of it's time waiting for a response.
You can do this with forking, if your PHP install has support for forking, or you can just execute another PHP script via the command line.
exec('nohup /path/to/script.php >> /tmp/logfile 2>&1 & echo $!'), $processId);
You can pass parameters (getopt) to the php script on the command line to tell it which "batch" to process. You can have the master script do a sleep/check cycle to see if the scripts are still running by checking for the process id's. I've tested up to 100 scripts running at once in this manner, at which point the CPU load can get quite high.
Combine multiple processes with multi-curl, and you should easily be able to do what you need.
My two suggestions are (a) do some benchmarking to find out where your real bottlenecks are and (b) use batching and cacheing wherever possible.
Mysqli allows multiple-statement queries, so you could definitely batch those database updates.
The http requests to the web service are more likely the culprit, though. Check the API you're using to see if you can get more info from a single call, maybe? To break up the work, maybe you want a single master script to shell out to a bunch of individual processes, each of which makes an api call and stores the results in a file or memcached. The master can periodically read the results and update the db. (Careful to rotate the data store for safe reading and writing by multiple processes.)
To understand your requirements better, you must implement your solution only in PHP? Or you can interface a PHP part with another part written in another language?
If you could not go for another language, try to perform this update maybe as php script that runs in the background and not through the apache.
You can follow Brent Baisley advice for a simple use case.
If you want to build a robuts solution, then you need to :
set up a representation of the actions in a table in database that will be your process queue;
set up a script that pop this queue and process your action;
set up a cron daemon that run this script every x.
This way you can have 1000 PHP scripts running, using your OS parallelism capabilities and not hanging when ebay is taking to to respond.
The real advantage of this system is that you can fully control the firepower you throw at your task by adjusting :
the number of request one PHP script does;
the order / number / type / priority of the action in the queue;
the number or scripts the cron daemon runs.
Thanks everyone for the awesome and quick answers!
The advice from Brent Baisley and e-satis works nicely, rather than executing the sub-processes using CURL like i did before, the forking takes a massive load off, it also nicely gets around the issues with max out my apache connection limit.
Thanks again!
It is true that PHP is not multithreaded, but it can certainly be setup with multiple processes.
I have created a system that resemebles the one that you are describing. It's running in a loop and is basically a background process. It uses up to 8 processes for batch processing and a single control process.
It is somewhat simplified because i do not have to have any communication between the processes. Everything resides in a database so each process is spawned with the full context taken from the database.
Here is a basic description of the system.
1. Start control process
2. Check database for new jobs
3. Spawn child process with the job data as a parameter
4. Keep a table of the child processes to be able to control the number of simultaneous processes.
Unfortunately it does not appear to be a widespread idea to use PHP for this type of application, and i really had to write wrappers for the low level functions.
The manual has a whole section on these functions, and it appears that there are methods for allowing IPC as well.
PCNTL has the functions to control forking/child processes, and Semaphore covers IPC.
The interesting part of this is that i'm able to fork off actual PHP code, not execute other programs.