I have about 35 cron jobs right now. Most of them are PHP scripts that either scrape or do some calculations. The scripts also loop over 10-20 different servers to do those scrapes. (They are different countries so they have to be separate calls).
So we have 30 scripts, each has a loop over 20 servers and therefore take about 5-15 minutes to run per script. I have each script spaced out right now.
But is it better to have 80 individual scripts run instead of 35 scripts that loop and take a while? Each script would take maybe 1-2 minutes instead of 10-15min.
That would of course spawn a ton more PHP processes. Is there any issue or limit with 10-15 or more PHP processes running at once?
I'm running a cloud server performance on Rackspace.
Personally if the jobs need to complete in a certain order I would make it as linear as possible.....it might take longer but I always err . The side of data accuracy.
It depends.
If you are creating more processes that will be running at the same time you are going to increase your overall memory footprint. Each process will carry it's own overhead of memory for the process to run, and to load any libraries needed for it's process. (aside from whatever it needs to do whatever it does). You will also more than have twice as many script to monitor that they are successfully running all the time.
However in creating more processes you will be able to speed things us since you are essentially creating a multi-thread. Allowing one process to continue while another is blocking waiting for i/o.
If each script doesn't have a dependency on another, breaking them into smaller scripts should be fine. If you can handle monitoring more scripts, and the server can handle it, then I would do it.
If scripts do have dependencies, or if you would have to run so many at the same time you server usage maxes out, keep them together.
That being said, I would also try to optimize the script, make sure there isn't something you can do to make them faster without create more processes.
Depending on how you have the servers setup, I would run them at once. In addition, I would also run them at night, off hours when the web servers aren't in use and not during business operations unless your web app depends on it. If you're on a Cloud server on Rackspace I wouldn't worry about bandwidth although increasing your ram could be an issue further down the road.
Spawning a ton more PHP process shouldn't be a worry if you have sufficient amount of ram; there is no limitation on the linux side.
a) Figure out which cron needs to run in which order
b) Order the cron to be run at night, around mid-night
c) Run and fireoff the 80 scripts at once
it would also be a good idea to send you an email with cron results or report that it all went through successfully, based on the batch but not individual cron.
Related
Is it so memory consuming to have a daemon written on php (which listens/process a queue) comparing to crontab way of executing background tasks?
I have ~600 shops on one server under one engine. Some tasks shop-owner runs require a lot of time, so it is reasonable to fork them. Putting a task into cron works well, I just don't like up to 59 sec delay of start (restriction of cron). So I'd like to try queue system. I'm just afraid it will force me to run 600 php threads to listen/process those queues (shops are from different customers, I can't make a common daemon). Doesn't it automatically require some 600-1000MB more memory, which is then not a good choice comparing to cron (which only loads a process if it was planned).
Instead of putting them into a cron with a 59-second delay, why not run them using the "at" daemon? You can simply use "at now" and they'll run immediately. See, for example,:
http://unixhelp.ed.ac.uk/CGI/man-cgi?at
I certainly wouldn't consider running 600 threads in PHP as daemons simultaneously.
I've previously built queue-runners that ran as many as 75-100 separate PHP processes, using supervisor to start as many as I wanted. Since they share so much common code, that is also shared by the OS, and not duplicated.
Running a few dozen, or more, maybe with some type of high-priority queue for the small, fast jobs and a subset of the workers that can happily run the large, slow ones.
I've written on the subject at my tech blog, phpscaling.com.
I built an app in php where a feature analyzes about 10000 text files and extracts stuff from them and puts it into a mysql database. The code itself is just a for loop where every file is loaded through file_get_contents() and after the end of that iteration, its unset() from memory. The file analysis is a cron job and a single php file does all this processing.
The problem however is that the app was built (initially) entirely on a shared server everything worked seamlessly really well. I didn't notice any delays or major lags neither did users however in order for it to be able to handle more of a load, I moved everything to an EC2 server (the micro instance).
The problem I am having now is that every time I run the cronjob (process the files on hourly basis) it slows the entire server down so much that a normal page takes about 5-8 seconds to load, which sort of defeats the purpose of moving it to EC2.
The cron itself is a very long process. Here are some tests results of the script process (every hr)
SQL Insertion Time: 23.138303995132 seconds
Memory Used: 10.05 MB
Execution: 411.00507092476 seconds
But on the top of every hour the server slows down so much for 7 minutes despite of having more dedicated hardware acceleration compared to a shared server (I think at least). The graphs from EC2 dashboard show that the CPU usage is close to 100% but I don't understand how it gets to that level.
Can anyone help me determine the reason as to why this could be happening? I have noticed not even the slightest lag when the cron runs on the shared server but the case is completely different for EC2.
Please feel free to ask me anything I missed mentioning.
Micro instances are pretty slow. If you use a larger instance, it'll run a lot faster.
We use EC2 for all of our production boxes. I can't say enough good things about that platform. I'll never go back to another host.
Also, if you want to write your code in C++, it'll run A LOT faster. I wrote a simple mysql insert with this code here. It's multi-threaded, so you can asyncronously run mysql updates or inserts.
Please let me know if you need any help with it, but I'm sure you'll be able to just use a micro instance still and get great speeds.
Hope that helps...
PS. I'd be willing to help you write a C++ version for your uses... just because it's fun! :-)
Well EC2 is designed to be scalable.
Since your code is running in 1 loop to open each file one after another, it does not make for a scalable design.
Try changing your codes to break them up so that the files are handled concurrently by different instances of the php script. That way, each copy of the script can run in a thread by itself. If you have multiple servers (or instances of servers in EC2), you can run them on different machines to speed it up even more.
I have created a php/mysql scraper, which is running fine, and have no idea how to most-efficiently run it as a cron job.
There are 300 sites, each with between 20 - 200 pages being scraped. It takes between 4 - 7 hours to scrape all the sites (depending on network latency and other factors). The scraper needs to do a complete run once daily.
Should I run this as 1 cron job which runs for the entire 4 - 7 hours, or run it every hour 7 times, or run it every 10 minutes until complete?
The script is set up to run from the cron like this:
while($starttime+600 > time()){
do_scrape();
}
Which will run the do_scrape() function, which scrapes 10 urls at a time, until (in this case) 600 seconds has passed. The do_scrape can take between 5 - 60 seconds to run.
I am asking here as I cant find any information on the web about how to run this, and am kind of wary about getting this running daily, as php isnt really designed to be run as a single script for 7 hours.
I wrote it in vanilla PHP/mysql, and it is running on cut down debian VPS with only lighttpd/mysql/php5 installed. I have run it with a timeout of 6000 seconds (100 minutes) without any issue (the server didnt fall over).
Any advice on how to go about this task is appreciated. What should I be watching out for etc..? or am i going about executing this all wrong?
Thanks!
There's nothing wrong with running a well-written PHP script for long periods. I have some scripts that have literally been running continuously for months. Just watch your memory usage, and you should be fine.
That said, your architecture is pretty basic, and is unlikely scale very well.
You might consider moving from a big monolithic script to a divide-and-conquer strategy. For instance, it sounds like your script is making synchronous requests for every URL is scrapes. If that's true, then most of that 7 hour run time is spent idly waiting for a response from some remote server.
In an ideal world, you wouldn't write this kind of thing PHP. Some language that handles threads and can easily do asynchronous http requests with callback would be much better suited.
That said, if I were doing this in PHP, I'd be aiming at having a script that kicks of N children who grab data from URLs, and stick the response data in some kind of work queue, and then another script that pretty much runs all the time, processing any work it finds in the queue.
Then you just cron your fetcher-script-manager to run once an hour, it manages some worker processes that fetch the data (in parellel, so latency doesn't kill you), and stick the work on the queue. Then the queue-cruncher sees the work on the queue and crunches it.
Depending on how you implement the queue, this could scale pretty well. You could have multiple boxes fetching remote data, and sticking it on some central queue box (with a queue implemented in mysql, or memcache, or whatever). You could even conceivably have multiple boxes taking work from the queue and doing the work.
Of course, the devil is in the details, but this design is generally more scalable and usually more robust than a single-threaded fetch-process-repeat script.
You shouldn't have a problem running it once a day to completion. That's the way I would do it. Timeouts are a big issue if php is being served through a web server, but since you are interpreting directly through the php executable this is ok. I would advise you to use python or something else that is more task-friendly, though.
I have a PHP-script running on my server via a cronjob. The job runs every minute. In the php script i have a loop that executes, then waits one sevond and loops again. Essentially creating a script to run once every second.
Now I'm wondering, if i make the cronjob run only once per hour and have the script still loop for an entire hour or possible an entire day.. Would this have any impact on the servers cpu and or memory and if so, will it be positive or negative?
I spot a design flaw.
You can always have a PHP script permanently running in a loop performing whatever functionality you require, without dependency upon a webserver or clients.
You are obviously checking something with this script, any incites into what? There may be better solutions for you. For example if it is a database consider SQL triggers.
In my opinion it would have a negative impact. since the scripts keeps using resources.
cron is called on a time based scale that is already running on the server.
But cronjob can only run once a minute at most.
Another thing is if the script times out, fails, crashes for whatever reason you end up with not running the script for at max one hour. Would have a positive impact on server load but not what you're looking for i guess? :)
maybe run it every 2 or even 5 minutes to spare server load?
OR maybe change the script so it does not wait but just executes once and calling it from cron job. should have a positive impact on server load.
I think you should change script logic if it is possible.
If tasks your script executes are not periodic but are triggered by some events, the you can use some Message Queue (like Gearman).
Otherwise your solution is OK. Memory leaks can occurs, but in new PHP versions (5.3.x) Garbage Collector is pretty good. Some extensions can lead to memory leaks. Or your application design can lead to hungry memory usage (like Doctrine ORM loaded objects cache).
But you can control script memory usage by tools like monit and restart your script when mempry limit reaches some point or start script again when your script unexpectedly shuts down.
When executing proc_nice(), is it actually nice'ing Apache's thread?
If so, and if the current user (non-super user) can't renice to its original priority is killing the Apache thread appropriate (apache_child_terminate) on an Apache 2.0x server?
The issue is that I am trying to limit the impact of an app that allows the user to run Ad-Hack queries. The Queries can be massive and the resultant transform on the data requires a lot of Memory and CPU.
I've already re-written the process to be more stream based - helping with the memory consumption, but I would also like the process to run a lower priority. However I can't leave the Apache thread in low priority as we have a lot of high-priority web services running on this same box.
TIA
In that kind of situation, a solution if often to not do that kind of heavy work within the Apache processes, but either :
run an external PHP process, using something like shell_exec, for instance -- this is if you must work in synchronous mode (ie, if you cannot execute the task a couple of minutes later)
push the task to a FIFO system, and immediatly return a message to the user saying "your task will be processed soon"
and have some other process (launched via a crontab every minute, for instance) check that FIFO queue
and do the processing it there is something in the queue
That process, itself, can run in low priority mode.
As often as possible, especially if the heavy calculations take some time, I would go for the second solution :
It allows users to get some feedback immediatly : "the server has received your request, and will process it soon"
It doesn't keep Apaches's processes "working" for long : the heavy stuff is done by other processes
If, one day, you need such an amount of processing power that one server is not enough anymore, this kind of system will be easier to scale : just add a second server that'll pick from the same FIFO queue
If your server is really too loaded, you can stop processing from the queue, at least for some time, so the load can get better -- for instance, this can be usefull if your critical web-services are used a lot in a specific time-frame.
Another (nice-looking, but I haven't tried it yet) solution would be to use some kind of tool like, for instance, Gearman :
Gearman provides a generic application
framework to farm out work to other
machines or processes that are better
suited to do the work. It allows you
to do work in parallel, to load
balance processing, and to call
functions between languages. It can be
used in a variety of applications,
from high-availability web sites to
the transport of database replication
events. In other words, it is the
nervous system for how distributed
processing communicates.