I have a code that needs to run in 5 parts, each 10 minutes apart. I know I can run 5 different cron jobs, but the script lends itself to being one script with 10 minute sleep()s at different points.
So I have:
set_time_limit(3600);
//code
sleep(600);
//continutes
sleep(600);
//etc
Is doing this highly inefficient, or should I find a way to have it split into 5 different cron jobs run 10 minutes apart?
sleep() doesn't consume CPU time but the processes ongoing will consume RAM because the php engine needs to keep running. It shouldn't be a problem if you have a lot of free RAM but I would still suggest splitting it into other crons.
Personally, I've used long sleep (10-20 minutes) in previous web crawlers that I've written in PHP and that ran from my local 4 GB RAM machine with no problem.
Depends on the task you have, but generally speaking - it is bad because it consumes unneeded resources for a long time and has a high risk of being interrupted (by system crash/reboot or external changes to resources which whom script operates).
I'd recommend to use a job queue daemon like RabbitMQ with delaying features. So that after each block you could enqueue a next one in 10 minutes. That will save resources and increase stability
Related
I have about 35 cron jobs right now. Most of them are PHP scripts that either scrape or do some calculations. The scripts also loop over 10-20 different servers to do those scrapes. (They are different countries so they have to be separate calls).
So we have 30 scripts, each has a loop over 20 servers and therefore take about 5-15 minutes to run per script. I have each script spaced out right now.
But is it better to have 80 individual scripts run instead of 35 scripts that loop and take a while? Each script would take maybe 1-2 minutes instead of 10-15min.
That would of course spawn a ton more PHP processes. Is there any issue or limit with 10-15 or more PHP processes running at once?
I'm running a cloud server performance on Rackspace.
Personally if the jobs need to complete in a certain order I would make it as linear as possible.....it might take longer but I always err . The side of data accuracy.
It depends.
If you are creating more processes that will be running at the same time you are going to increase your overall memory footprint. Each process will carry it's own overhead of memory for the process to run, and to load any libraries needed for it's process. (aside from whatever it needs to do whatever it does). You will also more than have twice as many script to monitor that they are successfully running all the time.
However in creating more processes you will be able to speed things us since you are essentially creating a multi-thread. Allowing one process to continue while another is blocking waiting for i/o.
If each script doesn't have a dependency on another, breaking them into smaller scripts should be fine. If you can handle monitoring more scripts, and the server can handle it, then I would do it.
If scripts do have dependencies, or if you would have to run so many at the same time you server usage maxes out, keep them together.
That being said, I would also try to optimize the script, make sure there isn't something you can do to make them faster without create more processes.
Depending on how you have the servers setup, I would run them at once. In addition, I would also run them at night, off hours when the web servers aren't in use and not during business operations unless your web app depends on it. If you're on a Cloud server on Rackspace I wouldn't worry about bandwidth although increasing your ram could be an issue further down the road.
Spawning a ton more PHP process shouldn't be a worry if you have sufficient amount of ram; there is no limitation on the linux side.
a) Figure out which cron needs to run in which order
b) Order the cron to be run at night, around mid-night
c) Run and fireoff the 80 scripts at once
it would also be a good idea to send you an email with cron results or report that it all went through successfully, based on the batch but not individual cron.
Is it so memory consuming to have a daemon written on php (which listens/process a queue) comparing to crontab way of executing background tasks?
I have ~600 shops on one server under one engine. Some tasks shop-owner runs require a lot of time, so it is reasonable to fork them. Putting a task into cron works well, I just don't like up to 59 sec delay of start (restriction of cron). So I'd like to try queue system. I'm just afraid it will force me to run 600 php threads to listen/process those queues (shops are from different customers, I can't make a common daemon). Doesn't it automatically require some 600-1000MB more memory, which is then not a good choice comparing to cron (which only loads a process if it was planned).
Instead of putting them into a cron with a 59-second delay, why not run them using the "at" daemon? You can simply use "at now" and they'll run immediately. See, for example,:
http://unixhelp.ed.ac.uk/CGI/man-cgi?at
I certainly wouldn't consider running 600 threads in PHP as daemons simultaneously.
I've previously built queue-runners that ran as many as 75-100 separate PHP processes, using supervisor to start as many as I wanted. Since they share so much common code, that is also shared by the OS, and not duplicated.
Running a few dozen, or more, maybe with some type of high-priority queue for the small, fast jobs and a subset of the workers that can happily run the large, slow ones.
I've written on the subject at my tech blog, phpscaling.com.
I have created a php/mysql scraper, which is running fine, and have no idea how to most-efficiently run it as a cron job.
There are 300 sites, each with between 20 - 200 pages being scraped. It takes between 4 - 7 hours to scrape all the sites (depending on network latency and other factors). The scraper needs to do a complete run once daily.
Should I run this as 1 cron job which runs for the entire 4 - 7 hours, or run it every hour 7 times, or run it every 10 minutes until complete?
The script is set up to run from the cron like this:
while($starttime+600 > time()){
do_scrape();
}
Which will run the do_scrape() function, which scrapes 10 urls at a time, until (in this case) 600 seconds has passed. The do_scrape can take between 5 - 60 seconds to run.
I am asking here as I cant find any information on the web about how to run this, and am kind of wary about getting this running daily, as php isnt really designed to be run as a single script for 7 hours.
I wrote it in vanilla PHP/mysql, and it is running on cut down debian VPS with only lighttpd/mysql/php5 installed. I have run it with a timeout of 6000 seconds (100 minutes) without any issue (the server didnt fall over).
Any advice on how to go about this task is appreciated. What should I be watching out for etc..? or am i going about executing this all wrong?
Thanks!
There's nothing wrong with running a well-written PHP script for long periods. I have some scripts that have literally been running continuously for months. Just watch your memory usage, and you should be fine.
That said, your architecture is pretty basic, and is unlikely scale very well.
You might consider moving from a big monolithic script to a divide-and-conquer strategy. For instance, it sounds like your script is making synchronous requests for every URL is scrapes. If that's true, then most of that 7 hour run time is spent idly waiting for a response from some remote server.
In an ideal world, you wouldn't write this kind of thing PHP. Some language that handles threads and can easily do asynchronous http requests with callback would be much better suited.
That said, if I were doing this in PHP, I'd be aiming at having a script that kicks of N children who grab data from URLs, and stick the response data in some kind of work queue, and then another script that pretty much runs all the time, processing any work it finds in the queue.
Then you just cron your fetcher-script-manager to run once an hour, it manages some worker processes that fetch the data (in parellel, so latency doesn't kill you), and stick the work on the queue. Then the queue-cruncher sees the work on the queue and crunches it.
Depending on how you implement the queue, this could scale pretty well. You could have multiple boxes fetching remote data, and sticking it on some central queue box (with a queue implemented in mysql, or memcache, or whatever). You could even conceivably have multiple boxes taking work from the queue and doing the work.
Of course, the devil is in the details, but this design is generally more scalable and usually more robust than a single-threaded fetch-process-repeat script.
You shouldn't have a problem running it once a day to completion. That's the way I would do it. Timeouts are a big issue if php is being served through a web server, but since you are interpreting directly through the php executable this is ok. I would advise you to use python or something else that is more task-friendly, though.
I'm running a few PHP job which fetches 100th thousands of data from a webservice and insert them to database. These jobs take up the CPU usage of the server.
My question is, how much is it considered high?
When i do a "top" command on linux server,
it seems like 77% .. It will go up to more than 100% if i run more jobs simultaneously. It seems high to me, (does more than 100% means it is running on the 2nd CPU ?)
28908 mysql 15 0 152m 43m 5556 S 77.6 4.3 2099:25 mysqld
7227 apache 15 0 104m 79m 5964 S 2.3 7.8 4:54.81 httpd
This server is also has also webpages/projects hosted in it. The hourly job since to be affecting the server as well as the other web project's loading time.
If high, is there any way of making it more efficient on the CPU?
Anyone can enlighten?
A better indicator is the load average, if I simplify, it is the amount of waiting tasks because of insufficient resources.
You can access it in the uptime command, for example: 13:05:31 up 6 days, 22:54, 5 users, load average: 0.01, 0.04, 0.06. The 3 numbers at the end are the load averages for the last minute, the last 5 minutes and the last 15 minutes. If it reaches 1.00, (no matter of the number of cores) it is that something it waiting.
I'd say 77% is definitly high.
There are probably many ways to make the job more efficient, (recursive import), but not much info given.
A quick fix would be invoking the script with the nice cmd,
and add a few sleeps to stretch the load over time.
I guess you also saturate the network during import, so can you split up the job it would prevent your site from stalling.
regards,
/t
You can always nice your tasks
http://unixhelp.ed.ac.uk/CGI/man-cgi?nice
With the command nice you can give proccesses more or less priority
These jobs take up the CPU usage of the server.
My question is, how much is it considered high?
That is entirely subjective. On computing nodes, the CPU usage is pretty much 100% per core all the time. Is that high? No, not at all, it is proper use of hardware that has been bought for money.
Nice won't help much, since it's mysql that's occupying your cpu,
putting nice on a php-client as in
nice -10 php /home/me/myjob.php
won't make any significant difference.
Better to split up the job so smaller parts, call your php-script
from cron and build it like
<?
ini_set("max_execution_time", "600")
//
//1. get the file from remote server, in chunks to avoid net saturation
$fp = fopen('http://example.org/list.txt');
$fp2 = fopen('local.txt','w');
while(!feof($fp)) {
fwrite($fp2,fread($fp,10000));
sleep(5);
}
fclose($fp/fp2);
while(!eof(file) {
//read 1000 lines
//do insert..
sleep(10);
}
//finished, now rename to .bak, log success or whatever...
I have a loop that runs for approx. 25 minutes i.e 1500 seconds. [100 loops with sleep(15)]
The execution time for the statements inside loop is very less.
My scripts are hosted on GoDaddy. I am sure that they are having some kind of limit on execution time.
My question is, are they concerned with "the total CPU execution time" or the total running time.
They will be concerned with the CPU Execution Time, not the total running time unless connections are an issue and you're using a lot of them (which it doesn't sound like you are).
Running time, as in a stopwatch, doesn't matter much to a shared host, if your loop runs for 3 years but only uses 0.01% CPU doing it, it doesn't impact their ability to host. However if you ran for 3 years at 100% CPU, that directly impacts how many other applications/VMs/whatever can be run on that same hardware. This would mean more servers to host the same number of people which means money...that they care about.
For the question in the title: they are very different. With sleep() and the same amount of total time, that means the actual work the CPU is doing is much less because it can do the work, sleep/idle, and still finish in the same amount of time. When you're calling sleep() you're not taxing the CPU, it's a very low-power operation for it to keep the timer going until calling your code again.
This is the typical time limit:
http://es2.php.net/manual/en/info.configuration.php#ini.max-execution-time
It can normally be altered in a per-script basis with ini_set(), e.g.:
ini_set('max_execution_time', 20*60); // 20 minutes (in seconds)
Whatever, the exact time limits probably depend on how PHP is running (Apache module, fastCGI, CLI...).