PHP Scripts with a huge amount of Execution Time - php

I am writing a PHP script which downloads pictures from Internet. As the data is huge the execution time for the script varies from 10-15 minutes. Are there any better ways to handle such a situation or should i simple execute the script and let it take the time it takes?

Your script appears to be essentially I/O bound. Short of getting more bandwidth, there's little you can do.
You can improve user experience (if any) by increasing interactivity. For example you can save the filenames you intend to download in a session, and redisplay the page (and refresh it, or go AJAX) after each one, showing expected completion time, current speed, and percentage of completion.
Basically, the script will save in session the array of URLs, and at each iteration pop some of them and download them, maybe checking the time it takes (if you download one file in half a second, it's worth it to download another).
Since the script is executed several times, not only one, you need not worry about its timeout. You do, however, have to deal with the possibility of the user aborting the whole process.

I would've recommended multiple threads to do it faster if there are not any bandwidth restrictions. but the closest thing php has is process control.
Alternatively sometime ago I wrote a similar scraper, and to execute it faster I used the exec functions to instantiate multiple threads of the same file. Hence you also need to create a repository and locking mechanism. Sounds and looks dirty, but works!

If optimisation is worth investing time and if substantial part of execution time is consumed on image processing then calling a shell script that spins a few processes might be an option

Related

Processing a queue of jobs via PHP, but faster than cron's 1-minute minimum?

Many users upload photos to my site, and at peak times the load of resizing and saving the photos causes my server to run out of memory and MySQL to crash.
I figure I can detach the processing from the uploading, and instead of having many concurrent reloads happening I can just have one job doing them one-by-one, which should hugely reduce my memory needs.
But how best to implement this?
My first thought was just a cronjob that runs repeatedly and checks for any unprocessed files. But a cronjob can only (I think) run every minute, so sometimes there would be a large delay even if a user uploaded just one photo at a quiet time on the site.
I expect this is a fairly common problem, so what's the solution?
(I'd rather do this myself with minimal code rather than use a huge third party library, if possible.)
Thanks

PHP execution time: Factor to consider in determining the speed to execution

As all my requests goes through an index script, I tried to time the respond time of all my requests.
Its simply the difference between the start time (start of the script) and end time (end of the script).
As I cache my data on memcached and user are all served using memcached.
I mostly get less than a second respond time but at times there's wierd spike of more than a seconds. the worse case can go up to 200+ seconds.
I was wondering if mobile users had a slow connection, does that reflect on my respond time?
I am serving primary mobile users.
Thanks!
No, it's the runtime of your script. It does not count the latency to the user, that's something the underlying web server is worrying about. Something in your script just takes very long. I recommend you profile your script to find what that is. Xdebug is a good way to do so.
If you're measuring in PHP (which it sounds like you are), thats the time it takes for the page to be generated on the server side, not the time it takes to be downloaded.
Drop timers in throughout the page, and try and break it down to a section that is causing the huge delay of 200+ seconds.
You could even add a small script that will email you details of how long each section took to load if it doesn't happen often enough to see it yourself.
It could be that the script cannot finish because a client downloads the results very-very slowly. If you don't use a front-end server like nginx, the first thing to do is to try it.
Someone already mentioned xdebug, but normally you would not want to run xdebug in production. I would suggest using xhprof to profile pages on development/staging/production. You can turn on xhprof conditionally, which makes it really easy to run on production.

Cron job for php script that requires VERY long execution time

I have a php script run as a cron job that executes a set of simple tasks that loops for each user in the database and takes about 30 mins to complete. This process starts over every hour and needs to be as fast and efficient as possible. The problem Im having, is like with any server script, execution time varies and I need to figure out the best cron time settings.
If I run cron every minute, I need to stop the last loop of the script 20 seconds before the end of the minute to make sure that the current loop finishes in time. Over the course of the hour this adds up to a lot of wasted time.
Im wondering if its a bad idea to simple remove the php execution time limit and run the script once an hour and let it run to completion.... is this a bad idea?
Instead of setting the max_execution_time you could also use set_time_limit() to reset the counter on every loop. This will ensure your script is never running out of time unless there is something serious hanging within the current loop (and taking longer than the max_execution_time).
Basically this should make your script run as long as it needs while giving it a 30 seconds timeout between two set_time_limit() calls.
Assuming you'd like the work done ASAP, don't use cron. Cron is good for things that need to happen at specific times. It's often abused to simulate a background process that would ideally process work as soon as work appears. You should probably write a daemon that runs continuously. (Note: you could also look at a message/work-queue type system, there are nice libraries out there to do this too)
You can write a daemon from scratch using the pcntl functions (since you don't care about multiple worker processes, it's super-easy to get a process running in the background.), or cheat and just make a script that runs forever and run it via screen, or leverage some solid library code like PEAR's System:Daemon or nanoserv
Once the daemonization stuff is taken care of, all you really care about is having a loop that runs forever. You'll want to take care that your script doesn't leak memory, or consume too many resources.
Generally, you can do something like:
<?PHP
// some setup code
while(true){
$todo = figureOutIfIHaveWorkToDo();
foreach($todo as $something){
//do stuff with $something
//remember to clean up resources so you don't leak memory!
usleep(/*some integer*/);
}
usleep(/* some other integer */);
}
And it'll work pretty well.
Setting the time limit to 0 and letting it do its thing is fairly typical of PHP based cronjobs (in my experience), but this is also the point when you should ask yourself a few important questions, such as "Should I rewrite this job in a compiled language?" and "Am I using all of my tools (database, etc) to their maximum efficiency?"
That said, maybe better than completely removing the time limit would be to set it to the upper limit you actually want. If that means 48 minutes, then set_time_limit(48 * 60);
I really think you shouldn't set the time out to 0, that is just looking for trouble. At most, set it to 59*60 seconds, but setting it to 0 might cause security problems, if a script hangs, it will hang almost forever until the server host stops the execution. It is considered bad practice to do so.
I have used the php command-line interface for similar long running tasks in the past. You probably do not want to remove the execution time limit for any request.
Sounds like a great idea if there's little chance that it will take more than an hour. Note, however, that the wrong bug can be a really good way of making it take longer than expected..
To avoid all sorts of nasty problems, you should have a guard file with the process ID of the script. On startup, you should check to make sure the file doesn't exist, or if it does that the process ID in the file doesn't exist (through a kill( pid, 0 ) call). If these conditions are met, create a new file with the script's PID and delete the file when you're done.
This is the same trick that many daemons use to ensure it isn't already running. If the daemon was killed suddenly, the file will still exist but the PID of the process therein is unlikely to be running.
Depending on what your script does, it can lead to problems if you remove the time limit. If per example, you are polling an external server that is unresponsive while the job is running, and that your cron takes 2 hours instead of 30 minutes to complete, you may get a stack of PHP processes being fired up even if the previous ones haven't completed yet. This can cause system instability and crashes.
You probably have two options:
Make sure that no other instance of your script is running beforehand, otherwise exit() on start.
Consider changing your cronjob into a daemon.
Does it have to run hourly like clockwork?
If not split the job (you mentioned it was more than one simple task) do each task every hour?
Or split it per user, do A-M on hour, then N-Z the next?

fopen file locking in PHP (reader/writer type of situation)

I have a scenario where one PHP process is writing a file about 3 times a second, and then several PHP processes are reading this file.
This file is esentially a cache. Our website has a very insistent polling, for data that changes constantly, and we don't want every visitor to hit the DB every time they poll, so we have a cron process that reads the DB 3 times per second, processes the data, and dumps it to a file that the polling clients can then read.
The problem I'm having is that, sometimes, opening the file to write to it takes a long time, sometimes even up to 2-3 seconds. I'm assuming that this happens because it's being locked by reads (or by something), but I don't have any conclusive way of proving that, plus, according to what I understand from the documentation, PHP shouldn't be locking anything.
This happens every 2-5 minutes, so it's pretty common.
In the code, I'm not doing any kind of locking, and I pretty much don't care if that file's information gets corrupted, if a read fails, or if data changes in the middle of a read.
I do care, however, if writing to it takes 2 seconds, esentially, because the process that has to happen thrice a second now skipped several beats.
I'm writing the file with this code:
$handle = fopen(DIR_PUBLIC . 'filename.txt', "w");
fwrite($handle, $data);
fclose($handle);
And i'm reading it directly with:
file_get_contents('filename.txt')
(it's not getting served directly to the clients as a static file, I'm getting a regular PHP request that reads the file and does some basic stuff with it)
The file is about 11kb, so it doesn't take a lot of time to read/write. Well under 1ms.
This is a typical log entry when the problem happens:
Open File: 2657.27 ms
Write: 0.05984 ms
Close: 0.03886 ms
Not sure if it's relevant, but the reads happen in regular web requests, through apache, but the write is a regular "command line" PHP execution made by Linux's cron, it's not going through Apache.
Any ideas of what could be causing this big delay in opening the file?
Any pointers on where I could look to help me pinpoint the actual cause?
Alternatively, can you think of something I could do to avoid this? For example, I'd love to be able to set a 50ms timeout to fopen, and if it didn't open the file, it just skips ahead, and lets the next run of the cron take care of it.
Again, my priority is to keep the cron beating thrice a second, all else is secondary, so any ideas, suggestions, anything is extremely welcome.
Thank you!
Daniel
I can think of 3 possible problems:
the file get locked when read/writing to it by lower php system calls without you knowing it. This should block the file for 1/3s max. You get periods longer that that.
the fs cache starts a fsync() and the whole system blocks read/writes to the disk until that is done. As a fix you may try installing more RAM or upgrading the kernel or using a faster hard disk
your "caching" solution is not distributed, and it hits the worst acting piece of hardware in the whole system many times a second... meaning that you cannot scale it further by simply adding more machines, only by increasing the hdd speed. You should take a look at memcache or APC, or maybe even shared memory http://www.php.net/manual/en/function.shm-put-var.php
Solutions i can think of:
put that file in a ramdisk http://www.cyberciti.biz/faq/howto-create-linux-ram-disk-filesystem/ . This should be the simplest way to avoid hitting the disk so often, without other major changes.
use memcache. Its very fast when used locally, it scales well, used by "big" players. http://www.php.net/manual/en/book.memcache.php
use shared memory. It was designed for what you are trying to do here... having a "shared memory zone"...
change the cron scheduler to update less, maybe implement some kind of event system, so it would only update the cache when necessary, and not on a time basis
make the "writing" script write to 3 files different, and make the "readers" read from one of the files, randomly. This may allow a "distribution" of the loking across more files and it may reduce the chances that a certain file is locked when writing to it... but i doubt it would bring any noticeable benefits.
You should use something really fast solution if you want to guarantee constant low open times. Maybe your OS is doing disk syncs, database file commits, or other things that you can not work around.
I suggest using memcached, redis, or even mongoDB for such tasks. You might even write your own caching daemon, even in php (however this is totally unnecessary, and can be tricky).
If you are absolutely, positively sure that you can only solve this task by this file cache, and you are under Linux, try to use different disk I/O scheduler, like deadline, OR (cfq AND decrease PHP process priority to -3 / -4).

Is PHP suitable for very large projects? Can it be transaction-safe?

That question may appear strange.
But every time I made PHP projects in the past, I encountered this sort of bad experience:
Scripts cancel running after 10 seconds. This results in very bad database inconsistencies (bad example for an deleting loop: User is about to delete an photo album. Album object gets deleted from database, and then half way down of deleting the photos the script gets killed right where it is, and 10.000 photos are left with no reference).
It's not transaction-safe. I've never found a way to do something securely, to ensure it's done. If script gets killed, it gets killed. Right in the middle of a loop. It gets just killed. That never happened on tomcat with java. Java runs and runs and runs, if it takes long.
Lot's of newsletter-scripts try to come around that problem by splitting the job up into a lot of packages, i.e. sending 100 at a time, then relading the page (oh man, really stupid), doing the next one, and so on. Most often something hangs or script will take longer than 10 seconds, and your platform is crippled up.
But then, I hear that very big projects use PHP like studivz (the german facebook clone, actually the biggest german website). So there is a tiny light of hope that this bad behavior just comes from unprofessional hosting companies who just kill php scripts because their servers are so bad. What's the truth about this? Can it be configured in such a way, that scripts never get killed because they take a little longer?
Is PHP suitable for very large projects?
Whenever I see a question like that, I get a bit uneasy. What does very large mean? What may be large to you, may be small to me or vice versa. And that is even assuming that we use the same metric. Are you measuring time to build the project, complete life-cycle of the project, money that are involved, number of people using it, number of developers to build/maintain it, etc. etc.
That said, the problems you're describing sounds like you don't know your technology good enough. That would be a problem for you regardless of which technology you picked. For example, use database transactions to ensure atomicity. And use asynchronous offline jobs to process long running tasks (Such as dispatching a mailing list).
A lot if the bad behaviour is covered in good frameworks like the Zend Framework.
Anything that takes longer the 10 seconds is really messed up but you can always raise the execution time with http://de3.php.net/set_time_limit
A lot of big sites are writen in PHP: Facebook, Wikipedia, StudiVZ, Digg.com etc.. a lot of the things you are talking about are just configuration things maybe you should look into that?
Are you looking for set_time_limit() and ignore_user_abort()?
Performance is not a feature you can just throw in after most of the site is done.
You have to design the site for heavy load.
If a database task is normally involving 10K rows, you should be prepared not just the execution time issues, but other maintenance questions.
Worst case: make a consistency tool to check and fix those errors.
Better: instead of phisically delete the images, just flag them and let background services to take care of the expensive maneuvers.
Best: you can utilize a job queue service and add this job to the queue.
If you do need to do transactions in php, you can just do:
mysql_query("BEGIN");
/// do your queries here
mysql_query("COMMIT");
The commit command will just complete the transaction.
If any errors occur, you can just rollback with:
mysql_query("ROLLBACK");
Edit: Note this will only work if you are using a database that supports transactions, such as InnoDB
You can configure how much time is allowed for executing a script, either in the php.ini setting or via ini_set/set_time_limit
Instead of studivz (the German Facebook clone), you could look at the actual Facebook which is entirely PHP. Or Digg. Or many Yahoo sites. Or many, many others.
ignore_user_abort is probably what you're looking for, but you could also add another layer in terms of scheduled maintenance jobs. They basically run on a specified interval and do various things to make sure your data/filesystem are in a state that you want... deleting old/unlinked files is just one of many things you can do.
For these large loops like deleting photo albums or sending 1000's of emails your looking for ignore_user_abort and set_time_limit.
Something like this:
ignore_user_abort(true); //users leaves webpage will not kill script
set_time_limit(0); //script can take as long as it wants
for(i=0;i<10000;i++)
costly_very_important_operation();
Be carefull however that this could potentially run the script forever:
ignore_user_abort(true); //users leaves webpage will not kill script
set_time_limit(0); //script can take as long as it wants
while(true)
do_something();
That script will never die, unless you restart your server.
Therefore it is best to never set the time_limit the 0.
Technically no programming language is transaction safe, it's the database that needs to be transaction safe. So if the script/code running dies or disconnects, for whatever reason, the transaction will be rolled back.
Putting queries in a loop is a very bad idea unless it is specifically design to be running in batches and breaking a much larger set into smaller pieces. Adjusting PHP timers and limits is generally a stop gap solution, you are still dependent on the client browser if using the web to kick off a script.
If I have a long process that needs to be kicked off by a browser, I "disconnect" the process from the browser and web server so control is returned to the user while the script runs. PHP scripts run from the command line can run for hours if you want. You can then use AJAX, or reload the page, to check on the progress of the long running script.
There are security concern with this code, but to "disconnect" a process from PHP running under something like Apache:
exec("nohup /usr/bin/php -f /path/to/script.php > /dev/null 2>&1 &");
But that really has nothing to do with PHP being suitable for large projects or being transaction safe. PHP can be used for large projects, but since by default there is no code that remains "resident" between hits, it can get slow if not designed right. Also, since there is no namespace support, you want to plan ahead if you have a large development team.
It's fine for a Java based system to take a few minutes to startup, initialize and load all the default objects. But this is unacceptable with PHP. PHP will take more planning for larger systems. The question is, when does the time saved in using PHP get wasted by the additional planning time required for a large system?
The reason you most likely experienced bad database consistencies in the past is because you were using the MyISAM engine for mysql (which DOES NOT support transactions). Use InnoDB instead, it supports transactions and performs row level locking.
Or use postgreSQL.
Many, many software sites are made in PHP. However, you will not hear about millions of web pages made in PHP that do not exist anymore because they were abandoned. Those pages may have burned all company money for dealing with PHP mess, or maybe they bankrupted because their soft was so crappy that customer did not want it… PHP seems good at the startup, but it does not scale very well. Yes, there are many huge web sites made in PHP, but they are rather exceptions, than a norm.

Categories