How to run PHP process for longer time - php

I am working on web scraping with php and curl to scrap a whole website
but it takes more than one day to complete the process of scraping
I have even used
ignore_user_abort(true);
set_error_handler(array(&$this, 'customError'));
set_time_limit (0);
ini_set('memory_limit', '-1');
I have also cleared memory after scraping a page I am using simple html DOM
to get the scraping details from a page
But still process runs and works fine for some amount of links after that it stops although process keeps circulating the browser and no error log is generated
Could not understand what seems to be the problem.
Also I need to know if PHP can
run process for two or three days?
thanks in advance

PHP can run for as long as you need it to, but the fact it stops after what seems like the same point every time indicates there is an issue with your script.
You said you have tried ignore_user_abort(true);, but then indicated you were running this via a browser. This setting only works in command line as closing a browser window for a script of this type will not terminate the process anyway.
Do you have xDebug? simplehtmlDOM will throw some rather interesting errors with malformed html (a link within a broken link for example). xDebug will throw a MAX_NESTING_LEVEL error in a browser, but will not throw this in a console unless you have explicitly told it to with the -d flag.
There are lots of other errors, notices, warnings etc which will break/stop your script without writing anything to error_log.
Are you getting any errors?
When using cURL in this way it is important to use multi cURL to parallel process URLs - depending on your environment, 150-200 URLs at a time is easy to achieve.
If you have truly sorted out the memory issue and freed all available space like you have indicated, then the issue must be with a particular page it is crawling.
I would suggest running your script via a console and finding out exactly when it stops to run that URL separately - at least this will indicate if it is a memory issue or not.
Also remember that set_error_handler(array(&$this, 'customError')); will NOT catch every type of error PHP can throw.
When you next run it, debug via a console to show progress, and keep a track of actual memory use - either via PHP (printed to console) or via your systems process manager. This way you will be closer to finding out what the actual issue with your script is.

Even if you set an unlimited memory, there exists a physical limit.
If you call recursively the URLs, the memory can be fullfilled.
Try to do a loop and work with a database:
scan a page, store the founded links if there aren't in the database yet.
when finish, do a select, and get the first unscanned URL
{loop}

Related

PHP script stops running after some time run by browser

I have made a PHP script which probably would take about 3 hours to complete. I run it from browser and after about 45minutes it stops doing anything. I know this since its polling certain web addresses and then saves some data to database. So it basically stops putting any data to database which lead me to conclusion that it has stopped. It still shows in browser like it would be loading the page though but its neverending.
There arent any errors so it probably is some kind of timeout... But where it occurs is mystery or how can I prevent it from happening. In my case I cant use the CLI, I must user browser client to initiate the script.
I have tried to put
set_time_limit(0);
But it had no apparent effect. Any suggestions what could cause the timeout and a fix for it?
Try this:
set_time_limit(0);
ignore_user_abort(true);
ini_set('max_execution_time', 0);
Most webhosts kill processes that run for a certain length of time. This is intended as a failsafe against infinite loops.
Ask your host about this and see if there's any way it can be disabled for this particular script of yours. In some cases, the killer doesn't apply to Cron tasks, or processes run by SSH. However, this varies from host to host.
Might be the browser that's timing out, not sure if browsers do that or not, but then I've also never had a page require so much time.
Suggestion: I'm assuming you are running a loop. Load a page then run each iteration of the loop in an ajax call to another page, not firing the next iteration until the previous one returns.
There's a setting in PHP to kill processes after sometime. This is especially important for shared servers (you do not want that one process slows up the whole server).
You need to ask your host if you can make modifications to php.ini (through .htaccess). In particular the max_execution_time setting.
If you are using session, then you would need to look at 'session.cookie_lifetime' and not set_time_limit. If you are using an array, the array size might also fill up.
Without more info on how your script handles the task, it would be difficult to identify.

How do I avoid this PHP Script causing a server standstill?

I'm currently running a Linux based VPS, with 768MB of Ram.
I have an application which collects details of domains and then connect to a service via cURL to retrieve details of the pagerank of these domains.
When I run a check on about 50 domains, it takes the remote page about 3 mins to load with all the results, before the script can parse the details and return it to my script. This causes a problem as nothing else seems to function until the script has finished executing, so users on the site will just get a timer / 'ball of death' while waiting for pages to load.
**(The remote page retrieves the domain details and updates the page by AJAX, but the curl request doesnt (rightfully) return the page until loading is complete.
Can anyone tell me if I'm doing anything obviously wrong, or if there is a better way of doing it. (There can be anything between 10 and 10,000 domains queued, so I need a process that can run in the background without affecting the rest of the site)
Thanks
A more sensible approach would be to "batch process" the domain data via the use of a cron triggered PHP cli script.
As such, once you'd inserted the relevant domains into a database table with a "processed" flag set as false, the background script would then:
Scan the database for domains that aren't marked as processed.
Carry out the CURL lookup, etc.
Update the database record accordingly and mark it as processed.
...
To ensure no overlap with an existing executing batch processing script, you should only invoke the php script every five minutes from cron and (within the PHP script itself) check how long the script has been running at the start of the "scan" stage and exit if its been running for four minutes or longer. (You might want to adjust these figures, but hopefully you can see where I'm going with this.)
By using this approach, you'll be able to leave the background script running indefinitely (as it's invoked via cron, it'll automatically start after reboots, etc.) and simply add domains to the database/review the results of processing, etc. via a separate web front end.
This isn't the ideal solution, but if you need to trigger this process based on a user request, you can add the following at the end of your script.
set_time_limit(0);
flush();
This will allow the PHP script to continue running, but it will return output to the user. But seriously, you should use batch processing. It will give you much more control over what's going on.
Firstly I'm sorry but Im an idiot! :)
I've loaded the site in another browser (FF) and it loads fine.
It seems Chrome puts some sort of lock on a domain when it's waiting for a server response, and I was testing the script manually through a browser.
Thanks for all your help and sorry for wasting your time.
CJ
While I agree with others that you should consider processing these tasks outside of your webserver, in a more controlled manner, I'll offer an explanation for the "server standstill".
If you're using native php sessions, php uses an exclusive locking scheme so only a single php process can deal with a given session id at a time. Having a long running php script which uses sessions can certainly cause this.
You can search for combinations of terms like:
php session concurrency lock session_write_close()
I'm sure its been discussed many times here. I'm too lazy to search for you. Maybe someone else will come along and make an answer with bulleted lists and pretty hyperlinks in exchange for stackoverflow reputation :) But not me :)
good luck.
I'm not sure how your code is structured but you could try using sleep(). That's what I use when batch processing.

Best practice to find source of PHP script termination

I have a PHP script that grabs a chunk of data from a database, processes it, and then looks to see if there is more data. This processes runs indefinitely and I run several of these at a time on a single server.
It looks something like:
<?php
while($shouldStillRun)
{
// do stuff
}
logThatWeExitedLoop();
?>
The problem is, after some time, something causes the process to stop running and I haven't been able to debug it and determine the cause.
Here is what I'm using to get information so far:
error_log - Logging all errors, but no errors are shown in the error log.
register_shutdown_function - Registered a custom shutdown function. This does get called so I know the process isn't being killed by the server, it's being allowed to finish. (or at least I assume that is the case with this being called?)
debug_backtrace - Logged a debug_backtrace() in my custom shutdown function. This shows only one call and it's my custom shutdown function.
Log if reaches the end of script - Outside of the loop, I have a function that logs that the script exited the loop (and therefore would be reaching the end of the source file normally). When the script dies randomly, it's not logging this, so whatever kills it, kills it while it's in the middle of processing.
What other debugging methods would you suggest for finding the culprit?
Note: I should add that this is not an issue with max_execution_time, which is disabled for these scripts. The time before being killed is inconsistent. It could run for 10 seconds or 12 hours before it dies.
Update/Solution: Thank you all for your suggestions. By logging the output, I discovered that when a MySql query failed, the script was set to die(). D'oh. Updated it to log the mysql errors and then terminate. Got it working now like a charm!
I'd log memory usage of your script. Maybe it acquires too much memory, hits memory limit and dies?
Remember, PHP has a variable in the ini file that says how long a script should run. max-execution-time
Make sure that you are not going over this, or use the set_time_limit() to increase execution time. Is this program running through a web server or via cli?
Adding: My Bad Experiences with PHP. Looking through some background scripts I wrote earlier this year. Sorry, but PHP is a terrible scripting language for doing anything for long lengths of time. I see that the newer PHP (which we haven't upgraded to) adds the functionality to force the GC to run. The problem I've been having is from using too much memory because the GC almost never runs to clean up itself. If you use things that recursively reference themselves, they also will never be freed.
Creating an array of 100,000 items makes memory, but then setting the array to an empty array or splicing it all out, does NOT free it immediately, and doesn't mark it as unused (aka making a new 100,000 element array increases memory).
My personal solution was to write a perl script that ran forever, and system("php my_php.php"); when needed, so that the interpreter would free completely. I'm currently supporting 5.1.6, this might be fixed in 5.3+ or at the very least, now they have GC commands that you can use to force the GC to cleanup.
Simple script
#!/usr/bin/perl -w
use strict;
while(1) {
if( system("php /to/php/script.php") != 0 ) {
sleep(30);
}
}
then in your php script
<?php
// do a single processing block
if( $moreblockstodo ) {
exit(0);
} else {
// no? then lets sleep for a bit until we get more
exit(1);
}
?>
I'd log the state of the function to a file in a few different places in each loop.
You can get the contents of most variables as a string with var_export, using the var_export($varname,true) form.
You could just log this to a certain file, and keep an eye on it. The latest state of the function before the log ends should provide some clues.
Sounds like whatever is happening is not a standard php error. You should be able to throw your own errors using a try... catch statement that should then be logged. I don't have more details other than that because I'm on my phone away from a pc.
I've encountered this before on one of our projects at work. We have a similar setup - a PHP script checks the DB if there are tasks to be done (such as sending out an email, updating records, processing some data as well). The PHP script has a while loop inside, which is set to
while(true) {
//do something
}
After a while, the script will also be killed somehow. I've already tried most of what has been said here like setting max_execution_time, using var_export to log all output, placing a try_catch, making the script output ( php ... > output.txt) etc and we've never been able to find out what the problem is.
I think PHP just isn't built to do background tasks by itself. I know it's not answering your question (how to debug this) but the way we worked this is that we used a cronjob to call the PHP file every 5 minutes. This is similar to Jeremy's answer of using a perl script - it ensures that the interpreter if free after the execution is done.
If this is on Linux, try to look into system logs - the process could be killed by the OOM (out-of-memory) killer (unlikely, you'd also see other problems if this was happening), or a segmentation fault (some versions of PHP don't like some versions of extensions, resulting in weird crashes).

why does apache/server die when i make a typo and cause a never ending loop?

this is more of a fundamental question at how apache/threading works.
in this hypothetical (read: sometimes i suck and write terrible code), i write some code that enters the infinite-recursion phases of it's life. then, what's expected, happens. the serve stalls.
even if i close the tab, open up a new one, and hit the site again (locally, of course), it does nothing. even if i hit a different domain i'm hosting through a vhost declaration, nothing. i normally have to wait a number of seconds before apache can begin handling traffic again. most of the time i just get tired and restart the server manually.
can someone explain this process to me? i have the php runtime setting 'ignore_user_abort' set to true to allow ajax calls that are initiated to keep running even if they close their browser, but would this being set to false affect it?
any help would be appreciated. didn't know what to search for.
thanks.
ignore_user_abort() allows your script (and Apache) to ignore a user disconnecting (closing browser/tab, moving away from page, hitting ESC, esc..) and continue processing. This is useful in some cases - for instance in a shopping cart once the user hits "yes, place the order". You really don't want an order to die halfway through the process, e.g. order's in the database, but the charge hasn't been sent to the payment facility yet. Or vice-versa.
However, while this script is busilly running away in "the background", it will lock up resources on the server, especially the session file - PHP locks the session file to make sure that multiple parallel requests won't stomp all over the file, so while your infinite loop is running in the background, you won't be able to use any session-enabled other part of the site. And if the loop is intensive enough, it could tie up the CPU enough that Apache is unable to handle any other requests on other hosted sites, where the session lock might not apply.
If it is an infinite loop, you'll have to wait until PHP's own maximum allowed run time (set_time_limit() and max_execution_time()) kicks in and kills the script. There's also some server-side limiters, like Apache's RLimitCPU and TimeOut that can handle situations like this.
Note that except on Windows, PHP doesn't count "external" time in the set_time_limit. So if your runaway process is doing database stuff, calling external programs via system() and the like, the time spent running those external calls is NOT accounted for in the parent's time limit.
If you write code that causes an (effectively) neverending loop, then apache will execute that, and be unable to respond to any additional new requests for a page, because it's trying to determine the page content (for the served page which caused the neverending loop) by executing the (non-terminating) php code.
Solution: don't write code that doesn't terminate (in a reasonable amount of time). Understand loop invariants.

How to Increase the time till a read timeout error occurs?

I've written in PHP a script that takes a long time to execute [Image processing for thousands of pictures]. It's a meter of hours - maybe 5.
After 15 minutes of processing, I get the error:
ERROR
The requested URL could not be retrieved
The following error was encountered while trying to retrieve the URL: The URL which I clicked
Read Timeout
The system returned: [No Error]
A Timeout occurred while waiting to read data from the network. The network or server may be down or congested. Please retry your request.
Your cache administrator is webmaster.
What I need is to enable that script to run for much longer.
Now, here are all the technical info:
I'm writing in PHP and using the Zend Framework. I'm using Firefox. The long script that is processed is done after clicking a link. Obviously, since the script is not over I see the web page on which the link was and the web browser writes "waiting for ...".
After 15 minutes the error occurs.
I tried to make changes to Firefox threw about:config but without any success. I don't know, but the changes might be needed somewhere else.
So, any ideas?
Thanks ahead.
set_time_limit(0) will only affect the server-side running of the script. The error you're receiving is purely browser-side. You have to send SOMETHING to keep the browser from deciding the connection's dead - even a single character of output (followed by a flush() to make sure it actually get sent out over the wire) will do. Maybe once every image that's processed, or on a fixed time interval (if last char sent more than 5 minutes ago, output another one).
If you don't want any intermediate output, you could do ignore_user_abort(TRUE), which will allow the script to keep running even if the connection gets shut down from the client side.
If the process runs for hours then you should probably look into batch processing. So you just store a request for image processing (in a file, database or whatever works for you) instead of starting the image processing. This request is then picked up by a scheduled (cron) process running on the server, which will do the actual processing (this can be a PHP script, which calls set_time_limit(0)). And when processing is finished you could signal the user (by mail or any other way that works for you) that the processing is finished.
use set_time_limit
documentation here
http://nl.php.net/manual/en/function.set-time-limit.php
If you can split your work in batches, after processing X images display the page with some javascript (or META redirects) on it to open the link http://server/controller/action/nextbatch/next_batch_id.
Rinse and repeat.
batching the entire process also has the added benefit that once something goes wrong, you don't have to start out the entire thing anew.
If you're running on a server of your own and can get out of safe_mode, then you could also fork background processes to do the actual heavy lifting, independent of your browser view of things. If you're in a multicore or multiprocessor environment, you can even schedule more than one running process at any time.
We've done something like that for large computation scripts; synchronization of the processes happened over a shared database---but luckily enough, they processes were so independent that the only thing we needed to see was their completion or termination.

Categories