I have a php script that scrapes the web and inserts the scraped data into a database.
This php script runs for a very long time(about a couple of hours).
Sometimes, after the script runs for a long time, the php script just stops executing and shows no error.
The problem isn't caused due to the amount of execution time of the script because i set the php script to an unlimited amount of execution time.
ini_set('max_execution_time', 0);
I also set the php script to show all errors.
ini_set('display_errors',1);
error_reporting(E_ALL);
But I get no error after the php script stops execution.
I also ran the script in several other computers and i still encounter the same problem, so the problem isn't due to server restrictions either.
I researched the issue and apparently it's a networking problem.
The php script stops communicating with the server and disconnects from it(probably because the php script sent a http request and didn't receive any response).
My question is this:
Is there any way I can check for network disconnections through the php script, and resume the script and try to reconnect if there was a network disconnection?
First "set_time_limit" is a better way to use that function and I would check your server's rules and regulations. If it is your private server, then great. If you are renting a VPS or other type of server, it is highly possible that they have restrictions to prevent misuse, like SPAMing and such. Make sure that you are within their rules and go from there.
Related
I am working on web scraping with php and curl to scrap a whole website
but it takes more than one day to complete the process of scraping
I have even used
ignore_user_abort(true);
set_error_handler(array(&$this, 'customError'));
set_time_limit (0);
ini_set('memory_limit', '-1');
I have also cleared memory after scraping a page I am using simple html DOM
to get the scraping details from a page
But still process runs and works fine for some amount of links after that it stops although process keeps circulating the browser and no error log is generated
Could not understand what seems to be the problem.
Also I need to know if PHP can
run process for two or three days?
thanks in advance
PHP can run for as long as you need it to, but the fact it stops after what seems like the same point every time indicates there is an issue with your script.
You said you have tried ignore_user_abort(true);, but then indicated you were running this via a browser. This setting only works in command line as closing a browser window for a script of this type will not terminate the process anyway.
Do you have xDebug? simplehtmlDOM will throw some rather interesting errors with malformed html (a link within a broken link for example). xDebug will throw a MAX_NESTING_LEVEL error in a browser, but will not throw this in a console unless you have explicitly told it to with the -d flag.
There are lots of other errors, notices, warnings etc which will break/stop your script without writing anything to error_log.
Are you getting any errors?
When using cURL in this way it is important to use multi cURL to parallel process URLs - depending on your environment, 150-200 URLs at a time is easy to achieve.
If you have truly sorted out the memory issue and freed all available space like you have indicated, then the issue must be with a particular page it is crawling.
I would suggest running your script via a console and finding out exactly when it stops to run that URL separately - at least this will indicate if it is a memory issue or not.
Also remember that set_error_handler(array(&$this, 'customError')); will NOT catch every type of error PHP can throw.
When you next run it, debug via a console to show progress, and keep a track of actual memory use - either via PHP (printed to console) or via your systems process manager. This way you will be closer to finding out what the actual issue with your script is.
Even if you set an unlimited memory, there exists a physical limit.
If you call recursively the URLs, the memory can be fullfilled.
Try to do a loop and work with a database:
scan a page, store the founded links if there aren't in the database yet.
when finish, do a select, and get the first unscanned URL
{loop}
I have a PHP script on my Apache web server, which starts another several hours running PHP script. Right after the long-lasting script is started no other PHP script requests are handled. The browser just hangs eternally.
The background script crawls other sites and gathers data from ones. Therefore it takes quite long time.
At the same time static pages are got without problems. Also at the same time any PHP script started locally on the server from bash are executed without problems.
CPU and RAM usage are low. In fact it's test server and my requests are only ones being handled.
I tried to decrease Apache processes in order to be able to trace all of them to see where requests are hung. But when I decreased amount of processes to 2 the problem has gone.
I found no errors neither in syslog nor in apache/error.log
What else can I check?
Though I didn't find the reason of Apache hanging I have solved the task in a different way.
I've set a schedule to run a script every 5 minutes. From web script I'm just creating a file with necessary parameters. Script check existence of the file and if it exists it reads its content and deletes to prevent further scheduled start.
I have a script that updates my database with listings from eBay. The amount of sellers it grabs items from is always different and there are some sellers who have over 30,000 listings. I need to be able to grab all of these listings in one go.
I already have all the data pulling/storing working since I've created the client side app for this. Now I need an automated way to go through each seller in the DB and pull their listings.
My idea was to use CRON to execute the PHP script which will then populate the database.
I keep getting Internal Server Error pages when I'm trying to execute a script that takes a very long time to execute.
I've already set
ini_set('memory_limit', '2G');
set_time_limit(0);
error_reporting(E_ALL);
ini_set('display_errors', true);
in the script but it still keeps failing at about the 45 second mark. I've checked ini_get_all() and the settings are sticking.
Are there any other settings I need to adjust so that the script can run for as long as it needs to?
Note the warnings from the set_time_limit function:
This function has no effect when PHP is running in safe mode. There is no workaround other than turning off safe mode or changing the time limit in the php.ini.
Are you running in safe mode? Try turning it off.
This is the bigger one:
The set_time_limit() function and the configuration directive max_execution_time only affect the execution time of the script itself. Any time spent on activity that happens outside the execution of the script such as system calls using system(), stream operations, database queries, etc. is not included when determining the maximum time that the script has been running. This is not true on Windows where the measured time is real.
Are you using external system calls to make the requests to eBay? or long calls to the database?
Look for particularly long operations by profiling your php script, and looking for long operations (> 45 seconds). Try to break those operations into smaller chunks.
Well, as it turns out, I overlooked the fact that I was testing the script through the browser. Which means Apache was handling the PHP process, which was executed with mod_fcgid, which had a timeout of exactly 45 seconds.
Executing the script directly from shell and CRON works just fine.
I have tried searching for this forever, but unfortunately I could not find the answer.
I am calculating a whole lot of Pearson correlations on huge matrixes on my server. I do this by opening example.org/testscript.php.
The script itself will terminate about a 2 days after it has started and will perform many INSERT INTO databases for recommendation purposes.
I was wondering when I closed the window of my browser, whether the PHP script would stop or not. I am assuming not; however I am not a 100% sure.
P.S. I have noticed that in some browsers on some computers I will receive an internal server error (500) when starting the script after about 10 minutes. The script itself however was still running as it was still inserting rows in my database.
On this computer however I have not received such error and therefore I was wondering what would happen when I closed the tab.
The php script will terminate after reaching the timeout. You can change the timeout: http://php.net/manual/en/function.set-time-limit.php
The browser will however timeout much sooner, if you are not producing any output. The browser timeout is of course browser-specific.
As Dagon suggested, the correct way would be to execute the php script on the server in the background.
I currently have a website that has twice been suspended by my hosting provider for "overusing system resources". In each case, there were 300 - 400 crashed copies of one of my PHP scripts left running on the server.
The scripts themselves pull and image from a web camera at home and copy it to the server. They make use of file locks to ensure only one can write at a time. The scripts are called every 3 seconds by any client viewing the page.
Initially I was confused, as I had understood that a PHP script either completes (returning the result), or crashes (returning the internal server error page). I am, however, informed that "defunct scripts" are a very common occurrence.
Would anyone be able to educate me? I have Googled this to death but I cannot see how a script can end up in a crashed state. Would it not time out when it reaches the max execution time?
My hosting provider is using PHP set up as CGI on a Linux platform. I believe that I have actually identified the problem with my script in that I did not realise that flock was a blocking function (and I am not using the LOCK_NB mask). I am assuming that somehow hundreds of copies of my script end up blocked waiting for a resource to become available and this leads to a crash? Does this sound plausible? I am reluctant to re-enable the site for fear of it being suspended again.
Any insights greatly appreciated.
Probably the approach I would recommend is to use tempnam() first and write the contents inside (which may take a while). Once done, you do the file locking, etc.
Not sure if this happens when a PUT request is being done; typically PHP will handle file uploads first before handing over the execution to your script.
Script could crash on these two limitations
max_execution_time
memory_limit
while working with resources, unless you have no other errors in script / check for notice errors too