Background Info - I created a online shop a while ago dropshipping products i created the website and added all product info by hand. Now i have knowledge in php i created a scraper/spider to get all the required info i need without doing anything by hand
Question - My script runs on my local server collecting all links from the sites sitemap.xml this is uploaded to my database once this script is complete it starts going through the links extracting the data needed Picture, Price, Name, Desc etc... the site i am scraping is not happy that i am doing it due to human/computer errors that can only be spotted by a human, but have allowed it. anyway my script sometimes throws me an error when a item cannot be scraped due to some unknown reasons so i have put a die() when the script throws this error.
This is placed inside the mysql while loop for the links, i have noticed a few times that when an error does occur the script stops loading shows me the exact error, but when i shut down the browser it carries on deleting queries and extracting information i need to manually restart the server before it stops.
How is this possible and what can i do to prevent this? is it the die() statement just kills the client side script and keeps the server side script running ?
So you are running PHP locally to gather data from a remote site. You start a PHP script in your local browser. And the script does not stop when the browser is closed.
Of course the local server must be stopped.
However I think PHP can also be run from the command line (maybe only Linux?) and then output could go to the command line, and the command line might be simply killed.
Another solution is: in the loops checking for the (non-)existence of a file and then die. A second PHP script, callable in a second browser tab, then adds/removes that signal file.
(The file might serve as a lock too, so you do not start the data gathering twice.)
Related
I have PHP Drupal framework running in Apache server, one REST call sent to my Drupal will trigger one block of PHP code. This block of code will keep running for 3 days in backend to analysis data stored in local database and keep writing the analysis result to a text file.
Notice that the block of code is also under Drupal framework, it's not running independently. When the Apache error.log rolls over, this block of code will stop working. To continue my task, I have to send REST call again to trigger it.
When error.log rolls over, the log content will be pushed to file error.log.1, and a new error.log file will be created. My code keeps writing to log, I can't see any abnormal log at the end of error.log.1, the script just literally stopped.
Saying so much, just tried to explain my problem clearly.
Question: How to handle this? Should I change some configuration of Apache server? Or do I have to use independent script which doesn't depend on apache server?
I found this page explaining how apache and system do the log rotation. It shows me the files I need to take care of if I want to resolve my issue.
https://www.digitalocean.com/community/tutorials/how-to-configure-logging-and-log-rotation-in-apache-on-an-ubuntu-vps
I am working on web scraping with php and curl to scrap a whole website
but it takes more than one day to complete the process of scraping
I have even used
ignore_user_abort(true);
set_error_handler(array(&$this, 'customError'));
set_time_limit (0);
ini_set('memory_limit', '-1');
I have also cleared memory after scraping a page I am using simple html DOM
to get the scraping details from a page
But still process runs and works fine for some amount of links after that it stops although process keeps circulating the browser and no error log is generated
Could not understand what seems to be the problem.
Also I need to know if PHP can
run process for two or three days?
thanks in advance
PHP can run for as long as you need it to, but the fact it stops after what seems like the same point every time indicates there is an issue with your script.
You said you have tried ignore_user_abort(true);, but then indicated you were running this via a browser. This setting only works in command line as closing a browser window for a script of this type will not terminate the process anyway.
Do you have xDebug? simplehtmlDOM will throw some rather interesting errors with malformed html (a link within a broken link for example). xDebug will throw a MAX_NESTING_LEVEL error in a browser, but will not throw this in a console unless you have explicitly told it to with the -d flag.
There are lots of other errors, notices, warnings etc which will break/stop your script without writing anything to error_log.
Are you getting any errors?
When using cURL in this way it is important to use multi cURL to parallel process URLs - depending on your environment, 150-200 URLs at a time is easy to achieve.
If you have truly sorted out the memory issue and freed all available space like you have indicated, then the issue must be with a particular page it is crawling.
I would suggest running your script via a console and finding out exactly when it stops to run that URL separately - at least this will indicate if it is a memory issue or not.
Also remember that set_error_handler(array(&$this, 'customError')); will NOT catch every type of error PHP can throw.
When you next run it, debug via a console to show progress, and keep a track of actual memory use - either via PHP (printed to console) or via your systems process manager. This way you will be closer to finding out what the actual issue with your script is.
Even if you set an unlimited memory, there exists a physical limit.
If you call recursively the URLs, the memory can be fullfilled.
Try to do a loop and work with a database:
scan a page, store the founded links if there aren't in the database yet.
when finish, do a select, and get the first unscanned URL
{loop}
I start my browser and run a PHP program (in another server) and them I close the browser, the program will still keep running in the server, right?
What if you run the program and them remove the folder in the server (while the program is running). Assuming its a single PHP file, will it crash? Does the whole PHP file is read in memory before running or do the system does periodic access for this file?
draft saved
First off, when the server receives a request, it will continue to process that request until it finishes it's response, even if the browser that made the request closes.
The PHP file call is loaded into memory and processed, so deleting the file in the middle of processing will not cause anything to crash.
If however, half way through your PHP it references another file that is deleted BEFORE that code is reached, then it may crash (based on your error handling).
Note however, that causing PHP to crash will not crash the whole web server.
According to the PHP Connection Handling Page:
http://php.net/manual/en/features.connection-handling.php
You can decide whether or not you want a client disconnect to cause
your script to be aborted. Sometimes it is handy to always have your
scripts run to completion even if there is no remote browser receiving
the output.
Of course you can delete the file or folder which includes the PHP file as long as it is not directly in use/open on the server.
Otherwise you could never delete files on a Webserver as they always might be in use :-)
I have a PHP script that downloads videos from various locations.
The video files can be any where from 20mb to 100mb+
I've got PHP currently saving the video file in a directory using CURLOPT_FILE. This is working fine with no problems.
Because of the large files that are being dowloaded I've set the cURL timeout period to 45 minutes to allow the file to download. I have also set set_time_limit(0) so that the PHP page should continue processing after the download has completed. I've also set ini_set("memory_limit","500M");
When the download completes it should echo "Downloaded" and then update a mysql record stating the file has been downloaded.
What is happening though, is the video file is being downloaded correctly by cURL but it is not displaying "Downloaded" in the browser BUT it is updating mysql.
Why is this? I've tried to come up with a solution myself, but I cannot work out what the issue here is...
If you're in a browser environment, the browser will timeout after a certain time, and so will stop listening for output from the script, even though the script will continue to run. It varies across browsers, but the number I've seen is 30 seconds.
To overcome this problem, you should send output (even if meaningless echo "<!--empty comment-->";) every so often.
I recently had a similar problem, and I dealt with it by not outputting any content from the script, and instead polling from the browser every so often using AJAX to see if it was done.
Or, don't use the browser environment (as it's not ideally suited for this problem), and instead use a command line prompt, as it does not have (to my knowledge) these timeouts.
I created a script that gets data from some web services and our database, formats a report, then zips it and makes it available for download. When I first started I made it a command line script to see the output as it came out and to get around the script timeout limit you get when viewing in a browser. But because I don't want my user to have to use it from the command line or have to run php on their computer, I want to make this run from our webserver instead.
Because this script could take minutes to run, I need a way to let it process in the background and then start the download once the file has been created successfully. What's the best way to let this script run without triggering the timeout? I've attempted this before (using the backticks to run the script separately and such) but gave up, so I'm asking here. Ideally, the user would click the submit button on the form to start the request, then be returned to the page instead of making them stare at a blank browser window. When the zip file they exists (meaning the process has finished), it should notify them (via AJAX? reloaded page? I don't know yet).
This is on windows server 2007.
You should run it in a different process. Make a daemon that runs continuously, hits a database and looks for a flag, like "ShouldProcessData". Then when you hit that website switch the flag to true. Your daemon process will see the flag on it's next iteration and begin the processing. Stick the results in to the database. Use the database as the communication mechanism between the website and the long running process.
In PHP you have to tell what time-out you want for your process
See PHP manual set_time_limit()
You may have another problem: the time-out of the browser itself (could be around 1~2 minutes). While that time-out should be changeable within the browser (for each browser), you can usually prevent the time-out user side to be triggered by sending some data to the browser every 20 seconds for instance (like the header for download, you can then send other headers, like encoding etc...).
Gearman is very handy for it (create a background task, let javascript poll for progress). It does of course require having gearman installed & workers created. See: http://www.php.net/gearman
Why don't you make an ajax call from the page where you want to offer the download and then just wait for the ajax call to return and also set_time_limit(0) on the other page.