How to execute a PHP spider/scraper but without it timing out

How to execute a PHP spider/scraper but without it timing out - php

Basically I need to get around max execution time.
I need to scrape pages for info at varying intervals, which means calling the bot at those intervals, to load a link form the database and scrap the page the link points to.
The problem is, loading the bot. If I load it with javascript (like an Ajax call) the browser will throw up an error saying that the page is taking too long to respond yadda yadda yadda, plus I will have to keep the page open.
If I do it from within PHP I could probably extend the execution time to however long is needed but then if it does throw an error I don't have the access to kill the process, and nothing is displayed in the browser until the PHP execute is completed right?
I was wondering if anyone had any tricks to get around this? The scraper executing by itself at various intervals without me needing to watch it the whole time.
Cheers :)

Use set_time_limit() as such:
set_time_limit(0);
// Do Time Consuming Operations Here

"nothing is displayed in the browser until the PHP execute is completed"
You can use flush() to work around this:
flush()
(PHP 4, PHP 5)
Flushes the output buffers of PHP and whatever backend PHP is using (CGI, a web server, etc). This effectively tries to push all the output so far to the user's browser.

take a look at how Sphider (PHP Search Engine) does this.
Basically you will just process some part of the sites you need, do your thing, and go on to the next request if there's a continue=true parameter set.

run via CRON and split spider into chunks, so it will only do few chunks at once. call from CRON with different parameteres to process only few chunks.

Related

JavaScript back and forth with running program

Problem:
I'm trying to see if I can have a back and forth between a program running on the server-side and JavaScript running on the client-side. All the outputs from the program are sent to JavaScript to be displayed to the user, and all the inputs from the user are sent from JavaScript to the program.
Having JavaScript receive the output and send the input is easily done with AJAX. The problem is that I do not know how to access an already running program on the server.
Attempt:
I tried to use PHP, but ran into some hurdles I couldn't leap over. Now, I can execute a program with PHP without any issue using proc_open. I can hook into the stdin and stdout streams, and I can get output from the program and send it input as well. But I can do this only once.
If the same PHP script is executed(?) again, I end up running the program again. So all I ever get out of multiple executions is whatever the program writes to stdout first, multiple times.
Right now, I use proc_open in the script which is supposed to only take care of input and output because I do not know how to access the stdout and stdin streams of an already running program. The way I see it, I need to maintain the state of my program in execution over multiple executions of the same PHP script; maintain the resource returned by proc_open and the pipes hooked into the stdin and stdout streams.
$_SESSION does NOT work. I cannot use it to maintain resources.
Is there a way to have such a back and forth with a program? Any help is really appreciated.

This sounds like a job for websockets
Try something like http://socketo.me/ or http://code.google.com/p/phpwebsocket/
I've always used Node for this type of thing, but from the above two links and a few others, it looks like there's options for PHP as well.

There may be a more efficient way to do it, but you could get the program to write it's output to a text file, and read the contents of that text file in with php. That way you'd have access to the full stream of data from the running program. There are issues with managing the size of the file, and handling requests from multiple clients, but it's a simple approach that might be good enough for your needs.

You are running the same program again, because it's the way PHP works. In your case client does a HTTP request and runs the script. Second request will run the script again. I'm not sure if continuous interaction is possible, so I would suggest making your script able to handle discrete transactions.
In order to figure different steps of the same "interaction", you will have to save data about previous ones in database. Basically, you need to give some unique hash to every client to identify them in your script, then it will know who does the request and will be able to differ consecutive requests from one user from requests of different users.
If your script is heavy and runs for a long time, consider making two script - one heavy and one for interaction (AJAX will query second one). In this case, second script will fill data into database and heavy script will simply fetch it from there.

Creating a long TCPDF document without timeout (so long running php process)

I'm building a feature of a site that will generate a PDF (using TCPDF) into a booklet of 500+ pages. The layout is very simple but just due to the number of records I think it qualifies as a "long running php process". This will only need to be done a handful of times per year and if I could just have it run in the background and email the admin when done, that would be perfect. Considered Cron but it is a user-generated type of feature.
What can I do to keep my PDF rendering for as long as it takes? I am "good" with PHP but not so much with *nix. Even a tutorial link would be helpful.

Honestly you should avoid doing this entirely from a scalability perspective. I'd use a database table to "schedule" the job with the parameters, have a script that is continuously checking this table. Then use JavaScript to poll your application for the file to be "ready", when the file is ready then let the JavaScript pull down the file to the client.
It will be incredibly hard to maintain/troubleshoot this process while you're wondering why is my web server so slow all of a sudden. Apache doesn't make it easy to determine what process is eating up what CPU.
Also by using a database you can do things like limit the number of concurrent threads, or even provide faster rendering time by letting multiple processes render each PDF page and then re-assemble them together with yet another process... etc.
Good luck!

What you need is to change the allowed maximum execution time for PHP scripts. You can do that by several means from the script itself (you should prefer this if it would work) or by changing php.ini.
BEWARE - Changing execution time might seriously lower the performance of your server. A script is allowed to run only a certain time (30sec by default) before it is terminated by the parser. This helps prevent poorly written scripts from tying up the server. You should exactly know what you are doing before you do this.
You can find some more info about:
setting max-execution-time in php.ini here http://www.php.net/manual/en/info.configuration.php#ini.max-execution-time
limiting the maximum execution time by set_time_limit() here http://php.net/manual/en/function.set-time-limit.php
PS: This should work if you use PHP to generate the PDF. It will not work if you use some stuff outside of the script (called by exec(), system() and similar).

This question is already answered, but as a result of other questions / answers here, here is what I did and it worked great: (I did the same thing using pdftk, but on a smaller scale!)
I put the following code in an iframe:
set_time_limit(0); // ignore php timeout
//ignore_user_abort(true); // optional- keep on going even if user pulls the plug*
while(ob_get_level())ob_end_clean();// remove output buffers
ob_implicit_flush(true);
This avoided the page load timeout. You might want to put a countdown or progress bar on the parent page. I originally had the iframe issuing progress updates back to the parent, but browser updates broke that.

Page won't display until PHP functions have completed execution

My website runs simplexml commands to pull data from 2 different websites, and doesn't finish loading the page until after the functions have their responses.
This is really only 1-2 seconds, but it is noticable when regular webpages take milliseconds to load.
Since this code is already in PHP functions, how can I most efficiently load the page and execute the code after? I'm assuming that by the time the page loads, the functions will have executed as well, its just that the browser itself won't refresh and finish loading til execution completes.
Hope this makes sense to you.

Unfortunately, php runs on the server side before the page is loaded. That is what allows it to provide dynamically generated content to the page. If you want to load the page and then run the php functions, you should check out AJAX.
Ajax uses javascript to call external functions and change content on the page without a reload.

Create a webpage without calling any of these functions. Add some JavaScript to that page to make AJAX requests to PHP scripts that call the functions, then adds the returned results to the page.

You have a few options.
AJAX call -- once the important stuff loads, have JS send word to the server that it needs to do some process to complete loading. (rennekon and Dan Grossman seem to have already suggested this).
iframe similar to AJAX, but it does not require JS. Placed at the bottom of the HTML it can let the server know something needs to finish without worrying about any other rendering. (this can actually also be accomplished by any number of tags which make HTTP requests. img attacks are notorious for allowing this with vulnerable sites.)
Spawn a new thread. This is a bit more difficult/annoying, but it does not rely on user feedback to finish processing. You also may not be able to do this on most servers, but it is one way to finish processing in the background.

You can create a cron that would talk to the 2 different websites and store the data you need periodically and then when your page runs it would talk to the local version that cron stored for you taking the communication off of page render time

PHP display progress messages on the fly

I am working in a tool in PHP that processes a lot of data and takes a while to finish. I would like to keep the user updated with what is going on and the current task processed.
What is in your opinion the best way to do it? I've got some ideas but can't decide for the most effective one:
The old way: execute a small part of the script and display a page to the user with a Meta Redirect or a JavaScript timer to send a request to continue the script (like /script.php?step=2).
Sending AJAX requests constantly to read a server file that PHP keeps updating through fwrite().
Same as above but PHP updates a field in the database instead of saving a file.
Does any of those sound good? Any ideas?
Thanks!

Rather than writing to a static file you fetch with AJAX or to an extra database field, why not have another PHP script that simply returns a completion percentage for the specified task. Your page can then update the progress via a very lightweight AJAX request to said PHP script.
As for implementing this "progress" script, I could offer more advice if I had more insight as to what you mean by "processes a lot of data". If you are writing to a file, your "progress" script could simply check the file size and return the percentage complete. For more complex tasks, you might assign benchmarks to particular processes and return an estimated percentage complete based on which process has completed last or is currently running.
UPDATE
This is one suggested method to "check the progress" of an active script which is simply waiting for a response from a request. I have a data mining application that I use a similar method for.
In your script that makes the request you're waiting for (the script you want to check the progress of), you can store (either in a file or a database, I use a database as I have hundreds of processes running at any time which all need to track their progress, and I have another script that allows me to monitor progress of these processes) a progress variable for the process. When the process begins, set this to 1. You can easily select an arbitrary number of 'checkpoints' the script will pass and calculate the percentage given the current checkpoint. For a large request, however, you might be more interested in knowing the approximate percent the request has completed. One possible solution would be to know the size of the returned content and set your status variable according to the percentage received at any moment. I.e. if you receive the request data in a loop, each iteration you could update the status. Or if you are downloading to a flat file you could poll the size of the file. This could be done less accurately with time (rather than file size) if you know the approximate time the request should take to complete and simply compare against the script's current execution time. Obviously neither of these are perfect solutions, but I hope they'll give you some insight into your options.

I suggest using the AJAX method, but not using a file or a database. You could probably use session values or something like that, that way you don't have to create a connection or open a file to do anything.

In the past, I've just written messages out to the page and used flush() to flush the output buffer. Very simple, but it may not work correctly on every web server or with every web browser (as they may do their own internal buffering).
Personally, I like your second option the best. Should be reliable and fairly simple to implement.

I like option 2 - using AJAX to read a status file that PHP writes to periodically. This opens up a lot of different presentation options. If you write a JSON object to the file, you can easily parse it and display things like a progress bar, status messages, etc...

A 'dirty' but quick-and-easy approach is to just echo out the status as the script runs along. So long as you don't have output buffering on, the browser will render the HTML as it receives it from the server (I know WordPress uses this technique for it's auto-upgrade).
But yes, a 'better' approach would be AJAX, though I wouldn't say there's anything wrong with 'breaking it up' use redirects.
Why not incorporate 1 & 2, where AJAX sends a request to script.php?step=1, checks response, writes to the browser, then goes back for more at script.php?step=2 and so on?

if you can do away with IE then use server sent events. its the ideal solution.

Showing percentage complete of PHP script

I have a PHP script that takes about 10 minutes to complete.
I want to give the user some feedback as to the completion percent and need some ideas on how to do so.
My idea is to call the php page with jquery and the $.post command.
Is there a way to return information from the PHP script without ending the script?
For example, from my knowledge of this now, if I return the variable, the PHP script will stop running.
My idea is to split the script into multiple PHP files and have the .post run each after a return from the previous is given.
But this still will not give an accurate assessment of time left because each script will be a different size.
Any ideas on a way to do this?
Thanks!

You can echo and flush() output, but that's suboptimal and rather fragile solution.
For long operations it might be good idea to launch script in the background and store/updte script status in shared location.
e.g. you could lanuch script using fopen('http://… call, proc_open PHP CLI process or even just openg long-running script in an <iframe>.
You could store status in the database or in shared memory (using apc_store()).
This will let user to check status of the script at any time (by refreshing page, or using AJAX) and user won't lose track of the script if browser's connection times out.
It also lets you avoid starting same long script twice.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.