I am writing a script that will probably need half a day because it gets data from about 14000 webpages from a website.
To find out whether it makes progress or not, is there any way to observe its execution, i.e. the outgoing connections to each of the scraped pages, with the mac os shell?
I am using curl to get the page contents, if that is of any help.
Thanks a lot!
Charles
EDIT
The script is written in php and executed from localhost.
When writing custom scripts it is very helpful to output some sort of status to stdout.
this can be done in a uniform way using printf http://www.php.net/manual/en/function.sprintf.php
What you log to stdout depends on what information you need to see. Perhaps for a curl request I would log Url, Response code, maybe start time and end time. Its really up to you, just make sure you can verfiy it's status/progress.
printf('%40s | %5s', 'URL', 'Status Code');
printf('%40s | %5s', $the_url, $status_code);
If you are running this via a web browser, output is not seen until the PHP has finished executing. However, file_put_contents() can append data to a logfile which you can look at.
An example line of code would be: file_put_contents("file name.txt", "\nWebsite abc was successfully scraped", FILE_APPEND);. You must have the FILE_APPEND flag or the PHP will just overwrite the file each time.
php.net Reference
Related
To preface, I know this isn't a great question and it will be hard to explain.
I have a PHP script that takes 5-10 minutes to run. I don't want the user to have to wait for it. If I "trigger" the script using jquery ajax, and then the user navigates away from that page or closes the browser (and doesn't wait for the response (if any) which will come much later), will the script still execute fully (assuming there are no errors etc)
Thanks!
Once the server receives the AJAX Request along with the data, it would process it as usual, even if you close the page or the window. If you close the browser window before the server receives the AJAX Request, the processing it not going to happen.
Furthermore, if the AJAX Request is returning any kind of data or displaying messages, it is advised that you leave the window open, so that there is some "Listening" page to the Server's Request Response.
In your PHP script you could call the ignore_user_abort(true) function which would cause the script to run regardless of the user closing the page or not.
You could use the command line if you have access to one:
php /loc/to/file.php
This does not have a timeout and might be faster than port:80 (a browser calling an phpfile)
Or call the main phpfile in another file via a php's exec():
<?php
exec("php /loc/to/file.php > /loc/to/result.txt");
?>
You might want to use shell_exec()
The dir of results.txt and the file itself need to be writable.
The 'greater than' sign writes the output of the phpfile to result.txt. If the php would echo 123, that would be the contents of result.txt
5 to 10 minutes is a very long time, you might want to check your code for improvements. If you are using a database, add indice (an index) on columns u use allot, that ccan save huge amounts of tim
I'm fetching pages with cURL in PHP. Everything works fine, but I'm fetching some parts of the page that are calculated with JavaScript a fraction after the page is loaded. cURL already send the page's source back to my PHP script before the JavaScript calculations are done, thus resulting in wrong end-results. The calculations on the site are fetched by AJAX, so I can't reproduce that calculation in an easy way. Also I have no access to the target-page's code, so I can't tweak that target-page to fit my (cURL) fetching needs.
Is there any way I can tell cURL to wait until all dynamic traffic is finished? It might be tricky, due to some JavaScripts that are keep sending data back to another domain that might result in long hangs. But at least I can test then if I at least get the correct results back.
My Developer toolbar in Safari indicates the page is done in about 1.57s. Maybe I can tell cURL statically to wait for 2 seconds too?
I wonder what the possibilities are :)
cURL does not execute any JavaScript or download any files referenced in the document. So cURL is not the solution for your problem.
You'll have to use a browser on the server side, tell it to load the page, wait for X seconds and then ask it to give you the HTML.
Look at: http://phantomjs.org/ (you'll need to use node.js, I'm not aware of any PHP solutions).
With Peter's advise and some research. It's late but I have found a solution. Hope someone find it helpful.
All you need to do is request the ajax call directly. First, load the page that you want to get in chrome, go to Network tab, filter XHR.
Now you have to find the ajax call that you want. Check the response to verify it.
Right click on the name of the ajax call, select copy -> "copy as Curl (bash)"
Go to https://reqbin.com/curl, paste the Curl and click Run. Check the response content.
If it's what you want then move to the next step.
Still in reqbin window, click Generate code and choose the language that you want it to be translated and you will get the desired code. Now intergrated to your code however you want.
Some tips: if test run on your own server return 400 error or nothing at all: Set POSTFIELDS to empty. If it return 301 permanently moved, check your url whether it's https or not.
Not knowing a lot about the page you are retrieving or the calculations you want to include, but it could be an option to cURL straight to the URL serving those ajax requests. Use something like Firebug to inspect the Ajax calls being made on your target page and you can figure out the URL and any parameters passed. If you do need the full web page, maybe you can cURL both the web page and the Ajax URL and combine the two in your PHP code, but then it starts to get messy.
There is one quite tricky way to achieve it using php. If you' really like it to work for php you could potentially use Codeception setup in junction with Selenium and use Chrome browser webdriver in headless mode.
Here are some general steps to have it working.
You make sure you have codeception in your PHP project
https://codeception.com
Download chrome webdriver:
https://chromedriver.chromium.org/downloads
Download selenium:
https://www.seleniumhq.org/download/
Configure it accordingly looking into documentation of codeception framework.
Write codeception test where you can use expression like $I->wait(5) for waiting 5 seconds or $I->waitForJs('js expression here') for waiting for js script to complete on the page.
Run written in previous step test using command php vendor/bin/codecept path/to/test
I have some PHP scripts that can be called either from the command line or as a webpage (where the arguments are passed from other web pages using $GET or $POST).
They can take a while to execute, let’s say 5 minutes.
The scripts include some “echo” & “print” calls which allow me to know what is happening during the execution in real time.
The problem is that, in webpage mode, those echo calls don’t print anything in the browser till the end of the script execution. Or sometimes, half the echos appears after 2 minutes and the rest at the end.
Is there a simple way to make my print()/echo() calls appear in real time when my script are called in “webpage mode”?
Thanks in advance.
flush() may or may not work depending on the browser and size of the output (see: PHP Flush() not working in Chrome)
Apache can also buffer output if mod_gzip is enabled.
Your best bet is to log into a db/session/fs and have JS on client side polling for updates.
Use ob_flush() to force output to be sent to the browser before script execution completes.
I assume you are not using output buffering as your script outputs fine on consolde. Therefore use flush() to explicitely tell PHP it should send output to the browser.
I would suggest a flush every xxx outputs instead of flushing after every echo or print if they appear in short intervals.
Is it possible to show interactive shell command output on a webpage without refreshing the webpage?
For example, can we update a webpage with the latest snapshot of the (Linux) Top command's output every 1 second without refreshing the webpage?
It will be very helpful if anyone can teach me how to take a text-based snapshot of the latest output of an interactive shell command.
Thanks.
Yes, you need to use an AJAX-style request to refresh the contents periodically. That's a broad subject though, so not an easy one to cover in a few minutes here!
Have a look at some of the more popular JavaScript libraries like jQuery or Mootools and you will see methods for making AJAX requests. JSON is about the easiest format to transfer data in for this type of work, since both PHP and JavaScript support it natively. Ie. you can encode your data in JSON in one line from PHP and then decode in JavaScript simply by eval'ing it.
Edit: And now I re-read your question, I've missed half the point! Not sure off-hand on the interactive shell output question. I tried this just now - the command didn't terminate unfortunately, but it did write output to test.txt.
top > test.txt
Perhaps there is a way to make it non-interactive.
if i have something like this in php
$foo=0;
while($foo<20){
echo "hello";
usleep(1000000);
$foo=$foo+1;
}
and i make an ajax request to that php file, can i do anything with the data while the request is in progress?
i mean, the script echos hello every second and i saw that the request only shows what data it has when the whole loop is finished, so isnt there a way i can access each hello when its echoed out?
Look for firebug extension of Firefox.
There are a few reasons why you can'y see it.
The content coming from the AJAX request is processed by the server like any other http/php request.
What is happening is the data is being cached by the php buffer, then when its done its flushing it to the output. Which apache then delivers to you.
There is so little data that there is no need to flush this buffer before the processes is done. So you are only seeing the final result.
If you had some much data outputted that it cause the output to be flushed before hand then you may get it.
The other problem is going to be you ajax request handler. I'm pretty sure the onComplete (or similar) method that you (and everyone else) is using will only be called when the output from the server request is finishing and your browser has the full data.
It may be possible to use a different event or perhaps write the ajax code your self (with out using stuff like jQuery) but i'm not even sure if that would solve your problem; as this might also be something to do the with x http request implementation.
May i ask what your are trying to do this for? there may be an easier solution for your actually problem *i'm assuming this isn't the code your actually are using on your site).
Dan
If you execute the flush(); command in PHP you will send content. If you're compressing at the server level you may need to pad output to fill up a packet to make it send.
flush();
Here's an example: http://example.preinheimer.com/flush.php
The correct answer is you CAN see the content while it's being returned.
The other answers were partially correct in mentioning that the PHP output buffer will keep the output "bottled up"... but the output buffer can be disabled.
Once you disable the output buffer you need to show the JQuery response before the request completes - you do this by updating the browser periodically while the connection to the server is still active. This concept is called "Comet" or "Long Polling".
See these questions:
Comet and jQuery
How do I implement basic "Long Polling"?
Comet In PHP