Web scraper that handles JavaScript [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Make a JavaScript-aware Crawler
I'm trying to figure out what to use as the basis for a PHP based web scraper that can handle pages that render using JavaScript. Many web site scrape attempts (at least the ones I handle) now fail unless the JS in those pages is executed. The pages are not built to gracefully fall back to no-script implementations. This includes those that make heavy use of AJAX.
Would anyone have suggestions for where to start with the development of a web scraper that can handle modern and heavily JavaScript dependent web pages?
Something that can be used by PHP would be best.

It's possible to use a web browser engine in headless mode to load the page and analyze the DOM. Some googling pointed me at http://phantomjs.org/

Those sites that have heavy ajax usage, just call the same urls as the page does, and build your site content on that response rather than requesting the page.
Those sites that have heavy document.write or framework equivalent thereof, you could probably just strip space or match tags or relevant content using simple regex and again request the script responsible rather than the page that requests it ...

You could use Selenium which is a browser automation tool and then use one of the PHP bindings here, here, or here so you can automate Selenium from PHP.

You would have to have a JavaScript engine in PHP. Or some headless Webkit on the command line. And even then it would get hugely complicated. So the short answer would be: No, sorry, you can't do that.

PHP supports the V8 engine, so I guess you could pass over javascript to V8. Not a pretty thing to do though, I would use something else than straight PHP to do this.

Related

cURL PHP - load a fully page

I am currently trying to load an HTML page via cURL. I can retrieve the HTML content, but part is loaded later via scripting (AJAX POST). I can not recover the HTML part (this is a table).
Is it possible to load a page entirely?
Thank you for your answers
No, you cannot do this.
CURL does nothing more than download a file from a URL -- it doesn't care whether it's HTML, Javascript, and image, a spreadsheet, or any other arbitrary data; it just downloads. It doesn't run anything or parse anything or display anything, it just downloads.
You are asking for something more than that. You need to download, parse the result as HTML, then run some Javascript that downloads something else, then run more Javascript that parses that result into more HTML and inserts it into the original HTML.
What you're basically looking for is a full-blown web browser, not CURL.
Since your goal involves "running some Javascript code", it should be fairly clear that it is not acheivable without having a Javascript interpreter available. This means that it is obviously not going to work inside of a PHP program (*). You're going to need to move beyond PHP. You're going to need a browser.
The solution I'd suggest is to use a very specialised browser called PhantomJS. This is actually a full Webkit browser, but without a user interface. It's specifically designed for automated testing of websites and other similar tasks. Your requirement fits it pretty well: write a script to get PhantomJS to open your URL, wait for the table to finish rendering, and grab the finished HTML code.
You'll need to install PhantomJS on your server, and then use a library like this one to control it from your PHP code.
I hope that helps.
(*) yes, I'm aware of the PHP extension that provides a JS interpreter inside of PHP, and it would provide a way to solve the problem, but it's experimental, unfinished, would be still difficult to implement as a solution, and I don't think it's a particularly good idea anyway, so let's not consider it for the purposes of this answer.
No, the only way you can do that is if you make a separate curl request to ajax request and put the two results together afterwards.

How to include the static HTML results of a dynamic Javascript page in PHP?

I have a small script that pulls HTML from another site using Javascript.
I want to include that static HTML that gets pulled in a PHP page without any of the Javascript code appearing in the final PHP page that gets displayed.
I tried doing an include of the file with the Javascript code in the PHP page, but it just included the actual Javascript and not the results of the Javascript.
So how would I go about doing this?
You would need to fetch the page, execute the JavaScript in it, then extract the data you wanted from the generated DOM.
The usual approach to this is to use a web automation tool such as Selenium.
You simply can't.
You need to understand that PHP and Javascript operate on different places, PHP on the server and Javascript on the client.
Your only solution is to change the way all this is done and use "file_get_contents(url)" from PHP to get the same content your javascript used to get. This way, there is no javascript anymore and you can still pre-process your page with distant content.
You wouldn't be able to do this directly from within PHP, since you'd need to run Javascript code.
I'd suggest passing the URL (and any required actions such as click event, etc) to a headless browser such as Phantom or Zombie, and capturing the DOM from it once the JS engine has done it's work.
You could also use a real browser, but of course you don't need a UI in your case, and it might actually get in the way of what you're trying to do, so a headless browser might be better.
This sort of thing would normally be used for automated testing of a site (ie Functional Testing).
There is a PHP tool named Mink which can run these sorts of scripts from within a PHP program. It is aimed at writing test scripts, but I would imagine you could use it for your purposes.
Hope that helps.

Language for web scraping JAVASCRIPT content

I think topic ask the question, I usually use PHP for parse/ web scraping, but I have really bad time scraping javascript most cases I cant do it
ex: Parse a div that appears when a javascript its executed.
I readed about RUBY, that have a parser library for javascript, so question is w is the languaje for program a web scraping that will effective scrap javascript generated content ?? Its here a library for PHP like the one for ruby for parse javascript content ?
There are a handful of strategies for this. Depending on your needs, consider pro grammatically instantiating a browser instance that you can hook into and read the page from.
The idea is, let the browser do the work, as the page is made for a browser and not your bot. You can then tap in and scrape away using a browser plugin that feeds data to your primary application running things.
This may be way overkill for what you need though. I'll leave it up to you to decide.
You should look at some GUI-less/headless browsers. There is some written for Java. I didn't find one for PHP.
Look at :
HTMLUnit
Golf
You can try using something like Selenium, which allows you to automate browser tasks.
On the other hand, you can go into details on what happens when the js code is executed. For example, if the js code is requesting something from the server by POSTing some data, you could emulate that in the regular fashion.
You should look at PhantomJS and CasperJS (headless browsers).
In the ruby world the gem for running Phantomjs would be poltergeist
There is another article about some of the options you have in ruby here too (however they are not all js capable)

Inline PHP in a .JSP?

I am relatively new to web-development and am encountering an issue using some inline PHP code.
The page is a JavaServer Page (.jsp) and I am trying to implement a JFormer form.
When I add my JFormer PHP code to my .jsp page, it just displays as plain-text and refuses to cooperate with me (even when using demo code from the site's documentation). Is this because of some sort of incompatibility between using PHP on a .jsp page?
If that is the case, what are some work-around that I could use? Should I use an iframe?
I need to preserve the use of the .jsp page and would prefer very much to use JFormer, but if I have to I can toss it.
Example of something similar to what I am doing can be found at: http://www.jformer.com/documentation/getting-started/installation/
JSP and PHP are both server-side languages. As such, all scripting in a given file must be processed by the required engines on the server to produce the necessary HTML output.
I suppose it is possible to rig multiple engines inline to process first JSP, then PHP, but that seems cumbersome and error prone.
Instead, consider using an iframe (as you suggested) or load the PHP content via an AJAX call.
PHP is executed by a PHP interpreter and output HTML. JSP is compiled and executed by a Java VM, and output HTML. You can't execute PHP inside JSP code (and vice-versa). It's like if you put Chinese words inside an English speech. Nobody can understand.
I think the point of this is that the examples for jFormer use PHP for the server side logic. If you want to integrate jFormer into your JSP project, learn how to code the equivalent PHP functionality in JSP. You may need to create a Servlet for portions of the logic.
It looks like JFormer requires PHP so you can't make this work on a JSP page easily. You can rewrite the JFormer PHP code in Java/JSP but this may be a lot of work.
The container (like Tomcat) you're using may be able to run PHP scripts as CGI scripts. If you do this you can't easily share session information between PHP and Java. Javascript could be used to accomplish this, but beware of security issues. If you still want to use JSP you could make an iframe that points to the PHP page, as you said.
Here's an article on setting that up for Tomcat:
http://wiki.apache.org/tomcat/UsingPhp
Disclaimer: I don't know JFormer.

Convert HTML page to an image

I want to change my HTML page as an image. Is there a way in PHP to change or save an HTML page as an image?
This is not easy; as NullUserException says in his comment, you would need to render the HTML page on the server-side, which is not something PHP (or any other server-sided language) has built in.
The approach that comes to mind would be to write a program (probably not in PHP, but rather something like C# or C++) that runs on your server, fires up a web browser, and does a series of screen captures (possibly combined with page scrolls). As this is a very nontrivial and bug-prone process, I would suggest looking into third-party components that are capable of doing this.
You would then execute this program from PHP, and when it's done running, display the results from the file it output.
I would advise you to use an external service with an api. This list might be a good start: http://blogs.sitepoint.com/2008/07/10/9-ways-to-put-site-screenshots-in-your-web-app/
Thumbalizr seems great, they allso provide a php script so you can cache the images locally:
http://www.thumbalizr.com/apitools.php
Try taking a look at browsershots.org - source code is available for it if you want to install it locally. Essentially it uses a browser to take screenshots, and can be controlled via an XML-RPC interface, which you can call from PHP.
As others have said this is not a simple job, and not something you can do directly in PHP, so use an external service.
(I'm not affiliated with browsershots.org in any way)

Categories