I think topic ask the question, I usually use PHP for parse/ web scraping, but I have really bad time scraping javascript most cases I cant do it
ex: Parse a div that appears when a javascript its executed.
I readed about RUBY, that have a parser library for javascript, so question is w is the languaje for program a web scraping that will effective scrap javascript generated content ?? Its here a library for PHP like the one for ruby for parse javascript content ?
There are a handful of strategies for this. Depending on your needs, consider pro grammatically instantiating a browser instance that you can hook into and read the page from.
The idea is, let the browser do the work, as the page is made for a browser and not your bot. You can then tap in and scrape away using a browser plugin that feeds data to your primary application running things.
This may be way overkill for what you need though. I'll leave it up to you to decide.
You should look at some GUI-less/headless browsers. There is some written for Java. I didn't find one for PHP.
Look at :
HTMLUnit
Golf
You can try using something like Selenium, which allows you to automate browser tasks.
On the other hand, you can go into details on what happens when the js code is executed. For example, if the js code is requesting something from the server by POSTing some data, you could emulate that in the regular fashion.
You should look at PhantomJS and CasperJS (headless browsers).
In the ruby world the gem for running Phantomjs would be poltergeist
There is another article about some of the options you have in ruby here too (however they are not all js capable)
Related
I am currently trying to load an HTML page via cURL. I can retrieve the HTML content, but part is loaded later via scripting (AJAX POST). I can not recover the HTML part (this is a table).
Is it possible to load a page entirely?
Thank you for your answers
No, you cannot do this.
CURL does nothing more than download a file from a URL -- it doesn't care whether it's HTML, Javascript, and image, a spreadsheet, or any other arbitrary data; it just downloads. It doesn't run anything or parse anything or display anything, it just downloads.
You are asking for something more than that. You need to download, parse the result as HTML, then run some Javascript that downloads something else, then run more Javascript that parses that result into more HTML and inserts it into the original HTML.
What you're basically looking for is a full-blown web browser, not CURL.
Since your goal involves "running some Javascript code", it should be fairly clear that it is not acheivable without having a Javascript interpreter available. This means that it is obviously not going to work inside of a PHP program (*). You're going to need to move beyond PHP. You're going to need a browser.
The solution I'd suggest is to use a very specialised browser called PhantomJS. This is actually a full Webkit browser, but without a user interface. It's specifically designed for automated testing of websites and other similar tasks. Your requirement fits it pretty well: write a script to get PhantomJS to open your URL, wait for the table to finish rendering, and grab the finished HTML code.
You'll need to install PhantomJS on your server, and then use a library like this one to control it from your PHP code.
I hope that helps.
(*) yes, I'm aware of the PHP extension that provides a JS interpreter inside of PHP, and it would provide a way to solve the problem, but it's experimental, unfinished, would be still difficult to implement as a solution, and I don't think it's a particularly good idea anyway, so let's not consider it for the purposes of this answer.
No, the only way you can do that is if you make a separate curl request to ajax request and put the two results together afterwards.
I have a huge list of URL's from a client which I need to run through so i can get content from the pages. This content is in different tags within the page.
I am looking to create an automated service to do this which I can leave running to complete.
I want the automated process to load each page and get the content from particular html tags, then process some this content to ensure the html is correct.
If possible I want to generate one XML or JSON file, but I can settle for an XML or JSON file per page.
What is the best way to do this, preferably something I can run off a mac or a linux server.
The list of URL's are to an external site.
Is there something I can already use or an example somewhere which will help me.
Thanks
This is a perfect application of BeautifulSoup, IMHO. Here is a tutorial on a similar process. It is certainly a headstart.
Scrapy is an excellent framework for spidering and scraping.
I think you'll find it will involve a little more learning overhead based on the Requests + Beautiful Soup or LXML tutorial mentioned by tim-cook in his answer. However if you're writing a lot of scraping / parsing logic it should guide toward a pretty well-factored (readiable, maintainable) codebase.
So, if it's a one-off run I'd go with Beautiful Soup + Requests. If it'll be re-used, extended and maintained over time then Scrapy would be my pick.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Make a JavaScript-aware Crawler
I'm trying to figure out what to use as the basis for a PHP based web scraper that can handle pages that render using JavaScript. Many web site scrape attempts (at least the ones I handle) now fail unless the JS in those pages is executed. The pages are not built to gracefully fall back to no-script implementations. This includes those that make heavy use of AJAX.
Would anyone have suggestions for where to start with the development of a web scraper that can handle modern and heavily JavaScript dependent web pages?
Something that can be used by PHP would be best.
It's possible to use a web browser engine in headless mode to load the page and analyze the DOM. Some googling pointed me at http://phantomjs.org/
Those sites that have heavy ajax usage, just call the same urls as the page does, and build your site content on that response rather than requesting the page.
Those sites that have heavy document.write or framework equivalent thereof, you could probably just strip space or match tags or relevant content using simple regex and again request the script responsible rather than the page that requests it ...
You could use Selenium which is a browser automation tool and then use one of the PHP bindings here, here, or here so you can automate Selenium from PHP.
You would have to have a JavaScript engine in PHP. Or some headless Webkit on the command line. And even then it would get hugely complicated. So the short answer would be: No, sorry, you can't do that.
PHP supports the V8 engine, so I guess you could pass over javascript to V8. Not a pretty thing to do though, I would use something else than straight PHP to do this.
I want to change my HTML page as an image. Is there a way in PHP to change or save an HTML page as an image?
This is not easy; as NullUserException says in his comment, you would need to render the HTML page on the server-side, which is not something PHP (or any other server-sided language) has built in.
The approach that comes to mind would be to write a program (probably not in PHP, but rather something like C# or C++) that runs on your server, fires up a web browser, and does a series of screen captures (possibly combined with page scrolls). As this is a very nontrivial and bug-prone process, I would suggest looking into third-party components that are capable of doing this.
You would then execute this program from PHP, and when it's done running, display the results from the file it output.
I would advise you to use an external service with an api. This list might be a good start: http://blogs.sitepoint.com/2008/07/10/9-ways-to-put-site-screenshots-in-your-web-app/
Thumbalizr seems great, they allso provide a php script so you can cache the images locally:
http://www.thumbalizr.com/apitools.php
Try taking a look at browsershots.org - source code is available for it if you want to install it locally. Essentially it uses a browser to take screenshots, and can be controlled via an XML-RPC interface, which you can call from PHP.
As others have said this is not a simple job, and not something you can do directly in PHP, so use an external service.
(I'm not affiliated with browsershots.org in any way)
i have a php script who parser a rss and give me the data in a know pattern. Im very new with ASP, JavaScript and Jquery so i dont have any idea of how to autoupdate the script and display the new data with a smooth animation (see this example, that exactly what i want). Thanks for the support and if you know a good script to made this i will appreciate it.
Seems like you're looking for this:
http://leftlogic.com/lounge/articles/jquery_spy2/
It's PHP (not ASP), so that might be an issue, though the code is SUPER easy to implement (I've written by own implementation on three separate occasions).
The site itself has some decent documentation on getting things up and running, but if you need some extra help, comment and I'll point you in the right direction :)
Good luck!
The resources people have linked here are helpful and merely mentioning jQuery means you're probably headed in the right direction. But if you're new to this it might still be worth mentioning some of the concepts you'll be looking to play with here.
First of all, you'll probably want to stick with one language on the client side and one on the server side. This means choosing either PHP or ASP -- this isn't clear from your question but I'll assume you're dealing with PHP since that's the language I use for this kind of thing. JavaScript + jQuery is the right choice for the browser (client) side of things.
Like Luca points out, you'll have to set up some JavaScript code that goes live on page load and "polls" the server at a set interval. In JavaScript you do this using something called XMLHttpRequest (or "XHR") and it's pretty complicated. You could use combination of jQuery and a library like the one Matt points to in his answer, or just jQuery -- sample code abounds but it's basically a loop with a function call and sleep timer.
That function call is going to be one of the more difficult parts if you're trying to emulate the Twitter World Cup site. But here's the basic idea: You need to populate a list using jQuery and a data standard like JSON. Since the RSS feed you'll be parsing is written in XML, you'll have to write a server side (PHP/ASP) script that fetches, parses and converts the feed to JSON. In PHP, this is best done through cURL (file_get_contents() if you're lazy), SimpleXML and json_encode(), respectively.
Your JavaScript should load the list based on JSON. To do this, and display any new items, what you'll do is load the JSON from the client (browser) side using a jQuery method like getJSON(). Then you spin through the array object and add any new items to the list by adding new <li> elements to the "DOM." The same jQuery code that does this can easily also do the cross dissolve with something like fadeIn().
It looks like the script on that example page has an Ajax request running every TOT seconds.
You could simply have your PHP script return the RSS data (in JSON format say) and let JavaScript parse it and generate some HTML with it.
If all of this doesn't make sense to you I advice reading a little about JavaScript and PHP... there's plenty of good books.