how is it possible to grab web page source from a ajax type web page:
curl doesn't seem to be able to get ajax generated source.
Sorry if duplicate, but looking throw questions didn't find answer.
If the page you want to grab uses ajax to compose different parts of it, then the content does not exist until all the loading is done.
You couldn't do this with curl, as curl acts as a client requesting only the URL you instruct it, but has no javascript engine to interpret the script and load other parts of the page.
If the content you are looking for is in one of the parts loaded through ajax, you should use the chrome inspector -> network tab and see what is the exact URL of the loaded page, then load that page using curl.
Related
Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.
I am getting the source code for a particular web page.
But I am not getting some dynamic content.
Is there any way to wait until page load and then get the source code.
I need a solution in PHP.
I am getting the source code for a particular web page. But I am not getting some dynamic content.
Dynamic contents will be loaded mostly only after making an ajax call in a webpage. If you want to get those data using curl, then you should inspect the network call that's being made from that webpage and replicate that call in curl.
Is there any way to wait until page load and then get the source code
Even though you get the full source code of the page via curl, you won't get the dynamically loaded content.
Otherwise, you can use tools like selenium to get those data.
I'm looking to get information from an external website, by taking it from a div in their code. But using the file_get_contents() method doesn't work because the information isn't in the source code for the page. It only shows up after the page loads (It's available if you use an inspect element in the web browser).
Is there a way to do this? Or Am I just out of luck on that?
If you look at a page like this: http://www.fieg.nl/ias-demo#/
You can see that it uses Ajax to dynamically add content when you scroll to the bottom, similar to how Google Images works.
If there was a page like this that I wanted to capture for parsing, I would do something like:
$page = file_get_contents("http://www.fieg.nl/ias-demo#/");
But this only gets everything that initially loads before any Ajax happens. Is it possible to use PHP or CURL or any other programs to capture the entire page and automatically load the Ajax content and capture that as well?
Also if there happens to be a weird page where it never stops loading things and literally goes on forever, I'm not sure how the tool would have to handle that. Because the tool would never find the end of the DOM in that situation.
Those pages work by sending an ajax request to fetch more data as the user scrolls towards the bottom of a page. The javascript then writes the response of the ajax request into the bottom of the page.
You need to run a tool like firebug to analyse the requests that are made to the server to retrieve the next page of content (using the net panel). Once you have found out the request URL, you need to emulate these requests in your PHP script.
Unfortunately SO is not the place for people to write your scripts to spider websites, but there is the theory anyway.
I'm new to YQL, and just trying to learn how to do some fairly simple tasks.
Let's say I have a list of URLs and I want to get their HTML source as a string in javascript (so I can later insert it to a database via ajax). How would I go about getting this info back in Javascript? Or would I have to do it in PHP? I'm fine with either, really - whatever can work.
Here's the example queries I'd run on their console:
select * from html where url="http://en.wikipedia.org/wiki/Baroque_music"
And the goal is to essentially save the HTML or maybe just the text or something, as a string.
How would I go about doing this? I somewhat understand how the querying works, but not really how to integrate with javascript and/or php (say I have a list of URLs and I want to loop through them, getting the html at each one and saving it somewhere).
Thanks.
You can't read other pages with Javascript due to a built-in security feature in web browsers. It is called the Same origin policy.
The usual method is to scrape the content of these sites from the server using PHP.
There is an other option with javascript called a bookmarklet.
You can add the bookmarklet in your bookmarks bar, and each time you want the content of a site click the bookmark.
A script will be loaded in the host page, it can read the content and post it back to your server.
Oddly enough, the same origin policy, does not prevent you to POST data from this host page to your domain. You need to POST a FORM to an IFRAME that has a source hosted on your domain.
You won't be able to read the response you get back from the POST.
But you can poll with a setInterval making a JSONP call to your domain to know if the POST was successful.