I have to scrape this page using php curl. In this when the user scrolls down , more items are loaded using ajax . Can I call the URL that the ajax script is calling ? If so, then how do I figure the URL out ? I know a bit of ajax, but the code there is kind of complex for me.
Here is the relevant js code pastebin
Alternatively can someone suggest an alternative method of scraping that page? PS : I doing this for good cause.
Edit: I figured it out. Live http headers. QUestion can be closed. downvoted to oblivion.
You can use FireBug for that. Switch to the Console-Tab and then make the page make the AJAX-request.
This is what should see after scrolling to the bottom of the page: http://www.flipkart.com/computers/components/ram-20214?_l=m56QC%20tQahyMi46nTirnSA--&_r=11FxOYiYfpMxmANj4kGJzg--&_pop=flyout&response-type=json&inf-start=20
and if you scroll further: http://www.flipkart.com/computers/components/ram-20214?_l=m56QC%20tQahyMi46nTirnSA--&_r=11FxOYiYfpMxmANj4kGJzg--&_pop=flyout&response-type=json&inf-start=40
The tokens seem to always remain the same: _l=m56QC%20tQahyMi46nTirnSA-- and _r=11FxOYiYfpMxmANj4kGJzg--, so does the _pop-parameter: _pop=flyout So let's have a look at the other parameters:
This one was for the main page:
//no additional parameters...
this one for the first 'reload':
&response-type=json&inf-start=20
and this one for the second 'reload':
&response-type=json&inf-start=40
So, appearently you just have to append &response-type=json&inf-start=$offset to your initial URI to get the results in JSON-format. You can also see the contents in FireBug which should make it very easy to work with them.
Here's a screenshot:
Related
Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.
As many of you have noticed, when you hit a link to switch from page to page in Google+ or facebook, the URL changes, the body changes but some parts of the page don't, like the chatbox. I believe AJAX can change a specific content of the page by requesting a PHP page from the server and get some results, but that won't change the URL.
Actually, I didn't know exactly how to search that in Google, so, any keywords/names/linkes will be strongly appreciated.
I'm using JQuery library for Javascript and Symfony2 framework for PHP, if this helps.
Look at the JQUERY load method.
http://api.jquery.com/load/
All you need to do is use a selector:
$('mydiv').load('newcontent');
Very powerful function. Look it up!
edit:
Sorry missed that url change. The trick alot of times with the URL is around the Hashtag. If you look closely at the URL there will be a "#" pound symbol in there somewhere. This allows the site to store current state without a reload.
Currently there is no way to change the URL in the browser save for the bit after the hashtag with out fully reloading the site.
you can either use iframe or ajax to keep some part of your page static.To change url either you can use hash hack.
window.location.hash = "pageidentifier"
or you can use the html 5 trick described in the url provided by arxanas
for one project, i need to get the facebook source page (html one) via a php application.
i try lot of method like curl, file_get_content, change my ini_set, etc.... but facebook never let me get the html result file.
Does anyone can help ?
for example this page :
ini_set('user_agent', $_SERVER['HTTP_USER_AGENT']);
$data = file_get_contents("http://apps.facebook.com/is_cool/?cafe_action=album&view=scroll",0);
Print strip_tags($data,"");
Thanks a lot.
Damien
Comment 1 :
- I need to create 2 application. I want to parse the html code to get some information from one to the other. I don't want to duplicate or take the facebook code. I just want to make a "view source" (like IE or firefox) and put it on a file, without ask my users. When my user is logged in my first application, i just want to is is credential to get the other content.
The reason you're having problems is that the majority of the facebook homepage content is loaded via AJAX. The data is not hardcoded into what your browser renders.
You should think of a different way to accomplish your goals. If you tell us a little more about what you're trying to do, we can probably help you find an alternate method.
I Have to load a html file using jquery. When i referred google, i got the snippet
$("#feeds").load("feeds.html");
But I dont want to load the content of feed.html into feeds object. Instead I need to load that page entirely. How to load that page. Help me plz
If you're not wanting to load() some HTML into an element on the existing page, maybe you mean that you want to redirect to another page?
url = "feeds.html";
window.location = url;
Or maybe you just want to fill an entire body? You could load() into the body tag if you wanted.
$("body").load("feeds.html");
$.get()
Here's a summary:
Load a remote page using an HTTP GET
request. This is an easy way to send a
simple GET request to a server without
having to use the more complex $.ajax
function. It allows a single callback
function to be specified that will be
executed when the request is complete
(and only if the response has a
successful response code). If you need
to have both error and success
callbacks, you may want to use $.ajax.
$.get() returns the XMLHttpRequest
that it creates. In most cases you
won't need that object to manipulate
directly, but it is available if you
need to abort the request manually.
Are you sure feeds.html is located under the same domain as your script?
If so, that line is ok, you'll search for problem anywhere else. Try to debug it with Firebug under Net panel.
If not, so you are able to send only JSON requests to urls under other domains. But you can easily write a proxy script with PHP (noticed php tag under your question :) ) to load your content with cURL. Take a look at CatsWhoCode tutorial at point 9, hope it'll help.
hi im using ajax to extract all the pages into the main page but am not being able to control the refresh , if somebody refreshes the page returns back to the main page can anybody give me any solutions , i would really appreciate the help...
you could add anchor (#something) to your URL and change it to something you can decode to some particular page state on every ajax event.
then in body.onload check the anchor and decode it to some state.
back button (at least in firefox) will be working alright too. if you want back button to work in ie6, you should add some iframe magic.
check various javascript libraries designed to support back button or history in ajax environment - this is probably what you really need. for example, jQuery history plugin
You can rewrite the current url so it gives pointers to where the user was - see Facebook for examples of this.
I always store the 'current' state in PHP session.
So, user can refresh at any time and page will still be the same.
if somebody refreshes the page returns back to the main page can anybody give me any solutions
This is a feature, not a bug in the browser. You need to change the URL for different pages. Nothing is worse then websites that use any kind of magic either on the client side or the server side which causes a bunch of completely different pages to use the same URL. Why? How the heck am I gonna link to a specific page? What if I like something and want to copy & paste the URL into an IM window?
In other words, consider the use cases. What constitutes a "page"? For example, if you have a website for stock quotes--should each stock have a unique URL? Yes. Should you have a unique URL for every variation you can make to the graph (i.e. logarithmic vs linear, etc)? Depends--if you dont, at least provide a "share this" like google maps does so you can have some kind of URL that you can share.
That all said, I agree with the suggestion to mess with the #anchor and parse it out. Probably the most elegant solution.