I'm trying to get the html code from an album of facebook,once done that i want to extract the photo links,
but as we know all the photos are not loaded at once but as you scroll down.
When i use cURL it get only those links of the photos which are loaded at first.
Will there be a way to get the whole loaded html code through php?
Please bear with my english.
Thanks
EDIT: Well i was wanting to do the same on any page which have this kind of behavior not only on facebook. Thanks anyway
You will not be able to reliably fetch an album from Facebook via this method.
There are a small amount of photos sent with the original request and the rest are loaded via ajax. Using curl, there is no way that I am aware of to simulate the events that lead to said ajax load. You could call the original, and then subsequently call the other ajax endpoints but this method is likely to break the second fb changes anything.
As ceejayoz said, you should be using the fb api for this sort of thing.
Related
Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.
I am creating a web app in php. i am loading content through a ajax based request.
when i click on a hyperlink, the corresponding page gets fetched through ajax and the content is replaced by the fetched page.
now the issue is, i need a physical href so that i can implement facebook like functionality and also maintain the browser history property. i cannot do a old school POSTBACK to the php page as I am doing a transition animation in which the current page slides away and the new page slides in.
Is there a way I can keep the animation and still have a valid physical href and history.
the design of the application is such:
the app grabs an rss feed.
it creates the DOM for those rss feeds.
upon clicking on any headline, the page animates and takes to the full story of the rss feed.
i need to create "like" button on the full story page. but i dont have a valid url.
While Alexander's answer works great on the client side, Facebook's linter tool does not run javascript, so it will get the old content. Neither of the two links provide a solution to this.
What amit needs to implement is server-side parsing of the url. See http://www.php.net/manual/en/function.parse-url.php. Fragment is what the server sees as the hash tag value. In your php code, render the correct og: tags for based upon the fragment.
Firstly, if you need a URL for facebook then think up a structure that gives you one, such that your server-side code will load the correct page when given that URL. This could be something like http://yourdomain.com/page.php?feed=<feedname>&link=<linknumber>, which would allow you to check the parameters using the PHP $_GET array. If you don't have the parameters then load the index page; if you do then load the relevant article.
Secondly, use something like history.js to give you cross-browser support for the HTML5 pushState() functionality so that you can set the page URL when you do the AJAX call, without requiring the browser to do a full reload.
You have to implement hash navigation.
Here is short tutorial.
Here is more conceptual introduction.
If you're using jQuery, I can recommend BBQ for hash navigation:
http://benalman.com/projects/jquery-bbq-plugin/
This actually sounds pretty straight forward to me.
You have the urls as usual, using hash (#) you can extract the info both in the client and server side.
There is only one thing that is missing, on the server side before you return the content, check the user agent string and compare it to the facebook bot (if i'm not mistaken it's something like "facebookexternalhit"), if it turns out to be the facebook bot then return what ever you want which describes the url for a like/share (open graph meta data), and if it's any other user agent string return the content as usual.
I want to put Google Analytics code after people signup AJAX contact form.
If I put this in PHP file that gets submitted by AJAX post request. Do you think this will trigger this javascript codes when it runs?
In theory, you could create a results page that contains your Analytics code. On getting back results from Ajax dynamically create an iFrame that loads your results page. Set the iFrame's display attribute to none (e.g. style="display:none;").
This should have the results page show up in your Analytics stats without visitors to the site seeing it.
Edited to add: Google's official approach is documented here: https://support.google.com/analytics/bin/answer.py?hl=en&answer=1033068
It's less of a hack than the above :-)
I'm new to YQL, and just trying to learn how to do some fairly simple tasks.
Let's say I have a list of URLs and I want to get their HTML source as a string in javascript (so I can later insert it to a database via ajax). How would I go about getting this info back in Javascript? Or would I have to do it in PHP? I'm fine with either, really - whatever can work.
Here's the example queries I'd run on their console:
select * from html where url="http://en.wikipedia.org/wiki/Baroque_music"
And the goal is to essentially save the HTML or maybe just the text or something, as a string.
How would I go about doing this? I somewhat understand how the querying works, but not really how to integrate with javascript and/or php (say I have a list of URLs and I want to loop through them, getting the html at each one and saving it somewhere).
Thanks.
You can't read other pages with Javascript due to a built-in security feature in web browsers. It is called the Same origin policy.
The usual method is to scrape the content of these sites from the server using PHP.
There is an other option with javascript called a bookmarklet.
You can add the bookmarklet in your bookmarks bar, and each time you want the content of a site click the bookmark.
A script will be loaded in the host page, it can read the content and post it back to your server.
Oddly enough, the same origin policy, does not prevent you to POST data from this host page to your domain. You need to POST a FORM to an IFRAME that has a source hosted on your domain.
You won't be able to read the response you get back from the POST.
But you can poll with a setInterval making a JSONP call to your domain to know if the POST was successful.
for one project, i need to get the facebook source page (html one) via a php application.
i try lot of method like curl, file_get_content, change my ini_set, etc.... but facebook never let me get the html result file.
Does anyone can help ?
for example this page :
ini_set('user_agent', $_SERVER['HTTP_USER_AGENT']);
$data = file_get_contents("http://apps.facebook.com/is_cool/?cafe_action=album&view=scroll",0);
Print strip_tags($data,"");
Thanks a lot.
Damien
Comment 1 :
- I need to create 2 application. I want to parse the html code to get some information from one to the other. I don't want to duplicate or take the facebook code. I just want to make a "view source" (like IE or firefox) and put it on a file, without ask my users. When my user is logged in my first application, i just want to is is credential to get the other content.
The reason you're having problems is that the majority of the facebook homepage content is loaded via AJAX. The data is not hardcoded into what your browser renders.
You should think of a different way to accomplish your goals. If you tell us a little more about what you're trying to do, we can probably help you find an alternate method.