How to extract images from a webpage as Facebook does? - php

If I insert in my wall a link like this:
http://blog.bonsai.tv/news/il-nuovo-vezzo-della-lega-nord-favorire-i-lombardi-alluniversita/
then facebook extract the image in the post and not the first image in the webpage ( not image logo or other little images for example ) !!
How facebook does that ?

Hm, impossible to say without more information about the algorithm they use.
However, from looking at the page's source code you can see that while the image of Bossi is not the first image in the page, it's the first inside the divs "page_content" and "post_content". Maybe Facebooks knows the HTML IDs that the blogging system (Wordpress in this case) uses, and uses these to find the first image that is actually part of the page content.
That would actually be a good idea, and is essentially an implementation of the "semantic web"...

As others have said, we have no idea how Facebook decides what to choose in the absence of any relevant metadata (though Sleske's guesses seem reasonable; I'd also guess that they look at the first big image), but you can avoid that by going the correct route and simply giving facebook (and similar services) addiotnal metadata about your page by using Open Graph Protocol tags, for example if you want to specify a particular image to use for a facebook like, you'd include this in your head tag:
<meta property="og:image" content="<your image URL>" />
OGP is also used by LinkedIn, Google+ and many others.
If you're in Wordpress you can control these tags with an open graph plugin. Other systems can do it manually or via their own plugins.

I can imagine that the Facebook crawler can identify the actual content part, and select an image from it. Similar functionality is used by the Safari Reader functionality. It probably helps that the software used is Wordpress, which is the most popular blogging software. It's a quick win for Facebook to add specific support for this software.

My guess is facebook has built some algorithms for distinguishing the actual content from the other data in a html page. When looking at the page you provided it's quite easy since the html element that contains the page content has id="page_content" which is self-explanatory.

Related

Ajax page fetch design requires physical address

I am creating a web app in php. i am loading content through a ajax based request.
when i click on a hyperlink, the corresponding page gets fetched through ajax and the content is replaced by the fetched page.
now the issue is, i need a physical href so that i can implement facebook like functionality and also maintain the browser history property. i cannot do a old school POSTBACK to the php page as I am doing a transition animation in which the current page slides away and the new page slides in.
Is there a way I can keep the animation and still have a valid physical href and history.
the design of the application is such:
the app grabs an rss feed.
it creates the DOM for those rss feeds.
upon clicking on any headline, the page animates and takes to the full story of the rss feed.
i need to create "like" button on the full story page. but i dont have a valid url.
While Alexander's answer works great on the client side, Facebook's linter tool does not run javascript, so it will get the old content. Neither of the two links provide a solution to this.
What amit needs to implement is server-side parsing of the url. See http://www.php.net/manual/en/function.parse-url.php. Fragment is what the server sees as the hash tag value. In your php code, render the correct og: tags for based upon the fragment.
Firstly, if you need a URL for facebook then think up a structure that gives you one, such that your server-side code will load the correct page when given that URL. This could be something like http://yourdomain.com/page.php?feed=<feedname>&link=<linknumber>, which would allow you to check the parameters using the PHP $_GET array. If you don't have the parameters then load the index page; if you do then load the relevant article.
Secondly, use something like history.js to give you cross-browser support for the HTML5 pushState() functionality so that you can set the page URL when you do the AJAX call, without requiring the browser to do a full reload.
You have to implement hash navigation.
Here is short tutorial.
Here is more conceptual introduction.
If you're using jQuery, I can recommend BBQ for hash navigation:
http://benalman.com/projects/jquery-bbq-plugin/
This actually sounds pretty straight forward to me.
You have the urls as usual, using hash (#) you can extract the info both in the client and server side.
There is only one thing that is missing, on the server side before you return the content, check the user agent string and compare it to the facebook bot (if i'm not mistaken it's something like "facebookexternalhit"), if it turns out to be the facebook bot then return what ever you want which describes the url for a like/share (open graph meta data), and if it's any other user agent string return the content as usual.

Facebook link inspector

I'm building a website and am looking for a way to implement a certain feature that Facebook has. The feature that am looking for is the link inspector. I am not sure that is what it is called, or what its called for that matter. It's best I give you an example so you know exactly what I am looking for.
When you post a link on Facebook, for example a link to a youtube video (or any other website for that matter), Facebook automatically inspects the page that it leads you and imports information like page title, favicon, and some other images, and then adds them to your post as a way of giving (what i think is) a brief preview of the page to anyone reading that post.
I already have a feature that allows users to share a link (or URLs). What I want is to do something useful with the url, to display something other than just a plain link to a webpage, to give someone viewing a shared link (in the form if a post) some useful insight into the page that the url leads to.
What I'm looking for is a script, or tutorial, or at the very least someone to point me in the right direction, so that it can help me accomplish this (using PHP preferably).
I've tried googling it but I don't know exactly what such a feature would be called and google isn't helpful when you don't exactly know what you're looking for.
I figure someone out there, in this vast knowledge basket called stackoverflow, can help me with this. Can anyone help me?
You would first scan the page for URLs using regex, then you would parse the pages those links reference with a php DOMDocument. You could use the parsed document to obtain any information you need from the webpage.
DOMDocument:
http://php.net/manual/en/class.domdocument.php
DOMDocument->load (loads a file, aka a webpage):
http://php.net/manual/en/domdocument.load.php
the link goes through http://www.facebook.com/l.php
You pass a URL to this and facebook filters it.

Paste a link and get thumbnails

I want to add a function to my PHP/mysql/jQuery website.
The function is that if user paste a link in a input box,
the server will retrieve all representative pics
just as facebook does.
Is there any PHP code project or jQuery plugin satisfying my demand?
There are lots of services.
Take a look at websnaper for example:
or just google it
It is not hard to write your own from scratch.
Facebook uses the Open Graph Protocol - it retrieves the page and then looks for special meta tags that describe the images associated with that page (og:image).
I guess you can write a basic HTML parser that would do the same.
EDIT: Someone has already written an Open Graph parser

Adding a custom image to facebook like feature

This may seem like a duplicate question or it is easily answered here yet I remain utterly befuddled by how easy of a feature this is to implement and how much trouble I am having adding my image and finding a solution that works. I have been scouring to no avail. The two links provided above show the only two options I have found which are:
<meta property="og:image" content="http://www.example.com/styles/images/fb_cl_logo.png"/>
and
<link rel="canonical" href="http://www.example.com/Blogs_fb_build.php?id=<?php echo $blog; ?>"/>
Needless to say this is not working whatsoever!
Thanks in advance.
Here are two probable reasons:
The url to the image is invalid or the URL is inaccessible by Facebooks crawler.
Facebook have cached the thumbnail for your page.
If Facebook have cached the results your page, use the Facebook Debugger and enter the URL to your page, it will usually break the cache (can be good to add cache break querystring if it misbehaves).
If it isnt a cache problem you will see what information the crawler sees about your page which probably is going to solve your problem.
Another things worth mentioning is that breaking the cache in the debugger doesnt always reflect to the "normal Facebook posting" instantly.
Because you didn't link to the image it's hard to be sure, but is your image within the 3:1 aspect ratio which Facebook supports? Many people aren't aware of that restriction. If the image is too wide/tall the Facebook's Debug Tool will still detect it but it won't be rendered in news feed

how to create URL extractor like facebook share

i need to extract data from url
like title , description ,and any vedios images in the given url
like facebook share button
like this :
http://www.facebook.com/sharer.php?u=http://www.wired.com&t=Test
regards
Embed.ly has a nice api for exactly this purpose. Their api returns the site's oEmbed data if available - otherwise, it attempts to extract a summary of the page like Facebook.
Use something like cURL to get the page and then something like Simple HTML DOM to parse it and extract the elements you want.
If the web site has support for oEmbed, that's easier and more robust than scraping HTML:
oEmbed is a format for allowing an embedded representation of a URL on third party sites. The simple API allows a website to display embedded content (such as photos or videos) when a user posts a link to that resource, without having to parse the resource directly.
oEmbed is supported by sites like YouTube and Flickr.
I am working on a project for this issue, it is not as easy as writing an html parser and expecting sites to be 'semantical'. Especially extracting videos and finding auto-play parameters are killing. You can check the project in http://www.embedify.me, which has also fb-style url preview script. As I see, embed.ly and oembed are passive parser, they need the sites to support them, so called providers, the approach is quite different than fb does.
While I was looking for a similar functionality, I came across a jQuery + PHP demo of the url extract feature of Facebook messages:
http://www.99points.info/2010/07/facebook-like-extracting-url-data-with-jquery-ajax-php/
Instead of using an HTML DOM parser, it works with simple regular expressions. It looks for title, description and img tags. Hence, the image extraction doesn't perform well with a lot of websites, which use CSS for images. Also, Facebook looks first at its own meta tags and then at the classic description tag of HTML but it illustrates well the principe.

Categories