i need to extract data from url
like title , description ,and any vedios images in the given url
like facebook share button
like this :
http://www.facebook.com/sharer.php?u=http://www.wired.com&t=Test
regards
Embed.ly has a nice api for exactly this purpose. Their api returns the site's oEmbed data if available - otherwise, it attempts to extract a summary of the page like Facebook.
Use something like cURL to get the page and then something like Simple HTML DOM to parse it and extract the elements you want.
If the web site has support for oEmbed, that's easier and more robust than scraping HTML:
oEmbed is a format for allowing an embedded representation of a URL on third party sites. The simple API allows a website to display embedded content (such as photos or videos) when a user posts a link to that resource, without having to parse the resource directly.
oEmbed is supported by sites like YouTube and Flickr.
I am working on a project for this issue, it is not as easy as writing an html parser and expecting sites to be 'semantical'. Especially extracting videos and finding auto-play parameters are killing. You can check the project in http://www.embedify.me, which has also fb-style url preview script. As I see, embed.ly and oembed are passive parser, they need the sites to support them, so called providers, the approach is quite different than fb does.
While I was looking for a similar functionality, I came across a jQuery + PHP demo of the url extract feature of Facebook messages:
http://www.99points.info/2010/07/facebook-like-extracting-url-data-with-jquery-ajax-php/
Instead of using an HTML DOM parser, it works with simple regular expressions. It looks for title, description and img tags. Hence, the image extraction doesn't perform well with a lot of websites, which use CSS for images. Also, Facebook looks first at its own meta tags and then at the classic description tag of HTML but it illustrates well the principe.
Related
Is there a way to define the facebook Open Graph meta tags (OG) with a link?
For example I have a single page website with several anchors.
At each anchor I have a sub heading and an image
The url would be : www.mywebsite.com/#page2
When using the facebook share function I'd like to define that URL with a defined sub heading as the title, and the the image in the sub heading as the image that is shared on facebook.
Is this possible? Is it possible with PHP or JS?
I would advise you to implement the Google Ajax URL specification. It specifies how to treat single-page website when indexing for the Google search engine, but Facebook also uses it when indexing Open Graph information.
You can read more on the specification here: https://developers.google.com/webmasters/ajax-crawling/docs/specification
I basically works by mapping http://www.example.com/#!section to http://www.example.com/_escaped_fragment_=section. This way you, at the server side, can respond to the _escaped_fragment_ variable, when given.
When using this method, please be aware that the Facebook crawler will still follow meta og:url and link rel=canonical tags as the final url. So, if you return with the _escaped_fragment_ method, you can remove these tags (or set them to the current url).
I read a lot of articles explaining how to parse a HTML file with PHP but in the case of twitter it uses iframe where texts are hidden. How can I parse the twitter HTML?
I know it is very easy to use API's or .rss page or json to get the tweets/string but I want to be able to work with twitter HTML page directly. Is there any way I could find the tweets using their html page?
The best way would be to use something like Simple HTML DOM. With this you can use CSS selectors like with jQuery to find the elements on the page you are looking for. However Twitter pages use a lot of javascript and ajax so you may be stuck with either using an API or maybe you could try it with the mobile site.
If I insert in my wall a link like this:
http://blog.bonsai.tv/news/il-nuovo-vezzo-della-lega-nord-favorire-i-lombardi-alluniversita/
then facebook extract the image in the post and not the first image in the webpage ( not image logo or other little images for example ) !!
How facebook does that ?
Hm, impossible to say without more information about the algorithm they use.
However, from looking at the page's source code you can see that while the image of Bossi is not the first image in the page, it's the first inside the divs "page_content" and "post_content". Maybe Facebooks knows the HTML IDs that the blogging system (Wordpress in this case) uses, and uses these to find the first image that is actually part of the page content.
That would actually be a good idea, and is essentially an implementation of the "semantic web"...
As others have said, we have no idea how Facebook decides what to choose in the absence of any relevant metadata (though Sleske's guesses seem reasonable; I'd also guess that they look at the first big image), but you can avoid that by going the correct route and simply giving facebook (and similar services) addiotnal metadata about your page by using Open Graph Protocol tags, for example if you want to specify a particular image to use for a facebook like, you'd include this in your head tag:
<meta property="og:image" content="<your image URL>" />
OGP is also used by LinkedIn, Google+ and many others.
If you're in Wordpress you can control these tags with an open graph plugin. Other systems can do it manually or via their own plugins.
I can imagine that the Facebook crawler can identify the actual content part, and select an image from it. Similar functionality is used by the Safari Reader functionality. It probably helps that the software used is Wordpress, which is the most popular blogging software. It's a quick win for Facebook to add specific support for this software.
My guess is facebook has built some algorithms for distinguishing the actual content from the other data in a html page. When looking at the page you provided it's quite easy since the html element that contains the page content has id="page_content" which is self-explanatory.
anyone have any idea how to generate excerpt from any given article page (so could source from many type of sites)? Something like what facebook does when you paste a url into the post. Thank you.
What you're looking to do is called web scraping. The basic method for doing so would be to capture the page (you can scrape a URL using file_get_contents), and then somehow parse it for the content that you want (ie. pull out content from the <body> tag).
In order to parse the returned HTML, you should use a DOM parser. PHP has its own DOM classes which you can use.
Here is a video tutorial about how to do that:
http://net.tutsplus.com/tutorials/php/how-to-create-blog-excerpts-with-php/
How is it possibe to generate a list of all the pages of a given website programmatically using PHP?
What I'm basically trying to achieve is to generate something like an sitemap, in nested unordered list with links for all the pages contained in a website.
If all pages are linked to one another, then you can use a crawler or spider to do this.
If there are pages that are not all linked you will need to come up with another method.
You can try this:
Add an "image bug/web beacon/web
bug" to each page you tracked as
follows:
OR
alternatively add a javascript function to each page that makes a call to /scripts/logger.php You can use any of the javascript libraries that make this super simple like Jquery, Mootools, or YUI.
Create the logger.php script, have it save the request's originating URL somewhere like a file or a database.
Pros:
- Fairly simple
Cons:
Requires edits to each page
Pages that aren't visited don't get
logged
Some other techniques that don't really fit your need to do it programatically but may be worth considering include:
Create a spider or crawler
Use a ripper such as CURL, or
Teleport Plus.
Using Google Analytics (similar to
the image bug technique)
Use a log analyzer like Webstats or a
freeware UNIX webstats analyzer
You can easly list the files with the glob function... But if the pages uses includes/requires and other stuff to mix multiple files into "one page" you'll need to import the Google "site:mysite.com" search results.. Or just create a table with the URL of every page :P
Maybe this can help:
http://www.xml-sitemaps.com/ (SiteMap Generator)