Facebook like on demand meta content scraper - php

you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube).
any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)
thanks!

FB scrapes the meta tags from the HTML.
I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.
As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.
Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/
I've had a look at how FB does it, and it looks like the scraping is done at server side.
class ScrapedInfo
{
public $url;
public $title;
public $description;
public $imageUrls;
}
function scrapeUrl($url)
{
$info = new ScrapedInfo();
$info->url = $url;
$html = file_get_html($info->url);
//Grab the page title
$info->title = trim($html->find('title', 0)->plaintext);
//Grab the page description
foreach($html->find('meta') as $meta)
if ($meta->name == "description")
$info->description = trim($meta->content);
//Grab the image URLs
$imgArr = array();
foreach($html->find('img') as $element)
{
$rawUrl = $element->src;
//Turn any relative Urls into absolutes
if (substr($rawUrl,0,4)!="http")
$imgArr[] = $url.$rawUrl;
else
$imgArr[] = $rawUrl;
}
$info->imageUrls = $imgArr;
return $info;
}

Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title and description are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" /> to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.

As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.

Related

A reliable way to scrape title, description and keywords

Currently I'm using CURL to scrape a website. I want to reliably get the title, description and keywords.
//Parse for the title, description and keywords
if (strlen($link_html) > 0)
{
$tags = get_meta_tags($link); // name
$link_keywords = $tags['keywords']; // php documentation
$link_description = $tags['description'];
}
The only problem is people are now using all kinds of meta tags, such as open graph <meta property="og:title" content="The Rock" />. They also vary the tags a lot <title> <Title> <TITLE> <tiTle>. It's very difficult to get these reliably.
I really need some code that will extract these variables consistently. If there is some title, keyword and description provided that it will find it. Because right now it seems very hit and miss.
Perhaps a way to extract all titles into a titles array? Then the scraping web developer can choose the best one to record in their database. The same applying to keywords and description.
This is not a duplicate. I have searched through stackoverflow and
nowhere is this solution to place all "title", "keywords" and
"description" type tags into arrays.
Generally get_meta_tags() should get you most of what you need, you just need to setup a set of cascading checks that will sample the required field from each metadata system until one is found. For example, something like this:
function get_title($url) {
$tags = get_meta_tags($url);
$props = get_meta_props($url);
return #tags["title"] || #props["og:title"] || ...
}
The above implementation is obviously not efficient (because if we implemetn all the getters like this you'd reload the URL for each getter), and I didn't implement get_meta_props() - which is problematic to implement correctly using pcre_* and tedious to implement using DOMDocument.
Still a correct implementation is trivial though a lot of work - which is a classic scenario for an external library to solve the problem! Fortunately, there is one for just that - called simply "Embed" and you can find it on github, or using composer just run
composer require embed/embed

PHP - Parsing iFrame from an HTML Page

Hey guys so I have a basic PHP application that loads a page with a video on it from one of those free TV streaming sites. You set the episode, season and show you wish to view and the server application parses the HTML tag of the iFrame that contains that video. The application works by parsing the HTML page with the PHP preg_match_all() method to get all occurrences of the iFrame HTML tag. I use the following string as the pattern, "/iframe .*\>/". This works for about half of the video players on the site, but for some reason comes up dry with all of the others.
For examples the video at http://www.free-tv-video-online.me/player/novamov.php?id=huv5cpp5k8cia which hosted on a video site called novamov is easily parsed. However, the video displayed at http://www.free-tv-video-online.me/player/gorillavid.php?id=8rerq4qpgsuw which is hosted on gorillavid, is not found by the preg_match_all() function despite it clearly being displayed in the HTML source when the element is inspected using chrome. Why is my script no returning the proper results and why is this behaviour dependant on the video player the video is using? Please could someone explain?
Try:
$dom = new DOMDocument; #$dom->loadHTML('yourURLHere');
$iframe = $dom->getElementsByTagName('iframe');
foreach($iframe as $ifr){
$ifrA[] = $ifr->getAttribute('src');
}
Now, the $ifrA Array should have your iframe src's.

JS, PHP Dynamic Content and Google Crawlers

I have a series of about 25 static sites I created that share the same info and was having to change inane bits of copy here and there so i wrote this javascript so all the sites pulled the content from one location. (shortened to one example)
var dataLoc = "<?=$resourceLocation?>";
$("#listOne").load(dataLoc+"resources.html #listTypes");
When the page loads, it will find the div id listOne then replace it with the contents of the div in the file resources.html and only the contents of the div labeled listTypes there.
My Question: Google is not crawling this dynamic content at all, I am told Google will crawl dynamically imported information so what i'm curious to find out is what it is that i am currently doing that needs to be improved?
I assumed js just was skipped by the google spider so i used PHP to access the same HTML file used before and it is working slightly, but it's not working how i need it. This will return the text, but i need the markup as well, the <li>, <p><img> tags, and so on. Perhaps i could tweak this? (i am not a developer so I have just tried a few dozen things i read in the PHP online help and this is as close as i got)
function parseContents($divID)
{
$page = file_get_contents('content/resources.html');
$doc = new DOMDocument();
#$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div)
{
if ($div->getAttribute('id') === $divID)
{
echo $div->nodeValue;
}
}
}
parseContents('listOfStuff');
Thanks for your help in understanding this a little better, let me know if I need to explain it any better :)
See Making AJAX Applications Crawlable published by Google.

How to replace unlinked images with a link to the image?

In PHP
I'm converting a large number of blog posts into something more mobile friendly. As these blogs posts seem to have alot of large images in them, I would like to use some regex or something to go thorugh the article's HTML and replace any images that aren't currently linked with a link to that image. This will allow the mobile browser to display the article without any images, but a link to the image inplace of where the image would be thus downsizing the page download size.
Alternatively, if anyone knows any php classes/functions that can make the job of formatting these posts easier, please suggest.
Any help would be brilliant!
To parse HTML, use an HTML parser. Regex is not the correct tool to parse HTML or XML documents. PHP's DOM offers the loadHTML() function. See http://in2.php.net/manual/en/domdocument.loadhtml.php. Use the DOM's functions to access and modify img elements.
How about doing it in JQuery instead of PHP. That way, it will work across different blogging software.
You can do something like...
$(document).ready( function () {
$('#content img').each(function () {
var imageUrl = $(this).attr('src');
//* Now write codes to delete off the image and put in a link instead. :)
});
});
With Regexp something like this:
$c = 0;
while(preg_match('/<img src="(.*?)">/', $html, $match) && $c++<100) {
$html = str_replace($match[0], 'Image '.$c.'');
}
You can also use preg_replace (saves the loop), but the loop allows for easy extensions, e.g. using the image functions to create thumbnails.

Integrating tumblr blog with website

I would like to integrate my tumblr feed in to my website. It seems that tumblr has an API for this, but I'm not quite sure how to use it. From what I understand, I request the page, and tumblr returns an xml file with the contents of my blog. But how do I then make this xml into meaningful html? Must I parse it with php, turning the relevant tags into headers and so on? I tell myself it cannot be that painful. Anyone have any insights?
There's a javascript include that does this now, available from Tumblr (you have to login to see it): http://www.tumblr.com/developers
It winds up being something like this:
<script type="text/javascript" src="http://{username}.tumblr.com/js"></script>
You can use PHPTumblr, an API wrapper written in PHP which makes retrieving posts a breeze.
If you go to http://yourblog.tumblr.com/api/read where "yourblog" should be replaced with the name of your blog (be careful, if you host your Tumblr blog on a custom domain, like I do, use that) you'll see the XML version of your blog. It comes up really messy for me on Firefox for some reason so I use Chrome, try a couple of different browser, it'll help to see the XML file well-formed, indented and such.
Once your looking at the XML version of your blog, notice that each post has a bunch of data in an attribute="value" orientation. Here's an example from my blog:
<post id="11576453174" url="http://wamoyo.com/post/11576453174" url-with-slug="http://wamoyo.com/post/11576453174/100-year-old-marathoner-finishes-race" type="link" date-gmt="2011-10-17 18:01:27 GMT" date="Mon, 17 Oct 2011 14:01:27" unix-timestamp="1318874487" format="html" reblog-key="E2Eype7F" slug="100-year-old-marathoner-finishes-race" bookmarklet="true">
So, there's lots of ways to do this, I'll show you the one I used, and drop my code on the bottom of this post so you can just tailor that to your needs. Notice the type="link" part? Or the id="11576453174" ? These are the values you're going to use to pull data into your PHP script.
Here's the example:
<!-- The Latest Text Post -->
<?php
echo "";
$request_url = "http://wamoyo.com/api/read?type=regular"; //get xml file
$xml = simplexml_load_file($request_url); //load it
$title = $xml->posts->post->{'regular-title'}; //load post title into $title
$post = $xml->posts->post->{'regular-body'}; //load post body into $post
$link = $xml->posts->post['url']; //load url of blog post into $link
$small_post = substr($post,0,350); //shorten post body to 350 characters
echo // spit that baby out with some stylish html
'<div class="panel" style="width:220px;margin:0 auto;text-align:left;">
<h1 class="med georgia bold italic black">'.$title.'</h1>'
. '<br />'
. '<span>'.$small_post.'</span>' . '...'
. '<br /></br><div style="text-align:right;"><a class="bold italic blu georgia" href="'.$link.'">Read More...</a></div>
</div>
<img style="position:relative;top:-6px;" src="pic/shadow.png" alt="" />
';
?>
So, this is actually fairly simple. The PHP script here places data (like the post title and post text) from the xml file into php variables, and then echos out those variable along with some html to create a div which features a snippet from a blog post. This one features the most recent text post. Feel free to use it, just go in and change that first url to your own blog. And then choose whatever values you want from your xml file.
For example let's say you want, not the most recent, but the second most recent "photo" post. You have to change the request_url to this:
$request_url = "http://wamoyo.com/api/read?type=photo&start=1"
Or let's say you want the most recent post with a specific tag
$request_url = "http://wamoyo.com/api/read?tagged=events";
Or let's say you want a specific post, just use the id
$request_url = "http://wamoyo.com/api/read?id=11576453174";
So all you have to do is tack on the ? with whatever parameter and use an & if you have multiple parameters.
If you want to do something fancier, you'll need the tumblr api docs here: http://www.tumblr.com/docs/en/api/v2
Hope this was helpful!
There are two main ways to do this. First, you can parse the xml, pulling out the content from the the tags you need (a few ways to do this depending on whether you use a SAX or DOM parser). This is the quick and dirty solution.
You can also use an XSLT transformation to convert the xml source directly to the html you want. This is more involved since you have to learn the syntax for xslt templates, which is a bit verbose.

Categories