How to get Open Graph Protocol of a webpage by php? - php

PHP has a simple command to get meta tags of a webpage (get_meta_tags), but this only works for meta tags with name attributes. However, Open Graph Protocol is becoming more and more popular these days. What is the easiest way to get the values of opg from a webpage. For example:
<meta property="og:url" content="">
<meta property="og:title" content="">
<meta property="og:description" content="">
<meta property="og:type" content="">
The basic way I see is to get the page via cURL and parse it with regex. Any idea?

Really simple and well done:
Using https://github.com/scottmac/opengraph
$graph = OpenGraph::fetch('http://www.avessotv.com.br/bastidores-pantene-institute-experience-pg.html');
print_r($graph);
Will return
OpenGraph Object
(
[_values:OpenGraph:private] => Array
(
[type] => article
[video] => http://www.avessotv.com.br/player/flowplayer/flowplayer-3.2.7.swf?config=%7B%27clip%27%3A%7B%27url%27%3A%27http%3A%2F%2Fwww.avessotv.com.br%2Fmedia%2Fprogramas%2Fpantene.flv%27%7D%7D
[image] => /wp-content/thumbnails/9025.jpg
[site_name] => Programa Avesso - Bastidores
[title] => Bastidores “Pantene Institute Experience†P&G
[url] => http://www.avessotv.com.br/bastidores-pantene-institute-experience-pg.html
[description] => Confira os bastidores do Pantene Institute Experience, da Procter & Gamble. www.pantene.com.br Mais imagens:
)
[_position:OpenGraph:private] => 0
)

When parsing data from HTML, you really shouldn't use regex. Take a look at the DOMXPath Query function.
Now, the actual code could be :
[EDIT] A better query for XPath was given by Stefan Gehrig, so the code can be shortened to :
libxml_use_internal_errors(true); // Yeah if you are so worried about using # with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(#property, \'og:\')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
var_dump($rmetas);
Instead of :
$doc = new DomDocument();
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
if(!empty($property) && preg_match('#^og:#', $property)) {
$rmetas[$property] = $content;
}
}
var_dump($rmetas);

How about:
preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i', $str, $matches);
So, yes, grab the page with any way you can and parse with regex

This function does the job without dependency and DOM parsing:
function getOgTags($html)
{
$pattern='/<\s*meta\s+property="og:([^"]+)"\s+content="([^"]*)/i';
if(preg_match_all($pattern, $html, $out))
return array_combine($out[1], $out[2]);
return array();
}
test code:
$x=' <title>php - Using domDocument, and parsing info, I would like to get the 'href' contents of an 'a' tag - Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="referrer" content="origin" />
<meta property="og:type" content="website"/>
<meta property="og:url" content="https://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of"/>
<meta property="og:image" itemprop="image primaryImageOfPage" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon#2.png?v=73d79a89bded" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:domain" content="stackoverflow.com"/>
<meta name="twitter:title" property="og:title" itemprop="title name" content="Using domDocument, and parsing info, I would like to get the 'href' contents of an 'a' tag" />
<meta name="twitter:description" property="og:description" itemprop="description" content="Possible Duplicate:
Regular expression for grabbing the href attribute of an A element
This displays the what is between the a tag, but I would like a way to get the href contents as well.
Is..." />';
echo '<pre>';
var_dump(getOgTags($x));
and you get:
array(3) {
["type"]=>
string(7) "website"
["url"]=>
string(119) "https://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of"
["image"]=>
string(85) "https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon#2.png?v=73d79a89bded"
}

As per this method you will get key pair array of fabcebook open graph tags.
$url="http://fbcpictures.in";
$site_html= file_get_contents($url);
$matches=null;
preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i', $site_html,$matches);
$ogtags=array();
for($i=0;$i<count($matches[1]);$i++)
{
$ogtags[$matches[1][$i]]=$matches[2][$i];
}

Here is what i am using to extract Og tags.
function get_og_tags($get_url = "", $ret = 0)
{
if ($get_url != "") {
$title = "";
$description = "";
$keywords = "";
$og_title = "";
$og_image = "";
$og_url = "";
$og_description = "";
$full_link = "";
$image_urls = array();
$og_video_name = "";
$youtube_video_url="";
$get_url = $get_url;
$ret_data = file_get_contents_curl($get_url);
//$html = file_get_contents($get_url);
$html = $ret_data['curlData'];
$full_link = $ret_data['full_link'];
$full_link = addhttp($full_link);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
if ($nodes->length == 0) {
$title = $get_url;
} else {
$title = $nodes->item(0)->nodeValue;
}
//get and display what you need:
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++) {
$meta = $metas->item($i);
if ($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
if ($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
$og = $doc->getElementsByTagName('og');
for ($i = 0; $i < $metas->length; $i++) {
$meta = $metas->item($i);
if ($meta->getAttribute('property') == 'og:title')
$og_title = $meta->getAttribute('content');
if ($meta->getAttribute('property') == 'og:url')
$og_url = $meta->getAttribute('content');
if ($meta->getAttribute('property') == 'og:image')
$og_image = $meta->getAttribute('content');
if ($meta->getAttribute('property') == 'og:description')
$og_description = $meta->getAttribute('content');
// for sociotube video share
if ($meta->getAttribute('property') == 'og:video_name')
$og_video_name = $meta->getAttribute('content');
// for sociotube youtube video share
if ($meta->getAttribute('property') == 'og:youtube_video_url')
$youtube_video_url = $meta->getAttribute('content');
}
//if no image found grab images from body
if ($og_image != "") {
$image_urls[] = $og_image;
} else {
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//img"); // find your image
$imgCount = 0;
for ($i = 0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i); // gets the 1st image
if (isset($node->attributes->getNamedItem('src')->nodeValue)) {
$src = $node->attributes->getNamedItem('src')->nodeValue;
}
if (isset($node->attributes->getNamedItem('src')->value)) {
$src = $node->attributes->getNamedItem('src')->value;
}
if (isset($src)) {
if (!preg_match('/blank.(.*)/i', $src) && filter_var($src, FILTER_VALIDATE_URL)) {
$image_urls[] = $src;
if ($imgCount == 10) break;
$imgCount++;
}
}
}
}
$page_title = ($og_title == "") ? $title : $og_title;
if(!empty($og_video_name)){
// for sociotube video share
$page_body = $og_video_name;
}else{
// for post share
$page_body = ($og_description == "") ? $description : $og_description;
}
$output = array('title' => $page_title, 'images' => $image_urls, 'content' => $page_body, 'link' => $full_link,'video_name'=>$og_video_name,'youtube_video_url'=>$youtube_video_url);
if ($ret == 1) {
return $output; //output JSON data
}
echo json_encode($output); //output JSON data
die;
} else {
$data = array('error' => "Url not found");
if ($ret == 1) {
return $data; //output JSON data
}
echo json_encode($data);
die;
}
}
usage of the function
$url = "https://www.alectronics.com";
$tagsArray = get_og_tags($url);
print_r($tagsArray);

The more XMLish way would be to use XPath:
$xml = simplexml_load_file('http://ogp.me/');
$xml->registerXPathNamespace('h', 'http://www.w3.org/1999/xhtml');
$result = array();
foreach ($xml->xpath('//h:meta[starts-with(#property, \'og:\')]') as $meta) {
$result[(string)$meta['property']] = (string)$meta['content'];
}
print_r($result);
Unfortunately the namespace registration is needed if the HTML document uses a namespace declaration in the <html>-tag.

With native PHP function get_meta_tags().
https://php.net/get_meta_tags

Related

Getting link tag via DOMDocument

I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.
I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274
function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';

parse tei domxpath get text child tag inside evaluate loop

From a string that contain a tei file, I generate an index to navigate to their blocks, I retrieve all the div tags, I also want to get, if present, the content of a tag (the tag <head>) inside current div.
Example tei file:
<div type="lib" n="1"><head>LIBER I</head>...
<div type="pr">...</div>
<div type="cap" n="1"><head>CAP EX</head><p><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div>
<div type="cap" n="2"><head>CAP EX</head><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div>
</div>
I tried this but don't work:
//source file:
$fulltext = '<div type="lib" n="1"><head>LIBER I</head>...<div type="pr">...</div><div type="cap" n="1"><head>CAP EX</head><p><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div><div type="cap" n="2"><head>CAP EX</head><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div></div>';
$dom = new DOMDocument();
#$dom->loadHTML($fulltext);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//div");
echo '<ul>';
foreach ($entries as $entry){
$title = '';
type = $entry->getAttribute( 'type' );
$n = $entry->getAttribute( 'n' );
$head = $domx->evaluate("string(./head[1])",$entry);
if( $head != '' ) $title = $head; else $title = $n;
echo '<li><a href="#'.$type.'-'.$n.'">'.$title.'</li>';
}
echo '</ul>';
The line don't work:
$head = $domx->evaluate("string(./head[1])",$entry);
Error returned:
DOMDocument::loadHTML(): htmlParseStartTag: misplaced <head> tag in Entity, line: 3
The purpose of this line is to get the text of the child tag head inside the loop (in this example "LIBER I")
Using the # symbol on the load can hide all sorts of issues. So if you take it out you get errors with your document.
If however you change the line to
$dom->loadXML($fulltext);
The output gives you what your after.
Resolved using XMLReader:
$level = 0;
$indici_bc = array();
$indici_head = array();
$passed_milestone = false;
$xml = new XMLReader();
$xml->open($pathTei);
//$xml->xml($testo);
while ($xml->read()){
if($xml->nodeType == XMLReader::END_ELEMENT && $xml->name == 'div'){
$level--;
$last_blocco = $xml->name;
if($passed_milestone){ $level--; $passed_milestone = false; }
}
if($xml->nodeType == XMLReader::ELEMENT && ($xml->name == 'div' || $xml->name == 'milestone' )){
$blocco = $xml->name;
$type = $xml->getAttribute('type');
$n = $xml->getAttribute('n');
$unit = isset($xml->getAttribute('unit')) ? $xml->getAttribute('unit') : '';
//here I get the child node
$node = new SimpleXMLElement($xml->readOuterXML());
$head = $node->head ? (string)$node->head : '';
$indici_head[] = $head;
if($last_blocco != 'milestone') $level++;
if($blocco == 'div') $bc[$level] = $n; else $bc[($level+1)] = $n;
$bc_str = '';
for($j=1;$j<$level;$j++){
if( $bc_str != '' ) $bc_str.='.';
$bc_str.=$bc[$j];
}
if( $bc_str != '' ) $bc_str.='.';
$bc_str.=$n;
$last_blocco = $xml->name;
if( $blocco == 'milestone' ) $passed_milestone = true;
$indici_bc[]=$bc_str;
}
}
$xml->close();

get value using DOMDocument

I am trying to fetch a value from the following html snippet using DOMDocument:
<h3>
<meta itemprop="priceCurrency" content="EUR">€
<meta itemprop="price" content="465.0000">465
</h3>
I need to fetch the value 465 from this code snippet. To avail this I am using the following code:
foreach($dom->getElementsByTagName('h3') as $h) {
foreach($h->getElementsByTagName('meta') as $p) {
if($h->getAttribute('itemprop') == 'price') {
foreach($h->childNodes as $child) {
$name = $child->nodeValue;
echo $name;
$name = preg_replace('/[^0-9\,]/', '', $name);
// $name = number_format($name, 2, ',', ' ');
if (strpos($name,',') == false)
{
$name = $name .",00";
}
}
}
}
}
But this code is not fetching the value...can anyone please help me on this.
You have an invalid HTML. Where is the closing tag for meta? This is why you get the results you see.
To find what you are looking for you can use xpath:
$doc = new \DOMDocument();
$doc->loadXML($yourHTML);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//meta[#itemprop='price']");
echo $elements->item(0)->textContent;
Inside your loop, you're pointing in the wrong object:
foreach($h->childNodes as $child) {
// ^ its not supposed to be `$h`
You should point to $p instead.
After that just use your current condition, if it satisfies, then loop all the child nodes:
$price = '';
foreach($dom->getElementsByTagName('h3') as $h) {
foreach($h->getElementsByTagName('meta') as $p) {
if($p->getAttribute('itemprop') === 'price') {
foreach($h->childNodes as $c) {
if($c->nodeType == XML_TEXT_NODE) {
$price .= trim($c->textContent);
}
}
if(strpos($price, ',') === false) {
$price .= ',00';
}
}
}
}
Sample Output
Another way is to use xpath queries:
$xpath = new DOMXpath($dom);
$meta = $xpath->query('//h3/meta[#itemprop="price"]');
if($meta->length > 0) { // found
$price = trim($xpath->evaluate('string(./following-sibling::text()[1])', $meta->item(0)));
if(strpos($price, ',') === false) { $price .= ',00'; }
$currency = $xpath->evaluate('string(./preceding-sibling::meta[#itemprop="priceCurrency"]/following-sibling::text()[1])', $meta->item(0));
$price = "{$currency} {$price}";
echo $price;
}
Out
Use jQuery, like this:
var priceCurrency = $('meta[itemprop="priceCurrency"]').attr("content");
var price = $('meta[itemprop="price"]').attr("content");
alert(priceCurrency + " " + price);
Outputs:
EUR 465.0000
CODEPEN DEMO

Quote issue in PHP

I have scrape data from Telugu site:
when i got "Suriya’s ‘24’ in legal tangle" this kind of string then that quote is not recognized by php function and it's converted in different character(Issue Link).
Code:
//
include "simple_html_dom.php";
// Get news from telugu site
$url = "http://www.123telugu.com/category/mnews";
$html = file_get_html($url);
$divs = $html->find('div.leading');
$result = array();
$status = FALSE;
$i = 0;
foreach ($divs as $d) {
$status = TRUE;
$title = $d->find('a', 0)->plaintext;
$result[$i]['Title'] = $title;
$link = $d->find('a', 0)->href;
$result[$i]['Link'] = $link;
$title = trim(mysql_real_escape_string($title)); // code for title
$html = file_get_html($link);
// code for image
$image = '';
foreach ($html->find('div.post-content') as $im) {
$image = $im->find('img', 0)->src; // code for image
}
$image = trim(str_replace('//', '', $image));
$result[$i]['Image'] = $image;
// code for content
$content = '';
foreach ($html->find('div.post-content p') as $co) {
$content.= $co->plaintext; // code for content
}
$result[$i]['Content'] = $content;
$i++;
}
echo json_encode(array('Status' => $status, 'Data' => $result));
We have to add following code on top of the page. will solve the issue.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Solution:
$string= iconv('utf-8', 'us-ascii//TRANSLIT', $string);
htmlspecialchars_decode() might be the function that you are looking for. Just run the final output from the scraper with this function. It should decode all the special HTML encoded characters.
Check out: http://php.net/htmlspecialchars_decode

Parsing xml with simplexml_load

I am trying to parse an xml but I get a problem while I am trying to fetch image url.
My xml is:
<entry>
<title>The Title</title>
<id>http://example.com/post/367327.html</id>
<summary>Some extra text</summary>
<link rel="enclosure" href="http://example.com/photos/f_0px_30px/image687.jpg" type="image/jpeg" length="" />
</entry>
So far I am using the code below to fetch the other data:
$url = "http://msdssite.com/feeds/xml/myxml.xml";
$xml = simplexml_load_file($url);
foreach($xml->entry as $PRODUCT)
{
$my_title = trim($PRODUCT->title);
$url = trim($PRODUCT->id);
$myimg = $PRODUCT->link;
}
How can I parse the href from this: <link rel="enclosure" href="http://example.com/photos/f_0px_30px/image687.jpg" type="image/jpeg" length="" />
Since it seems that your entries can contain several link tags, you need to check that the type attribute has the value image/jpeg to be sure to obtain a link to an image:
ini_set("display_errors", "On");
$feedURL = 'http://OLDpost.gr/feeds/xml/category-takhs-xatzhs.xml';
$feed = simplexml_load_file($feedURL);
$results = array();
foreach($feed->entry as $entry) {
$result = array('title' => (string)$entry->title,
'url' => (string)$entry->id);
$links = $entry->link;
foreach ($links as $link) {
$linkAttr = $link->attributes();
if (isset($linkAttr['type']) && $linkAttr['type']=='image/jpeg') {
$result['img'] = (string)$linkAttr['href'];
break;
}
}
$results[] = $result;
}
print_r($results);
Note that using simplexml like that (the foreach loop to find the good link tag) isn't very handy. It's better to use an XPath query:
foreach($feed->entry as $entry) {
$entry->registerXPathNamespace('e', 'http://www.w3.org/2005/Atom');
$results[] = array(
'title' => (string)$entry->title,
'url' => (string)$entry->id,
'img' => (string)$entry->xpath('e:link[#type="image/jpeg"]/#href')[0][0]
);
}
If that's the exact XML, actually there is no need for a foreach. Try this:
$xml = simplexml_load_file($url);
$my_title = (string) $xml->title;
$myimg = (string) $xml->link->attributes()['href']; // 5.4 or above
echo $myimg; // http://example.com/photos/f_0px_30px/image687.jpg
Try:
foreach($xml->entry as $PRODUCT)
{
$my_title = trim($PRODUCT->title[0]);
$url = trim($PRODUCT->id[0]);
$myimg = $PRODUCT->link[0];
}

Categories