Extracting META using PHP simple_html_dom.php

Extracting META using PHP simple_html_dom.php - php

I have read and tried every POST about this but cannot get it to work.
This is the HTML:
<meta itemprop="interactionCount" content="UserPlays:4635">
<meta itemprop="interactionCount" content="UserLikes:4">
<meta itemprop="interactionCount" content="UserComments:0">
I need to extract the '4635' bit.
Code:
<?php
$html = file_get_html($url);
foreach($html->find("meta[name=interactionCount]")->getAttribute('content') as $element) {
$val = $element->innertext;
echo '<br>Value is: '.$val;
}
I get nothing back?

$metaData= '<meta itemprop="interactionCount" content="UserPlays:4635">
<meta itemprop="interactionCount" content="UserLikes:4">
<meta itemprop="interactionCount" content="UserComments:0">';
$dom = new DOMDocument();
$dom->loadHtml($metaData);
$metas = $dom->getElementsByTagName('meta');
foreach($metas as $el) {
list($user_param,$value) = explode(':',$el->getAttribute('content'));
// here check what you need
print $user_param.' '.$value.'<br/>';
}
// OUTPUT
UserPlays 4635
UserLikes 4
UserComments 0

include 'simple_html_dom.php';
$url = '...';
$html = file_get_html($url);
foreach ($html->find('meta[itemprop="interactionCount"]') as $element) {
list($key, $value) = explode(':', strval($key->content));
echo 'Value:'.$value."\n";
}

Related

How to display data from json url using php decode?

I am not able to display data from this url using php json decode:
https://api.mymemory.translated.net/get?q=Hello%20World!&langpair=en|it
here is the data provider:
https://mymemory.translated.net/doc/spec.php
thanks.
What I want is to setup a form to submit words and get translation back from their API.
here is my code sample:
<?php
$json = file_get_contents('https://api.mymemory.translated.net/get?q=Hello%20World!&langpair=en|it');
// parse the JSON
$data = json_decode($json);
// show the translation
echo $data;
?>

My guess is that you might likely want to write some for loops with if statements to display your data as you wish:
Test
$json = file_get_contents('https://api.mymemory.translated.net/get?q=Hello%20World!&langpair=en|it');
$data = json_decode($json, true);
if (isset($data["responseData"])) {
foreach ($data["responseData"] as $key => $value) {
// This if is to only display the translatedText value //
if ($key == 'translatedText' && !is_null($value)) {
$html = $value;
} else {
continue;
}
}
} else {
echo "Something is not right!";
}
echo $html;
Output
Ciao Mondo!
<?php
$html = '
<!DOCTYPE html>
<html lang="en">
<head>
<title>read JSON from URL</title>
</head>
<body>
';
$json = file_get_contents('https://api.mymemory.translated.net/get?q=Hello%20World!&langpair=en|it');
$data = json_decode($json, true);
foreach ($data["responseData"] as $key => $value) {
// This if is to only display the translatedText value //
if ($key == 'translatedText' && !is_null($value)) {
$html .= '<p>' . $value . '</p>';
} else {
continue;
}
}
$html .= '
</body>
</html>';
echo $html;
?>
Output
<!DOCTYPE html>
<html lang="en">
<head>
<title>read JSON from URL</title>
</head>
<body>
<p>Ciao Mondo!</p>
</body>

After many researches I got it working this way:
$json = file_get_contents('https://api.mymemory.translated.net/get?q=Map&langpair=en|it');
$obj = json_decode($json);
echo $obj->responseData->translatedText;
thank you all.

How to appendXML(fragment) with empty attribute in PHP DOMDocument

I try to add some piece of HTML code which contains an attribute like {{ some_attr }} i.e. with empty value. For example:
<?php
$pageHTML = '<!doctype html>
<html>
<head>
</head>
<body>
<div id="root">Initial content</div>
</body>
</html>';
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($pageHTML);
libxml_use_internal_errors(false);
$tmplCode = '<div {{ some_attr }}>New content</div>';
foreach($dom->getElementsByTagName('body')[0]->getElementsByTagName('*') as $node) {
if($node->getAttribute('id') == 'root') {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($tmplCode);
$node->appendChild($fragment);
}
}
echo $dom->saveHTML((new \DOMXPath($dom))->query('/')->item(0));
?>
Since appendXML() doesn't pass empty attribute, I don't receive my div with New content
I've tried
$dom->loadHTML($pageHTML, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
and
foreach (libxml_get_errors() as $error) {
// Ignore unknown tag errors
if ($error->code === 801) continue;
throw new Exception("Could not parse template");
}
libxml_clear_errors();
before saveHTML() as described by the link https://stackoverflow.com/a/39671548
I've also tried
##$fragment = $dom->createDocumentFragment();
##$fragment->appendXML($tmplCode);
as mentioned by the link https://stackoverflow.com/a/15998516
But none of the solutions work
Is it possible to append a code with empty attribute using appendXML() ?

Ok, I've just found a solution from https://stackoverflow.com/a/4401089/3208225
...
if($node->getAttribute('id') == 'root') {
$tmpDoc = new DOMDocument();
$tmpDoc->loadHTML($tmplCode);
foreach ($tmpDoc->getElementsByTagName('body')->item(0)->childNodes as $newNode) {
$newNode = $dom->importNode($newNode, true);
$node->nodeValue = '';
$node->appendChild($newNode);
}
}
...

Extract the data from content of HTML

I'm trying to extract data from HTML. I did it with curl, but all I need is to pass the title to another variable:
<meta property="og:url" content="https://example.com/">
How to extract this, and is there a better way?

You should use a parser to pull values out of HTML files/strings/docs. Here's an example using the domdocument.
$string = '<meta property="og:url" content="https://example.com/">';
$doc = new DOMDocument();
$doc->loadHTML($string);
$metas = $doc->getElementsByTagName('meta');
foreach($metas as $meta) {
if($meta->getAttribute('property') == 'og:url') {
echo $meta->getAttribute('content');
}
}
Output:
https://example.com/

If you are loading the HTML from a remote location and not a local string you can use DOM for this using something like:
libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('https://evernote.com');
libxml_clear_errors();
$xp = new DOMXpath($dom);
$nodes = $xp->query('//meta[#property="og:url"]');
if(!is_null($nodes->item(0)->attributes)) {
foreach ($nodes->item(0)->attributes as $attr) {
if($attr->value!="og:url") {
print $attr->value;
}
}
}
This outputs the expected value:
https://evernote.com/

can't get html from webpage

file_get_htm return false, but if i try get data to string, then everything is ok ..
$url = "http://www.dkb-handball-bundesliga.de/de/dkb-hbl/spielplan/spielplan-chronologisch/";
$output = file_get_contents($url);
print_r($output); //this return string
$html = file_get_html($url);
print_r($html); //this return false
i was try with curl, but everything is the same...
if i cgange url for example, everything work ok...
$url='http://www.dkb-handball-bundesliga.de/de/s/spiele/2014-2015/dkb-handball-bundesliga/1--spieltag--bergischer-hc-vs-sg-bbm-bietigheim/';

You will get data from this:
<?php
// put your code here
include_once './simple_html_dom.php';
$html = file_get_html("http://www.dkb-handball-bundesliga.de/");
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
?>
<html>
<head>
<title>TODO supply a title</title>
<meta charset="ISO-8859-1">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
<div>TODO write content</div>
<table>
<tr>
<th>My elements</th>
Hello world 1
Hello world 2
Hello world 3
Hello world 4
Hello world 5
</tr>
</table>
</body>
</html>
<?php
// You need to know the location of the file that you are calling.
include_once './simple_html_dom.php';
$html = file_get_html("http://localhost/PhpHelpers/examples.html");
$links = array();
foreach($html->find('a') as $key => $val) {
$links[$key] = $val;
}
print_r($links);
?>

How to get Open Graph Protocol of a webpage by php?

PHP has a simple command to get meta tags of a webpage (get_meta_tags), but this only works for meta tags with name attributes. However, Open Graph Protocol is becoming more and more popular these days. What is the easiest way to get the values of opg from a webpage. For example:
<meta property="og:url" content="">
<meta property="og:title" content="">
<meta property="og:description" content="">
<meta property="og:type" content="">
The basic way I see is to get the page via cURL and parse it with regex. Any idea?

Really simple and well done:
Using https://github.com/scottmac/opengraph
$graph = OpenGraph::fetch('http://www.avessotv.com.br/bastidores-pantene-institute-experience-pg.html');
print_r($graph);
Will return
OpenGraph Object
(
[_values:OpenGraph:private] => Array
(
[type] => article
[video] => http://www.avessotv.com.br/player/flowplayer/flowplayer-3.2.7.swf?config=%7B%27clip%27%3A%7B%27url%27%3A%27http%3A%2F%2Fwww.avessotv.com.br%2Fmedia%2Fprogramas%2Fpantene.flv%27%7D%7D
[image] => /wp-content/thumbnails/9025.jpg
[site_name] => Programa Avesso - Bastidores
[title] => Bastidores Ã¢Â€ÂœPantene Institute ExperienceÃ¢Â€Â P&G
[url] => http://www.avessotv.com.br/bastidores-pantene-institute-experience-pg.html
[description] => Confira os bastidores do Pantene Institute Experience, da Procter & Gamble. www.pantene.com.br Mais imagens:
)
[_position:OpenGraph:private] => 0
)

When parsing data from HTML, you really shouldn't use regex. Take a look at the DOMXPath Query function.
Now, the actual code could be :
[EDIT] A better query for XPath was given by Stefan Gehrig, so the code can be shortened to :
libxml_use_internal_errors(true); // Yeah if you are so worried about using # with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(#property, \'og:\')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
var_dump($rmetas);
Instead of :
$doc = new DomDocument();
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
if(!empty($property) && preg_match('#^og:#', $property)) {
$rmetas[$property] = $content;
}
}
var_dump($rmetas);

How about:
preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i', $str, $matches);
So, yes, grab the page with any way you can and parse with regex

This function does the job without dependency and DOM parsing:
function getOgTags($html)
{
$pattern='/<\s*meta\s+property="og:([^"]+)"\s+content="([^"]*)/i';
if(preg_match_all($pattern, $html, $out))
return array_combine($out[1], $out[2]);
return array();
}
test code:
$x=' <title>php - Using domDocument, and parsing info, I would like to get the 'href' contents of an 'a' tag - Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="referrer" content="origin" />
<meta property="og:type" content="website"/>
<meta property="og:url" content="https://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of"/>
<meta property="og:image" itemprop="image primaryImageOfPage" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon#2.png?v=73d79a89bded" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:domain" content="stackoverflow.com"/>
<meta name="twitter:title" property="og:title" itemprop="title name" content="Using domDocument, and parsing info, I would like to get the 'href' contents of an 'a' tag" />
<meta name="twitter:description" property="og:description" itemprop="description" content="Possible Duplicate:
Regular expression for grabbing the href attribute of an A element
This displays the what is between the a tag, but I would like a way to get the href contents as well.
Is..." />';
echo '<pre>';
var_dump(getOgTags($x));
and you get:
array(3) {
["type"]=>
string(7) "website"
["url"]=>
string(119) "https://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of"
["image"]=>
string(85) "https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon#2.png?v=73d79a89bded"
}

As per this method you will get key pair array of fabcebook open graph tags.
$url="http://fbcpictures.in";
$site_html= file_get_contents($url);
$matches=null;
preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i', $site_html,$matches);
$ogtags=array();
for($i=0;$i<count($matches[1]);$i++)
{
$ogtags[$matches[1][$i]]=$matches[2][$i];
}

Here is what i am using to extract Og tags.
function get_og_tags($get_url = "", $ret = 0)
{
if ($get_url != "") {
$title = "";
$description = "";
$keywords = "";
$og_title = "";
$og_image = "";
$og_url = "";
$og_description = "";
$full_link = "";
$image_urls = array();
$og_video_name = "";
$youtube_video_url="";
$get_url = $get_url;
$ret_data = file_get_contents_curl($get_url);
//$html = file_get_contents($get_url);
$html = $ret_data['curlData'];
$full_link = $ret_data['full_link'];
$full_link = addhttp($full_link);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
if ($nodes->length == 0) {
$title = $get_url;
} else {
$title = $nodes->item(0)->nodeValue;
}
//get and display what you need:
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++) {
$meta = $metas->item($i);
if ($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
if ($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
$og = $doc->getElementsByTagName('og');
for ($i = 0; $i < $metas->length; $i++) {
$meta = $metas->item($i);
if ($meta->getAttribute('property') == 'og:title')
$og_title = $meta->getAttribute('content');
if ($meta->getAttribute('property') == 'og:url')
$og_url = $meta->getAttribute('content');
if ($meta->getAttribute('property') == 'og:image')
$og_image = $meta->getAttribute('content');
if ($meta->getAttribute('property') == 'og:description')
$og_description = $meta->getAttribute('content');
// for sociotube video share
if ($meta->getAttribute('property') == 'og:video_name')
$og_video_name = $meta->getAttribute('content');
// for sociotube youtube video share
if ($meta->getAttribute('property') == 'og:youtube_video_url')
$youtube_video_url = $meta->getAttribute('content');
}
//if no image found grab images from body
if ($og_image != "") {
$image_urls[] = $og_image;
} else {
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//img"); // find your image
$imgCount = 0;
for ($i = 0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i); // gets the 1st image
if (isset($node->attributes->getNamedItem('src')->nodeValue)) {
$src = $node->attributes->getNamedItem('src')->nodeValue;
}
if (isset($node->attributes->getNamedItem('src')->value)) {
$src = $node->attributes->getNamedItem('src')->value;
}
if (isset($src)) {
if (!preg_match('/blank.(.*)/i', $src) && filter_var($src, FILTER_VALIDATE_URL)) {
$image_urls[] = $src;
if ($imgCount == 10) break;
$imgCount++;
}
}
}
}
$page_title = ($og_title == "") ? $title : $og_title;
if(!empty($og_video_name)){
// for sociotube video share
$page_body = $og_video_name;
}else{
// for post share
$page_body = ($og_description == "") ? $description : $og_description;
}
$output = array('title' => $page_title, 'images' => $image_urls, 'content' => $page_body, 'link' => $full_link,'video_name'=>$og_video_name,'youtube_video_url'=>$youtube_video_url);
if ($ret == 1) {
return $output; //output JSON data
}
echo json_encode($output); //output JSON data
die;
} else {
$data = array('error' => "Url not found");
if ($ret == 1) {
return $data; //output JSON data
}
echo json_encode($data);
die;
}
}
usage of the function
$url = "https://www.alectronics.com";
$tagsArray = get_og_tags($url);
print_r($tagsArray);

The more XMLish way would be to use XPath:
$xml = simplexml_load_file('http://ogp.me/');
$xml->registerXPathNamespace('h', 'http://www.w3.org/1999/xhtml');
$result = array();
foreach ($xml->xpath('//h:meta[starts-with(#property, \'og:\')]') as $meta) {
$result[(string)$meta['property']] = (string)$meta['content'];
}
print_r($result);
Unfortunately the namespace registration is needed if the HTML document uses a namespace declaration in the <html>-tag.

With native PHP function get_meta_tags().
https://php.net/get_meta_tags

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting META using PHP simple_html_dom.php - php

include 'simple_html_dom.php'; $url = '...'; $html = file_get_html($url); foreach ($html->find('meta[itemprop="interactionCount"]') as $element) { list($key, $value) = explode(':', strval($key->content)); echo 'Value:'.$value."\n"; }

Related

How to display data from json url using php decode?

How to appendXML(fragment) with empty attribute in PHP DOMDocument

Extract the data from content of HTML

can't get html from webpage

How to get Open Graph Protocol of a webpage by php?

Categories

Resources