This question already has answers here:
How to get Open Graph Protocol of a webpage by php?
(8 answers)
Closed 8 years ago.
I am trying to retrieve some meta data included into a SimpleXMLElement. I am using XPATH and I struggle to get the value that interests me.
Here is an extract of the webpage header (from : http://www.wayfair.de/CleverFurn-Couchtisch-Abby-69318X2-MFE2223.html)
Do you know how I could retrieve all xmlns data in an array containing :
1) og:type
2) og:url
3) og:image
....
x) og:upc
<meta xmlns:og="http://opengraphprotocol.org/schema/" property="og:title" content="CleverFurn Couchtisch "Abby"" />
And here's my php code
<?php
$html = file_get_contents("http://www.wayfair.de/CleverFurn-Couchtisch-Abby-69318X2-MFE2223.html");
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->recover=true;
#$doc->loadHTML("<html><body>".$html."</body></html>");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*/meta[#property='og:url']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
var_dump($element);
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>
Just found the answer :
How to get Open Graph Protocol of a webpage by php?
<?php
$html = file_get_contents("http://www.wayfair.de/CleverFurn-Couchtisch-Abby-69318X2-MFE2223.html");
libxml_use_internal_errors(true); // Yeah if you are so worried about using # with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(#property, \'og:\')]';
$metas = $xpath->query($query);
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
var_dump($rmetas);
?>
Related
I'm trying to parse a page with Xpath, but I don't manage to get the body class.
Here is what I'm trying :
<?php
$url = 'http://figurinepop.com/mickey-paintbrush-disney-funko';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//link[#rel="canonical"]/#href');
foreach($nodes as $node) {
$canonical = $node->nodeValue;
}
$nodes = $xpath->query('//html/body/#class');
foreach($nodes as $node) {
$bodyclass = $node->nodeValue;
}
$output['canonical'] = $canonical;
$output['bodyclass'] = $bodyclass;
echo '<pre>'; print_r ($output); echo '</pre>';
?>
Here is what I get :
Array
(
[canonical] => http://figurinepop.com/mickey-paintbrush-disney-funko
[bodyclass] =>
)
It's working with many elements (title, canonical, div...) but the body class.
I've tested the Xpath query with a chrome extension and it seems well written.
What is wrong ?
I'm trying to extract data from HTML. I did it with curl, but all I need is to pass the title to another variable:
<meta property="og:url" content="https://example.com/">
How to extract this, and is there a better way?
You should use a parser to pull values out of HTML files/strings/docs. Here's an example using the domdocument.
$string = '<meta property="og:url" content="https://example.com/">';
$doc = new DOMDocument();
$doc->loadHTML($string);
$metas = $doc->getElementsByTagName('meta');
foreach($metas as $meta) {
if($meta->getAttribute('property') == 'og:url') {
echo $meta->getAttribute('content');
}
}
Output:
https://example.com/
If you are loading the HTML from a remote location and not a local string you can use DOM for this using something like:
libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('https://evernote.com');
libxml_clear_errors();
$xp = new DOMXpath($dom);
$nodes = $xp->query('//meta[#property="og:url"]');
if(!is_null($nodes->item(0)->attributes)) {
foreach ($nodes->item(0)->attributes as $attr) {
if($attr->value!="og:url") {
print $attr->value;
}
}
}
This outputs the expected value:
https://evernote.com/
This question already has answers here:
SimpleXML: Selecting Elements Which Have A Certain Attribute Value
(2 answers)
Closed 8 years ago.
I'm getting a xml using the file_get_contents function and then creating a SimpleXMLElement with it.
The xml file can be seen here: http://ws.audioscrobbler.com/2.0/?method=artist.getinfo&artist=Nirvana&api_key=0ca5b0824b7973303c361510e7dbfced
The problem is that I need to get the value of lfm->artist->image[#size='small'] and I can't find how to do it.
You should use DOMXPath for this: http://php.net/manual/en/class.domxpath.php
This XPath query would work for your XML:
\\lfm\artist\image[#size='small']
As follows:
$doc = new DOMDocument();
$doc->loadXML($url);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("\\lfm\artist\image[#size='small']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
This question already has answers here:
Getting title and meta tags from external website
(21 answers)
Closed 9 years ago.
I am trying to show rss feed links in my website, All going well but its taking so much time to get og:image property by using file_get_contents() method. Is there any other way to fetch meta tag properties?
Is Python helpful to get these tags faster ?
This is how I used to get all the og:tags:
libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML(file_get_contents($url));
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(#property, \'og:\')]';
$metas = $xpath->query($query);
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
echo '<h1>Meta '.$property.' <span>'.$content.'</span></h1>';
}
<?php
$page_content = file_get_contents('http://example.com');
$dom_obj = new DOMDocument();
$dom_obj->loadHTML($page_content);
$meta_val = null;
foreach($dom_obj->getElementsByTagName('meta') as $meta) {
if($meta->getAttribute('property')=='og:image'){
$meta_val = $meta->getAttribute('content');
}
}
echo $meta_val;
?>
This question already has answers here:
convert part of dom element to string with html tags inside of them
(2 answers)
Closed 9 years ago.
Im trying to echo HTML using PHP DOM:
$doc = new \DomDocument('1.0', 'UTF-8');
$doc->loadHTMLFile("http://www.nu.nl");
$tags = $doc->getElementsByTagName('a');
echo $doc->saveHTML($tags);
This is getting me a blank page. I also tried:
$doc = new DOMDocument();
$doc->loadHTMLFile("http://www.nu.nl");
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
echo $link->getAttribute('href') . '<br />';
}
This is getting me the "href" as plain text. I have Googled for hours now and tried many things but I can't figure out how to output HTML as HTML.
here is a fix that will add the root url for relative links
$pageurl = "http://www.nu.nl";
$html = file_get_contents($pageurl);
$html = str_replace('&','&',$html);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
$myLink = $link->getAttribute('href');
if (substr($myLink,0,7) == 'http://') {
echo ''.$myLink.'<br/>';
} else {
echo ''.$myLink.'<br/>';
}
}
You probably want something like this doing:
$doc = new DOMDocument();
$doc->loadHTMLFile("http://www.nu.nl");
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
$thelinks[] = '' . trim(preg_replace('/\s{2,}/', '', $link->textContent)) . '';
}
var_dump($thelinks);
In the foreach
echo $doc->saveHTML($link);