How do I obtain the canonical value using PHP DomDocument? - php

<link rel='canonical' href='http://test.com/asdfsdf/sdf/' />
I need to get the canonical href value using Dom. How do I do this?

There are multiple ways to do this.
Using XML:
<?php
$html = "<link rel='canonical' href='http://test.com/asdfsdf/sdf/' />";
$xml = simplexml_load_string($html);
$attr = $xml->attributes();
print_r($attr);
?>
which outputs:
SimpleXMLElement Object
(
[#attributes] => Array
(
[rel] => canonical
[href] => http://test.com/asdfsdf/sdf/
)
)
or, using Dom:
<?php
$html = "<link rel='canonical' href='http://test.com/asdfsdf/sdf/' />";
$dom = new DOMDocument;
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('link');
foreach ($nodes as $node)
{
if ($node->getAttribute('rel') === 'canonical')
{
echo($node->getAttribute('href'));
}
}
?>
which outputs:
http://test.com/asdfsdf/sdf/
In both examples, more code is required if you're parsing an entire HTML file, but they demonstrate most of the structure that you'll need.
Code modified from this answer and the documentation on Dom.

Related

How to count all nodes in DOMDocument

Using PHP 7.1 I want to count the number of nodes in the root of this string:
<p>Lorem</p>
<p>Ipsum</p>
<div>Dolores</div>
<b>Amet</b>
Using following PHP:
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->loadHTML($content);
$root = $dom->documentElement;
$children = $root->childNodes;
var_dump($children)
Returns:
object(DOMNodeList)#4 (1) {
["length"]=>
int(1)
}
I don't understand why the string of HTML only returns as 1 node. Additionally, I am unable to iterate through the nodes.
After a nice conversation in chat with #bart we find a solution.
$content = "
<p>Lorem</p>
<p>Ipsum</p>
<div>Dolores</div>
<b>Amet</b>
";
$dom = new DOMDocument;
$dom->loadHTML($content);
$allElements = $dom->getElementsByTagName('*');
echo $allElements->length;
echo "<br />";
$node = array();
foreach($allElements as $element) {
if(array_key_exists($element->tagName, $node)) {
$node[$element->tagName] += 1;
} else {
$node[$element->tagName] = 1;
}
}
print_r($node);
ps: html and body tag are added and counted by default increasing the result by 2.
For the record ( and despite other answer being accepted, here is the correct way to list the child nodes :-). This includes the text nodes, which people forget are there!
<?php
$content = "
<p>Lorem</p>
<p>Ipsum</p>
<div>Dolores</div>
<b>Amet</b>
";
$dom = new DOMDocument;
$dom->loadHTML($content);
$nodes=[];
$bodyNodes = $dom->getElementsByTagName('body'); // returns DOMNodeList object
foreach($bodyNodes[0]->childNodes as $child) // assuming 1 <body> node
{
$nodes[]=$child->nodeName;
}
print_r($nodes);
Outputs this, illustrating the point...:
Array
(
[0] => p
[1] => #text
[2] => p
[3] => #text
[4] => div
[5] => #text
[6] => b
[7] => #text
)
Well I was already typing this answer up so I'll add it here anyway.
You have to iterate through the contents of a DOMNodeList object, it's not an array structure that can be seen with var_dump() and friends. When iterating with foreach you get an instance of a DOMNode object. The count of elements in the DOMNodeList is stored in the length property.
$content = "
<p>Lorem</p>
<p>Ipsum</p>
<div>Dolores</div>
<b>Amet</b>
";
$dom = new DomDocument();
$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$allElements = $dom->getElementsByTagName('*');
echo "We found $allElements->length elements\n";
foreach ($allElements as $element) {
echo "$element->tagName = $element->nodeValue\n";
}

PHP - RSS Parser XML

Question: How to Parse <media:content URL="IMG" /> from XML?
OK. This is like asking why 1+1 = 2. And 2+2=Not Available.
Orginal Link:
How to Parse XML With SimpleXML and PHP // By: John Morris.
https://www.youtube.com/watch?v=_1F1Iq1IIS8
Using his method I can easily reach items on RSS FEED New York Times
With Following Code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>How to Parse XML with SimpleXML and PHP</title>
</head>
<body>
<?php
$url = 'http://rss.nytimes.com/services/xml/rss/nyt/Sports.xml';
$xml = simplexml_load_file($url) or die("Can't connect to URL");
?><pre><?php //print_r($xml); ?></pre><?php
foreach ($xml->channel->item as $item) {
printf('<li>%s</li>', $item->link, $item->title);
}
?>
</body>
</html>
GIVES:
Sparky Lyle in Monument Park? Fans Say Yes, but He Disagrees
The Thickly Accented American Behind the N.B.A. in France
On Pro Basketball: ‘That Got Ugly in a Hurry’: More Playoff Pain Delivered by the Spurs
...
BUT
TO reach media:content you cannot use simplexml_load_file as it doesn't grab any media.content tags.
So... Yes.. I searched around on the Webb.
I found this example on StackOverflow:
get media:description and media:content url from xml
But using the Code:
<?php
function feeds()
{
$url = "http://rss.nytimes.com/services/xml/rss/nyt/Sports.xml"; // xmld.xml contains above data
$feeds = file_get_contents($url);
$rss = simplexml_load_string($feeds);
foreach($rss->channel->item as $entry) {
if($entry->children('media', true)->content->attributes()) {
$md = $entry->children('media', true)->content->attributes();
print_r("$md->url");
}
}
}
?>
Gave me no errors. But also a blank page.
And it seems most people (googling) has little to no idea how to really use media:content . So I have to turn for Stackoverflow and hope someone can provide an answer. Im even willing to not use SimpleXML.
What I want.. is .. to grab media:content url IMAGES and use them on a external site.
Also.. if possible.
I would like to put the XML parsed items into a SQL database.
I came up with this:
<?php
$url = "http://rss.nytimes.com/services/xml/rss/nyt/Sports.xml"; // xmld.xml contains above data
$feeds = file_get_contents($url);
$rss = simplexml_load_string($feeds);
$items = [];
foreach($rss->channel->item as $entry) {
$image = '';
$image = 'N/A';
$description = 'N/A';
foreach ($entry->children('media', true) as $k => $v) {
$attributes = $v->attributes();
if ($k == 'content') {
if (property_exists($attributes, 'url')) {
$image = $attributes->url;
}
}
if ($k == 'description') {
$description = $v;
}
}
$items[] = [
'link' => $entry->link,
'title' => $entry->title,
'image' => $image,
'description' => $description,
];
}
print_r($items);
?>
Giving:
Array
(
[0] => Array
(
[link] => SimpleXMLElement Object
(
[0] => https://www.nytimes.com/2017/04/17/sports/basketball/a-court-used-for-playing-hoops-since-1893-where-paris.html?partner=rss&emc=rss
)
[title] => SimpleXMLElement Object
(
[0] => A Court Used for Playing Hoops Since 1893. Where? Paris.
)
[image] => SimpleXMLElement Object
(
[0] => https://static01.nyt.com/images/2017/04/05/sports/basketball/05oldcourt10/05oldcourt10-moth-v13.jpg
)
[description] => SimpleXMLElement Object
(
[0] => The Y.M.C.A. in Paris says its basketball court, with its herringbone pattern and loose slats, is the oldest one in the world. It has been continuously functional since the building opened in 1893.
)
)
.....
And you can iterate over
foreach ($items as $item) {
printf('<img src="%s">', $item['image']);
printf('%s', $item['url'], $item['title']);
}
Hope this helps.

php DomXPath - how to strip html tags and its contents from nodeValue?

In this code
<root>
<main>
<cont>
<p>hello<a>world</a></p>
<p>hello</p>
<p>hello<a>world</a></p>
</cont>
</main>
</root>
I just need to get only the text inside <cont> tag. without getting <a> tag and its contents
so, the results will be hello hello hello without world
You can select the text nodes that are a direct descendant of each <p> tag:
$dom = new DOMDocument;
$dom->loadXml($xmlData);
$xpath = new DOMXpath($dom);
foreach ($xpath->query('//cont/p/text()') as $text) {
echo $text->textContent, "\n";
}
A simplexml_load_string() or simplexml_load_file() should be enough:
$xml_string = '<root> <main> <cont> <p>hello<a>world</a></p> <p>hello</p> <p>hello<a>world</a></p> </cont> </main></root>';
$xml = simplexml_load_string($xml_string);
$p = $xml->main->cont->p;
foreach($p as $value) {
$parapgraphs[] = (string) $value;
}
echo '<pre>';
print_r($parapgraphs);
Should show something like:
Array
(
[0] => hello
[1] => hello
[2] => hello
)

A regexp to retrieve either og:url meta or link rel="canonical"

i'm trying to write a script to scrape canonical URL from a remote URL.
I'm not a professional developper, so if something is ugly in my code, any explanation would (and will) be appreciated.
What I'm trying to do is either look for:
<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />
<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />`
... and extract the URL out of it.
My code so far :
$content = file_get_contents($url);
$content = strtolower($content);
$content = preg_replace("'<style[^>]*>.*</style>'siU",'',$content); // strip js
$content = preg_replace("'<script[^>]*>.*</script>'siU",'',$content); // strip css
$split = explode("\n",$content); // Separate each line
foreach ($split as $k => $v) // For each line
{
if (strpos(' '.$v,'<meta') || strpos(' '.$v,'<link')) // If contains a <meta or <link
{
// Check with regex and if found, return what I need (the URL)
}
}
return $split_content;
I've been fighting with regex for hours, trying to figure out how to do so, but it seems it's well above my knowledge.
would someone know how I need to define this rule ?
Plus, does my script seems okay to you, or is there room for improvement ?
Thanks a bunch !
Using DOMDocument this is how you can get the property and content
$html = '<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
$attr = array();
foreach ($dom->getElementsByTagName('meta') as $meta) {
if ($meta->hasAttributes()) {
foreach ($meta->attributes as $attribute) {
$attr[$attribute->nodeName] = $attribute->nodeValue;
}
}
}
print_r($attr);
Output ::
Array
(
[property] => og:url
[content] => http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html
)
The same you can get for the 2nd URL as
$html = '<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
$attr = array();
foreach ($dom->getElementsByTagName('link') as $link) {
if ($link->hasAttributes()) {
foreach ($link->attributes as $attribute) {
$attr[$attribute->nodeName] = $attribute->nodeValue;
}
}
}
print_r($attr);
Output ::
Array
(
[rel] => canonical
[href] => http://www.another-canonical-url.com/is-here
)
Consider using DOMDocument, simply load your HTML into the DOMDocument object and use getElementsByTagName and then loop the results until one of them has the right attributes. As if you were writing Javascript.

PHP XML - Find out the path to a known value

Here is an XML bit:
[11] => SimpleXMLElement Object
(
[#attributes] => Array
(
[id] => 46e8f57e67db48b29d84dda77cf0ef51
[label] => Publications
)
[section] => Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[id] => 9a34d6b273914f18b2273e8de7c48fd6
[label] => Journal Articles
[recordId] => 1a5a5710b0e0468e92f9a2ced92906e3
)
I know the value "46e8f57e67db48b29d84dda77cf0ef51" but its location varies across files. Can I use XPath to find the path to this value? If not what could be used?
Latest trial that does not work:
$search = $xml->xpath("//text()=='047ec63e32fe450e943cb678339e8102'");
while(list( , $node) = each($search)) {
echo '047ec63e32fe450e943cb678339e8102',$node,"\n";
}
PHPs DOMNode objects have a function for that: DOMNode::getNodePath()
$xml = <<<'XML'
<root>
<child key="1">
<child key="2"/>
<child key="3"/>
</child>
</root>
XML;
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate('//child');
foreach ($nodes as $node) {
var_dump($node->getNodePath());
}
Output:
string(11) "/root/child"
string(20) "/root/child/child[1]"
string(20) "/root/child/child[2]"
SimpleXML is a wrapper for DOM and here is a function that allows you to get the DOMNode for an SimpleXMLElement: dom_import_simplexml.
$xml = <<<'XML'
<root>
<child key="1">
<child key="2"/>
<child key="3"/>
</child>
</root>
XML;
$structure = simplexml_load_string($xml);
$elements = $structure->xpath('//child');
foreach ($elements as $element) {
$node = dom_import_simplexml($element);
var_dump($node->getNodePath());
}
To fetch an element by its attribute xpath can be used.
Select all nodes using the element joker anywhere in the document:
//*
Filter them by the id attribute:
//*[#id = "46e8f57e67db48b29d84dda77cf0ef51"]
$dom = new DOMDocument();
$dom->loadXml('<node id="46e8f57e67db48b29d84dda77cf0ef51"/>');
$xpath = new DOMXpath($dom);
foreach ($xpath->evaluate('//*[#id = "46e8f57e67db48b29d84dda77cf0ef51"]') as $node) {
var_dump(
$node->getNodePath()
);
}
Is this string always in the #id attribute? Then a valid and distinct path is always //*[#id='46e8f57e67db48b29d84dda77cf0ef51'], no matter where it is.
To construct a path to a given node, use $node->getNodePath() which will return an XPath expression for the current node. Also take this answer on constructing XPath expression using #id attributes, similar to like Firebug does, in account.
For SimpleXML you will have to do everything by hand. If you need to support attribute and other paths, you will have to add this, this code only supports element nodes.
$results = $xml->xpath("/highways/route[66]");
foreach($results as $result) {
$path = "";
while (true) {
// Is there an #id attribute? Shorten the path.
if ($id = $result['id']) {
$path = "//".$result->getName()."[#id='".(string) $id."']".$path;
break;
}
// Determine preceding and following elements, build a position predicate from it.
$preceding = $result->xpath("preceding-sibling::".$result->getName());
$following = $result->xpath("following-sibling::".$result->getName());
$predicate = (count($preceding) + count($following)) > 0 ? "[".(count($preceding)+1)."]" : "";
$path = "/".$result->getName().$predicate.$path;
// Is there a parent node? Then go on.
$result = $result->xpath("parent::*");
if (count($result) > 0) $result = $result[0];
else break;
}
echo $path."\n";
}

Categories