Xpath with html5lib in PHP

Xpath with html5lib in PHP - php

I have this basic code that doesn't work. How can I use Xpath with html5lib php? Or Xpath with HTML5 in any other way.
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url);
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
//$elements = $dom->getElementsByTagName('h1');
foreach ($elements as $element)
{
var_dump($element);
}
No elements are found. Using $xpath->query('.') works for getting the root element (xpath in general seems to work). $dom->getElementsByTagName('h1') is working.

use disable_html_ns option.
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5(array(
'disable_html_ns' => true, // add `disable_html_ns` option
));
$dom = $html5->loadHTML($response);
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
foreach ($elements as $element) {
var_dump($element);
}
https://github.com/Masterminds/html5-php#options
disable_html_ns (boolean): Prevents the parser from automatically assigning the HTML5 namespace to the DOM document. This is for non-namespace aware DOM tools.

So it looks like html5lib is setting us up with a default namespace.
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
echo $de->namespaceURI . "\n";
}
This outputs:
http://www.w3.org/1999/xhtml
To query against namespaced nodes with xpath you need to register the namespace and use the prefix in the query.
$xpath = new DOMXPath($dom);
$xpath->registerNamespace('n', $de->namespaceURI);
$elements = $xpath->query('//n:h1');
foreach ($elements as $element)
{
echo $element->nodeValue;
}
This outputs PHP.
Generally I find it tedious to prefix everything in xpath queries when there's a default namespace involved, so I just strip it.
$de = $dom->documentElement;
$de->removeAttributeNS($de->getAttributeNode("xmlns")->nodeValue,"");
$dom->loadXML($dom->saveXML()); // reload the existing dom, now sans default ns
After that you can use your original xpath and it'll work just fine.
$elements = $xpath->query('//h1');
foreach ($elements as $element)
{
echo $element->nodeValue;
}
This now outputs PHP as well.
So the modified version of the example would be something like:
Example:
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
$de->removeAttributeNS($de->getAttributeNode("xmlns")->nodeValue,"");
$dom->loadXML($dom->saveXML());
}
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
foreach ($elements as $element)
{
var_dump($element);
}
Output:
class DOMElement#11 (18) {
public $tagName =>
string(2) "h1"
public $schemaTypeInfo =>
NULL
public $nodeName =>
string(2) "h1"
public $nodeValue =>
string(3) "PHP"
...
public $textContent =>
string(3) "PHP"
}

Related

How to parse body class with Xpath?

I'm trying to parse a page with Xpath, but I don't manage to get the body class.
Here is what I'm trying :
<?php
$url = 'http://figurinepop.com/mickey-paintbrush-disney-funko';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//link[#rel="canonical"]/#href');
foreach($nodes as $node) {
$canonical = $node->nodeValue;
}
$nodes = $xpath->query('//html/body/#class');
foreach($nodes as $node) {
$bodyclass = $node->nodeValue;
}
$output['canonical'] = $canonical;
$output['bodyclass'] = $bodyclass;
echo '<pre>'; print_r ($output); echo '</pre>';
?>
Here is what I get :
Array
(
[canonical] => http://figurinepop.com/mickey-paintbrush-disney-funko
[bodyclass] =>
)
It's working with many elements (title, canonical, div...) but the body class.
I've tested the Xpath query with a chrome extension and it seems well written.
What is wrong ?

Parsing HTML to extract array of DIV content by class

$html = file_get_contents("https://www.wireclub.com/chat/room/music");
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = array();
foreach($xpath->evaluate('//div[#class="message clearfix"]/node()') as $childNode) {
$result[] = $dom->saveHtml($childNode);
}
echo '<pre>'; var_dump($result);
I would like the content of each individual DIV in an array to be processed individually.
This code is clumping every DIV together.

You could retrieve all the div and get the nodeValue
$dom = new DOMDocument();
$dom->loadHTML($html);
$myDivs = $dom->getElementsByTagName('div');
foreach($myDivs as $key => $value) {
$result[] = $value->nodeValue;
}
var_dump($result);
for class you should
you could use you code
$xpath = new DOMXPath($dom);
$myElem = $xpath->query("//*[contains(#class, '$classname')]");
foreach($myElem as $key => $value) {
$result[] = $value->nodeValue;
}

PHP Dom Getting Multiple href From Class

Could someone please help me out.
I'm trying to get multiple href's from a page for exmaple.
The page
<div class="link__ttl">
Version 1
</div>
<div class="link__ttl">
Version 1
</div>
PHP Dom
$data = array();
$data['links'] = array();
$page = $this->curl->get($page);
$dom = new DOMDocument();
#$dom->loadHTML($page);
$divs = $dom->getElementsByTagName('div');
for($i=0;$i<$divs->length;$i++){
if ($divs->item($i)->getAttribute("class") == "link__ttl") {
foreach ($divs as $div) {
$link = $div->getElementsByTagName('a');
$data['links'][] = $link->getAttribute("href");
}
}
}
But this don't same to work and i get a error
Call to undefined method DOMNodeList::getAttribute()
Could someone help me out here please thanks

You're testing divs for having the link__tt class, but then just for each all the divs. Take only the anchors from the divs that have the class.
Then you're trying to call getAttribute from a DOMNodeList, you need to get the underlying domnode to get the attribute.
$divs = $dom->getElementsByTagName('div');
for($i=0;$i<$divs->length;$i++){
$div = $divs->item($i);
if ($div->getAttribute("class") == "link__ttl") {
$link = $div->getElementsByTagName('a');
$data['links'][] = $link->item(0)->getAttribute("href");
}
}
Another solution is to use xpath
$path = new DOMXPath($dom);
$as = $path->query('//div[#class="link__ttl"]/a');
for($i=0;$i<$as->length;$i++){
$data['links'][] = $as->item($i)->getAttribute("href");
}
http://codepad.org/pX5qA1BB

$link = $div->getElementsByTagName('a'); retrieves a LIST of Items where you cant's get an attribute-value "href" of...
try use of $link[0] instead of $link

Any part of a DOM is an node. The attributes are nodes, too, not just the elements. Using Xpath you can directly fetch an list of href attribute nodes.
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
$result = [];
foreach ($xpath->evaluate('//div[#class = "link__ttl"]/a/#href') as $href) {
$result[] = $href->value;
}
var_dump($result);
Output: https://eval.in/150202
array(2) {
[0]=>
string(24) "/watch-link-53767-934537"
[1]=>
string(24) "/watch-link-53759-934537"
}

Trouble extracting data from an XML document using XPath

I'm trying to extract all of the "name" and "form13FFileNumber" values from xpath "//otherManagers2Info/otherManager2/otherManager" in this document:
https://www.sec.gov/Archives/edgar/data/1067983/000095012314002615/primary_doc.xml
Here is my code. Any idea what I am doing wrong here?
$xml = file_get_contents($url);
$dom = new DOMDocument();
$dom->loadXML($xml);
$x = new DOMXpath($dom);
$other_managers = array();
$nodes = $x->query('//otherManagers2Info/otherManager2/otherManager');
if (!empty($nodes)) {
$i = 0;
foreach ($nodes as $n) {
$i++;
$other_managers[$i]['form13FFileNumber'] = $x->evaluate('form13FFileNumber', $n)->item(0)->nodeValue;
$other_managers[$i]['name'] = $x->evaluate('name', $n)->item(0)->nodeValue;
}
}

Like you posted in the comment you can just register the namespace with an own prefix for Xpath. Namespace prefixes are just aliases. Here is no default namespace in Xpath, so you always have to register and use an prefix.
However, expressions always return a traversable node list, you can use foreach to iterate them. query() and evaluate() take a context node as the second argument, expression are relative to the context. Last evaluate() can return scalar values directly. This happens if you cast the node list in Xpath into a scalar type (like a string) or use function like count().
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
$xpath->registerNamespace('e13', 'http://www.sec.gov/edgar/thirteenffiler');
$xpath->registerNamespace('ecom', 'http://www.sec.gov/edgar/common');
$result = [];
$nodes = $xpath->evaluate('//e13:otherManagers2Info/e13:otherManager2/e13:otherManager');
foreach ($nodes as $node) {
$result[] = [
'form13FFileNumber' => $xpath->evaluate('string(e13:form13FFileNumber)', $node),
'name' => $xpath->evaluate('string(e13:name)', $node),
];
}
var_dump($result);
Demo: https://eval.in/125200

extracting and printing an html element by it's class using DOMDocument

what i want to do is to get an element with its class name and show it as a actual html element not it nodes or its inner data
here is my code
$html = file_get_contents("www.site.com");
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
$element = $dom->getElementById('myid');
$string = $element->C14N();
here is how i do it using ID but i want to now if there is a way to do this using class apparently there is no getElementByClass method

There is no straightforward method in php dom to do this. You will have to walk all the elements and check if their class attribute contains the class name you need...
$html = file_get_contents("www.site.com");
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('div') as $element) {
if (strpos($element->getAttribute('class'), 'yourClassNameHere') !== false) {
$string = $element->C14N();
}
}
You can also use DOMXpath:
$xpath = new DOMXpath($doc);
foreach ($xpath->query("*/div[#class='yourClassNameHere']") as $element) {
$string = $element->C14N();
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Xpath with html5lib in PHP - php

Related

How to parse body class with Xpath?

Parsing HTML to extract array of DIV content by class

PHP Dom Getting Multiple href From Class

Trouble extracting data from an XML document using XPath

extracting and printing an html element by it's class using DOMDocument

Categories

Resources