DOM xpath isn't working correctly - php

I don't quite understand whats wrong with my xpath code. It's not returning any results. First, here's the code:
$url = 'http://someurl.com';
$dom = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$query = '//div[#id="main"]//ul[#id="tags"]//li//a';
$result = $xpath->query($query);
foreach($result as $node){
$val = $node->getAttribute('href');
echo $val."<br/>";
}
Here is the HTML:
<div id="main">
<ul id="tags">
<li class="tag_col_0">somevalue</li>
<li class="tag_col_0">somevalue1</li>
</ul>
</div>
I'm not quite sure what's wrong here.

Related

How to set class to all text node parents inside of specific block

I need to set a class to parent of each text node inside of specific block on my page.
Here is what I'm trying to do:
$pageHTML = '<html><head></head>
<body>
<header>
<div>
<nav>Menu</nav>
<span>Another text</span>
</div>
</header>
<section>Section</section>
<footer>Footer</footer>
</body>
</html>';
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($pageHTML);
libxml_use_internal_errors(false);
foreach($dom->getElementsByTagName('body')[0]->childNodes as $bodyChild) {
if($bodyChild->nodeName == 'header') {
$blockDoc = new DOMDocument();
$blockDoc->appendChild($blockDoc->importNode($bodyChild, true));
$xpath = new DOMXpath($blockDoc);
foreach($xpath->query('//text()') as $textnode) {
if(preg_match('/\S/', $textnode->nodeValue)) { // exclude non-characters
$textnode->parentNode->setAttribute('class','my_class');
}
}
}
}
echo $dom->saveHTML((new \DOMXPath($dom))->query('/')->item(0));
I need to get <nav> and <span> inside of <header> with the my_class but I don't get.
As I can understand, I need to return back changed parents to DOM after setting the class to them, but how can I do that?
Ok, I've found the answer by myself:
...
$xpath = new DOMXpath($dom);
foreach($dom->getElementsByTagName('body')[0]->childNodes as $bodyChild) {
if($bodyChild->nodeName == 'header') {
foreach($xpath->query('.//text()', $bodyChild) as $textnode) {
if(preg_match('/\S/', $textnode->nodeValue)) { // exclude non-characters
$textnode->parentNode->setAttribute('class','my_class');
}
}
}
}
Try this code, you have to get the node by its name by using getElementsByTagName instead of checking by text node.
$pageHTML = '<html>
<head></head>
<body>
<header>
<div>
<nav>Menu</nav>
<span>Another text</span>
</div>
</header>
<section>Section</section>
<footer>Footer</footer>
</body>
</html>';
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($pageHTML);
libxml_use_internal_errors(false);
$elements = $dom->getElementsByTagName('header');
foreach ($elements as $node) {
$nav = $node->getElementsByTagName('nav');
$span = $node->getElementsByTagName('span');
$nav->item(0)->setAttribute('class', 'my_class');
$span->item(0)->setAttribute('class', 'my_class');
}
echo $dom->saveHTML();

How to web-scrape in in divs with DOMparser

I am trying to get div and for other pages, trying to put it in a foreach.
But facing some troubles,
<div class="article_info">
<ul class="c-result_box">
<li>
<div class="inner cf">
<div class="c-header">
<div class="c-logo">
<im src="/e/designs/31sumai/common/img/logo_08.png" alt="#">
</div>
<p class="c-supplier">三井のマンション</p>
<p class="c-name">
パークリュクス大阪天満
</p>
I'm trying to get the text inside the <a> element, here is my codes, what I am missing here?
$start_id = 1501;
while(true){
$url = 'https://www.31sumai.com/mfr/K'.$start_id.'/outline.html';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);
$classname="c-name";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
$MyTable = false;
$insertData = [];
foreach($nodes as $node){
$allNames = [];
foreach($node->getElementsByTagName('a') as $a){
$name = $a->getElementsByTagName('a');
$allProperties[] = [
'names' => $name];
}
}
Thank you for helping!
You can rely on your XPath query to pull all the text node that you want, and then just get the nodeValue property within your loop:
$start_id = "1501";
$url = "https://www.31sumai.com/mfr/K$start_id/outline.html";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);
$classname="c-name";
$nodes = $xpath->query("//*[contains(#class, '$classname')]/a/text()");
foreach($nodes as $node){
echo $node->nodeValue;
}

DOMXpath & PHP: how to wrap a bunch of <li> inside an <ul>

I have a html-document with this not-so-nice markup, without the 'ul':
<p>Lorem</p>
<p>Ipsum...</p>
<li class='item'>...</li>
<li class='item'>...</li>
<li class='item'>...</li>
<div>...</div>
I am now trying to "grab" all li-elements and wrap them inside an ul-list which I'd like to place in the same spot, using PHP and DOMXPath. I manage to find and "remove" the li-elements:
$elements = $xpath->query('//li[#class="item"]');
$wrapper = $document->createElement('ul');
foreach($elements as $child) {
$wrapper->appendChild($child);
}
Maybe you can get the parentNode of the first <li> and then use the insertBefore method:
$html = <<<HTML
<p>Lorem</p>
<p>Ipsum...</p>
<li class='item'>...</li>
<li class='item'>...</li>
<li class='item'>...</li>
<div>...</div>
HTML;
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//li[#class="item"]');
$wrapper = $doc->createElement('ul');
$elements->item(0)->parentNode->insertBefore(
$wrapper, $elements->item(0)
);
foreach($elements as $child) {
$wrapper->appendChild($child);
}
echo $doc->saveHTML();
Demo
Here's what you need. You may need to tweak the XPath query for your real HTML.
$document = new DOMDocument;
// We don't want to bother with white spaces
$document->preserveWhiteSpace = false;
$html = <<<EOT
<p>Lorem</p>
<p>Ipsum...</p>
<li class='item'>...</li>
<li class='item'>...</li>
<li class='item'>last...</li>
<div>...</div>
EOT;
$document->LoadHtml($html);
$xpath = new DOMXPath($document);
$elements = $xpath->query('//li[#class="item"]');
// Saves a reference to the Node that is positioned right after our li's
$ref = $xpath->query('//li[#class="item"][last()]')->item(0)->nextSibling;
$wrapper = $document->createElement('ul');
foreach($elements as $child) {
$wrapper->appendChild($child);
}
$ref->parentNode->insertBefore($wrapper, $ref);
echo $document->saveHTML();
Running example: https://repl.it/B3UO/24

How to extract the contents inside a div based on its class?

I tried with this code,
$html= file_get_contents("page.html");
$dom = new DOMDocument;
$dom->loadHTML($html);
$div = $dom->getElementsByClassName('mydiv1');
$result = $dom->saveHTML($div);
echo $result;
page.html
<html>
<body>
<div id="test">
<div class="mydiv1">Hello</div>
<div class="mydiv2">How are you</div>
</div>
</body>
</html>
But when I tried with Id its works. like,
$html= file_get_contents("page.html");
$dom = new DOMDocument;
$dom->loadHTML($html);
$div = $dom->getElementById('test');
$result = $dom->saveHTML($div);
echo $result;
How can I get the content based on class ?
Try this code,
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div[#class="mydiv1"]');
$div = $div->item(0);
$result = $dom->saveXML($div);
echo $result;
There is no actual getElementsByClassName (yet) in DOMDocument, but the same results can be produced using DOMXpath as :
$dom = new DomDocument();
$dom->load($filePath);
$finder = new DomXPath($dom);
$nodes= $finder->query('//div[#class="mydiv1"]');

Php DOM and Xpath - Replace node but keep children of old node

Consider the following html:
<html>
<title>Xyz</title>
<body>
<div>
<div class='mycls'>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</div>
</div>
<body>
</html>
$dom = new DOMDocument();
$dom->loadHTML([loaded html of remote url through curl]);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('html/body/div[#class="mycls"]');
till here its working fine, i need to replace the node to get following:
<body>
<div>
<span>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</span>
</div>
<body>
Something like the following should work for you:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$oldNode = $xpath->query('//div[#class="mycls"]')->item(0);
$span = $dom->createElement('span');
if ($oldNode->hasChildNodes()) {
$children = [];
foreach ($oldNode->childNodes as $child) {
$children[] = $child;
}
foreach ($children as $child) {
$span->appendChild($child->parentNode->removeChild($child));
}
}
$oldNode->parentNode->replaceChild($span, $oldNode);
echo htmlspecialchars($dom->saveHTML());
Demo: http://codepad.viper-7.com/WNTrR5
Note that in the demo I also have fixed your HTML which was utterly broken :-)
If you demo is really the HTML you are getting back from the cURL call and you cannot change it (no control over it) you can do:
$libxmlErrors = libxml_use_internal_errors(true); // at the start
and
libxml_use_internal_errors($libxmlErrors); // at the end
To prevent errors popping up

Categories