Php DOM and Xpath - Replace node but keep children of old node - php

Consider the following html:
<html>
<title>Xyz</title>
<body>
<div>
<div class='mycls'>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</div>
</div>
<body>
</html>
$dom = new DOMDocument();
$dom->loadHTML([loaded html of remote url through curl]);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('html/body/div[#class="mycls"]');
till here its working fine, i need to replace the node to get following:
<body>
<div>
<span>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</span>
</div>
<body>

Something like the following should work for you:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$oldNode = $xpath->query('//div[#class="mycls"]')->item(0);
$span = $dom->createElement('span');
if ($oldNode->hasChildNodes()) {
$children = [];
foreach ($oldNode->childNodes as $child) {
$children[] = $child;
}
foreach ($children as $child) {
$span->appendChild($child->parentNode->removeChild($child));
}
}
$oldNode->parentNode->replaceChild($span, $oldNode);
echo htmlspecialchars($dom->saveHTML());
Demo: http://codepad.viper-7.com/WNTrR5
Note that in the demo I also have fixed your HTML which was utterly broken :-)
If you demo is really the HTML you are getting back from the cURL call and you cannot change it (no control over it) you can do:
$libxmlErrors = libxml_use_internal_errors(true); // at the start
and
libxml_use_internal_errors($libxmlErrors); // at the end
To prevent errors popping up

Related

How to set class to all text node parents inside of specific block

I need to set a class to parent of each text node inside of specific block on my page.
Here is what I'm trying to do:
$pageHTML = '<html><head></head>
<body>
<header>
<div>
<nav>Menu</nav>
<span>Another text</span>
</div>
</header>
<section>Section</section>
<footer>Footer</footer>
</body>
</html>';
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($pageHTML);
libxml_use_internal_errors(false);
foreach($dom->getElementsByTagName('body')[0]->childNodes as $bodyChild) {
if($bodyChild->nodeName == 'header') {
$blockDoc = new DOMDocument();
$blockDoc->appendChild($blockDoc->importNode($bodyChild, true));
$xpath = new DOMXpath($blockDoc);
foreach($xpath->query('//text()') as $textnode) {
if(preg_match('/\S/', $textnode->nodeValue)) { // exclude non-characters
$textnode->parentNode->setAttribute('class','my_class');
}
}
}
}
echo $dom->saveHTML((new \DOMXPath($dom))->query('/')->item(0));
I need to get <nav> and <span> inside of <header> with the my_class but I don't get.
As I can understand, I need to return back changed parents to DOM after setting the class to them, but how can I do that?
Ok, I've found the answer by myself:
...
$xpath = new DOMXpath($dom);
foreach($dom->getElementsByTagName('body')[0]->childNodes as $bodyChild) {
if($bodyChild->nodeName == 'header') {
foreach($xpath->query('.//text()', $bodyChild) as $textnode) {
if(preg_match('/\S/', $textnode->nodeValue)) { // exclude non-characters
$textnode->parentNode->setAttribute('class','my_class');
}
}
}
}
Try this code, you have to get the node by its name by using getElementsByTagName instead of checking by text node.
$pageHTML = '<html>
<head></head>
<body>
<header>
<div>
<nav>Menu</nav>
<span>Another text</span>
</div>
</header>
<section>Section</section>
<footer>Footer</footer>
</body>
</html>';
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($pageHTML);
libxml_use_internal_errors(false);
$elements = $dom->getElementsByTagName('header');
foreach ($elements as $node) {
$nav = $node->getElementsByTagName('nav');
$span = $node->getElementsByTagName('span');
$nav->item(0)->setAttribute('class', 'my_class');
$span->item(0)->setAttribute('class', 'my_class');
}
echo $dom->saveHTML();

Replace content specific HTML tag using PHP

I have HTML code:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
Using PHP I want replace all < symbols located in code elements for example above code I want converted to:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
I try using PHP DomDocument class but my work was ineffective. Below is my code:
$dom = new DOMDocument();
$dom->loadHTML($content);
$innerHTML= '';
$tmp = '';
if(count($dom->getElementsByTagName('*'))){
foreach ($dom->getElementsByTagName('*') as $child) {
if($child->tagName == 'code'){
$tmp = $child->ownerDocument->saveXML( $child);
$innerHTML .= htmlentities($tmp);
}
else{
$innerHTML .= $child->ownerDocument->saveXML($child);
}
}
}
So, you're iterating over the markup properly, and your use of saveXML() was close to what you want, but nowhere in your code do you try to actually change the contents of the element. This should work:
<?php
$content='<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
foreach ($dom->getElementsByTagName('code') as $child) {
// get the markup of the children
$html = implode(array_map([$child->ownerDocument,"saveHTML"], iterator_to_array($child->childNodes)));
// create a node from the string
$text = $dom->createTextNode($html);
// remove existing child nodes
foreach ($child->childNodes as $node) {
$child->removeChild($node);
}
// append the new text node - escaping is done automatically
$child->appendChild($text);
}
echo $dom->saveHTML();

Getting all elements between html tag in php

I refered this question
But, i want to iterate and get all the elements between the html tag
This is what i did
$homepage = file_get_contents('http://www.example.com');
Which will print the following
<html>
<body>
<div class = "alpha">hey</div>
<div class = "beta">one</div>
<div class = "beta">two</div>
</body>
</html>
Here i need to get all the elements with the class beta.
How can i do this ?
Here's the code that i tried so far
$dom = new DOMDocument();
$dom->loadHTML($homepage);
foreach($dom->getAllElements as $element ){
if(!$element->hasClass('beta')){
echo $element;
}
}
But it says DOMDocument::loadHTML(): Tag nav invalid in Entity,
Try this
<?php
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML("<html>
<body>
<div class = 'alpha'>hey</div>
<div class = 'beta'>one</div>
<div class = 'beta'>two</div>
</body>
</html>");
libxml_clear_errors();
$classname="beta";
$finder = new DomXPath($dom);
$spaner = $finder->query("//*[contains(#class, '$classname')]");
foreach($spaner as $element ){
print_r($element);
}
?>

How can I add an element into the middle of a text node's text?

Given the following HTML:
$content = '<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
</body>
</html>';
How can I alter it to the following HTML:
<html>
<body>
<div>
<p>During the <span>interim</span> there shall be nourishment supplied</p>
</div>
</body>
</html>
I need to do this using DomDocument. Here's what I've tried:
$dom = new DomDocument();
$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//*[contains(text(),'interim')]");
if (!is_null($elements)) {
foreach ($elements as $element) {
$text = $element->nodeValue;
$element->nodeValue = str_replace('interim','<span>interim</span>',$text);
}
}
echo $dom->saveHTML();
However, this outputs literal html entities so it renders like this in the browser:
During the <span>interim</span> there shall be nourishment supplied
I imagine one should use createElement and appendChild methods instead of assigning nodeValue directly but I can't see how to insert an element in the middle of a textNode string?
Marcus Harrison's answer using splitText is a good one, but it can be simplified and needs to use mb_* methods to work with UTF-8 input:
<?php
$html = <<<END
<html>
<meta charset="utf-8">
<body>
<div>
<p>During € the interim there shall be nourishment supplied</p>
</div>
</body>
</html>
END;
$replace = 'interim';
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query(sprintf('//text()[contains(., "%s")]', $replace));
foreach ($nodes as $node) {
$start = mb_strpos($node->textContent, $replace);
$end = $start + mb_strlen($replace);
$node->splitText($end); // do this first
$node->splitText($start); // do this last
$newnode = $doc->createElement('span');
$node->parentNode->insertBefore($newnode, $node->nextSibling);
$newnode->appendChild($newnode->nextSibling);
}
$doc->encoding = 'UTF-8';
print $doc->saveHTML($doc->documentElement);
Create a new DomDocument with modified element and replace the old one
foreach ($elements as $element) {
$text = $element->nodeValue;
$el = new DomDocument();
$el->loadHTML('<iframe>'. str_replace('interim','<span>interim</span>',$text) . '</iframe>');
$new = $dom->importNode($el->getElementsByTagName('iframe')->item(0), true);
unset($el);
$element->parentNode->replaceChild($new, $element);
}
In order to do this, you must use the DOMString's splitText interface. This accepts an offset, which can be retrieved by using strpos:
$dom = new DomDocument();
$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//*[contains(text(),'interim')]");
if (!is_null($elements)) {
foreach ($elements as $element) {
$text = $element->childNodes->item(0);
$text->splitText(strpos($text->textContent, "interim"));
$text2 = $element->childNodes->item(1);
$text2->splitText(strpos($text2->textContent, " "));
$element->removeChild($text2);
$span = $dom->createElement("span");
$span->appendChild($dom->createTextNode("interim"));
$element->insertBefore($span, $element->childNodes->item(1));
}
}
echo $dom->saveHTML();
Edits: having just tested it, I realise I hadn't removed the original "interim" in the second text node. Edited this answer to do that. I have also edited this code to be as compatible with old versions of PHP as I can think of making it: as I don't run an old version of PHP it isn't possible for me to test that.

how to use dom php parser

I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!
First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.
I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}
Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();

Categories