Extract and dump a DOM node (and its children) in PHP

Extract and dump a DOM node (and its children) in PHP - php

’I have the following scenario and I'm already spending hours trying to handle it: I'm developing a Wordpress theme (hence PHP) and I want to check whether the content of a post (which is HTML) contains a tag with a certain id/class. If so, I want to extract it from the content and place it somewhere else.
Example: Let's say the text content of the Wordpress post is
<?php
/* $content actually comes from WP function get_the_content() */
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
?>
So how can I extract that div with the class (could also live with giving it an ID), output it (with tags and all that) in one place of the template, and output the rest (without the extracted tag, of course) in another place of the template?
I've already tried with the DOMDocument class, p.i.t.a. to me, maybe I'm too stupid.

Try:
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
$dom = new DomDocument;
$dom->loadHtml($content);
$xpath = new DomXpath($dom);
$contents = '';
foreach ($xpath->query('//div[#class="the-wanted-element"]') as $node) {
$contents = $dom->saveXml($node);
break;
}
echo $contents;
How to get the remaining xml/html:
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
$dom = new DomDocument;
$dom->loadHtml($content);
$xpath = new DomXpath($dom);
foreach ($xpath->query('//div[#class="the-wanted-element"]') as $node) {
$node->parentNode->removeChild($node);
break;
}
$contents = '';
foreach ($xpath->query('//body/*') as $node) {
$contents .= $dom->saveXml($node);
}
echo $contents;

Related

Limited content break the HTML layout in php

I am facing an issues when I tried to limit the content of description , I have tried like this :
<?php
$intDescLt = 400;
$content = $arrContentList[$arr->nid]['description'];
$excerpt = substr($content, 0, $intDescLt);
?>
<div class="three16 DetailsDiv">
<?php echo $excerpt; ?>
<div>
In the description field if I simply put the content without html tags it works fine but if I put the content with html tags and if limit reach to the end before the closing tag, It applied that tab style to all the content after that.
So I need to know that how can I resolve this issue.
Ex.
Issue :
$string = "<p><b>Lorem Ipsum</b> is simply dummy text of the printing and typesetting industry.</p>";
echo substr($string, 0, 15);
Html output in console: <p><b>Lorem Ipsu
And now it applied that <b> tag to rest of the content in the page.
Expected output in console: <p><b>Lorem Ipsu</b>

You can't just use PHP's binary string functions on a HTML string and then expect things to work.
$string = "<p><b>Lorem Ipsum</b> is simply dummy text of the printing and typesetting industry.</p>";
First of all you need to formulate what kind of excerpt you'd like to create in the HTML context. Let's take an example that is concerned about the actual text-length in characters. That is not counting the size of the HTML tags. Also tags should be kept closing.
You start by creating a DOMDocument so that you can operate on the HTML fragment you have. The $string loaded will be the child-nodes of the <body> tag, so the code gets it for reference as well:
$doc = new DOMDocument();
$result = $doc->loadHTML($string);
if (!$result) {
throw new InvalidArgumentException('String could not be parsed as HTML fragment');
}
$body = $doc->getElementsByTagName('body')->item(0);
Next is needed to operate on all the nodes within it in document order. Iterating these nodes can be easily achieved with the help of an xpath query:
$xp = new DOMXPath($doc);
$nodes = $xp->query('./descendant::node()', $body);
Then the logic on how to create the excerpt needs to be implemented. That is all text-nodes are taken over until their length exceeds the number of characters left. If so, they are split or if no characters are left removed from their parent:
$length = 0;
foreach ($nodes as $node) {
if (!$node instanceof DOMText) {
continue;
}
$left = max(0, 15 - $length);
if ($left) {
if ($node->length > $left) {
$node->splitText($left);
$node->nextSibling->parentNode->removeChild($node->nextSibling);
}
$length += $node->length;
} else {
$node->parentNode->removeChild($node);
}
}
At the end you need to turn in inner HTML of the body tag into a string to obtain the result:
$buffer = '';
foreach ($body->childNodes as $node) {
$buffer .= $doc->saveHTML($node);
}
echo $buffer;
This will give you the following result:
<p><b>Lorem Ipsum</b> is </p>
As node elements have been altered but only text-nodes, the elements are still intact. Just the text has been shortened. The Document Object Model allows you to do the traversal, the string operations as well as node-removal as needed.
As you can imagine, a more simplistic string function like substr() is not similarly capable of handling the HTML.
In reality there might be more to do: The HTML in the string might be invalid (check the Tidy extension), you might want to drop HTML attributes and tags (images, scripts, iframes) and you might also want to put the size of the tags into account. The DOM will allow you to do so.
The example in full (online demo):
<?php
/**
* Limited content break the HTML layout in php
*
* #link http://stackoverflow.com/a/29323396/367456
* #author hakre <http://hakre.wordpress.com>
*/
$string = "<p><b>Lorem Ipsum</b> is simply dummy text of the printing and typesetting industry.</p>";
echo substr($string, 0, 15), "\n";
$doc = new DOMDocument();
$result = $doc->loadHTML($string);
if (!$result) {
throw new InvalidArgumentException('String could not be parsed as HTML fragment');
}
$body = $doc->getElementsByTagName('body')->item(0);
$xp = new DOMXPath($doc);
$nodes = $xp->query('./descendant::node()', $body);
$length = 0;
foreach ($nodes as $node) {
if (!$node instanceof DOMText) {
continue;
}
$left = max(0, 15 - $length);
if ($left) {
if ($node->length > $left) {
$node->splitText($left);
$node->nextSibling->parentNode->removeChild($node->nextSibling);
}
$length += $node->length;
} else {
$node->parentNode->removeChild($node);
}
}
$buffer = '';
foreach ($body->childNodes as $node) {
$buffer .= $doc->saveHTML($node);
}
echo $buffer;

Ok, given the example you provided:
$string = "<p><b>Lorem Ipsum</b> is simply dummy text of the printing and typesetting industry.</p>";
$substring = substr((addslashes($string)),0,15);
On possible solution is to use the DOMDocument class if you want to close all unclosed tags:
$doc = new DOMDocument();
$doc->loadHTML($substring);
$yourText = $doc->saveHTML($doc->getElementsByTagName('*')->item(2));
//item(0) = html
//item(1) = body
echo htmlspecialchars($yourText);
//<p><b>Lorem Ips</b></p>

Retrieve data from html page using xpath and php

I know there are similar question, but, trying to study PHP I met this error and I want understand why this occurs.
<?php
$url = 'http://aice.anie.it/quotazione-lme-rame/';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTML($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tbody/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}
?>
this prints just "hello!". I want to print the value extracted with the xpath, but the last echo doesn't do anything.

You have some errors in your code :
You try to get the table from the url http://aice.anie.it/quotazione-lme-rame/, but it's actually in an iframe located at http://www.aiceweb.it/it/frame_rame.asp, so get the iframe url directly.
You use the function loadHTML(), which load an HTML string. What you need is the loadHTMLFile function, which takes the link of an HTML document as a parameter (See http://www.php.net/manual/fr/domdocument.loadhtmlfile.php)
You assume there is a tbody element on the page but there is no one. So remove that from your query filter.
Working code :
$url = 'http://www.aiceweb.it/it/frame_rame.asp';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTMLFile($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}

Missing html content when using dom->saveHTML in PHP

I am getting data from a website using DOM. I've tested my code in my local server and it works perfectly however, when I uploaded it on a server and ran the code, the script I created returned html tags without any content. My code looks something like this:
$divs = $dom->getElementsByTagName('div');
foreach($divs as $div){
if($div->getAttribute('class') == "content1"){
$dom = new DOMDocument();
$dom->appendChild($dom->importNode($div, true));
$content1 = $dom->saveHTML();
echo "content:".$content1;
}
}
In my localhost, it returns something like so:
<div class="content1">This is my content</div>
However, in the server, I strangely get the empty html tags like so:
<div class="content1"></div>
What are possible causes of this problem? Is there any way I can fix it? Please advise.

PHP version under 5.3.6 :
create a variable that will contains a clone of the current node with all sub nodes,
append it as a child
echo the returned value.
foreach($divs as $div) {
if($div->getAttribute('class') == "content1"){
$dom = new DOMDocument();
$cloned = $div->cloneNode(TRUE);
$dom->appendChild($dom->importNode($cloned,TRUE));
$content1 = $dom->saveHTML();
echo "content:".$content1;
}
}
EDIT: I've made a mistake it was not
$cloned = $element->cloneNode(TRUE);
but
$cloned = $div->cloneNode(TRUE);
sorry ^^ (hope it will work)

Dom Document - extract a document id & save

I am trying to extract a specific clump of HTML using dom document.
My code is as follows:
$domd = new DOMDocument('1.0', 'utf-8');
$domd->loadHTML($string);
$this->hook = 'content';
if($this->hook !== '') {
$main = $domd->getElementById($this->hook);
$newstr = "";
foreach($main->childNodes as $node) {
$newstr .= $domd->saveXML($node, LIBXML_NOEMPTYTAG);
}
$domd->loadHTML($newstr);
}
//MORE PARSING USING THE DOMD OBJECT
It works great BUT the foreach is quite slow, and I was wondering if there's a more intelligent way of doing this. I am re-loading the HTML into the $domd so I can keep editing. In the back of my mind I feel I should be saving a fragment, not re-loading the saved $newstr into the object.
Can this be made more elegant or faster?
Thanks!

I'm assuming you want to mutate your existing $domd document, replacing it completely with those child nodes you're grabbing from that content node:
UPDATE: Just realized that since you were reloading using loadHTML, you probably wanted to preserve the html/body nodes that it creates. Code below has been adjusted to empty body and append the fragment there:
$domd = new DOMDocument('1.0', 'utf-8');
$domd->loadHTML($string);
$this->hook = 'content';
if($this->hook !== '') {
$main = $domd->getElementById($this->hook);
$fragment = $domd->createDocumentFragment();
while($main->hasChildNodes()) {
$fragment->appendChild($main->firstChild);
}
$body = $domd->getElementsByTagName("body")->item(0);
while($body->hasChildNodes()) {
$body->removeChild($body->firstChild);
}
$body->appendChild($fragment);
}
//MORE PARSING USING THE DOMD OBJECT

PHP DOMDocument error handling

I'm having trouble trying to write an if statement for the DOM that will check if $html is blank. However, whenever the HTML page does end up blank, it just removes everything that would be below DOM (including what I had to check if it was blank).
$html = file_get_contents("http://example.com/");
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementById('dividhere')->getElementsByTagName('img');
foreach ($links as $link)
{
echo $link->getAttribute('src');
}
All this does is grab an image URL in the specified div, which works perfectly until the page is a blank HTML page.
I've tried using SimpleHTMLDOM, which didn't work either (it didn't even fetch the image on working pages). Did I happen to miss something with this one or am I just missing something in both?
include_once('simple_html_dom.php')
$html = file_get_html("http://example.com/");
foreach($html->find('div[id="dividhere"]') as $div)
{
if(empty($div->src))
{
continue;
}
echo $div->src;
}

Get rid on the $html variable and just load the file into $dom by doing #$dom->loadHTMLFile("http://example.com/");, then have an if statement below that to check if $dom is empty.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract and dump a DOM node (and its children) in PHP - php

Related

Limited content break the HTML layout in php

Retrieve data from html page using xpath and php

Missing html content when using dom->saveHTML in PHP

Dom Document - extract a document id & save

PHP DOMDocument error handling

Categories

Resources