This question already has answers here:
Closed 10 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
I have this html code:
<html>
<head>
...
</head>
<body>
<div>
<div class="foo" data-type="bar">
SOMECONTENTWITHMORETAGS
</div>
</div>
</body>
I already can get the "foo" element (but only its content) with this function:
private function get_html_from_node($node){
$html = '';
$children = $node->childNodes;
foreach ($children as $child) {
$tmp_doc = new DOMDocument();
$tmp_doc->appendChild($tmp_doc->importNode($child,true));
$html .= $tmp_doc->saveHTML();
}
return $html;
}
But I'd like to return all html tags (including its attributes) of DOMElement. How I can do that?
Use the optional argument to DOMDocument::saveHTML: this says "output this element only".
return $node->ownerDocument->saveHTML($node);
Note that the argument is only available from PHP 5.3.6. Before that, you need to use DOMDocument::saveXML instead. The results may be slightly different. Also, if you already have a reference to the document, you can just do this:
$doc->saveHTML($node);
PHP Simple HTML DOM Parser should do the job!
Related
I would like to add html tag to string of HTML in PHP, for example:
<h2><b>Hello World</b></h2>
<p>First</p>
Second
<p>Third</p>
Second is not wrapped with any html element, so system will add p tag into it, expected result:
<h2><b>Hello World</b></h2>
<p>First</p>
<p>Second</p>
<p>Third</p>
Tried with PHP Simple HTML DOM Parser but have no clue how to deal with it, here is my example of idea:
function htmlParser($html)
{
foreach ($html->childNodes() as $node) {
if ($node->childNodes()) {
htmlParser($node);
}
// Ideally: add p tag to node innertext if it does not wrapped with any tag
}
return $html;
}
But childNode will not loop into Second because it has no element wrapped inside, and regex is not recommended to deal with html tag, any idea on it?
Much appreciate, thanks.
This was a cool question because it promoted thought about the DoM.
I raised a question How do HTML Parsers process untagged text which was commented generously by #sideshowbarker, which made me think, and improved my knowledge of the DoM, especially about text nodes.
Below is a DoM based way of finding candidate text nodes and padding them with 'p' tags. There are lots of text nodes that we should leave alone, like the spaces, carriage returns and line feeds we use for formatting (which an "uglifier" may strip out).
<?php
$html = file_get_contents("nodeTest.html"); // read the test file
$dom = new domDocument; // a new dom object
$dom->loadHTML($html); // build the DoM
$bodyNodes = $dom->getElementsByTagName('body'); // returns DOMNodeList object
foreach($bodyNodes[0]->childNodes as $child) // assuming 1 <body> node
{
$text="";
// this tests for an untagged text node that has more than non-formatting characters
if ( ($child->nodeType == 3) && ( strlen( $text = trim($child->nodeValue)) > 0 ) )
{ // its a candidate for adding tags
$newText = "<p>".$text."</p>";
echo str_replace($text,$newText,$child->nodeValue);
}
else
{ // not a candidate for adding tags
echo $dom->saveHTML($child);
}
}
nodeTest.html contains this.
<!DOCTYPE HTML>
<html>
<body>
<h2><b>Hello World</b></h2>
<p>First</p>
Second
<p>Third</p>
fourth
<p>Third</p>
<!-- comment -->
</body>
</html>
and the output is this.... I did not bother echoing the outer tags. Notice that comments and formatting are properly treated.
<h2><b>Hello World</b></h2>
<p>First</p>
<p>Second</p>
<p>Third</p>
<p>fourth</p>
<p>Third</p>
<!-- comment -->
Obviously you need to traverse the DoM and repeat the search/replace at each element node if you wish to make the thing more general. We are only stopping at the Body node in this example and processing each direct child node.
I'm not 100% sure the code is the most efficient possible and I may think some more on that and update if I find a better way.
Used a stupid way to solve this problem, here is my code:
function addPTag($html)
{
$contents = preg_split("/(<\/.*?>)/", $html, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
foreach ($contents as &$content) {
if (substr($content, 0, 1) != '<') {
$chars = preg_split("/(<)/", $content, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$chars[0] = '<p>' . $chars[0] . '</p>';
$content = implode($chars);
}
}
return implode($contents);
}
Hope there is other elegant way rather than this, thanks.
You can try Simple HTML Dom Parser
$stringHtml = 'Your received html';
$html = str_get_html(stringHtml);
//Find necessary element and edit it
$exampleText = $html->find('Your selector here', 0)->last_child()->innertext
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
I am trying get a specific div element (i.e. with attribute id="vung_doc") from a website, but I get almost every element. Do you have any idea what's wrong?
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = true;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://lightnovelgate.com/chapter/epoch_of_twilight/chapter_300');
$xpath = new DOMXPath($doc);
$query = "//*[#class='vung_doc']";
$entries = $xpath->query($query);
var_dump($entries->item(0)->textContent);
Actually, it appears that that one element, which has both id and class attributes with value vung_doc, has many paragraphs inside its text content. Perhaps you are thinking each paragraph should be in its own div element.
<div id="vung_doc" class="vung_doc" style="font-size: 18px;">
<p></p>
"Mayor song..."
In the screenshot at the bottom of this post, I added an outline style to that element, to show just how many paragraphs are within that element.
If you wanted to separate the paragraphs, you could use preg_split() to split on any new line characters:
$entries = $xpath->query($query);
foreach($entries as $entry) {
$paragraphs = preg_split("/[\r\n]+/s",$entry->textContent);
foreach($paragraphs as $paragraph) {
if (trim($paragraph)) {
echo '<b>paragraph:</b> '.$paragraph;
break;
}
}
}
See a demonstration of this in this playground example. Note that before loading the HTML file, libxml_use_internal_errors() is called, to suppress the XML errors:
libxml_use_internal_errors(true);
Screenshot of the target div element with outline added:
Change
$query = "//*[#class='vung_doc']";
to
$query = "//*[#id='vung_doc']";
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How can I get an element's serialised HTML with PHP's DOMDocument?
PHP + DOMDocument: outerHTML for element?
I am trying to extract all img tags from a string. I am using:
$domimg = new DOMDocument();
#$domimg->loadHTML($body);
$images_all = $domimg->getElementsByTagName('img');
foreach ($images_all as $image) {
// do something
}
I want to put the src= values or even the complete img tags into an array or string.
Use saveXML() or saveHTML() on each node to add it to an array:
$img_links = array();
$domimg = new DOMDocument();
$domimg->loadHTML($body);
$images_all = $domimg->getElementsByTagName('img');
foreach ($images_all as $image) {
// Append the XML or HTML of each to an array
$img_links[] = $domimg->saveXML($image);
}
print_r($img_links);
You could try a DOM parser like simplexml_load_string. Take a look at a similar answer I posted here:
Needle in haystack with array in PHP
This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 5 years ago.
What's the simplest way to get the innerHTML (tags and all) of a DOMElement using PHP's DOM functions?
$html = '';
foreach($parentElement->childNodes as $node) {
$html .= $dom->saveHTML($node);
}
CodePad.
Inner HTML
Try approach suggested by #trincot:
$html = implode(array_map([$node->ownerDocument,"saveHTML"], iterator_to_array($node->childNodes)));
Outer HTML
Try:
$html = $node->ownerDocument->saveHTML($node);
or in PHP lower than 5.3.6:
$html = $node->ownerDocument->saveXML($node);
This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 2 years ago.
Description of the current situation:
I have a folder full of pages (pages-folder), each page inside that folder has (among other things) a div with id="short-info".
I have a code that pulls all the <div id="short-info">...</div> from that folder and displays the text inside it by using textContent (which is for this purpose the same as nodeValue)
The code that loads the divs:
<?php
$filename = glob("pages-folder/*.php");
sort($filename);
foreach ($filename as $filenamein) {
$doc = new DOMDocument();
$doc->loadHTMLFile($filenamein);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*//div[#id='short-info']");
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->textContent;
}
}
}
?>
Now the problem is that if the page I am loading has a child, like an image: <div id="short-info"> <img src="picture.jpg"> Hello world </div>, the output will only be Hello world rather than the image and then Hello world.
Question:
How do I make the code display the full html inside the div id="short-info" including for instance that image rather than just the text?
You have to make an undocumented call on the node.
$node->c14n() Will give you the HTML contained in $node.
Crazy right? I lost some hair over that one.
http://php.net/manual/en/class.domnode.php#88441
Update
This will modify the html to conform to strict HTML. It is better to use
$html = $Node->ownerDocument->saveHTML( $Node );
Instead.
You'd want what amounts to 'innerHTML', which PHP's dom doesn't directly support. One workaround for it is here in the PHP docs.
Another option is to take the $node you've found, insert it as the top-level element of a new DOM document, and then call saveHTML() on that new document.