DOMDocument and hr tag losing HTML - php

Using PHP and DOMDocument class to parse HTML from TinyMCE editor. I'm having issues inserting <hr /> elements into the editor, because DOMDocument keeps losing the rest of the code.
# Input: <hr /><p> </p><p>test input</p>
$domDoc = new DOMDocument();
$domDoc->loadHTML($input, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
var_dump($domDoc->saveHTML());
// Result: <hr>
I can't find any reason for this, nor an option for loadHTML() to prevent this. What exactly happens and can I use hr element here?

The answer was as follows:
substr($domDoc->saveHTML($domDoc->getElementsByTagName('body')->item(0)), 6, -7)
The issue was located in saveHTML to which I gave the body node and parsed out the tags. Now I get the full HTML out. This is also a one line solution.

It seems that DomDocument has problems when it encounters an HTML string that is not entirely wrapped in a single element. So if you start with:
<h1>My Title<h1><p>My text</p>
then read it into DomDocument and use the DomDocument object to generate the HTML again, you'll get something like:
<h1>My Title<p>My text</p></h1>
For my application the solution was to wrap the entire content in a div before sending it to DomDocument. This fixes the issue posted by the OP - if there is a leading hr tag, wrapping the entire html string in a div will preserve it and the rest of the content.

Related

Stop php DomDocument / LibXML restructing content

When loading HTML content with DomDocument it gets restructured.
I know that p tags are not allowed inside h1 but this is what I have to work with. Whilst the spec says it’s not allowed everything is still correctly nested (no missing closing tag etc.)
...
<h1>
<p>Nested paragraph</p>
</h1>
...
Then when run
$dom = new \DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($content);
It will output like so
<h1>
</h1>
<p>Nested paragraph</p>
The p has been moved outside the h1. Is there a way to tell it not to care about matching the spec but just ensure tags are closed etc. How’s this going to work with custom elements in the future?

Why this DOMXpath query merges sibling nodes values?

Given the following code:
$html = "<h1>foo</h1><h2>bar</h2>";
$document = new DOMDocument();
$document->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($document);
$h1Nodes = $xpath->query('//h1');
foreach ($h1Nodes as $h1Node) {
var_dump($h1Node->nodeValue);
}
H1 tag contains only text node with the text 'foo'. Text 'bar' is in a sibling heading node (h2). I would expect the output to be 'foo'.
However, the output is 'foobar'.
Why?
Thank you, for your comment, hardik solanki.
It lead me to the answer: valid markup must have a root element.
Markup, which I've provided doesn't have one, and flags I've used prevent the library from adding one implicitly. So the first tag is treated as a root element and the result is a bit confusing.
Dropping those flags helps for this issue, but I am using them for a purpose. I just want to manipulate a snippet of HTML, and not a whole document. I want to get this snippet back (after transformations), by calling DOMDocument::saveHTML(). Without doctype/<html>/<body> tags.
I've ended up doing this:
I add doctype/<html>/<body> tags to the HTML snippet I want to manipluate to have temporary a valid document
load it with DOMDocument
transform it the way I need
save it with DOMDocument::saveHTML()
get rid of excess doctype/<html>/<body> tags markup
It works.

Cannot parse into <code> tag - PHP - simple html dom

I am trying to extract the content of a <div> nested inside a <code> tag with PHP Simple HTML DOM Parser but I am always getting the error Trying to get property of non-object in... as if the parser was finding nothing inside my <div>
The code I'm using is
include_once('simplehtmldom_1_5/simple_html_dom.php');
// Create a DOM object
$html = new simple_html_dom();
// Load HTML
$html->load('<code><div>hello</div></code>');
// Extract div content
echo $html->find('div',0)->innertext;
But if instead of using <code><div>hello</div></code> as my sample code i use <span><div>hello</div></span> it works... it seems like I'm having problems only looking inside the code tag.
What's wrong with what i'm doing?
Hope you guys can point me in the right direction, thank you very much for your support!
simplehtmldom among others strips out pre formatted tags.
If you want code tag to be recognized delete or comment out line 1076 in *simple_html_dom.php*
According to the source code for Simple HTML DOM it automagically removes code tags when it loads the HTML into the parser.
If you need the functionality you'll need to remove the reference to remove_noise() in the load() function within simplehtmldom.php.
This should produce the results you expect, but obviously may well introduce other issues, depending on the authors reasoning for removing the tags in the first place.

How to use PHP Simple HTML DOM Parser to find the not hyper linked text

I want to parse html to a dom tree, and find all the text NOT inside the <a> tags, so, I googled it, and found "PHP Simple HTML DOM Parser". It seems it can help me to parse the HTML DOM to a DOM Tree. I would like to find the text NOT inside <a> tags, but I only can find the element which is inside <a> tag. *ps: it don't support the CSS3 not selector yet. Thank you.
Any one experience on this? Thank you.
I hope I'm not misunderstanding the question, but can't you use the built-in DOM functions for PHP to find the text inside the <a> tags?
$doc = new DOMDocument();
$doc->loadHTMLFile("http://blahblah.com/blah.html");
$elem_list = $doc->getElementsByTagName("a");
foreach($elem_list as $elem)
echo $elem->textContent;
In that case I would remove all <a> tags and their contents (for example with regular expressions) and then load the resulting HTML into your DOM parser of choice.
Update: Even better, immediately parse the HTML and use the built-in functions to remove the <a> tags, or loop through all tags and just skip the <a> tags. Regex with HTML should be avoided.
I have used this class many times. Its an excellent solution to parse html/dom in php.
$html = new simple_html_dom();
// Load your html as string
$html->load('........ HTML ..........');
$a = $html->find('a');
$text='';
for($i=0;$i<count($a);$i++)
$text.=$a[$i]->innertext;
variable $text containing all the text in a tags.
Hope it will help you.

Regex and PHP for extracting contents between tags with several line breaks

How can I extract the content between tags with several line breaks?
I'm a newbie to regex, who would like to know how to handle unknown numbers of line break to match my query.
Task: Extract content between <div class="test"> and the first closing </div> tag.
Original source:
<div class="test">optional text<br/>
content<br/>
<br/>
content<br/>
...
content<br/>Hyperlink</div></div></div>
I've worked out the below regex,
/<div class=\"test\">(.*?)<br\/>(.*?)<\/div>/
Just wonder how to match several line breaks using regex.
There is DOM for us but I am not familiar with that.
You should not parse (x)html with regular expressions. Use DOM.
I'm a beginner in xpath, but one like this should work:
//div[#class='test']
This selects all divs with the class 'test'. You will need to load your html into a DOMDocument object, then create a DOMXpath object relating to that, and call its execute() method to get the results. It will return a DOMNodeList object.
Final code looks something like this:
$domd = new DOMDocument();
$domd->loadHTML($your_html_code);
$domx = new DOMXPath($domd);
$items = $domx->execute("//div[#class='test']");
After this, your div is in $items->item(0).
This is untested code, but if I remember correctly, it should work.
Update, forgot that you need the content.
If you need the text content (no tags), you can simply call $items->item(0)->textContent. If you also need the tags, here's the equivalent of javascript's innerHTML for PHP DOM:
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
Call it with $items->item(0) as the parameter.
You could use preg_match_all('/<div class="test">(.*?)<\/div>/si', $html, $matches);. But remember that this will match the first closing </div> within the HTML. Ie. if the HTML looks like <div class="test">...aaa...<div>...bbb...</div>...ccc...</div> then you would get ...aaa...<div>...bbb... as the result in $matches...
So in the end using a DOM parser would indeed by a better solution.

Categories