Stop php DomDocument / LibXML restructing content - php

When loading HTML content with DomDocument it gets restructured.
I know that p tags are not allowed inside h1 but this is what I have to work with. Whilst the spec says it’s not allowed everything is still correctly nested (no missing closing tag etc.)
...
<h1>
<p>Nested paragraph</p>
</h1>
...
Then when run
$dom = new \DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($content);
It will output like so
<h1>
</h1>
<p>Nested paragraph</p>
The p has been moved outside the h1. Is there a way to tell it not to care about matching the spec but just ensure tags are closed etc. How’s this going to work with custom elements in the future?

Related

Why ul element was moved out of pre element in php domdocument? [duplicate]

This question already has an answer here:
How to prevent the PHP DOMDocument from "fixing" your HTML string
(1 answer)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a HTML like follow:
<pre>
<code>
some code
<div></div>
</code>
<ul>
<li>1</li>
</ul>
</pre>
others
And parse it via DOMDocument.
After I run this:
$doc = new DOMDocument();
$doc->loadHTML($html);
echo $doc->saveHTML();
The ul element was removed out of pre element:
<pre>
<code>
some code
<div></div>
</code>
</pre>
<ul>
<li>1</li>
</ul>
others
Why and How to keep it the same?
Please see demo for detail.
Why?
According to the specification, a UL element cannot belong to the content of a PRE element.
Permitted content: Phrasing content
Phrasing content is a subset of flow content that defines the text and
the markup it contains, and can be used everywhere flow content is
expected. Runs of phrasing content make up paragraphs.
Elements belonging to this category are <abbr>, <audio>, <b>,
<bdo>, <br>, <button>, <canvas>, <cite>, <code>, <command>,
<data>, <datalist>, <dfn>, <em>, <embed>, <i>, <iframe>,
<img>, <input>, <kbd>, <keygen>, <label>, <mark>, <math>,
<meter>, <noscript>, <object>, <output>, <picture>, <progress>,
<q>, <ruby>, <samp>, <script>, <select>, <small>, <span>,
<strong>, <sub>, <sup>, <svg>, <textarea>, <time>, <u>, <var>,
<video>, <wbr> and plain text (not only consisting of white spaces
characters).
This can be seen in the warning message after calling $doc->loadHTML($html):
Warning: DOMDocument::loadHTML(): Unexpected end tag : pre in Entity
demo
How to keep it the same?
If you still need to work only with a fragment of the DOM structure that does not meet the specification, use the createDocumentFragment and appendXML functions:
$doc = new DOMDocument();
$docFragment = $doc->createDocumentFragment();
$docFragment->appendXML($html);
$doc->appendChild($docFragment);
echo $doc->saveHTML();
demo

DOMDocument and hr tag losing HTML

Using PHP and DOMDocument class to parse HTML from TinyMCE editor. I'm having issues inserting <hr /> elements into the editor, because DOMDocument keeps losing the rest of the code.
# Input: <hr /><p> </p><p>test input</p>
$domDoc = new DOMDocument();
$domDoc->loadHTML($input, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
var_dump($domDoc->saveHTML());
// Result: <hr>
I can't find any reason for this, nor an option for loadHTML() to prevent this. What exactly happens and can I use hr element here?
The answer was as follows:
substr($domDoc->saveHTML($domDoc->getElementsByTagName('body')->item(0)), 6, -7)
The issue was located in saveHTML to which I gave the body node and parsed out the tags. Now I get the full HTML out. This is also a one line solution.
It seems that DomDocument has problems when it encounters an HTML string that is not entirely wrapped in a single element. So if you start with:
<h1>My Title<h1><p>My text</p>
then read it into DomDocument and use the DomDocument object to generate the HTML again, you'll get something like:
<h1>My Title<p>My text</p></h1>
For my application the solution was to wrap the entire content in a div before sending it to DomDocument. This fixes the issue posted by the OP - if there is a leading hr tag, wrapping the entire html string in a div will preserve it and the rest of the content.

DOM xpath to find #text nodes and wrap in paragraph tag

I would like to find all root-level #text nodes (or those with div parents) which should be wrapped inside a <p> tag. In the following text there should be three (or even just two) final root <p> tags.
<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.
The idea is to format the text nicer so that text blocks are grouped into paragraphs for HTML display. However, the following xpath I have been working out seems to fail to select the text nodes.
<?php
$html = '<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.';
libxml_use_internal_errors(TRUE);
$dom = DOMDocument::loadHTML($html);
$xp = new DOMXPath($dom);
$xpath = '//text()[not(parent::p) and normalize-space()]';
foreach($xp->query($xpath) as $node) {
$element = $dom->createElement('p');
$node->parentNode->replaceChild($element, $node);
$element->appendChild($node);
}
print $dom->saveHTML();
OK, so let me rephrase my comment as an answer. If you want to match all text nodes, you should simply remove the //div part from your XPath expression. So it becomes:
//text()[not(parent::p) and normalize-space()]
Your scenario has many edge-cases and the word should is adding on top. I assume you want to do the classic a double break starts a new paragraph thingy, however this time within parent <div> (or certainly other block elements) as well.
I would let do the HTML parser most of the work but I still would work with text search and replace (next to xpath). So what you will see coming is a bit hackish but I think pretty stable:
First of all I would select all text-nodes that are of top-level or child of the said div.
(.|./div)/text()
This xpath is relative to an anchor element which is the <body> tag as it represents the root-tag of your HTML fragment when loaded into DOMDocument.
If child of a div then I would insert the starting paragraph at the very beginning.
Then in any case I would insert a break-mark (here in form of a comment) at each occurrence of the sequence that starts a new paragraph (that should be "\n\n" because of whitespace normalization, I might be wrong and if it doesn't apply, you would need to do the whitespace-normalization upfront to have this working transparently).
/* #var $result DOMText[] */
$result = $xp->query('(.|./div)/text()', $anchor);
foreach ($result as $i => $node)
{
if ($node->parentNode->tagName == 'div')
{
$insertBreakMarkBefore($node, true);
}
while (FALSE !== $pos = strpos($node->data, $paragraphSequence))
{
$node = $node->splitText($pos + $paragraphSequenceLength);
$insertBreakMarkBefore($node);
}
}
These inserted break-marks are just there to be replaced with a HTML <p> tag. A HTML parser will turn those into adequate <p>...</p> pairs so I can spare myself writing that algorithm (even though, this might be interesting). This basically work like I once outlined in some other answer but I just don't find the link any longer:
After the modification of the DOM tree, get the innter HTML of the <body> again.
Replace the set marks with "<p>" (here I mark the class as well to make this visible)
Load the HTML fragment into the parser again to re-create the DOM with the proper <p>...</p> pairs.
Obtain the HTML again from the DOMDocument parser, which now is finally.
These outlined steps in code (skipping some of the function definitions for a moment):
$needle = sprintf('%1$s<!--%2$s-->%1$s', $paragraphSequence, $paragraphComment);
$replace = sprintf("\n<p class=\"%s\">\n", $paragraphComment);
$html = strtr($innerHTML($anchor), array($needle . $needle => $replace, $needle => $replace));
echo "HTML afterwards:\n", $innerHTML($loadHTMLFragment($html));
As this shows, double sequences are replaced with a single one. Probably one at the end need to be deleted as well (if applicale, you could also trim whitespace here).
The final HTML output:
<div>
<p class="break">
This text should be wrapped in a p tag.
</p>
</div>
<p class="break">
This also should be wrapped.
</p>
<p class="break">
<b>And</b> this.</p>
Some more post-production for nice output formatting can be useful, too. Actually I think it's worth to do as it will help you tweak the algorithm (Full Demo - just seeing, whitespace normalization probably does not apply there. so use with care).
you can do it with pure JavaScript if you wish:
var content = document.evaluate(
'//text()',
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null );
for ( var i=0 ; i < content .snapshotLength; i++ ){
console.log( content .snapshotItem(i).textContent );
}
I know it is not xpath but check this out:
PHP Simple HTML DOM Parser
http://simplehtmldom.sourceforge.net/
Features
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.

How can I remove all content between <pre> tags in PHP?

I've read and tried to implement variations on about 10 different solutions for this so far from stack overflow, and none of them are working. All I want to do is replace the content between two pre tags (including the tags themselves). I don't care if it's regex or straight up php. Anyone have any suggestions?
An example is:
This is how to remove pre tags and their contents:<br/>
<pre>
<?php>
[code here]
<?php>
That's all there is to it.
becomes:
This is how to remove pre tags and their contents:</br>
That's all there is to it.
This needs to happen before the html is rendered to the page.
I'm not sure DOMDocument will work. The context for my code is that it is happening within a plugin for expression engine (a codeigniter / php based CMS). The plugin truncates the html to a set character length, and renders that back to the parent template to be rendered in the browser - so the domdocument can't render to the browser - it just needs to return the code to the parent template with the tags and content removed.
Regex will work fine if you use assertions (ie lookahead/lookbehind). This should remove anything within pre tags:
$page_content = preg_replace('/<(pre)(?:(?!<\/\1).)*?<\/\1>/s','',$page_content);
If you want to include other tags, just add them to the initial matching group like:
(pre|script|style)
the only real issue with regex tag removal is nested tags of the same type, like:
<div>
<div>inner closing tag might match beginning outer opening div tag leaving an orphan outer closing tag</div>
<div>
Edit
I tested the example you left in the other comment on the other answer, works fine for me:
$html = 'This is a quick snippet that often comes in handy: <pre>[code]blah blah[/code]</pre>';
$html = preg_replace('/<(pre)(?:(?!<\/?\1).)*?<\/\1>/s',"",$html);
var_dump($html);
result:
string(51) "This is a quick snippet that often comes in handy: "
Use DOMDocument:
$html = '<div id="container">
<div id="test"></div>
<pre>
content
</pre>
</div>';
$dom = new DOMDocument;
$dom->loadXML($html);
$xpath = new DOMXPath($dom);
$query = '//div[#id="container"]/pre';
// $query = '//pre'; // for all <pre>
$entries = $xpath->query($query);
foreach($entries as $one){
$newelement = $dom->createTextNode('Some new node!');
$one->parentNode->replaceChild($newelement, $one);
}
echo $dom->saveHTML();
Codepad Example

php replace \n except where other valid HTML tags appear

I am looking for a function that replaces \n stored in the db via user input into a textfield, except where there are already HTML tags in place. This is for a CMS, so that the dumbass users have less work to do.
So, for-instance, if the user wrote the following into the textfield:
<H1>Title of page</H1>
This is the first paragraph in the page.
<H2>Sub section</H2>
This is a sub-section.
I'd want the function to return:
<H1>Title of page</H1>
<p>This is the first paragraph in the page.</p>
<H2>Sub section</H2>
<p>This is a sub-section.</p>
Can anyone help with something they already have / have found?
I would avoid reinventing the wheel, and you are probably going to run into a ton of special rules you have to handle. Even in your question, the rules are unclear. What does it have to do with \n? I recommend using an html parser. PHP has some:
$dom = new DOMDocument;
$dom->loadHTML($start);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//text()[following-sibling::* or preceding-sibling::*]');
foreach ($nodes as $node) {
$parent = $node->parentNode;
$p = $dom->createElement('p', htmlentities($node->nodeValue, ENT_COMPAT, 'UTF-8'));
$parent->insertBefore($p, $node);
$parent->removeChild($node);
}
This will wrap all text nodes that are siblings of another node in <p>, including whitespace. An important question is: are there ever text nodes with siblings that don't need to be wrapped?

Categories