php dom replacedChild, save as html and continue parsing - php

I created a php parser for editing the html which is created by a CMS. The first thing I do is parse a custom tag for adding modules.
After that things like links, images etc. are if needed updated, changed or w/e. This all works.
Now I noticed that when a custom tag is replaced with the html the module generated this html is NOT processed by the rest of the actions.
For example; all links with a href of /pagelink-001 are replaced with the actual link of the current page. This works for the initial loaded html, not the replaced tag. Below I have a short version of the code. I tried saving it with saveHtml() and load it with loadHtml() and things like that.
I'm guessing this is because $doc with the loaded html is not updated as such.
My code:
$html = 'Link1<customtag></customtag>';
// Load the html (all other settings are not shown to keep it simple. Can be added if this is important)
$doc->loadHTML($html);
// Replace custom tag
foreach($xpath->query('//customtag') as $module)
{
// Create fragment
$return = $doc->createDocumentFragment();
// Check the kind of module
switch($module)
{
case 'news':
$html = $this->ZendActionHelperThatReturnsHtml;
// <div class="news">Link2</div>
break;
}
// Fill fragment
$return->appendXML($html);
// Replace tag with html
$module->parentNode->replaceChild($return, $module);
}
foreach($doc->getElementsByTagName('a') as $link)
{
// Replace the the /pagelink with a correct link
}
In this example Link1 href is replaced with the correct value, however Link2 is not. Link2 does correctly appear as a link and all that works fine.
Any directions of how I can update the $doc with the new html or if that is indeed the problem would be awesome. Or please tell me if I'm completely wrong (and where to look)!
Thanks in advance!!

It seemed that I was right and the returned string was a string and not html. I discovered in my code the innerHtml function from #Keyvan that I implemented at some point. This resulted in my function being this:
// Start with the modules, so all that content can be fixed as well
foreach($xpath->query('//customtag') as $module)
{
// Create fragment
$fragment = $doc->createDocumentFragment();
// Check the kind of module
switch($module)
{
case 'news':
$html = htmlspecialchars_decode($this->ZendActionHelperThatReturnsHtml); // Note htmlspecialchars_decode!
break;
}
// Set contents as innerHtml instead of string
$module->innerHTML = $html;
// Append child
$fragment->appendChild($module->childNodes->item(0));
// Replace tag with html
$module->parentNode->replaceChild($fragment, $module);
}

Related

php xml DOMDocument close tag element

I am using PHP DOMDocument() to generate XML file with elements.
I am appending all details into sample xml file into components tag. But closing tag is not coming. I want to create closing tag.
My Code is doing this
<component expiresOn="2022-12-31" id="pam" />
I want to do like following
<component expiresOn="2022-12-31" id="pam"></component>
My PHP CODE SAMPLE
$dom = new DOMDocument();
$dom->load("Config.xml");
$components = $dom->getElementsByTagName('components')->item(0);
if(!empty($_POST["pam"])) {
$pam = $_POST["pam"];
$component = $dom->createElement('component');
$component->setAttribute('expiresOn', $expirydate);
$component->setAttribute('id', "pam");
$components->appendChild($component5);
}
$dom->save("Config.xml");
I tested following suggestion and its not working. Both xml-php code are different.
$dom->saveXml($dom,LIBXML_NOEMPTYTAG);
Self-closing tags using createElement
I tested following.
You're trying to use DOMDocument::saveXML to save the new XML back into the original file, but all that function does is return the XML as a string. Since you aren't assigning the result to anything, nothing happens.
If you want to save the XML back to your file, as well as avoiding self-closing tags, you'll need to use the save method as you originally were, and also pass the option:
$dom->save('licenceConfig.xml', LIBXML_NOEMPTYTAG);
See https://3v4l.org/e6N5s for a demo

PHP DOM in WordPress - replace and add attribute

I try to change type attribute of input fields to number (they have hidden type by default) and add readonly="readonly", but it has no effect on output HTML. Attributes are as they were.
Function is triggered properly, because before I added encoding, it showed incorrect characters on page. I have proper CSS to format readonly inputs, so visuals are not a problem, I will also use more conditions to find only specific input tags, but for now I would like to get this code to work properly:
add_filter('the_content', 'acau_lock_input');
function acau_lock_input($content) {
$dom = new DOMDocument();
#$dom->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));
foreach ($dom->getElementsByTagName('input') as $node) {
$node->setAttribute('type', 'number');
$node->setAttribute('readonly', 'readonly');
}
$newHtml = $dom->saveHtml();
return $newHtml;
}
Nigel Ren helped me to find the problem. The solution is ridiculous, the_content filter of WordPress is simply whatever is placed in the text area of the current page as visible in the editor.
It has nothing to do with full HTML structure of output page, so while wrong encoding can break text of the whole page, $content returns only one shortcode because this is in the page editor.
To learn how to edit DOM for WordPress, including the output of all templates, check this question:
PHP DOM in WordPress - add attribute in output buffer HTML

php extract body tag content

I'm trying what should be very easy, but I can't get it to work. Which makes me wonder if I'm using the right workflow.
I have a simple html page which I load in my desktop application as a help file. This page has no menu just the content.
On my website I want to have a more sophisticated help system. So I want to use a php file which will show a menu, breadcrums and a header and footer.
To not duplicate my help content I want to load the original HTML help file and add its body content to my enhanced help page.
I'm using this code to extract the title:
function getURLContent($filename){
$url = realpath(dirname(__FILE__)) . DIRECTORY_SEPARATOR . $filename;
$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
#$doc->loadHTMLFile($url);
return $doc;
}
function getSingleElementValue($element){
if (!is_null($element)) {
$node = $element->childNodes->item(0);
return $node->nodeValue;
}
}
$doc = getURLContent("test.html");
$title = getSingleElementValue($doc->getElementsByTagName('title')->item(0));
echo $title;
The title is correctly extracted.
Now I try to extract the body:
function getBodyContent($element){
$mock = new DOMDocument;
foreach ($element->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}
return $mock->saveHTML();
}
$body = getBodyContent($doc->getElementsByTagName('body')->item(0));
echo $body;
The getBodyContent() function is one of the several options I tried.
All of them return the whole HTML tag, including the HEAD tag.
My question is: Is this a correct workflow or should I use something else?
Thanks.
Update: My final goal is to have a website with multiple pages that has the help files accessible via a menu. These pages will be generated using something like generate.php?page=test.html. I'm not yet at this part. The goal is also to not duplicate the content of test.html because this file will be used in my desktop application (using a web control). In my desktop application I don't need the menu and such.
Update #2: I had to add <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> to the html-file I want to read and now I do get the body content. Unfortunaly all tags are strips. I'll need to fixed that as well.
The problem is that saveHTML() will return an actual document. You don't want this. Instead, you want just what you put in.
Thankfully, you can do this much more easily.
function getBodyContent(DOMNode $element) {
$doc = $element->ownerDocument;
$wrapper = $doc->createElement('div');
foreach( $element->childNodes as $child) {
$wrapper->appendChild($child);
}
$element->appendChild($wrapper);
$html = $doc->saveHTML($wrapper);
return substr($html, strlen("<div>"), -strlen("</div>"));
}
This wraps the contents into a single element of known tag representation (the body may have attributes that make it unknown), gets the rendered HTML from that element, and strips off the known tag of the wrapper.
I'd also like to suggest an improvement to getSingleElementValue:
function getSingleElementValue(DOMNode $element) {
return trim($element->textContent);
}
Note also the use of type hints to ensure that your functions are indeed getting the kind of thing that is expected - this is useful as it means we no longer need to check "does $element->ownerDocument exist? does $element->ownerDocument->saveHTML() do what we think it does?" and other such questions. It ensures we have a DOMNode, so we know it has those things.

How to make a link in a mysql stored variable clickable when rendered on the page

I have a function that enables members on a site to message each other; the message is stored in mysql database.
My question now is this: what is the best way to allow members to include a link in the message so that, when rendered, it is rendered as a click-able link.
I've tried the following:
click here
but when I then tried to render it on the page it came out as:
$message = nl2br($this->escapeHtml(trim($this->theMessage[0]['message'])));
echo $message; // click here
the var_dump Values of $messages is:
string '<a href="testpage.html"> click here</a>'
HTML markup is complicated, because when displaying it to the user and someone has injected unsavory HTML into the markup, then you've got an XSS attack on your hands. Imagine an added onclick interception, etc.. Any data from outside is dangerous.
markup language
This is one of the reasons, why markup languages like BBCode and markdown exist.
You don't want every piece of HTML markup, only clean and safe stuff.
Basically, you want to work with a restricted set of "content".
And one way of allowing data from outside is by using an "intermediate" markup language.
It is intermediate, because it is a custom format, which is later transformed into HTML.
This happens here on Stackoverflow, too:
[link](http://google.com) = link
tell your users: "to insert a link, using a special syntax"
save the content to the database.
the content you store to the database is something like:
The message text. And some markdown [link](http://google.com).
when you fetch the message from database, you process the markdown content:
$messageFromDb = 'The message. [http://google.com](google)';
$parsedown = new Parsedown();
$html = $parsedown->text($messageFromDb);
echo $html; // ready to show
Result: <p>The message. <a href="http://google.com">http://google.com</a></p>
There are libraries out there ready for usage, like
http://parsedown.org/
https://github.com/egil/php-markdown-extra-extended
filter html
Another way is to allow HTML, but only an restricted set. You would have to filter the inserted HTML, to pick only the good content and drop the rest.
PHP Extension Tidy: http://php.net/manual/en/book.tidy.php
Libraries like http://htmlpurifier.org/
DOM based HTML filter
Instead of relying on a filter library, you could also come up with a "little" DOM based HTML filter.
The following example re-creates a clean link from a crappy and bad one.
You should also check the URL attributes to ensure they use known-good schemes like http:, and not troublesome like javascript:.
This allows to whitelist the combination of elements, to control the nesting and the content.
<?php
// content from form
$html = 'Message <img title="The Link" /> Link Text';
$dom = new DOMDocument;
$dom->formatOutput = true;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOXMLDECL);
// filter, then rebuild a clean link
foreach ($dom->getElementsByTagName('a') as $node)
{
// extract the values
$title = $node->nodeValue;
$href = $node->getAttribute('href');
// maybe add a href filter?
// to remove links to bad websites
// and to remove href="javascript:"
// oh boy ... simple questions, resulting in lots of work ;)
// create a new "clean" link element
$link = $dom->createElement('a', $title);
$link->setAttribute('href', $href);
// replace old node
$node->parentNode->replaceChild($link, $node);
}
$html = $dom->saveXML();
// drop html, body, get only html fragment
// http://stackoverflow.com/q/11216726/1163786
$html = preg_replace('~<(?:!DOCTYPE|/?(?:html|body|p))[^>]*>\s*~i', '', $dom->saveHTML());
var_dump($html);
Before
Message <img src="injectionHell.png" title="The Link" /> Link Text
After
Message Link Text
To store "HTML in database"
When storing: use addslashes().
When returning text from DB: apply stripslashes(), before rending
A simple way to attain your goal is to save the message including the <a> tags.
You can use an HTML sanitizer so that you accept <a> link tags from your users while removing any potentially dangerous tags.
Then you wouldn't escape the saved text when you output it.
Have a look at HTML purifier.
Alternatively, you could use a Markdown parser to convert plain text to HTML.
your code removes the html tags and replace it with a written form ...
escapeHtml()
what you need is a function that remove all your html tags except what you desire in this case (link tag)
<a>
here is the function you can add it to your code :
function stripme($msg){
$msg = strip_tags($msg,'<a>');
return $msg ;
}
and then call it for your message like this:
$message = nl2br($this->stripme($this->theMessage[0]['message']));

PHP DOMDocument is not working

I am studying parsing HTML on PHP and I am using DOM for this.
I write this code inside my php file:
<?php
$site = new DOMDocument();
$div = $site->createElement("div");
$class = $site->createAttribute("class");
$class->nodeValue = "wrapper";
$div->appendChild($class);
$site->appendChild($div);
$html = $site->saveHTML();
echo $html;
?>
And when I run this on the browser and view the page source, only this code comes out:
<div class="wrapper"></div>
I don't know why it is not showing the whole html document that supposedly have to be. I am using XAMPP v3.2.1.
Please tell me where did I gone wrong with this. Thanks.
It's showing the whole HTML you created. A div node with a wrapper class attribute.
See the example in the docs. There the html, head, etc. nodes are explicitly created.
PHP only adds missing DOCTYPE, html and body elements when loading HTML, not when saving.
Adding $site->loadHTML($site->saveHTML()); before $html = $site->saveHTML(); will demonstrate this.

Categories