DOMElement replace HTML value - php

I have this HTML string in a DOMElement:
<h1>Home</h1>
test{{test}}
I want to replace this content in a way that only
<h1>Home</h1>
test
remains (so I want to remove the {{test}}).
At this moment, my code looks like this:
$node->nodeValue = preg_replace(
'/(?<replaceable>{{([a-z0-9_]+)}})/mi', '' , $node->nodeValue);
This doesn't work because nodeValue doesn't contain the HTML value of the node.
I can't figure out how to get the HTML string of the node other than using $node->C14N(), but by using C14N I can't replace the content.
Any ideas how I can remove the {{test}} in an HTML string like this?

Have you tried the DOMDocument::saveXML function? (http://php.net/manual/en/domdocument.savexml.php)
It has a second argument $node with which you can specify which node to print the HTML/XML of.
So, for example:
<?php
$doc = new DOMDocument('1.0');
// we want a nice output
$doc->formatOutput = true;
$root = $doc->createElement('body');
$root = $doc->appendChild($root);
$title = $doc->createElement('h1', 'Home');
$root->appendChild($title);
$text = $doc->createTextNode('test{{test}}');
$text = $root->appendChild($text);
echo $doc->saveXML($root);
?>
This will give you:
<body>
<h1>Home</h1>
test{{test}}
</body>
If you do not want the <body> tag, you could cycle through all of its childnodes:
<?php
foreach($root->childNodes as $child){
echo $doc->saveXML($child);
}
?>
This will give you:
<h1>Home</h1>test{{test}}
Edit: you can then of course replace {{test}} by the regex that you are already using:
<?php
$xml = '';
foreach($root->childNodes as $child){
$xml .= preg_replace(
'/(?<replaceable>{{([a-z0-9_]+)}})/mi', '',
$doc->saveXML($child)
);
}
?>
This will give you:
<h1>Home</h1>test
Note: I haven't tested the code, but this should give you the general idea.

The issue is mainly around how you navigate the DOM but there's also an issue with your RegExp; XPath actually provides a lot of flexibility when it comes to DOM manipulation so that's my preferred solution.
Assuming you have a DOMDocument built like this (I've attached an XPath):
$dom = new DOMDocument('1.0', 'utf-8');
$xpath = new DOMXPath($dom);
$node = $dom->createElement('div');
$node->appendChild(
$dom->createElement('h1', "Home")
);
$node->appendChild(
$dom->createTextNode("test{{test}}")
);
$dom->appendChild($node);
You can specifically target the text node of that <div> with '/div/text()' in XPath.
So to replace {{test}} within that text node without corrupting the rest of the node, you would do:
$xpath->query('/div/text()')->item(0)->nodeValue = preg_replace(
'/(.*){{[^}]+}}/m',
'$1',
$xpath->query('/div/text()')->item(0)->nodeValue
);
Somewhat convoluted but the output from $dom->saveXML(); is:
<?xml version="1.0" encoding="utf-8"?>
<div><h1>Home</h1>test</div>
{{test}} has been removed leaving the rest intact.

Related

php export xml CDATA escaped

I am trying to export xml with CDATA tags. I use the following code:
$xml_product = $xml_products->addChild('product');
$xml_product->addChild('mychild', htmlentities("<![CDATA[" . $mytext . "]]>"));
The problem is that I get CDATA tags < and > escaped with < and > like following:
<mychild><![CDATA[My some long long long text]]></mychild>
but I need:
<mychild><![CDATA[My some long long long text]]></mychild>
If I use htmlentities() I get lots of errors like tag raquo is not defined etc... though there are no any such tags in my text. Probably htmlentities() tries to parse my text inside CDATA and convert it, but I dont want it either.
Any ideas how to fix that? Thank you.
UPD_1 My function which saves xml to file:
public static function saveFormattedXmlFile($simpleXMLElement, $output_file) {
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadXML(urldecode($simpleXMLElement->asXML()));
$dom->save($output_file);
}
A short example of how to add a CData section, note the way it skips into using DOMDocument to add the CData section in. The code builds up a <product> element, $xml_product has a new element <mychild> created in it. This newNode is then imported into a DOMElement using dom_import_simplexml. It then uses the DOMDocument createCDATASection method to properly create the appropriate bit and adds it back into the node.
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><Products />');
$xml_product = $xml->addChild('product');
$newNode = $xml_product->addChild('mychild');
$mytext = "<html></html>";
$node = dom_import_simplexml($newNode);
$cdata = $node->ownerDocument->createCDATASection($mytext);
$node->appendChild($cdata);
echo $xml->asXML();
This example outputs...
<?xml version="1.0" encoding="UTF-8"?>
<Products><product><mychild><![CDATA[<html></html>]]></mychild></product></Products>

Is the meta tag content value from an xpath query trustable?

I have a php function who extracts meta tags from an url with xpath queries.
e.g $xpath->query('/html/head/meta[#name="my_target"]/#content')
My question :
Can I trust the returned value or should I verify it ?
=> Is there any possible XSS exploit ?
=> Should the html content be purified before loading it in the DOMDocument ?
// Other way to say it with some code :
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
libxml_use_internal_errors(true);
// is
$doc->loadHTMLFile($url);
// trustable ??
// or is
file_get_contents($url);
$trust = $purifier->purify($html);
$doc->loadHTML($trust);
// a better practice ??
libxml_use_internal_errors(false);
$xpath = new DOMXPath($doc);
$trustable = $xpath->query('/html/head/meta[#name="my_target"]/#content')->item(0) // ?
===== UPDATE =========================================
Yes, never trust external sources.
use $be_sure = htmlspecialchars($trustable->textContent) or strip_tags($trustable->textContent)
If you pull in HTML content from a source that you don't control, then yes, I would consider that piece of code potentially troublesome!
You could use htmlspecialchars() to convert any special characters to HTML entities. Or if you want to keep parts of the mark-up, you could use strip_tags(). An other option is to use filter_var() which gives you more control over its filtering.
Or you could use a library like HTML Purifier but that might be too much for your end. It all depends on the type of content you are working with.
Now, to sanitise the element, you will need to get the string representation of your XPath result first. Apply your filtering and then put it back in. The following example should do what you want:
<?php
// The following HTML is what you fetch from your remote source:
$html = <<<EOL
<html>
<body>
<h1>Foo, bar!</h1>
<div id="my-target">
Here is some <strong>text</strong> <script>javascript:alert('some malicious script!');</script> that we want to sanitize.
</div>
</body>
</html>
EOL;
// We instantiate a DOCDocument so we can work with it:
$original = new DOMDocument("1.0", 'UTF-8');
$original->formatOutput = true;
$original->loadHTML($html);
$body = $original->getElementsByTagName('body')->item(0);
// Find the element we need using Xpath:
$xpath = new DOMXPath($original);
$divs = $xpath->query("//body/div[#id='my-target']");
// The XPath query will return DOMElement objects, so create a string that we can manipulate out of it:
$innerHTML = '';
if (count($divs))
{
$div = $divs->item(0);
// Now get the innerHTML for this element
foreach ($div->childNodes as $child) {
$innerHTML .= $original->saveXML($child);
}
// Remove it from the original document because we want to replace it anyway
$div->parentNode->removeChild($div);
}
// Sanitize our string by removing all tags except <strong> and the container <div>
$innerHTML = strip_tags($innerHTML, '<strong>');
// or htmlspecialchars() or filter_var or HTML Purifier ..
// Now re-import the sanitized string into a blank DOMDocument
$sanitized = new DOMDocument("1.0", 'UTF-8');
$sanitized->formatOutput = true;
$sanitized->loadXML('<div id="my-target">' . $innerHTML . '</div>');
// Now add the sanitized DOMElement back into the original document as a child of <body>
$body->appendChild($original->importNode($sanitized->documentElement, true));
echo $original->saveHTML();
Hope that helps.

How do I use str_replace with DomDocument

I am using DomDocument to pull content from a specific div on a page.
I would then like to replace all instances of links with a path equal to http://example.com/test/ with http://example.com/test.php.
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$doc = new DomDocument('1.0', 'UTF-8');
$doc->loadHtml(file_get_contents($url));
$div = $doc->getElementById('upcoming_league_dates');
foreach ($div->getElementsByTagName('a') as $item) {
$item->setAttribute('href', 'http://example.com/test.php');
}
echo $doc->saveHTML($div);
As you can see in the example above, str_replace causes problems after I target the upcoming_league_dates div with getElementById. I understand this but unfortunately I don't know where to go from here!
I've tried several different ways including executing the str_replace above the getElementById function (I figured I could replace the strings first and then target the specific div), with no luck.
What am I missing here?
EDIT: UPDATED CODE TO SHOW WORKING SOLUTION
You can't just use str_replace on that node. You need to access it properly first. Thru the DOMElement class you can use the method ->setAttribute() and make the replacement.
Example:
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom); // use xpath
$needle = 'http://example.com/test/';
$replacement = 'http://example.com/test.php';
// target the link
$links = $xpath->query("//div[#id='upcoming_league_dates']/a[contains(#href, '$needle')]");
foreach($links as $anchor) {
// replacement of those href values
$anchor->setAttribute('href', $replacement);
}
echo $dom->saveHTML();
Update: After your revision, your code is now working anyway. This is just to answer your logic replacement (ala str_replace search/replace) on your previous question.

How to cut off a portion of a html inside <div> and store it as html string by using xpath and domdocument?

I would like to cut off some portion of html, I can take it by using XPath and DomDocument but the problem is that I need result as a html code string. Normally I would use reg. expr. for that but I wouldn't like to do a complicated search pattern that would mach the begining and the end of tag.
That's the example input:
some html code before
<div>this <b>is</b> what I want</div>
some html after
and the output:
<div>this <b>is</b> what I want</div>
I tried something like this:
subject = 'some html code before
<div>this <b>is</b> what I want</div>
some html after';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div/*");
echo $result->saveHTML();
but i got only error:
Call to undefined method DOMNodeList::saveHTML()
Does anyone know how to get the result as a html string by using DomDocument and XPath?
Thank you Gentleman for pointing out my missunderstanding with accessing methods that are not aviailable in a child object. But line:
echo $doc->saveHTML($result->item(0));
generates only warning (without the html sting I want to have). Luckily I found another soulution and here it is:
<?php
$subject = '<html>
<head>
<title>A very short ebook</title>
<meta name="charset" value="utf-8" />
</head>
<body>
<h1 class="bookTitle">A very short ebook</h1>
<p style="text-align:right">Written by Kovid Goyal</p>
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>
<h2 class="chapter">Chapter One</h2>
<p>This is a truly fascinating chapter.</p>
<h2 class="chapter">Chapter Two</h2>
<p>A worthy continuation of a fine tradition.</p>
</body>
</html>';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
//echo $doc->saveHTML($result->item(0));
echo domNodeList_to_string($result);
function domNodeList_to_string($DomNodeList) {
$output = '';
$doc = new DOMDocument;
while ( $node = $DomNodeList->item($i) ) {
// import node
$domNode = $doc->importNode($node, true);
// append node
$doc->appendChild($domNode);
$i++;
}
$output = $doc->saveHTML();
$output = print_r($output, 1);
// I added this because xml output and ajax do not like each others
//$output = htmlspecialchars($output);
return $output;
}
php>
so if one has a query like that:
$result = $xpath->query("//div");
then will get the raw html string output:
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>
if the query is:
$result = $xpath->query("//p");
then output will be:
<p style="text-align:right">Written by Kovid Goyal</p><p>A very short ebook to demonstrate the use of XPath.</p><p>This is a truly fascinating chapter.</p><p>A worthy continuation of a fine tradition.</p>
Does anyone know simpler (embeded in php) method to get the same result?
Try this:
$subject = 'some html code before
<div>this <b>is</b> what I want</div>
some html after';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
echo $doc->saveHTML($result->item(0)); //echoes what you want :)
The saveHTML function belongs to the DOMDocument object, you can't call it directly on the node (much less on the NodeList, which is what the query returns), but what you can do is pass it the node as a param.
Also, your query was wrong: what you want is the div element (i.e. //div), not its children (//div/*).
As per the php manual docs on DOMXPath::querydocs, the function:
Returns a DOMNodeList containing all nodes matching the given XPath
expression. Any expression which does not return nodes will return an
empty DOMNodeList.
This means that the $result in the following code will be a DOMNodeListdocs object. So if you want to get individual HTML code out from inside it you'll need to use methods available with a DOMNodeList object. In this case, the item method:
$result = $xpath->query("//div");
echo $doc->saveHTML($result->item(0));
$result->item(0) returns the first DOMNode in the DOMNodeList created by your xpath query.
Try this :
$subject = 'some html code before<div>this <b>is</b> what I want</div>some html after';
$doc = new DOMDocument('1.0');
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
$docSave = new DOMDocument('1.0');
foreach ( $result as $node ) {
$domNode = $docSave->importNode($node, true);
$docSave->appendChild($domNode);
}
echo $docSave->saveHTML();

XML node with Mixed Content using PHP DOM

Is there a way to create a node that has mixed XML content in it with the PHP DOM?
If I understood you correctly you want something similar to innerHTML in JavaScript. There is a solution to that:
$xmlString = 'some <b>mixed</b> content';
$dom = new DOMDocument;
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($xmlString);
$dom->appendChild($fragment);
// done
To sumarize. What you need is:
DOMDocument::createDocumentFragment()
DOMDocumentFragment::appendXML()
Although you didn't asked about it I'll tell you how to get the string representation of a DOM node as opposed to the whole DOM document:
// for a DOMDocument you have
$dom->save($file);
$string = $dom->saveXML();
$dom->saveHTML();
$string = $dom->saveHTMLFile($file);
// For a DOMElement you have
$node = $dom->getElementById('some-id');
$string = $node->C14N();
$node->C14NFile($file);
Those two methods are currently not documented.

Categories