How to escape all invalid characters from DOM XPath Query? - php

I have the following function that finds values within a HTML DOM;
It works, but when i give parameter $value like: Levi's Baby Overall,
it cracks, because it does not escape the , and ' chars
How to escape all invalid characters from DOM XPath Query?
private function extract($file,$url,$value) {
$result = array();
$i = 0;
$dom = new DOMDocument();
#$dom->loadHTMLFile($file);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
if (($node->nodeValue != null) && ($node->nodeValue === $value)) {
$xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
$result[$i]['url'] = $url;
$result[$i]['value'] = $node->nodeValue;
$result[$i]['xpath'] = $xpath;
$i++;
}
}
}
}
return $result;
}

One shouldn't substitute placeholders in an XPath expression with arbitrary, user-provided strings -- because of the risk of (malicious) XPath injection.
To deal safely with such unknown strings, the solution is to use a pre-compiled XPath expression and to pass the user-provided string as a variable to it. This also completely eliminates the need to deal with nested quotes in the code.

PHP has no built-in function for escaping/quoting strings for XPath queries. furthermore, escaping strings for XPath is surprisingly difficult to do, here's more information on why: https://stackoverflow.com/a/1352556/1067003 , and here is a PHP port of his C# XPath quote function:
function xpath_quote(string $value):string{
if(false===strpos($value,'"')){
return '"'.$value.'"';
}
if(false===strpos($value,'\'')){
return '\''.$value.'\'';
}
// if the value contains both single and double quotes, construct an
// expression that concatenates all non-double-quote substrings with
// the quotes, e.g.:
//
// concat("'foo'", '"', "bar")
$sb='concat(';
$substrings=explode('"',$value);
for($i=0;$i<count($substrings);++$i){
$needComma=($i>0);
if($substrings[$i]!==''){
if($i>0){
$sb.=', ';
}
$sb.='"'.$substrings[$i].'"';
$needComma=true;
}
if($i < (count($substrings) -1)){
if($needComma){
$sb.=', ';
}
$sb.="'\"'";
}
}
$sb.=')';
return $sb;
}
example usage:
$elements = $dom_xpath->query("//*[contains(text()," . xpath_quote($value) . ")]");
notice how i did not add the quoting characters (") in the xpath itself, because the xpath_quote function does it for me (or the concat() equivalent if needed)

Related

set tags in html using domdocument and preg_replace_callback

I try to replace words that are in my dictionary of terminology with an (html)anchor so it gets a tooltip. I get the replace-part done, but I just can't get it back in the DomDocument object.
I've made a recursive function that iterates the DOM, it iterates every childnode, searching for the word in my dictionary and replacing it with an anchor.
I've been using this with an ordinary preg_match on HTML, but that just runs into problems.. when HTML gets complex
The recursive function:
$terms = array(
'example'=>'explanation about example'
);
function iterate_html($doc, $original_doc = null)
{
global $terms;
if(is_null($original_doc)) {
self::iterate_html($doc, $doc);
}
foreach($doc->childNodes as $childnode)
{
$children = $childnode->childNodes;
if($children) {
self::iterate_html($childnode);
} else {
$regexes = '~\b' . implode('\b|\b',array_keys($terms)) . '\b~i';
$new_nodevalue = preg_replace_callback($regexes, function($matches) {
$doc = new DOMDocument();
$anchor = $doc->createElement('a', $matches[0]);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($matches[0])]);
return $doc->saveXML($anchor);
}, $childnode->nodeValue);
$dom = new DOMDocument();
$template = $dom->createDocumentFragment();
$template->appendXML($new_nodevalue);
$original_doc->importNode($template->childNodes, true);
$childnode->parentNode->replaceChild($template, $childnode);
}
}
}
echo iterate_html('this is just some example text.');
I expect the result to be:
this is just some <a class="text-info" data-toggle="tooltip" data-original-title="explanation about example">example</a> text
I don't think building a recursive function to walk the DOM is usefull when you can use an XPath query. Also, I'm not sure that preg_replace_callback is an adapted function for this case. I prefer to use preg_split. Here is an example:
$html = 'this is just some example text.';
$terms = array(
'example'=>'explanation about example'
);
// sort by reverse order of key size
// (to be sure that the longest string always wins instead of the first in the pattern)
uksort($terms, function ($a, $b) {
$diff = mb_strlen($b) - mb_strlen($a);
return ($diff) ? $diff : strcmp($a, $b);
});
// build the pattern inside a capture group (to have delimiters in the results with the PREG_SPLIT_DELIM_CAPTURE option)
$pattern = '~\b(' . implode('|', array_map(function($i) { return preg_quote($i, '~'); }, array_keys($terms))) . ')\b~i';
// prevent eventual html errors to be displayed
$libxmlInternalErrors = libxml_use_internal_errors(true);
// determine if the html string have a root html element already, if not add a fake root.
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$fakeRootElement = false;
if ( $dom->documentElement->nodeName !== 'html' ) {
$dom->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$fakeRootElement = true;
}
libxml_use_internal_errors($libxmlInternalErrors);
// find all text nodes (not already included in a link or between other unwanted tags)
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::a)][not(ancestor::style)][not(ancestor::script)]');
// replacement
foreach ($textNodes as $textNode) {
$parts = preg_split($pattern, $textNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$fragment = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k&1) {
$anchor = $dom->createElement('a', $part);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($part)]);
$fragment->appendChild($anchor);
} else {
$fragment->appendChild($dom->createTextNode($part));
}
}
$textNode->parentNode->replaceChild($fragment, $textNode);
}
// building of the result string
$result = '';
if ( $fakeRootElement ) {
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
} else {
$result = $dom->saveHTML();
}
echo $result;
demo
Feel free to put that into one or more functions/methods, but keep in mind that this kind of editing has a non-neglictable weight and should be used each time the html is edited (and not each time the html is displayed).

Php variable into a XML request string

I have the below code wich is extracting the Artist name from a XML file with the ref asrist code.
<?php
$dom = new DOMDocument();
$dom->load('http://www.bookingassist.ro/test.xml');
$xpath = new DOMXPath($dom);
echo $xpath->evaluate('string(//Artist[ArtistCode = "COD Artist"] /ArtistName)');
?>
The code that is pulling the artistcode based on a search
<?php echo $Artist->artistCode ?>
My question :
Can i insert the variable generated by the php code into the xml request string ?
If so could you please advise where i start reading ...
Thanks
You mean the XPath expression. Yes you can - it is "just a string".
$expression = 'string(//Artist[ArtistCode = "'.$Artist->artistCode.'"]/ArtistName)'
echo $xpath->evaluate($expression);
But you have to make sure that the result is valid XPath and your value does not break the string literal. I wrote a function for a library some time ago that prepares a string this way.
The problem in XPath 1.0 is that here is no way to escape any special character. If you string contains the quotes you're using in XPath it breaks the expression. The function uses the quotes not used in the string or, if both are used, splits the string and puts the parts into a concat() call.
public function quoteXPathLiteral($string) {
$string = str_replace("\x00", '', $string);
$hasSingleQuote = FALSE !== strpos($string, "'");
if ($hasSingleQuote) {
$hasDoubleQuote = FALSE !== strpos($string, '"');
if ($hasDoubleQuote) {
$result = '';
preg_match_all('("[^\']*|[^"]+)', $string, $matches);
foreach ($matches[0] as $part) {
$quoteChar = (substr($part, 0, 1) == '"') ? "'" : '"';
$result .= ", ".$quoteChar.$part.$quoteChar;
}
return 'concat('.substr($result, 2).')';
} else {
return '"'.$string.'"';
}
} else {
return "'".$string."'";
}
}
The function generates the needed XPath.
$expression = 'string(//Artist[ArtistCode = '.quoteXPathLiteral($Artist->artistCode).']/ArtistName)'
echo $xpath->evaluate($expression);

How to remove invalid element from DOM?

We have the following code that lists the xpaths where $value is found.
We have detected for a given URL (see on picture) a non standard tag td1 which in addition doesn't have a closing tag. Probably the site developers have put that there intentionally, as you see in the screen shot below.
This element creates problems identifying the corect XPath for nodes.
A broken Xpath example :
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/**td1**/td[2]/span/u[1]
(as you see td1 is identified and chained in the Xpath)
We think by removing this element it helps us to build the valid XPath we are after.
A valid example is
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/td[2]/span/u[1]
How can we remove prior loading in DOMXpath? Do you have some other approach?
We would like to remove all the invalid tags which may be other than td1, as h8, diw, etc...
private function extract($url, $value) {
$dom = new DOMDocument();
$file = 'content.txt';
//$current = file_get_contents($url);
$current = CurlTool::downloadFile($url, $file);
//file_put_contents($file, $current);
#$dom->loadHTMLFile($current);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
var_dump($elements);
if (!is_null($elements)) {
foreach ($elements as $element) {
var_dump($element);
echo "\n1.[" . $element->nodeName . "]\n";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
if( ($node->nodeValue != null) && ($node->nodeValue === $value) ) {
echo '2.' . $node->nodeValue . "\n";
$xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
echo '3.' . $xpath . "\n";
}
}
}
}
}
You could use XPath to find the offending nodes and remove them, while promoting its children into its place in the DOM. Then your paths will be correct.
$dom_xpath = new DOMXpath($dom);
$results = $dom_xpath->query('//td1'); // (or any offending element)
foreach ($results as $invalidNode)
{
$parentNode = $invalidNode->parentNode;
while ($invalidNode->childNodes)
{
$firstChild = $invalidNode->firstChild;
$parentNode->insertBefore($firstChild,$invalidNode);
}
$parentNode->removeChild($invalidNode);
}
EDIT:
You could also build a list of offending elements by using a list of valid elements and negating it.
// Build list manually from the HTML spec:
// See: http://www.w3.org/TR/html5/section-index.html#elements-1
$validTags = array();
// Convert list to XPath:
$validTagsStr = '';
foreach ($validTags as $tag)
{
if ($validTagsStr)
{ $validTagsStr .= ' or '; }
$validTagsStr .= 'self::'.$tag;
}
$results = $dom_xpath->query('//*[not('.$validTagsStr.')');
Sooo... perhaps str_replace($current, "<td1 va-laign=\"top\">", "") could do the trick?

remove script tag from HTML content

I am using HTML Purifier (http://htmlpurifier.org/)
I just want to remove <script> tags only.
I don't want to remove inline formatting or any other things.
How can I achieve this?
One more thing, it there any other way to remove script tags from HTML
Because this question is tagged with regex I'm going to answer with poor man's solution in this situation:
$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);
However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.
Remember, anything that user inputs should be considered not safe.
Better solution here would be to use DOMDocument which is designed for this.
Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:
<?php
$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item)
{
$remove[] = $item;
}
foreach ($remove as $item)
{
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
I have removed the HTML intentionally because even this can bork.
Use the PHP DOMDocument parser.
$doc = new DOMDocument();
// load the HTML string we want to strip
$doc->loadHTML($html);
// get all the script tags
$script_tags = $doc->getElementsByTagName('script');
$length = $script_tags->length;
// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
$script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}
// get the HTML string back
$no_script_html_string = $doc->saveHTML();
This worked me me using the following HTML document:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>
hey
</title>
<script>
alert("hello");
</script>
</head>
<body>
hey
</body>
</html>
Just bear in mind that the DOMDocument parser requires PHP 5 or greater.
$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
$element = $dom->getElementsByTagName($tag);
foreach($element as $item){
$item->parentNode->removeChild($item);
}
}
$html = $dom->saveHTML();
A simple way by manipulating string.
function stripStr($str, $ini, $fin)
{
while (($pos = mb_stripos($str, $ini)) !== false) {
$aux = mb_substr($str, $pos + mb_strlen($ini));
$str = mb_substr($str, 0, $pos);
if (($pos2 = mb_stripos($aux, $fin)) !== false) {
$str .= mb_substr($aux, $pos2 + mb_strlen($fin));
}
}
return $str;
}
Shorter:
$html = preg_replace("/<script.*?\/script>/s", "", $html);
When doing regex things might go wrong, so it's safer to do like this:
$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;
So that when the "accident" happen, we get the original $html instead of empty string.
this is a merge of both ClandestineCoder & Binh WPO.
the problem with the script tag arrows is that they can have more than one variant
ex. (< = < = &lt;) & ( > = > = &gt;)
so instead of creating a pattern array with like a bazillion variant,
imho a better solution would be
return preg_replace('/script.*?\/script/ius', '', $text)
? preg_replace('/script.*?\/script/ius', '', $text)
: $text;
this will remove anything that look like script.../script regardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1
Try this complete and flexible solution. It works perfectly, and is based in-part by some previous answers, but contains additional validation checks, and gets rid of additional implied HTML from the loadHTML(...) function. It is divided into two separate functions (one with a previous dependency so don't re-order/rearrange) so you can use it with multiple HTML tags that you would like to remove simultaneously (i.e. not just 'script' tags). For example removeAllInstancesOfTag(...) function accepts an array of tag names, or optionally just one as a string. So, without further ado here is the code:
/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [BEGIN] */
/* Usage Example: $scriptless_html = removeAllInstancesOfTag($html, 'script'); */
if (!function_exists('removeAllInstancesOfTag'))
{
function removeAllInstancesOfTag($html, $tag_nm)
{
if (!empty($html))
{
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); /* For UTF-8 Compatibility. */
$doc = new DOMDocument();
$doc->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD|LIBXML_NOWARNING);
if (!empty($tag_nm))
{
if (is_array($tag_nm))
{
$tag_nms = $tag_nm;
unset($tag_nm);
foreach ($tag_nms as $tag_nm)
{
$rmvbl_itms = $doc->getElementsByTagName(strval($tag_nm));
$rmvbl_itms_arr = [];
foreach ($rmvbl_itms as $itm)
{
$rmvbl_itms_arr[] = $itm;
}
foreach ($rmvbl_itms_arr as $itm)
{
$itm->parentNode->removeChild($itm);
}
}
}
else if (is_string($tag_nm))
{
$rmvbl_itms = $doc->getElementsByTagName($tag_nm);
$rmvbl_itms_arr = [];
foreach ($rmvbl_itms as $itm)
{
$rmvbl_itms_arr[] = $itm;
}
foreach ($rmvbl_itms_arr as $itm)
{
$itm->parentNode->removeChild($itm);
}
}
}
return $doc->saveHTML();
}
else
{
return '';
}
}
}
/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [END] */
/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [BEGIN] */
/* Prerequisites: 'removeAllInstancesOfTag(...)' */
if (!function_exists('removeAllScriptTags'))
{
function removeAllScriptTags($html)
{
return removeAllInstancesOfTag($html, 'script');
}
}
/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [END] */
And here is a test usage example:
$html = 'This is a JavaScript retention test.<br><br><span id="chk_frst_scrpt">Congratulations! The first \'script\' tag was successfully removed!</span><br><br><span id="chk_secd_scrpt">Congratulations! The second \'script\' tag was successfully removed!</span><script>document.getElementById("chk_frst_scrpt").innerHTML = "Oops! The first \'script\' tag was NOT removed!";</script><script>document.getElementById("chk_secd_scrpt").innerHTML = "Oops! The second \'script\' tag was NOT removed!";</script>';
echo removeAllScriptTags($html);
I hope my answer really helps someone. Enjoy!
An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.
$str = '<script> var a - 1; </script>';
$pattern = '/(script.*?(?:\/|/|/)script)/ius';
$replace = preg_replace($pattern, '', $str);
return ($replace !== null)? $replace : $str;
If you are using php 7 you can use the null coalesce operator to simplify it even more.
$pattern = '/(script.*?(?:\/|/|/)script)/ius';
return (preg_replace($pattern, '', $str) ?? $str);
function remove_script_tags($html){
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item){
$remove[] = $item;
}
foreach ($remove as $item){
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
$html = preg_replace('/<!DOCTYPE.*?<html>.*?<body><p>/ims', '', $html);
$html = str_replace('</p></body></html>', '', $html);
return $html;
}
Dejan's answer was good, but saveHTML() adds unnecessary doctype and body tags, this should get rid of it. See https://3v4l.org/82FNP
I would use BeautifulSoup if it's available. Makes this sort of thing very easy.
Don't try to do it with regexps. That way lies madness.
I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:
$html = file_get_contents('http://some_page.html');
$h = explode('>', $html);
foreach($h as $k => $v){
$v = trim($v);//clean it up a bit
if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable
$counter = $k;//match opening tag and start counter for backtrace
}elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done
$script_length = $k - $counter;
$counter = 0;
for($i = $script_length; $i >= 0; $i--){
$h[$k-$i] = '';//backtrace and clear everything in between
}
}
}
for($i = 0; $i <= count($h); $i++){
if($h[$i] != ''){
$ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
}
}
$html = implode('>', $ht);//all scripts stripped.
echo $html;
I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.
I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.
This is a simplified variant of Dejan Marjanovic's answer:
function removeTags($html, $tag) {
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
$item->parentNode->removeChild($item);
}
return $dom->saveHTML();
}
Can be used to remove any kind of tag, including <script>:
$scriptlessHtml = removeTags($html, 'script');
use the str_replace function to replace them with empty space or something
$query = '<script>console.log("I should be banned")</script>';
$badChar = array('<script>','</script>');
$query = str_replace($badChar, '', $query);
echo $query;
//this echoes console.log("I should be banned")
?>

Indentation with DOMDocument in PHP

I'm using DOMDocument to generate a new XML file and I would like for the output of the file to be indented nicely so that it's easy to follow for a human reader.
For example, when DOMDocument outputs this data:
<?xml version="1.0"?>
<this attr="that"><foo>lkjalksjdlakjdlkasd</foo><foo>lkjlkasjlkajklajslk</foo></this>
I want the XML file to be:
<?xml version="1.0"?>
<this attr="that">
<foo>lkjalksjdlakjdlkasd</foo>
<foo>lkjlkasjlkajklajslk</foo>
</this>
I've been searching around looking for answers, and everything that I've found seems to say to try to control the white space this way:
$foo = new DOMDocument();
$foo->preserveWhiteSpace = false;
$foo->formatOutput = true;
But this does not seem to do anything. Perhaps this only works when reading XML? Keep in mind I'm trying to write new documents.
Is there anything built-in to DOMDocument to do this? Or a function that can accomplish this easily?
DomDocument will do the trick, I personally spent couple of hours Googling and trying to figure this out and I noted that if you use
$xmlDoc = new DOMDocument ();
$xmlDoc->loadXML ( $xml );
$xmlDoc->preserveWhiteSpace = false;
$xmlDoc->formatOutput = true;
$xmlDoc->save($xml_file);
In that order, It just doesn't work but, if you use the same code but in this order:
$xmlDoc = new DOMDocument ();
$xmlDoc->preserveWhiteSpace = false;
$xmlDoc->formatOutput = true;
$xmlDoc->loadXML ( $xml );
$xmlDoc->save($archivoxml);
Works like a charm, hope this helps
After some help from John and playing around with this on my own, it seems that even DOMDocument's inherent support for formatting didn't meet my needs. So, I decided to write my own indentation function.
This is a pretty crude function that I just threw together quickly, so if anyone has any optimization tips or anything to say about it in general, I'd be glad to hear it!
function indent($text)
{
// Create new lines where necessary
$find = array('>', '</', "\n\n");
$replace = array(">\n", "\n</", "\n");
$text = str_replace($find, $replace, $text);
$text = trim($text); // for the \n that was added after the final tag
$text_array = explode("\n", $text);
$open_tags = 0;
foreach ($text_array AS $key => $line)
{
if (($key == 0) || ($key == 1)) // The first line shouldn't affect the indentation
$tabs = '';
else
{
for ($i = 1; $i <= $open_tags; $i++)
$tabs .= "\t";
}
if ($key != 0)
{
if ((strpos($line, '</') === false) && (strpos($line, '>') !== false))
$open_tags++;
else if ($open_tags > 0)
$open_tags--;
}
$new_array[] = $tabs . $line;
unset($tabs);
}
$indented_text = implode("\n", $new_array);
return $indented_text;
}
I have tried running the code below setting formatOutput and preserveWhiteSpace in different ways, and the only member that has any effect on the output is formatOutput. Can you run the script below and see if it works?
<?php
echo "<pre>";
$foo = new DOMDocument();
//$foo->preserveWhiteSpace = false;
$foo->formatOutput = true;
$root = $foo->createElement("root");
$root->setAttribute("attr", "that");
$bar = $foo->createElement("bar", "some text in bar");
$baz = $foo->createElement("baz", "some text in baz");
$foo->appendChild($root);
$root->appendChild($bar);
$root->appendChild($baz);
echo htmlspecialchars($foo->saveXML());
echo "</pre>";
?>
Which method do you call when printing the xml?
I use this:
$doc = new DOMDocument('1.0', 'utf-8');
$root = $doc->createElement('root');
$doc->appendChild($root);
(...)
$doc->formatOutput = true;
$doc->saveXML($root);
It works perfectly but prints out only the element, so you must print the <?xml ... ?> part manually..
Most answers in this topic deal with xml text flow.
Here is another approach using the dom functionalities to perform the indentation job.
The loadXML() dom method imports indentation characters present in the xml source as text nodes. The idea is to remove such text nodes from the dom and then recreate correctly formatted ones (see comments in the code below for more details).
The xmlIndent() function is implemented as a method of the indentDomDocument class, which is inherited from domDocument.
Below is a complete example of how to use it :
$dom = new indentDomDocument("1.0");
$xml = file_get_contents("books.xml");
$dom->loadXML($xml);
$dom->xmlIndent();
echo $dom->saveXML();
class indentDomDocument extends domDocument {
public function xmlIndent() {
// Retrieve all text nodes using XPath
$x = new DOMXPath($this);
$nodeList = $x->query("//text()");
foreach($nodeList as $node) {
// 1. "Trim" each text node by removing its leading and trailing spaces and newlines.
$node->nodeValue = preg_replace("/^[\s\r\n]+/", "", $node->nodeValue);
$node->nodeValue = preg_replace("/[\s\r\n]+$/", "", $node->nodeValue);
// 2. Resulting text node may have become "empty" (zero length nodeValue) after trim. If so, remove it from the dom.
if(strlen($node->nodeValue) == 0) $node->parentNode->removeChild($node);
}
// 3. Starting from root (documentElement), recursively indent each node.
$this->xmlIndentRecursive($this->documentElement, 0);
} // end function xmlIndent
private function xmlIndentRecursive($currentNode, $depth) {
$indentCurrent = true;
if(($currentNode->nodeType == XML_TEXT_NODE) && ($currentNode->parentNode->childNodes->length == 1)) {
// A text node being the unique child of its parent will not be indented.
// In this special case, we must tell the parent node not to indent its closing tag.
$indentCurrent = false;
}
if($indentCurrent && $depth > 0) {
// Indenting a node consists of inserting before it a new text node
// containing a newline followed by a number of tabs corresponding
// to the node depth.
$textNode = $this->createTextNode("\n" . str_repeat("\t", $depth));
$currentNode->parentNode->insertBefore($textNode, $currentNode);
}
if($currentNode->childNodes) {
$indentClosingTag = false;
foreach($currentNode->childNodes as $childNode) $indentClosingTag = $this->xmlIndentRecursive($childNode, $depth+1);
if($indentClosingTag) {
// If children have been indented, then the closing tag
// of the current node must also be indented.
$textNode = $this->createTextNode("\n" . str_repeat("\t", $depth));
$currentNode->appendChild($textNode);
}
}
return $indentCurrent;
} // end function xmlIndentRecursive
} // end class indentDomDocument
Yo peeps,
just found out that apparently, a root XML element may not contain text children. This is nonintuitive a. f. But apparently, this is the reason that, for instance,
$x = new \DOMDocument;
$x -> preserveWhiteSpace = false;
$x -> formatOutput = true;
$x -> loadXML('<root>a<b>c</b></root>');
echo $x -> saveXML();
will fail to indent.
https://bugs.php.net/bug.php?id=54972
So there you go, h. t. h. et c.
header("Content-Type: text/xml");
$str = "";
$str .= "<customer>";
$str .= "<offer>";
$str .= "<opened></opened>";
$str .= "<redeemed></redeemed>";
$str .= "</offer>";
echo $str .= "</customer>";
If you are using any extension other than .xml then first set the header Content-Type header to the correct value.

Categories