XML node with Mixed Content using PHP DOM - php

Is there a way to create a node that has mixed XML content in it with the PHP DOM?

If I understood you correctly you want something similar to innerHTML in JavaScript. There is a solution to that:
$xmlString = 'some <b>mixed</b> content';
$dom = new DOMDocument;
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($xmlString);
$dom->appendChild($fragment);
// done
To sumarize. What you need is:
DOMDocument::createDocumentFragment()
DOMDocumentFragment::appendXML()
Although you didn't asked about it I'll tell you how to get the string representation of a DOM node as opposed to the whole DOM document:
// for a DOMDocument you have
$dom->save($file);
$string = $dom->saveXML();
$dom->saveHTML();
$string = $dom->saveHTMLFile($file);
// For a DOMElement you have
$node = $dom->getElementById('some-id');
$string = $node->C14N();
$node->C14NFile($file);
Those two methods are currently not documented.

Related

php export xml CDATA escaped

I am trying to export xml with CDATA tags. I use the following code:
$xml_product = $xml_products->addChild('product');
$xml_product->addChild('mychild', htmlentities("<![CDATA[" . $mytext . "]]>"));
The problem is that I get CDATA tags < and > escaped with < and > like following:
<mychild><![CDATA[My some long long long text]]></mychild>
but I need:
<mychild><![CDATA[My some long long long text]]></mychild>
If I use htmlentities() I get lots of errors like tag raquo is not defined etc... though there are no any such tags in my text. Probably htmlentities() tries to parse my text inside CDATA and convert it, but I dont want it either.
Any ideas how to fix that? Thank you.
UPD_1 My function which saves xml to file:
public static function saveFormattedXmlFile($simpleXMLElement, $output_file) {
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadXML(urldecode($simpleXMLElement->asXML()));
$dom->save($output_file);
}
A short example of how to add a CData section, note the way it skips into using DOMDocument to add the CData section in. The code builds up a <product> element, $xml_product has a new element <mychild> created in it. This newNode is then imported into a DOMElement using dom_import_simplexml. It then uses the DOMDocument createCDATASection method to properly create the appropriate bit and adds it back into the node.
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><Products />');
$xml_product = $xml->addChild('product');
$newNode = $xml_product->addChild('mychild');
$mytext = "<html></html>";
$node = dom_import_simplexml($newNode);
$cdata = $node->ownerDocument->createCDATASection($mytext);
$node->appendChild($cdata);
echo $xml->asXML();
This example outputs...
<?xml version="1.0" encoding="UTF-8"?>
<Products><product><mychild><![CDATA[<html></html>]]></mychild></product></Products>

PHP: Keeping HTML inside XML node without CDATA

I've got an xml like this:
<father>
<son>Text with <b>HTML</b>.</son>
</father>
I'm using simplexml_load_string to parse it into SimpleXmlElement. Then I get my node like this
$xml->father->son->__toString(); //output: "Text with .", but expected "Text with <b>HTML</b>."
I need to handle simple HTML such as:
<b>text</b> or <br/> inside the xml which is sent by many users.
Me problem is that I can't just ask them to use CDATA because they won't be able to handle it properly, and they are already use to do without.
Also, if it's possible I don't want the file to be edited because the information need to be the one sent by the user.
The function simplexml_load_string simply erase anything inside HTML node and the HTML node itself.
How can I keep the information ?
SOLUTION
To handle the problem I used the asXml as explained by #ThW:
$tmp = $xml->father->son->asXml(); //<son>Text with <b>HTML</b>.</son>
I just added a preg_match to erase the node.
A CDATA section is a character node, just like a text node. But it does less encoding/decoding. This is mostly a downside, actually. On the upside something in a CDATA section might be more readable for a human and it allows for some BC in special cases. (Think HTML script tags.)
For an XML API they are nearly the same. Here is a small DOM example (SimpleXML abstracts to much).
$document = new DOMDocument();
$father = $document->appendChild(
$document->createElement('father')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createTextNode('With <b>HTML</b><br>It\'s so nice.')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createCDataSection('With <b>HTML</b><br>It\'s so nice.')
);
$document->formatOutput = TRUE;
echo $document->saveXml();
Output:
<?xml version="1.0"?>
<father>
<son>With <b>HTML</b><br>It's so nice.</son>
<son><![CDATA[With <b>HTML</b><br>It's so nice.]]></son>
</father>
As you can see they are serialized very differently - but from the API view they are basically exchangeable. If you're using an XML parser the value you get back should be the same in both cases.
So the first possibility is just letting the HTML fragment be stored in a character node. It is just a string value for the outer XML document itself.
The other way would be using XHTML. XHTML is XML compatible HTML. You can mix an match different XML formats, so you could add the XHTML fragment as part of the outer XML.
That seems to be what you're receiving. But SimpleXML has some problems with mixed nodes. So here is an example how you can read it in DOM.
$xml = <<<'XML'
<father>
<son>With <b>HTML</b><br/>It's so nice.</son>
</father>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('/father/son[1]/node()') as $child) {
$result .= $document->saveXml($child);
}
echo $result;
Output:
With <b>HTML</b><br/>It's so nice.
Basically you need to save each child of the son element as XML.
SimpleXML is based on the same DOM library internally. That allows you to convert a SimpleXMLElement into a DOM node. From there you can again save each child as XML.
$father = new SimpleXMLElement($xml);
$sonNode = dom_import_simplexml($father->son);
$document = $sonNode->ownerDocument;
$result = '';
foreach ($sonNode->childNodes as $child) {
$result .= $document->saveXml($child);
}
echo $result;

DOMElement replace HTML value

I have this HTML string in a DOMElement:
<h1>Home</h1>
test{{test}}
I want to replace this content in a way that only
<h1>Home</h1>
test
remains (so I want to remove the {{test}}).
At this moment, my code looks like this:
$node->nodeValue = preg_replace(
'/(?<replaceable>{{([a-z0-9_]+)}})/mi', '' , $node->nodeValue);
This doesn't work because nodeValue doesn't contain the HTML value of the node.
I can't figure out how to get the HTML string of the node other than using $node->C14N(), but by using C14N I can't replace the content.
Any ideas how I can remove the {{test}} in an HTML string like this?
Have you tried the DOMDocument::saveXML function? (http://php.net/manual/en/domdocument.savexml.php)
It has a second argument $node with which you can specify which node to print the HTML/XML of.
So, for example:
<?php
$doc = new DOMDocument('1.0');
// we want a nice output
$doc->formatOutput = true;
$root = $doc->createElement('body');
$root = $doc->appendChild($root);
$title = $doc->createElement('h1', 'Home');
$root->appendChild($title);
$text = $doc->createTextNode('test{{test}}');
$text = $root->appendChild($text);
echo $doc->saveXML($root);
?>
This will give you:
<body>
<h1>Home</h1>
test{{test}}
</body>
If you do not want the <body> tag, you could cycle through all of its childnodes:
<?php
foreach($root->childNodes as $child){
echo $doc->saveXML($child);
}
?>
This will give you:
<h1>Home</h1>test{{test}}
Edit: you can then of course replace {{test}} by the regex that you are already using:
<?php
$xml = '';
foreach($root->childNodes as $child){
$xml .= preg_replace(
'/(?<replaceable>{{([a-z0-9_]+)}})/mi', '',
$doc->saveXML($child)
);
}
?>
This will give you:
<h1>Home</h1>test
Note: I haven't tested the code, but this should give you the general idea.
The issue is mainly around how you navigate the DOM but there's also an issue with your RegExp; XPath actually provides a lot of flexibility when it comes to DOM manipulation so that's my preferred solution.
Assuming you have a DOMDocument built like this (I've attached an XPath):
$dom = new DOMDocument('1.0', 'utf-8');
$xpath = new DOMXPath($dom);
$node = $dom->createElement('div');
$node->appendChild(
$dom->createElement('h1', "Home")
);
$node->appendChild(
$dom->createTextNode("test{{test}}")
);
$dom->appendChild($node);
You can specifically target the text node of that <div> with '/div/text()' in XPath.
So to replace {{test}} within that text node without corrupting the rest of the node, you would do:
$xpath->query('/div/text()')->item(0)->nodeValue = preg_replace(
'/(.*){{[^}]+}}/m',
'$1',
$xpath->query('/div/text()')->item(0)->nodeValue
);
Somewhat convoluted but the output from $dom->saveXML(); is:
<?xml version="1.0" encoding="utf-8"?>
<div><h1>Home</h1>test</div>
{{test}} has been removed leaving the rest intact.

simple HTML DOM parser return wrong elements tree

I am having problem with HTML DOM parser. This is what I used:
$url = 'http://topmmanews.com/2013/04/06/ufc-on-fuel-tv-9-results/';
$page = file_get_html($url);
$ret = $page->find("div.posttext",0);
Which is supposed to return me count($ret->children()) = 10. However, it only return me with 3, all the elements after the 3rd are combined into it and created one element only.
Can anyone help let me know if there is something wrong with my code or it was simple HTML DOM parser bug?
As Álvaro G. Vicario pointed out, your target HTML is somehow malformed. I tried your code but as you can see here it shows three children and 6 other nodes:
But the other way, which might be useful, is to use DOMDocument and DOMXPath like this:
$url = 'http://topmmanews.com/2013/04/06/ufc-on-fuel-tv-9-results/';
$html = file_get_contents($url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$dom_xpath = new DOMXpath($dom);
// XPATH to return the first DIV with class "posttext"
$elements = $dom_xpath->query("(//div[#class='posttext'])[1]");
Then you can iterate through child nodes and read the values or whatever you want.
phpquery uses DOM so it's a more reliable parser with bad html:
$html = file_get_contents('http://topmmanews.com/2013/04/06/ufc-on-fuel-tv-9-results/');
$dom = phpQuery::newDocumentHTML($html);
$ret = $dom->find("div.posttext")->eq(0);
echo count($ret->children());
#=> 10

How to remove elements from selection when using PHP Simple HTML Dom library

I am using PHP Simple HTML Dom library. I can obtain the element that I want. But this element contains other elements that I want to remove from selection.
[elem]
include this data
[elem]exclude this data[elem]
[elem]
If it is possible please show an example.
xml
<elem>
include this data
<elem>exclude this data</elem>
</elem>
php -- Pure DOMDocument solution:
$dom = new DOMDocument;
$dom->load('xml.xml');
$node = $dom->getElementsByTagName('elem')->item(0);
$child = $node->getElementsByTagName('elem')->item(0);
$node->removeChild($child);
echo $dom->saveXml();
php -- SimpleXML with DOMDocument
$doc = simplexml_load_file('xml.xml');
$toremove = $doc->elem;
$dom = dom_import_simplexml($toremove);
$dom->parentNode->removeChild($dom);
echo $doc->asXml();

Categories