Convert xml to html with emphasis in php - php

I have an XML file that contains the following content.
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE article>
<article
xmlns="http://docbook.org/ns/docbook" version="5.0"
xmlns:xlink="http://www.w3.org/1999/xlink" >
<para>
This is an <emphasis role="strong">test</emphasis> sentence.
</para>
</article>
When I use
$xml_data = simplexml_load_string($filedata);
foreach ($xml_data['para'] as $data) {
echo $data;
}
I got This is an sentence.. But I want to get This is an <b>test<b> sentence. as result.

Instead of simplexml_load_string I'd recommend DOMDocument, but that is just a personal preference. A naïve implementation might just do a string replacement and that might totally work for you. However, since you've provided actual XML that even includes a NS I'm going to try to keep this as XML-centric as possible, while skipping XPath which could possibly be used, too.
This code loads the XML and walks every node. If it find a <para> element it walks all of the children of that node looking for an <emphasis> node, and if it finds one it replaces it with a new new that is a <b> node.
The replacement process is a little complex, however, because if we just use nodeValue we might lose any HTML that lives in there, so we need to walk the children of the <emphasis> node and clone those into our replacement node.
Because the source document has a NS, however, we also need to remove that from our final HTML. Since we are going from XML to HTML, I think that is a safe usage of a str_replace without going to crazy in the XML land for that.
The code should have enough comments to make sense, hopefully.
<?php
$filedata = <<<EOT
<?xml version="1.0" encoding="utf-8" ?>
<article
xmlns="http://docbook.org/ns/docbook" version="5.0"
xmlns:xlink="http://www.w3.org/1999/xlink" >
<para>
This is an <emphasis role="strong">hello <em>world</em></emphasis> sentence.
</para>
</article>
EOT;
$dom = new DOMDocument();
$dom->loadXML($filedata);
foreach($dom->documentElement->childNodes as $node){
if(XML_ELEMENT_NODE === $node->nodeType && 'para' === $node->nodeName){
// Replace any emphasis elements
foreach($node->childNodes as $childNode) {
if(XML_ELEMENT_NODE === $childNode->nodeType && 'emphasis' === $childNode->nodeName){
// This is arguably the most "correct" way to replace, just in case
// there's extra nodes inside. A cheaper way would be to not loop
// and just use the nodeValue however you might lose some HTML.
$newNode = $dom->createElement('b');
foreach($childNode->childNodes as $grandChild){
$newNode->appendChild($grandChild->cloneNode(true));
}
$childNode->replaceWith($newNode);
}
}
// Build our output
$output = '';
foreach($node->childNodes as $childNode) {
$output .= $dom->saveHTML($childNode);
}
// The provided XML has a namespace, and when cloning nodes that NS comes
// along. Since we are going from regular XML to irregular HTML I think
// a string replacement is best.
$output = str_replace(' xmlns="http://docbook.org/ns/docbook"', '', $output);
echo $output;
}
}
Demo here: https://3v4l.org/04Tc3#v8.0.23
NOTE: PHP 8 added replaceWith. If you are using PHP 7 or less you'd use replaceChild and just play around with things a bit.

What if you have the following XML?
<entry>
<para>This is the first text</para>
<emphasis>This is the second text</emphasis>
<para>This is the <emphasis>next</emphasis> text</para>
<itemizedlist>
<listitem>
<para>
This is an paragraph inside a list
</para>
</listitem>
<itemizedlist>
<listitem>
<para>
This is an paragraph inside a list inside a list
</para>
</listitem>
</itemizedlist>
</itemizedlist>
</entry>
using
if(XML_ELEMENT_NODE === $stuff2->nodeType && 'para' === $stuff2->nodeName){
$newNode = $dom->createElement('p');
foreach($stuff2->childNodes as $grandChild){
$newNode->appendChild($grandChild->cloneNode(true));
}
$stuff2->replaceWith($newNode);
}
if (XML_ELEMENT_NODE === $stuff2->nodeType && 'itemizedlist' === $stuff2->nodeName) {
$newNode = $dom->createElement('ul');
foreach($stuff2->childNodes as $grandChild){
$newNode->appendChild($grandChild->cloneNode(true));
}
$stuff2->replaceWith($newNode);
}
if(XML_ELEMENT_NODE === $stuff2->nodeType && 'emphasis' === $stuff2->nodeName){
$newNode = $dom->createElement('b');
foreach($stuff2->childNodes as $grandChild){
$newNode->appendChild($grandChild->cloneNode(true));
}
$stuff2->replaceWith($newNode);
}
if (XML_ELEMENT_NODE === $stuff2->nodeType && 'listitem' === $stuff2->nodeName) {
$newNode = $dom->createElement('li');
foreach($stuff2->childNodes as $grandChild){
$newNode->appendChild($grandChild->cloneNode(true));
}
$stuff2->replaceWith($newNode);
}
only results in
<p>This is the first text</p>
<emphasis>This is the second text</emphasis>
<para>This is the <emphasis>next</emphasis> text</para>
<itemizedlist>
<listitem>
<para>This is an paragraph inside a list</para>
</listitem>
<itemizedlist>
<listitem>
<para>This is an paragraph inside a list inside a list</para>
</listitem>
</itemizedlist>
</itemizedlist>

Related

PHP DOM Why does removing a child node of an element with removeChild interrupt a foreach loop over its child nodes?

I have encountered a puzzling behavior of the DOM method removeChild. When looping over the child nodes of a DOMElement, removing one of these nodes along the way interrupts the loop, i.e., the loop does not iterate over the remaining child nodes.
Here is a minimal example:
$test_string = <<<XML
<test>
<text>A sample text with <i>mixed content</i> of <b>various sorts</b></text>
</test>
XML;
$test_DOMDocument = new DOMDocument();
$test_DOMDocument->loadXML($test_string);
$test_DOMNode = $test_DOMDocument->getElementsByTagName("text");
foreach ($test_DOMNode as $text) {
foreach ($text->childNodes as $node) {
if (preg_match("/text/", $node->nodeValue)) {
echo $node->nodeValue;
$node->parentNode->removeChild($node);
} else {
echo $node->nodeValue;
}
}
}
If I comment out the line $node->parentNode->removeChild($node);, then the output is the entire test string, i.e., A sample text with mixed content of various sorts, as expected. With that line, however, only the first child node is output, i.e., A sample text with. That is, removing the first child node as the loop passes over it apparently interrupts the loop; the remaining child nodes are not processed. Why is that?
Thanks in advance for your help!
Implementing the suggestions of the comments on my question, I came up with the following solution:
$test_string = <<<XML
<test>
<text>A sample text with <i>mixed content</i> of <b>various sorts</b></text>
</test>
XML;
$test_DOMDocument = new DOMDocument();
$test_DOMDocument->loadXML($test_string);
$test_DOMNode = $test_DOMDocument->getElementsByTagName("text");
foreach ($test_DOMNode as $text) {
$child_nodes = $text->childNodes;
for($n = $child_nodes->length-1; $n >= 0; --$n) {
$node = $child_nodes->item($n);
if (preg_match("/text/", $node->nodeValue)) {
echo $node->nodeValue;
$node->parentNode->removeChild($node);
} else {
echo $node->nodeValue;
}
}
}
That is, I go through the child nodes in reverse order, using a method suggested in another posting. In this way, all nodes are processed: The output is various sorts of mixed contentA sample text with. Note the reverse order of the text fragments. In my specific use case, this reversal does not matter because I am not actually echoing the text nodes, but performing another kind of operation on them.

PHP Split XML based on multiple nodes

I honestly tried to find a solution for php, but a lot of threads sound similar, but are not applicable for me or are for completely different languages.
I want to split an xml file based on nodes. Ideally multiple nodes, but of course one is enough and could be applied multiple times.
e.g. I want to split this by the tag <thingy> and <othernode>:
<root>
<stuff />
<thingy><othernode>one</othernode></thingy>
<thingy><othernode>two</othernode></thingy>
<thingy>
<othernode>three</othernode>
<othernode>four</othernode>
</thingy>
<some other data/>
</root>
Ideally I want to have 4 xmlstrings of type:
<root>
<stuff />
<thingy><othernode>CONTENT</othernode></thingy>
<some other data/>
</root>
With CONTENT being one, two, three and four. Plottwist: CONTENT can also be a whole subtree. Of course it all also can be filled with various namespaces and tag prefixes (like <q1:node/>. Formatting is irrelevant for me.
I tried SimpleXml, but it lacks the possiblity to write into dom easily
I tried DomDocument, but all what I do seems to destroy some links/relation of parent/child nodes in some way.
I tried XmlReader/Writer, but that is extremely hard to maintain and combine (at least for me).
So far my best guess is something with DomDocument, node cloning and removing everything but one node?
Interesting question.
If I get it right, it is given that <othernode> is always a child of <thingy> and the split is for each <othernode> at the place of the first <thingy> in the original document.
DOMDocument appeared useful in this case, as it allows to easily move nodes around - including all its children.
Given the split on a node-list (from getElementsByTagName()):
echo "---\n";
foreach ($split($doc->getElementsByTagName('othernode')) as $doc) {
echo $doc->saveXML(), "---\n";
}
When moving all <othernode> elements into a DOMDocumentFragement of its own while cleaning up <thingy> parent elements when emptied (unless the first anchor element) and then temporarily bring each of them back into the DOMDocument:
$split = static function (DOMNodeList $nodes): Generator {
while (($element = $nodes->item(0)) && $element instanceof DOMElement) {
$doc ??= $element->ownerDocument;
$basin ??= $doc->createDocumentFragment();
$anchor ??= $element->parentNode;
[$parent] = [$element->parentNode, $basin->appendChild($element)];
$parent->childElementCount || $parent === $anchor || $parent->parentNode->removeChild($parent);
}
if (empty($anchor)) {
return;
}
assert(isset($basin, $doc));
while ($element = $basin->childNodes->item(0)) {
$element = $anchor->appendChild($element);
yield $doc;
$anchor->removeChild($element);
}
};
This results in the following split:
---
<?xml version="1.0"?>
<root>
<stuff/>
<thingy><othernode>one</othernode></thingy>
<some other="data"/>
</root>
---
<?xml version="1.0"?>
<root>
<stuff/>
<thingy><othernode>two</othernode></thingy>
<some other="data"/>
</root>
---
<?xml version="1.0"?>
<root>
<stuff/>
<thingy><othernode>three</othernode></thingy>
<some other="data"/>
</root>
---
<?xml version="1.0"?>
<root>
<stuff/>
<thingy><othernode>four</othernode></thingy>
<some other="data"/>
</root>
---

XMLReader differentiating nested nodes with same name

Trying to work with an external xml file, which is stacked like this:
<?xml version="1.0" encoding="UTF-8"?>
<merchandiser xsi:noNamespaceSchemaLocation="merchandiser.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<header>
<merchantId>44235</merchantId>
<merchantName>Feelunique (UK)</merchantName>
<createdOn>04/27/2020 00:05:33</createdOn>
</header>
<product part_number="99082" manufacturer_name="Sanctuary Spa" sku_number="99082" name="Sanctuary Spa Sleep Dream Easy Pillow Mist 100ml" product_id="15927186808">
<URL>
<product>https://click.linksynergy.com/link?id=y/LyuzvjryY&offerid=687217.15927186808&type=15&murl=https%3A%2F%2Fwww.feelunique.com%2Fp%2FSanctuary-Spa-Sleep-Dream-Easy-Pillow-Mist-100ml%26curr%3DGBP</product>
</URL>
</product>
</merchandiser>
As you can see the node <product> is used twice, and I need to grab an attribute from the first one, and the value in the second.
My code seems to jump straight to the second one by default and allows me to define the $xml->value of the second <product> node, but I can't seem to figure out how to separate the two in my code and get the attribute I need.
while($xml->read()) {
if($xml->nodeType == XMLReader::ELEMENT) {
if($xml->localName == 'header') {
$header = array();
}
if($xml->localName == 'merchantName') {
$xml->read();
$header['merchant'] = addslashes($xml->value);
}
if($xml->localName == 'product') {
$product = array();
$product['merchant'] = $header['merchant'];
$product['title'] = $xml->getAttribute('name');
}
if($xml->localName == 'product') {
$xml->read();
$product['link'] = $xml->value;
}
}
}
Can somebody point me in the right direction as to how I can achieve both values in my php code?
This isn't a complete solution, but just a demonstration of how to reach elements from each of the two product nodes - and you can modify it as needed:
$doc = new DOMDocument();
$doc->loadXML($xml);
$xpath = new DOMXpath($doc);
$product = $xpath->evaluate("//product/#name");
$link = $xpath->evaluate("//product//URL//product");
foreach ($product as $node1) {
foreach ($link as $node2){
echo trim($node2->nodeValue), PHP_EOL,trim($node1->nodeValue);
}}
Output:
https://click.linksynergy.com...
Sanctuary Spa Sleep Dream Easy Pillow Mist 100ml
XMLReader will just jump from node to node, and by the time you hit 'product', both your if statements will evaluated to true.
The only way you can know which product node you are in, is if you retain the information of it's parent.
Doing this with one big loop will be a pain. It's probably better to start a new function after the level-1 product opens and create a new loop to parse the 'product' subtree.
I wrote a library to help with this.
XMLReader (and expat) can be a great tool to parse large XML documents fast, but you need to learn algorithms how to traverse nested structures effectively. If you find that this is too hard to grasp, I would recommend a simpler XML parser like the DOM, or SimpleXML.

SimpleXML: handle CDATA tag presence in node value

I need to save <![CDATA[]]> tag when I parse XML document.
For example, I have node:
<Dest><![CDATA[some text...]]></Dest>
In xml file may be present nodes without CDATA.
Then I process all the nodes in loop:
$dom = simplexml_load_file($path);
foreach($dom->children() as $child) {
$nodeValue = (string) $child;
}
As a result, when I process node in example above - $nodeValue = some text...
But I need $nodeValue = <![CDATA[some text...]]>
There is any way to do this?
File example:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
<Params>
<param>text</param>
<anotherParam>text</anotherParam>
</Params>
<Content>
<String>
<Source>some another text</Source>
<Dest>some another text 2</Dest>
</String>
<String>
<Source>some another text 3</Source>
<Dest><![CDATA[some text...]]></Dest>
</String>
</Content>
</Root>
As far as a parser like SimpleXML is concerned, the <![CDATA[ is not part of the text content of the XML element, it's just part of the serialization of that content. A similar confusion is discussed here: PHP, SimpleXML, decoding entities in CDATA
What you need to look at is the "inner XML" of that element, which is tricky in SimpleXML (->asXML() will give you the "outer XML", e.g. <Dest><![CDATA[some text...]]></Dest>).
Your best bet here is to use the DOM which gives you more access to the detailed structure of the document, rather than trying to give you the content, so distinguishes "text nodes" and "CDATA nodes". However, it's worth double-checking that you do actually need this, as for 99.9% of use cases, you shouldn't care whether somebody sent you <foo>bar & baz</foo> or <foo><![CDATA[bar & baz]]></foo>, since by definition they represent the same string.
If you want to add CDATA to all element who doesn't have it, you can do that :
$dom = simplexml_load_file($path);
foreach($dom->children() as $child) {
if(strpos((string) $child,'CDATA')){
$nodeValue = (string) $child)
}
else {
$nodeValue = "<![CDATA[".((string) $child)."]]>";
}
}
with that you will have $nodeValue = '<![CDATA[some text...]]>'
if you want to just have the element where there is CDATA you can do that :
$dom = simplexml_load_file($path);
foreach($dom->children() as $child) {
if(strpos((string) $child,'CDATA')){
$nodeValue = (string) $child;
}
}
with that you will have $nodeValue = '<![CDATA[some text...]]>'
if you want the element without CDATA and add it you can do that :
$dom = simplexml_load_file($path);
foreach($dom->children() as $child) {
if(!strpos((string) $child,'CDATA')){
$notValue ="<![CDATA[".((string) $child)."]]>";
}
}
with that you wil have $nodeValue = '<![CDATA[some another text 3]]>'

getting the text content of a specific DOMElement

After a little hairpulling, I discovered that DOMElement->textContent also returns the combined text from the children of that element.
Looking around a bit I saw people suggesting DOMElement->firstChild->textContent but this is no good for me because I'm looking through the document following the hierarchy and cues from element attributes, the data is just as likely to be on a branch rather than a leaf so I would get multiple hits even though only one of them is the correct one.
Is there an actual way to get the text content of this one specific element and none of its childrens?
EDIT: nvm, found a way to make sure
function get_text($el) {
if (is_a($el->firstChild, "DOMText")) return $el->firstChild->textContent;
return "";
}
Simply iterate the child nodes and check if the next node is a text. You
might want to skip the nodes consisting of only space characters, though:
function getNodeText(DOMNode $node) {
if ($node->nodeType === XML_TEXT_NODE)
return $node->textContent;
$node = $node->firstChild;
while ($node) {
if ($node->nodeType === XML_TEXT_NODE &&
$text = trim($node->textContent))
{
return $text;
}
$node = $node->nextSibling;
}
return '';
}
$xml = <<<'EOXML'
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child>
<x>x text</x>
child text
</child>
root text
</root>
EOXML;
$doc = new DOMDocument();
$doc->loadXML($xml);
var_dump(getNodeText($doc->getElementsByTagName('x')[0]));
var_dump(getNodeText($doc->getElementsByTagName('root')[0]));
var_dump(getNodeText($doc->getElementsByTagName('child')[0]));
Sample output
string(6) "x text"
string(9) "root text"
string(10) "child text"

Categories