Complex edit xml file

Complex edit xml file - php

For example, we have this xml:
<x>
<y>some text</y>
<y>[ID] hello</y>
<y>world [/ID]</y>
<y>some text</y>
<y>some text</y>
</x>
and we need to remove words "[ID]", "[/ID]" and text between them (which we don't know, when parsing), of course without damage xml formatting.
The only solution i can think is that:
Find in xml the text by using regex, for example: "/\[ID\].*?\[\/ID\]/". In our case, result will be "[ID]hello</y><y>world[/ID]"
In result from prev step we need to find text without xml-tags by using this regex:
"/(?<=^|>)[^><]+?(?=<|$)/", and delete this text. The result will be "</y><y>"
Made changes in original xml by doing smth like this:
str_replace($step1string,$step2string,$xml);
is this correct way to do this?
I just think that this "str_replace"'s things it's not best way to edit xml, so maybe you know better solution?

Removing the specific string is simple:
<?php
$xml = '<x>
<y>some text</y>
<y>[ID] hello</y>
<y>world [/ID]</y>
<y>some text</y>
<y>some text</y>
</x>';
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[(contains(.,\'[ID]\') or contains(.,\'[/ID]\'))]') as $elm){
$elm->nodeValue = preg_replace('/\[\/?ID\]/','',$elm->nodeValue);
}
var_dump($d->saveXML());
?>
When just removing textnodes in a specific tag, one could alter te preg_replace to these 2:
$elm->nodeValue = preg_replace('/\[ID\].*$/','',$elm->nodeValue);
$elm->nodeValue = preg_replace('/^.*\[/ID\]/','',$elm->nodeValue);
Resulting in for your example:
<x>
<y>some text</y>
<y></y>
<y></y>
<y>some text</y>
<y>some text</y>
</x>
However, removing tags in between without damaging well formed XML is quite tricky. Before venturing into lot of DOM actions, how would you like to handle:
An [/ID] higher in the DOM-tree:
<foo>[ID] foo
<bar> lorem [/ID] ipsum </bar>
</foo>
An [/ID] lower in the DOM-tree
<foo> foo
<bar> lorem [ID] ipsum </bar>
[/ID]
</foo>
And open/close spanning siblings, as per your example:
<foo> foo
<bar> lorem [ID] ipsum </bar>
<bar> lorem [/ID] ipsum </bar>
</foo>
And a real dealbreaker of a question: is nesting possible, is that nesting well formed, and what should it do?
<foo> foo
<bar> lo [ID] rem [ID] ipsum </bar>
<bar> lorem [/ID] ipsum </bar>
[/ID]
</foo>
Without further knowledge how these case should be handled there is no real answer.
Edit, well futher information was given, the actual, fail-safe solution (i.e.: parse XML, don't use regexes) seems kind of long, but will work in 99.99% of cases (personal typos and brainfarts excluded of course :) ):
<?php
$xml = '<x>
<y>some text</y>
<y>
<a> something </a>
well [ID] hello
<a> and then some</a>
</y>
<y>some text</y>
<x>
world
<a> also </a>
foobar [/ID] something
<a> these nodes </a>
</x>
<y>some text</y>
<y>some text</y>
</x>';
echo $xml;
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[contains(.,\'[ID]\')]') as $elm){
//if this node also contains [/ID], replace and be done:
if(($startpos = strpos($elm->nodeValue,'[ID]'))!==false && $endpos = strpos($elm->nodeValue,'[/ID]',$startpos)){
$elm->replaceData($startpos, $endpos-$startpos + 5,'');
var_dump($d->saveXML($elm));
continue;
}
//delete all siblings of this textnode not being text and having [/ID]
while($elm->nextSibling){
if(!($elm->nextSibling instanceof DOMTEXT) || ($pos =strpos($elm->nodeValue,'[/ID]'))===false){
$elm->parentNode->removeChild($elm->nextSibling);
} else {
//id found in same element, replace and go to next [ID]
$elm->parentNode->appendChild(new DOMTExt(substr($elm->nextSibling->nodeValue,$pos+5)));
$elm->parentNode->removeChild($elm->nextSibling);
continue 2;
}
}
//siblings of textnode deleted, string truncated to before [ID], now let's delete intermediate nodes
while($sibling = $elm->parentNode->nextSibling){ // in case of example: other <y> elements:
//loop though childnodes and search a textnode with [/ID]
while($child = $sibling->firstChild){
//delete if not a textnode
if(!($child instanceof DOMText)){
$sibling->removeChild($child);
continue;
}
//we have text, check for [/ID]
if(($pos = strpos($child->nodeValue,'[/ID]'))!==false){
//add remaining text in textnode:
$elm->appendData(substr($child->nodeValue,$pos+5));
//remove current textnode with match:
$sibling->removeChild($child);
//sanity check: [ID] was in <y>, is [/ID]?
if($sibling->tagName!= $elm->parentNode->tagname){
trigger_error('[/ID] found in other tag then [/ID]: '.$sibling->tagName.'<>'.$elm->parentNode->tagName, E_USER_NOTICE);
}
//add remaining childs of sibling to parent of [ID]:
while($sibling->firstChild){
$elm->parentNode->appendChild($sibling->firstChild);
}
//delete the sibling that was found to hold [/ID]
$sibling->parentNode->removeChild($sibling);
//done: end both whiles
break 2;
}
//textnode, but no [/ID], so remove:
$sibling->removeChild($child);
}
//no child, no text, so no [/ID], remove:
$elm->parentNode->parentNode->removeChild($sibling);
}
}
var_dump($d->saveXML());
?>

For your entertainment and edification, you may want to read this: RegEx match open tags except XHTML self-contained tags
The "correct" solution is to use an XML library and search through the nodes to perform the operation. However, it would probably be much easier to just use a str_replace, even if there's a chance of damaging the XML formatting. You have to gauge the likelihood of receiving something like <a href="[ID]"> and the importance of defending against such cases, and weigh those factors against development time.

The only other option I can think of is if you could format the xml differently.
<x>
<y>
<z>[ID]</z>

Related

Convert xml to html with emphasis in php

I have an XML file that contains the following content.
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE article>
<article
xmlns="http://docbook.org/ns/docbook" version="5.0"
xmlns:xlink="http://www.w3.org/1999/xlink" >
<para>
This is an <emphasis role="strong">test</emphasis> sentence.
</para>
</article>
When I use
$xml_data = simplexml_load_string($filedata);
foreach ($xml_data['para'] as $data) {
echo $data;
}
I got This is an sentence.. But I want to get This is an <b>test<b> sentence. as result.

Instead of simplexml_load_string I'd recommend DOMDocument, but that is just a personal preference. A naïve implementation might just do a string replacement and that might totally work for you. However, since you've provided actual XML that even includes a NS I'm going to try to keep this as XML-centric as possible, while skipping XPath which could possibly be used, too.
This code loads the XML and walks every node. If it find a <para> element it walks all of the children of that node looking for an <emphasis> node, and if it finds one it replaces it with a new new that is a <b> node.
The replacement process is a little complex, however, because if we just use nodeValue we might lose any HTML that lives in there, so we need to walk the children of the <emphasis> node and clone those into our replacement node.
Because the source document has a NS, however, we also need to remove that from our final HTML. Since we are going from XML to HTML, I think that is a safe usage of a str_replace without going to crazy in the XML land for that.
The code should have enough comments to make sense, hopefully.
<?php
$filedata = <<<EOT
<?xml version="1.0" encoding="utf-8" ?>
<article
xmlns="http://docbook.org/ns/docbook" version="5.0"
xmlns:xlink="http://www.w3.org/1999/xlink" >
<para>
This is an <emphasis role="strong">hello <em>world</em></emphasis> sentence.
</para>
</article>
EOT;
$dom = new DOMDocument();
$dom->loadXML($filedata);
foreach($dom->documentElement->childNodes as $node){
if(XML_ELEMENT_NODE === $node->nodeType && 'para' === $node->nodeName){
// Replace any emphasis elements
foreach($node->childNodes as $childNode) {
if(XML_ELEMENT_NODE === $childNode->nodeType && 'emphasis' === $childNode->nodeName){
// This is arguably the most "correct" way to replace, just in case
// there's extra nodes inside. A cheaper way would be to not loop
// and just use the nodeValue however you might lose some HTML.
$newNode = $dom->createElement('b');
foreach($childNode->childNodes as $grandChild){
$newNode->appendChild($grandChild->cloneNode(true));
}
$childNode->replaceWith($newNode);
}
}
// Build our output
$output = '';
foreach($node->childNodes as $childNode) {
$output .= $dom->saveHTML($childNode);
}
// The provided XML has a namespace, and when cloning nodes that NS comes
// along. Since we are going from regular XML to irregular HTML I think
// a string replacement is best.
$output = str_replace(' xmlns="http://docbook.org/ns/docbook"', '', $output);
echo $output;
}
}
Demo here: https://3v4l.org/04Tc3#v8.0.23
NOTE: PHP 8 added replaceWith. If you are using PHP 7 or less you'd use replaceChild and just play around with things a bit.

What if you have the following XML?
<entry>
<para>This is the first text</para>
<emphasis>This is the second text</emphasis>
<para>This is the <emphasis>next</emphasis> text</para>
<itemizedlist>
<listitem>
<para>
This is an paragraph inside a list
</para>
</listitem>
<itemizedlist>
<listitem>
<para>
This is an paragraph inside a list inside a list
</para>
</listitem>
</itemizedlist>
</itemizedlist>
</entry>
using
if(XML_ELEMENT_NODE === $stuff2->nodeType && 'para' === $stuff2->nodeName){
$newNode = $dom->createElement('p');
foreach($stuff2->childNodes as $grandChild){
$newNode->appendChild($grandChild->cloneNode(true));
}
$stuff2->replaceWith($newNode);
}
if (XML_ELEMENT_NODE === $stuff2->nodeType && 'itemizedlist' === $stuff2->nodeName) {
$newNode = $dom->createElement('ul');
foreach($stuff2->childNodes as $grandChild){
$newNode->appendChild($grandChild->cloneNode(true));
}
$stuff2->replaceWith($newNode);
}
if(XML_ELEMENT_NODE === $stuff2->nodeType && 'emphasis' === $stuff2->nodeName){
$newNode = $dom->createElement('b');
foreach($stuff2->childNodes as $grandChild){
$newNode->appendChild($grandChild->cloneNode(true));
}
$stuff2->replaceWith($newNode);
}
if (XML_ELEMENT_NODE === $stuff2->nodeType && 'listitem' === $stuff2->nodeName) {
$newNode = $dom->createElement('li');
foreach($stuff2->childNodes as $grandChild){
$newNode->appendChild($grandChild->cloneNode(true));
}
$stuff2->replaceWith($newNode);
}
only results in
<p>This is the first text</p>
<emphasis>This is the second text</emphasis>
<para>This is the <emphasis>next</emphasis> text</para>
<itemizedlist>
<listitem>
<para>This is an paragraph inside a list</para>
</listitem>
<itemizedlist>
<listitem>
<para>This is an paragraph inside a list inside a list</para>
</listitem>
</itemizedlist>
</itemizedlist>

PHP Split XML based on multiple nodes

I honestly tried to find a solution for php, but a lot of threads sound similar, but are not applicable for me or are for completely different languages.
I want to split an xml file based on nodes. Ideally multiple nodes, but of course one is enough and could be applied multiple times.
e.g. I want to split this by the tag <thingy> and <othernode>:
<root>
<stuff />
<thingy><othernode>one</othernode></thingy>
<thingy><othernode>two</othernode></thingy>
<thingy>
<othernode>three</othernode>
<othernode>four</othernode>
</thingy>
<some other data/>
</root>
Ideally I want to have 4 xmlstrings of type:
<root>
<stuff />
<thingy><othernode>CONTENT</othernode></thingy>
<some other data/>
</root>
With CONTENT being one, two, three and four. Plottwist: CONTENT can also be a whole subtree. Of course it all also can be filled with various namespaces and tag prefixes (like <q1:node/>. Formatting is irrelevant for me.
I tried SimpleXml, but it lacks the possiblity to write into dom easily
I tried DomDocument, but all what I do seems to destroy some links/relation of parent/child nodes in some way.
I tried XmlReader/Writer, but that is extremely hard to maintain and combine (at least for me).
So far my best guess is something with DomDocument, node cloning and removing everything but one node?

Interesting question.
If I get it right, it is given that <othernode> is always a child of <thingy> and the split is for each <othernode> at the place of the first <thingy> in the original document.
DOMDocument appeared useful in this case, as it allows to easily move nodes around - including all its children.
Given the split on a node-list (from getElementsByTagName()):
echo "---\n";
foreach ($split($doc->getElementsByTagName('othernode')) as $doc) {
echo $doc->saveXML(), "---\n";
}
When moving all <othernode> elements into a DOMDocumentFragement of its own while cleaning up <thingy> parent elements when emptied (unless the first anchor element) and then temporarily bring each of them back into the DOMDocument:
$split = static function (DOMNodeList $nodes): Generator {
while (($element = $nodes->item(0)) && $element instanceof DOMElement) {
$doc ??= $element->ownerDocument;
$basin ??= $doc->createDocumentFragment();
$anchor ??= $element->parentNode;
[$parent] = [$element->parentNode, $basin->appendChild($element)];
$parent->childElementCount || $parent === $anchor || $parent->parentNode->removeChild($parent);
}
if (empty($anchor)) {
return;
}
assert(isset($basin, $doc));
while ($element = $basin->childNodes->item(0)) {
$element = $anchor->appendChild($element);
yield $doc;
$anchor->removeChild($element);
}
};
This results in the following split:
---
<?xml version="1.0"?>
<root>
<stuff/>
<thingy><othernode>one</othernode></thingy>
<some other="data"/>
</root>
---
<?xml version="1.0"?>
<root>
<stuff/>
<thingy><othernode>two</othernode></thingy>
<some other="data"/>
</root>
---
<?xml version="1.0"?>
<root>
<stuff/>
<thingy><othernode>three</othernode></thingy>
<some other="data"/>
</root>
---
<?xml version="1.0"?>
<root>
<stuff/>
<thingy><othernode>four</othernode></thingy>
<some other="data"/>
</root>
---

XML 2 Array Markup in Text Issue [duplicate]

This question already has answers here:
Getting the text portion of a node using php Simple XML
(7 answers)
Closed 3 years ago.
I'm struggeling with the following problem. I try to convert an xml document to an array in PHP, which is working fine so far. But I do have some special elements which contain text with markup in it. The elements looks something like this:
<section>
<name>sectionname</name>
<subsection>
<subsectionname>one</subsectionname>
<element>
<text>some text <xref>a</xref>, <xref>b</xref>, <xref>c</xref></text>
</element>
</subsection>
<subsection>
<subsectionname>two</subsectionname>
<element>
<text>some text <xref>a</xref>, <xref>b</xref>, <xref>c</xref></text>
</element>
</subsection>
</section>
I tried to work with simplexml in the first place:
$xml = simplexml_load_string($string) or die("Error: Cannot create object");
$json = json_encode($xml);
$array = json_decode($json, TRUE);
but this will return an element containing "some text , , and some more" without the content of xref. What I actually want is the whole text "some text a, b, c and some more", but I am afraid I do not know how to achieve this.
And I already gave DOMDocument a shot, but had problems with the whole thing there as it is a quite complex xml.
Any ideas how I could receive what I want?
EDIT: I've added a more complex example of the xml. As you can see I would need to traverse over sections, then subsections and in there, the elements with markup and text.

The problem with SimpleXML is that it tends to group text nodes into 1 lump. To be able to get the properly split text you tend to have to use DOMDocument.
As you can see this loads the document and then uses XPath to find the Element/Text nodes ( this is just to get to the right point - you can use getElementsByTagName() if you wish). Then inside that node it again uses XPath to find all of the text nodes (using descendant::text()) which will then fetch each piece of text in sequence from <text> node in the document.
For each Text node this creates a blank $text string and adds the content to it in the loop and then displays it...
$data = '<section>
<name>sectionname</name>
<subsection>
<subsectionname>one</subsectionname>
<element>
<text>some text <xref>a</xref>, <xref>b</xref>, <xref>c</xref></text>
</element>
</subsection>
<subsection>
<subsectionname>two</subsectionname>
<element>
<text>some text <xref>a</xref>, <xref>b</xref>, <xref>c</xref>d</text>
</element>
</subsection>
</section>';
$dom = new DOMDocument();
$dom->loadXML($data);
$xp = new DOMXPath($dom);
foreach ( $xp->query("//element/text") as $element ) {
$text = '';
foreach ( $xp->query("descendant::text()", $element) as $textNode ) {
$text .= $textNode->textContent;
}
echo $text.PHP_EOL;
}
This displays (I modified the second one to help)...
some text a, b, c
some text a, b, cd
Edit:
As ThW points out, using textContent will fetch all of the text including the child nodes, so you can shorten the inner loop to
foreach ( $xp->query("//element/text") as $element ) {
echo $element->textContent.PHP_EOL;
}

It is quite easy to use DOMDocument - if I understood the question correctly you could try like this ~ though as there is only a small snippet of XML this might be wide of the mark
<?php
$strxml='<?xml version="1.0" encoding="UTF-8"?>
<root>
<element>
<text>some text <xref>a</xref>, <xref>b</xref>, <xref>c</xref> and some more</text>
</element>
<element>
<text>a banana <xref>FFF</xref>, <xref>GGG</xref>, <xref>ZZZ</xref> and some more bananas</text>
</element>
</root>';
$dom=new DOMDocument;
$dom->loadXML( $strxml );
$col=$dom->getElementsByTagName('element');
$output=array();
foreach( $col as $node )$output[]=$node->childNodes[1]->nodeValue;
printf('<pre>%s</pre>',print_r( $output, true ) );
?>
Will output
Array
(
[0] => some text a, b, c and some more
[1] => a banana FFF, GGG, ZZZ and some more bananas
)

Prepending raw XML using PHP's SimpleXML

Given a base $xml and a file containing a <something> tag with attributes, children and children of its children, I would like to append it as first child and all of its children as raw XML.
Original XML:
<root>
<people>
<person>
<name>John Doe</name>
<age>47</age>
</person>
<person>
<name>James Johnson</name>
<age>13</age>
</person>
</people>
</root>
XML in file:
<something someval="x" otherthing="y">
<child attr="val" ..> { some children and values ... }</child>
<child attr="val2" ..> { some children and values ... }</child>
...
</something>
Result XML:
<root>
<something someval="x" otherthing="y">
<child attr="val" ..> { some children and values ... }</child>
<child attr="val2" ..> { some children and values ... }</child>
...
</something>
<people>
<person>
<name>John Doe</name>
<age>47</age>
</person>
<person>
<name>James Johnson</name>
<age>13</age>
</person>
</people>
</root>
This tag would contain several children both direct and recursively, so it would not be practical to build the XML via the SimpleXML operations. Besides, keeping it in a file would result in lower maintenance costs.
Technically it would simply be prepending one child. The problem is that this child would have other children and so on.
On the PHP addChild page there's a comment that says:
$x = new SimpleXMLElement('<root name="toplevel"></root>');
$f1 = new SimpleXMLElement('<child pos="1">alpha</child>');
$x->{$f1->getName()} = $f1; // adds $f1 to $x
However, this does not seem to treat my XML as raw XML therefore causing < and > escaped tags to appear. Several warnings concerning namespaces seem to appear as well.
I suppose I could do a quick replace of such tags but I am not sure whether it could cause future problems and it certainly does not feel right.
Manually hacking the XML is not an option and neither is adding children one by one. Choosing a different library could be.
Any clues on how to get this working?
Thanks!

I'm really not sure if that will work. Try this or downvote this, but I hope it helps. Using DOMDocument (Reference)
<?php
$xml = new DOMDocument();
$xml->loadHTML($yourOriginalXML);
$newNode = DOMDocument::createElement($someXMLtoPrepend);
$nodeRoot = $xml->getElementsByTagName('root')->item(0);
$nodeOriginal = $xml->getElementsByTagName('people')->item(0);
$nodeRoot->insertBefore($newNode,$nodeOriginal);
$finalXmlAsString = $xml->saveXML();
?>
Sometimes UTF-8 can make problems, then try this:
<?php
$xml = new DOMDocument();
$xml->loadHTML(mb_convert_encoding($yourOriginalXML, 'HTML-ENTITIES', 'UTF-8'));
$newNode = DOMDocument::createElement(mb_convert_encoding($someXMLtoPrepend, 'HTML-ENTITIES', 'UTF-8'));
$nodeRoot = $xml->getElementsByTagName('root')->item(0);
$nodeOriginal = $xml->getElementsByTagName('people')->item(0);
$nodeRoot->insertBefore($newNode,$nodeOriginal);
$finalXmlAsString = $xml->saveXML();
?>

XPath Node to String

How can I select the string contents of the following nodes:
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
I have tried a few things
//span/text()
Doesn't get the bold tag
//span/string(.)
is invalid
string(//span)
only selects 1 node
I am using simple_xml in php and the only other option I think is to use //span which returns:
Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[class] => url
)
[b] => test
)
[1] => SimpleXMLElement Object
(
[#attributes] => Array
(
[class] => url
)
[b] => test2
)
)
*note that it is also dropping the "more words" text from the second span.
So I guess I could then flatten the item in the array using php some how? Xpath is preferred, but any other ideas would help too.

$xml = '<foo>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</foo>';
$dom = new DOMDocument();
$dom->loadXML($xml); //or load an HTML document with loadHTML()
$x= new DOMXpath($dom);
foreach($x->query("//span[#class='url']") as $node) echo $node->textContent;

You dont even need an XPath for this:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('span') as $span) {
if(in_array('url', explode(' ', $span->getAttribute('class')))) {
$span->nodeValue = $span->textContent;
}
}
echo $dom->saveHTML();
EDIT after comment below
If you just want to fetch the string, you can do echo $span->textContent; instead of replacing the nodeValue. I understood you wanted to have one string for the span, instead of the nested structure. In this case, you should also consider if simply running strip_tags on the span snippet wouldnt be the faster and easier alternative.
With PHP5.3 you can also register arbitrary PHP functions for use as callbacks in XPath queries. The following would fetch the content of all span elements and it's child nodes and return it as a single string.
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions();
echo $xp->evaluate('php:function("nodeTextJoin", //span)');
// Custom Callback function
function nodeTextJoin($nodes)
{
$text = '';
foreach($nodes as $node) {
$text .= $node->textContent;
}
return $text;
}

Using XMLReader:
$xmlr = new XMLReader;
$xmlr->xml($doc);
while ($xmlr->read()) {
if (($xmlr->nodeType == XmlReader::ELEMENT) && ($xmlr->name == 'span')) {
echo $xmlr->readString();
}
}
Output:
word
test
word
test2
more words

SimpleXML doesn't like mixing text nodes with other elements, that's why you're losing some content there. The DOM extension, however, handles that just fine. Luckily, DOM and SimpleXML are two faces of the same coin (libxml) so it's very easy to juggle them. For instance:
foreach ($yourSimpleXMLElement->xpath('//span') as $span)
{
// will not work as expected
echo $span;
// will work as expected
echo textContent($span);
}
function textContent(SimpleXMLElement $node)
{
return dom_import_simplexml($node)->textContent;
}

//span//text()
This may be the best you can do. You'll get multiple text nodes because the text is stored in separate nodes in the DOM. If you want a single string you'll have to just concatenate the text nodes yourself since I can't think of a way to get the built-in XPath functions to do it.
Using string() or concat() won't work because these functions expect string arguments. When you pass a node-set to a function expecting a string, the node-set is converted to a string by taking the text content of the first node in the node-set. The rest of the nodes are discarded.

How can I select the string contents
of the following nodes:
First, I think your question is not clear.
You could select the descendant text nodes as John Kugelman has answer with
//span//text()
I recommend to use the absolute path (not starting with //)
But with this you would need to process the text nodes finding from wich parent span they are childs. So, it would be better to just select the span elements (as example, //span) and then process its string value.
With XPath 2.0 you could use:
string-join(//span, '.')
Result:
word test. word test2 more words
With XSLT 1.0, this input:
<div>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</div>
With this stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="span[#class='url']">
<xsl:value-of select="concat(substring('.',1,position()-1),normalize-space(.))"/>
</xsl:template>
</xsl:stylesheet>
Output:
word test.word test2 more words

Along the lines of Alejandro's XSLT 1.0 "but any other ideas would help too" answer...
XML:
<?xml version="1.0" encoding="UTF-8"?>
<div>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</div>
XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="span">
<xsl:value-of select="normalize-space(data(.))"/>
</xsl:template>
</xsl:stylesheet>
OUTPUT:
word test
word test2 more words

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Complex edit xml file - php

The only other option I can think of is if you could format the xml differently. <x> <y> <z>[ID]</z>

Related

Convert xml to html with emphasis in php

PHP Split XML based on multiple nodes

XML 2 Array Markup in Text Issue [duplicate]

Prepending raw XML using PHP's SimpleXML

XPath Node to String

Categories

Resources