detecting nodes in xml - php

for example, we have this xml:
<p>[TAG]
<span>foo 1</span>
<span>foo 2</span>
[/TAG]
<span>bar 1</span>
<span>bar 2</span>
</p>
how can i detect <span>-tags between words [TAG] and [/TAG] ("foo 1" and "foo 2" in this case)?
UPD. for example i need to change nodeValue of each span between [TAG] and [/TAG]

Assuming that you only have one set of [TAG]..[/TAG] per node (as in, if your document has two sets they're within separate <p> elements or whatever), and that they're always siblings:
You can use preceding-sibling and following-sibling to select only elements which are preceded by a [TAG] text node and followed by a [/TAG] text node:
//span[preceding-sibling::text()[normalize-space(.) = "[TAG]"]][following-sibling::text()[normalize-space(.) = "[/TAG]"]]
A full PHP example:
$doc = new DOMDocument();
$doc->loadHTMLFile('test.xml');
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//span[preceding-sibling::text()[normalize-space(.) = "[TAG]"]][following-sibling::text()[normalize-space(.) = "[/TAG]"]]') as $el) {
$el->nodeValue = 'Changed!';
}
echo $doc->saveXML();

Related

How to add separators between different numbers of children node values, for each parent node in Xpath

Any ideas if this is possible in PHP DOMXPath: I have for example 3 parent spans, and random number of children within each. First I was collecting data like this:
"//span[#class='parent']" and getting the node-value which was something like "Child TextChildText2Child Text 3" for the first item.
However I'm trying to get something like "Child Text,ChildText2,Child Text 3" with the comma separator.
Any ideas? I want to be able to identify which children belong to which parent as at the same time I am collecting other data within the parents:
<span class="parent">Parent 1
<span class="child">Child Text</span>
<span class="child">ChildText2</span>
<span class="child">Child Text 3</span>
</span>
<span class="parent">Parent 2
<span class="child">Child Text4</span>
<span class="child">ChildText5</span>
</span>
<span class="parent">Parent 3
<span class="child">Child Text6</span>
<span class="child">Child Text7</span>
<span class="child">ChildText8</span>
<span class="child">Child Text 9</span>
</span>
The following PHP is what I am currently using:
$array = [];
$result = $xpath->query("//span[#class='parent']");
for ($x=0; $x<$result->length; $x++){
$array[$x]['children'] = trim($result->item($x)->nodeValue);
}
I'd do it in PHP directly instead of trying to get too fancy with the xpath itself. Just throw the node values into an array and implode() it with a comma as the separator.
Example:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//span[#class="parent"]') as $parent) {
$childText = [];
foreach ($xpath->query('span[#class="child"]', $parent) as $child) {
$childText[] = trim($child->nodeValue);
}
echo implode(',', $childText), "\n";
}
Output:
Child Text,ChildText2,Child Text 3
Child Text4,ChildText5
Child Text6,Child Text7,ChildText8,Child Text 9
Using xpath > 2:
'string-join(//span[#class="parent"]/span, ",")'
Output:
Child Text,ChildText2,Child Text 3,Child Text4,ChildText5,Child Text6,Child Text7,ChildText8,Child Text 9

Append an entry to an XML file using DOM and PHP

I'm trying to add entries to an XML file using DOM/PHP and I can't for the life of me get them to appear in the XML file.
The XML schema is as follows:
<alist>
<a>
<1>text a</1>
<2>text a</2>
</a>
<a>
<1>text b</1>
<2>text b</2>
</a>
</alist>
and the PHP is:
$xmlFile = "../../data/file.xml";
$dom = DOMDocument::load($xmlFile);
$v1 = "text c";
$v2 = "text c";
//create anchor
$alist = $dom->getElementsByTagName("alist");
//create elements and contents for <1> and <2>
$a1= $dom->createElement("1");
$a1->appendChild($dom->createTextNode($v1));
$a2= $dom->createElement("2");
$a2->appendChild($dom->createTextNode($v2));
//Create element <a>, add elements <1> and <2> to it.
$a= $dom->createElement("a");
$a->appendChild($v1);
$a->appendChild($v2);
//Add element <a> to <alist>
$alist->appendChild($a);
//Append entry?
$dom->save($xmlFile);
getElementsByTagName() returns a list of element nodes with that tag name. You can not append nodes to the list. You can only append them to elements nodes in the list.
You need to check if the list contains nodes and read the first one.
Numeric element names like 1 or 2 are not allowed. Digits can not be the first character of an xml qualified name. Even numbering them like e1, e2, ... is a bad idea, it makes definitions difficult. If the number is needed, put it into an attribute value.
$xml = <<<XML
<alist>
<a>
<n1>text a</n1>
<n2>text a</n2>
</a>
<a>
<n1>text b</n1>
<n2>text b</n2>
</a>
</alist>
XML;
$dom = new DOMDocument();
$dom->preserveWhiteSpace = FALSE;
$dom->formatOutput = TRUE;
$dom->loadXml($xml);
$v1 = "text c";
$v2 = "text c";
// fetch the list
$list = $dom->getElementsByTagName("alist");
if ($list->length > 0) {
$listNode = $list->item(0);
//Create element <a>, add it to the list node.
$a = $listNode->appendChild($dom->createElement("a"));
$child = $a->appendChild($dom->createElement("n1"));
$child->appendChild($dom->createTextNode($v1));
$child = $a->appendChild($dom->createElement("n2"));
$child->appendChild($dom->createTextNode($v2));
}
echo $dom->saveXml();
Output: https://eval.in/147562
<?xml version="1.0"?>
<alist>
<a>
<n1>text a</n1>
<n2>text a</n2>
</a>
<a>
<n1>text b</n1>
<n2>text b</n2>
</a>
<a>
<n1>text c</n1>
<n2>text c</n2>
</a>
</alist>

Getting second p tag inside a specific ID from HTML ysubg DOMDocument

How can I get content from the second <p> tag inside a div with ID mydiv using DOMDocument?
For example, my HTML might look like:
<div class='mydiv'>
<p><img src='xx.jpg'></p>
<p>i need here</p>
<p>lorem ipsum lorem ipsum</p>
</div>
I'm trying to extract the following text:
i need here
How can I do it?
Getting the contents from nth <p> tag:
Use DOMDocument::getElementsByTagName() to get all the <p> tags, and use item() to retrieve the node value of the second tag from the returned DOMNodeList:
$index = 2;
$dom = new DOMDocument;
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('p');
echo $tags->item(($index-1))->nodeValue; // to-do: check if that index exists
Getting the contents from nth<p> tag inside a div with given ID
If you want to retrieve the node value of a <p> tag inside a specific ID, then you can use an XPath expression instead of getElementsByTagName():
$index = 2;
$id = 'mydiv'
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query(
sprintf('//div[#id="%s"]/p', $id)
);
Demo.

PHP DomDocument - How to replace a text from a Node with another node

I have a node:
<p>
This is a test node with Figure 1.
</p>
My goal is to replace "Figure 1" with a child node:
<xref>Figure 1</xref>
So that the final result will be:
<p>
This is a test node with <xref>Figure 1</xref>.
</p>
Thank you in advance.
Xpath allows you to fetch the text nodes containing the string from the document. Then you have to split it into a list of text and element (xref) nodes and insert that nodes before the text node. Last remove the original text node.
$xml = <<<'XML'
<p>
This is a test node with Figure 1.
</p>
XML;
$string = 'Figure 1';
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
// find text nodes that contain the string
$nodes = $xpath->evaluate('//text()[contains(., "'.$string.'")]');
foreach ($nodes as $node) {
// explode the text at the string
$parts = explode($string, $node->nodeValue);
// add a new text node with the first part
$node->parentNode->insertBefore(
$dom->createTextNode(
// fetch and remove the first part from the list
array_shift($parts)
),
$node
);
// if here are more then one part
foreach ($parts as $part) {
// add a xref before it
$node->parentNode->insertBefore(
$xref = $dom->createElement('xref'),
$node
);
// with the string that we used to split the text
$xref->appendChild($dom->createTextNode($string));
// add the part from the list as new text node
$node->parentNode->insertBefore(
$dom->createTextNode($part),
$node
);
}
// remove the old text node
$node->parentNode->removeChild($node);
}
echo $dom->saveXml($dom->documentElement);
Output:
<p>
This is a test node with <xref>Figure 1</xref>.
</p>
You can first use getElementsByTagName() to find the node you're looking for and remove the search text from the nodeValue of that node. Now, create the new node, set the nodeValue as the search text and append the new node to the main node:
<?php
$dom = new DOMDocument;
$dom->loadHTML('<p>This is a test node with Figure 1</p>');
$searchFor = 'Figure 1';
// replace the searchterm in given paragraph node
$p_node = $dom->getElementsByTagName("p")->item(0);
$p_node->nodeValue = str_replace($searchFor, '', $p_node->nodeValue);
// create the new element
$new_node = $dom->createElement("xref");
$new_node->nodeValue = $searchFor;
// append the child element to paragraph node
$p_node->appendChild($new_node);
echo $dom->saveHTML();
Output:
<p>This is a test node with <xref>Figure 1</xref></p>
Demo.

XPath Node to String

How can I select the string contents of the following nodes:
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
I have tried a few things
//span/text()
Doesn't get the bold tag
//span/string(.)
is invalid
string(//span)
only selects 1 node
I am using simple_xml in php and the only other option I think is to use //span which returns:
Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[class] => url
)
[b] => test
)
[1] => SimpleXMLElement Object
(
[#attributes] => Array
(
[class] => url
)
[b] => test2
)
)
*note that it is also dropping the "more words" text from the second span.
So I guess I could then flatten the item in the array using php some how? Xpath is preferred, but any other ideas would help too.
$xml = '<foo>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</foo>';
$dom = new DOMDocument();
$dom->loadXML($xml); //or load an HTML document with loadHTML()
$x= new DOMXpath($dom);
foreach($x->query("//span[#class='url']") as $node) echo $node->textContent;
You dont even need an XPath for this:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('span') as $span) {
if(in_array('url', explode(' ', $span->getAttribute('class')))) {
$span->nodeValue = $span->textContent;
}
}
echo $dom->saveHTML();
EDIT after comment below
If you just want to fetch the string, you can do echo $span->textContent; instead of replacing the nodeValue. I understood you wanted to have one string for the span, instead of the nested structure. In this case, you should also consider if simply running strip_tags on the span snippet wouldnt be the faster and easier alternative.
With PHP5.3 you can also register arbitrary PHP functions for use as callbacks in XPath queries. The following would fetch the content of all span elements and it's child nodes and return it as a single string.
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions();
echo $xp->evaluate('php:function("nodeTextJoin", //span)');
// Custom Callback function
function nodeTextJoin($nodes)
{
$text = '';
foreach($nodes as $node) {
$text .= $node->textContent;
}
return $text;
}
Using XMLReader:
$xmlr = new XMLReader;
$xmlr->xml($doc);
while ($xmlr->read()) {
if (($xmlr->nodeType == XmlReader::ELEMENT) && ($xmlr->name == 'span')) {
echo $xmlr->readString();
}
}
Output:
word
test
word
test2
more words
SimpleXML doesn't like mixing text nodes with other elements, that's why you're losing some content there. The DOM extension, however, handles that just fine. Luckily, DOM and SimpleXML are two faces of the same coin (libxml) so it's very easy to juggle them. For instance:
foreach ($yourSimpleXMLElement->xpath('//span') as $span)
{
// will not work as expected
echo $span;
// will work as expected
echo textContent($span);
}
function textContent(SimpleXMLElement $node)
{
return dom_import_simplexml($node)->textContent;
}
//span//text()
This may be the best you can do. You'll get multiple text nodes because the text is stored in separate nodes in the DOM. If you want a single string you'll have to just concatenate the text nodes yourself since I can't think of a way to get the built-in XPath functions to do it.
Using string() or concat() won't work because these functions expect string arguments. When you pass a node-set to a function expecting a string, the node-set is converted to a string by taking the text content of the first node in the node-set. The rest of the nodes are discarded.
How can I select the string contents
of the following nodes:
First, I think your question is not clear.
You could select the descendant text nodes as John Kugelman has answer with
//span//text()
I recommend to use the absolute path (not starting with //)
But with this you would need to process the text nodes finding from wich parent span they are childs. So, it would be better to just select the span elements (as example, //span) and then process its string value.
With XPath 2.0 you could use:
string-join(//span, '.')
Result:
word test. word test2 more words
With XSLT 1.0, this input:
<div>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</div>
With this stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="span[#class='url']">
<xsl:value-of select="concat(substring('.',1,position()-1),normalize-space(.))"/>
</xsl:template>
</xsl:stylesheet>
Output:
word test.word test2 more words
Along the lines of Alejandro's XSLT 1.0 "but any other ideas would help too" answer...
XML:
<?xml version="1.0" encoding="UTF-8"?>
<div>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</div>
XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="span">
<xsl:value-of select="normalize-space(data(.))"/>
</xsl:template>
</xsl:stylesheet>
OUTPUT:
word test
word test2 more words

Categories