How can I select the string contents of the following nodes:
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
I have tried a few things
//span/text()
Doesn't get the bold tag
//span/string(.)
is invalid
string(//span)
only selects 1 node
I am using simple_xml in php and the only other option I think is to use //span which returns:
Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[class] => url
)
[b] => test
)
[1] => SimpleXMLElement Object
(
[#attributes] => Array
(
[class] => url
)
[b] => test2
)
)
*note that it is also dropping the "more words" text from the second span.
So I guess I could then flatten the item in the array using php some how? Xpath is preferred, but any other ideas would help too.
$xml = '<foo>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</foo>';
$dom = new DOMDocument();
$dom->loadXML($xml); //or load an HTML document with loadHTML()
$x= new DOMXpath($dom);
foreach($x->query("//span[#class='url']") as $node) echo $node->textContent;
You dont even need an XPath for this:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('span') as $span) {
if(in_array('url', explode(' ', $span->getAttribute('class')))) {
$span->nodeValue = $span->textContent;
}
}
echo $dom->saveHTML();
EDIT after comment below
If you just want to fetch the string, you can do echo $span->textContent; instead of replacing the nodeValue. I understood you wanted to have one string for the span, instead of the nested structure. In this case, you should also consider if simply running strip_tags on the span snippet wouldnt be the faster and easier alternative.
With PHP5.3 you can also register arbitrary PHP functions for use as callbacks in XPath queries. The following would fetch the content of all span elements and it's child nodes and return it as a single string.
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions();
echo $xp->evaluate('php:function("nodeTextJoin", //span)');
// Custom Callback function
function nodeTextJoin($nodes)
{
$text = '';
foreach($nodes as $node) {
$text .= $node->textContent;
}
return $text;
}
Using XMLReader:
$xmlr = new XMLReader;
$xmlr->xml($doc);
while ($xmlr->read()) {
if (($xmlr->nodeType == XmlReader::ELEMENT) && ($xmlr->name == 'span')) {
echo $xmlr->readString();
}
}
Output:
word
test
word
test2
more words
SimpleXML doesn't like mixing text nodes with other elements, that's why you're losing some content there. The DOM extension, however, handles that just fine. Luckily, DOM and SimpleXML are two faces of the same coin (libxml) so it's very easy to juggle them. For instance:
foreach ($yourSimpleXMLElement->xpath('//span') as $span)
{
// will not work as expected
echo $span;
// will work as expected
echo textContent($span);
}
function textContent(SimpleXMLElement $node)
{
return dom_import_simplexml($node)->textContent;
}
//span//text()
This may be the best you can do. You'll get multiple text nodes because the text is stored in separate nodes in the DOM. If you want a single string you'll have to just concatenate the text nodes yourself since I can't think of a way to get the built-in XPath functions to do it.
Using string() or concat() won't work because these functions expect string arguments. When you pass a node-set to a function expecting a string, the node-set is converted to a string by taking the text content of the first node in the node-set. The rest of the nodes are discarded.
How can I select the string contents
of the following nodes:
First, I think your question is not clear.
You could select the descendant text nodes as John Kugelman has answer with
//span//text()
I recommend to use the absolute path (not starting with //)
But with this you would need to process the text nodes finding from wich parent span they are childs. So, it would be better to just select the span elements (as example, //span) and then process its string value.
With XPath 2.0 you could use:
string-join(//span, '.')
Result:
word test. word test2 more words
With XSLT 1.0, this input:
<div>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</div>
With this stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="span[#class='url']">
<xsl:value-of select="concat(substring('.',1,position()-1),normalize-space(.))"/>
</xsl:template>
</xsl:stylesheet>
Output:
word test.word test2 more words
Along the lines of Alejandro's XSLT 1.0 "but any other ideas would help too" answer...
XML:
<?xml version="1.0" encoding="UTF-8"?>
<div>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</div>
XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="span">
<xsl:value-of select="normalize-space(data(.))"/>
</xsl:template>
</xsl:stylesheet>
OUTPUT:
word test
word test2 more words
Related
I want to extract some data from xml.
I have this xml:
<root>
<p>Some text</p>
<p>Even more text</p>
<span class="bla bla">
<span class="currency">EUR</span> 19.95
</span>
</root>
and then I run this php code
$xml = simplexml_load_string($xmlString);
$json = json_encode($xml);
$obj = json_decode($json);
print_r($obj);
and the result is:
stdClass Object
(
[p] => Array
(
[0] => Some text
[1] => Even more text
)
[span] => stdClass Object
(
[#attributes] => stdClass Object
(
[class] => bla bla
)
[span] => EUR
)
)
How do I get the missing string "19.95"?
Don't convert XML into JSON/an array. It means that you loose information and features.
SimpleXML is litmit, it works with basic XML, but it has problems with thing like mixed nodes. DOM allows for an easier handling in this case.
$xml = <<<'XML'
<root>
<p>Some text</p>
<p>Even more text</p>
<span class="bla bla">
<span class="currency">EUR</span> 19.95
</span>
</root>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
foreach($xpath->evaluate('/root/span[#class="bla bla"]') as $span) {
var_dump(
$xpath->evaluate('string(span[#class="currency"][1])', $span),
$xpath->evaluate(
'number(span[#class="currency"][1]/following-sibling::text()[1])',
$span
)
);
}
Xpath is an expression language to fetch parts of an DOM (Think SQL for XML). PHP has several method to access it. SimpleXMLElement::xpath() allows to fetch nodes as arrays of SimpleXMLElement objects. DOMXpath::query() allows you to fetch node lists. Only DOMXpath::evaluate() allows to fetch node lists and scalar values.
In the example /root/span[#class="bla bla"] fetches all span element nodes that have the given class attribute. For each of the nodes it then fetches the span with the class currency as a string. The third expression fetches the first following sibling text node of the currency span as a number.
Don't trust the debug output, don't convert to JSON or an array, and don't overthink the problem.
Outputting this string is as simple as navigating to the element and echoing it:
echo $xml->span;
Or to get it into a variable, explicitly cast to string:
$foo = (string)$xml->span
Or if you want to use XPath like in ThW's answer, you could find the span using //span[#class="bla bla"] and echo that (note that ->xpath() returns an array, so you want element 0 of that array):
echo $xml->xpath('//span[#class="bla bla"]')[0];
I'm trying to parse a specific html document, some sort of a dictionary, with about 10000 words and description.
It went well until I've noticed that entries in specific format doesn't get parsed well.
Here is an example:
<?php
$html = '
<p>
<b>
<span>zot; zotz </span>
</b>
<span>Nista; nula. Isto
<b>zilch; zip.</b>
</span>
</p>
';
$xml = simplexml_load_string($html);
var_dump($xml);
?>
Result of var_dump() is:
object(SimpleXMLElement)#1 (2) {
["b"]=>
object(SimpleXMLElement)#2 (1) {
["span"]=>
string(10) "zot; zotz "
}
["span"]=>
string(39) "Nista; nula. Isto
"
}
As you can see - Simplexml kept text node inside tag but left out a child node and text inside.
I've also tried:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
with the same result.
As it looked to me that this is a common problem in parsing html I tried googling it out but only place that acknowledges this problem is this blog:
https://hakre.wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/
but does not offer any solution.
There is just too generalized posts and answers about parsing HTML in SO.
Is there a simple way of dealing with this?
Or, should I change my strategy?
Your observation is correct: SimpleXML does only offer the child element-node here, not the child text-nodes. The solution is to switch to DOMDocument as it can access all nodes there, text and element children.
// first span element
$span = dom_import_simplexml($xml->span);
foreach ($span->childNodes as $child) {
printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}
This example shows that dom_import_simplexml is used on the more specific <span> element-node and the traversal is the done over the children of the according DOMElement object.
The output:
- DOMText : Nista; nula. Isto
- DOMElement : zilch; zip.
- DOMText :
The first entry is the first text-node within the <span> element. It is followed by the <b> element (which again contains some text) and then from another text-node that consists of whitespace only.
The dom_import_simplexml function is especially useful when SimpleXMLElement is too simple for more differentiated data access within the XML document. Like in the case you face here.
The example in full:
$html = <<<HTML
<p>
<b>
<span>zot; zotz </span>
</b>
<span>Nista; nula. Isto
<b>zilch; zip.</b>
</span>
</p>
HTML;
$xml = simplexml_load_string($html);
// first span element
$span = dom_import_simplexml($xml->span);
foreach ($span->childNodes as $child) {
printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}
If I have a document like this:
<!-- in doc.xml -->
<a>
<b>
greetings?
<c>hello</c>
<d>goodbye</c>
</b>
</a>
Is there any way to use simplexml (or any php builtin really) to get a string containing:
greetings?
<c>hello</c>
<d>goodbye</d>
Whitespace and such doesn't matter.
Thanks!
I must admit this wasn't as simple as one would think. This is what I came up with:
$xml = new DOMDocument;
$xml->load('doc.xml');
// find just the <b> node(s)
$xpath = new DOMXPath($xml);
$results = $xpath->query('/a/b');
// get entire <b> node as text
$node = $results->item(0);
$text = $xml->saveXML($node);
// remove encapsulating <b></b> tags
$text = preg_replace('#^<b>#', '', $text);
$text = preg_replace('#</b>$#', '', $text);
echo $text;
Regarding the XPath query
The query returns all matching nodes, so if there are multiple matching <b> tags, you can loop through $results to get them all.
My query for '/a/b' assumes that <a> is the root and <b> is its child/immediate descendant. You could alter it for different scenarios. Here's an XPath reference. Some adjustments might include:
'a/b' –– <b> is child of <a>, but <a> is anywhere, not just in the root
'a//b' –– <b> is a descendant of <a> no matter how deep, not just a direct child
'//b' –– all <b> nodes anywhere in the document
Regarding method of obtaining string contents
I tried using $node->nodeValue or $node->textContent, but both of them strip out the <c> and <d> tags, leaving just the text contents of those. I also tried casting it as a DOMText object, but that didn't directly work and was more trouble than it was worth.
Regarding the use of regular expressions
It could be done without regex, but I found it easiest to use them. I wanted to make sure that I only stripped the <b> and </b> at the very beginning and end of the string, just in case there were other <b> nodes within the contents.
How about this? Since you already know the XML format:
<?php
$xml = simplexml_load_file('doc.xml');
$str = $xml->b;
$str .= "<c>".$xml->b->c."</c>";
$str .= "<d>".$xml->b->d."</d>";
echo $str;
?>
Here's an alternative using DOM (to balance the SimpleXML answers!) that outputs the contents of all of the first <b> element.
$doc = new DOMDocument;
$doc->load('doc.xml');
$bee = $doc->getElementsByTagName('b')->item(0);
$innerxml = '';
foreach ($bee->childNodes as $node) {
$innerxml .= $doc->saveXML($node);
}
echo $innerxml;
for example, we have this xml:
<p>[TAG]
<span>foo 1</span>
<span>foo 2</span>
[/TAG]
<span>bar 1</span>
<span>bar 2</span>
</p>
how can i detect <span>-tags between words [TAG] and [/TAG] ("foo 1" and "foo 2" in this case)?
UPD. for example i need to change nodeValue of each span between [TAG] and [/TAG]
Assuming that you only have one set of [TAG]..[/TAG] per node (as in, if your document has two sets they're within separate <p> elements or whatever), and that they're always siblings:
You can use preceding-sibling and following-sibling to select only elements which are preceded by a [TAG] text node and followed by a [/TAG] text node:
//span[preceding-sibling::text()[normalize-space(.) = "[TAG]"]][following-sibling::text()[normalize-space(.) = "[/TAG]"]]
A full PHP example:
$doc = new DOMDocument();
$doc->loadHTMLFile('test.xml');
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//span[preceding-sibling::text()[normalize-space(.) = "[TAG]"]][following-sibling::text()[normalize-space(.) = "[/TAG]"]]') as $el) {
$el->nodeValue = 'Changed!';
}
echo $doc->saveXML();
For example, we have this xml:
<x>
<y>some text</y>
<y>[ID] hello</y>
<y>world [/ID]</y>
<y>some text</y>
<y>some text</y>
</x>
and we need to remove words "[ID]", "[/ID]" and text between them (which we don't know, when parsing), of course without damage xml formatting.
The only solution i can think is that:
Find in xml the text by using regex, for example: "/\[ID\].*?\[\/ID\]/". In our case, result will be "[ID]hello</y><y>world[/ID]"
In result from prev step we need to find text without xml-tags by using this regex:
"/(?<=^|>)[^><]+?(?=<|$)/", and delete this text. The result will be "</y><y>"
Made changes in original xml by doing smth like this:
str_replace($step1string,$step2string,$xml);
is this correct way to do this?
I just think that this "str_replace"'s things it's not best way to edit xml, so maybe you know better solution?
Removing the specific string is simple:
<?php
$xml = '<x>
<y>some text</y>
<y>[ID] hello</y>
<y>world [/ID]</y>
<y>some text</y>
<y>some text</y>
</x>';
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[(contains(.,\'[ID]\') or contains(.,\'[/ID]\'))]') as $elm){
$elm->nodeValue = preg_replace('/\[\/?ID\]/','',$elm->nodeValue);
}
var_dump($d->saveXML());
?>
When just removing textnodes in a specific tag, one could alter te preg_replace to these 2:
$elm->nodeValue = preg_replace('/\[ID\].*$/','',$elm->nodeValue);
$elm->nodeValue = preg_replace('/^.*\[/ID\]/','',$elm->nodeValue);
Resulting in for your example:
<x>
<y>some text</y>
<y></y>
<y></y>
<y>some text</y>
<y>some text</y>
</x>
However, removing tags in between without damaging well formed XML is quite tricky. Before venturing into lot of DOM actions, how would you like to handle:
An [/ID] higher in the DOM-tree:
<foo>[ID] foo
<bar> lorem [/ID] ipsum </bar>
</foo>
An [/ID] lower in the DOM-tree
<foo> foo
<bar> lorem [ID] ipsum </bar>
[/ID]
</foo>
And open/close spanning siblings, as per your example:
<foo> foo
<bar> lorem [ID] ipsum </bar>
<bar> lorem [/ID] ipsum </bar>
</foo>
And a real dealbreaker of a question: is nesting possible, is that nesting well formed, and what should it do?
<foo> foo
<bar> lo [ID] rem [ID] ipsum </bar>
<bar> lorem [/ID] ipsum </bar>
[/ID]
</foo>
Without further knowledge how these case should be handled there is no real answer.
Edit, well futher information was given, the actual, fail-safe solution (i.e.: parse XML, don't use regexes) seems kind of long, but will work in 99.99% of cases (personal typos and brainfarts excluded of course :) ):
<?php
$xml = '<x>
<y>some text</y>
<y>
<a> something </a>
well [ID] hello
<a> and then some</a>
</y>
<y>some text</y>
<x>
world
<a> also </a>
foobar [/ID] something
<a> these nodes </a>
</x>
<y>some text</y>
<y>some text</y>
</x>';
echo $xml;
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[contains(.,\'[ID]\')]') as $elm){
//if this node also contains [/ID], replace and be done:
if(($startpos = strpos($elm->nodeValue,'[ID]'))!==false && $endpos = strpos($elm->nodeValue,'[/ID]',$startpos)){
$elm->replaceData($startpos, $endpos-$startpos + 5,'');
var_dump($d->saveXML($elm));
continue;
}
//delete all siblings of this textnode not being text and having [/ID]
while($elm->nextSibling){
if(!($elm->nextSibling instanceof DOMTEXT) || ($pos =strpos($elm->nodeValue,'[/ID]'))===false){
$elm->parentNode->removeChild($elm->nextSibling);
} else {
//id found in same element, replace and go to next [ID]
$elm->parentNode->appendChild(new DOMTExt(substr($elm->nextSibling->nodeValue,$pos+5)));
$elm->parentNode->removeChild($elm->nextSibling);
continue 2;
}
}
//siblings of textnode deleted, string truncated to before [ID], now let's delete intermediate nodes
while($sibling = $elm->parentNode->nextSibling){ // in case of example: other <y> elements:
//loop though childnodes and search a textnode with [/ID]
while($child = $sibling->firstChild){
//delete if not a textnode
if(!($child instanceof DOMText)){
$sibling->removeChild($child);
continue;
}
//we have text, check for [/ID]
if(($pos = strpos($child->nodeValue,'[/ID]'))!==false){
//add remaining text in textnode:
$elm->appendData(substr($child->nodeValue,$pos+5));
//remove current textnode with match:
$sibling->removeChild($child);
//sanity check: [ID] was in <y>, is [/ID]?
if($sibling->tagName!= $elm->parentNode->tagname){
trigger_error('[/ID] found in other tag then [/ID]: '.$sibling->tagName.'<>'.$elm->parentNode->tagName, E_USER_NOTICE);
}
//add remaining childs of sibling to parent of [ID]:
while($sibling->firstChild){
$elm->parentNode->appendChild($sibling->firstChild);
}
//delete the sibling that was found to hold [/ID]
$sibling->parentNode->removeChild($sibling);
//done: end both whiles
break 2;
}
//textnode, but no [/ID], so remove:
$sibling->removeChild($child);
}
//no child, no text, so no [/ID], remove:
$elm->parentNode->parentNode->removeChild($sibling);
}
}
var_dump($d->saveXML());
?>
For your entertainment and edification, you may want to read this: RegEx match open tags except XHTML self-contained tags
The "correct" solution is to use an XML library and search through the nodes to perform the operation. However, it would probably be much easier to just use a str_replace, even if there's a chance of damaging the XML formatting. You have to gauge the likelihood of receiving something like <a href="[ID]"> and the importance of defending against such cases, and weigh those factors against development time.
The only other option I can think of is if you could format the xml differently.
<x>
<y>
<z>[ID]</z>