getting the text content of a specific DOMElement - php

After a little hairpulling, I discovered that DOMElement->textContent also returns the combined text from the children of that element.
Looking around a bit I saw people suggesting DOMElement->firstChild->textContent but this is no good for me because I'm looking through the document following the hierarchy and cues from element attributes, the data is just as likely to be on a branch rather than a leaf so I would get multiple hits even though only one of them is the correct one.
Is there an actual way to get the text content of this one specific element and none of its childrens?
EDIT: nvm, found a way to make sure
function get_text($el) {
if (is_a($el->firstChild, "DOMText")) return $el->firstChild->textContent;
return "";
}

Simply iterate the child nodes and check if the next node is a text. You
might want to skip the nodes consisting of only space characters, though:
function getNodeText(DOMNode $node) {
if ($node->nodeType === XML_TEXT_NODE)
return $node->textContent;
$node = $node->firstChild;
while ($node) {
if ($node->nodeType === XML_TEXT_NODE &&
$text = trim($node->textContent))
{
return $text;
}
$node = $node->nextSibling;
}
return '';
}
$xml = <<<'EOXML'
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child>
<x>x text</x>
child text
</child>
root text
</root>
EOXML;
$doc = new DOMDocument();
$doc->loadXML($xml);
var_dump(getNodeText($doc->getElementsByTagName('x')[0]));
var_dump(getNodeText($doc->getElementsByTagName('root')[0]));
var_dump(getNodeText($doc->getElementsByTagName('child')[0]));
Sample output
string(6) "x text"
string(9) "root text"
string(10) "child text"

Related

PHP DOM Why does removing a child node of an element with removeChild interrupt a foreach loop over its child nodes?

I have encountered a puzzling behavior of the DOM method removeChild. When looping over the child nodes of a DOMElement, removing one of these nodes along the way interrupts the loop, i.e., the loop does not iterate over the remaining child nodes.
Here is a minimal example:
$test_string = <<<XML
<test>
<text>A sample text with <i>mixed content</i> of <b>various sorts</b></text>
</test>
XML;
$test_DOMDocument = new DOMDocument();
$test_DOMDocument->loadXML($test_string);
$test_DOMNode = $test_DOMDocument->getElementsByTagName("text");
foreach ($test_DOMNode as $text) {
foreach ($text->childNodes as $node) {
if (preg_match("/text/", $node->nodeValue)) {
echo $node->nodeValue;
$node->parentNode->removeChild($node);
} else {
echo $node->nodeValue;
}
}
}
If I comment out the line $node->parentNode->removeChild($node);, then the output is the entire test string, i.e., A sample text with mixed content of various sorts, as expected. With that line, however, only the first child node is output, i.e., A sample text with. That is, removing the first child node as the loop passes over it apparently interrupts the loop; the remaining child nodes are not processed. Why is that?
Thanks in advance for your help!
Implementing the suggestions of the comments on my question, I came up with the following solution:
$test_string = <<<XML
<test>
<text>A sample text with <i>mixed content</i> of <b>various sorts</b></text>
</test>
XML;
$test_DOMDocument = new DOMDocument();
$test_DOMDocument->loadXML($test_string);
$test_DOMNode = $test_DOMDocument->getElementsByTagName("text");
foreach ($test_DOMNode as $text) {
$child_nodes = $text->childNodes;
for($n = $child_nodes->length-1; $n >= 0; --$n) {
$node = $child_nodes->item($n);
if (preg_match("/text/", $node->nodeValue)) {
echo $node->nodeValue;
$node->parentNode->removeChild($node);
} else {
echo $node->nodeValue;
}
}
}
That is, I go through the child nodes in reverse order, using a method suggested in another posting. In this way, all nodes are processed: The output is various sorts of mixed contentA sample text with. Note the reverse order of the text fragments. In my specific use case, this reversal does not matter because I am not actually echoing the text nodes, but performing another kind of operation on them.

Read out an XML file and search for a specific word. Then put this word on a page. Code in PHP

I have to read out an XML file and I have to search for 0GEW903KA. This code must then be on one page. I can't get it to read the XML file. With a .txt, I can already read the code, but it always outputs the entire line and not just the word. Can you help me please.
sheet.info
<?xml version="1.0" encoding="utf-8"?>
<configuration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" sheetInfoVersion="1.0" infoType="SheetInfo">
<uid>10000005-e59b-4723-a941-8d5942ff1673</uid>
<sheetType hasTemporaryMOP="false" regularMOPLocation="">ALS</sheetType>
<revision />
<timestamps>
<created>2019-08-26 12:27:30</created>
<lastWrite>2019-08-26 12:27:30</lastWrite>
<lastAccess>2019-08-26 12:27:30</lastAccess>
</timestamps>
<sheetName />
<idCode>0GEW903KA</idCode>
<versionInfo>
<comment>initial revision</comment>
<date>2019-08-26 12:27:30</date>
<increment>0</increment>
<user>LA2TERMSERV/html_eng1</user>
<version>1</version>
<state>Undefined</state>
<creationToolName>HTML_Editor</creationToolName>
<creationToolVersion>2.6.0</creationToolVersion>
<creationToolLibVersion>HTML 2.6.0</creationToolLibVersion>
</versionInfo>
</configuration>
my current mess code
my attempt to read xml files
$z = new XMLReader;
$z->open('C:/xampp/htdocs/Flutter/Probe/sheet.xml');
$doc = new DOMDocument;
// move to the first <product /> node
while ($z->read() && $z->name !== 'idCode');
// now that we're at the right depth, hop to the next <product/> until the end of the tree
while ($z->name === 'idCode') {
// either one should work
//$node = new SimpleXMLElement($z->readOuterXML());
$node = simplexml_import_dom($doc->importNode($z->expand(), true));
// now you can use $node without going insane about parsing
echo $node;
}
The code does not match the XML example (here is not "product" element) and the XML does not look like it would need XMLReader (large xml file).
So with just DOM + Xpath:
$document = new DOMDocument;
$document->loadXML($xmlString);
$xpath = new DOMXpath($document);
var_dump($xpath->evaluate('string(/configuration/idCode)'));
Output:
string(9) "0GEW903KA"
Xpath allows you to fetch nodes and values from DOM using expressions. The expression here is:
Fetch the "configuration" document element node.../configuration
... fetch the "idCode" child elements .../configuration/idCode
... cast the first found node to a stringstring(/configuration/idCode)
Your example code misses the actual string value read and it is an endless loop because it does not move to the next node. Here is a fixed example with the different possibilities after you navigate to the "idCode"element node:
$reader = new XMLReader;
//$reader->open('C:/xampp/htdocs/Flutter/Probe/sheet.xml');
$reader->open('data:text/xml;base64,'.base64_encode($xmlString));
$document = new DOMDocument;
$xpath = new DOMXpath($document);
while ($reader->read() && $reader->localName !== 'idCode');
while ($reader->localName === 'idCode') {
// expand idCode into prepared document
$node = $reader->expand($document);
// importing into SimpleXML is an additional step
$idCode = simplexml_import_dom($node);
var_dump(
[
// direct read without expand into DOM
'XMLReader::readString' => $reader->readString(),
// text content of the expanded node
'DOMNode::$textContent' => $node->textContent,
// xpath expression using expanded node as context
'Xpath expression' => $xpath->evaluate('string(.)', $node),
// cast the imported SimpleXMLElement instance
'SimpleXMLElement' => (string)$idCode
]
);
// look for a sibling "idCode"
$reader->next('idCode');
}
Output:
array(4) {
["XMLReader::readString"]=>
string(9) "0GEW903KA"
["DOMNode::$textContent"]=>
string(9) "0GEW903KA"
["Xpath expression"]=>
string(9) "0GEW903KA"
["SimpleXMLElement"]=>
string(9) "0GEW903KA"
}

What does LIBXML_NOBLANKS do, exactly?

What is the difference between
$domd=new DOMDocument();
$domd->loadHTML($html, LIBXML_NOBLANKS);
and
$domd=new DOMDocument();
$domd->loadHTML($html, 0);
?
edit: just in case someone wants to remove all empty+whitespace text nodes (which is not exactly what LIBXML_NOBLANKS does), here's a function to do just that,
$removeAnnoyingWhitespaceTextNodes = function (\DOMNode $node) use (&$removeAnnoyingWhitespaceTextNodes): void {
if ($node->hasChildNodes()) {
// Warning: it's important to do it backwards; if you do it forwards, the index for DOMNodeList might become invalidated;
// that's why i don't use foreach() - don't change it (unless you know what you're doing, ofc)
for ($i = $node->childNodes->length - 1; $i >= 0; --$i) {
$removeAnnoyingWhitespaceTextNodes($node->childNodes->item($i));
}
}
if ($node->nodeType === XML_TEXT_NODE && !$node->hasChildNodes() && !$node->hasAttributes() && (strlen(trim($node->textContent)) === 0)) {
//echo "Removing annoying POS";
// var_dump($node);
$node->parentNode->removeChild($node);
} //elseif ($node instanceof DOMText) { echo "not removed"; var_dump($node, $node->hasChildNodes(), $node->hasAttributes(), trim($node->textContent)); }
};
$dom=new DOMDocument();
$dom->loadHTML($html);
$removeAnnoyingWhitespaceTextNodes($dom);
The LIBXML_NOBLANKS parser option removes all text nodes containing only whitespace. Consider the following document, for example:
<doc>
<elem>text</elem>
</doc>
Normally, the element doc has three children: A whitespace text node, the element elem and another whitespace text node. When parsing with LIBXML_NOBLANKS, the doc element will only have a single element child.
Probably:
LIBXML_NOBLANKS Removes all insignificant whitespace within the document.
However I found no clear sign that this (borrowed) description fits 100% to what in the PHP documentation is written:
LIBXML_NOBLANKS (int)
Remove blank nodes
Which wondered me and I guess the reference here is to libxml2:
XML_PARSE_NOBLANKS = 256 : remove blank nodes
And I could find more Q&A accessible in https://stackoverflow.com/a/18521956/367456 and it seems that this is probably different to insignificant whitespace.

SimpleXML: handle CDATA tag presence in node value

I need to save <![CDATA[]]> tag when I parse XML document.
For example, I have node:
<Dest><![CDATA[some text...]]></Dest>
In xml file may be present nodes without CDATA.
Then I process all the nodes in loop:
$dom = simplexml_load_file($path);
foreach($dom->children() as $child) {
$nodeValue = (string) $child;
}
As a result, when I process node in example above - $nodeValue = some text...
But I need $nodeValue = <![CDATA[some text...]]>
There is any way to do this?
File example:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
<Params>
<param>text</param>
<anotherParam>text</anotherParam>
</Params>
<Content>
<String>
<Source>some another text</Source>
<Dest>some another text 2</Dest>
</String>
<String>
<Source>some another text 3</Source>
<Dest><![CDATA[some text...]]></Dest>
</String>
</Content>
</Root>
As far as a parser like SimpleXML is concerned, the <![CDATA[ is not part of the text content of the XML element, it's just part of the serialization of that content. A similar confusion is discussed here: PHP, SimpleXML, decoding entities in CDATA
What you need to look at is the "inner XML" of that element, which is tricky in SimpleXML (->asXML() will give you the "outer XML", e.g. <Dest><![CDATA[some text...]]></Dest>).
Your best bet here is to use the DOM which gives you more access to the detailed structure of the document, rather than trying to give you the content, so distinguishes "text nodes" and "CDATA nodes". However, it's worth double-checking that you do actually need this, as for 99.9% of use cases, you shouldn't care whether somebody sent you <foo>bar & baz</foo> or <foo><![CDATA[bar & baz]]></foo>, since by definition they represent the same string.
If you want to add CDATA to all element who doesn't have it, you can do that :
$dom = simplexml_load_file($path);
foreach($dom->children() as $child) {
if(strpos((string) $child,'CDATA')){
$nodeValue = (string) $child)
}
else {
$nodeValue = "<![CDATA[".((string) $child)."]]>";
}
}
with that you will have $nodeValue = '<![CDATA[some text...]]>'
if you want to just have the element where there is CDATA you can do that :
$dom = simplexml_load_file($path);
foreach($dom->children() as $child) {
if(strpos((string) $child,'CDATA')){
$nodeValue = (string) $child;
}
}
with that you will have $nodeValue = '<![CDATA[some text...]]>'
if you want the element without CDATA and add it you can do that :
$dom = simplexml_load_file($path);
foreach($dom->children() as $child) {
if(!strpos((string) $child,'CDATA')){
$notValue ="<![CDATA[".((string) $child)."]]>";
}
}
with that you wil have $nodeValue = '<![CDATA[some another text 3]]>'

how to differentiate these two xml tags with childnodes

i have two tags in my sample xml as below,
<EmailAddresses>2</EmailAddresses>
<EmailAddresses>
<string>Allen.Patterson01#fantasyisland.com</string>
<string>Allen.Patterson12#fantasyisland.com</string>
</EmailAddresses>
how to differentiate these two xml tags based on the childnodes that means how to check that first tag has no childnodes and other one has using DOM php
Hope it will meet your requirement. Just copy,paste and run it. And change/add logic whatever you want.
<?php
$xmlstr = <<<XML
<?xml version='1.0' standalone='yes'?>
<email>
<EmailAddresses>2</EmailAddresses>
<EmailAddresses>
<string>Allen.Patterson01#fantasyisland.com</string>
<string>Allen.Patterson12#fantasyisland.com</string>
</EmailAddresses>
</email>
XML;
$email = new SimpleXMLElement($xmlstr);
foreach ($email as $key => $value) {
if(count($value)>1) {
var_dump($value);
//write your logic to process email strings
} else {
var_dump($value);
// count of emails
}
}
?>
You can use ->getElementsByTagName( 'string' ):
foreach( $dom->getElementsByTagName( 'EmailAddresses' ) as $node )
{
if( $node->getElementsByTagName( 'string' )->length )
{
// Code for <EmailAddresses><string/></EmailAddresses>
}
else
{
// Code for <EmailAddresses>2</EmailAddresses>
}
}
2 is considered as <EmailAddresses> child node, so in your XML ->haschildNodes() returns always True.
You have this problem due your weird XML structure conception.
If you don't have particular reason to maintain this XML syntax, I suggest you to use only one tag:
<EmailAddresses count="2">
<string>Allen.Patterson01#fantasyisland.com</string>
<string>Allen.Patterson12#fantasyisland.com</string>
</EmailAddresses>
Xpath allows you to do that.
$xml = <<<'XML'
<xml>
<EmailAddresses>2</EmailAddresses>
<EmailAddresses>
<string>Allen.Patterson01#fantasyisland.com</string>
<string>Allen.Patterson12#fantasyisland.com</string>
</EmailAddresses>
</xml>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
var_dump(
$xpath->evaluate('number(//EmailAddresses[not(*)])')
);
foreach ($xpath->evaluate('//EmailAddresses/string') as $address) {
var_dump($address->textContent);
}
Output:
float(2)
string(35) "Allen.Patterson01#fantasyisland.com"
string(35) "Allen.Patterson12#fantasyisland.com"
The Expressions
Fetch the first EmailAddresses node without any element node child as a number.
Select any EmailAddresses element node:
//EmailAddresses
That does not contain another element node as child node:
//EmailAddresses[not(*)]
Cast the first of the fetched EmailAddresses nodes into a number:
number(//EmailAddresses[not(*)])
Fetch the string child nodes of the EmailAddresses element nodes.
Select any EmailAddresses element node:
//EmailAddresses
Get their string child nodes:
//EmailAddresses/string
In you example the first EmailAddresses seems to be duplicate information and stored in a weird way. Xpath can count nodes, too. The expression count(//EmailAddresses/string) would return the number of nodes.

Categories