SimpleXML how to get line number of a node? - php

I'm using this in SimpleXML and PHP:
foreach ($xml->children() as $node) {
echo $node->attributes('namespace')->id;
}
That prints the id attribute of all nodes (using a namespace).
But now I want to know the line number that $node is located in the XML file.
I need the line number, because I'm analyzing the XML file, and returning to the user information of possible issues to resolve them. So I need to say something like: "Here you have an error at line X". I'm sure that the XML file would be in a standard format that will have enough line breaks for this to be useful.

It is possible with DOM. DOMNode provides the function getLineNo().
DOM
$xml = <<<'XML'
<foo>
<bar/>
</foo>
XML;
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
var_dump(
$xpath->evaluate('//bar[1]')->item(0)->getLineNo()
);
Output:
int(2)
SimpleXML
SimpleXML is based on DOM, so you can convert SimpleXMLElement objects to DOMElement objects.
$element = new SimpleXMLElement($xml);
$node = dom_import_simplexml($element->bar);
var_dump($node->getLineNo());
And yes, most of the time if you have a problem with SimpleXML, the answer is to use DOM.
XMLReader
XMLReader has the line numbers internally, but here is no direct method to access them. Again you will have to convert it into a DOMNode. It works because both use libxml2. This will read the node and all its descendants into memory, so be careful with it.
$reader = new XMLReader();
$reader->open('data://text/xml;base64,'.base64_encode($xml));
while ($reader->read()) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name== 'bar') {
var_dump($reader->expand()->getLineNo());
}
}

Related

Read XML File with DOMDocument in php

I want to read this xml document:
<?xml version="1.0" encoding="UTF-8"?>
<tns:getPDMNumber xmlns:tns="http://www.testgroup.com/TestPDM" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.testgroup.com/TestPDM getPDMNumber.xsd ">
<tns:getPDMNumberResponse>
<tns:requestID>22222</tns:requestID>
<tns:pdmNumber>654321</tns:pdmNumber>
<tns:responseCode>0</tns:responseCode>
</tns:getPDMNumberResponse>
</tns:getPDMNumber>
I tried it this way:
$dom->load('response/17_getPDMNumberResponse.xml');
$nodes = $dom->getElementsByTagName("tns:requestID");
//$nodes = $dom->getElementsByTagName("tns:getPDMNumber");
//$nodes = $dom->getElementsByTagName("tns:getPDMNumberResponse");
foreach($nodes as $node)
{
$response=$node->getElementsByTagName("tns:getPDMNumber");
foreach($response as $info)
{
$test = $info->getElementsByTagName("tns:pdmNumber");
$pdm = $test->nodeValue;
}
}
the code never runs into the foreach loop.
Only for clarification my goal is to read the "tns:pdmNumber" node.
Have anybody a idea?
EDIT: I have also tried the commited lines.
The XML uses a namespace, so you should use the namespace aware methods. They have the suffix _NS.
$tns = 'http://www.testgroup.com/TestPDM';
$document = new DOMDocument();
$document->loadXml($xml);
foreach ($document->getElementsByTagNameNS($tns, "pdmNumber") as $node) {
var_dump($node->textContent);
}
Output:
string(6) "654321"
A better option is to use Xpath expression. They allow a more comfortable access to DOM nodes. In this case you have to register a prefix for the namespace that you can use in the Xpath expression:
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('t', 'http://www.testgroup.com/TestPDM');
var_dump(
$xpath->evaluate('string(/t:getPDMNumber/t:getPDMNumberResponse/t:pdmNumber)')
);
This:
$nodes = $dom->getElementsByTagName("tns:requestID");
you find all the requestID nodes, and try to loop on them. That's fine, but then you use that node as a basis to find any getPDMNumber nodes UNDER the requestID - but there's nothing - requestID is a terminal node. So
$response=$node->getElementsByTagName("tns:getPDMNumber");
finds nothing, and the inner loop has nothing to do.
It's like saying "Start digging a hole until you reach china. Once you reach China, keep digging until you reach Australia". But you can't keep digging - you've reached the "bottom", and the only thing deeper than China would be going into orbit.

In DomDocument, reuse of DOMXpath, it is stable?

I am using the function below, but not sure about it is always stable/secure... Is it?
When and who is stable/secure to "reuse parts of the DOMXpath preparing procedures"?
To simlify the use of the XPath query() method we can adopt a function that memorizes the last calls with static variables,
function DOMXpath_reuser($file) {
static $doc=NULL;
static $docName='';
static $xp=NULL;
if (!$doc)
$doc = new DOMDocument();
if ($file!=$docName) {
$doc->loadHTMLFile($file);
$xp = NULL;
}
if (!$xp)
$xp = new DOMXpath($doc);
return $xp; // ??RETURNED VALUES ARE ALWAYS STABLE??
}
The present question is similar to this other one about XSLTProcessor reuse.
In both questions the problem can be generalized for any language or framework that use LibXML2 as DomDocument implementation.
There are another related question: How to "refresh" DOMDocument instances of LibXML2?
Illustrating
The reuse is very commom (examples):
$f = "my_XML_file.xml";
$elements = DOMXpath_reuser($f)->query("//*[#id]");
// use elements to get information
$elements = DOMXpath_reuser($f)->("/html/body/div[1]");
// use elements to get information
But, if you do something like removeChild, replaceChild, etc. (example),
$div = DOMXpath_reuser($f)->query("/html/body/div[1]")->item(0); //STABLE
$div->parentNode->removeChild($div); // CHANGES DOM
$elements = DOMXpath_reuser($f)->query("//div[#id]"); // INSTABLE! !!
extrange things can be occur, and the queries not works as expected!!
When (what DOMDocument methods affect XPath?)
Why we can not use something like normalizeDocument to "refresh DOM" (exist?)?
Only a "new DOMXpath($doc);" is allways secure? need to reload $doc also?
DOMXpath is affected by the load*() methods on DOMDocument. After loading a new xml or html, you need to recreate the DOMXpath instance:
$xml = '<xml/>';
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
var_dump($xpath->document === $dom); // bool(true)
$dom->loadXml($xml);
var_dump($xpath->document === $dom); // bool(false)
In DOMXpath_reuser() you store a static variable and recreate the xpath depending on the file name. If you want to reuse an Xpath object, suggest extending DOMDocument. This way you only need pass the $dom variable around. It would work with a stored xml file as well with xml string or a document your are creating.
The following class extends DOMDocument with an method xpath() that always returns a valid DOMXpath instance for it. It stores and registers the namespaces, too:
class MyDOMDocument
extends DOMDocument {
private $_xpath = NULL;
private $_namespaces = array();
public function xpath() {
// if the xpath instance is missing or not attached to the document
if (is_null($this->_xpath) || $this->_xpath->document != $this) {
// create a new one
$this->_xpath = new DOMXpath($this);
// and register the namespaces for it
foreach ($this->_namespaces as $prefix => $namespace) {
$this->_xpath->registerNamespace($prefix, $namespace);
}
}
return $this->_xpath;
}
public function registerNamespaces(array $namespaces) {
$this->_namespaces = array_merge($this->_namespaces, $namespaces);
if (isset($this->_xpath)) {
foreach ($namespaces as $prefix => $namespace) {
$this->_xpath->registerNamespace($prefix, $namespace);
}
}
}
}
$xml = <<<'ATOM'
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Test</title>
</feed>
ATOM;
$dom = new MyDOMDocument();
$dom->registerNamespaces(
array(
'atom' => 'http://www.w3.org/2005/Atom'
)
);
$dom->loadXml($xml);
// created, first access
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));
$dom->loadXml($xml);
// recreated, connection was lost
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));
The DOMXpath class (instead of XSLTProcessor in your another question) use reference to given DOMDocument object in contructor. DOMXpath create libxml context object based on given DOMDocument and save it to internal class data. Besides libxml context its saves references to originalDOMDocument` given in contructor arguments.
What that means:
Part of sample from ThomasWeinert answer:
var_dump($xpath->document === $dom); // bool(true)
$dom->loadXml($xml);
var_dump($xpath->document === $dom); // bool(false)
gives false after load becouse of $dom already holds pointer to new libxml data but DOMXpath holds libxml context for $dom before load and pointer to real document after load.
Now about query works
If it should return XPATH_NODESET (as in your case) its make a node copy - node by node iterating throw detected node set(\ext\dom\xpath.c from 468 line). Copy but with original document node as parent. Its means that you can modify result but this gone away you XPath and DOMDocument connection.
XPath results provide a parentNode memeber that knows their origin:
for attribute values, parentNode returns the element that carries them. An example is //foo/#attribute, where the parent would be a foo Element.
for the text() function (as in //text()), it returns the element that contains the text or tail that was returned.
note that parentNode may not always return an element. For example, the XPath functions string() and concat() will construct strings that do not have an origin. For them, parentNode will return None.
So,
There is no any reasons to cache XPath. It do not anything besides xmlXPathNewContext (just allocate lightweight internal struct).
Each time your modify your DOMDocument (removeChild, replaceChild, etc.) your should recreate XPath.
We can not use something like normalizeDocument to "refresh DOM" because of it change internal document structure and invalidate xmlXPathNewContext created in Xpath constructor.
Only "new DOMXpath($doc);" is allways secure? Yes, if you do not change $doc between Xpath usage. Need to reload $doc also - no, because of it invalidated previously created xmlXPathNewContext.
(this is not a real answer, but a consolidation of comments and answers posted here and related questions)
This new version of the question's DOMXpath_reuser function contains the #ThomasWeinert suggestion (for avoid DOM changes by external re-load) and an option $enforceRefresh to workaround the problem of instability (as related question shows the programmer must detect when).
function DOMXpath_reuser_v2($file, $enforceRefresh=0) { //changed here
static $doc=NULL;
static $docName='';
static $xp=NULL;
if (!$doc)
$doc = new DOMDocument();
if ( $file!=$docName || ($xp && $doc !== $xp->document) ) { // changed here
$doc->load($file);
$xp = NULL;
} elseif ($enforceRefresh==2) { // add this new refresh mode
$doc->loadXML($doc->saveXML());
$xp = NULL;
}
if (!$xp || $enforceRefresh==1) //changed here
$xp = new DOMXpath($doc);
return $xp;
}
When must to use $enforceRefresh=1 ?
... perhaps an open problem, only little tips and clues...
when DOM submited to setAttribute, removeChild, replaceChild, etc.
...? more cases?
When must to use $enforceRefresh=2 ?
... perhaps an open problem, only little tips and clues...
when DOM was subject to indexes inconsistences, etc. See this question/solution.
...? more cases?

PHP: How to replace existing XML node with XMLWriter

I am editing an XML file and need to populate it with data from a database. DOM works but it is unable to scale to several hundreds of MBs so I am now using XMLReader and XMLWriter which can write the very large XML file. Now, I need to select a node and add children to it but I can't find a method to do it, can someone help me out?
I can find the node I need to add children to by:
if ($xmlReader->nodeType == XMLReader::ELEMENT && $xmlReader->name == 'data')
{
echo 'data was found';
$data = $xmlReader->getAttribute('data');
}
How do I now add more nodes/children to the found node? Again for clarification, this code will read and find the node, so that is done. What is required is how to modify the found node specifically? Is there a way with XMLWriter for which I have not found a method that will do that after reading through the class documentation?
Be default the expanded nodes (missing in your question)
$node = $xmlReader->expand();
are not editable with XMLReader (makes sense by that name). However you can make the specific DOMNode editable if you import it into a new DOMDocument:
$doc = new DOMDocument();
$node = $doc->importNode($node);
You can then perform any DOM manipulation the DOM offers, e.g. for example adding a text-node:
$textNode = $doc->createTextNode('New Child TextNode added :)');
$node->appendChild($textNode);
If you prefer SimpleXML for manipulation, you can also import the node into SimpleXML after it has been imported into the DOMDocument:
$xml = simplexml_import_dom($node);
An example from above making use of my xmlreader-iterators that just offer me some nicer interface to XMLReader:
$reader = new XMLReader();
$reader->open($xmlFile);
$elements = new XMLElementIterator($reader, 'data');
foreach ($elements as $element)
{
$node = $element->expand();
$doc = new DOMDocument();
$node = $doc->importNode($node, true);
$node->appendChild($doc->createTextNode('New Child TextNode added :)'));
echo $doc->saveXML($node), "\n";
}
With the following XML document:
<xml>
<data/>
<boo>
<blur>
<data/>
<data/>
</blur>
</boo>
<data/>
</xml>
The small example code above produces the following output:
<data>New Child TextNode added :)</data>
<data>New Child TextNode added :)</data>
<data>New Child TextNode added :)</data>
<data>New Child TextNode added :)</data>

Get instance of nodes by their name in SimpleXML (PHP)

I'd like to search for nodes with the same node name in a SimpleXML Object no matter how deep they are nested and create an instance of them as an array.
In the HTML DOM I can do that with JavaScript by using getElementsByTagName(). Is there a way to do that in PHP as well?
Yes use xpath
$xml->xpath('//div');
Here $xml is your SimpleXML object.
In this example you will get array of all 'div' elements
$fname = dirname(__FILE__) . '\\xml\\crRoll.xml';
$dom = new DOMDocument;
$dom->load($fname, LIBXML_DTDLOAD|LIBXML_DTDATTR);
$root = $dom->documentElement;
$xpath = new DOMXpath($dom);
$xpath->registerNamespace('cr', "http://www.w3.org/1999/xhtml");
$candidateNodes = $xpath->query("//cr:break");
foreach ($candidateNodes as $child) {
$max = $child->getAttribute('tstamp');
}
This finds all the BREAK nodes (tstamp attr) using XPath ...
Only on DOMDocument::getElementsByTagName,
however, you can import/export SimpleXML into DOMDocument,
or simply use DOMDocument to parse XML.
Another answer mentioned about Xpath,
it will return duplication of node, if you have something like :-
<div><div>1</div></div>

XML DOMDocument optimization

I have a 5MB XML file
I'm using the following code to get all nodeValue
$dom = new DomDocument('1.0', 'UTF-8');
if(!$dom->load($url))
return;
$games = $dom->getElementsByTagName("game");
foreach($games as $game)
{
}
This takes 76 seconds and there are around 2000 games tag. Is there any optimization or other solution to get the data?
I once wrote a blog article about loading huge XML files with XMLReader - you probably can use some of it.
Using DOM or SimpleXML is no option, since both load the whole document into memory.
You shouldn't use the Document Object Model on large XML files, it is intended for human readable documents, not big datasets!
If you want fast access you should use XMLReader or SimpleXML.
XMLReader is ideal for parsing whole documents, and SimpleXML has a nice XPath function for retreiving data quickly.
For XMLReader you can use the following code:
<?php
// Parsing a large document with XMLReader with Expand - DOM/DOMXpath
$reader = new XMLReader();
$reader->open("tooBig.xml");
while ($reader->read()) {
switch ($reader->nodeType) {
case (XMLREADER::ELEMENT):
if ($reader->localName == "game") {
$node = $reader->expand();
$dom = new DomDocument();
$n = $dom->importNode($node,true);
$dom->appendChild($n);
$xp = new DomXpath($dom);
$res = $xp->query("/game/title"); // this is an example
echo $res->item(0)->nodeValue;
}
}
}
?>
The above will output all game titles (assuming you have /game/title XML structure).
For SimpleXML you can use:
$xml = file_get_contents($url);
$sxml = new SimpleXML($xml);
$games = $sxml->xpath('/game'); // returns an array of SXML nodes
foreach ($games as $game)
{
print $game->nodeValue;
}
You can use DOMXpath for querying, which is way faster than the DOMDocument:: getElementsByTagName() method.
<?php
$xpath = new \DOMXpath($dom);
$games = $xpath->query("//game");
foreach ($games as $game) {
// Code here
}
In one of my tests with a fairly large file, this approach took < 1 sec to complete the iteration of 24k elements, whereas the DOMDocument:: getElementsByTagName() method was taking ~27 min (and the time took to iterate to the next object was exponential).

Categories