PHP - From DOMDocument to XMLReader [duplicate] - php

This question already has answers here:
How to use XMLReader in PHP?
(7 answers)
Closed 6 years ago.
PHP developers here ??
I have a PHP function who parse an xml file (using DOMDocument, i'm proficien with this tool). I want to do the same with XMLReader, but i don't understand how XMLReader works...
I want to use XMLReader because it's a light tool.
Feel free to ask me others questions about my issue.
function getDatas($filepath)
{
$doc = new DOMDocument();
$xmlfile = file_get_contents($filepath);
$doc->loadXML($xmlfile);
$xmlcars = $doc->getElementsByTagName('car');
$mycars= [];
foreach ($xmlcars as $xmlcar) {
$car = new Car();
$car->setName(
$xmlcar->getElementsByTagName('id')->item(0)->nodeValue
);
$car->setBrand(
$xmlcar->getElementsByTagName('brand')->item(0)->nodeValue
);
array_push($mycars, $car);
}
return $mycars;
}
PS : I'm not a senior PHP dev.
Ahah Thanks.

This is a good example from this topic, I hope it helps you to understand.
$z = new XMLReader;
$z->open('data.xml');
$doc = new DOMDocument;
// move to the first <product /> node
while ($z->read() && $z->name !== 'product');
// now that we're at the right depth, hop to the next <product/> until the end of the tree
while ($z->name === 'product')
{
// either one should work
//$node = new SimpleXMLElement($z->readOuterXML());
$node = simplexml_import_dom($doc->importNode($z->expand(), true));
// now you can use $node without going insane about parsing
var_dump($node->element_1);
// go to next <product />
$z->next('product');
}

XMLReader does not, as far as I can tell, have some equivalent way of filtering by an element name. So the closest equivalent to this would be, as mentioned in rvbarreto's answer, to iterate through all elements using XMLReader->read() and grabbing the info you need when the element name matches what you are wanting.'
Alternatively, you might want to check out SimpleXML, which supports filtering using XPath expressions, as well as seeking to a node in the XML using the element structure like they are sub-objects of the main object. For instance, instead of using:
$xmlcar->getElementsByTagName('id')->item(0)->nodeValue;
You would use:
$xmlcar->id[0];
Assuming all of your car elements are at the first level of the XML document tree, the following should work as an example:
function getDatas($filepath) {
$carsData = new SimpleXMLElement($filepath, NULL, TRUE);
$mycars = [];
foreach($carsData->car as $xmlcar) {
$car = new Car();
$car->setName($xmlcar->id[0]);
$car->setBrand($xmlcar->id[0]);
$mycars[] = $car;
}
}

Related

Get children of an XML node with unknown structure

I'm trying to modify an XML document which contains some node that I can identify by name. For example, I might want to modify the <abc>some text</abc> node in a document (which I can identify by the tag name abc)
The problem I'm facing currently is that I don't know the exact structure of this document. I don't know what the root element is called and I don't know which children might contain this <abc> node.
I tried using SimpleXML<...> but this does not allow me to read arbitrary element children:
$xml = new SimpleXMLElement($xmlString);
foreach ($xml->children() as $child) {
// code here doesnt execute
}
I'm considering building my own XML parser which would have this simple functionality, but I cannot believe that simply iterating over all child nodes of a node (eventually recursively) is not something that is supported by PHP. Hopefully someone can tell me what I'm missing. Thanks!
Use DOMDocument
$dom = new DOMDocument();
#$dom->loadXML($xmlString);
foreach($dom->getElementsByTagName('item') as $item) {
if ($item->hasChildNodes()) {
foreach($item->childNodes as $i) {
YOUR CODE HERE
}
}
}
I found the solution moments after posting, after being stuck on it for a while..
SimpleXML<...> does not have these features, but the DOMDocument and associated classes do;
$dom = new DOMDocument();
$dom->loadXml($xmlString);
foreach($dom->childNodes as $child) {
if ($child->nodeName == "abc") {
$child->textContent = "modified text content";
}
}
Documentation for future reference, here: http://php.net/manual/en/book.dom.php
Thanks for your help.

Using PHP DOM to extract a certain node value, only if another node value is > 0

I am using php dom to parse xml from another platform, extract certain data from it, and upload to my own platform. I am however stuck when it comes to extracting a certain node value, only if another node value is greater than 0 for the child node 'row'. In the example below, I would like to iterate over the xml and pull out the 'affcustomid' value only if the CPACommission node value is greater than 0. Does anyone have any ideas how I can do this? The below code is a shortened version, in reality, i would get back 100's of rows in the same format as below.
<row>
<rowid>1</rowid>
<currencysymbol>€</currencysymbol>
<totalrecords>2145</totalrecords>
<affcustomid>11159_4498302</affcustomid>
<period>7/1/2014</period>
<impressions>0</impressions>
<clicks>1</clicks>
<clickthroughratio>0</clickthroughratio>
<downloads>1</downloads>
<downloadratio>1</downloadratio>
<newaccountratio>1</newaccountratio>
<newdepositingacc>1</newdepositingacc>
<newaccounts>1</newaccounts>
<firstdepositcount>1</firstdepositcount>
<activeaccounts>1</activeaccounts>
<activedays>1</activedays>
<newpurchases>12.4948</newpurchases>
<purchaccountcount>1</purchaccountcount>
<wageraccountcount>1</wageraccountcount>
<avgactivedays>1</avgactivedays>
<netrevenueplayer>11.8701</netrevenueplayer>
<Deposits>12.4948</Deposits>
<Bonus>0</Bonus>
<NetRevenue>11.8701</NetRevenue>
<TotalBetsHands>4</TotalBetsHands>
<Product1Bets>4</Product1Bets>
<Product1NetRevenue>11.8701</Product1NetRevenue>
<Product1Commission>30</Product1Commission>
<Commission>0</Commission>
<CPACommission>30</CPACommission>
</row>
Thanks in advance!
Mark
The easiest way to fetch data from an XML DOM is Xpath:
$dom = new DOMDocument();
$dom->load('file.xml');
$xpath = new DOMXpath($dom);
var_dump(
$xpath->evaluate('string(//row[CPACommission > 0]/affcustomid)')
);
It would be easier using SimpleXML:
$doc = simplexml_load_file('file.xml');
foreach ($doc->row AS $row) {
if($row->CPACommission > 0){
echo $row->affcustomid;
}
}
But if you still need to use DOMDocument:
$doc = new DOMDocument();
$doc->load('file.xml');
foreach ($doc->getElementsByTagName('row') AS $row) {
if($row->getElementsByTagName('CPACommission')->item(0)->textContent > 0){
echo $row->getElementsByTagName('affcustomid')->item(0)->textContent;
}
}

In DomDocument, reuse of DOMXpath, it is stable?

I am using the function below, but not sure about it is always stable/secure... Is it?
When and who is stable/secure to "reuse parts of the DOMXpath preparing procedures"?
To simlify the use of the XPath query() method we can adopt a function that memorizes the last calls with static variables,
function DOMXpath_reuser($file) {
static $doc=NULL;
static $docName='';
static $xp=NULL;
if (!$doc)
$doc = new DOMDocument();
if ($file!=$docName) {
$doc->loadHTMLFile($file);
$xp = NULL;
}
if (!$xp)
$xp = new DOMXpath($doc);
return $xp; // ??RETURNED VALUES ARE ALWAYS STABLE??
}
The present question is similar to this other one about XSLTProcessor reuse.
In both questions the problem can be generalized for any language or framework that use LibXML2 as DomDocument implementation.
There are another related question: How to "refresh" DOMDocument instances of LibXML2?
Illustrating
The reuse is very commom (examples):
$f = "my_XML_file.xml";
$elements = DOMXpath_reuser($f)->query("//*[#id]");
// use elements to get information
$elements = DOMXpath_reuser($f)->("/html/body/div[1]");
// use elements to get information
But, if you do something like removeChild, replaceChild, etc. (example),
$div = DOMXpath_reuser($f)->query("/html/body/div[1]")->item(0); //STABLE
$div->parentNode->removeChild($div); // CHANGES DOM
$elements = DOMXpath_reuser($f)->query("//div[#id]"); // INSTABLE! !!
extrange things can be occur, and the queries not works as expected!!
When (what DOMDocument methods affect XPath?)
Why we can not use something like normalizeDocument to "refresh DOM" (exist?)?
Only a "new DOMXpath($doc);" is allways secure? need to reload $doc also?
DOMXpath is affected by the load*() methods on DOMDocument. After loading a new xml or html, you need to recreate the DOMXpath instance:
$xml = '<xml/>';
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
var_dump($xpath->document === $dom); // bool(true)
$dom->loadXml($xml);
var_dump($xpath->document === $dom); // bool(false)
In DOMXpath_reuser() you store a static variable and recreate the xpath depending on the file name. If you want to reuse an Xpath object, suggest extending DOMDocument. This way you only need pass the $dom variable around. It would work with a stored xml file as well with xml string or a document your are creating.
The following class extends DOMDocument with an method xpath() that always returns a valid DOMXpath instance for it. It stores and registers the namespaces, too:
class MyDOMDocument
extends DOMDocument {
private $_xpath = NULL;
private $_namespaces = array();
public function xpath() {
// if the xpath instance is missing or not attached to the document
if (is_null($this->_xpath) || $this->_xpath->document != $this) {
// create a new one
$this->_xpath = new DOMXpath($this);
// and register the namespaces for it
foreach ($this->_namespaces as $prefix => $namespace) {
$this->_xpath->registerNamespace($prefix, $namespace);
}
}
return $this->_xpath;
}
public function registerNamespaces(array $namespaces) {
$this->_namespaces = array_merge($this->_namespaces, $namespaces);
if (isset($this->_xpath)) {
foreach ($namespaces as $prefix => $namespace) {
$this->_xpath->registerNamespace($prefix, $namespace);
}
}
}
}
$xml = <<<'ATOM'
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Test</title>
</feed>
ATOM;
$dom = new MyDOMDocument();
$dom->registerNamespaces(
array(
'atom' => 'http://www.w3.org/2005/Atom'
)
);
$dom->loadXml($xml);
// created, first access
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));
$dom->loadXml($xml);
// recreated, connection was lost
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));
The DOMXpath class (instead of XSLTProcessor in your another question) use reference to given DOMDocument object in contructor. DOMXpath create libxml context object based on given DOMDocument and save it to internal class data. Besides libxml context its saves references to originalDOMDocument` given in contructor arguments.
What that means:
Part of sample from ThomasWeinert answer:
var_dump($xpath->document === $dom); // bool(true)
$dom->loadXml($xml);
var_dump($xpath->document === $dom); // bool(false)
gives false after load becouse of $dom already holds pointer to new libxml data but DOMXpath holds libxml context for $dom before load and pointer to real document after load.
Now about query works
If it should return XPATH_NODESET (as in your case) its make a node copy - node by node iterating throw detected node set(\ext\dom\xpath.c from 468 line). Copy but with original document node as parent. Its means that you can modify result but this gone away you XPath and DOMDocument connection.
XPath results provide a parentNode memeber that knows their origin:
for attribute values, parentNode returns the element that carries them. An example is //foo/#attribute, where the parent would be a foo Element.
for the text() function (as in //text()), it returns the element that contains the text or tail that was returned.
note that parentNode may not always return an element. For example, the XPath functions string() and concat() will construct strings that do not have an origin. For them, parentNode will return None.
So,
There is no any reasons to cache XPath. It do not anything besides xmlXPathNewContext (just allocate lightweight internal struct).
Each time your modify your DOMDocument (removeChild, replaceChild, etc.) your should recreate XPath.
We can not use something like normalizeDocument to "refresh DOM" because of it change internal document structure and invalidate xmlXPathNewContext created in Xpath constructor.
Only "new DOMXpath($doc);" is allways secure? Yes, if you do not change $doc between Xpath usage. Need to reload $doc also - no, because of it invalidated previously created xmlXPathNewContext.
(this is not a real answer, but a consolidation of comments and answers posted here and related questions)
This new version of the question's DOMXpath_reuser function contains the #ThomasWeinert suggestion (for avoid DOM changes by external re-load) and an option $enforceRefresh to workaround the problem of instability (as related question shows the programmer must detect when).
function DOMXpath_reuser_v2($file, $enforceRefresh=0) { //changed here
static $doc=NULL;
static $docName='';
static $xp=NULL;
if (!$doc)
$doc = new DOMDocument();
if ( $file!=$docName || ($xp && $doc !== $xp->document) ) { // changed here
$doc->load($file);
$xp = NULL;
} elseif ($enforceRefresh==2) { // add this new refresh mode
$doc->loadXML($doc->saveXML());
$xp = NULL;
}
if (!$xp || $enforceRefresh==1) //changed here
$xp = new DOMXpath($doc);
return $xp;
}
When must to use $enforceRefresh=1 ?
... perhaps an open problem, only little tips and clues...
when DOM submited to setAttribute, removeChild, replaceChild, etc.
...? more cases?
When must to use $enforceRefresh=2 ?
... perhaps an open problem, only little tips and clues...
when DOM was subject to indexes inconsistences, etc. See this question/solution.
...? more cases?

XML DOMDocument optimization

I have a 5MB XML file
I'm using the following code to get all nodeValue
$dom = new DomDocument('1.0', 'UTF-8');
if(!$dom->load($url))
return;
$games = $dom->getElementsByTagName("game");
foreach($games as $game)
{
}
This takes 76 seconds and there are around 2000 games tag. Is there any optimization or other solution to get the data?
I once wrote a blog article about loading huge XML files with XMLReader - you probably can use some of it.
Using DOM or SimpleXML is no option, since both load the whole document into memory.
You shouldn't use the Document Object Model on large XML files, it is intended for human readable documents, not big datasets!
If you want fast access you should use XMLReader or SimpleXML.
XMLReader is ideal for parsing whole documents, and SimpleXML has a nice XPath function for retreiving data quickly.
For XMLReader you can use the following code:
<?php
// Parsing a large document with XMLReader with Expand - DOM/DOMXpath
$reader = new XMLReader();
$reader->open("tooBig.xml");
while ($reader->read()) {
switch ($reader->nodeType) {
case (XMLREADER::ELEMENT):
if ($reader->localName == "game") {
$node = $reader->expand();
$dom = new DomDocument();
$n = $dom->importNode($node,true);
$dom->appendChild($n);
$xp = new DomXpath($dom);
$res = $xp->query("/game/title"); // this is an example
echo $res->item(0)->nodeValue;
}
}
}
?>
The above will output all game titles (assuming you have /game/title XML structure).
For SimpleXML you can use:
$xml = file_get_contents($url);
$sxml = new SimpleXML($xml);
$games = $sxml->xpath('/game'); // returns an array of SXML nodes
foreach ($games as $game)
{
print $game->nodeValue;
}
You can use DOMXpath for querying, which is way faster than the DOMDocument:: getElementsByTagName() method.
<?php
$xpath = new \DOMXpath($dom);
$games = $xpath->query("//game");
foreach ($games as $game) {
// Code here
}
In one of my tests with a fairly large file, this approach took < 1 sec to complete the iteration of 24k elements, whereas the DOMDocument:: getElementsByTagName() method was taking ~27 min (and the time took to iterate to the next object was exponential).

Delete from xml according to attribute [duplicate]

This question already has answers here:
How do I remove a specific node using its attribute value in PHP XML Dom?
(4 answers)
Closed 9 years ago.
My xml file is named cgal.xml
<?xml version="1.0"?>
<item>
<name><![CDATA[<img src="event_pic/pic1.jpg" />CALENDAR]]></name>
<description title="NAM ELIT AGNA, ENDRERIT SIT AMET, TINCIDUNT AC." day="13" month="8" year="2010" id="15"><![CDATA[<img src="events/preview/13p1.jpg" /><font size="8" color="#6c6e74">In Gladiator, victorious general Maximus Decimus Meridias has been named keeper of Rome and its empire by dying emperor Marcus Aurelius, so that rule might pass from the Caesars back to the people and Senate. Marcus\' neglected and power-hungry son, Commodus, has other ideas, however. Escaping an ordered execution, Maximus hurries back to his home in Spain, too l</font>]]></description>
</item>
and my PHP function is:-
$doc = new DOMDocument;
$doc->formatOutput = TRUE;
$doc->preserveWhiteSpace = FALSE;
$doc->simplexml_load_file('../cgal.xml');
foreach($doc->description as $des)
{
if($des['id'] == $id) {
$dom=dom_import_simplexml($des);
$dom->parentNode->removeChild($dom);
}
}
$doc->save('../cgal.xml');
id is passed dynamically
I want to remove node according to id
You dont need to load or import the XML from SimpleXml. You can load it directly with DOM. Also, you can remove the node in the same way as you are doing in your question updatin xml in php. Just change the XPath Query to read
$query = sprintf('//description[#id="%s"]', $id);
or
$query = sprintf('/item/description[#id="%s"]', $id);
You can also use getElementById instead of an XPath, if your XML validates against a DTD or Schema that actually defines id as an XML ID. This is explained in Simplify PHP DOM XML parsing - how?.
Well, first off, there's no DomDocument::simplexml_load_file() method. Either use dom document, or don't... So using DomDocument:
$doc = new DomDocument();
$doc->formatOutput = true;
$doc->preserveWhiteSpace = true;
$doc->loadXml(file_get_contents('../cgal.xml'));
$element = $doc->getElementById($id);
if ($element) {
$element->parentNode->removeChild($element);
}
That should do it for you...
Edit:
As Gordon points out, that may not work (I tried it, it doesn't all the time)... So, you could either:
$xpath = new DomXpath($doc);
$elements = $xpath->query('//description[#id="'.$id.'"]');
foreach ($elements as $element) {
$element->parentNode->removeChild($element);
}
Or, using SimpleXML, you can recurse over each node (less performant, but more flexible):
$simple = simplexml_load_file('../cgal.xml', 'SimpleXmlIterator');
$it = new RecursiveIteratorIterator($simple, RecursiveIteratorIterator::SELF_FIRST);
foreach ($it as $element) {
if (isset($element['id']) && $element['id'] == $id) {
$node = dom_import_simplexml($element);
$node->parentNode->removeChild($node);
}
}

Categories