I have a 5MB XML file
I'm using the following code to get all nodeValue
$dom = new DomDocument('1.0', 'UTF-8');
if(!$dom->load($url))
return;
$games = $dom->getElementsByTagName("game");
foreach($games as $game)
{
}
This takes 76 seconds and there are around 2000 games tag. Is there any optimization or other solution to get the data?
I once wrote a blog article about loading huge XML files with XMLReader - you probably can use some of it.
Using DOM or SimpleXML is no option, since both load the whole document into memory.
You shouldn't use the Document Object Model on large XML files, it is intended for human readable documents, not big datasets!
If you want fast access you should use XMLReader or SimpleXML.
XMLReader is ideal for parsing whole documents, and SimpleXML has a nice XPath function for retreiving data quickly.
For XMLReader you can use the following code:
<?php
// Parsing a large document with XMLReader with Expand - DOM/DOMXpath
$reader = new XMLReader();
$reader->open("tooBig.xml");
while ($reader->read()) {
switch ($reader->nodeType) {
case (XMLREADER::ELEMENT):
if ($reader->localName == "game") {
$node = $reader->expand();
$dom = new DomDocument();
$n = $dom->importNode($node,true);
$dom->appendChild($n);
$xp = new DomXpath($dom);
$res = $xp->query("/game/title"); // this is an example
echo $res->item(0)->nodeValue;
}
}
}
?>
The above will output all game titles (assuming you have /game/title XML structure).
For SimpleXML you can use:
$xml = file_get_contents($url);
$sxml = new SimpleXML($xml);
$games = $sxml->xpath('/game'); // returns an array of SXML nodes
foreach ($games as $game)
{
print $game->nodeValue;
}
You can use DOMXpath for querying, which is way faster than the DOMDocument:: getElementsByTagName() method.
<?php
$xpath = new \DOMXpath($dom);
$games = $xpath->query("//game");
foreach ($games as $game) {
// Code here
}
In one of my tests with a fairly large file, this approach took < 1 sec to complete the iteration of 24k elements, whereas the DOMDocument:: getElementsByTagName() method was taking ~27 min (and the time took to iterate to the next object was exponential).
Related
This question already has answers here:
How to use XMLReader in PHP?
(7 answers)
Closed 6 years ago.
PHP developers here ??
I have a PHP function who parse an xml file (using DOMDocument, i'm proficien with this tool). I want to do the same with XMLReader, but i don't understand how XMLReader works...
I want to use XMLReader because it's a light tool.
Feel free to ask me others questions about my issue.
function getDatas($filepath)
{
$doc = new DOMDocument();
$xmlfile = file_get_contents($filepath);
$doc->loadXML($xmlfile);
$xmlcars = $doc->getElementsByTagName('car');
$mycars= [];
foreach ($xmlcars as $xmlcar) {
$car = new Car();
$car->setName(
$xmlcar->getElementsByTagName('id')->item(0)->nodeValue
);
$car->setBrand(
$xmlcar->getElementsByTagName('brand')->item(0)->nodeValue
);
array_push($mycars, $car);
}
return $mycars;
}
PS : I'm not a senior PHP dev.
Ahah Thanks.
This is a good example from this topic, I hope it helps you to understand.
$z = new XMLReader;
$z->open('data.xml');
$doc = new DOMDocument;
// move to the first <product /> node
while ($z->read() && $z->name !== 'product');
// now that we're at the right depth, hop to the next <product/> until the end of the tree
while ($z->name === 'product')
{
// either one should work
//$node = new SimpleXMLElement($z->readOuterXML());
$node = simplexml_import_dom($doc->importNode($z->expand(), true));
// now you can use $node without going insane about parsing
var_dump($node->element_1);
// go to next <product />
$z->next('product');
}
XMLReader does not, as far as I can tell, have some equivalent way of filtering by an element name. So the closest equivalent to this would be, as mentioned in rvbarreto's answer, to iterate through all elements using XMLReader->read() and grabbing the info you need when the element name matches what you are wanting.'
Alternatively, you might want to check out SimpleXML, which supports filtering using XPath expressions, as well as seeking to a node in the XML using the element structure like they are sub-objects of the main object. For instance, instead of using:
$xmlcar->getElementsByTagName('id')->item(0)->nodeValue;
You would use:
$xmlcar->id[0];
Assuming all of your car elements are at the first level of the XML document tree, the following should work as an example:
function getDatas($filepath) {
$carsData = new SimpleXMLElement($filepath, NULL, TRUE);
$mycars = [];
foreach($carsData->car as $xmlcar) {
$car = new Car();
$car->setName($xmlcar->id[0]);
$car->setBrand($xmlcar->id[0]);
$mycars[] = $car;
}
}
I'm using this in SimpleXML and PHP:
foreach ($xml->children() as $node) {
echo $node->attributes('namespace')->id;
}
That prints the id attribute of all nodes (using a namespace).
But now I want to know the line number that $node is located in the XML file.
I need the line number, because I'm analyzing the XML file, and returning to the user information of possible issues to resolve them. So I need to say something like: "Here you have an error at line X". I'm sure that the XML file would be in a standard format that will have enough line breaks for this to be useful.
It is possible with DOM. DOMNode provides the function getLineNo().
DOM
$xml = <<<'XML'
<foo>
<bar/>
</foo>
XML;
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
var_dump(
$xpath->evaluate('//bar[1]')->item(0)->getLineNo()
);
Output:
int(2)
SimpleXML
SimpleXML is based on DOM, so you can convert SimpleXMLElement objects to DOMElement objects.
$element = new SimpleXMLElement($xml);
$node = dom_import_simplexml($element->bar);
var_dump($node->getLineNo());
And yes, most of the time if you have a problem with SimpleXML, the answer is to use DOM.
XMLReader
XMLReader has the line numbers internally, but here is no direct method to access them. Again you will have to convert it into a DOMNode. It works because both use libxml2. This will read the node and all its descendants into memory, so be careful with it.
$reader = new XMLReader();
$reader->open('data://text/xml;base64,'.base64_encode($xml));
while ($reader->read()) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name== 'bar') {
var_dump($reader->expand()->getLineNo());
}
}
I am using php dom to parse xml from another platform, extract certain data from it, and upload to my own platform. I am however stuck when it comes to extracting a certain node value, only if another node value is greater than 0 for the child node 'row'. In the example below, I would like to iterate over the xml and pull out the 'affcustomid' value only if the CPACommission node value is greater than 0. Does anyone have any ideas how I can do this? The below code is a shortened version, in reality, i would get back 100's of rows in the same format as below.
<row>
<rowid>1</rowid>
<currencysymbol>€</currencysymbol>
<totalrecords>2145</totalrecords>
<affcustomid>11159_4498302</affcustomid>
<period>7/1/2014</period>
<impressions>0</impressions>
<clicks>1</clicks>
<clickthroughratio>0</clickthroughratio>
<downloads>1</downloads>
<downloadratio>1</downloadratio>
<newaccountratio>1</newaccountratio>
<newdepositingacc>1</newdepositingacc>
<newaccounts>1</newaccounts>
<firstdepositcount>1</firstdepositcount>
<activeaccounts>1</activeaccounts>
<activedays>1</activedays>
<newpurchases>12.4948</newpurchases>
<purchaccountcount>1</purchaccountcount>
<wageraccountcount>1</wageraccountcount>
<avgactivedays>1</avgactivedays>
<netrevenueplayer>11.8701</netrevenueplayer>
<Deposits>12.4948</Deposits>
<Bonus>0</Bonus>
<NetRevenue>11.8701</NetRevenue>
<TotalBetsHands>4</TotalBetsHands>
<Product1Bets>4</Product1Bets>
<Product1NetRevenue>11.8701</Product1NetRevenue>
<Product1Commission>30</Product1Commission>
<Commission>0</Commission>
<CPACommission>30</CPACommission>
</row>
Thanks in advance!
Mark
The easiest way to fetch data from an XML DOM is Xpath:
$dom = new DOMDocument();
$dom->load('file.xml');
$xpath = new DOMXpath($dom);
var_dump(
$xpath->evaluate('string(//row[CPACommission > 0]/affcustomid)')
);
It would be easier using SimpleXML:
$doc = simplexml_load_file('file.xml');
foreach ($doc->row AS $row) {
if($row->CPACommission > 0){
echo $row->affcustomid;
}
}
But if you still need to use DOMDocument:
$doc = new DOMDocument();
$doc->load('file.xml');
foreach ($doc->getElementsByTagName('row') AS $row) {
if($row->getElementsByTagName('CPACommission')->item(0)->textContent > 0){
echo $row->getElementsByTagName('affcustomid')->item(0)->textContent;
}
}
This question use PHP, but the problems and algorithms are valid for many other Libxml2 and W3C DOM implementations.
Core problem: there are no $node->replaceThisBy($otherNode). There are only "replace text" (using nodeValue property) and the replaceChild() method — not obviuos neither simple to use.
In the code below, only the second loop works, but I need copy nodes from one DOM tree (simulated by a clone) to another one.
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->load($fileXML);
$xp = new DOMXpath($doc);
$lst = $xp->query("//td");
$z = clone $lst->item(2); // a big and complex node
// needs clone to freeze the node content (before change it also).
// does NOT work:
foreach ($lst as $node)
$node = $z; // no error messages!
//error: $node->parentNode->replaceChild($z,$node);
// This works though:
foreach ($lst as $node)
$node->nodeValue = $z->nodeValue;
Similar questions:
PHP DOM replace element with a new element
PHP DOMDocument question: how to replace text of a node?
nodeValue property, changes only text-value. To change all tags and contents, need a lot more instructions -- DomDocument is not friendly (!) ... Need to import a clone, and clone in the loop: solved!
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadXML($xmlFrag);
$xp = new DOMXpath($doc);
$lst = $xp->query("//p");
$import = $doc->importNode( $lst->item(1)->cloneNode(true) , TRUE);
foreach ($lst as $node) {
$tmp = clone $import; // clone because if same, ignores loop.
$node->parentNode->replaceChild($tmp,$node);
}
print $doc->saveXML();
I am loading HTML into DOM and then querying it using XPath in PHP. My current problem is how do I find out how many matches have been made, and once that is ascertained, how do I access them?
I currently have this dirty solution:
$i = 0;
foreach($nodes as $node) {
echo $dom->savexml($nodes->item($i));
$i++;
}
Is there a cleaner solution to find the number of nodes, I have tried count(), but that does not work.
You haven't posted any code related to $nodes so I assume you are using DOMXPath and query(), or at the very least, you have a DOMNodeList.
DOMXPath::query() returns a DOMNodeList, which has a length member. You can access it via (given your code):
$nodes->length
If you just want to know the count, you can also use DOMXPath::evaluate.
Example from PHP Manual:
$doc = new DOMDocument;
$doc->load('book.xml');
$xpath = new DOMXPath($doc);
$tbody = $doc->getElementsByTagName('tbody')->item(0);
// our query is relative to the tbody node
$query = 'count(row/entry[. = "en"])';
$entries = $xpath->evaluate($query, $tbody);
echo "There are $entries english books\n";