Parsing large XML with XMLReader [duplicate]

Parsing large XML with XMLReader [duplicate] - php

I have the following XML file, the file is rather large and i haven't been able to get simplexml to open and read the file so i'm trying XMLReader with no success in php
<?xml version="1.0" encoding="ISO-8859-1"?>
<products>
<last_updated>2009-11-30 13:52:40</last_updated>
<product>
<element_1>foo</element_1>
<element_2>foo</element_2>
<element_3>foo</element_3>
<element_4>foo</element_4>
</product>
<product>
<element_1>bar</element_1>
<element_2>bar</element_2>
<element_3>bar</element_3>
<element_4>bar</element_4>
</product>
</products>
I've unfortunately not found a good tutorial on this for PHP and would love to see how I can get each element content to store in a database.

It all depends on how big the unit of work, but I guess you're trying to treat each <product/> nodes in succession.
For that, the simplest way would be to use XMLReader to get to each node, then use SimpleXML to access them. This way, you keep the memory usage low because you're treating one node at a time and you still leverage SimpleXML's ease of use. For instance:
$z = new XMLReader;
$z->open('data.xml');
$doc = new DOMDocument;
// move to the first <product /> node
while ($z->read() && $z->name !== 'product');
// now that we're at the right depth, hop to the next <product/> until the end of the tree
while ($z->name === 'product')
{
// either one should work
//$node = new SimpleXMLElement($z->readOuterXML());
$node = simplexml_import_dom($doc->importNode($z->expand(), true));
// now you can use $node without going insane about parsing
var_dump($node->element_1);
// go to next <product />
$z->next('product');
}
Quick overview of pros and cons of different approaches:
XMLReader only
Pros: fast, uses little memory
Cons: excessively hard to write and debug, requires lots of userland code to do anything useful. Userland code is slow and prone to error. Plus, it leaves you with more lines of code to maintain
XMLReader + SimpleXML
Pros: doesn't use much memory (only the memory needed to process one node) and SimpleXML is, as the name implies, really easy to use.
Cons: creating a SimpleXMLElement object for each node is not very fast. You really have to benchmark it to understand whether it's a problem for you. Even a modest machine would be able to process a thousand nodes per second, though.
XMLReader + DOM
Pros: uses about as much memory as SimpleXML, and XMLReader::expand() is faster than creating a new SimpleXMLElement. I wish it was possible to use simplexml_import_dom() but it doesn't seem to work in that case
Cons: DOM is annoying to work with. It's halfway between XMLReader and SimpleXML. Not as complicated and awkward as XMLReader, but light years away from working with SimpleXML.
My advice: write a prototype with SimpleXML, see if it works for you. If performance is paramount, try DOM. Stay as far away from XMLReader as possible. Remember that the more code you write, the higher the possibility of you introducing bugs or introducing performance regressions.

For xml formatted with attributes...
data.xml:
<building_data>
<building address="some address" lat="28.902914" lng="-71.007235" />
<building address="some address" lat="48.892342" lng="-75.0423423" />
<building address="some address" lat="58.929753" lng="-79.1236987" />
</building_data>
php code:
$reader = new XMLReader();
if (!$reader->open("data.xml")) {
die("Failed to open 'data.xml'");
}
while($reader->read()) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'building') {
$address = $reader->getAttribute('address');
$latitude = $reader->getAttribute('lat');
$longitude = $reader->getAttribute('lng');
}
$reader->close();

The accepted answer gave me a good start, but brought in more classes and more processing than I would have liked; so this is my interpretation:
$xml_reader = new XMLReader;
$xml_reader->open($feed_url);
// move the pointer to the first product
while ($xml_reader->read() && $xml_reader->name != 'product');
// loop through the products
while ($xml_reader->name == 'product')
{
// load the current xml element into simplexml and we’re off and running!
$xml = simplexml_load_string($xml_reader->readOuterXML());
// now you can use your simpleXML object ($xml).
echo $xml->element_1;
// move the pointer to the next product
$xml_reader->next('product');
}
// don’t forget to close the file
$xml_reader->close();

Most of my XML parsing life is spent extracting nuggets of useful information out of truckloads of XML (Amazon MWS). As such, my answer assumes you want only specific information and you know where it is located.
I find the easiest way to use XMLReader is to know which tags I want the information out of and use them. If you know the structure of the XML and it has lots of unique tags, I find that using the first case is the easy. Cases 2 and 3 are just to show you how it can be done for more complex tags. This is extremely fast; I have a discussion of speed over on What is the fastest XML parser in PHP?
The most important thing to remember when doing tag-based parsing like this is to use if ($myXML->nodeType == XMLReader::ELEMENT) {... - which checks to be sure we're only dealing with opening nodes and not whitespace or closing nodes or whatever.
function parseMyXML ($xml) { //pass in an XML string
$myXML = new XMLReader();
$myXML->xml($xml);
while ($myXML->read()) { //start reading.
if ($myXML->nodeType == XMLReader::ELEMENT) { //only opening tags.
$tag = $myXML->name; //make $tag contain the name of the tag
switch ($tag) {
case 'Tag1': //this tag contains no child elements, only the content we need. And it's unique.
$variable = $myXML->readInnerXML(); //now variable contains the contents of tag1
break;
case 'Tag2': //this tag contains child elements, of which we only want one.
while($myXML->read()) { //so we tell it to keep reading
if ($myXML->nodeType == XMLReader::ELEMENT && $myXML->name === 'Amount') { // and when it finds the amount tag...
$variable2 = $myXML->readInnerXML(); //...put it in $variable2.
break;
}
}
break;
case 'Tag3': //tag3 also has children, which are not unique, but we need two of the children this time.
while($myXML->read()) {
if ($myXML->nodeType == XMLReader::ELEMENT && $myXML->name === 'Amount') {
$variable3 = $myXML->readInnerXML();
break;
} else if ($myXML->nodeType == XMLReader::ELEMENT && $myXML->name === 'Currency') {
$variable4 = $myXML->readInnerXML();
break;
}
}
break;
}
}
}
$myXML->close();
}

Simple example:
public function productsAction()
{
$saveFileName = 'ceneo.xml';
$filename = $this->path . $saveFileName;
if(file_exists($filename)) {
$reader = new XMLReader();
$reader->open($filename);
$countElements = 0;
while($reader->read()) {
if($reader->nodeType == XMLReader::ELEMENT) {
$nodeName = $reader->name;
}
if($reader->nodeType == XMLReader::TEXT && !empty($nodeName)) {
switch ($nodeName) {
case 'id':
var_dump($reader->value);
break;
}
}
if($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'offer') {
$countElements++;
}
}
$reader->close();
exit(print('<pre>') . var_dump($countElements));
}
}

XMLReader is well documented on PHP site. This is a XML Pull Parser, which means it's used to iterate through nodes (or DOM Nodes) of given XML document. For example, you could go through the entire document you gave like this:
<?php
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>
It is then up to you to decide how to deal with the node returned by XMLReader::expand().

This Works Better and Faster For Me
<html>
<head>
<script>
function showRSS(str) {
if (str.length==0) {
document.getElementById("rssOutput").innerHTML="";
return;
}
if (window.XMLHttpRequest) {
// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
} else { // code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function() {
if (this.readyState==4 && this.status==200) {
document.getElementById("rssOutput").innerHTML=this.responseText;
}
}
xmlhttp.open("GET","getrss.php?q="+str,true);
xmlhttp.send();
}
</script>
</head>
<body>
<form>
<select onchange="showRSS(this.value)">
<option value="">Select an RSS-feed:</option>
<option value="Google">Google News</option>
<option value="ZDN">ZDNet News</option>
<option value="job">Job</option>
</select>
</form>
<br>
<div id="rssOutput">RSS-feed will be listed here...</div>
</body>
</html>
**The Backend File **
<?php
//get the q parameter from URL
$q=$_GET["q"];
//find out which feed was selected
if($q=="Google") {
$xml=("http://news.google.com/news?ned=us&topic=h&output=rss");
} elseif($q=="ZDN") {
$xml=("https://www.zdnet.com/news/rss.xml");
}elseif($q == "job"){
$xml=("https://ngcareers.com/feed");
}
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
//get elements from "<channel>"
$channel=$xmlDoc->getElementsByTagName('channel')->item(0);
$channel_title = $channel->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$channel_link = $channel->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
$channel_desc = $channel->getElementsByTagName('description')
->item(0)->childNodes->item(0)->nodeValue;
//output elements from "<channel>"
echo("<p><a href='" . $channel_link
. "'>" . $channel_title . "</a>");
echo("<br>");
echo($channel_desc . "</p>");
//get and output "<item>" elements
$x=$xmlDoc->getElementsByTagName('item');
$count = $x->length;
// print_r( $x->item(0)->getElementsByTagName('title')->item(0)->nodeValue);
// print_r( $x->item(0)->getElementsByTagName('link')->item(0)->nodeValue);
// print_r( $x->item(0)->getElementsByTagName('description')->item(0)->nodeValue);
// return;
for ($i=0; $i <= $count; $i++) {
//Title
$item_title = $x->item(0)->getElementsByTagName('title')->item(0)->nodeValue;
//Link
$item_link = $x->item(0)->getElementsByTagName('link')->item(0)->nodeValue;
//Description
$item_desc = $x->item(0)->getElementsByTagName('description')->item(0)->nodeValue;
//Category
$item_cat = $x->item(0)->getElementsByTagName('category')->item(0)->nodeValue;
echo ("<p>Title: <a href='" . $item_link
. "'>" . $item_title . "</a>");
echo ("<br>");
echo ("Desc: ".$item_desc);
echo ("<br>");
echo ("Category: ".$item_cat . "</p>");
}
?>

Related

How to extract elements with XMLReader

I have a large XML file (4GB) which I'm parsing and importing in to a database. I've been playing with XMLReader but can't seem to get it to work and the PHP documentation doesn't have many examples to work with.
My goal is to extract combinations of "url" and "text" from the following (simplified) version of the XML file I'm working with:
<everything>
<doc>
<field1>...</field2>
<url>www.theurlthatIwant.com</url>
<text>This is some text which I want to extract with the url</text>
<random>
<subrandom> </subrandom>
<subrandom> </subrandom>
<subrandom> </subrandom>
</random>
</doc>
<doc>
<field1>...</field2>
<url>www.anotherurl.com</url>
<text>This is some more text which I want to extract with the url</text>
<random>
<subrandom> ... </subrandom>
<subrandom> ... </subrandom>
<subrandom> ... </subrandom>
</random>
</doc>
...
</everything>
What's the pseudocode for grabbing "url" and "text" and ignoring the rest using XMLReader? I plan on outputting the pairs to a CSV file for further (much easier) processing. Thank you!
Updated:
Figured it out, posting answer below for future readers.

I got it working finally. What I didn't understand is that read() doesn't just move to the next element, it moves to the next TOKEN, which could be text, a closing tag, or any XML element. Here's the working code for future readers:
$xml = new XMLReader;
$xml->open('data.xml');
$xml->read(); // One read to skip the "everything" element
while ($xml->read()) {
if ($xml->nodeType == XMLReader::ELEMENT) {
if ($xml->name == 'url') {
$xml->read();
if ($xml->nodeType == XMLReader::TEXT) {
print 'got url: ' . $xml->value.PHP_EOL;
}
}elseif ($xml->name == 'text') {
$xml->read();
if ($xml->nodeType == XMLReader::TEXT) {
print 'got text: ' . $xml->value.PHP_EOL;
}
}
}
}

How do I process XML string in order using PHP

I have an XML file which contains text with some very simple layout constructs:
<?xml version='1.0'?>
<page>
<section>
<header>Header</header>
<par>Some paragraph</par>
<par>Another paragraph with <emph>formatting</emph></par>
</section>
</page>
In PHP then I read this file using SimpleXML (Note that I intentionally strip other tags!):
$page = file_get_contents("page.xml");
if ($page) {
$stripped = strip_tags($page, "<?xml><page><section><header><par><emph>");
$xml = new SimpleXMLElement($stripped);
}
Now I would like to iterate over the XML elements and print them in order as HTML for my website. The final result should be the following snippet:
<h1>Header</h1>
<p>Some paragraph
<p>Another paragraph with <i>formatting</i>
I've noodled through SimpleXML and XPath and tried to figure out how I can iterate over the XML tree in order so that I can digest the original XML file into HTML output. I can produce a somewhat desired result but the <emph></emph> is just gone; how do I descent further into the tree? My code so far:
foreach ($xml->section as $s) {
echo "<h1>" . $s->header . "</h1>";
foreach ($s->par as $p) {
echo "<p>" . $p;
// Do some magic here to ensure <emph> tags are recognized and responded to properly.
}
}
Any hints and pointers are appreciated! Thanks :-)

Well, without an answer I just had to noodle myself :-) So here is what I did and it worked out just fine.
Turned out that the SimpleXML thing didn't cut it, so I used the XMLReader:
$xml = new XMLReader();
Then I manually parsed the XML string, jumped from element to element and acted upon each of them:
if ($xml->xml($stripped)) { // $stripped here is a string that's been validated (see below).
while (false !== $xml->read()) {
$t = $xml->nodeType;
if ($t === XMLReader::ELEMENT) {
$n = $xml->name;
switch ($n) {
case "page":
case "section":
// Nothing to echo here.
break;
case "header":
// Handle attributes here
echo "<h1>";
break;
case "par":
echo "<p> ";
break;
case "emph";
echo "<i>"; // This can also open a <span> for more flexibility later.
break;
default:
// Nothing should arrive here.
echo "Gah!"
}
}
else if ($t === XMLReader::END_ELEMENT) {
... // Close the opened tags here.
}
else if ($t === XMLReader::TEXT) {
$s = $xml->readString();
echo $s;
}
else {
// Everything else are comments or white spaces.
}
}
}
You get the drift. I basically had to bounce through the XML structure myself and, dependent on the element type, handle attributes and nodes of elements manually.
In fact, this is a two-step process. What you see here assumes a valid XML document. I also have a validator that runs before the above code, and which makes sure that the correct elements are nested properly and that the given XML is "well formed" as per my own definitions of nesting, attributes, whatnot. The validator operates after the exact same principle.
Hope this helps.

PHP + XML - how to rename and delete XML elements using SimpleXML or DOMDocument?

I've had some success due to the help of StackOverflow community to modify a complex XML source for use with jsTree. However now that I have data that is usable, it is only so if i manually edit the XML to do the following :
Rename all <user> tags to <item>
Remove some elements before the first <user> tag
insert an 'encoding=UTF-8' into the XML opener
and lastly modify the <response> (opening XML tag) to <root>
XML File Example : SampleXML
I have read and read through so many pages on here and google but cannot find a method to achieve the above items.
Point (2) I have found out that by loading it via SimpleXML and using UNSET i can delete the portions I do not require, however I am still having troubles with the rest.
I thought I could perhaps modify the source with SimpleXML (that I am more familiar with) and then continue to modify the code via the help I had been provided before.
<?php
$s = file_get_contents('http://www.fluffyduck.com.au/sampleXML.xml');
$doc1 = simplexml_load_string($s);
unset($doc1->row);
unset($doc1->display);
#$moo = $doc1->user;
echo '<textarea>';
echo $doc1->asXML();
echo '</textarea>';
$doc = new DOMDocument();
$doc->loadXML($doc1);
$users = $doc->getElementsByTagName("user");
foreach ($users as $user)
{
if ($user->hasAttributes())
{
// create content node
$content = $user->appendChild($doc->createElement("content"));
// transform attributes into content elements
for ($i = 0; $i < $user->attributes->length; $i++)
{
$attr = $user->attributes->item($i);
if (strtolower($attr->name) != "id")
{
if ($user->removeAttribute($attr->name))
{
if($attr->name == "username") {
$content->appendChild($doc->createElement('name', $attr->value));
} else {
$content->appendChild($doc->createElement($attr->name, $attr->value));
}
$i--;
}
}
}
}
}
$doc->saveXML();
header("Content-Type: text/xml");
echo $doc->saveXML();
?>

Using recursion, you can create a brand new document based on the input, solving all your points at once:
Code
<?php
$input = file_get_contents('http://www.fluffyduck.com.au/sampleXML.xml');
$inputDoc = new DOMDocument();
$inputDoc->loadXML($input);
$outputDoc = new DOMDocument("1.0", "utf-8");
$outputDoc->appendChild($outputDoc->createElement("root"));
function ConvertUserToItem($outputDoc, $inputNode, $outputNode)
{
if ($inputNode->hasChildNodes())
{
foreach ($inputNode->childNodes as $inputChild)
{
if (strtolower($inputChild->nodeName) == "user")
{
$outputChild = $outputDoc->createElement("item");
$outputNode->appendChild($outputChild);
// read input attributes and convert them to nodes
if ($inputChild->hasAttributes())
{
$outputContent = $outputDoc->createElement("content");
foreach ($inputChild->attributes as $attribute)
{
if (strtolower($attribute->name) != "id")
{
$outputContent->appendChild($outputDoc->createElement($attribute->name, $attribute->value));
}
else
{
$outputChild->setAttribute($attribute->name, $attribute->value);
}
}
$outputChild->appendChild($outputContent);
}
// recursive call
ConvertUserToItem($outputDoc, $inputChild, $outputChild);
}
}
}
}
ConvertUserToItem($outputDoc, $inputDoc->documentElement, $outputDoc->documentElement);
header("Content-Type: text/xml; charset=" . $outputDoc->encoding);
echo $outputDoc->saveXML();
?>
Output
<?xml version="1.0" encoding="utf-8"?>
<root>
<item id="41">
<content>
<username>bsmain</username>
<firstname>Boss</firstname>
<lastname>MyTest</lastname>
<fullname>Test Name</fullname>
<email>lalal#test.com</email>
<logins>1964</logins>
<lastseen>11/09/2012</lastseen>
</content>
<item id="61">
<content>
<username>underling</username>
<firstname>Under</firstname>
<lastname>MyTest</lastname>
<fullname>Test Name</fullname>
<email>lalal#test.com</email>
<logins>4</logins>
<lastseen>08/09/2009</lastseen>
</content>
</item>
...

Can not figure out how to save the new outputed document with DOM

With the following script I rename the elements through XSLT of an XML and output the result. However I want to save it in a new file on my browser but the only I managed to do is to save the input XML.
<?php
// create an XSLT processor and load the stylesheet as a DOM
$xproc = new XsltProcessor();
$xslt = new DomDocument;
$xslt->load('stylesheet.xslt'); // this contains the code from above
$xproc->importStylesheet($xslt);
// your DOM or the source XML (copied from your question)
$xml = '<document><item><title>Hide your heart</title><quantity>1</quantity><price>9.90</price></item></document>';
$dom = new DomDocument;
$dom->loadXML($xml);
// do the transformation
if ($xml_output = $xproc->transformToXML($dom)) {
echo $xml_output;
} else {
trigger_error('Oops, XSLT transformation failed!', E_USER_ERROR);
}
?>
I used
echo $dom->saveXML();
$dom->save("write.xml")
and replaced it with xml_output with no luck.

if ($xml_output = $xproc->transformToDoc($dom)) {
$xml_output->save('write.xml');
} else {
trigger_error('Oops, XSLT transformation failed!', E_USER_ERROR);
}
transformToXML() returns a string instead of a DOMDocument. You can also write the string to a file using the common file handling functions
if ($xml_output = $xproc->transformToXML($dom)) {
echo $xml_output;
file_put_contents('write.xml', $xml_output);
} else {
trigger_error('Oops, XSLT transformation failed!', E_USER_ERROR);
}
Update: Just found the third method :) Depending on what you want to achieve this is the one you are looking for
$xproc->transformToURI($doc, 'write.xml');
http://php.net/xsltprocessor.transformtouri
You can find the signature of the whole class at the manual: http://php.net/class.xsltprocessor

How to use XMLReader in PHP?

I have the following XML file, the file is rather large and i haven't been able to get simplexml to open and read the file so i'm trying XMLReader with no success in php
<?xml version="1.0" encoding="ISO-8859-1"?>
<products>
<last_updated>2009-11-30 13:52:40</last_updated>
<product>
<element_1>foo</element_1>
<element_2>foo</element_2>
<element_3>foo</element_3>
<element_4>foo</element_4>
</product>
<product>
<element_1>bar</element_1>
<element_2>bar</element_2>
<element_3>bar</element_3>
<element_4>bar</element_4>
</product>
</products>
I've unfortunately not found a good tutorial on this for PHP and would love to see how I can get each element content to store in a database.

It all depends on how big the unit of work, but I guess you're trying to treat each <product/> nodes in succession.
For that, the simplest way would be to use XMLReader to get to each node, then use SimpleXML to access them. This way, you keep the memory usage low because you're treating one node at a time and you still leverage SimpleXML's ease of use. For instance:
$z = new XMLReader;
$z->open('data.xml');
$doc = new DOMDocument;
// move to the first <product /> node
while ($z->read() && $z->name !== 'product');
// now that we're at the right depth, hop to the next <product/> until the end of the tree
while ($z->name === 'product')
{
// either one should work
//$node = new SimpleXMLElement($z->readOuterXML());
$node = simplexml_import_dom($doc->importNode($z->expand(), true));
// now you can use $node without going insane about parsing
var_dump($node->element_1);
// go to next <product />
$z->next('product');
}
Quick overview of pros and cons of different approaches:
XMLReader only
Pros: fast, uses little memory
Cons: excessively hard to write and debug, requires lots of userland code to do anything useful. Userland code is slow and prone to error. Plus, it leaves you with more lines of code to maintain
XMLReader + SimpleXML
Pros: doesn't use much memory (only the memory needed to process one node) and SimpleXML is, as the name implies, really easy to use.
Cons: creating a SimpleXMLElement object for each node is not very fast. You really have to benchmark it to understand whether it's a problem for you. Even a modest machine would be able to process a thousand nodes per second, though.
XMLReader + DOM
Pros: uses about as much memory as SimpleXML, and XMLReader::expand() is faster than creating a new SimpleXMLElement. I wish it was possible to use simplexml_import_dom() but it doesn't seem to work in that case
Cons: DOM is annoying to work with. It's halfway between XMLReader and SimpleXML. Not as complicated and awkward as XMLReader, but light years away from working with SimpleXML.
My advice: write a prototype with SimpleXML, see if it works for you. If performance is paramount, try DOM. Stay as far away from XMLReader as possible. Remember that the more code you write, the higher the possibility of you introducing bugs or introducing performance regressions.

For xml formatted with attributes...
data.xml:
<building_data>
<building address="some address" lat="28.902914" lng="-71.007235" />
<building address="some address" lat="48.892342" lng="-75.0423423" />
<building address="some address" lat="58.929753" lng="-79.1236987" />
</building_data>
php code:
$reader = new XMLReader();
if (!$reader->open("data.xml")) {
die("Failed to open 'data.xml'");
}
while($reader->read()) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'building') {
$address = $reader->getAttribute('address');
$latitude = $reader->getAttribute('lat');
$longitude = $reader->getAttribute('lng');
}
$reader->close();

The accepted answer gave me a good start, but brought in more classes and more processing than I would have liked; so this is my interpretation:
$xml_reader = new XMLReader;
$xml_reader->open($feed_url);
// move the pointer to the first product
while ($xml_reader->read() && $xml_reader->name != 'product');
// loop through the products
while ($xml_reader->name == 'product')
{
// load the current xml element into simplexml and we’re off and running!
$xml = simplexml_load_string($xml_reader->readOuterXML());
// now you can use your simpleXML object ($xml).
echo $xml->element_1;
// move the pointer to the next product
$xml_reader->next('product');
}
// don’t forget to close the file
$xml_reader->close();

Most of my XML parsing life is spent extracting nuggets of useful information out of truckloads of XML (Amazon MWS). As such, my answer assumes you want only specific information and you know where it is located.
I find the easiest way to use XMLReader is to know which tags I want the information out of and use them. If you know the structure of the XML and it has lots of unique tags, I find that using the first case is the easy. Cases 2 and 3 are just to show you how it can be done for more complex tags. This is extremely fast; I have a discussion of speed over on What is the fastest XML parser in PHP?
The most important thing to remember when doing tag-based parsing like this is to use if ($myXML->nodeType == XMLReader::ELEMENT) {... - which checks to be sure we're only dealing with opening nodes and not whitespace or closing nodes or whatever.
function parseMyXML ($xml) { //pass in an XML string
$myXML = new XMLReader();
$myXML->xml($xml);
while ($myXML->read()) { //start reading.
if ($myXML->nodeType == XMLReader::ELEMENT) { //only opening tags.
$tag = $myXML->name; //make $tag contain the name of the tag
switch ($tag) {
case 'Tag1': //this tag contains no child elements, only the content we need. And it's unique.
$variable = $myXML->readInnerXML(); //now variable contains the contents of tag1
break;
case 'Tag2': //this tag contains child elements, of which we only want one.
while($myXML->read()) { //so we tell it to keep reading
if ($myXML->nodeType == XMLReader::ELEMENT && $myXML->name === 'Amount') { // and when it finds the amount tag...
$variable2 = $myXML->readInnerXML(); //...put it in $variable2.
break;
}
}
break;
case 'Tag3': //tag3 also has children, which are not unique, but we need two of the children this time.
while($myXML->read()) {
if ($myXML->nodeType == XMLReader::ELEMENT && $myXML->name === 'Amount') {
$variable3 = $myXML->readInnerXML();
break;
} else if ($myXML->nodeType == XMLReader::ELEMENT && $myXML->name === 'Currency') {
$variable4 = $myXML->readInnerXML();
break;
}
}
break;
}
}
}
$myXML->close();
}

Simple example:
public function productsAction()
{
$saveFileName = 'ceneo.xml';
$filename = $this->path . $saveFileName;
if(file_exists($filename)) {
$reader = new XMLReader();
$reader->open($filename);
$countElements = 0;
while($reader->read()) {
if($reader->nodeType == XMLReader::ELEMENT) {
$nodeName = $reader->name;
}
if($reader->nodeType == XMLReader::TEXT && !empty($nodeName)) {
switch ($nodeName) {
case 'id':
var_dump($reader->value);
break;
}
}
if($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'offer') {
$countElements++;
}
}
$reader->close();
exit(print('<pre>') . var_dump($countElements));
}
}

XMLReader is well documented on PHP site. This is a XML Pull Parser, which means it's used to iterate through nodes (or DOM Nodes) of given XML document. For example, you could go through the entire document you gave like this:
<?php
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>
It is then up to you to decide how to deal with the node returned by XMLReader::expand().

This Works Better and Faster For Me
<html>
<head>
<script>
function showRSS(str) {
if (str.length==0) {
document.getElementById("rssOutput").innerHTML="";
return;
}
if (window.XMLHttpRequest) {
// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
} else { // code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function() {
if (this.readyState==4 && this.status==200) {
document.getElementById("rssOutput").innerHTML=this.responseText;
}
}
xmlhttp.open("GET","getrss.php?q="+str,true);
xmlhttp.send();
}
</script>
</head>
<body>
<form>
<select onchange="showRSS(this.value)">
<option value="">Select an RSS-feed:</option>
<option value="Google">Google News</option>
<option value="ZDN">ZDNet News</option>
<option value="job">Job</option>
</select>
</form>
<br>
<div id="rssOutput">RSS-feed will be listed here...</div>
</body>
</html>
**The Backend File **
<?php
//get the q parameter from URL
$q=$_GET["q"];
//find out which feed was selected
if($q=="Google") {
$xml=("http://news.google.com/news?ned=us&topic=h&output=rss");
} elseif($q=="ZDN") {
$xml=("https://www.zdnet.com/news/rss.xml");
}elseif($q == "job"){
$xml=("https://ngcareers.com/feed");
}
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
//get elements from "<channel>"
$channel=$xmlDoc->getElementsByTagName('channel')->item(0);
$channel_title = $channel->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$channel_link = $channel->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
$channel_desc = $channel->getElementsByTagName('description')
->item(0)->childNodes->item(0)->nodeValue;
//output elements from "<channel>"
echo("<p><a href='" . $channel_link
. "'>" . $channel_title . "</a>");
echo("<br>");
echo($channel_desc . "</p>");
//get and output "<item>" elements
$x=$xmlDoc->getElementsByTagName('item');
$count = $x->length;
// print_r( $x->item(0)->getElementsByTagName('title')->item(0)->nodeValue);
// print_r( $x->item(0)->getElementsByTagName('link')->item(0)->nodeValue);
// print_r( $x->item(0)->getElementsByTagName('description')->item(0)->nodeValue);
// return;
for ($i=0; $i <= $count; $i++) {
//Title
$item_title = $x->item(0)->getElementsByTagName('title')->item(0)->nodeValue;
//Link
$item_link = $x->item(0)->getElementsByTagName('link')->item(0)->nodeValue;
//Description
$item_desc = $x->item(0)->getElementsByTagName('description')->item(0)->nodeValue;
//Category
$item_cat = $x->item(0)->getElementsByTagName('category')->item(0)->nodeValue;
echo ("<p>Title: <a href='" . $item_link
. "'>" . $item_title . "</a>");
echo ("<br>");
echo ("Desc: ".$item_desc);
echo ("<br>");
echo ("Category: ".$item_cat . "</p>");
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing large XML with XMLReader [duplicate] - php

Related

How to extract elements with XMLReader

How do I process XML string in order using PHP

PHP + XML - how to rename and delete XML elements using SimpleXML or DOMDocument?

Can not figure out how to save the new outputed document with DOM

How to use XMLReader in PHP?

Categories

Resources