Open a 2GB xml file to understand the structure [duplicate] - php

I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

This is a very similar question to Best way to process large XML in PHP but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing.
However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:
My take on it:
https://github.com/prewk/XmlStreamer
A simple class that will extract all children to the XML root element while streaming the file.
Tested on 108 MB XML file from pubmed.com.
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();

I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.
If you have the following file complex-test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
And wanted to return the <Object/>s
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* #package default
* #author Dom Hastings
*/
class Chunk {
/**
* options
*
* #var array Contains all major options
* #access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* #var string The filename being read
* #access public
*/
public $file = '';
/**
* pointer
*
* #var integer The current position the file is being read from
* #access public
*/
public $pointer = 0;
/**
* handle
*
* #var resource The fopen() resource
* #access private
*/
private $handle = null;
/**
* reading
*
* #var boolean Whether the script is currently reading the file
* #access private
*/
private $reading = false;
/**
* readBuffer
*
* #var string Used to make sure start tags aren't missed
* #access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* #param string $file The filename to work with
* #param array $options The options with which to parse the file
* #author Dom Hastings
* #access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a /
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* #return void
* #author Dom Hastings
* #access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* #return string The XML string from $this->file
* #author Dom Hastings
* #access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}

This is an old post, but first in the google search result, so I thought I post another solution based on this post:
http://drib.tech/programming/parse-large-xml-files-php
This solution uses both XMLReader and SimpleXMLElement :
$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL = 'the_name_of_your_element';
$xml = new XMLReader();
$xml->open($xmlFile);
// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}
// looping through elements
while($xml->name == $primEL) {
// loading element data into simpleXML object
$element = new SimpleXMLElement($xml->readOuterXML());
// DO STUFF
// moving pointer
$xml->next($primEL);
// clearing current element
unset($element);
} // end while
$xml->close();

I would suggest using a SAX based parser rather than DOM based parsing.
Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm

This isn't a great solution, but just to throw another option out there:
You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).
e.g., if your doc looks like:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
You can read it in a meg or two at a time, artificially wrap the few complete <listing> tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).
Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.
Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)

You can combine XMLReader with DOM for this. In PHP both APIs (and SimpleXML) are based on the same library - libxml2. Large XMLs are a list of records typically. So you use XMLReader to iterate the records, load a single record into DOM and use DOM methods and Xpath to extract values. The key is the method XMLReader::expand(). It loads the current node in an XMLReader instance and its descendants as DOM nodes.
Example XML:
<books>
<book>
<title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
</book>
<book>
<title isbn="978-0596100506">XML Pocket Reference</title>
</book>
<!-- ... -->
</books>
Example code:
// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');
// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
continue;
}
// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
// expand the node into the prepared DOM
$book = $reader->expand($document);
// use Xpath expressions to fetch values
var_dump(
$xpath->evaluate('string(title/#isbn)', $book),
$xpath->evaluate('string(title)', $book)
);
// move to the next book sibling node
$reader->next('book');
}
$reader->close();
Take note that the expanded node is never appended to the DOM document. It allows the GC to clean it up.
This approach works with XML namespaces as well.
$namespaceURI = 'urn:example-books';
$reader = new XMLReader();
$reader->open('books.xml');
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);
// compare local node name and namespace URI
while (
$reader->read() &&
(
$reader->localName !== 'book' ||
$reader->namespaceURI !== $namespaceURI
)
) {
continue;
}
// iterate the book elements
while ($reader->localName === 'book') {
// validate that they are in the namespace
if ($reader->namespaceURI === $namespaceURI) {
$book = $reader->expand($document);
var_dump(
$xpath->evaluate('string(b:title/#isbn)', $book),
$xpath->evaluate('string(b:title)', $book)
);
}
$reader->next('book');
}
$reader->close();

I've written a wrapper for XMLReader to (IMHO) make it easier to just get the bits your after. The wrapper allows you to associate a set of paths of the data elements and a callback to be run when this path is found. The path allows regex expressions and also capture groups which can also be passed to the callback.
The library is at https://github.com/NigelRel3/XMLReaderReg and can also be installed using composer require nigelrel3/xml-reader-reg.
An example of how to use it...
$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);
$reader->process([
'(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
echo "1) Value for ".$path[1]." is ".PHP_EOL.
$data->asXML().PHP_EOL;
},
'(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
echo "2) Value for ".$path[1]." is ".PHP_EOL.
$data->ownerDocument->saveXML($data).PHP_EOL;
},
'/root/person2/firstname' => function (string $data): void {
echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
}
]);
$reader->close();
As can be seen from the example, you can get the data to be passed as a SimpleXMLElement, a DOMElement or the last one is a string. This will represent only the data which matches the path.
The paths also show how capture groups can be used - (.*/person(?:\[\d*\])?) looks for any person element (including arrays of elements) and $path[1] in the callback displays the path where this particular instance is found.
There is an expanded example in the library as well as unit tests.

I tested the following code with 2 GB xml:
<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>

My solution:
$reader = new XMLReader();
$reader->open($fileTMP);
while ($reader->read()) {
if ($reader->nodeType === XMLReader::ELEMENT && $reader->name === 'xmltag' && $reader->isEmptyElement === false) {
$item = simplexml_load_string($reader->readOuterXML(), null, LIBXML_NOCDATA);
//operations on file
}
}
$reader->close();

Very high performed way is
preg_split('/(<|>)/m', $xmlString);
And after that, only one cycle is needed.

Related

How to get the html of certain tag using php DomDocument

first I let the it find the element
$dom= new \DOMDocument();
$dom->loadHTML($html_string);
$only_p = $dom->getElementsByTagName('p')->item(1);
If I try something like
$only_p->textContent; /* <-- it will only return the text inside the paragraph and even remove all the tags inside of it */
what I need is something like
$only_p = $dom->getElementsByTagName('p')->item(1);
$only_p->outerHTML;
that would return the HTML around it, like
<p class="something"> this is text </p>
instead of just the string "this is text"
This is how I solved it
/**
* Returns the outer HTML of a certain tag.
* You can decide whether the result should be returned as string or inside an array.
*
* #param $html_string
* #param $tagName
* #param string $return_as
* #return array|string
*/
public function getOuterHTMLFromTagName($html_string,$tagName,$return_as='array')
{
//create a new DomDocument based on the first parameter's html
$dom_doc= new \DOMDocument();
$dom_doc->loadHTML($html_string);
//set variables for the result type
$html_results_as_array = array();
$html_results_as_string = "";
// get tags from DocDocument
$elements_in_tag =$dom_doc->getElementsByTagName($tagName);
// loop through found tags
for($a=0; $a < $elements_in_tag->length; $a++)
{
// get tag of current key
$element_in_tag = $dom_doc->getElementsByTagName($tagName)->item($a);
//create a new DomDocument that only contains the tags HTML
$element_doc = new \DOMDocument();
$element_doc->appendChild($element_doc->importNode($element_in_tag,true));
//save the elements HTML in variables
$html_results_as_string .= $element_doc->saveHTML();
array_push($html_results_as_array,$element_doc->saveHTML());
}
//return either as array or string
if($return_as == 'array')
{
return $html_results_as_array;
}
else
{
return $html_results_as_string;
}
}

Parsing very big XML file with PHP [duplicate]

I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?
There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
This is a very similar question to Best way to process large XML in PHP but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing.
However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:
My take on it:
https://github.com/prewk/XmlStreamer
A simple class that will extract all children to the XML root element while streaming the file.
Tested on 108 MB XML file from pubmed.com.
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();
I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.
If you have the following file complex-test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
And wanted to return the <Object/>s
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* #package default
* #author Dom Hastings
*/
class Chunk {
/**
* options
*
* #var array Contains all major options
* #access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* #var string The filename being read
* #access public
*/
public $file = '';
/**
* pointer
*
* #var integer The current position the file is being read from
* #access public
*/
public $pointer = 0;
/**
* handle
*
* #var resource The fopen() resource
* #access private
*/
private $handle = null;
/**
* reading
*
* #var boolean Whether the script is currently reading the file
* #access private
*/
private $reading = false;
/**
* readBuffer
*
* #var string Used to make sure start tags aren't missed
* #access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* #param string $file The filename to work with
* #param array $options The options with which to parse the file
* #author Dom Hastings
* #access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a /
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* #return void
* #author Dom Hastings
* #access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* #return string The XML string from $this->file
* #author Dom Hastings
* #access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}
This is an old post, but first in the google search result, so I thought I post another solution based on this post:
http://drib.tech/programming/parse-large-xml-files-php
This solution uses both XMLReader and SimpleXMLElement :
$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL = 'the_name_of_your_element';
$xml = new XMLReader();
$xml->open($xmlFile);
// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}
// looping through elements
while($xml->name == $primEL) {
// loading element data into simpleXML object
$element = new SimpleXMLElement($xml->readOuterXML());
// DO STUFF
// moving pointer
$xml->next($primEL);
// clearing current element
unset($element);
} // end while
$xml->close();
I would suggest using a SAX based parser rather than DOM based parsing.
Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm
This isn't a great solution, but just to throw another option out there:
You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).
e.g., if your doc looks like:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
You can read it in a meg or two at a time, artificially wrap the few complete <listing> tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).
Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.
Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)
You can combine XMLReader with DOM for this. In PHP both APIs (and SimpleXML) are based on the same library - libxml2. Large XMLs are a list of records typically. So you use XMLReader to iterate the records, load a single record into DOM and use DOM methods and Xpath to extract values. The key is the method XMLReader::expand(). It loads the current node in an XMLReader instance and its descendants as DOM nodes.
Example XML:
<books>
<book>
<title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
</book>
<book>
<title isbn="978-0596100506">XML Pocket Reference</title>
</book>
<!-- ... -->
</books>
Example code:
// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');
// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
continue;
}
// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
// expand the node into the prepared DOM
$book = $reader->expand($document);
// use Xpath expressions to fetch values
var_dump(
$xpath->evaluate('string(title/#isbn)', $book),
$xpath->evaluate('string(title)', $book)
);
// move to the next book sibling node
$reader->next('book');
}
$reader->close();
Take note that the expanded node is never appended to the DOM document. It allows the GC to clean it up.
This approach works with XML namespaces as well.
$namespaceURI = 'urn:example-books';
$reader = new XMLReader();
$reader->open('books.xml');
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);
// compare local node name and namespace URI
while (
$reader->read() &&
(
$reader->localName !== 'book' ||
$reader->namespaceURI !== $namespaceURI
)
) {
continue;
}
// iterate the book elements
while ($reader->localName === 'book') {
// validate that they are in the namespace
if ($reader->namespaceURI === $namespaceURI) {
$book = $reader->expand($document);
var_dump(
$xpath->evaluate('string(b:title/#isbn)', $book),
$xpath->evaluate('string(b:title)', $book)
);
}
$reader->next('book');
}
$reader->close();
I've written a wrapper for XMLReader to (IMHO) make it easier to just get the bits your after. The wrapper allows you to associate a set of paths of the data elements and a callback to be run when this path is found. The path allows regex expressions and also capture groups which can also be passed to the callback.
The library is at https://github.com/NigelRel3/XMLReaderReg and can also be installed using composer require nigelrel3/xml-reader-reg.
An example of how to use it...
$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);
$reader->process([
'(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
echo "1) Value for ".$path[1]." is ".PHP_EOL.
$data->asXML().PHP_EOL;
},
'(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
echo "2) Value for ".$path[1]." is ".PHP_EOL.
$data->ownerDocument->saveXML($data).PHP_EOL;
},
'/root/person2/firstname' => function (string $data): void {
echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
}
]);
$reader->close();
As can be seen from the example, you can get the data to be passed as a SimpleXMLElement, a DOMElement or the last one is a string. This will represent only the data which matches the path.
The paths also show how capture groups can be used - (.*/person(?:\[\d*\])?) looks for any person element (including arrays of elements) and $path[1] in the callback displays the path where this particular instance is found.
There is an expanded example in the library as well as unit tests.
I tested the following code with 2 GB xml:
<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>
My solution:
$reader = new XMLReader();
$reader->open($fileTMP);
while ($reader->read()) {
if ($reader->nodeType === XMLReader::ELEMENT && $reader->name === 'xmltag' && $reader->isEmptyElement === false) {
$item = simplexml_load_string($reader->readOuterXML(), null, LIBXML_NOCDATA);
//operations on file
}
}
$reader->close();
Very high performed way is
preg_split('/(<|>)/m', $xmlString);
And after that, only one cycle is needed.

removing all divs with a certain class

I have an HTML string, and I want to remove from it all DIVs whose class is "toremove".
This is trivial to do on the client side with jQuery etc., but I want to do it on the server side with PHP.
A simple aegular expression won't work, because divs may be nested...
You could use the DOM object and xPath to remove the DIVs.
/** UNTESTED **/
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*/div[#class='yourClasshere']");
foreach($elements as $e){
$doc->removeChild($e);
}
$doc->saveHTMLFile($file);
You can replace the load from file and save to file with load from and save to string if you prefer.
Here's a snippet of code I'm using to remove content from pages:
/**
* A method to remove unwanted parts of an HTML-page. Can remove elements by
* id, tag name and/or class names.
*
* #param string $html The HTML to manipulate
* #param array $partsToRemove An array of arrays, with the keys specifying
* what type of values the array holds. The following keys are used:
* 'elements' - An array of element ids to remove from the html
* 'tags' - An array of tag names to remove from the html
* 'classNames' - An array of class names. Each tag that contains one of the
* class names will be removed from the html.
*
* Also, note that descendants of the removed document will also be removed.
*
* #return string The manipulated HTML content
*
* #example removeHtmlParts($html, array (
* 'elements' => array ('headerSection', 'nav', 'footerSection'),
* 'tags' => array ('form'),
* 'classNames' => array ('promotion')
* ));
*/
--
public function removeHtmlParts ($html, array $toRemove = array())
{
$document = new \DOMDocument('1.0', 'UTF-8');
$document->encoding = 'UTF-8';
// Hack to force DOMDocument to load the HTML using UTF-8.
#$document->loadHTML('<?xml encoding="UTF-8">' . $response->getBody());
$partsToRemove = array ();
if(isset($toRemove['elements']))
{
$partsToRemove['elements'] = $toRemove['element'];
}
if(isset($toRemove['tags']))
{
$partsToRemove['tags'] = $toRemove['tags'];
}
if(isset($toRemove['classNames']))
{
$partsToRemove['classNames'] = $toRemove['classNames'];
}
foreach ($partsToRemove as $type => $content)
{
if($type == 'elements')
{
foreach ($content as $elementId)
{
$element = $document->getElementById($elementId);
if($element)
{
$element->parentNode->removeChild($element);
}
}
}
elseif($type == 'tags')
{
foreach($content as $tagName)
{
$tags = $document->getElementsByTagName($tagName);
while($tags->length)
{
$tag = $tags->item(0);
if($tag)
{
$tag->parentNode->removeChild($tag);
}
}
}
}
elseif($type == 'classNames')
{
foreach ($content as $className)
{
$xpath = new \DOMXPath($document);
$xpathExpression = sprintf(
'//*[contains(#class,"%1")]',
$className
);
$domNodeList = $xpath->evaluate($xpathExpression);
for($i = 0; $i < $domNodeList->length; $i++)
{
$node = $domNodeList->item($i);
if($node && $node->parentNode)
{
$node->parentNode->removeChild($node);
}
}
}
}
}
return $document->saveHTML();
}
Note:
This code has not undergone proper unit testing and probably contains bugs in edge cases
This method should be refactored into a class, and the contents of the method split into separate methods to ease testing.
Based on the short answer of jebbench and the long answer of PatrikAkerstrand, I created a medium function that exactly solves my problem:
/**
* remove, from the given xhtml string, all divs with the given class.
*/
function remove_divs_with_class($xhtml, $class) {
$doc = new DOMDocument();
// Hack to force DOMDocument to load the HTML using UTF-8:
$doc->loadHTML('<?xml encoding="UTF-8">'.$xhtml);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[contains(#class,'$class')]");
foreach ($elements as $element)
$element->parentNode->removeChild($element);
return $doc->saveHTML();
}
/* UNIT TEST */
if (basename(__FILE__)==basename($_SERVER['PHP_SELF'])) {
$xhtml = "<div class='near future'>near future</div><div>start</div><div class='future'>future research</div><div class='summary'>summary</div><div class='a future b'>far future</div>";
$xhtml2 = remove_divs_with_class($xhtml, "future");
print "<h2>before</h2>$xhtml<h2>after</h2>$xhtml2";
}
/* OUTPUT:
before
near future
start
future research
summary
far future
after
start
summary
*/
Never, ever try and use regex to parse XML/HTML. Instead use a parsing library. Apparently, one for PHP is http://sourceforge.net/projects/simplehtmldom/files/

php -- combining arrays

So I'm trying to write a function that does the following: I have about 20 or so XML files (someday I will have over a hundred) and in the header of each file is the name of a person who was a peer review editor <editor role="PeerReviewEditor">John Doe</editor>. I want to run through the directory where these files are stored and capture the name of the Peer-Review-Editor for that file. I want to end up with an variable $reviewEditorNames that contains all of the different names. (I will then use this to display a list of editors, etc.)
Here's what I've got so far. I'm worried about the last part. I feel like the attempt to turn $editorReviewName into $editorReviewNames is not going to combine the individuals for each file, but an array found within a given file (even if there is only one name in a given file, and thus it is an array of 1)
I'm grateful for your help.
function editorlist()
{
$filename = readDirectory('../editedtranscriptions');
foreach($filename as $file)
{
$xmldoc = simplexml_load_file("../editedtranscriptions/$file");
$xmldoc->registerXPathNamespace("tei", "http://www.tei-c.org/ns/1.0");
$reviewEditorName = $xmldoc->xpath("//tei:editor[#role='PeerReviewEditor']");
return $reviewEditorNames[] = $reviewEditorName;
}
}
I would put things more apart, that helps as well when you need to change your code later on.
Next to that, you need to check the return of the xpath, most likely you want to process only the first match (is there one editor per file?) and you want to return it as string.
If you put things into functions of it's own it's more easy to make a function to only do one thing and so it's easier to debug and improve things. E.g. you can first test if a editorFromFile function does what it should and then run it on multiple files:
/**
* get PeerReviewEditor from file
*
* #param string $file
* #return string
*/
function editorFromFile($file)
{
$xmldoc = simplexml_load_file($file);
$xmldoc->registerXPathNamespace("tei", "http://www.tei-c.org/ns/1.0");
$node = $xmldoc->xpath("//tei:editor[#role='PeerReviewEditor'][1]");
return (string) $node[0];
}
/**
* get editors from a path
*
* #param string $path
* #return array
*/
function editorlist($path)
{
$editors = array();
$files = glob(sprintf('%s/*.xml', $path), GLOB_NOSORT);
foreach($files as $file)
{
$editors[] = editorFromFile($file);
}
return $editors;
}
Just a little update:
function editorlist() {
$reviewEditorNames = array(); // init the array
$filename = readDirectory('../editedtranscriptions');
foreach($filename as $file) {
$xmldoc = simplexml_load_file("../editedtranscriptions/$file");
$xmldoc->registerXPathNamespace("tei", "http://www.tei-c.org/ns/1.0");
// add to the array
$result = $xmldoc->xpath("//tei:editor[#role='PeerReviewEditor']");
if (sizeof($result) > 0) {
$reviewEditorNames[] = (string)$result[0];
}
}
// return the array
return $reviewEditorNames;
}

Parsing Huge XML Files in PHP

I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?
There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
This is a very similar question to Best way to process large XML in PHP but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing.
However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:
My take on it:
https://github.com/prewk/XmlStreamer
A simple class that will extract all children to the XML root element while streaming the file.
Tested on 108 MB XML file from pubmed.com.
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();
I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.
If you have the following file complex-test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
And wanted to return the <Object/>s
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* #package default
* #author Dom Hastings
*/
class Chunk {
/**
* options
*
* #var array Contains all major options
* #access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* #var string The filename being read
* #access public
*/
public $file = '';
/**
* pointer
*
* #var integer The current position the file is being read from
* #access public
*/
public $pointer = 0;
/**
* handle
*
* #var resource The fopen() resource
* #access private
*/
private $handle = null;
/**
* reading
*
* #var boolean Whether the script is currently reading the file
* #access private
*/
private $reading = false;
/**
* readBuffer
*
* #var string Used to make sure start tags aren't missed
* #access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* #param string $file The filename to work with
* #param array $options The options with which to parse the file
* #author Dom Hastings
* #access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a /
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* #return void
* #author Dom Hastings
* #access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* #return string The XML string from $this->file
* #author Dom Hastings
* #access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}
This is an old post, but first in the google search result, so I thought I post another solution based on this post:
http://drib.tech/programming/parse-large-xml-files-php
This solution uses both XMLReader and SimpleXMLElement :
$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL = 'the_name_of_your_element';
$xml = new XMLReader();
$xml->open($xmlFile);
// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}
// looping through elements
while($xml->name == $primEL) {
// loading element data into simpleXML object
$element = new SimpleXMLElement($xml->readOuterXML());
// DO STUFF
// moving pointer
$xml->next($primEL);
// clearing current element
unset($element);
} // end while
$xml->close();
I would suggest using a SAX based parser rather than DOM based parsing.
Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm
This isn't a great solution, but just to throw another option out there:
You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).
e.g., if your doc looks like:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
You can read it in a meg or two at a time, artificially wrap the few complete <listing> tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).
Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.
Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)
You can combine XMLReader with DOM for this. In PHP both APIs (and SimpleXML) are based on the same library - libxml2. Large XMLs are a list of records typically. So you use XMLReader to iterate the records, load a single record into DOM and use DOM methods and Xpath to extract values. The key is the method XMLReader::expand(). It loads the current node in an XMLReader instance and its descendants as DOM nodes.
Example XML:
<books>
<book>
<title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
</book>
<book>
<title isbn="978-0596100506">XML Pocket Reference</title>
</book>
<!-- ... -->
</books>
Example code:
// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');
// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
continue;
}
// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
// expand the node into the prepared DOM
$book = $reader->expand($document);
// use Xpath expressions to fetch values
var_dump(
$xpath->evaluate('string(title/#isbn)', $book),
$xpath->evaluate('string(title)', $book)
);
// move to the next book sibling node
$reader->next('book');
}
$reader->close();
Take note that the expanded node is never appended to the DOM document. It allows the GC to clean it up.
This approach works with XML namespaces as well.
$namespaceURI = 'urn:example-books';
$reader = new XMLReader();
$reader->open('books.xml');
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);
// compare local node name and namespace URI
while (
$reader->read() &&
(
$reader->localName !== 'book' ||
$reader->namespaceURI !== $namespaceURI
)
) {
continue;
}
// iterate the book elements
while ($reader->localName === 'book') {
// validate that they are in the namespace
if ($reader->namespaceURI === $namespaceURI) {
$book = $reader->expand($document);
var_dump(
$xpath->evaluate('string(b:title/#isbn)', $book),
$xpath->evaluate('string(b:title)', $book)
);
}
$reader->next('book');
}
$reader->close();
I've written a wrapper for XMLReader to (IMHO) make it easier to just get the bits your after. The wrapper allows you to associate a set of paths of the data elements and a callback to be run when this path is found. The path allows regex expressions and also capture groups which can also be passed to the callback.
The library is at https://github.com/NigelRel3/XMLReaderReg and can also be installed using composer require nigelrel3/xml-reader-reg.
An example of how to use it...
$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);
$reader->process([
'(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
echo "1) Value for ".$path[1]." is ".PHP_EOL.
$data->asXML().PHP_EOL;
},
'(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
echo "2) Value for ".$path[1]." is ".PHP_EOL.
$data->ownerDocument->saveXML($data).PHP_EOL;
},
'/root/person2/firstname' => function (string $data): void {
echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
}
]);
$reader->close();
As can be seen from the example, you can get the data to be passed as a SimpleXMLElement, a DOMElement or the last one is a string. This will represent only the data which matches the path.
The paths also show how capture groups can be used - (.*/person(?:\[\d*\])?) looks for any person element (including arrays of elements) and $path[1] in the callback displays the path where this particular instance is found.
There is an expanded example in the library as well as unit tests.
I tested the following code with 2 GB xml:
<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>
My solution:
$reader = new XMLReader();
$reader->open($fileTMP);
while ($reader->read()) {
if ($reader->nodeType === XMLReader::ELEMENT && $reader->name === 'xmltag' && $reader->isEmptyElement === false) {
$item = simplexml_load_string($reader->readOuterXML(), null, LIBXML_NOCDATA);
//operations on file
}
}
$reader->close();
Very high performed way is
preg_split('/(<|>)/m', $xmlString);
And after that, only one cycle is needed.

Categories