php -- combining arrays

php -- combining arrays - php

So I'm trying to write a function that does the following: I have about 20 or so XML files (someday I will have over a hundred) and in the header of each file is the name of a person who was a peer review editor <editor role="PeerReviewEditor">John Doe</editor>. I want to run through the directory where these files are stored and capture the name of the Peer-Review-Editor for that file. I want to end up with an variable $reviewEditorNames that contains all of the different names. (I will then use this to display a list of editors, etc.)
Here's what I've got so far. I'm worried about the last part. I feel like the attempt to turn $editorReviewName into $editorReviewNames is not going to combine the individuals for each file, but an array found within a given file (even if there is only one name in a given file, and thus it is an array of 1)
I'm grateful for your help.
function editorlist()
{
$filename = readDirectory('../editedtranscriptions');
foreach($filename as $file)
{
$xmldoc = simplexml_load_file("../editedtranscriptions/$file");
$xmldoc->registerXPathNamespace("tei", "http://www.tei-c.org/ns/1.0");
$reviewEditorName = $xmldoc->xpath("//tei:editor[#role='PeerReviewEditor']");
return $reviewEditorNames[] = $reviewEditorName;
}
}

I would put things more apart, that helps as well when you need to change your code later on.
Next to that, you need to check the return of the xpath, most likely you want to process only the first match (is there one editor per file?) and you want to return it as string.
If you put things into functions of it's own it's more easy to make a function to only do one thing and so it's easier to debug and improve things. E.g. you can first test if a editorFromFile function does what it should and then run it on multiple files:
/**
* get PeerReviewEditor from file
*
* #param string $file
* #return string
*/
function editorFromFile($file)
{
$xmldoc = simplexml_load_file($file);
$xmldoc->registerXPathNamespace("tei", "http://www.tei-c.org/ns/1.0");
$node = $xmldoc->xpath("//tei:editor[#role='PeerReviewEditor'][1]");
return (string) $node[0];
}
/**
* get editors from a path
*
* #param string $path
* #return array
*/
function editorlist($path)
{
$editors = array();
$files = glob(sprintf('%s/*.xml', $path), GLOB_NOSORT);
foreach($files as $file)
{
$editors[] = editorFromFile($file);
}
return $editors;
}

Just a little update:
function editorlist() {
$reviewEditorNames = array(); // init the array
$filename = readDirectory('../editedtranscriptions');
foreach($filename as $file) {
$xmldoc = simplexml_load_file("../editedtranscriptions/$file");
$xmldoc->registerXPathNamespace("tei", "http://www.tei-c.org/ns/1.0");
// add to the array
$result = $xmldoc->xpath("//tei:editor[#role='PeerReviewEditor']");
if (sizeof($result) > 0) {
$reviewEditorNames[] = (string)$result[0];
}
}
// return the array
return $reviewEditorNames;
}

Related

Using PHPWord elements in a TemplateProcessor

I have a Word template with several macros (customer name, address, etc) as well as some generated tables which works fine. One of the macros I need to populate is data from a TinyMCE field on a web form, which generates data as HTML.
I looked at https://github.com/rkorebrits/HTMLtoOpenXML however the HTML I need to inject comes out of a TinyMCE editor and can include things like tables.
I was able to write the HTML to disk and use https://github.com/unoconv/unoconv to covert it into a .docx file, and then use PHPWord to extract the elements, however I can't seem to inject those elements into a TemplateProcessor as they're the wrong type!
Any help would be appreciated.
/**
* Converts HTML data into XML elements for use in Word
*
* #param string $html
* #return \PhpOffice\PhpWord\Element\Section
*/
public function convertHTML($html)
{
global $config;
$format = "<html><head></head><body>%s</body></html>";
$temphtml = tempnam($config['tmp_dir'], 'PHPWord');
$tempdocx = tempnam($config['tmp_dir'], 'PHPWord');
$elements = [];
// Write the HTML to disk
$fp = fopen($temphtml, 'w');
fprintf($fp, $format, $html);
fclose($fp);
// Convert the HTML to Word
$unoconv = Unoconv::create();
$unoconv->transcode($temphtml, 'docx', $tempdocx);
// Parse the word doc
$word = IOFactory::load($tempdocx);
$sections = $word->getSections();
foreach ($sections as $section) {
$elements = array_merge($elements, $section->getElements());
}
unlink($temphtml);
unlink($tempdocx);
$section = $word->addSection();
foreach ($elements as $element) {
$section->addElement($element);
}
return $section;
}
And then in my template processing, I have something like:
// See if we're dealing with a TinyMCE field
$def = $bean->getFieldDefinition($varArray[0]);
if ($def['type'] === 'tinymce') {
// Convert the HTML from the TinyMCE field into Word elements
$section = $this->convertHTML($bean->{$varArray[0]});
$this->setComplexValue($var, $section);
} else {
// Just treat it like a normal text value
$this->setValue($var, $bean->{$varArray[0]});
}
$template->setComplexValue() fails as the result is of type PhpOffice\PhpWord\Element\Section and setComplexValue only accepts PhpOffice\PhpWord\Writer\Word2007\Element*, so I think I've hit a dead end.

Open a 2GB xml file to understand the structure [duplicate]

I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

This is a very similar question to Best way to process large XML in PHP but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing.
However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:
My take on it:
https://github.com/prewk/XmlStreamer
A simple class that will extract all children to the XML root element while streaming the file.
Tested on 108 MB XML file from pubmed.com.
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();

I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.
If you have the following file complex-test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
And wanted to return the <Object/>s
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* #package default
* #author Dom Hastings
*/
class Chunk {
/**
* options
*
* #var array Contains all major options
* #access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* #var string The filename being read
* #access public
*/
public $file = '';
/**
* pointer
*
* #var integer The current position the file is being read from
* #access public
*/
public $pointer = 0;
/**
* handle
*
* #var resource The fopen() resource
* #access private
*/
private $handle = null;
/**
* reading
*
* #var boolean Whether the script is currently reading the file
* #access private
*/
private $reading = false;
/**
* readBuffer
*
* #var string Used to make sure start tags aren't missed
* #access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* #param string $file The filename to work with
* #param array $options The options with which to parse the file
* #author Dom Hastings
* #access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a /
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* #return void
* #author Dom Hastings
* #access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* #return string The XML string from $this->file
* #author Dom Hastings
* #access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}

This is an old post, but first in the google search result, so I thought I post another solution based on this post:
http://drib.tech/programming/parse-large-xml-files-php
This solution uses both XMLReader and SimpleXMLElement :
$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL = 'the_name_of_your_element';
$xml = new XMLReader();
$xml->open($xmlFile);
// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}
// looping through elements
while($xml->name == $primEL) {
// loading element data into simpleXML object
$element = new SimpleXMLElement($xml->readOuterXML());
// DO STUFF
// moving pointer
$xml->next($primEL);
// clearing current element
unset($element);
} // end while
$xml->close();

I would suggest using a SAX based parser rather than DOM based parsing.
Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm

This isn't a great solution, but just to throw another option out there:
You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).
e.g., if your doc looks like:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
You can read it in a meg or two at a time, artificially wrap the few complete <listing> tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).
Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.
Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)

You can combine XMLReader with DOM for this. In PHP both APIs (and SimpleXML) are based on the same library - libxml2. Large XMLs are a list of records typically. So you use XMLReader to iterate the records, load a single record into DOM and use DOM methods and Xpath to extract values. The key is the method XMLReader::expand(). It loads the current node in an XMLReader instance and its descendants as DOM nodes.
Example XML:
<books>
<book>
<title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
</book>
<book>
<title isbn="978-0596100506">XML Pocket Reference</title>
</book>
<!-- ... -->
</books>
Example code:
// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');
// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
continue;
}
// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
// expand the node into the prepared DOM
$book = $reader->expand($document);
// use Xpath expressions to fetch values
var_dump(
$xpath->evaluate('string(title/#isbn)', $book),
$xpath->evaluate('string(title)', $book)
);
// move to the next book sibling node
$reader->next('book');
}
$reader->close();
Take note that the expanded node is never appended to the DOM document. It allows the GC to clean it up.
This approach works with XML namespaces as well.
$namespaceURI = 'urn:example-books';
$reader = new XMLReader();
$reader->open('books.xml');
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);
// compare local node name and namespace URI
while (
$reader->read() &&
(
$reader->localName !== 'book' ||
$reader->namespaceURI !== $namespaceURI
)
) {
continue;
}
// iterate the book elements
while ($reader->localName === 'book') {
// validate that they are in the namespace
if ($reader->namespaceURI === $namespaceURI) {
$book = $reader->expand($document);
var_dump(
$xpath->evaluate('string(b:title/#isbn)', $book),
$xpath->evaluate('string(b:title)', $book)
);
}
$reader->next('book');
}
$reader->close();

I've written a wrapper for XMLReader to (IMHO) make it easier to just get the bits your after. The wrapper allows you to associate a set of paths of the data elements and a callback to be run when this path is found. The path allows regex expressions and also capture groups which can also be passed to the callback.
The library is at https://github.com/NigelRel3/XMLReaderReg and can also be installed using composer require nigelrel3/xml-reader-reg.
An example of how to use it...
$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);
$reader->process([
'(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
echo "1) Value for ".$path[1]." is ".PHP_EOL.
$data->asXML().PHP_EOL;
},
'(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
echo "2) Value for ".$path[1]." is ".PHP_EOL.
$data->ownerDocument->saveXML($data).PHP_EOL;
},
'/root/person2/firstname' => function (string $data): void {
echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
}
]);
$reader->close();
As can be seen from the example, you can get the data to be passed as a SimpleXMLElement, a DOMElement or the last one is a string. This will represent only the data which matches the path.
The paths also show how capture groups can be used - (.*/person(?:\[\d*\])?) looks for any person element (including arrays of elements) and $path[1] in the callback displays the path where this particular instance is found.
There is an expanded example in the library as well as unit tests.

I tested the following code with 2 GB xml:
<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>

My solution:
$reader = new XMLReader();
$reader->open($fileTMP);
while ($reader->read()) {
if ($reader->nodeType === XMLReader::ELEMENT && $reader->name === 'xmltag' && $reader->isEmptyElement === false) {
$item = simplexml_load_string($reader->readOuterXML(), null, LIBXML_NOCDATA);
//operations on file
}
}
$reader->close();

Very high performed way is
preg_split('/(<|>)/m', $xmlString);
And after that, only one cycle is needed.

Parsing very big XML file with PHP [duplicate]

I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

This is a very similar question to Best way to process large XML in PHP but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing.
However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:
My take on it:
https://github.com/prewk/XmlStreamer
A simple class that will extract all children to the XML root element while streaming the file.
Tested on 108 MB XML file from pubmed.com.
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();

I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.
If you have the following file complex-test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
And wanted to return the <Object/>s
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* #package default
* #author Dom Hastings
*/
class Chunk {
/**
* options
*
* #var array Contains all major options
* #access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* #var string The filename being read
* #access public
*/
public $file = '';
/**
* pointer
*
* #var integer The current position the file is being read from
* #access public
*/
public $pointer = 0;
/**
* handle
*
* #var resource The fopen() resource
* #access private
*/
private $handle = null;
/**
* reading
*
* #var boolean Whether the script is currently reading the file
* #access private
*/
private $reading = false;
/**
* readBuffer
*
* #var string Used to make sure start tags aren't missed
* #access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* #param string $file The filename to work with
* #param array $options The options with which to parse the file
* #author Dom Hastings
* #access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a /
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* #return void
* #author Dom Hastings
* #access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* #return string The XML string from $this->file
* #author Dom Hastings
* #access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}

This is an old post, but first in the google search result, so I thought I post another solution based on this post:
http://drib.tech/programming/parse-large-xml-files-php
This solution uses both XMLReader and SimpleXMLElement :
$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL = 'the_name_of_your_element';
$xml = new XMLReader();
$xml->open($xmlFile);
// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}
// looping through elements
while($xml->name == $primEL) {
// loading element data into simpleXML object
$element = new SimpleXMLElement($xml->readOuterXML());
// DO STUFF
// moving pointer
$xml->next($primEL);
// clearing current element
unset($element);
} // end while
$xml->close();

I would suggest using a SAX based parser rather than DOM based parsing.
Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm

This isn't a great solution, but just to throw another option out there:
You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).
e.g., if your doc looks like:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
You can read it in a meg or two at a time, artificially wrap the few complete <listing> tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).
Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.
Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)

You can combine XMLReader with DOM for this. In PHP both APIs (and SimpleXML) are based on the same library - libxml2. Large XMLs are a list of records typically. So you use XMLReader to iterate the records, load a single record into DOM and use DOM methods and Xpath to extract values. The key is the method XMLReader::expand(). It loads the current node in an XMLReader instance and its descendants as DOM nodes.
Example XML:
<books>
<book>
<title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
</book>
<book>
<title isbn="978-0596100506">XML Pocket Reference</title>
</book>
<!-- ... -->
</books>
Example code:
// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');
// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
continue;
}
// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
// expand the node into the prepared DOM
$book = $reader->expand($document);
// use Xpath expressions to fetch values
var_dump(
$xpath->evaluate('string(title/#isbn)', $book),
$xpath->evaluate('string(title)', $book)
);
// move to the next book sibling node
$reader->next('book');
}
$reader->close();
Take note that the expanded node is never appended to the DOM document. It allows the GC to clean it up.
This approach works with XML namespaces as well.
$namespaceURI = 'urn:example-books';
$reader = new XMLReader();
$reader->open('books.xml');
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);
// compare local node name and namespace URI
while (
$reader->read() &&
(
$reader->localName !== 'book' ||
$reader->namespaceURI !== $namespaceURI
)
) {
continue;
}
// iterate the book elements
while ($reader->localName === 'book') {
// validate that they are in the namespace
if ($reader->namespaceURI === $namespaceURI) {
$book = $reader->expand($document);
var_dump(
$xpath->evaluate('string(b:title/#isbn)', $book),
$xpath->evaluate('string(b:title)', $book)
);
}
$reader->next('book');
}
$reader->close();

I've written a wrapper for XMLReader to (IMHO) make it easier to just get the bits your after. The wrapper allows you to associate a set of paths of the data elements and a callback to be run when this path is found. The path allows regex expressions and also capture groups which can also be passed to the callback.
The library is at https://github.com/NigelRel3/XMLReaderReg and can also be installed using composer require nigelrel3/xml-reader-reg.
An example of how to use it...
$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);
$reader->process([
'(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
echo "1) Value for ".$path[1]." is ".PHP_EOL.
$data->asXML().PHP_EOL;
},
'(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
echo "2) Value for ".$path[1]." is ".PHP_EOL.
$data->ownerDocument->saveXML($data).PHP_EOL;
},
'/root/person2/firstname' => function (string $data): void {
echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
}
]);
$reader->close();
As can be seen from the example, you can get the data to be passed as a SimpleXMLElement, a DOMElement or the last one is a string. This will represent only the data which matches the path.
The paths also show how capture groups can be used - (.*/person(?:\[\d*\])?) looks for any person element (including arrays of elements) and $path[1] in the callback displays the path where this particular instance is found.
There is an expanded example in the library as well as unit tests.

I tested the following code with 2 GB xml:
<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>

My solution:
$reader = new XMLReader();
$reader->open($fileTMP);
while ($reader->read()) {
if ($reader->nodeType === XMLReader::ELEMENT && $reader->name === 'xmltag' && $reader->isEmptyElement === false) {
$item = simplexml_load_string($reader->readOuterXML(), null, LIBXML_NOCDATA);
//operations on file
}
}
$reader->close();

Very high performed way is
preg_split('/(<|>)/m', $xmlString);
And after that, only one cycle is needed.

OOP problems with instance variables

I've got some problems with a piece of code in my two classes 'File' and 'Folder'. I've created a page which shows me the content of my server space. Therefore I've wrote the class Folder which contains information about it self like 'name', 'path' and 'children'. The children property contains an Array of 'Files' or 'Folders' within this folder. So it's a kind of recursive class. To get the whole structure of a wanted directory I've wrote some recursive backtracking algorithms that are giving me an array of objects for all children in the same structure as my folder on the server. The second algorithm is taking that array and searches an special folder. If it finds this folder the method will return the root path to it and if the folder isn't a subfolder of this directory the algorithm will return false. I've tested all of that methods for the 'Folder' object and it works just fine but now I've detected an error by using my script more intensive.
/**
* find an subfolder within the given directory (Recursive)
*/
public function findFolder($name) {
// is this object the object you wanted
if ($this->name == $name) {
return $this->getPath();
}
// getting array
$this->bindChildren();
$result = $this->getChildren();
// backtracking part
foreach($result as $r) {
// skip all 'Files'
if(get_class($r) == 'File') {
continue;
} else {
if($search_res = $r->findFolder($name)) {
return $search_res;
}
}
}
// loop runned out
return false;
}
/**
* stores all children of this folder
*/
public function bindChildren() {
$this->resetContent();
$this->dirSearch();
}
/**
* resets children array
*/
private function resetContent() {
$this->children = array();
}
/**
* storing children of this folder
*/
private function dirSearch() {
$dh = opendir($this->path);
while($file = readdir($dh)) {
if($file !== "" && $file !== "." && $file !== "..") {
if(!is_dir($this->path.$file)) {
$this->children[] = new File($this->path.$file);
} else {
$this->children[] = new Folder($this->path.$file.'/');
}
}
}
}
In my website I first create a new folder object and then I'm starting to find a subfolder of 'doc' which is call 'test' for example. The folder 'test' is in '/var/www/media/username/doc/test4/test/' located
$folder = new Folder('/var/www/media/username/doc/');
$dir = $folder->findFolder('test');
If I print out $dir it returns a link as I wanted because the folder 'test' is a subfolder of 'docs' but the returned link is not correct. it should be '/var/www/media/username/doc/test4/test' but the result is '/var/www/media/username/doc/test' I've tried to debugg a bit and found out that the folders list which contains all children is keeping the objects with the right links but in the findFolder method in the first if condition the object $this doesn't have the correct path. I don't know why but the the
// backtracking part
foreach($result as $r) {
seems to change the object properties. I hope someone can help me and thanks in advance

Don't reinvent the wheel. PHP already has a class for that purpose named RecursiveDirectoryIterator.
http://php.net/manual/en/class.recursivedirectoryiterator.php

Parsing Huge XML Files in PHP

I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

This is a very similar question to Best way to process large XML in PHP but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing.
However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:
My take on it:
https://github.com/prewk/XmlStreamer
A simple class that will extract all children to the XML root element while streaming the file.
Tested on 108 MB XML file from pubmed.com.
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();

I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.
If you have the following file complex-test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
And wanted to return the <Object/>s
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* #package default
* #author Dom Hastings
*/
class Chunk {
/**
* options
*
* #var array Contains all major options
* #access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* #var string The filename being read
* #access public
*/
public $file = '';
/**
* pointer
*
* #var integer The current position the file is being read from
* #access public
*/
public $pointer = 0;
/**
* handle
*
* #var resource The fopen() resource
* #access private
*/
private $handle = null;
/**
* reading
*
* #var boolean Whether the script is currently reading the file
* #access private
*/
private $reading = false;
/**
* readBuffer
*
* #var string Used to make sure start tags aren't missed
* #access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* #param string $file The filename to work with
* #param array $options The options with which to parse the file
* #author Dom Hastings
* #access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a /
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* #return void
* #author Dom Hastings
* #access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* #return string The XML string from $this->file
* #author Dom Hastings
* #access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}

This is an old post, but first in the google search result, so I thought I post another solution based on this post:
http://drib.tech/programming/parse-large-xml-files-php
This solution uses both XMLReader and SimpleXMLElement :
$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL = 'the_name_of_your_element';
$xml = new XMLReader();
$xml->open($xmlFile);
// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}
// looping through elements
while($xml->name == $primEL) {
// loading element data into simpleXML object
$element = new SimpleXMLElement($xml->readOuterXML());
// DO STUFF
// moving pointer
$xml->next($primEL);
// clearing current element
unset($element);
} // end while
$xml->close();

I would suggest using a SAX based parser rather than DOM based parsing.
Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm

This isn't a great solution, but just to throw another option out there:
You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).
e.g., if your doc looks like:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
You can read it in a meg or two at a time, artificially wrap the few complete <listing> tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).
Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.
Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)

You can combine XMLReader with DOM for this. In PHP both APIs (and SimpleXML) are based on the same library - libxml2. Large XMLs are a list of records typically. So you use XMLReader to iterate the records, load a single record into DOM and use DOM methods and Xpath to extract values. The key is the method XMLReader::expand(). It loads the current node in an XMLReader instance and its descendants as DOM nodes.
Example XML:
<books>
<book>
<title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
</book>
<book>
<title isbn="978-0596100506">XML Pocket Reference</title>
</book>
<!-- ... -->
</books>
Example code:
// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');
// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
continue;
}
// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
// expand the node into the prepared DOM
$book = $reader->expand($document);
// use Xpath expressions to fetch values
var_dump(
$xpath->evaluate('string(title/#isbn)', $book),
$xpath->evaluate('string(title)', $book)
);
// move to the next book sibling node
$reader->next('book');
}
$reader->close();
Take note that the expanded node is never appended to the DOM document. It allows the GC to clean it up.
This approach works with XML namespaces as well.
$namespaceURI = 'urn:example-books';
$reader = new XMLReader();
$reader->open('books.xml');
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);
// compare local node name and namespace URI
while (
$reader->read() &&
(
$reader->localName !== 'book' ||
$reader->namespaceURI !== $namespaceURI
)
) {
continue;
}
// iterate the book elements
while ($reader->localName === 'book') {
// validate that they are in the namespace
if ($reader->namespaceURI === $namespaceURI) {
$book = $reader->expand($document);
var_dump(
$xpath->evaluate('string(b:title/#isbn)', $book),
$xpath->evaluate('string(b:title)', $book)
);
}
$reader->next('book');
}
$reader->close();

I've written a wrapper for XMLReader to (IMHO) make it easier to just get the bits your after. The wrapper allows you to associate a set of paths of the data elements and a callback to be run when this path is found. The path allows regex expressions and also capture groups which can also be passed to the callback.
The library is at https://github.com/NigelRel3/XMLReaderReg and can also be installed using composer require nigelrel3/xml-reader-reg.
An example of how to use it...
$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);
$reader->process([
'(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
echo "1) Value for ".$path[1]." is ".PHP_EOL.
$data->asXML().PHP_EOL;
},
'(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
echo "2) Value for ".$path[1]." is ".PHP_EOL.
$data->ownerDocument->saveXML($data).PHP_EOL;
},
'/root/person2/firstname' => function (string $data): void {
echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
}
]);
$reader->close();
As can be seen from the example, you can get the data to be passed as a SimpleXMLElement, a DOMElement or the last one is a string. This will represent only the data which matches the path.
The paths also show how capture groups can be used - (.*/person(?:\[\d*\])?) looks for any person element (including arrays of elements) and $path[1] in the callback displays the path where this particular instance is found.
There is an expanded example in the library as well as unit tests.

I tested the following code with 2 GB xml:
<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>

My solution:
$reader = new XMLReader();
$reader->open($fileTMP);
while ($reader->read()) {
if ($reader->nodeType === XMLReader::ELEMENT && $reader->name === 'xmltag' && $reader->isEmptyElement === false) {
$item = simplexml_load_string($reader->readOuterXML(), null, LIBXML_NOCDATA);
//operations on file
}
}
$reader->close();

Very high performed way is
preg_split('/(<|>)/m', $xmlString);
And after that, only one cycle is needed.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php -- combining arrays - php

Related

Using PHPWord elements in a TemplateProcessor

Open a 2GB xml file to understand the structure [duplicate]

Parsing very big XML file with PHP [duplicate]

OOP problems with instance variables

Parsing Huge XML Files in PHP

Categories

Resources