Parsing extremely large XML files in php

Parsing extremely large XML files in php - php

I need to parse XML files of 40GB in size, and then normalize, and insert to a MySQL database. How much of the file I need to store in the database is not clear, neither do I know the XML structure.
Which parser should I use, and how would you go about doing this?

In PHP, you can read in extreme large XML files with the XMLReaderDocs:
$reader = new XMLReader();
$reader->open($xmlfile);
Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz.
PHP supports that quite well with XMLReader via the compression wrappersDocs:
$xmlfile = 'compress.zlib://path/to/large.xml.gz';
$reader = new XMLReader();
$reader->open($xmlfile);
The XMLReader allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.
I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.
See as well:
PHP open gzipped XML

It would be nice to know what you actually intend to do with the XML. The way you parse it depends very much on the processing you need to carry out, as well as the size.
If this is a one-off task, then I've started in the past by discovering the XML structure before doing anything else. My DTDGenerator (see saxon.sf.net) was written for this purpose a long time ago and still does the job, there are other tools available now but I don't know whether they do streamed processing which is a prerequisite here.
You can write an application that processes the data using either a pull or push streamed parser (SAX or StAX). How easy this is depends on how much processing you have to do and how much state you have to maintain, which you haven't told us. Alternatively you could try streamed XSLT processing, which is available in Saxon-EE.

Related

Parsing a large XML file for MySQL

I've got a very large XML file (1.5GB) that I need to parse and then insert specific values into a MySQL table.
Now the way I would have usually done parsing on a DOM would be to use jQuery or PHP Simple Dom Parser but in this situation, given the file size, I don't think either are suitable. I need the emphasis to be on performance. I've read a little about SimpleXML and XML Parser for PHP and it seems each have their advantages but I'm not sure if either of these are suitable for a file with a size of 1.5GB.
I've also seen Pear's XML parser mentioned but, again, I don't know if this is suitable in this situation. From what I've read it seems that I need to load into memory only the required nodes and not the whole tree itself. Even now i'm having trouble actually viewing the document due to the size. VIM seems to be the only editor that can handle it but even then scrolling through the document can cause a crash.
If anyone can recommend one of these above the other, or even an entirely different solution that would be great.
That would then bring me to my SQL inserts which I was going to do on the fly - so after i've parse a node and pulled the values I require I will insert these into the database. Again, any advice would be great.

For such huge XML file its recommended to use SAX based XML parsers. In PHP you can do it with "XML Parser". It consumes less memory than its peers. Also its very fast.

SimpleXml and DOM are not meant for big XML files
try:
XMLReader: http://php.net/manual/en/book.xmlreader.php
or even better/faster (but slightly more complicate to use)
XMLParser: http://php.net/manual/en/book.xml.php

How to check validity of big xml file?

I have a big XML file, larger than 100mb, and I want to check if the structure of this file is valid.
I can try to load this file with DOMDocument; For example, I can read it with the PHP XML parser, which "lets you parse, but not validate, XML documents".
Is there any way to do this without fully loading the XML file into memory?

Firstly, you don't say what kind of schema you are using for validation: DTD, XSD, RelaxNG?
Secondly you mention PHP but you don't say whether the solution has to be based on PHP. Could you, for example, use Java?
Generally speaking, validating an XML document against a schema is a streamable operation, it does not require building a tree representation of the XML document in memory. Finding a streaming validator that works in your environment should not be hard, but we need to know what the environment is (and what schema language you are using).

I think you need to look into the XMLReader class. More specifically,
XMLReader::setSchema.

Think about what you're saying. You want to do operations on data that is not in memory. That doesn't make sense at all... it will eventually have to be in memory if you want to reference it from operations.
If you don't want to load the data in memory all at once, you could do a divide and conquer approach. If the file is incredibly large, you could run a map reduce job in multiple processes, but this would not decrease the amount of memory used.

If all you want to do is check if the XML structure is valid, you can use PHP's XML Parser. It will not validate the document against a DTD, which is what it means by it will not validate.
All of these error codes can be returned in the event the XML structure is found to be invalid while parsing it.

How to delete xml elements/nodes from xml file larger than available RAM?

I'm trying to figure out how to delete an element (and its children) from a xml file that is very large in php (latest version).
I know I can use dom and simpleXml, but that will require the document to be loaded into memory.
I am looking at the XML writer/reader/parser functions and googling, but there seems to be nothing on the subject (all answers recommend using dom or simpleXml). That cannot be correct--am I missing something?
The closest thing I've found is this (C#):
You can use an XmlReader to sequentially read your xml (ReadOuterXml might be useful in your case to read a whole node at a time). Then use an XmlWriter to write out all the nodes you want to keep.
( Deleting nodes from large XML files )
Really? Is that the approach? I have to copy the entire huge file?
Is there really no other way?
One approcah
As suggested,
I could read the data using phps XML reader or parser, possibly buffer it, and write/dump+append it back to a new file.
But is this approach really practical?
I have experience with splitting huge xml files into smaller pieces, basically using suggested method, and it took a very long time for the process to finish.
My dataset isn’t currently big enough to give me an idea on how this would work out. I could only assume that the results will be the same (a very slow process).
Does anybody have experience of applying this in practice?

There are a couple ways to process large documents incrementally, so that you do not need to load the entire structure into memory at once. In either case, yes, you will need to write back out the elements that you wish to keep and omit those you want to remove.
PHP has an XMLReader implementation of a pull parser. An explanation:
A pull parser creates an iterator that sequentially visits the various
elements, attributes, and data in an XML document. Code which uses
this iterator can test the current item (to tell, for example, whether
it is a start or end element, or text), and inspect its attributes
(local name, namespace, values of XML attributes, value of text,
etc.), and can also move the iterator to the next item. The code can
thus extract information from the document as it traverses it.
Or you could use the SAX XML Parser. Explanation:
Simple API for XML (SAX) is a lexical, event-driven interface in which
a document is read serially and its contents are reported as callbacks
to various methods on a handler object of the user's design. SAX is
fast and efficient to implement, but difficult to use for extracting
information at random from the XML, since it tends to burden the
application author with keeping track of what part of the document is
being processed.
A lot of people prefer the pull method, but either meets your requirement. Keep in mind that large is relative. If the document fits in memory, then it will almost always be easier to use the DOM. But for really, really large documents that simply might not be an option.

XSLTProcessor xmlSAX2Characters: out of memory

I have a page which load a 500 mb xml file and parses the file using an xsl template.
The parser works perfectly in my local environment. I am using WAMP.
On the web server.
Warning: DOMDocument::load() [domdocument.load]: (null)xmlSAX2Characters: out of memory in /home/mydomain/public_html/xslt/largeFile.xml, line: 2031052 in /home/mydomain/public_html/xslt/parser_large.php on line 6
My Code is as below, line 6 loads the xml file
<?php
$xslDoc = new DOMDocument();
$xslDoc->load("template.xslt");
$xmlDoc = new DOMDocument();
$xmlDoc->load("largeFile.xml");
$proc = new XSLTProcessor();
$proc->importStylesheet($xslDoc);
echo $proc->transformToXML($xmlDoc);
?>
I have tried copying the php.ini file from the wamp installation to the folder where the above code is located. But this has not helped. The memory limit in this php.ini file is memory_limit = 1000M
Any advice / experience on this would be greatly appreciated

Here is the sad truth. There are two basic ways of working with XML, DOM-based, where the whole XML file is present in memory at once (with considerable overhead to make it fast to traverse), and SAX based where the file goes through memory, but only a small portion of it is present at any given time.
However, with DOM, large memory consumption is pretty much normal.
Now XSLT language in general allows constructions that access any parts of the whole file at any time and it therefore requires the DOM style. Some programming languages have libraries that allow feeding SAX input into an XSLT processor, but this necessarily implies restrictions on the XSLT language or memory consumption not much better than that of DOM. PHP does not have a way of making XSLT read SAX input, though.
That leaves us with alternatives to DOM; there is one, and is called SimpleXML. SimpleXML is is a little tricky to use if your document has namespaces. An ancient benchmark seems to indicate that it is somewhat faster, and probably also less wasteful with memory consumption, than DOM on large files.
And finally, I was in your shoes once in another programming language. The solution was to split the document into small ones based on simple rules. Each small document contained a header copied from the whole document, one "detail" element and a footer, making its format valid against the big XML file's schema. It was processed using XSLT (assuming that processing of one detail element does not look into any other detail element) and the outputs combined. This works like charm but it is not implemented in seconds.
So, here are your options. Choose one.
Parse and process XML using SAX.
Use SimpleXML and hope that it will allow slightly larger files within the same memory.
Execute an external XSLT processor and hope that it will allow slightly larger files within the same memory.
Split and merge XML using this method and apply XSLT on small chunks only. This method is only practical with some schemas.

Best practices to parse and manipulate XML files, that are minimum of 1000 MBs or more in size [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
PHP what is the best approach to using XML? Need to create and parse XML responses
Parse big XML in PHP
Hello Community,
I am writing an application, that requires to parse XML files, that can minimum of 1000 MBs or more in size.
I have tried with few code that is available on internet. As file size is more, it's easy to understand that file will have lots and lots of XML tags. So, loop performance gets weak as time elapse.
So, I would need a parser:
-> Performance is considerably good as time passes, when doing execution / parsing
-> Doesn't load the whole XML file in memory
I know about following XML parsers, but not sure which to use and why?
XML Parser
SimpleXML
XMLReader
I am using PHP 5.3, so please help me guys and gals, to choose the parser.
You can even suggest me some other options, or classes.
Thanks.
EDIT
I even want to know about SAX (Simple API for XML) and StAX implementation of PHP

First of all, you can't load that much XML in memory. It depends on your machine, but if your XML file is more than 10-20 MB it generally is too much. The server may be able to handle more, but it's not a good idea to fill all the memory with one script. So you can rule out SimpleXML and DOM from the start.
The other two options, XML Parser and XMLReader, will both be good, with XMLReader being a newer extension, so probably better. But as a warning you should take notice that XMLReader also allows you to load everything in memory. Don't do that. Instead use it as a node-by-node parser and read/process your data in small bits.
You problem may go beyond the scope of choosing a parser if you need most of the data from the XML. You should also make sure that you don't load it all up in memory and use it at the end of the script. Instead use it as you get it and dispose of it once you no longer need it.

Load your giant XML files into an XML database and perform your query and manipulations through their XQuery/XSLT interfaces.
http://www.xml.com/pub/a/2003/10/22/embed.html

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.