How to check validity of big xml file?

How to check validity of big xml file? - php

I have a big XML file, larger than 100mb, and I want to check if the structure of this file is valid.
I can try to load this file with DOMDocument; For example, I can read it with the PHP XML parser, which "lets you parse, but not validate, XML documents".
Is there any way to do this without fully loading the XML file into memory?

Firstly, you don't say what kind of schema you are using for validation: DTD, XSD, RelaxNG?
Secondly you mention PHP but you don't say whether the solution has to be based on PHP. Could you, for example, use Java?
Generally speaking, validating an XML document against a schema is a streamable operation, it does not require building a tree representation of the XML document in memory. Finding a streaming validator that works in your environment should not be hard, but we need to know what the environment is (and what schema language you are using).

I think you need to look into the XMLReader class. More specifically,
XMLReader::setSchema.

Think about what you're saying. You want to do operations on data that is not in memory. That doesn't make sense at all... it will eventually have to be in memory if you want to reference it from operations.
If you don't want to load the data in memory all at once, you could do a divide and conquer approach. If the file is incredibly large, you could run a map reduce job in multiple processes, but this would not decrease the amount of memory used.

If all you want to do is check if the XML structure is valid, you can use PHP's XML Parser. It will not validate the document against a DTD, which is what it means by it will not validate.
All of these error codes can be returned in the event the XML structure is found to be invalid while parsing it.

Related

Parsing extremely large XML files in php

I need to parse XML files of 40GB in size, and then normalize, and insert to a MySQL database. How much of the file I need to store in the database is not clear, neither do I know the XML structure.
Which parser should I use, and how would you go about doing this?

In PHP, you can read in extreme large XML files with the XMLReaderDocs:
$reader = new XMLReader();
$reader->open($xmlfile);
Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz.
PHP supports that quite well with XMLReader via the compression wrappersDocs:
$xmlfile = 'compress.zlib://path/to/large.xml.gz';
$reader = new XMLReader();
$reader->open($xmlfile);
The XMLReader allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.
I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.
See as well:
PHP open gzipped XML

It would be nice to know what you actually intend to do with the XML. The way you parse it depends very much on the processing you need to carry out, as well as the size.
If this is a one-off task, then I've started in the past by discovering the XML structure before doing anything else. My DTDGenerator (see saxon.sf.net) was written for this purpose a long time ago and still does the job, there are other tools available now but I don't know whether they do streamed processing which is a prerequisite here.
You can write an application that processes the data using either a pull or push streamed parser (SAX or StAX). How easy this is depends on how much processing you have to do and how much state you have to maintain, which you haven't told us. Alternatively you could try streamed XSLT processing, which is available in Saxon-EE.

Parsing a large XML file for MySQL

I've got a very large XML file (1.5GB) that I need to parse and then insert specific values into a MySQL table.
Now the way I would have usually done parsing on a DOM would be to use jQuery or PHP Simple Dom Parser but in this situation, given the file size, I don't think either are suitable. I need the emphasis to be on performance. I've read a little about SimpleXML and XML Parser for PHP and it seems each have their advantages but I'm not sure if either of these are suitable for a file with a size of 1.5GB.
I've also seen Pear's XML parser mentioned but, again, I don't know if this is suitable in this situation. From what I've read it seems that I need to load into memory only the required nodes and not the whole tree itself. Even now i'm having trouble actually viewing the document due to the size. VIM seems to be the only editor that can handle it but even then scrolling through the document can cause a crash.
If anyone can recommend one of these above the other, or even an entirely different solution that would be great.
That would then bring me to my SQL inserts which I was going to do on the fly - so after i've parse a node and pulled the values I require I will insert these into the database. Again, any advice would be great.

For such huge XML file its recommended to use SAX based XML parsers. In PHP you can do it with "XML Parser". It consumes less memory than its peers. Also its very fast.

SimpleXml and DOM are not meant for big XML files
try:
XMLReader: http://php.net/manual/en/book.xmlreader.php
or even better/faster (but slightly more complicate to use)
XMLParser: http://php.net/manual/en/book.xml.php

How to delete xml elements/nodes from xml file larger than available RAM?

I'm trying to figure out how to delete an element (and its children) from a xml file that is very large in php (latest version).
I know I can use dom and simpleXml, but that will require the document to be loaded into memory.
I am looking at the XML writer/reader/parser functions and googling, but there seems to be nothing on the subject (all answers recommend using dom or simpleXml). That cannot be correct--am I missing something?
The closest thing I've found is this (C#):
You can use an XmlReader to sequentially read your xml (ReadOuterXml might be useful in your case to read a whole node at a time). Then use an XmlWriter to write out all the nodes you want to keep.
( Deleting nodes from large XML files )
Really? Is that the approach? I have to copy the entire huge file?
Is there really no other way?
One approcah
As suggested,
I could read the data using phps XML reader or parser, possibly buffer it, and write/dump+append it back to a new file.
But is this approach really practical?
I have experience with splitting huge xml files into smaller pieces, basically using suggested method, and it took a very long time for the process to finish.
My dataset isn’t currently big enough to give me an idea on how this would work out. I could only assume that the results will be the same (a very slow process).
Does anybody have experience of applying this in practice?

There are a couple ways to process large documents incrementally, so that you do not need to load the entire structure into memory at once. In either case, yes, you will need to write back out the elements that you wish to keep and omit those you want to remove.
PHP has an XMLReader implementation of a pull parser. An explanation:
A pull parser creates an iterator that sequentially visits the various
elements, attributes, and data in an XML document. Code which uses
this iterator can test the current item (to tell, for example, whether
it is a start or end element, or text), and inspect its attributes
(local name, namespace, values of XML attributes, value of text,
etc.), and can also move the iterator to the next item. The code can
thus extract information from the document as it traverses it.
Or you could use the SAX XML Parser. Explanation:
Simple API for XML (SAX) is a lexical, event-driven interface in which
a document is read serially and its contents are reported as callbacks
to various methods on a handler object of the user's design. SAX is
fast and efficient to implement, but difficult to use for extracting
information at random from the XML, since it tends to burden the
application author with keeping track of what part of the document is
being processed.
A lot of people prefer the pull method, but either meets your requirement. Keep in mind that large is relative. If the document fits in memory, then it will almost always be easier to use the DOM. But for really, really large documents that simply might not be an option.

How can I handle a huge XML file using SimpleXML but to prevent memory and performance problems?

I am trying to avoid XMLReader for an app I build that has a huge XML file.
SimpleXML is easy to write and I was wondering if there is any way to successfully handle it (memory and performance issues) in a quite busy server.
What I will do, is to echo some data from that XML mainly from a search form.

Ok, if you really want to do this without XMLReader, here's what you could do.
Use fopen to open and read N number of bytes of that file.
Fix the ending : (That's the tough part but it's perfectly doable)
You do it by closing anything left unclosed and also if needed backtracking if you happen to be in the middle of some text.
When that XML chunk is finally valid you can parse it with simplexmL.
Process that chunk or save it in its separate XML file
and create another chunk ...until you have all of them.
Obviously if your XML is complex this might get a little painful.
Summary :
By creating your own custom/dirt-cheap xml parser/fixer you can split a huge XML file into multiple smaller files.

If your file is mostly a lot of similar nodes, like a big list of books where the number of books is large but the book record itself is small, you could use a variation of smaura's answer by using XMLReader to walk through each node, then convert the node to an XML string and pass it to SimpleXML. That way, you're using a streaming solution for the big list, but once you get each record you get the benefits of easily accessing the record with SimpleXML.

Best practices to parse and manipulate XML files, that are minimum of 1000 MBs or more in size [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
PHP what is the best approach to using XML? Need to create and parse XML responses
Parse big XML in PHP
Hello Community,
I am writing an application, that requires to parse XML files, that can minimum of 1000 MBs or more in size.
I have tried with few code that is available on internet. As file size is more, it's easy to understand that file will have lots and lots of XML tags. So, loop performance gets weak as time elapse.
So, I would need a parser:
-> Performance is considerably good as time passes, when doing execution / parsing
-> Doesn't load the whole XML file in memory
I know about following XML parsers, but not sure which to use and why?
XML Parser
SimpleXML
XMLReader
I am using PHP 5.3, so please help me guys and gals, to choose the parser.
You can even suggest me some other options, or classes.
Thanks.
EDIT
I even want to know about SAX (Simple API for XML) and StAX implementation of PHP

First of all, you can't load that much XML in memory. It depends on your machine, but if your XML file is more than 10-20 MB it generally is too much. The server may be able to handle more, but it's not a good idea to fill all the memory with one script. So you can rule out SimpleXML and DOM from the start.
The other two options, XML Parser and XMLReader, will both be good, with XMLReader being a newer extension, so probably better. But as a warning you should take notice that XMLReader also allows you to load everything in memory. Don't do that. Instead use it as a node-by-node parser and read/process your data in small bits.
You problem may go beyond the scope of choosing a parser if you need most of the data from the XML. You should also make sure that you don't load it all up in memory and use it at the end of the script. Instead use it as you get it and dispose of it once you no longer need it.

Load your giant XML files into an XML database and perform your query and manipulations through their XQuery/XSLT interfaces.
http://www.xml.com/pub/a/2003/10/22/embed.html

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.