Parse big XML in PHP

Parse big XML in PHP - php

I need to parse pretty big XML in PHP (like 300 MB). How can i do it most effectively?
In particular, i need to locate the specific tags and extract their content in a flat TXT file, nothing more.

You can read and parse XML in chunks with an old-school SAX-based parsing approach using PHP's xml parser functions.
Using this approach, there's no real limit to the size of documents you can parse, as you simply read and parse a buffer-full at a time. The parser will fire events to indicate it has found tags, data etc.
There's a simple example in the manual which shows how to pick up start and end of tags. For your purposes you might also want to use xml_set_character_data_handler so that you pick up on the text between tags also.

The most efficient way to do that is to create static XSLT and apply it to your XML using XSLTProcessor. The method names are a bit misleading. Even though you want to output plain text, you should use either transformToXML() if you need is as a string variable, or transformToURI() if you want to write a file.

If it's one or few time job I'd use XML Starlet. But if you really want to do it PHP side then I'd recommend to preparse it to smaller chunks and then processing it. If you load it via DOM as one big chunk it will take a lot of memory. Also use CLI side PHP script to speed things up.

This is what SAX was designed for. SAX has a low memory footprint reading in a small buffer of data and firing events when it encounter elements, character data etc.
It is not always obvious how to use SAX, well it wasn't to me the first time I used it but in essence you have to maintain your own state and view as to where you are within the document structure so generally you will end up with variables describing what section of the document you are in e.g. inFoo, inBar etc which you set when you encounter particular start/end elements.
There is a short description and example of a sax parser here

Depending on your memory requirements, you can either load it up and parse it with XSLT (the memory-consuming route), or you can create a forward-only cursor and walk the tree yourself, printing the values you're looking for (the memory-efficient route).

Pull parsing is the way to go. This way it's memory-efficient AND easy to process. I have been processing files that are as large as 50 Mb or more.

Related

making 'efficient' calls to xml data feed with simplexml

Im using a xml data feed to get information using simplexml and then generate a page using that data.
for this im getting the xml feed using
$xml = simplexml_load_file
Am i right in thinking that to parse the xml data the server has to download it all before it can work with it ?
Obviously this is no such problem with a 2kb file, but some files are nearing 100kb, so for every page load that has to be downloaded first before the php can start generating the page.
On some of the pages were only looking for a 1 attribute of an xml array so parseing the whole document seems unessarcery, normally i would look into caching the feed, but these feeds relate to live makets that are changing frequently so that not ideal as i would always have the up to the minute data.
Is there a better way to make more efficient calls of the xml feed ?

One of the first tactics to optimize XML parsing is by parsing on-the-fly - meaning, don't wait until the entire data arrives, and start parsing immediately when you have something to parse.
This is much more efficient, since the bottleneck is often network connection and not CPU, so if we can find our answer without waiting for all network info, we've optimized quite a bit.
You should google the term XML push parser or XML pull parser
In the article Pull parsing XML in PHP - Create memory-efficient stream processing you can find a tutorial that shows some code on how to do it with PHP using the XMLReader library that is bundled with PHP5
Here's a quote from this page which says basically what I just did in nicer words:
PHP 5 introduced XMLReader, a new class for reading Extensible Markup Language (XML). Unlike SimpleXML or the Document Object Model (DOM), XMLReader operates in streaming mode. That is, it reads the document from start to finish. You can begin to work with the content at the beginning before you see the content at the end. This makes it very fast, very efficient, and very parsimonious with memory. The larger the documents you need to process, the more important this is.
Parsing in streaming mode is a bit different from procedural parsing. Keep in mind that all the data isn't already there. What you usually have to do is supply event handlers that implement some sort of state-machine. If you see tag A, do this, if you see tag B, do that.
Regarding the difference between push parsing and pull parsing take a look at this article. Long story short, both are stream-based parsers. You will probably need a push parser since you want to parse whenever data arrives over the network from your XML feed.
Push parsing in PHP can also be done with xml_parse() (libexpat with a libxml compatibility layer). You can see a code example xml_parse PHP manual page.

php script-wise breakdown of huge XML for parsing

I've got a few huge XML files, and I cut a few rows out, so I could have a manageable-sized file on which to test my parsing script, written in php. There is a lot of nesting in the XML file, there are a lot of columns, and there are a lot of blanks, so writing the script was this huge ordeal. Now, I'm hitting my php memory limit on the full-sized XML files I want to parse.
Now, one thing I've considered is temporarily upping the php memory limit, but I need to rerun this script every well... week or so. Also, I don't have the best system. Running it hot and setting it melt is an all-to-real possibility and one of my "perfect storms".
I also considered attempting to learn a new language, such as perl or python. I probably could use to know one of these languages, anyway. I would prefer to stick with what I have, though, if only in the interest of time.
Isn't there some way to have php break the XML file up into manageable chunks that won't push my machine to its limit? Because every row in the XML file is wrapped by an ID column, it seems like I should be able to cut to the nth row closure, parse what was sliced, and then sleep, or something?
Any ideas?

How to ensure the files is an XML file

I donot know much about files and its related security. I have a LOT of data in XML files which i am planning on parsing to put in the database. I get these XML files from 3rd party people. I will be getting minimum around 1000 files per day. So i will write a script to parse them to enter in our database. Now i have many questions regarding this.
I know how to parse a single file. And i can extend the logic to multiple files in a single loop. But.Is there a better way to do the same? How can i use multi threaded programming to parse the files simultaneously many of them. There will be a script which, given the file, parses the single file and outputs to database. How can i use this script to parse in multiple threads/parallel processing
The File as i said, Comes from a 3rd party site. So how can i be sure that there are no security loop holes. I mean, i dono much about file security. But what are the MINIMUM common basic security checks i need to take.(like sql injection and XSS in web programing are VERY basic)
Again security related: How to ensure that the incoming XML file is XML itself. I mean i can use the extension, But is there a possibility to inject scripts and make them run when i parse these files. And What steps should i take while parsing individual files

You want to validate the XML. This does two things:
Make sure it is "well-formed" - a valid XML document
Make sure it is "valid" - follows a schema, dtd or other definition - it has the elements and you expect to parse.
In php5 the syntax for validating XML documents is:
$dom->validate('articles.dtd');
$dom->relaxNGValidate('articles.rng');
$dom->schemaValidate('articles.xsd');
Of course you need an XSD (XML Schema) or DTD (Document Type Definition) to validate against.

I can't speak to point 1, but it sounds fairly simple - each file can be parsed completely independently.
Points 2 and 3 are effectively about the contents of the file. Simply put, you can check that it's valid XML by parsing it and asking the parser to validate as it goes, and that's all you need to do. If you're expecting it to follow a particular DTD, you can validate it against that. (There are multiple levels of validation, depending on what your data is.)
XML files are just data, in and of themselves. While there are "processing instructions" available as XML, they're not instructions in quite the same way as direct bits of script to be executed, and there should be no harm in just parsing the file. Two potential things a malicious file could do:
Try to launch a denial-of-service attack by referring to a huge external DTD, which will make the parser use large amounts of bandwidth. You can probably disable external DTD resolution if you want to guard against this.
Try to take up significant resources just by being very large. You could always limit the maximum file size your script will handle.

Is it a good idea to use XML for formatting data in communication?

I was going to use XML for communicating some data betwwen my server and the client, then I saw few articles saying using XML at any occation may not be the best idea. The reason given was using XML would increase the size of my message which is quite true, specially for me where nost of my messages are very short ones.
Is it a good idea to send several information segements seperated by a new line? (maximum diffenernt types of data that one message may have is 3 or 4) Or what are the alternative methods that I should look in to.
I have diffenrent types of messages. Ex: one message may contain username and password and the next message may have current location and speed. I'll be using apache server and php.

Serializing data in an XML format can certainly have the negative side effect of bloating it a little (the angle bracket tax), but the incredible extensibility of XML greatly outweighs that consequence, IMO. Also, you can serialize XML in a binary format which greatly cuts down on size, and in most cases the additional bloat would be negligible.

Separating your information segments by newlines could be problematic if your information segments might ever need to include newlines.
JSON is a much lighter weight alternative to XML, and lots of software that supports XML often supports JSON as an alternative. It's pretty easy to use. Since your messages are short, it sounds like they would benefit from using JSON over XML.
http://json.org/

What's the difference between the different XML parsing libraries in PHP5?

The original question is below, but I changed the title because I think it will be easier to find others with the same doubt. In the end, a XHTML document is a XML document.
It's a beginner question, but I would like to know which do you think is the best library for parsing XHTML documents in PHP5?
I have generated the XHTML from HTML files (which where created using Word :S) with Tidy, and know I need to replace some elements from them (like the and element, replace some attributes in tags).
I haven't used XML very much, there seems to be many options for parsing in PHP (Simple XML, DOM, etc.) and I don't know if all of them can do what I need, an which is the easiest one to use.
Sorry for my English, I'm form Argentina. Thanks!
I bit more information: I have a lot of HTML pages, done in Word 97. I used Tidy for cleaning and turning them in XHTML Strict, so now they are all XML compatible. I want to use an XML parser to find some elements and replace them (the logic by which I do this doesn't matter). For example, I want all of the pages to use the same CSS stylesheet and class attributes, for unified appearance. They are all static pages which contains legal documents, nothing strange there. Which of the extensions should I use? Is SimpleXML enough? Should I learn DOM in spite of being more difficult?

You could use SimpleXML, which is included in a default PHP install. This extensions offers easy object-oriented access to XML-structures.
There's also DOM XML. A "downside" to this extension is that it is a bit harder to use and that it is not included by default.

Just to clear up the confusion here. PHP has a number of XML libraries, because php4 didn't have very good options in that direction. From PHP5, you have the choice between SimpleXml, DOM and the sax-based expat parser. The latter also existed in php4. php4 also had a DOM extension, which is not the same as php5's.
DOM and SimpleXml are alternatives to the same problem domain; They læoad the document into memory and let you access it as a tree-structure. DOM is a rather bulky api, but it's also very consistent and it's implemented in many languages, meaning that you can re-use your knowledge across languages (In Javascript for example). SimpleXml may be easier initially.
The SAX parser is a different beast. It treats an xml document as a stream of tags. This is useful if you are dealing with very large documents, since you don't need to hold it all in memory.
For your usage, I would probably use the DOM api.

DOM is a standard, language-independent API for heirarchical data such as XML which has been standardized by the W3C. It is a rich API with much functionality. It is object based, in that each node is an object.
DOM is good when you not only want to read, or write, but you want to do a lot of manipulation of nodes an existing document, such as inserting nodes between others, changing the structure, etc.
SimpleXML is a PHP-specific API which is also object-based but is intended to be a lot less 'terse' than the DOM: simple tasks such as finding the value of a node or finding its child elements take a lot less code. Its API is not as rich than DOM, but it still includes features such as XPath lookups, and a basic ability to work with multiple-namespace documents. And, importantly, it still preserves all features of your document such as XML CDATA sections and comments, even though it doesn't include functions to manipulate them.
SimpleXML is very good for read-only: if all you want to do is read the XML document and convert it to another form, then it'll save you a lot of code. It's also fairly good when you want to generate a document, or do basic manipulations such as adding or changing child elements or attributes, but it can become complicated (but not impossible) to do a lot of manipulation of existing documents. It's not easy, for example, to add a child element in between two others; addChild only inserts after other elements. SimpleXML also cannot do XSLT transformations. It doesn't have things like 'getElementsByTagName' or getElementById', but if you know XPath you can still do that kind of thing with SimpleXML.
The SimpleXMLElement object is somewhat 'magical'. The properties it exposes if you var_dump/printr/var_export don't correspond to its complete internal representation. It exposes some of its child elements as if they were properties which can be accessed with the -> operator, but still preserves the full document internally, and you can do things like access a child element whose name is a reserved word with the [] operator as if it was an associative array.
You don't have to fully commit to one or the other, because PHP implements the functions:
simplexml_import_dom(DOMNode)
dom_import_simplexml(SimpleXMLElement)
This is helpful if you are using SimpleXML and need to work with code that expects a DOM node or vice versa.
PHP also offers a third XML library:
XML Parser (an implementation of SAX, a language-independent interface, but not referred to by that name in the manual) is a much lower level library, which serves quite a different purpose. It doesn't build objects for you. It basically just makes it easier to write your own XML parser, because it does the job of advancing to the next token, and finding out the type of token, such as what tag name is and whether it's an opening or closing tag, for you. Then you have to write callbacks that should be run each time a token is encountered. All tasks such as representing the document as objects/arrays in a tree, manipulating the document, etc will need to be implemented separately, because all you can do with the XML parser is write a low level parser.
The XML Parser functions are still quite helpful if you have specific memory or speed requirements. With it, it is possible to write a parser that can parse a very long XML document without holding all of its contents in memory at once. Also, if you not interested in all of the data, and don't need or want it to be put into a tree or set of PHP objects, then it can be quicker. For example, if you want to scan through an XHTML document and find all the links, and you don't care about structure.

I prefer SimpleXMLElement as it's pretty easy to use to lop through elements.
Edit: It says no version info avaliable but it's avaliable in PHP5, at least 5.2.5 but probably earlier.
It's really personal choice though, there's plenty of XML extensions.
Bear in mind many XML parsers will balk if you have invalid markup - XHTML should be XML but not always!

It's been a long time (2 years or more) since I worked with XML parsing in PHP, but I always had good, usable results from the XML_Parser Pear package. Having said that, I have had minimal exposure to PHP5, so I don't really know if there are better, inbuilt alternatives these days.

I did a little bit of XML parsing in PHP5 last year and decided to use a combination of SimpleXML.
DOM is a bit more useful if you want to create a new XML tree or add to an existing one, its slightly more flexible.

It really depends on what you're trying to accomplish.
For pulling rather large amounts of data, I.E many records of say, product information from a store website, I'd probably use Expat, since its supposedly a bit faster...
Personally, I've has XML's large enough to create a noticeable performance boost.
At those quantities you might as well be using SQL.
I recommend using SimpleXML.
It's pretty intuitive, easy to use/write.
Also, works great with XPath.
Never really got to use DOM much, but if you're using the XML Parser for something as large as you're describing you might want to use it, since its a bit more functional than SimpleXML.
You can read about all three at W3C Schools:
http://www.w3schools.com/php/php_xml_parser_expat.asp
http://www.w3schools.com/php/php_xml_simplexml.asp
http://www.w3schools.com/php/php_xml_dom.asp

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.