I will handle a huge XML file and I will go with XMLReader. Below are three ways to go with, but I need to know which one is the fastest. How can I know this? The planet.xml file is located at http://trash.chregu.tv/planet-big.xml.bz2 in case hat you may need it. Thank you!
You might want to consider the PHP profiling extension:
http://www.php.net/apd
You can examine the results with pprofp:
http://www.compago.it/php/phpckbk-CHP-21-SECT-3.html
I haven't worked with XML much in PHP, but if you're dealing with a really large file, a streaming parser is the way to go. Reading the whole thing into memory and building a DOM tree is pretty expensive (and may even fail, if the document's too big to hold in memory).
Related
I have a project that is done but needs better performance.
The gist of the project is that I'm taking XML and converting it to CSV files. The files represent data to be loaded into a Database.
Right now I'm using PHP to unzip the zip file that contains the XML. Then I parse, convert to CSV, and rezip.
It's been fine till now but the XML files are getting HUGE now. So much that processing takes a little more than a day. I'm also doing some manipulations in there somewhere to the files, like rearranging columns and trims.
What alternatives do you suggest that would help me improve performance?
I've thought about writing this parser in C++ but I'm not sure of what route to take. Similar questions have been asked but this is more of a performance issue I suppose. Should I switch languages for performance, stick with PHP and optimize that, should I try to make this parser parallel so more than one file can be done at a time?
What would you suggest?
You should give Perl a try if PHP doesn't deliver what you wont, but I doubt, maybe you are doing something wrong there (logically).
What kind of XML parser are you using? (Its better be a SAX one...).
Also, it would be nice to see some code (how you parse the XMLs...)
Is there a native PHP wbxml API that can be used platform-independently? Perhaps a loadable module?
I have seen the pecl implementations but I have not been able to successfully work with the builds on win32 platforms.
I am not an expert, but what I found out there numbered two options, essentially.
One, the pecl library that you are having trouble with.
Two, I found WBXML encoder and decoder classes in Horde of all places. They might give you a starting point, and since they are open source, they might meet your needs quite nicely. Here is a link where I found them.
http://phpxref.com/xref/horde/lib/XML/WBXML/index.html
I don't know a huge amount about WBXML, but from what I can gather it's a binary-formatted XML file. I suppose at the simplest you could use the XML modules such as simpleXML to generate your XML document, output it as a string and then use PHP's built in file handling functions (fopen, fwrite, etc) to dump the string as binary data to a file. To reverse the process load the file as a string and have SimpleXML parse it.
However, without knowing the specific details of the WBXML format, I'm sure there's more to it tan that. You'd also have to implement the necessary code yourself, but as you could implement it in PHP itself that should make cross-platform portability a bit simpler to accomplish.
Not really an answer as such, I'm afraid, but I hope it gets you going in the right direction.
i just want to know if there are any alternatives to simpleXML for parsing XML Data with PHP.
For example if simpleXML module is not loaded or even if there is a lib/class out there that has a better performance then SimpleXML.
Obviously there's a ton of different way to process XML both as PHP extensions and userspace librairies. The problem is they are all much much more complicated than SimpleXML and nowhere as fast for random access.
I'm not sure what's the goal of your question though. None of those libraries/extensions share a common API so if you want a fallback in case SimpleXML isn't available then you'll have to duplicate your efforts. In actuality though, there's virtually no reason to disable SimpleXML so there's no reason to work on such a contingency plan.
You can use the DOM extension. It has the advantage many people are already familiar with DOM (coming from e.g. Javascript). Of course, DOM is very painful.
For reading large XML files, the event model (think SAX) is a necessity. See here.
Well there is XML_Parser (see http://php.net/manual/en/book.xml.php) aswell as XMLReader / XMLWriter ( http://www.php.net/manual/en/book.xmlreader.php / http://www.php.net/manual/en/book.xmlwriter.php ). SimpleXML is compiled into php per default (at least since 5.x). I can't tell you much about performance of XMLReader/XMLWriter or XML_Parser as I usually stick to SimpleXML.
Cheers,
Fabian
I need to parse pretty big XML in PHP (like 300 MB). How can i do it most effectively?
In particular, i need to locate the specific tags and extract their content in a flat TXT file, nothing more.
You can read and parse XML in chunks with an old-school SAX-based parsing approach using PHP's xml parser functions.
Using this approach, there's no real limit to the size of documents you can parse, as you simply read and parse a buffer-full at a time. The parser will fire events to indicate it has found tags, data etc.
There's a simple example in the manual which shows how to pick up start and end of tags. For your purposes you might also want to use xml_set_character_data_handler so that you pick up on the text between tags also.
The most efficient way to do that is to create static XSLT and apply it to your XML using XSLTProcessor. The method names are a bit misleading. Even though you want to output plain text, you should use either transformToXML() if you need is as a string variable, or transformToURI() if you want to write a file.
If it's one or few time job I'd use XML Starlet. But if you really want to do it PHP side then I'd recommend to preparse it to smaller chunks and then processing it. If you load it via DOM as one big chunk it will take a lot of memory. Also use CLI side PHP script to speed things up.
This is what SAX was designed for. SAX has a low memory footprint reading in a small buffer of data and firing events when it encounter elements, character data etc.
It is not always obvious how to use SAX, well it wasn't to me the first time I used it but in essence you have to maintain your own state and view as to where you are within the document structure so generally you will end up with variables describing what section of the document you are in e.g. inFoo, inBar etc which you set when you encounter particular start/end elements.
There is a short description and example of a sax parser here
Depending on your memory requirements, you can either load it up and parse it with XSLT (the memory-consuming route), or you can create a forward-only cursor and walk the tree yourself, printing the values you're looking for (the memory-efficient route).
Pull parsing is the way to go. This way it's memory-efficient AND easy to process. I have been processing files that are as large as 50 Mb or more.
I have normally hand written xml like this:
<tag><?= $value ?></tag>
Having found tools such as simpleXML, should I be using those instead? What's the advantage of doing it using a tool like that?
Good XML tools will ensure that the resulting XML file properly validates against the DTD you are using.
Good XML tools also save a bunch of repetitive typing of tags.
If you're dealing with a small bit of XML, there's little harm in doing it by hand (as long as you can avoid typos). However, with larger documents you're frequently better off using an editor, which can validate your doc against the schema and protect against typos.
You could use the DOM extenstion which can be quite cumbersome to code against. My personal opinion is that the most effective way to write XML documents from ground up is the XMLWriter extension that comes with PHP and is enabled by default in recent versions.
$w=new XMLWriter();
$w->openMemory();
$w->startDocument('1.0','UTF-8');
$w->startElement("root");
$w->writeAttribute("ah", "OK");
$w->text('Wow, it works!');
$w->endElement();
echo htmlentities($w->outputMemory(true));
using a good XML generator will greatly reduce potential errors due to fat-fingering, lapse of attention, or whatever other human frailty. there are several different levels of machine assistance to choose from, however:
at the very least, use a programmer's text editor that does syntax highlighting and auto-indentation. just noticing that your text is a different color than you expect, or not lining up the way you expect, can tip you off to a typo you might otherwise have missed.
better yet, take a step back and write the XML as a data structure of whatever language you prefer, than convert that data structure to XML. Perl gives you modules such as the lightweight XML::Simple for small jobs or the heftier XML::Generator; using XML::Simple is just a matter of arranging your content into a standard Perl hash of hashes and running it through the appropriate method.
-steve
Producing XML via any sort of string manipulation opens the door for bugs to get into your code. The extremely simple example you posted, for instance, won't produce well-formed XML if $value contains an ampersand.
There aren't a lot of edge cases in XML, but there are enough that it's a waste of time to write your own code to handle them. (And if you don't handle them, your code will unexpectedly fail someday. Nobody wants that.) Any good XML tool will automatically handle those cases.
Use the generator.
The advantage of using a generator is you have consistent markup and don't run the risk of fat-fingering a bracket or quote, or forgetting to encode something. This is crucial because these mistakes will not be found until runtime, unless you have significant tests to ensure otherwise.
hand writing isn't always the best practice, because in large XML ou can write wrong tags and can be difficult to find the reason of an error. So I suggest to use XMl parsers to create XML files.
Speed may be an issue... handwritten can be a lot faster.
The XML tools in eclipse are really useful too. Just create a new xml schema and document, and you can easily use most of the graphical tools. I do like to point out that a prior understanding of how schemas work will be of use.
Always use a tool of some kind. XML can be very complex, I know that the PHP guys are used to working with hackey little stuff, but its a huge code smell in the .NET world if someone doesn't use System.XML for creating XML.