I have a project that is done but needs better performance.
The gist of the project is that I'm taking XML and converting it to CSV files. The files represent data to be loaded into a Database.
Right now I'm using PHP to unzip the zip file that contains the XML. Then I parse, convert to CSV, and rezip.
It's been fine till now but the XML files are getting HUGE now. So much that processing takes a little more than a day. I'm also doing some manipulations in there somewhere to the files, like rearranging columns and trims.
What alternatives do you suggest that would help me improve performance?
I've thought about writing this parser in C++ but I'm not sure of what route to take. Similar questions have been asked but this is more of a performance issue I suppose. Should I switch languages for performance, stick with PHP and optimize that, should I try to make this parser parallel so more than one file can be done at a time?
What would you suggest?
You should give Perl a try if PHP doesn't deliver what you wont, but I doubt, maybe you are doing something wrong there (logically).
What kind of XML parser are you using? (Its better be a SAX one...).
Also, it would be nice to see some code (how you parse the XMLs...)
Related
I've got a few huge XML files, and I cut a few rows out, so I could have a manageable-sized file on which to test my parsing script, written in php. There is a lot of nesting in the XML file, there are a lot of columns, and there are a lot of blanks, so writing the script was this huge ordeal. Now, I'm hitting my php memory limit on the full-sized XML files I want to parse.
Now, one thing I've considered is temporarily upping the php memory limit, but I need to rerun this script every well... week or so. Also, I don't have the best system. Running it hot and setting it melt is an all-to-real possibility and one of my "perfect storms".
I also considered attempting to learn a new language, such as perl or python. I probably could use to know one of these languages, anyway. I would prefer to stick with what I have, though, if only in the interest of time.
Isn't there some way to have php break the XML file up into manageable chunks that won't push my machine to its limit? Because every row in the XML file is wrapped by an ID column, it seems like I should be able to cut to the nth row closure, parse what was sliced, and then sleep, or something?
Any ideas?
I know this is possibly an obscure use for php, but I'm working on an idea to navigate the human genome in a rather interesting way.
The problem is I need to know if I can write a php script to parse the freely available data, and if so how would I start? Are there any php scripts to do this in existence?
I'd suggest creating a database design (MySQL) that has the subset of data you want to explore in the PHP application.
Then find a way to upload the data into that data schema. For the uploading part you could use a more poweful language than PHP of your choice, it could be C#, F#, Haskell, or whatever.
This seperation will help simplify things more than doing it all in PHP.
You'll have to write a parser for that, but that should be fairly simple:
http://jc.unternet.net/genome/2bitformat.html
And an example in Perl: http://www.perlmonks.org/?node_id=672251
I will handle a huge XML file and I will go with XMLReader. Below are three ways to go with, but I need to know which one is the fastest. How can I know this? The planet.xml file is located at http://trash.chregu.tv/planet-big.xml.bz2 in case hat you may need it. Thank you!
You might want to consider the PHP profiling extension:
http://www.php.net/apd
You can examine the results with pprofp:
http://www.compago.it/php/phpckbk-CHP-21-SECT-3.html
I haven't worked with XML much in PHP, but if you're dealing with a really large file, a streaming parser is the way to go. Reading the whole thing into memory and building a DOM tree is pretty expensive (and may even fail, if the document's too big to hold in memory).
Is there a native PHP wbxml API that can be used platform-independently? Perhaps a loadable module?
I have seen the pecl implementations but I have not been able to successfully work with the builds on win32 platforms.
I am not an expert, but what I found out there numbered two options, essentially.
One, the pecl library that you are having trouble with.
Two, I found WBXML encoder and decoder classes in Horde of all places. They might give you a starting point, and since they are open source, they might meet your needs quite nicely. Here is a link where I found them.
http://phpxref.com/xref/horde/lib/XML/WBXML/index.html
I don't know a huge amount about WBXML, but from what I can gather it's a binary-formatted XML file. I suppose at the simplest you could use the XML modules such as simpleXML to generate your XML document, output it as a string and then use PHP's built in file handling functions (fopen, fwrite, etc) to dump the string as binary data to a file. To reverse the process load the file as a string and have SimpleXML parse it.
However, without knowing the specific details of the WBXML format, I'm sure there's more to it tan that. You'd also have to implement the necessary code yourself, but as you could implement it in PHP itself that should make cross-platform portability a bit simpler to accomplish.
Not really an answer as such, I'm afraid, but I hope it gets you going in the right direction.
I have normally hand written xml like this:
<tag><?= $value ?></tag>
Having found tools such as simpleXML, should I be using those instead? What's the advantage of doing it using a tool like that?
Good XML tools will ensure that the resulting XML file properly validates against the DTD you are using.
Good XML tools also save a bunch of repetitive typing of tags.
If you're dealing with a small bit of XML, there's little harm in doing it by hand (as long as you can avoid typos). However, with larger documents you're frequently better off using an editor, which can validate your doc against the schema and protect against typos.
You could use the DOM extenstion which can be quite cumbersome to code against. My personal opinion is that the most effective way to write XML documents from ground up is the XMLWriter extension that comes with PHP and is enabled by default in recent versions.
$w=new XMLWriter();
$w->openMemory();
$w->startDocument('1.0','UTF-8');
$w->startElement("root");
$w->writeAttribute("ah", "OK");
$w->text('Wow, it works!');
$w->endElement();
echo htmlentities($w->outputMemory(true));
using a good XML generator will greatly reduce potential errors due to fat-fingering, lapse of attention, or whatever other human frailty. there are several different levels of machine assistance to choose from, however:
at the very least, use a programmer's text editor that does syntax highlighting and auto-indentation. just noticing that your text is a different color than you expect, or not lining up the way you expect, can tip you off to a typo you might otherwise have missed.
better yet, take a step back and write the XML as a data structure of whatever language you prefer, than convert that data structure to XML. Perl gives you modules such as the lightweight XML::Simple for small jobs or the heftier XML::Generator; using XML::Simple is just a matter of arranging your content into a standard Perl hash of hashes and running it through the appropriate method.
-steve
Producing XML via any sort of string manipulation opens the door for bugs to get into your code. The extremely simple example you posted, for instance, won't produce well-formed XML if $value contains an ampersand.
There aren't a lot of edge cases in XML, but there are enough that it's a waste of time to write your own code to handle them. (And if you don't handle them, your code will unexpectedly fail someday. Nobody wants that.) Any good XML tool will automatically handle those cases.
Use the generator.
The advantage of using a generator is you have consistent markup and don't run the risk of fat-fingering a bracket or quote, or forgetting to encode something. This is crucial because these mistakes will not be found until runtime, unless you have significant tests to ensure otherwise.
hand writing isn't always the best practice, because in large XML ou can write wrong tags and can be difficult to find the reason of an error. So I suggest to use XMl parsers to create XML files.
Speed may be an issue... handwritten can be a lot faster.
The XML tools in eclipse are really useful too. Just create a new xml schema and document, and you can easily use most of the graphical tools. I do like to point out that a prior understanding of how schemas work will be of use.
Always use a tool of some kind. XML can be very complex, I know that the PHP guys are used to working with hackey little stuff, but its a huge code smell in the .NET world if someone doesn't use System.XML for creating XML.