How do I parse xml to insert it into a mysql database? - php

How would I parse a very large xml file and insert it into a mysql database? I know PHP and I know javascript

If it's a very large XML file you might not want to use DOM / SimpleXML as these load the complete XML tree into memory before allowing you to do any manipulation. If you are only interested in read operations you might want to look at XMLReader http://www.php.net/manual/en/class.xmlreader.php
XMLReader works by reading node by node, thus keeping speed up and memory usage down. There are a few interesting examples in the PHP documentation.
You can also look at SAX, an event based parser: http://php.net/xml_parser_create

Another way (for MySQL 5.5) is a LOAD XML statement.

You would use an XML parser such as SimpleXML.
$xml = simplexml_load_string($yourXml);
// Do what you need to do...

Related

making 'efficient' calls to xml data feed with simplexml

Im using a xml data feed to get information using simplexml and then generate a page using that data.
for this im getting the xml feed using
$xml = simplexml_load_file
Am i right in thinking that to parse the xml data the server has to download it all before it can work with it ?
Obviously this is no such problem with a 2kb file, but some files are nearing 100kb, so for every page load that has to be downloaded first before the php can start generating the page.
On some of the pages were only looking for a 1 attribute of an xml array so parseing the whole document seems unessarcery, normally i would look into caching the feed, but these feeds relate to live makets that are changing frequently so that not ideal as i would always have the up to the minute data.
Is there a better way to make more efficient calls of the xml feed ?
One of the first tactics to optimize XML parsing is by parsing on-the-fly - meaning, don't wait until the entire data arrives, and start parsing immediately when you have something to parse.
This is much more efficient, since the bottleneck is often network connection and not CPU, so if we can find our answer without waiting for all network info, we've optimized quite a bit.
You should google the term XML push parser or XML pull parser
In the article Pull parsing XML in PHP - Create memory-efficient stream processing you can find a tutorial that shows some code on how to do it with PHP using the XMLReader library that is bundled with PHP5
Here's a quote from this page which says basically what I just did in nicer words:
PHP 5 introduced XMLReader, a new class for reading Extensible Markup Language (XML). Unlike SimpleXML or the Document Object Model (DOM), XMLReader operates in streaming mode. That is, it reads the document from start to finish. You can begin to work with the content at the beginning before you see the content at the end. This makes it very fast, very efficient, and very parsimonious with memory. The larger the documents you need to process, the more important this is.
Parsing in streaming mode is a bit different from procedural parsing. Keep in mind that all the data isn't already there. What you usually have to do is supply event handlers that implement some sort of state-machine. If you see tag A, do this, if you see tag B, do that.
Regarding the difference between push parsing and pull parsing take a look at this article. Long story short, both are stream-based parsers. You will probably need a push parser since you want to parse whenever data arrives over the network from your XML feed.
Push parsing in PHP can also be done with xml_parse() (libexpat with a libxml compatibility layer). You can see a code example xml_parse PHP manual page.

XML parsing of large amount of data

Which is the optimal way of XML parsing(XML may be of large amount of data) in php?
See XML and PHP 5 in Devzone for a good introduction.
Basically, if you need to process large volumes of XML files, you will want to use a pull parser, like XMLReader or XMLParser to prevent running into memory issues. Parser like DOM or SimpleXML will read the whole files into memory before you can process them.
one of the most common ways is SimpleXML. it's pretty easy to use and fast.
i've used SAXY XML parser in the past. try it.
If you need a way to parse XML data that is valid for PHP4 too, then you can use the XML parser, or DOM XML (which is a PHP4 only extensions); if you need a solution for PHP5, then you can use DOM, or XMLReader.
It depends from your needs.

Should I use Perl or PHP for parsing a large XML file?

I want to parse a large XML file and I have two options: Perl or PHP. Being new to both languages what would be your suggestion about language of choice for parsing a large XML file?
And what modules are more appropriate for the task at hand?
Use the language that you are most comfortable with.
If you decide to use Perl, please refer back to the "parsing XML using Perl"-questions you asked recently:
What is the best tool for parsing XML and storing data in a database?
How to read data from an XML file and store it into database(MySQL) ?
What is the best way to validate XML against XML Schema, parsing it and storing data back to MySQL Database using Perl ?
What would be your choice of XML Parsers in Perl for parsing > 15 GB Files ?
XML is usually parsed in one of two modes: stream or DOM. DOM is convenient, but unsuitable for large files. XML::Twig from CPAN has mixed mode, which has advantages of both modes.
PHP has a built in function called simplexml which makes it very easy to handle XML files.
Just off the cuff - I have no knowledge of the specific XML parsing capabilities of either language - I would say if it's parsing, go Perl. Perl's regular expression support is excellent and makes it the language of choice when there is parsing to be done. Your mileage may vary.

Parse big XML in PHP

I need to parse pretty big XML in PHP (like 300 MB). How can i do it most effectively?
In particular, i need to locate the specific tags and extract their content in a flat TXT file, nothing more.
You can read and parse XML in chunks with an old-school SAX-based parsing approach using PHP's xml parser functions.
Using this approach, there's no real limit to the size of documents you can parse, as you simply read and parse a buffer-full at a time. The parser will fire events to indicate it has found tags, data etc.
There's a simple example in the manual which shows how to pick up start and end of tags. For your purposes you might also want to use xml_set_character_data_handler so that you pick up on the text between tags also.
The most efficient way to do that is to create static XSLT and apply it to your XML using XSLTProcessor. The method names are a bit misleading. Even though you want to output plain text, you should use either transformToXML() if you need is as a string variable, or transformToURI() if you want to write a file.
If it's one or few time job I'd use XML Starlet. But if you really want to do it PHP side then I'd recommend to preparse it to smaller chunks and then processing it. If you load it via DOM as one big chunk it will take a lot of memory. Also use CLI side PHP script to speed things up.
This is what SAX was designed for. SAX has a low memory footprint reading in a small buffer of data and firing events when it encounter elements, character data etc.
It is not always obvious how to use SAX, well it wasn't to me the first time I used it but in essence you have to maintain your own state and view as to where you are within the document structure so generally you will end up with variables describing what section of the document you are in e.g. inFoo, inBar etc which you set when you encounter particular start/end elements.
There is a short description and example of a sax parser here
Depending on your memory requirements, you can either load it up and parse it with XSLT (the memory-consuming route), or you can create a forward-only cursor and walk the tree yourself, printing the values you're looking for (the memory-efficient route).
Pull parsing is the way to go. This way it's memory-efficient AND easy to process. I have been processing files that are as large as 50 Mb or more.

What XML parser do you use for PHP?

I like the XMLReader class for it's simplicity and speed. But I like the xml_parse associated functions as it better allows for error recovery. It would be nice if the XMLReader class would throw exceptions for things like invalid entity refs instead of just issuinng a warning.
I'd avoid SimpleXML if you can. Though it looks very tempting by getting to avoid a lot of "ugly" code, it's just what the name suggests: simple. For example, it can't handle this:
<p>
Here is <strong>a very simple</strong> XML document.
</p>
Bite the bullet and go to the DOM Functions. The power of it far outweighs the little bit extra complexity. If you're familiar at all with DOM manipulation in Javascript, you'll feel right at home with this library.
SimpleXML seems to do a good job for me.
SimpleXML and DOM work seamlessly together, so you can use the same XML interacting with it as SimpleXML or DOM.
For example:
$simplexml = simplexml_load_string("<xml></xml>");
$simplexml->simple = "it is simple.";
$domxml = dom_import_simplexml($simplexml);
$node = $domxml->ownerDocument->createElement("dom", "yes, with DOM too.");
$domxml->ownerDocument->firstChild->appendChild($node);
echo (string)$simplexml->dom;
You will get the result:
"yes, with DOM too."
Because when you import the object (either into simplexml or dom) it uses the same underlining PHP object by reference.
I figured this out when I was trying to correct some of the errors in SimpleXML by extending/wrapping the object.
See http://code.google.com/p/blibrary/source/browse/trunk/classes/bXml.class.inc for examples.
This is really good for small chunks of XML (-2MB), as DOM/SimpleXML pull the full document into memory with some additional overhead (think x2 or x3). For large XML chunks (+2MB) you'll want to use XMLReader/XMLWriter to parse SAX style, with low memory overhead. I've used 14MB+ documents successfully with XMLReader/XMLWriter.
There are at least four options when using PHP5 to parse XML files. The best option depends on the complexity and size of the XML file.
There’s a very good 3-part article series titled ‘XML for PHP developers’ at IBM developerWorks.
“Parsing with the DOM, now fully compliant with the W3C standard, is a familiar option, and is your choice for complex but relatively small documents. SimpleXML is the way to go for basic and not-too-large XML documents, and XMLReader, easier and faster than SAX, is the stream parser of choice for large documents.”
I mostly stick to SimpleXML, at least whenever PHP5 is available for me.
http://www.php.net/simplexml

Categories