I'm looking for a fast Python pull parser (maybe some equivalent of PHP's XMLReader?) - is there something like this in Python? It's really a key feature that it would be a pull parser, because I'm gonna process really big xml files...
PHP's XMLReader is nothing but a SAX parser ( http://en.wikipedia.org/wiki/Simple_API_for_XML ).
Python supports SAX parsing. There are many tutorials available, like this one:
http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/
How about the built-in ones python has?
http://docs.python.org/library/xml.dom.html ("full DOM implementation")
and http://docs.python.org/library/xml.dom.minidom.html ("lightweight DOM implementation")
source: http://developer.yahoo.com/python/python-xml.html
Related
I need to read an XML file from the top to the bottom, while doing my reseach I found out about SAX Parser, and more exaclty I found this link:
http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm which works well and is exactly what I need.
But I can't find an exemple of SAX Parser with a PHP class or in a more "PHP7's way" , I was wondering if it was depreciate or it there are any better alternatives nowadays ?
XMLReader (http://php.net/manual/en/book.xmlreader.php) provides a much more modern approach to processing XML piece by piece. Even the article you refer to has this at the end...
PHP 5.1 comes with XMLReader included by default. This is a wrapper on
libxml2 and mimics the application programming interface (API) of the
C# component for reading XML, XmlTextReader. It is much faster than
SAX and just easier to use.
There are a few examples round about how to use XMLReader-
How to use XMLReader in PHP?
Which is the optimal way of XML parsing(XML may be of large amount of data) in php?
See XML and PHP 5 in Devzone for a good introduction.
Basically, if you need to process large volumes of XML files, you will want to use a pull parser, like XMLReader or XMLParser to prevent running into memory issues. Parser like DOM or SimpleXML will read the whole files into memory before you can process them.
one of the most common ways is SimpleXML. it's pretty easy to use and fast.
i've used SAXY XML parser in the past. try it.
If you need a way to parse XML data that is valid for PHP4 too, then you can use the XML parser, or DOM XML (which is a PHP4 only extensions); if you need a solution for PHP5, then you can use DOM, or XMLReader.
It depends from your needs.
I am testing various methods to read (possibly large, with very frequent reads) XML configuration files in PHP. No writing is ever needed. I have two successful implementations, one using SimpleXML (which I know is a DOM parser) and one using XMLReader.
I know that a DOM reader must read the whole tree and therefore uses more memory. My tests reflect that. I also know that A SAX parser is an "event-based" parser that uses less memory because it reads each node from the stream without checking what is next.
XMLReader also reads from a stream with the cursor providing data about the node it is currently at. So, it definitely sounds like XMLReader (http://us2.php.net/xmlreader) is not a DOM parser, but my question is, is it a SAX parser, or something else? It seems like XMLReader behaves the way a SAX parser does but does not throw the events themselves (in other words, can you construct a SAX parser with XMLReader?)
If it is something else, does the classification it's in have a name?
XMLReader calls itself a "pull parser."
The XMLReader extension is an XML Pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.
It later goes on to say it uses libxml.
This page on Java XML Pull Parsing may be of some possible interest. If XMLReader is related to this project's goals and intent, then the answer to your question falls squarely into the "neither" category.
A SAX parser is a parser which implements the SAX API. That is: a given parser is a SAX parser if and only if you can code against it using the SAX API. Same for a DOM parser: this classification is purely about the API it supports, not how that API is implemented. Thus a SAX parser might very well be a DOM parser, too; and hence you cannot be so sure about using less memory or other characteristics.
However to get to the real question: XMLReader seems the better choice because since it is a pull parser you request the data you want quite specifically and therefore there should be less overhead involved.
XMLReader is an interface that a SAX2 parser must implement. Thus you could say that you have a SAX parser when you access it through XMLReader and for short that XMLReader is the SAX parser.
See the javadoc of XMLReader.
XMLReader is the interface that an XML parser's SAX2 driver must implement. This interface allows an application to set and query features and properties in the parser, to register event handlers for document processing, and to initiate a document parse.
I think this information is relevant because:
It comes from the official Web site for SAX
Even if the javadoc is for Java, SAX originated in the Java language.
In short, it is neither.
SAX parsers are stream-oriented, event-based push parsers. You register callback functions to handle events such as startElement and endElement, then call parse() to process the entire XML document, one node at a time. To my knowledge, PHP doesn't have a well-maintained SAX parser. However, there is XMLParser, which uses the very similar Expat library.
DOM parsers require you to load the entire XML document into memory, but they provide an object-oriented tree of the XML nodes. Examples of DOM parsers in PHP include SimpleXML and DOM.
The PHP XMLReader is neither of these. It is a stream-oriented "pull parser" that requires you to create a big loop and call the read() function to move the cursor forward, processing one node at a time.
The big benefit of XMLParser and XMLReader vs SimpleXML and DOM is that stream-oriented parsers are memory efficient, only loading the current node into memory. On the other hand, SimpleXML and DOM are easier to use, but they require you to load the entire XML document into memory, and this is bad for very large XML documents.
I need HTML SAX (not DOM!) parser for PHP able to process even invalid HTML code.
The reason i need it is to filter user entered HTML (remove all attributes and tags
except allowed ones) and truncate HTML content to specified length.
Any ideas?
SAX was made to process valid XML and fail on invalid markup. Processing invalid HTML markup requires keeping more state than SAX parsers typically keep.
I'm not aware of any SAX-like parser for HTML. Your best shot is to use to pass the HTML through tidy before and then use a XML parser, but this may defeat your purpose of using a SAX parser in the first place.
Try to use HTML SAX Parser
Summarizing as two steps:
Use Tidy to transform "free HTML" into "good XHTML".
Use XML Parser to parse XHTML as XML by SAX API.
Use first Tidy (!), to transform "free HTML" into XHTML (or when you can not trust your "supposed XHTML"). See cleanRepair method. It needs more time, but runs with big files (!)... Set some minutes as maximum execution time if too big.
Another option (for work with big files) is to cache your XHTML files after checked or transformed into XHTML. See Tidy's repairfile method.
With a "trusted XHTML", use SAX... How to use SAX with PHP?
Parse XML with a SAX standard API, that in PHP is implemented by LibXML (see LibXML2 at xmlsoft.org), and its interface is the PHP's XML Parser, that is near to the SAX standard API.
Another way to use the "SAX of LibXML2", with another interface (a PHP iterator instead the traditional SAX interface), is to use XMLReader. See this explanation about "XMLReader use SAX".
Yes, the terms "SAX" or "SAX API" not expressed in the PHP manual (!!). See this old but good introduction.
I may suggest the pear package here : http://pear.php.net/package/XML_HTMLSax/redirected
I like the XMLReader class for it's simplicity and speed. But I like the xml_parse associated functions as it better allows for error recovery. It would be nice if the XMLReader class would throw exceptions for things like invalid entity refs instead of just issuinng a warning.
I'd avoid SimpleXML if you can. Though it looks very tempting by getting to avoid a lot of "ugly" code, it's just what the name suggests: simple. For example, it can't handle this:
<p>
Here is <strong>a very simple</strong> XML document.
</p>
Bite the bullet and go to the DOM Functions. The power of it far outweighs the little bit extra complexity. If you're familiar at all with DOM manipulation in Javascript, you'll feel right at home with this library.
SimpleXML seems to do a good job for me.
SimpleXML and DOM work seamlessly together, so you can use the same XML interacting with it as SimpleXML or DOM.
For example:
$simplexml = simplexml_load_string("<xml></xml>");
$simplexml->simple = "it is simple.";
$domxml = dom_import_simplexml($simplexml);
$node = $domxml->ownerDocument->createElement("dom", "yes, with DOM too.");
$domxml->ownerDocument->firstChild->appendChild($node);
echo (string)$simplexml->dom;
You will get the result:
"yes, with DOM too."
Because when you import the object (either into simplexml or dom) it uses the same underlining PHP object by reference.
I figured this out when I was trying to correct some of the errors in SimpleXML by extending/wrapping the object.
See http://code.google.com/p/blibrary/source/browse/trunk/classes/bXml.class.inc for examples.
This is really good for small chunks of XML (-2MB), as DOM/SimpleXML pull the full document into memory with some additional overhead (think x2 or x3). For large XML chunks (+2MB) you'll want to use XMLReader/XMLWriter to parse SAX style, with low memory overhead. I've used 14MB+ documents successfully with XMLReader/XMLWriter.
There are at least four options when using PHP5 to parse XML files. The best option depends on the complexity and size of the XML file.
There’s a very good 3-part article series titled ‘XML for PHP developers’ at IBM developerWorks.
“Parsing with the DOM, now fully compliant with the W3C standard, is a familiar option, and is your choice for complex but relatively small documents. SimpleXML is the way to go for basic and not-too-large XML documents, and XMLReader, easier and faster than SAX, is the stream parser of choice for large documents.”
I mostly stick to SimpleXML, at least whenever PHP5 is available for me.
http://www.php.net/simplexml