Parsing markup into element tree

Parsing markup into element tree - php

I need to parse markup not unlike XML or JSON into trees of elements, in PHP. I'm certain there exist libraries for doing this kind of thing, but I can't for the life of me find any.
Problem is this isn't XML or JSON; It's a number of obscure markups for which exist no specialized parsers. Thus I'm looking for a generic parser that can implement any markup in the form of an element tree.
Alternatively, articles on how to write one. I've written a recursive parser before, but am unsure how to approach making a generic, reusable one.

You could try this: http://pear.php.net/package/PHP_ParserGenerator with this: http://pear.php.net/package/PHP_LexerGenerator
There is also some versions of Lemon and JLex with support for emitting PHP here: http://wezfurlong.org/blog/2006/nov/parser-and-lexer-generators-for-php/
And this: https://drupal.org/project/grammar_parser

Related

Good, solid documentation of PHP DOM

I've been trying to do some simple DOM parsing of HTML documents and am really shocked at how difficult it is to do.
I've looked into some of the many alternatives to PHP's DOM classes (like simple xml parser and simple HTML DOM). I found a very effective dom2array function too, which is useful for extremely basic parsing where you just want raw values of elements.
None of these alternatives is really compelling though.
PHP documentation of the DOM is typically lacking in detail and largely useless. A lot of the comments are actually really helpful though.
The tutorials I've found online typically cover only the very very basics like writing a 20 line XML document or parsing all the p tags in a document. Meh.
Are there any sites (or books) that go into detail specifically on working with the DOM using PHP's DOM libraries?

The DOM is a language-independent interface and documented in detail by the W3C.
That being said, if your aim is extremely simple parsing of (typically) structured information, XML may not be the correct format in the first place; XML includes a variety of advanced features (namespaces, DTDs, XSLT, distinction between attributes and text, markup instead of structured information). If that's the case, consider JSON, which is extremely easy to parse and generate.

Anything that says "DOM" in the name or claims to support it should support the DOM API as defined by the W3C, and you should consider their documentation normative for everything but the language-specific parts.

I should have titled my post, "Easiest way to parse HTML DOM in PHP". 'Easiest' is not a very good word, I know. It's all relative to what you're trying to do. What I'm doing is pretty straight-forward. I want to parse standalone HTML documents and present the content in a different context.
These are the things I wanted to do:
Parse basic properties like title and body
Alter all file references (images, links, css, js) to point to a valid location
Add/remove attributes from tags (dealing with 1995 HTML here)
Strip inline styles
I ended up going with Simple HTML DOM Parser
It has a very small learning curve and gives easy read/write access to the DOM. End of story. It does seem to choke on nested elements sometimes though.

Keeping file offsets while parsing HTML with the DOM?

I want to modify <img src=""> attributes in not-too-malformed HTML (WordPress posts). I know I can take the simple way and use regexes, but I'm afraid people in blue furry suits will come haunt me in my sleep.
If I use the DOM parser to read the HTML and modify the <img> tags, I'm afraid I can't reconstruct the post exactly as it was (with only my modification), because the DOM parser will probably do too much cleanup and maybe remove essential data. A SAX parser can probably not handle invalid XML, so this will also not work.
So, is there a middle way, where I can use a DOM parser, but one that knows where each element started, so I can do string replacements or something similar from there? I know some nodes in the DOM tree will not exist in the source document (<b>Some <i>bizarre</b> formatting</i> will probably trigger this), but does this mean it is always impossible? I see there is a DOMNode::getLineNo() function added in PHP 5.3, but I'm using 5.2.x.

If PHP's DOM will write "too clean" results, you could try string-based SimpleHTMLDOM whether it's more lenient.
However, with formatting as bizarre as you show, I would never entirely trust the parser to do it "right". But try it out, maybe it just skips such stuff.
The DOM library's DOMNode class has a getLineNo() method. I don't entirely see how this works though, seeing as it doesn't provide an offset to go with it. Not sure whether that'll help your use case.

get text between <tags>in php</tags>

Currently i am using and have looked through this code http://us.php.net/manual/en/function.xml-set-element-handler.php#85970
How do i get the text between tags?
i am using php5 and XML Parser

This should explain it:
xml_set_character_data_handler ( $parser, 'tagContent' );
function tagContent ( $parser, $content )
{
}

Don't use PHP's XML Parser unless you know what you are doing - it's a SAX parser which requires you to understand the idea of SAX events. You don't need it unless you need to handle very large XML files very quickly. For 99% of cases, you don't need to use it. Based on your question, you are simply opening up an XML file and searching for a particular tag and extracting the strings contained in that tag. For that, you should use the DOM or SimpleXML parsers. SimpleXML lets you use XPath - see this example code which does broadly what you want. (If you want to use DOM, you'll have to look for it on the PHP site - I can't post hyperlinks as I'm a newbie - just be careful not to confuse it with DOMXML which is the old PHP4 XML parser.)
If I've misread this and you do actually want to learn how to use SAX parsing, Google will help - but the basics theory is this: your code will be processing the complete document as a series of events. Each event will represent something in the document - starting with the beginning of the document, then each element being opened and closed, each attribute being parsed, and each string being parsed. You need to have code that listens to the events you are interested in - when an element gets opened and closed - and then from that, see if the element that's being opened is the one you want - if it is, you need to then tell the event handler that handles the text that it should now listen for text and store whatever text you are looking for into a variable. When you are done, you can then close the stream off and do whatever you want to do with the text.
Again, based on the question, it sounds like you need DOM or SimpleXML not SAX.

What's the difference between the different XML parsing libraries in PHP5?

The original question is below, but I changed the title because I think it will be easier to find others with the same doubt. In the end, a XHTML document is a XML document.
It's a beginner question, but I would like to know which do you think is the best library for parsing XHTML documents in PHP5?
I have generated the XHTML from HTML files (which where created using Word :S) with Tidy, and know I need to replace some elements from them (like the and element, replace some attributes in tags).
I haven't used XML very much, there seems to be many options for parsing in PHP (Simple XML, DOM, etc.) and I don't know if all of them can do what I need, an which is the easiest one to use.
Sorry for my English, I'm form Argentina. Thanks!
I bit more information: I have a lot of HTML pages, done in Word 97. I used Tidy for cleaning and turning them in XHTML Strict, so now they are all XML compatible. I want to use an XML parser to find some elements and replace them (the logic by which I do this doesn't matter). For example, I want all of the pages to use the same CSS stylesheet and class attributes, for unified appearance. They are all static pages which contains legal documents, nothing strange there. Which of the extensions should I use? Is SimpleXML enough? Should I learn DOM in spite of being more difficult?

You could use SimpleXML, which is included in a default PHP install. This extensions offers easy object-oriented access to XML-structures.
There's also DOM XML. A "downside" to this extension is that it is a bit harder to use and that it is not included by default.

Just to clear up the confusion here. PHP has a number of XML libraries, because php4 didn't have very good options in that direction. From PHP5, you have the choice between SimpleXml, DOM and the sax-based expat parser. The latter also existed in php4. php4 also had a DOM extension, which is not the same as php5's.
DOM and SimpleXml are alternatives to the same problem domain; They læoad the document into memory and let you access it as a tree-structure. DOM is a rather bulky api, but it's also very consistent and it's implemented in many languages, meaning that you can re-use your knowledge across languages (In Javascript for example). SimpleXml may be easier initially.
The SAX parser is a different beast. It treats an xml document as a stream of tags. This is useful if you are dealing with very large documents, since you don't need to hold it all in memory.
For your usage, I would probably use the DOM api.

DOM is a standard, language-independent API for heirarchical data such as XML which has been standardized by the W3C. It is a rich API with much functionality. It is object based, in that each node is an object.
DOM is good when you not only want to read, or write, but you want to do a lot of manipulation of nodes an existing document, such as inserting nodes between others, changing the structure, etc.
SimpleXML is a PHP-specific API which is also object-based but is intended to be a lot less 'terse' than the DOM: simple tasks such as finding the value of a node or finding its child elements take a lot less code. Its API is not as rich than DOM, but it still includes features such as XPath lookups, and a basic ability to work with multiple-namespace documents. And, importantly, it still preserves all features of your document such as XML CDATA sections and comments, even though it doesn't include functions to manipulate them.
SimpleXML is very good for read-only: if all you want to do is read the XML document and convert it to another form, then it'll save you a lot of code. It's also fairly good when you want to generate a document, or do basic manipulations such as adding or changing child elements or attributes, but it can become complicated (but not impossible) to do a lot of manipulation of existing documents. It's not easy, for example, to add a child element in between two others; addChild only inserts after other elements. SimpleXML also cannot do XSLT transformations. It doesn't have things like 'getElementsByTagName' or getElementById', but if you know XPath you can still do that kind of thing with SimpleXML.
The SimpleXMLElement object is somewhat 'magical'. The properties it exposes if you var_dump/printr/var_export don't correspond to its complete internal representation. It exposes some of its child elements as if they were properties which can be accessed with the -> operator, but still preserves the full document internally, and you can do things like access a child element whose name is a reserved word with the [] operator as if it was an associative array.
You don't have to fully commit to one or the other, because PHP implements the functions:
simplexml_import_dom(DOMNode)
dom_import_simplexml(SimpleXMLElement)
This is helpful if you are using SimpleXML and need to work with code that expects a DOM node or vice versa.
PHP also offers a third XML library:
XML Parser (an implementation of SAX, a language-independent interface, but not referred to by that name in the manual) is a much lower level library, which serves quite a different purpose. It doesn't build objects for you. It basically just makes it easier to write your own XML parser, because it does the job of advancing to the next token, and finding out the type of token, such as what tag name is and whether it's an opening or closing tag, for you. Then you have to write callbacks that should be run each time a token is encountered. All tasks such as representing the document as objects/arrays in a tree, manipulating the document, etc will need to be implemented separately, because all you can do with the XML parser is write a low level parser.
The XML Parser functions are still quite helpful if you have specific memory or speed requirements. With it, it is possible to write a parser that can parse a very long XML document without holding all of its contents in memory at once. Also, if you not interested in all of the data, and don't need or want it to be put into a tree or set of PHP objects, then it can be quicker. For example, if you want to scan through an XHTML document and find all the links, and you don't care about structure.

I prefer SimpleXMLElement as it's pretty easy to use to lop through elements.
Edit: It says no version info avaliable but it's avaliable in PHP5, at least 5.2.5 but probably earlier.
It's really personal choice though, there's plenty of XML extensions.
Bear in mind many XML parsers will balk if you have invalid markup - XHTML should be XML but not always!

It's been a long time (2 years or more) since I worked with XML parsing in PHP, but I always had good, usable results from the XML_Parser Pear package. Having said that, I have had minimal exposure to PHP5, so I don't really know if there are better, inbuilt alternatives these days.

I did a little bit of XML parsing in PHP5 last year and decided to use a combination of SimpleXML.
DOM is a bit more useful if you want to create a new XML tree or add to an existing one, its slightly more flexible.

It really depends on what you're trying to accomplish.
For pulling rather large amounts of data, I.E many records of say, product information from a store website, I'd probably use Expat, since its supposedly a bit faster...
Personally, I've has XML's large enough to create a noticeable performance boost.
At those quantities you might as well be using SQL.
I recommend using SimpleXML.
It's pretty intuitive, easy to use/write.
Also, works great with XPath.
Never really got to use DOM much, but if you're using the XML Parser for something as large as you're describing you might want to use it, since its a bit more functional than SimpleXML.
You can read about all three at W3C Schools:
http://www.w3schools.com/php/php_xml_parser_expat.asp
http://www.w3schools.com/php/php_xml_simplexml.asp
http://www.w3schools.com/php/php_xml_dom.asp

What XML parser do you use for PHP?

I like the XMLReader class for it's simplicity and speed. But I like the xml_parse associated functions as it better allows for error recovery. It would be nice if the XMLReader class would throw exceptions for things like invalid entity refs instead of just issuinng a warning.

I'd avoid SimpleXML if you can. Though it looks very tempting by getting to avoid a lot of "ugly" code, it's just what the name suggests: simple. For example, it can't handle this:
<p>
Here is <strong>a very simple</strong> XML document.
</p>
Bite the bullet and go to the DOM Functions. The power of it far outweighs the little bit extra complexity. If you're familiar at all with DOM manipulation in Javascript, you'll feel right at home with this library.

SimpleXML seems to do a good job for me.

SimpleXML and DOM work seamlessly together, so you can use the same XML interacting with it as SimpleXML or DOM.
For example:
$simplexml = simplexml_load_string("<xml></xml>");
$simplexml->simple = "it is simple.";
$domxml = dom_import_simplexml($simplexml);
$node = $domxml->ownerDocument->createElement("dom", "yes, with DOM too.");
$domxml->ownerDocument->firstChild->appendChild($node);
echo (string)$simplexml->dom;
You will get the result:
"yes, with DOM too."
Because when you import the object (either into simplexml or dom) it uses the same underlining PHP object by reference.
I figured this out when I was trying to correct some of the errors in SimpleXML by extending/wrapping the object.
See http://code.google.com/p/blibrary/source/browse/trunk/classes/bXml.class.inc for examples.
This is really good for small chunks of XML (-2MB), as DOM/SimpleXML pull the full document into memory with some additional overhead (think x2 or x3). For large XML chunks (+2MB) you'll want to use XMLReader/XMLWriter to parse SAX style, with low memory overhead. I've used 14MB+ documents successfully with XMLReader/XMLWriter.

There are at least four options when using PHP5 to parse XML files. The best option depends on the complexity and size of the XML file.
There’s a very good 3-part article series titled ‘XML for PHP developers’ at IBM developerWorks.
“Parsing with the DOM, now fully compliant with the W3C standard, is a familiar option, and is your choice for complex but relatively small documents. SimpleXML is the way to go for basic and not-too-large XML documents, and XMLReader, easier and faster than SAX, is the stream parser of choice for large documents.”

I mostly stick to SimpleXML, at least whenever PHP5 is available for me.
http://www.php.net/simplexml

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.