Bi-directional Text Parsing Recommendations

Bi-directional Text Parsing Recommendations - php

I'm looking at the feasability of implementing a bi-directional text parsing framework to allow formatted text to be processed using a combination of common paradigms such as Markdown, BBCode, DocuWiki, and so on. Practically speaking this means that each implentation must be able to translate to and from a common format. That could be HTML, but more realistically an intermediate (more easily parsable) format like XML or YAML.
This will probably utilize a tokenizer to break the document into it's relevant components. Does this sound like the best approach and can you forsee any significant roadblocks?
Lastly, is anyone aware of an existing implementations (or attempts).
Note that this is focused on PHP, but other solutions are welcome.

Have a look at the source of an HTML parser such as Nokogiri, Hpricot, BeautifulSoup etc. They will give you some food for thought on constructing a structured text parser.
There's probably no need to translate to an intermediate format, since your tokenised object tree is going to be all you need to build all the output formats.
If you have specific implementation questions, you should post them too.

Related

Caching web pages using PHP (for offline viewing)

I'm working on a personal project to view web pages offline. The first idea that I came up with is using file_get_contents to get the contents of a specific url but this only gets the html and not the assets in that page(css, images, javascript, etc.). So I had to write regex to get the stylesheets and images in the page:
$css_pattern = '/\S*\.css"/';
$img_src_pattern = '/src=(?:"|\')?.+\.(?:gif|jpg|png|jpeg)(?:"|\')/';
preg_match_all($css_pattern, $contents, $style_matches);
preg_match_all($img_src_pattern, $contents, $img_matches);
This works but there are also images link in the css as well. And I'm still thinking how to deal with those.
There are also projects like ganon https://code.google.com/p/ganon/ and simple html parser that might make my life easier but I prefer using regex because I want to learn more about it.
The question is: is there a better way of doing this project? The app will probably have folders in which to save assets and html for each site and it will probably become unwieldy. I've heard of things like manifest file in html5 but I'm not sure if that's possible if you don't own the site. Any ideas? If there's no other way to do this then maybe you can just help me improve the regex that I have above. I basically have to use str_replace and foreach to get the stylesheets:
$stylesheets = array();
foreach($style_matches[0] as $match){
$stylesheets[] = str_replace(array('href=', '"', "'"), '', $match);
}
Thanks in advance!

I prefer using regex because I want to learn more about it.
Parsing HTML with regex is possible albeit non-trivial. A good introduction is given in the following paper:
REX: XML Shallow Parsing with Regular Expressions
The regular expressions used in that paper (REX) are not the ones used in PHP (PCRE), however you should be able to understand it if you're willing to learn, it's similar.
Following what that paper outlines and writing regular expressions in PHP on your own with some nice test-cases should be a real training camp for you digging into regular expressions.
Next to the regular expressions you also need to deal with character encodings which is another field of it's own and then adopting the parser for an encoding (if you do not re-encode before parsing).
If you're looking specifically for an HTML 5 compatible parser, it is specified as part of the HTML 5 "specification", but you can not do it precisely with regular expressions any longer in a sane way (at least as far as I know about it):
12.2 Parsing HTML documents — HTML Living Standard — Updated ca. daily
8.2 Parsing HTML documents — HTML5 — A vocabulary and associated APIs for HTML and XHTML W3C Candidate Recommendation 17 December 2012
For me that type of parsing looks like a large amount of overhead, but peek into the outline of the HTML 5 Parser and you get an idea what you could all take care of for HTML parsing nowadays. It seems like those guys and girls really needed to push anything in they could imagine. Actually the following engines/browsers have a HTML 5 Parser:
Gecko 2
Webkit
Chrome 7 (Webkit)
Opera 11.60 (Ragnarök)
IE10
From personal experience in the PHP eco-system there are not so many SGML based / "loose" / low-level / tag-soup HTML parsers. If I would write one, I would also use regular expressions for string parsing, the REX shallow parsing article has some good discussion. However I would probably only use such a low-level HTML parser to make any HTML consumable for DOMDocument or some other validation/fixing related stuff and won't use it for further parsing/document abstraction. DOMDocument is pretty powerful especially to gather links which you describe above.
For the rest of your question, you find all the elements you need to bring together outlined in diverse HTTP related RFCs, so you need to decide on your own which link resolving algorithm you want to support and how you re-map the static CSS/image/js files if you save them again. You normally then re-write the HTML as well for which DOMDocument is really handy.
Also you should store some HTTP headers inside the HTML file via the meta element. Especially for the encoding unless you don't re-encode it (which can be useful for offline reading anyway). Some of the more general Q&A suggestions for HTML authoring apply for a static cache as well.
The html5 manifest file is actually something different. The original server should have supported it. That is likely not the case (or you need to build a parser of it as well and process it). So if you create a mirror, you might want to also point out all static resources that can be stored locally for offline usage. That is some nice idea, I have not yet seen this implemented by tools like wget, so it's probably worth to play with that idea a little.
Instead of the HTML5 manifest file you might have also related to one of the following container formats:
Mozilla Archive Format - MAFF
MIME HTML - MHTML
Webarchive
Another one of these formats/extensions (here: SingleFile Chrome extension) makes use of the Data URI scheme according to wikipedia, which might be also useful in this context albeit I would not favorite it, I'd say it's better to have an algorithm that is able to re-write URLs to local file-system in a reproduce-able manner so that you can dump multiple HTML files with the same assets without fetching the assets multiple times.

What is the relation between PHP and XML?

I'm learning about PHP and web coding.
Specifically, the PHP book I'm using that covers PHP 5.3 (by Matt Doyle and published by Wrox), says:
XML ... lets you create text documents that can hold data in a structured way...
XML isn't really a language but rather a sepcification for creating your own markup languages...
Wikipedia says of XML:
As of 2009, hundreds of XML-based languages have been developed,[8] including RSS, Atom, SOAP, and XHTML. XML-based formats have become the default for many office-productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org and LibreOffice (OpenDocument), and Apple's iWork.[9] XML has also been employed as the base language for communication protocols, such as XMPP.
It sounds like XML is more like a protocol, a standard for allowing compuers to communicate and share information.
So XML is like a grammar I can use to create a markup language, but the language I create only formats data?
I want help defining the relationship between PHP and XML.
When during the processesing of PHP and HTML does XML get parsed?

XML is not a grammar (that's another thing entirely). XML (as the name suggests) is a markup language that essentially defines a set of rules that describe something. The "something" could be a protocol, the structure of a document, or any kind of data. XML is designed to be machine readable and human readable (although in my opinion, with bias towards the former ;)).
XML documents use something called a schema which describes the structure of the XML itself, and so you can validate an XML document against a schema to make sure that it is well-formed.
There is no relation between PHP and XML. XML is something that PHP can consume and produce. There is nowhere during processing that PHP consumes or produces XML unless you explicitly tell PHP to do so.
XML is sometimes used as sort a of "glue" that allows dissimilar or disparate systems to communicate with each other, but even that is just one of its functions. For example, PHP can consume XML produced by a program written in another language entirely, or XML produced by some website. PHP can also produce XML which can then be consumed by a program written in another language, or by some other source. As you found from the Wikipedia article, SOAP uses XML and this allows clients written in different languages to consume data exposed by a SOAP service.

XML gives a starting point for a lot of technologies, particularly web technologies.
While PHP can be used for other things, its origins are in the web and it is still most heavily used there. As such, it would be sorely lacking if it couldn't deal with such a core web technology as XML. Likewise, it has support for other key web technologies like URIs, and those heavily used with the web like streams and database connections.

XML is only a data format specification. It was once very hipped but has somewhat faded in favor of JSON - it is still VERY popular because there are many protocols using it as the data interchange format.
PHP is generic enough (as a programming language) to generate XML as well as any other data format.
Since XML is such an important data format, every respectable programming language is expected to easily consume XML as well, and PHP is no exception.

Good, solid documentation of PHP DOM

I've been trying to do some simple DOM parsing of HTML documents and am really shocked at how difficult it is to do.
I've looked into some of the many alternatives to PHP's DOM classes (like simple xml parser and simple HTML DOM). I found a very effective dom2array function too, which is useful for extremely basic parsing where you just want raw values of elements.
None of these alternatives is really compelling though.
PHP documentation of the DOM is typically lacking in detail and largely useless. A lot of the comments are actually really helpful though.
The tutorials I've found online typically cover only the very very basics like writing a 20 line XML document or parsing all the p tags in a document. Meh.
Are there any sites (or books) that go into detail specifically on working with the DOM using PHP's DOM libraries?

The DOM is a language-independent interface and documented in detail by the W3C.
That being said, if your aim is extremely simple parsing of (typically) structured information, XML may not be the correct format in the first place; XML includes a variety of advanced features (namespaces, DTDs, XSLT, distinction between attributes and text, markup instead of structured information). If that's the case, consider JSON, which is extremely easy to parse and generate.

Anything that says "DOM" in the name or claims to support it should support the DOM API as defined by the W3C, and you should consider their documentation normative for everything but the language-specific parts.

I should have titled my post, "Easiest way to parse HTML DOM in PHP". 'Easiest' is not a very good word, I know. It's all relative to what you're trying to do. What I'm doing is pretty straight-forward. I want to parse standalone HTML documents and present the content in a different context.
These are the things I wanted to do:
Parse basic properties like title and body
Alter all file references (images, links, css, js) to point to a valid location
Add/remove attributes from tags (dealing with 1995 HTML here)
Strip inline styles
I ended up going with Simple HTML DOM Parser
It has a very small learning curve and gives easy read/write access to the DOM. End of story. It does seem to choke on nested elements sometimes though.

When writing XML, is it better to hand write it, or to use a generator such as simpleXML in PHP?

I have normally hand written xml like this:
<tag><?= $value ?></tag>
Having found tools such as simpleXML, should I be using those instead? What's the advantage of doing it using a tool like that?

Good XML tools will ensure that the resulting XML file properly validates against the DTD you are using.
Good XML tools also save a bunch of repetitive typing of tags.

If you're dealing with a small bit of XML, there's little harm in doing it by hand (as long as you can avoid typos). However, with larger documents you're frequently better off using an editor, which can validate your doc against the schema and protect against typos.

You could use the DOM extenstion which can be quite cumbersome to code against. My personal opinion is that the most effective way to write XML documents from ground up is the XMLWriter extension that comes with PHP and is enabled by default in recent versions.
$w=new XMLWriter();
$w->openMemory();
$w->startDocument('1.0','UTF-8');
$w->startElement("root");
$w->writeAttribute("ah", "OK");
$w->text('Wow, it works!');
$w->endElement();
echo htmlentities($w->outputMemory(true));

using a good XML generator will greatly reduce potential errors due to fat-fingering, lapse of attention, or whatever other human frailty. there are several different levels of machine assistance to choose from, however:
at the very least, use a programmer's text editor that does syntax highlighting and auto-indentation. just noticing that your text is a different color than you expect, or not lining up the way you expect, can tip you off to a typo you might otherwise have missed.
better yet, take a step back and write the XML as a data structure of whatever language you prefer, than convert that data structure to XML. Perl gives you modules such as the lightweight XML::Simple for small jobs or the heftier XML::Generator; using XML::Simple is just a matter of arranging your content into a standard Perl hash of hashes and running it through the appropriate method.
-steve

Producing XML via any sort of string manipulation opens the door for bugs to get into your code. The extremely simple example you posted, for instance, won't produce well-formed XML if $value contains an ampersand.
There aren't a lot of edge cases in XML, but there are enough that it's a waste of time to write your own code to handle them. (And if you don't handle them, your code will unexpectedly fail someday. Nobody wants that.) Any good XML tool will automatically handle those cases.

Use the generator.
The advantage of using a generator is you have consistent markup and don't run the risk of fat-fingering a bracket or quote, or forgetting to encode something. This is crucial because these mistakes will not be found until runtime, unless you have significant tests to ensure otherwise.

hand writing isn't always the best practice, because in large XML ou can write wrong tags and can be difficult to find the reason of an error. So I suggest to use XMl parsers to create XML files.

Speed may be an issue... handwritten can be a lot faster.

The XML tools in eclipse are really useful too. Just create a new xml schema and document, and you can easily use most of the graphical tools. I do like to point out that a prior understanding of how schemas work will be of use.

Always use a tool of some kind. XML can be very complex, I know that the PHP guys are used to working with hackey little stuff, but its a huge code smell in the .NET world if someone doesn't use System.XML for creating XML.

What's the difference between the different XML parsing libraries in PHP5?

The original question is below, but I changed the title because I think it will be easier to find others with the same doubt. In the end, a XHTML document is a XML document.
It's a beginner question, but I would like to know which do you think is the best library for parsing XHTML documents in PHP5?
I have generated the XHTML from HTML files (which where created using Word :S) with Tidy, and know I need to replace some elements from them (like the and element, replace some attributes in tags).
I haven't used XML very much, there seems to be many options for parsing in PHP (Simple XML, DOM, etc.) and I don't know if all of them can do what I need, an which is the easiest one to use.
Sorry for my English, I'm form Argentina. Thanks!
I bit more information: I have a lot of HTML pages, done in Word 97. I used Tidy for cleaning and turning them in XHTML Strict, so now they are all XML compatible. I want to use an XML parser to find some elements and replace them (the logic by which I do this doesn't matter). For example, I want all of the pages to use the same CSS stylesheet and class attributes, for unified appearance. They are all static pages which contains legal documents, nothing strange there. Which of the extensions should I use? Is SimpleXML enough? Should I learn DOM in spite of being more difficult?

You could use SimpleXML, which is included in a default PHP install. This extensions offers easy object-oriented access to XML-structures.
There's also DOM XML. A "downside" to this extension is that it is a bit harder to use and that it is not included by default.

Just to clear up the confusion here. PHP has a number of XML libraries, because php4 didn't have very good options in that direction. From PHP5, you have the choice between SimpleXml, DOM and the sax-based expat parser. The latter also existed in php4. php4 also had a DOM extension, which is not the same as php5's.
DOM and SimpleXml are alternatives to the same problem domain; They læoad the document into memory and let you access it as a tree-structure. DOM is a rather bulky api, but it's also very consistent and it's implemented in many languages, meaning that you can re-use your knowledge across languages (In Javascript for example). SimpleXml may be easier initially.
The SAX parser is a different beast. It treats an xml document as a stream of tags. This is useful if you are dealing with very large documents, since you don't need to hold it all in memory.
For your usage, I would probably use the DOM api.

DOM is a standard, language-independent API for heirarchical data such as XML which has been standardized by the W3C. It is a rich API with much functionality. It is object based, in that each node is an object.
DOM is good when you not only want to read, or write, but you want to do a lot of manipulation of nodes an existing document, such as inserting nodes between others, changing the structure, etc.
SimpleXML is a PHP-specific API which is also object-based but is intended to be a lot less 'terse' than the DOM: simple tasks such as finding the value of a node or finding its child elements take a lot less code. Its API is not as rich than DOM, but it still includes features such as XPath lookups, and a basic ability to work with multiple-namespace documents. And, importantly, it still preserves all features of your document such as XML CDATA sections and comments, even though it doesn't include functions to manipulate them.
SimpleXML is very good for read-only: if all you want to do is read the XML document and convert it to another form, then it'll save you a lot of code. It's also fairly good when you want to generate a document, or do basic manipulations such as adding or changing child elements or attributes, but it can become complicated (but not impossible) to do a lot of manipulation of existing documents. It's not easy, for example, to add a child element in between two others; addChild only inserts after other elements. SimpleXML also cannot do XSLT transformations. It doesn't have things like 'getElementsByTagName' or getElementById', but if you know XPath you can still do that kind of thing with SimpleXML.
The SimpleXMLElement object is somewhat 'magical'. The properties it exposes if you var_dump/printr/var_export don't correspond to its complete internal representation. It exposes some of its child elements as if they were properties which can be accessed with the -> operator, but still preserves the full document internally, and you can do things like access a child element whose name is a reserved word with the [] operator as if it was an associative array.
You don't have to fully commit to one or the other, because PHP implements the functions:
simplexml_import_dom(DOMNode)
dom_import_simplexml(SimpleXMLElement)
This is helpful if you are using SimpleXML and need to work with code that expects a DOM node or vice versa.
PHP also offers a third XML library:
XML Parser (an implementation of SAX, a language-independent interface, but not referred to by that name in the manual) is a much lower level library, which serves quite a different purpose. It doesn't build objects for you. It basically just makes it easier to write your own XML parser, because it does the job of advancing to the next token, and finding out the type of token, such as what tag name is and whether it's an opening or closing tag, for you. Then you have to write callbacks that should be run each time a token is encountered. All tasks such as representing the document as objects/arrays in a tree, manipulating the document, etc will need to be implemented separately, because all you can do with the XML parser is write a low level parser.
The XML Parser functions are still quite helpful if you have specific memory or speed requirements. With it, it is possible to write a parser that can parse a very long XML document without holding all of its contents in memory at once. Also, if you not interested in all of the data, and don't need or want it to be put into a tree or set of PHP objects, then it can be quicker. For example, if you want to scan through an XHTML document and find all the links, and you don't care about structure.

I prefer SimpleXMLElement as it's pretty easy to use to lop through elements.
Edit: It says no version info avaliable but it's avaliable in PHP5, at least 5.2.5 but probably earlier.
It's really personal choice though, there's plenty of XML extensions.
Bear in mind many XML parsers will balk if you have invalid markup - XHTML should be XML but not always!

It's been a long time (2 years or more) since I worked with XML parsing in PHP, but I always had good, usable results from the XML_Parser Pear package. Having said that, I have had minimal exposure to PHP5, so I don't really know if there are better, inbuilt alternatives these days.

I did a little bit of XML parsing in PHP5 last year and decided to use a combination of SimpleXML.
DOM is a bit more useful if you want to create a new XML tree or add to an existing one, its slightly more flexible.

It really depends on what you're trying to accomplish.
For pulling rather large amounts of data, I.E many records of say, product information from a store website, I'd probably use Expat, since its supposedly a bit faster...
Personally, I've has XML's large enough to create a noticeable performance boost.
At those quantities you might as well be using SQL.
I recommend using SimpleXML.
It's pretty intuitive, easy to use/write.
Also, works great with XPath.
Never really got to use DOM much, but if you're using the XML Parser for something as large as you're describing you might want to use it, since its a bit more functional than SimpleXML.
You can read about all three at W3C Schools:
http://www.w3schools.com/php/php_xml_parser_expat.asp
http://www.w3schools.com/php/php_xml_simplexml.asp
http://www.w3schools.com/php/php_xml_dom.asp

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.