Parsing of badly formatted HTML in PHP

Parsing of badly formatted HTML in PHP - php

In my code I convert some styled xls document to html using openoffice.
I then parse the tables using xml_parser_create.
The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4>.
The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast.
Do you know a (hopefully included) php-parser, that doesn't care about these kinds of mistakes? Or perhaps a fast way to fix a 'broken' html?

A solution to "fix" broken HTML could be to use HTMLPurifier (quoting) :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove
all malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are standards compliant
An alternative idea might be to try loading your HTML with DOMDocument::loadHTML (quoting) :
The function parses the HTML contained
in the string source . Unlike loading
XML, HTML does not have to be
well-formed to load.
And if you're trying to load HTML from a file, see DOMDocument::loadHTMLFile.

There is SimpleHTML
For repairing broken HTML, you could use Tidy.
As an alternative you can use the native XML Reader. Because it is acts as a cursor going forward on the document stream and stopping at each node on the way, it will not break on invalid XML documents.
See http://www.ibm.com/developerworks/library/x-pullparsingphp.html

Any particular reason you're still using the PHP 4 XML API?
If you can get away with using PHP 5's XML API, there are two possibilities.
First, try the built-in HTML parser. It's really not very good (it tends to choke on poorly formatted HTML), but it might do the trick. Have a look at DomDocument::LoadHTML.
Second option - you could try the HTML parser based on the HTML5 parser specification:
http://code.google.com/p/html5lib/
This tends to work better than the built-in PHP HTML parser. It loads the HTML into a DomDocument object.

A solution is to use DOMDocument.
Example :
$str = "
<html>
<head>
<title>test</title>
</head>
<body>
</div>error.
<p>another error</i>
</body>
</html>
";
$doc = new DOMDocument();
#$doc->loadHTML($str);
echo $doc->saveHTML();
Advantage : natively included in PHP, contrary to PHP Tidy.

Related

Ensure valid XHTML from a string in PHP

I'm using XHTML Transitional doctype for displaying content in a browser. But, the content is displayed it is passed through a XML Parser (DOMDocument) for giving final touches before outputting to the browser.
I use a custom designed CMS for my website, that allows me to make changes to the site. I have a module that allows me to display HTML scripts on my website in a way similar to WordPress widgets.
The problem i am facing right now is that I need to make sure any code provided through this module should be in a valid XHTML format or else the module will need to convert the code to valid XHTML. Currently if a portion of the input code is not XHTML compliant then my XML parser breaks and throws warnings.
What I am looking for is a solution that encodes the entities present in the URLs and text portions of the input provided via TextArea control. For example the following string will break the parser giving entity reference error:
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
Also the following line would cause same error:
<a href="http://www.somesite.com">Books & Cool stuff<a/>
P.S. If i use htmlentities or htmlspecialchars, they also convert the angle brackets of tags, which is not required. I just need the urls and text portions of the string to be escaped/encoded.
Any help would be greatly appreciated.
Thanks and regards,
Waqar Mushtaq

What you'd need to do is generate valid XHTML in the first place. All your attributes much be htmlentitied.
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
should be
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
and
Books & Cool stuff
should be
Books & Cool stuff
It's not easy to always generate valid XHTML. If at all possible I would recommend you find some other way of doing the post processing.

As already suggested in a quick comment, you can solve the problem with the PHP tidy extensionDocs quite comfortable.
To convert a HTML fragment - even a good tag soup - into something DomDocument or SimpleXML can deal with, you can use something like the following:
$config = array(
'output-xhtml' => 1,
'show-body-only' => 1
);
$fragment = tidy_repair_string($html, $config);
$xhtml = sprintf("<body>%s</body>", $fragment);
Example: Format tag soup html as valid xhtml with tidy_repair_stringDocs.
Tidy has many options, these two used are needed for fragments and XHTML compatibility.
The only problem left now is that this XHTML fragment can contain entities that DomDocument or SimpleXML do not understand, for example . This and others are undefined in XML.
As far as DomDocument is concerned (you wrote you use it), it supports loading html instead of xml as well which deals with those entities:
$dom = new DomDocument;
$dom->loadHTML($xhtml);
Example: Loading HTML with DomDocument

HTML Tidy is a computer program and a library whose purpose is to fix invalid HTML and to improve the layout and indent style of the resulting markup.
http://tidy.sourceforge.net/
Examples of bad HTML it is able to fix:
Missing or mismatched end tags, mixed up tags
Adding missing items (some tags, quotes, ...)
Reporting proprietary HTML extensions
Change layout of markup to predefined style
Transform characters from some encodings into HTML entities

simplexml_load_string() != simplexml_import_dom()?

If I load an HTML page using DOMDocument::loadHTMLFile() then pass it to simplexml_import_dom() everything is fine, however, if I using $dom->saveHTML() to get a string representation from the DOMDocument then use simplexml_load_string(), I get nothing. Actually, if I use a very simple page it will work, but as soon as there is anything more complex, it fails without any errors in the PHP log file.
Can anyone shed light on this?
Is it something to do with HTML not being parsable XML?
I am trying to strip out CR's and newlines from the formatted HTML text before using the contents as they have nothing to do with the content but get inserted into the SimpleXMLElement object, which is rather tedious.

Is it something to do with HTML not being parsable XML?
YES! HTML is a far less strict syntax so simplexml_load_string will not work with it by itself. This is because simplexml is simple and HTML is convoluted. On the other hand, DOMDocument is designed to be able to read the convoluted HTML structure, which means that since it can make sense of HTML and simplexml can make sense of it, you can bridge the proverbial gap there.
<!-- Valid HTML but not valid XML -->
<ul>
<li>foo
<li>bar
</ul>

HTML may or may not be valid XML. when you use loadHTMLFile it doesnt necessarily have to be well formed xml because the DOM is an HTML one so different rules, but when you pass a string to SimpleXML it must indeed be well formed.

If I get your question correclty and you simply want no whitespace in your output, then there is no need to use simplexml here.
Use: DOMDocument::preservewhitespace
like:
$dom->preserveWhiteSpace = false;
before saveHTML and you're set.

PHP DOMDocument - get html source of BODY

I'm using PHP's DOMDocument to parse and normalize user-submitted HTML using the loadHTML method to parse the content then getting a well-formed result via saveHTML:
$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$well_formed= $dom->saveHTML();
echo($well_formed);
This does a beautiful job of parsing the fragment and adding the appropriate closing tags. The problem is that I'm also getting a bunch of tags I don't want such as <!DOCTYPE>, <html>, <head> and <body>. I understand that every well-formed HTML document needs these tags, but the HTML fragment I'm normalizing is going to be inserted into an existing valid document.

The quick solution to your problem is to use an xPath expression to grab the body.
$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));
A word of warning here. Sometimes loadHTML will throw a warning when it encounters certainly poorly formed HTML documents. If you're parsing those kind of HTML documents, you'll need to find a better html parser [self link warning].

IN your case, you do not want to work with an HTML document, but with an HTML fragment -- a portion of HTML code ;; which means DOMDocument is not quite what you need.
Instead, I would rather use something like HTMLPurifier (quoting) :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove all
malicious code (better known as XSS)
with a thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
And, if you try your portion of code :
<div><p>Hello World
Using the demo page of HTMLPurifier, you get this clean HTML as an output :
<div><p>Hello World</p></div>
Much better, isn't it ? ;-)
(Note that HTMLPurfier suppots a wide range of options, and that taking a look at its documentation might not hurt)

Faced with the same problem, I've created a wrapper around DOMDocument called SmartDOMDocument to overcome this and some other shortcomings (such as encoding problems).
You can find it here: http://beerpla.net/projects/smartdomdocument

This was taken from another post and worked perfectly for my use:
$layout = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $layout);

TL;DR: $dom->saveHTML($dom->documentElement->lastChild);
Where $dom->documentElement->lastChild is the body-node but could be every other available DOMNode of the document.
Actucally the DOMDocument::saveHTML-method itself is capable of doing what you want.
It takes a DOMNode-object as the first argument to output a subset of the document.
$dom = new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$well_formed= $dom->saveHTML($dom->documentElement->lastChild);
echo($well_formed);
There are several ways of retrieving the body-node. Here are 2:
$bodyNode = $dom->documentElement->lastChild;
$bodyNode = $dom->getElementsByTagName('body')->item(0);
From the PHP Manual
public DOMDocument::saveHTML(?DOMNode $node = null): string|false
Parameters
node
Optional parameter to output a subset of the document.
https://www.php.net/manual/en/domdocument.savehtml.php

Html Parser for PHP like Java

I have been developing Java programs that parse html source code of webpages by using various html parsers like Jericho, NekoHtml etc...
Now I want to develop parsers in PHP language. So before starting, I want to know that are there any html parsers available that I can use with PHP to parse html code

Check out DOMDocument.
Example #1 Creating a Document
<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br></body></html>");
echo $doc->saveHTML();

The builtin class DOM parser does a very good job. There are many other xml parsers, too.

DOM is pretty good for this. It can also deal with invalid markup, however, it will throw undocumented errors and exceptions in cases of imperfect markup so I suggest you filter HTML with HTMLPurifier or some other library before loading it with the DOM.

DOM manipulation in PHP

I am looking for good methods of manipulating HTML in PHP. For example, the problem I currently have is dealing with malformed HTML.
I am getting input that looks something like this:
<div>This is some <b>text
As you noticed, the HTML is missing closing tags. I could use regex or an XML Parser to solve this problem. However, it is likely that I will have to do other DOM manipulation in the future. I wonder if there are any good PHP libraries that handle DOM manipulation similar to how Javascript deals with DOM manipulation.

PHP has a PECL extension that gives you access to the features of HTML Tidy. Tidy is a pretty powerful library that should be able to take code like that and close tags in an intelligent manner.
I use it to clean up malformed XML and HTML sent to me by a classified ad system prior to import.

I've found PHP Simple HTML DOM to be the most useful and straight forward library yet. Better than PECL I would say.
I've written an article on how to use it to scrape myspace artist tour dates (just an example.) Here's a link to the php simple html dom parser.

The DOM library which is now built-in can solve this problem easily. The loadHTML method will accept malformed XML while the load method will not.
$d = new DOMDocument;
$d->loadHTML('<div>This is some <b>text');
$d->saveHTML();
The output will be:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div>This is some <b>text</b></div>
</body>
</html>

For manipulating the DOM i think that what you're looking for is this. I've used to parse HTML documents from the web and it worked fine for me.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing of badly formatted HTML in PHP - php

A solution is to use DOMDocument. Example : $str = " <html> <head> <title>test</title> </head> <body> </div>error. <p>another error</i> </body> </html> "; $doc = new DOMDocument(); #$doc->loadHTML($str); echo $doc->saveHTML(); Advantage : natively included in PHP, contrary to PHP Tidy.

Related

Ensure valid XHTML from a string in PHP

simplexml_load_string() != simplexml_import_dom()?

PHP DOMDocument - get html source of BODY

Html Parser for PHP like Java

DOM manipulation in PHP

Categories

Resources