Ensure valid XHTML from a string in PHP - php

I'm using XHTML Transitional doctype for displaying content in a browser. But, the content is displayed it is passed through a XML Parser (DOMDocument) for giving final touches before outputting to the browser.
I use a custom designed CMS for my website, that allows me to make changes to the site. I have a module that allows me to display HTML scripts on my website in a way similar to WordPress widgets.
The problem i am facing right now is that I need to make sure any code provided through this module should be in a valid XHTML format or else the module will need to convert the code to valid XHTML. Currently if a portion of the input code is not XHTML compliant then my XML parser breaks and throws warnings.
What I am looking for is a solution that encodes the entities present in the URLs and text portions of the input provided via TextArea control. For example the following string will break the parser giving entity reference error:
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
Also the following line would cause same error:
<a href="http://www.somesite.com">Books & Cool stuff<a/>
P.S. If i use htmlentities or htmlspecialchars, they also convert the angle brackets of tags, which is not required. I just need the urls and text portions of the string to be escaped/encoded.
Any help would be greatly appreciated.
Thanks and regards,
Waqar Mushtaq

What you'd need to do is generate valid XHTML in the first place. All your attributes much be htmlentitied.
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
should be
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
and
Books & Cool stuff
should be
Books & Cool stuff
It's not easy to always generate valid XHTML. If at all possible I would recommend you find some other way of doing the post processing.

As already suggested in a quick comment, you can solve the problem with the PHP tidy extensionDocs quite comfortable.
To convert a HTML fragment - even a good tag soup - into something DomDocument or SimpleXML can deal with, you can use something like the following:
$config = array(
'output-xhtml' => 1,
'show-body-only' => 1
);
$fragment = tidy_repair_string($html, $config);
$xhtml = sprintf("<body>%s</body>", $fragment);
Example: Format tag soup html as valid xhtml with tidy_repair_stringDocs.
Tidy has many options, these two used are needed for fragments and XHTML compatibility.
The only problem left now is that this XHTML fragment can contain entities that DomDocument or SimpleXML do not understand, for example . This and others are undefined in XML.
As far as DomDocument is concerned (you wrote you use it), it supports loading html instead of xml as well which deals with those entities:
$dom = new DomDocument;
$dom->loadHTML($xhtml);
Example: Loading HTML with DomDocument

HTML Tidy is a computer program and a library whose purpose is to fix invalid HTML and to improve the layout and indent style of the resulting markup.
http://tidy.sourceforge.net/
Examples of bad HTML it is able to fix:
Missing or mismatched end tags, mixed up tags
Adding missing items (some tags, quotes, ...)
Reporting proprietary HTML extensions
Change layout of markup to predefined style
Transform characters from some encodings into HTML entities

Related

PHP return XML string with values added to attributes missing values

I have to parse HTML and "HTML" from emails. I've already managed to create a function that cleans most of the errors such as improper nesting of elements.
I'm trying to determine how best to tackle the issue of HTML attributes that are missing values. We must parse everything ultimately as XML so well-formed HTML is a must as well.
The cleaning function starts off simple enough:
$xml = explode('<', $xml);
We quickly determine opening and closing tags of elements.
However once we get to attributes things get really messy really quickly:
Missing values.
People using single quotes instead of double quotes.
Attribute values may contain single quotes.
Here is an example of an HTML string we have to parse (a p element):
$s = 'p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text';
We do not care what those attributes are; our goal is simply to fix the XML so that it is well-formed as demonstrated by the following string:
$s = 'p obnoxious="true" nonprofessional="true" style="wrong: lulz-immature" dunno="true">Some paragraph text';
We're not interested in attribute="attribute" as that is just extra work (most email is frivolous) so we're simply interested in appending ="true" for each attribute missing a value just to prevent the XML parser on client browsers from failing over the trivialities of someone somewhere else not doing their job.
As I mentioned earlier we only need to fix the attributes which are missing values and we need to return a string. At this point all other issues of malformed XML have been addressed. I'm not sure where I should start as the topic is such a mess. So...
We're open to sending the entire XML string as a whole to be parsed and returned back as a string with some built in library. If this option presume that the XML is well-formed with a proper XML declaration (<?xml version="1.0" encoding="UTF-8"?>).
We're open to manually creating a function to address whatever we encounter though we're not interested in building a validator as much of the "HTML" we receive screams 1997.
We are working with the XML as a single string or an array (your pick); we are explicitly not dealing with files.
How do we with reasonable effort ensure that an XML string (in part or whole) is returned as a string with values for all attributes?
The DOM extension may solve your problem:
$doc = new DOMDocument('1.0');
$doc->loadHTML('<p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text');
echo $doc->saveXML();
The above code will result in the following output:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p obnoxious="" nonprofessional="" style="wrong: lulz-immature" dunno="">Some paragraph text</p></body></html>
You may replace every ="" with ="true" if you want, but the output is already a valid XML.

PHP Tidy removes tags

I am making use of the following HTML tags and when I pass it through tidy and view the HTML output those tags have been removed. I have had a look at the list of config options but I can't find one that prevents this from happening.
Tidy removes: unsubscribe and webversion.
How can I get it to keep HTML tags like these?
PHP Tidy is aimed at correcting HTML and those tags aren't valid. Through correct configuration of php tidy you might be able to add them.
If I guessd correctly those should be blocklevel elements read how to add them here or see all of the other options.
Those aren't valid HTML tags, so Tidy will remove them. You might have to massage the text before/after to 'hide' the tags, by changing them to [unsubscribe] and [webversion], possibly, then change back to the <> versions afterwards. Another option would be to process the file as XML, which allows arbitrary tags of that sort. However, you'd have to be producing valid XHTML to begin with, or tidy could nuke other parts of your document.

Parsing of badly formatted HTML in PHP

In my code I convert some styled xls document to html using openoffice.
I then parse the tables using xml_parser_create.
The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4>.
The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast.
Do you know a (hopefully included) php-parser, that doesn't care about these kinds of mistakes? Or perhaps a fast way to fix a 'broken' html?
A solution to "fix" broken HTML could be to use HTMLPurifier (quoting) :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove
all malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are standards compliant
An alternative idea might be to try loading your HTML with DOMDocument::loadHTML (quoting) :
The function parses the HTML contained
in the string source . Unlike loading
XML, HTML does not have to be
well-formed to load.
And if you're trying to load HTML from a file, see DOMDocument::loadHTMLFile.
There is SimpleHTML
For repairing broken HTML, you could use Tidy.
As an alternative you can use the native XML Reader. Because it is acts as a cursor going forward on the document stream and stopping at each node on the way, it will not break on invalid XML documents.
See http://www.ibm.com/developerworks/library/x-pullparsingphp.html
Any particular reason you're still using the PHP 4 XML API?
If you can get away with using PHP 5's XML API, there are two possibilities.
First, try the built-in HTML parser. It's really not very good (it tends to choke on poorly formatted HTML), but it might do the trick. Have a look at DomDocument::LoadHTML.
Second option - you could try the HTML parser based on the HTML5 parser specification:
http://code.google.com/p/html5lib/
This tends to work better than the built-in PHP HTML parser. It loads the HTML into a DomDocument object.
A solution is to use DOMDocument.
Example :
$str = "
<html>
<head>
<title>test</title>
</head>
<body>
</div>error.
<p>another error</i>
</body>
</html>
";
$doc = new DOMDocument();
#$doc->loadHTML($str);
echo $doc->saveHTML();
Advantage : natively included in PHP, contrary to PHP Tidy.

DOM manipulation in PHP

I am looking for good methods of manipulating HTML in PHP. For example, the problem I currently have is dealing with malformed HTML.
I am getting input that looks something like this:
<div>This is some <b>text
As you noticed, the HTML is missing closing tags. I could use regex or an XML Parser to solve this problem. However, it is likely that I will have to do other DOM manipulation in the future. I wonder if there are any good PHP libraries that handle DOM manipulation similar to how Javascript deals with DOM manipulation.
PHP has a PECL extension that gives you access to the features of HTML Tidy. Tidy is a pretty powerful library that should be able to take code like that and close tags in an intelligent manner.
I use it to clean up malformed XML and HTML sent to me by a classified ad system prior to import.
I've found PHP Simple HTML DOM to be the most useful and straight forward library yet. Better than PECL I would say.
I've written an article on how to use it to scrape myspace artist tour dates (just an example.) Here's a link to the php simple html dom parser.
The DOM library which is now built-in can solve this problem easily. The loadHTML method will accept malformed XML while the load method will not.
$d = new DOMDocument;
$d->loadHTML('<div>This is some <b>text');
$d->saveHTML();
The output will be:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div>This is some <b>text</b></div>
</body>
</html>
For manipulating the DOM i think that what you're looking for is this. I've used to parse HTML documents from the web and it worked fine for me.

Error Tolerant HTML/XML/SGML parsing in PHP

I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren't a part of HTML
<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>
I need to parse these files. PHP is the only only tool available. The documents don't come close to being well formed XML.
My original thought was to use the loadHTML methods on PHPs DOMDocument. However, these methods choke on the make up HTML tags, and will refuse to parse the string/file.
$oDom = new DomDocument();
$oDom->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
//gives us
DOMDocument::loadHTML() [function.loadHTML]: Tag pseud-template invalid in Entity, line: 1 occured in ....
The only solution I've been able to come up with is to pre-process the files with string replacement functions that will remove the invalid tags and replace them with a valid HTML tag (maybe a span with an id of the tag name).
Is there a more elegant solution? A way to let DOMDocument know about additional tags to consider as valid? Is there a different, robust HTML parsing class/object out there for PHP?
(if it's not obvious, I don't consider regular expressions a valid solution here)
Update: The information in the fake tags is part of the goal here, so something like Tidy isn't an option. Also, I'm after something that does the some level, if not all, of well-formedness cleanup for me, which is why I was looking the DomDocument's loadHTML method in the first place.
You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:
libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);
If, for some reason, you need access to the warnings, use libxml_get_errors
I wonder if passing the "bad" HTML through HTML Tidy might help as a first pass? Might be worth a look, if you can get the document to be well formed, maybe you could load it as a regular XML file with DomDocument.
#Twan
You don't need a DTD for DOMDocument to parse custom XML. Just use DOMDocument->load(), and as long as the XML is well-formed, it can read it.
Once you get the files to be well-formed, that's when you can start looking at XML parsers, before that you're S.O.L. Lok Alejo said, you could look at HTML TIDY, but it looks like that's specific to HTML, and I don't know how it would go with your custom elements.
I don't consider regular expressions a valid solution here
Until you've got well-formedness, that might be your only option. Once you get the documents to that stage, then you're in the clear with the DOM functions.
Take a look at the Parser in the PHP Fit port. The code is clean and was originally designed for loading the dirty HTML saved by Word. It's configured to pull tables out, but can easily be adapated.
You can see the source here:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/Parser.phps
The unit test will show you how to use it:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/test/parser.phps
My quick and dirty solution to this problem was to run a loop that matches my list of custom tags with a regular expression. The regexp doesn't catch tags that have another inner custom tag inside them.
When there is a match, a function to process that tag is called and returns the "processed HTML". If that custom tag was inside another custom tag than the parent becomes childless by the fact that actual HTML was inserted in place of the child, and it will be matched by the regexp and processed at the next iteration of the loop.
The loop ends when there are no childless custom tags to be matched. Overall it's iterative (a while loop) and not recursive.
#Alan Storm
Your comment on my other answer got me to thinking:
When you load an HTML file with DOMDocument, it appears to do some level of cleanup re: well well-formedness, BUT requires all your tags to be legit HTML tags. I'm looking for something that does the former, but not the later. (Alan Storm)
Run a regex (sorry!) over the tags, and when it finds one which isn't a valid HTML element, replace it with a valid element that you know doesn't exist in any of the documents (blink comes to mind...), and give it an attribute value with the name of the illegal element, so that you can switch it back afterwards. eg:
$code = str_replace("<pseudo-tag>", "<blink rel=\"pseudo-tag\">", $code);
// and then back again...
$code = preg_replace('<blink rel="(.*?)">', '<\1>', $code);
obviously that code won't work, but you get the general idea?

Categories