simplexml_load_string() != simplexml_import_dom()? - php

If I load an HTML page using DOMDocument::loadHTMLFile() then pass it to simplexml_import_dom() everything is fine, however, if I using $dom->saveHTML() to get a string representation from the DOMDocument then use simplexml_load_string(), I get nothing. Actually, if I use a very simple page it will work, but as soon as there is anything more complex, it fails without any errors in the PHP log file.
Can anyone shed light on this?
Is it something to do with HTML not being parsable XML?
I am trying to strip out CR's and newlines from the formatted HTML text before using the contents as they have nothing to do with the content but get inserted into the SimpleXMLElement object, which is rather tedious.

Is it something to do with HTML not being parsable XML?
YES! HTML is a far less strict syntax so simplexml_load_string will not work with it by itself. This is because simplexml is simple and HTML is convoluted. On the other hand, DOMDocument is designed to be able to read the convoluted HTML structure, which means that since it can make sense of HTML and simplexml can make sense of it, you can bridge the proverbial gap there.
<!-- Valid HTML but not valid XML -->
<ul>
<li>foo
<li>bar
</ul>

HTML may or may not be valid XML. when you use loadHTMLFile it doesnt necessarily have to be well formed xml because the DOM is an HTML one so different rules, but when you pass a string to SimpleXML it must indeed be well formed.

If I get your question correclty and you simply want no whitespace in your output, then there is no need to use simplexml here.
Use: DOMDocument::preservewhitespace
like:
$dom->preserveWhiteSpace = false;
before saveHTML and you're set.

Related

Cannot parse into <code> tag - PHP - simple html dom

I am trying to extract the content of a <div> nested inside a <code> tag with PHP Simple HTML DOM Parser but I am always getting the error Trying to get property of non-object in... as if the parser was finding nothing inside my <div>
The code I'm using is
include_once('simplehtmldom_1_5/simple_html_dom.php');
// Create a DOM object
$html = new simple_html_dom();
// Load HTML
$html->load('<code><div>hello</div></code>');
// Extract div content
echo $html->find('div',0)->innertext;
But if instead of using <code><div>hello</div></code> as my sample code i use <span><div>hello</div></span> it works... it seems like I'm having problems only looking inside the code tag.
What's wrong with what i'm doing?
Hope you guys can point me in the right direction, thank you very much for your support!
simplehtmldom among others strips out pre formatted tags.
If you want code tag to be recognized delete or comment out line 1076 in *simple_html_dom.php*
According to the source code for Simple HTML DOM it automagically removes code tags when it loads the HTML into the parser.
If you need the functionality you'll need to remove the reference to remove_noise() in the load() function within simplehtmldom.php.
This should produce the results you expect, but obviously may well introduce other issues, depending on the authors reasoning for removing the tags in the first place.

Extract Table as Text Using Php

I'm looking for a simple method to get the first table of a webpage and put the whole thing into a string, that is all.
So I need to know how to use preg_match or similar to get the first instance of a table from a DOM object and get that whole thing into a string:
I have a class to download webpages as DOM but I cannot convert the html to a string as I need it..
$nodes = $this->bot->QuerySelector($this->download['DOM'], "//table[1][#class='tyebfghjftsdf-ccfkk']");
Please help
I would use Tidy to convert page to valid XHTML, then read it using XML reader (not building DOM) and start echoing data when tag is found and terminate on tag. No regular expressions involved.

PHP DOMDocument to parse xml structure with non-alphabetical characters in tags?

The XML I am trying to parse has structure similar to this - where there are colon's in te tag: <person:type>mean</person:type>
Can PHP DomDocument parse such a structure? The usual getElementByTagName does not seem to work
Sort of, you really want getElementsByTagNameNS. At the beginning of the document, you might notice something like xmlns:person="http://foo.bar.com". That URL would be the first parameter of the method, 'type' would be the second.

Parsing of badly formatted HTML in PHP

In my code I convert some styled xls document to html using openoffice.
I then parse the tables using xml_parser_create.
The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4>.
The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast.
Do you know a (hopefully included) php-parser, that doesn't care about these kinds of mistakes? Or perhaps a fast way to fix a 'broken' html?
A solution to "fix" broken HTML could be to use HTMLPurifier (quoting) :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove
all malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are standards compliant
An alternative idea might be to try loading your HTML with DOMDocument::loadHTML (quoting) :
The function parses the HTML contained
in the string source . Unlike loading
XML, HTML does not have to be
well-formed to load.
And if you're trying to load HTML from a file, see DOMDocument::loadHTMLFile.
There is SimpleHTML
For repairing broken HTML, you could use Tidy.
As an alternative you can use the native XML Reader. Because it is acts as a cursor going forward on the document stream and stopping at each node on the way, it will not break on invalid XML documents.
See http://www.ibm.com/developerworks/library/x-pullparsingphp.html
Any particular reason you're still using the PHP 4 XML API?
If you can get away with using PHP 5's XML API, there are two possibilities.
First, try the built-in HTML parser. It's really not very good (it tends to choke on poorly formatted HTML), but it might do the trick. Have a look at DomDocument::LoadHTML.
Second option - you could try the HTML parser based on the HTML5 parser specification:
http://code.google.com/p/html5lib/
This tends to work better than the built-in PHP HTML parser. It loads the HTML into a DomDocument object.
A solution is to use DOMDocument.
Example :
$str = "
<html>
<head>
<title>test</title>
</head>
<body>
</div>error.
<p>another error</i>
</body>
</html>
";
$doc = new DOMDocument();
#$doc->loadHTML($str);
echo $doc->saveHTML();
Advantage : natively included in PHP, contrary to PHP Tidy.

Error Tolerant HTML/XML/SGML parsing in PHP

I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren't a part of HTML
<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>
I need to parse these files. PHP is the only only tool available. The documents don't come close to being well formed XML.
My original thought was to use the loadHTML methods on PHPs DOMDocument. However, these methods choke on the make up HTML tags, and will refuse to parse the string/file.
$oDom = new DomDocument();
$oDom->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
//gives us
DOMDocument::loadHTML() [function.loadHTML]: Tag pseud-template invalid in Entity, line: 1 occured in ....
The only solution I've been able to come up with is to pre-process the files with string replacement functions that will remove the invalid tags and replace them with a valid HTML tag (maybe a span with an id of the tag name).
Is there a more elegant solution? A way to let DOMDocument know about additional tags to consider as valid? Is there a different, robust HTML parsing class/object out there for PHP?
(if it's not obvious, I don't consider regular expressions a valid solution here)
Update: The information in the fake tags is part of the goal here, so something like Tidy isn't an option. Also, I'm after something that does the some level, if not all, of well-formedness cleanup for me, which is why I was looking the DomDocument's loadHTML method in the first place.
You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:
libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);
If, for some reason, you need access to the warnings, use libxml_get_errors
I wonder if passing the "bad" HTML through HTML Tidy might help as a first pass? Might be worth a look, if you can get the document to be well formed, maybe you could load it as a regular XML file with DomDocument.
#Twan
You don't need a DTD for DOMDocument to parse custom XML. Just use DOMDocument->load(), and as long as the XML is well-formed, it can read it.
Once you get the files to be well-formed, that's when you can start looking at XML parsers, before that you're S.O.L. Lok Alejo said, you could look at HTML TIDY, but it looks like that's specific to HTML, and I don't know how it would go with your custom elements.
I don't consider regular expressions a valid solution here
Until you've got well-formedness, that might be your only option. Once you get the documents to that stage, then you're in the clear with the DOM functions.
Take a look at the Parser in the PHP Fit port. The code is clean and was originally designed for loading the dirty HTML saved by Word. It's configured to pull tables out, but can easily be adapated.
You can see the source here:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/Parser.phps
The unit test will show you how to use it:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/test/parser.phps
My quick and dirty solution to this problem was to run a loop that matches my list of custom tags with a regular expression. The regexp doesn't catch tags that have another inner custom tag inside them.
When there is a match, a function to process that tag is called and returns the "processed HTML". If that custom tag was inside another custom tag than the parent becomes childless by the fact that actual HTML was inserted in place of the child, and it will be matched by the regexp and processed at the next iteration of the loop.
The loop ends when there are no childless custom tags to be matched. Overall it's iterative (a while loop) and not recursive.
#Alan Storm
Your comment on my other answer got me to thinking:
When you load an HTML file with DOMDocument, it appears to do some level of cleanup re: well well-formedness, BUT requires all your tags to be legit HTML tags. I'm looking for something that does the former, but not the later. (Alan Storm)
Run a regex (sorry!) over the tags, and when it finds one which isn't a valid HTML element, replace it with a valid element that you know doesn't exist in any of the documents (blink comes to mind...), and give it an attribute value with the name of the illegal element, so that you can switch it back afterwards. eg:
$code = str_replace("<pseudo-tag>", "<blink rel=\"pseudo-tag\">", $code);
// and then back again...
$code = preg_replace('<blink rel="(.*?)">', '<\1>', $code);
obviously that code won't work, but you get the general idea?

Categories