Find what part of Html is invalid with PHP - php

I tried several methods to find out what part of a html string is invalid
$dom->loadHTML($badHtml);
$tidy->cleanRepair();
simplexml_load_string($badHtml);
None is clear regarding what part of the html is invalid. Maybe and extra config option for one of the can fix that. Any ideas ?
I need this to manually fix html input from users. I don't want to relay on automated processes.

I'd try loading the offending HTML into a DOM Document (as you are already doing) and then using simplexml to fix things. You should be able to run a quick diff to see where the errors are.
error_reporting(0);
$badHTML = '<p>Some <em><strong>badly</em> nested</stong> tags</p>';
$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML($badHTML);
$goodHTML = simplexml_import_dom($doc)->asXML();

You can compare cleaned and bad version with PHP Inline-Diff found in answer to that stackoverflow question.

Related

$doc->getElementById('id'), $doc->getElementsByName('id') not working [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP HTML DomDocument getElementById problems
I'm trying to extract info from Google searches in PHP and find that I can read the search urls without problem, but getting anything out of them is a whole different issue. After reading numerous posts, and applicable PHP docs, I came up with the following
// get large panoramas of montana
$url = 'http://www.google.com/search?q=montana+panorama&tbm=isch&biw=1408&bih=409';
$html = file_get_contents($url);
// was getting tons of "entity parse" errors, so added
$html = htmlentities($html, ENT_COMPAT, 'UTF-8', true); // tried false as well
$doc = new DOMDocument();
//$doc->strictErrorChecking = false; // tried both true and false here, same result
$result = $doc->loadHTML($html);
//echo $doc->saveHTML(); this shows that the tags I'm looking for are in fact in $doc
if ($result === true)
{
var_dump($result); // prints 'true'
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');
var_dump($tags); // previous 2 lines both print NULL
}
I've verified that the ids and tags I'm looking for are in the html by error_log($html) and in the parsed doc with $doc->SaveHTNL(). Anyone see what I'm doing wrong?
Edit:
Thanks all for the help, but I've hit a wall with DOMDocument. Nothing in any of the docs, or other threads, works with Google image queries. Here's what I tried:
I looked at the #Jon link tried all the suggestions there, looked at the getElementByID docs and read all the comments there as well. Still getting empty result sets. Better than NULL, but not much.
I tried the xpath trick:
$xpath = new DOMXPath($doc);
$ccol = $xpath->query("//*[#id='center_col']");
Same result, an empty set.
I did a error_log($html) directly after the file read and the document has a doctype "" so it's not that.
I also see there that user "carl2088" says "From my experience, getElementById seem to work fine without any setups if you have loaded a HTML document". Not in the case of Google image queries, it would appear.
In desperation, I tried
echo count(explode('center_col', $html))
to see if for some strange reason it disappears after the initial error_log($html). It's definitely there, the string is split into 4 chunks.
I checked my version of PHP (5.3.15) complied Aug. 25 2012, so it's not a version too old to support getElementByID.
Before yesterday, I had been using an extremely ugly series of "explodes" to get the info, and while it's horrid code, it took 45 minutes to write and it works.
I'd really like to ditch my "explode" hack, but 5 hours to achieve nothing vs 45 minutes to get something that works, makes it really difficult to do things the right way.
If anyone else with experience using DOMDocument has some additional tricks I could try, it would be much appreciated.
are you using the the javascript getElementById and getElementsByTagName if yes than this is the problem
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');
You will need to validate your document with DOMDocument->validate() or DOMDocument->validateOnParse before using function $doc->getElementById('center_col');
$doc->validateOnParse = true;
$doc->loadHTML($html);
stackoverflow: getelementbyid-problem
http://php.net/manual/de/domdocument.getelementbyid.php
it's in the question #Jon post in his comment!

Add CDATA to XML element with SimpleXML and WITHOUT dom_import_simplexml()

I have a deadline for this project (Monday). It worked great on the localhost, but when I uploaded it to our web server, I discovered that we do not have all of the DOM package enabled and I cannot use the function dom_import_simplexml(). My server admin is ignoring my requests, probably because of the short notice, and I cannot possibly rewrite the XML system into a database system that quickly.
This appears to be the only error I'm encountering. Please, if you have any alternative ideas, I'd love to hear them. I'm at a loss, because I can't find any other solution. All I have to do is create some XML elements and populate them with CDATA values, and I can't believe that this is not supported by SimpleXML!
Please, are there any alternatives you can think of? I'm open, because I don't know what I can do.
// Add a <pages /> element
$xml->addChild('pages');
// Populate it with data
foreach($pages as $page){
$new = $xml->pages->addChild('page');
$new->addAttribute('pid', $page['pid']);
$new->addAttribute('title', $page['title']);
if(isset($page['ancestor']))
$new->addAttribute('ancestor', $page['ancestor']);
$node = dom_import_simplexml($new);
$no = $node->ownerDocument;
$node->appendChild($no->createCDATASection($page['content']));
}
Those last three lines obviously won't work, and I don't know what else I can do with them!
Thank you so much for any help you can provide.
them not supporting this is baffling. You must hack the stream.
$xml->node = '<![CDATA[character data]]>';
echo preg_replace('/\]\]></', ']]><', preg_replace('/<!\[CDATA/','<![CDATA', $xml->asXML()));
This is problematic, perhaps those strings might show up in your cdata somewhere. A risk you'll have to take.
Sorry I couldn't help you with your deadline, but at least anyone annoyed with using DOM or unable to use it will have a solution of sorts.

Going crazy getElementsByTagName not working on PHP 5.3.3

I have a very simple code like this:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($data);
libxml_clear_errors();
$dom->preserveWhiteSpace = false;
foreach($dom->getElementsByTagName('img') as $img) {
// do something here
}
The variable $data contains HTML from a external URL. Ok, if i test this code using my local webserver using PHP 5.3.6 it works and returns ALL img tags in that page, but same code running on another server with PHP 5.3.3 DOESN'T WORK! It doesn't return all img tags from the SAME $data value... it only returns the first 13 images.
I suspect that this has something to do with encoding, may be that some characters in $data have a bad encoding or something, but i don't know how to solve it. There is a known bug on PHP 5.3.3 related to this?
I'd suggest you check out the comments on the php docs page,
looks like there's some useful advice on the DOMDocument usage:
http://de.php.net/manual/en/domdocument.getelementsbytagname.php
And before you ask for (eventually) known bugs on stackoverflow,
you might want to look for it on https://bugs.php.net/
Edit:
I think I've found the bug related to that behavior:
https://bugs.php.net/bug.php?id=60762
Even though it's flagged 5.4.0 RC 5 I couldn't replicate
the behavior mentioned.
Probably an issue with the HTML data (as you mentioned).

preg_replace returning null when input is html (but not all of the time)

I am reading in html from a few different sources which I have to manipulate. As part of this I have a number of preg_replace() calls where I have to replace some of the information within the html received.
On 90% of the sites I have to do this on, everything works fine, the remaining 10% are returning NULL on each of the preg_replace() calls.
I've tried increasing the pcre.backtrack_limit and pcre.recursion_limit based on other articles I've found which appear to have the same problem, but this has been to no avail.
I have output preg_last_error() which is returning '4' for which the PHP documentation isn't proving very helpful at all, so if anyone can shed any light on this it might start to point me in the right direction, but I'm stumped.
One of the offending examples is:
$html = preg_replace('#<script[^>]*?.*?</script>#siu', '', $html);
but as I said, this works 90% of the time.
Don't parse HTML with regex. Use a real DOM parser:
$dom = new DOMDocument;
$dom->loadHTML($html);
$scripts = $dom->getElementsByTagName('script');
while ($el = $scripts->item(0)) {
$el->parentNode->removeChild($el);
}
$html = $dom->saveHTML();
You have bad utf-8.
/**
* Returned by preg_last_error if the last error was
* caused by malformed UTF-8 data (only when running a regex in UTF-8 mode). Available
* since PHP 5.2.0.
* #link http://php.net/manual/en/pcre.constants.php
*/
define ('PREG_BAD_UTF8_ERROR', 4);
However, you should really not use regex to parse html. Use DOMDocument
EDIT: Also I don't think this answer would be complete without including You can't parse [X]HTML with regex.
Your #4 error is a "PREG_BAD_UTF8_ERROR", you should check charset used on sites wich caused this error.
It is possible that you exceeded backtrack and/or internal recursion limits. See http://php.net/manual/en/pcre.configuration.php
Try this before preg_replace:
ini_set('pcre.backtrack_limit', '10000000');
ini_set('pcre.recursion_limit', '10000000');

How do I insert HTML into a PHP DOM object? [duplicate]

This question already has answers here:
How to insert HTML to PHP DOMNode?
(5 answers)
Closed 7 years ago.
I am using PHP's DOM object to create HTML pages for my website. This works great for my head, however since I will be entering a lot of HTML into the body (not via DOM), I would think I would need to use DOM->createElement($bodyHTML) to add my HTML from my site to the DOM object.
However DOM->createElement seems to parse all HTML entities so my end result ended up displaying the HTML on the page and not the actual renders HTML.
I am currently using a hack to get this to work,
$body = $this->DOM
->createComment('DOM Glitch--><body>'.$bodyHTML."</body><!--Woot");
Which puts all my site code in a comment, which I bypass athe comment and manually add the <body> tags.
Currently this method works, but I believe there should be a more proper way of doing this. Ideally something like DOM->createElement() that will not parse any of the string.
I also tried using DOM->createDocumentFragment() However it does not like some of the string so it would error and not work (Along with take up extra CPU power to re-parse the body's HTML).
So, my question is, is there a better way of doing this other than using DOM->createComment()?
You use the DOMDocumentFragment objec to insert arbitrary HTML chunks into another document.
$dom = new DOMDocument();
#$dom->loadHTML($some_html_document); // # to suppress a bajillion parse errors
$frag = $dom->createDocumentFragment(); // create fragment
$frag->appendXML($some_other_html_snippet); // insert arbitary html into the fragment
$node = // some operations to find whatever node you want to insert the fragment into
$node->appendChild($frag); // stuff the fragment into the original tree
I FOUND THE SOLUTION but it's not a pure php solution, but works very well. A little hack for everybody who lost countless hours, like me, to fix this
$dom = new DomDocument;
// main object
$object = $dom->createElement('div');
// html attribute
$attr = $dom->createAttribute('value');
// ugly html string
$attr->value = "<div> this is a really html string ©</div><i></i> with all the © that XML hates!";
$object->appendChild($attr);
// jquery fix (or javascript as well)
$('div').html($(this).attr('value')); // and it works!
$('div').removeAttr('value'); // to clean-up
loadHTML works just fine.
<?php
$dom = new DOMDocument();
$dom->loadHTML("<font color='red'>Hey there mrlanrat!</font>");
echo $dom->saveHTML();
?>
which outputs Hey there mrlanrat! in red.
or
<?php
$dom = new DOMDocument();
$bodyHTML = "here is the body, a nice body I might add";
$dom->loadHTML("<body> " . $bodyHTML . " </body>");
// this would even work as well.
// $bodyHTML = "<body>here is the body, a nice body I might add</body>";
// $dom->loadHTML($bodyHTML);
echo $dom->saveHTML();
?>
Which outputs:
here is the body, a nice body I might add and inside of your HTML source code, its wrapped inside body tags.
I spent a lot of time working on Anthony Forloney's answer, But I cannot seem to get the html to append to the body without it erroring.
#Mark B: I have tried doing that, but as I said in the comments, it errored on my html.
I forgot to add the below, my solution:
I decided to make my html object much simpler and to allow me to do this by not using DOM and just use strings.

Categories