DomDocument and HTML parsing using PHP [duplicate] - php

I need to parse some HTML files, however, they are not well-formed and PHP prints out warnings to. I want to avoid such debugging/warning behavior programatically. Please advise. Thank you!
Code:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument;
// this dumps out the warnings
$xmlDoc->loadHTML($fetchResult);
This:
#$xmlDoc->loadHTML($fetchResult)
can suppress the warnings but how can I capture those warnings programatically?

Call
libxml_use_internal_errors(true);
prior to processing with with $xmlDoc->loadHTML()
This tells libxml2 not to send errors and warnings through to PHP. Then, to check for errors and handle them yourself, you can consult libxml_get_last_error() and/or libxml_get_errors() when you're ready:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$errors = libxml_get_errors();
foreach ($errors as $error) {
// handle the errors as you wish
}

To hide the warnings, you have to give special instructions to libxml which is used internally to perform the parsing:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
The libxml_use_internal_errors(true) indicates that you're going to handle the errors and warnings yourself and you don't want them to mess up the output of your script.
This is not the same as the # operator. The warnings get collected behind the scenes and afterwards you can retrieve them by using libxml_get_errors() in case you wish to perform logging or return the list of issues to the caller.
Whether or not you're using the collected warnings you should always clear the queue by calling libxml_clear_errors().
Preserving the state
If you have other code that uses libxml it may be worthwhile to make sure your code doesn't alter the global state of the error handling; for this, you can use the return value of libxml_use_internal_errors() to save the previous state.
// modify state
$libxml_previous_state = libxml_use_internal_errors(true);
// parse
$dom->loadHTML($html);
// handle errors
libxml_clear_errors();
// restore
libxml_use_internal_errors($libxml_previous_state);

Setting the options "LIBXML_NOWARNING" & "LIBXML_NOERROR" works perfectly fine too:
$dom->loadHTML($html, LIBXML_NOWARNING | LIBXML_NOERROR);

You can install a temporary error handler with set_error_handler
class ErrorTrap {
protected $callback;
protected $errors = array();
function __construct($callback) {
$this->callback = $callback;
}
function call() {
$result = null;
set_error_handler(array($this, 'onError'));
try {
$result = call_user_func_array($this->callback, func_get_args());
} catch (Exception $ex) {
restore_error_handler();
throw $ex;
}
restore_error_handler();
return $result;
}
function onError($errno, $errstr, $errfile, $errline) {
$this->errors[] = array($errno, $errstr, $errfile, $errline);
}
function ok() {
return count($this->errors) === 0;
}
function errors() {
return $this->errors;
}
}
Usage:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument();
$caller = new ErrorTrap(array($xmlDoc, 'loadHTML'));
// this doesn't dump out any warnings
$caller->call($fetchResult);
if (!$caller->ok()) {
var_dump($caller->errors());
}

Related

PHP DOMDocument loading custom tags [duplicate]

I need to parse some HTML files, however, they are not well-formed and PHP prints out warnings to. I want to avoid such debugging/warning behavior programatically. Please advise. Thank you!
Code:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument;
// this dumps out the warnings
$xmlDoc->loadHTML($fetchResult);
This:
#$xmlDoc->loadHTML($fetchResult)
can suppress the warnings but how can I capture those warnings programatically?
Call
libxml_use_internal_errors(true);
prior to processing with with $xmlDoc->loadHTML()
This tells libxml2 not to send errors and warnings through to PHP. Then, to check for errors and handle them yourself, you can consult libxml_get_last_error() and/or libxml_get_errors() when you're ready:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$errors = libxml_get_errors();
foreach ($errors as $error) {
// handle the errors as you wish
}
To hide the warnings, you have to give special instructions to libxml which is used internally to perform the parsing:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
The libxml_use_internal_errors(true) indicates that you're going to handle the errors and warnings yourself and you don't want them to mess up the output of your script.
This is not the same as the # operator. The warnings get collected behind the scenes and afterwards you can retrieve them by using libxml_get_errors() in case you wish to perform logging or return the list of issues to the caller.
Whether or not you're using the collected warnings you should always clear the queue by calling libxml_clear_errors().
Preserving the state
If you have other code that uses libxml it may be worthwhile to make sure your code doesn't alter the global state of the error handling; for this, you can use the return value of libxml_use_internal_errors() to save the previous state.
// modify state
$libxml_previous_state = libxml_use_internal_errors(true);
// parse
$dom->loadHTML($html);
// handle errors
libxml_clear_errors();
// restore
libxml_use_internal_errors($libxml_previous_state);
Setting the options "LIBXML_NOWARNING" & "LIBXML_NOERROR" works perfectly fine too:
$dom->loadHTML($html, LIBXML_NOWARNING | LIBXML_NOERROR);
You can install a temporary error handler with set_error_handler
class ErrorTrap {
protected $callback;
protected $errors = array();
function __construct($callback) {
$this->callback = $callback;
}
function call() {
$result = null;
set_error_handler(array($this, 'onError'));
try {
$result = call_user_func_array($this->callback, func_get_args());
} catch (Exception $ex) {
restore_error_handler();
throw $ex;
}
restore_error_handler();
return $result;
}
function onError($errno, $errstr, $errfile, $errline) {
$this->errors[] = array($errno, $errstr, $errfile, $errline);
}
function ok() {
return count($this->errors) === 0;
}
function errors() {
return $this->errors;
}
}
Usage:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument();
$caller = new ErrorTrap(array($xmlDoc, 'loadHTML'));
// this doesn't dump out any warnings
$caller->call($fetchResult);
if (!$caller->ok()) {
var_dump($caller->errors());
}

Catching errors thrown by token_get_all (Tokenizer)

PHPs token_get_all function (which allows converting a PHP source code into tokens) can throw two errors: One if an unterminated multiline comment is encountered, the other if an unexpected char is found.
I would like to catch those errors and throw them as Exceptions.
Problem is: As those errors are parse errors they cannot be handled with an error handling function you would normally specify using set_error_handler.
What I have currently implemented is the following:
// Reset the error message in error_get_last()
#$errorGetLastResetUndefinedVariable;
$this->tokens = #token_get_all($code);
$error = error_get_last();
if (preg_match(
'~^(Unterminated comment) starting line ([0-9]+)$~',
$error['message'],
$matches
)
) {
throw new ParseErrorException($matches[1], $matches[2]);
}
if (preg_match(
'~^(Unexpected character in input:\s+\'(.)\' \(ASCII=[0-9]+\))~s',
$error['message'],
$matches
)
) {
throw new ParseErrorException($matches[1]);
}
It should be obvious that I'm not really excited to use that solution. Especially the fact that I reset the error message in error_get_last by accessing an undefined variable seems pretty unsatisfactory.
So: Is there a better solution to this problem?
Set a custom errorhandler using set_error_handler.
Call token_get_all.
Then unset the error handler by calling restore_error_handler.
This will allow you to catch warnings. Make sure you remove the # suppressor.
You can for instance register an error handler that is in a class that will just record any warnings for inspection later on.
Untested example code:
class CatchWarnings {
private $warnings = array();
public function handler($errno, $errstr, $errfile, $errline) {
switch ($errno) {
case E_USER_WARNING:
$this->warnings[] = $errstr;
return true; // cancel error handling bubble
}
return false; // error handling as usual
}
public function has_warnings() {
return count($this->warnings) > 0;
}
}
$cw = new CatchWarnings();
set_error_handler(array($cw, "handler"));
token_get_all();
restore_error_handler();
Usually validation and execution are two separate things, but it seems like there is no way to validate/lint a piece of PHP code (not since 5.x anyway).

loadXML unhandleable error

I'm using PEAR XML_Feed_Parser.
I have some bad xml that I give to it and get error.
DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE8 0xCF 0xD3 0xD4 in Entity, line: 7
It's actually html in wrong encoding - KOI8-R.
It's ok to get error but I can't handle it!
When I create new XML_Feed_Parser instance with
$feed = new XML_Feed_Parser($xml);
it calls to __construct() that looks like that
$this->model = new DOMDocument;
if (! $this->model->loadXML($feed)) {
if (extension_loaded('tidy') && $tidy) {
/* tidy stuff */
}
} else {
throw new Exception('Invalid input: this is not valid XML');
}
Where we can see that if loadXML() failed then it throw exception.
I want to catch error from loadXML() to skip bad XMLs and notify user. So i wrapped my code with try-catch like that
try
{
$feed = new XML_Feed_Parser($xml);
/* ... */
}
catch(Exception $e)
{
echo 'Feed invalid: '.$e->getMessage();
return False;
}
But even after that I get that error
DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE8 0xCF 0xD3 0xD4 in Entity, line: 7
I've read about loadXML() and found that
If an empty string is passed as the source, a warning will be generated. This warning is not generated by libxml and cannot be handled using libxml's error handling functions.
But somehow instead of warning i get error that halts my application. I've written my error handler and I saw that this is really warning ($errno is 2).
So i see 2 solutions:
Revert warnings to warnings - do not
treat them like errors. (Google
doesn't help me here). After that
handle False returned from loadXML.
Somehow catch that error.
Any help?
libxml_use_internal_errors(true) solved my problem. It made libxml to use normal errors so i can catch False from loadXML().
Try this one:
$this->model = new DOMDocument;
$converted = mb_convert_encoding($feed, 'UTF-8', 'KOI8-R');
if (! $this->model->loadXML($converted)) {
if (extension_loaded('tidy') && $tidy) {
/* tidy stuff */
}
} else {
throw new Exception('Invalid input: this is not valid XML');
}
or you can do it without need to modify XML_Feed_Parser like this:
$xml = mb_convert_encoding($loaded_xml, 'UTF-8', 'KOI8-R');
$feed = new XML_Feed_Parser($xml);

PHP DOMDocument error handling

In my application I am loading xml from url in order to parse it.
But sometimes this url may not be valid. In this case I need to handle errors.
I have the following code:
$xdoc = new DOMDocument();
try{
$xdoc->load($url); // This line causes Warning: DOMDocument::load(...)
// [domdocument.load]: failed to open stream:
// HTTP request failed! HTTP/1.1 404 Not Found in ...
} catch (Exception $e) {
$xdoc = null;
}
if($xdoc == null){
// Handle
} else {
// Proceed
}
I know I probably doing it wrong, but what's a correct way to handle this kind of exceptions? I don't want to see error messages on my page.
The manual for DOMDocument::load() says:
If an empty string is passed as the
filename or an empty file is named, a
warning will be generated. This
warning is not generated by libxml and
cannot be handled using libxml's error
handling functions.
But there is no information on how to handle it.
Thanks.
From what I can gather from the documentation, handling warnings issued by this method is tricky because they are not generated by the libxml extension and thus cannot be handled by libxml_get_last_error(). You could either use the error suppression operator and check the return value for false...
if (#$xdoc->load($url) === false)
// ...handle it
...or register an error handler which throws an exception on error:
function exception_error_handler($errno, $errstr, $errfile, $errline ) {
throw new ErrorException($errstr, 0, $errno, $errfile, $errline);
}
and then catch it.
set_error_handler(function($number, $error){
if (preg_match('/^DOMDocument::loadXML\(\): (.+)$/', $error, $m) === 1) {
throw new Exception($m[1]);
}
});
$xml = new DOMDocument();
$xml->loadXML($xmlData);
restore_error_handler();
That works for me in PHP 5.3. But if you're not using loadXML, you might need to do some modifications.
To disable throwing errors:
$internal_errors = libxml_use_internal_errors(true);
$dom = new DOMDocument();
// etc...
libxml_use_internal_errors($internal_errors);
From php.net
If an empty string is passed as the
filename or an empty file is named, a
warning will be generated. This
warning is not generated by libxml and
cannot be handled using libxml's error
handling functions.
In your production environment you shouldn't have errors displayed to the user. They don't need to see them so taking this into account you can use...
$xdoc = new DOMDocument();
if ( $xdoc->load($url) ) {
// valid
}
else {
// invalid
}
For me , following did the trick
$feed = new DOMDocument();
$res= #$feed->load('http://www.astrology.com/horoscopes/daily-extended.rss');
if($res==1){
//do sth
}

Disable warnings when loading non-well-formed HTML by DomDocument (PHP)

I need to parse some HTML files, however, they are not well-formed and PHP prints out warnings to. I want to avoid such debugging/warning behavior programatically. Please advise. Thank you!
Code:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument;
// this dumps out the warnings
$xmlDoc->loadHTML($fetchResult);
This:
#$xmlDoc->loadHTML($fetchResult)
can suppress the warnings but how can I capture those warnings programatically?
Call
libxml_use_internal_errors(true);
prior to processing with with $xmlDoc->loadHTML()
This tells libxml2 not to send errors and warnings through to PHP. Then, to check for errors and handle them yourself, you can consult libxml_get_last_error() and/or libxml_get_errors() when you're ready:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$errors = libxml_get_errors();
foreach ($errors as $error) {
// handle the errors as you wish
}
To hide the warnings, you have to give special instructions to libxml which is used internally to perform the parsing:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
The libxml_use_internal_errors(true) indicates that you're going to handle the errors and warnings yourself and you don't want them to mess up the output of your script.
This is not the same as the # operator. The warnings get collected behind the scenes and afterwards you can retrieve them by using libxml_get_errors() in case you wish to perform logging or return the list of issues to the caller.
Whether or not you're using the collected warnings you should always clear the queue by calling libxml_clear_errors().
Preserving the state
If you have other code that uses libxml it may be worthwhile to make sure your code doesn't alter the global state of the error handling; for this, you can use the return value of libxml_use_internal_errors() to save the previous state.
// modify state
$libxml_previous_state = libxml_use_internal_errors(true);
// parse
$dom->loadHTML($html);
// handle errors
libxml_clear_errors();
// restore
libxml_use_internal_errors($libxml_previous_state);
Setting the options "LIBXML_NOWARNING" & "LIBXML_NOERROR" works perfectly fine too:
$dom->loadHTML($html, LIBXML_NOWARNING | LIBXML_NOERROR);
You can install a temporary error handler with set_error_handler
class ErrorTrap {
protected $callback;
protected $errors = array();
function __construct($callback) {
$this->callback = $callback;
}
function call() {
$result = null;
set_error_handler(array($this, 'onError'));
try {
$result = call_user_func_array($this->callback, func_get_args());
} catch (Exception $ex) {
restore_error_handler();
throw $ex;
}
restore_error_handler();
return $result;
}
function onError($errno, $errstr, $errfile, $errline) {
$this->errors[] = array($errno, $errstr, $errfile, $errline);
}
function ok() {
return count($this->errors) === 0;
}
function errors() {
return $this->errors;
}
}
Usage:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument();
$caller = new ErrorTrap(array($xmlDoc, 'loadHTML'));
// this doesn't dump out any warnings
$caller->call($fetchResult);
if (!$caller->ok()) {
var_dump($caller->errors());
}

Categories