loadXML unhandleable error - php

I'm using PEAR XML_Feed_Parser.
I have some bad xml that I give to it and get error.
DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE8 0xCF 0xD3 0xD4 in Entity, line: 7
It's actually html in wrong encoding - KOI8-R.
It's ok to get error but I can't handle it!
When I create new XML_Feed_Parser instance with
$feed = new XML_Feed_Parser($xml);
it calls to __construct() that looks like that
$this->model = new DOMDocument;
if (! $this->model->loadXML($feed)) {
if (extension_loaded('tidy') && $tidy) {
/* tidy stuff */
}
} else {
throw new Exception('Invalid input: this is not valid XML');
}
Where we can see that if loadXML() failed then it throw exception.
I want to catch error from loadXML() to skip bad XMLs and notify user. So i wrapped my code with try-catch like that
try
{
$feed = new XML_Feed_Parser($xml);
/* ... */
}
catch(Exception $e)
{
echo 'Feed invalid: '.$e->getMessage();
return False;
}
But even after that I get that error
DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE8 0xCF 0xD3 0xD4 in Entity, line: 7
I've read about loadXML() and found that
If an empty string is passed as the source, a warning will be generated. This warning is not generated by libxml and cannot be handled using libxml's error handling functions.
But somehow instead of warning i get error that halts my application. I've written my error handler and I saw that this is really warning ($errno is 2).
So i see 2 solutions:
Revert warnings to warnings - do not
treat them like errors. (Google
doesn't help me here). After that
handle False returned from loadXML.
Somehow catch that error.
Any help?

libxml_use_internal_errors(true) solved my problem. It made libxml to use normal errors so i can catch False from loadXML().

Try this one:
$this->model = new DOMDocument;
$converted = mb_convert_encoding($feed, 'UTF-8', 'KOI8-R');
if (! $this->model->loadXML($converted)) {
if (extension_loaded('tidy') && $tidy) {
/* tidy stuff */
}
} else {
throw new Exception('Invalid input: this is not valid XML');
}
or you can do it without need to modify XML_Feed_Parser like this:
$xml = mb_convert_encoding($loaded_xml, 'UTF-8', 'KOI8-R');
$feed = new XML_Feed_Parser($xml);

Related

PHP DOMDocument loading custom tags [duplicate]

I need to parse some HTML files, however, they are not well-formed and PHP prints out warnings to. I want to avoid such debugging/warning behavior programatically. Please advise. Thank you!
Code:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument;
// this dumps out the warnings
$xmlDoc->loadHTML($fetchResult);
This:
#$xmlDoc->loadHTML($fetchResult)
can suppress the warnings but how can I capture those warnings programatically?
Call
libxml_use_internal_errors(true);
prior to processing with with $xmlDoc->loadHTML()
This tells libxml2 not to send errors and warnings through to PHP. Then, to check for errors and handle them yourself, you can consult libxml_get_last_error() and/or libxml_get_errors() when you're ready:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$errors = libxml_get_errors();
foreach ($errors as $error) {
// handle the errors as you wish
}
To hide the warnings, you have to give special instructions to libxml which is used internally to perform the parsing:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
The libxml_use_internal_errors(true) indicates that you're going to handle the errors and warnings yourself and you don't want them to mess up the output of your script.
This is not the same as the # operator. The warnings get collected behind the scenes and afterwards you can retrieve them by using libxml_get_errors() in case you wish to perform logging or return the list of issues to the caller.
Whether or not you're using the collected warnings you should always clear the queue by calling libxml_clear_errors().
Preserving the state
If you have other code that uses libxml it may be worthwhile to make sure your code doesn't alter the global state of the error handling; for this, you can use the return value of libxml_use_internal_errors() to save the previous state.
// modify state
$libxml_previous_state = libxml_use_internal_errors(true);
// parse
$dom->loadHTML($html);
// handle errors
libxml_clear_errors();
// restore
libxml_use_internal_errors($libxml_previous_state);
Setting the options "LIBXML_NOWARNING" & "LIBXML_NOERROR" works perfectly fine too:
$dom->loadHTML($html, LIBXML_NOWARNING | LIBXML_NOERROR);
You can install a temporary error handler with set_error_handler
class ErrorTrap {
protected $callback;
protected $errors = array();
function __construct($callback) {
$this->callback = $callback;
}
function call() {
$result = null;
set_error_handler(array($this, 'onError'));
try {
$result = call_user_func_array($this->callback, func_get_args());
} catch (Exception $ex) {
restore_error_handler();
throw $ex;
}
restore_error_handler();
return $result;
}
function onError($errno, $errstr, $errfile, $errline) {
$this->errors[] = array($errno, $errstr, $errfile, $errline);
}
function ok() {
return count($this->errors) === 0;
}
function errors() {
return $this->errors;
}
}
Usage:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument();
$caller = new ErrorTrap(array($xmlDoc, 'loadHTML'));
// this doesn't dump out any warnings
$caller->call($fetchResult);
if (!$caller->ok()) {
var_dump($caller->errors());
}

Using additional data in php exceptions

I have php code that execute python cgi and I want to pass python trace (returned from cgi) as extra data to php exception how can I do this and how can I get that value from catch(Exception e) { (It should check if that extra value exesit or not).
I have code like this:
$response = json_decode(curl_exec($ch));
if (isset($response->error)) {
// how to send $response->trace with exception.
throw new Exception($response->error);
}
return $response->result;
and I use json-rpc library that should return that data to the user:
} catch (Exception $e) {
//catch all exeption from user code
$msg = $e->getMessage();
echo response(null, $id, array("code"=>200, "message"=>$msg));
}
Do I need to write new type of exception or can I do this with normal Exception? I would like to send everything that was thrown in "data" =>
You need to extend Exception class:
<?php
class ResponseException extends Exception
{
private $_data = '';
public function __construct($message, $data)
{
$this->_data = $data;
parent::__construct($message);
}
public function getData()
{
return $this->_data;
}
}
When throwing:
<?php
...
throw new ResponseException($response->error, $someData);
...
And when catching:
catch(ResponseException $e) {
...
$data = $e->getData();
...
}
Dynamic Property (not recommended)
Please note that this will cause deprecation error in PHP 8.2 and will stop working in PHP 9 according to one of the PHP RFC https://wiki.php.net/rfc/deprecate_dynamic_properties
As the OP asking about doing this task without extending Exception class, you can totally skip ResponseException class declaration. I really not recommend do it this way, unless you've got really strong reason (see this topic for more details: https://softwareengineering.stackexchange.com/questions/186439/is-declaring-fields-on-classes-actually-harmful-in-php)
In throwing section:
...
$e = new Exception('Exception message');
$e->data = $customData; // we're creating object property on the fly
throw $e;
...
and when catching:
catch(Exception $e) {
$data = $e->data; // Access data property
}
September 2018 edit:
As some of readers found this answer useful, I have added a link to another Stack Overflow question which explains the downsides of using dynamically declared properties.
Currently, your code converts the response text directly into an object without any intermediate step. Instead, you could always just keep the serialized (via JSON) text it and append it to the end of the Exception message.
$responseText = curl_exec($ch);
$response = json_decode($responseText);
if (isset($response->error)) {
throw new Exception('Error when fetching resource. Response:'.$responseText);
}
return $response->result;
Then you could just recover everything after "Response:" in your error log and optionally de-serialize it or just read it.
As an aside, I would also not count on the server sending JSON, you should verify that the response text was actually parseable as JSON and return a separate error for that if it isn't.

DomDocument and HTML parsing using PHP [duplicate]

I need to parse some HTML files, however, they are not well-formed and PHP prints out warnings to. I want to avoid such debugging/warning behavior programatically. Please advise. Thank you!
Code:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument;
// this dumps out the warnings
$xmlDoc->loadHTML($fetchResult);
This:
#$xmlDoc->loadHTML($fetchResult)
can suppress the warnings but how can I capture those warnings programatically?
Call
libxml_use_internal_errors(true);
prior to processing with with $xmlDoc->loadHTML()
This tells libxml2 not to send errors and warnings through to PHP. Then, to check for errors and handle them yourself, you can consult libxml_get_last_error() and/or libxml_get_errors() when you're ready:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$errors = libxml_get_errors();
foreach ($errors as $error) {
// handle the errors as you wish
}
To hide the warnings, you have to give special instructions to libxml which is used internally to perform the parsing:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
The libxml_use_internal_errors(true) indicates that you're going to handle the errors and warnings yourself and you don't want them to mess up the output of your script.
This is not the same as the # operator. The warnings get collected behind the scenes and afterwards you can retrieve them by using libxml_get_errors() in case you wish to perform logging or return the list of issues to the caller.
Whether or not you're using the collected warnings you should always clear the queue by calling libxml_clear_errors().
Preserving the state
If you have other code that uses libxml it may be worthwhile to make sure your code doesn't alter the global state of the error handling; for this, you can use the return value of libxml_use_internal_errors() to save the previous state.
// modify state
$libxml_previous_state = libxml_use_internal_errors(true);
// parse
$dom->loadHTML($html);
// handle errors
libxml_clear_errors();
// restore
libxml_use_internal_errors($libxml_previous_state);
Setting the options "LIBXML_NOWARNING" & "LIBXML_NOERROR" works perfectly fine too:
$dom->loadHTML($html, LIBXML_NOWARNING | LIBXML_NOERROR);
You can install a temporary error handler with set_error_handler
class ErrorTrap {
protected $callback;
protected $errors = array();
function __construct($callback) {
$this->callback = $callback;
}
function call() {
$result = null;
set_error_handler(array($this, 'onError'));
try {
$result = call_user_func_array($this->callback, func_get_args());
} catch (Exception $ex) {
restore_error_handler();
throw $ex;
}
restore_error_handler();
return $result;
}
function onError($errno, $errstr, $errfile, $errline) {
$this->errors[] = array($errno, $errstr, $errfile, $errline);
}
function ok() {
return count($this->errors) === 0;
}
function errors() {
return $this->errors;
}
}
Usage:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument();
$caller = new ErrorTrap(array($xmlDoc, 'loadHTML'));
// this doesn't dump out any warnings
$caller->call($fetchResult);
if (!$caller->ok()) {
var_dump($caller->errors());
}

PHP DOMDocument error handling

In my application I am loading xml from url in order to parse it.
But sometimes this url may not be valid. In this case I need to handle errors.
I have the following code:
$xdoc = new DOMDocument();
try{
$xdoc->load($url); // This line causes Warning: DOMDocument::load(...)
// [domdocument.load]: failed to open stream:
// HTTP request failed! HTTP/1.1 404 Not Found in ...
} catch (Exception $e) {
$xdoc = null;
}
if($xdoc == null){
// Handle
} else {
// Proceed
}
I know I probably doing it wrong, but what's a correct way to handle this kind of exceptions? I don't want to see error messages on my page.
The manual for DOMDocument::load() says:
If an empty string is passed as the
filename or an empty file is named, a
warning will be generated. This
warning is not generated by libxml and
cannot be handled using libxml's error
handling functions.
But there is no information on how to handle it.
Thanks.
From what I can gather from the documentation, handling warnings issued by this method is tricky because they are not generated by the libxml extension and thus cannot be handled by libxml_get_last_error(). You could either use the error suppression operator and check the return value for false...
if (#$xdoc->load($url) === false)
// ...handle it
...or register an error handler which throws an exception on error:
function exception_error_handler($errno, $errstr, $errfile, $errline ) {
throw new ErrorException($errstr, 0, $errno, $errfile, $errline);
}
and then catch it.
set_error_handler(function($number, $error){
if (preg_match('/^DOMDocument::loadXML\(\): (.+)$/', $error, $m) === 1) {
throw new Exception($m[1]);
}
});
$xml = new DOMDocument();
$xml->loadXML($xmlData);
restore_error_handler();
That works for me in PHP 5.3. But if you're not using loadXML, you might need to do some modifications.
To disable throwing errors:
$internal_errors = libxml_use_internal_errors(true);
$dom = new DOMDocument();
// etc...
libxml_use_internal_errors($internal_errors);
From php.net
If an empty string is passed as the
filename or an empty file is named, a
warning will be generated. This
warning is not generated by libxml and
cannot be handled using libxml's error
handling functions.
In your production environment you shouldn't have errors displayed to the user. They don't need to see them so taking this into account you can use...
$xdoc = new DOMDocument();
if ( $xdoc->load($url) ) {
// valid
}
else {
// invalid
}
For me , following did the trick
$feed = new DOMDocument();
$res= #$feed->load('http://www.astrology.com/horoscopes/daily-extended.rss');
if($res==1){
//do sth
}

Disable warnings when loading non-well-formed HTML by DomDocument (PHP)

I need to parse some HTML files, however, they are not well-formed and PHP prints out warnings to. I want to avoid such debugging/warning behavior programatically. Please advise. Thank you!
Code:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument;
// this dumps out the warnings
$xmlDoc->loadHTML($fetchResult);
This:
#$xmlDoc->loadHTML($fetchResult)
can suppress the warnings but how can I capture those warnings programatically?
Call
libxml_use_internal_errors(true);
prior to processing with with $xmlDoc->loadHTML()
This tells libxml2 not to send errors and warnings through to PHP. Then, to check for errors and handle them yourself, you can consult libxml_get_last_error() and/or libxml_get_errors() when you're ready:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$errors = libxml_get_errors();
foreach ($errors as $error) {
// handle the errors as you wish
}
To hide the warnings, you have to give special instructions to libxml which is used internally to perform the parsing:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
The libxml_use_internal_errors(true) indicates that you're going to handle the errors and warnings yourself and you don't want them to mess up the output of your script.
This is not the same as the # operator. The warnings get collected behind the scenes and afterwards you can retrieve them by using libxml_get_errors() in case you wish to perform logging or return the list of issues to the caller.
Whether or not you're using the collected warnings you should always clear the queue by calling libxml_clear_errors().
Preserving the state
If you have other code that uses libxml it may be worthwhile to make sure your code doesn't alter the global state of the error handling; for this, you can use the return value of libxml_use_internal_errors() to save the previous state.
// modify state
$libxml_previous_state = libxml_use_internal_errors(true);
// parse
$dom->loadHTML($html);
// handle errors
libxml_clear_errors();
// restore
libxml_use_internal_errors($libxml_previous_state);
Setting the options "LIBXML_NOWARNING" & "LIBXML_NOERROR" works perfectly fine too:
$dom->loadHTML($html, LIBXML_NOWARNING | LIBXML_NOERROR);
You can install a temporary error handler with set_error_handler
class ErrorTrap {
protected $callback;
protected $errors = array();
function __construct($callback) {
$this->callback = $callback;
}
function call() {
$result = null;
set_error_handler(array($this, 'onError'));
try {
$result = call_user_func_array($this->callback, func_get_args());
} catch (Exception $ex) {
restore_error_handler();
throw $ex;
}
restore_error_handler();
return $result;
}
function onError($errno, $errstr, $errfile, $errline) {
$this->errors[] = array($errno, $errstr, $errfile, $errline);
}
function ok() {
return count($this->errors) === 0;
}
function errors() {
return $this->errors;
}
}
Usage:
// create a DOM document and load the HTML data
$xmlDoc = new DomDocument();
$caller = new ErrorTrap(array($xmlDoc, 'loadHTML'));
// this doesn't dump out any warnings
$caller->call($fetchResult);
if (!$caller->ok()) {
var_dump($caller->errors());
}

Categories