XML Parsing Error: undefined entity - special characters - php

Why does XML display error on certain special characters and some are ok?
For instance, below will create error,
<?xml version="1.0" standalone="yes"?>
<Customers>
<Customer>
<Name>Löic</Name>
</Customer>
</Customers>
but this is ok,
<?xml version="1.0" standalone="yes"?>
<Customers>
<Customer>
<Name>&</Name>
</Customer>
</Customers>
I convert the special character through php - htmlentities('Löic',ENT_QUOTES) by the way.
How can I get around this?
Thanks.
EDIT:
I found that it works fine if I use numeric character such as Lóic
now I have to find how to use php to convert special characters into numeric characters!

There are five entities defined in the XML specification — &, <, >, &apos; and "
There are lots of entities defined in the HTML DTD.
You can't use the ones from HTML in generic XML.
You could use numeric references, but you would probably be better off just getting your character encodings straight (which basically boils down to:
Set your editor to save the data in UTF-8
If you process the data with a programming language, make sure it is UTF-8 aware
If you store the data in a database, make sure it is configured for UTF-8
When you serve up your document, make sure the HTTP headers specify that it is UTF-8 (in the case of XML, UTF-8 is the default, so not specifying anything is almost as good)
)

Because it is not an built-in entity, it is instead an external entity that needs declaration in DTD.

TLDR Solution
You can solve this problem with html_entity_decode() (Source: PHP.net), like so...
$xml_line = '<description>' . html_entity_decode($description) . '</description>';
Full, Working Demo Online
In this demo, I use ’ and a line from the Tao teh Ching to demonstrate the above use of html_entity_decode()...
$title = 'The name you can say isn’t the real name.';
$xml_title = html_entity_decode($title)
$xml_title = str_replace(['<', '>',], ['<', '>',], $xml_title );
$xml_line = '<title>' . $xml_title . '</title>';
print($xml_line);
Don't forget to replace back those < and > chars, though!
Working Demo Sandbox
How Do You Know It Worked?
Want to verify it worked just fine? Then head on over to the W3C RSS Feed Validator, and see the above code being approved as just fine.

Related

multilanguage support to xml text search php

I have a xml file which can be in any language(finnish, italian, swedish, dutch) I have saved the xml using headers
<?xml version="1.0" encoding="ISO-8859-1"?>
The saved xml contains special characters and some html codes as
⁏ for single code etc.
Now I want to provide a search text functionality using this xml as source as follows
$xml->xpath("//page[data[contains(., '".strtoupper($string)."')]]")
Where am strggling is that from php when I try to provide the $search_text as variable it's not matching these ⁏ and producing error
for e.g. the word nell’Esercizio is there as nell’Esercizio in xml and hence my search result is empty for xpath.
I tried htmlentities and htmlspecialchars but no luck. For special characters i tried utf8_encode(), utf8_decode() combination and it worked (for finnish language) but for these html characters it's failing.
What should be the proper way of searching text in a xml file in diff language via a php application ?
The Xpath expression has to be UTF-8, the encoding of the document is not relevant. DOM uses UTF-8 and converts on load/save. I think your problem is the strtoupper(). You need to use unicode save transliterations.
ext/intl
ext/mbstring

Encode ’ to be XML safe

I have a string that contains a right single quotation mark:
$str = "David’s Spade";
I am sending the string via XML and need to encode it. I have read that I should encode string using htmlspecialchars, but I have found that XML request still fails whereas htmlentities works.
When I error_log $str:
$str; // David\xe2\x80\x99s Spade
htmlspecialchars($str); // David\xe2\x80\x99s Spade
htmlspecialchars($str, ENT_QUOTES, 'UTF-8'); // David\xe2\x80\x99s Spade
htmlentities($str); // David’s Spade
Would it be better to str_replace ’ and then use htmlentities? Are there any other chars htmlentities may miss?
I am sending the string via XML and need to encode it.
No, you don't. If the XML is UTF-8 encoded (it is by default) and as your $str is UTF-8 encoded (as you show by the binary sequences in your question), you do not need to encode it.
This is by the book. So given on the technical information of the data you collaborate with, this is clear and fine.
You then write that some things work and others won't. Whatever you do there, there problem lies within the things you've hidden from your question.
To make this more explicit:
$str = "David’s Spade"; // "David\xE2\x80\x99s Spade"
is a perfectly valid string, for example to use it with an XML library like Simplexml to add it to an XML document:
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0" encoding="UTF-8"?>
<doc><element>David’s Spade</element></doc>
As you can see, the XML has been encoded by not changing the byte-sequence of the string here because it's UTF-8.
Let's take some ASCII:
$xml = new SimpleXMLElement('<doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0"?>
<doc><element>David’s Spade</element></doc>
As this example shows, it depends on the document encoding then. This second example is a fall-back of Simplexml to make the output more robust, but actually this wouldn't be necessary as UTF-8 would be the default encoding.
In any case you should not be too concerned about the encoding yourself by using a library that has specialized on creating XML documents. PHP has some few for exactly that. Take one of them.

What is the best way to deal with XML that contains invalid characters (PHP)?

I am integrating with Quickbooks using QBXML. I am running a customer query and the XML that Quickbooks returns appears to contain an invalid character (!).
Looking at the source XML that quickbooks returns, I can see the invalid character (actual named changed for privacy reasons, but I left in the character in question):
<Contact>Ongél Davabond</Contact>
When I try to parse the XML (with the PHP XML parser, starting with xml_parser_create() ), I get an invalid character message.
I noticed that the XML header is just:
<?xml version="1.0" ?>
I tried preg_replacing that with
<?xml version="1.0" encoding="utf-8" ?>
but that didn't make any difference.
Given that I can't change how I receive the XML, how do I best deal with it on my end? Is there a way to have the PHP XML parser accept such characters? Does PHP have a way to convert any invalid characters into their &#nnn; equivalents, without affecting the XML structure, or do I need to go through the whole of the XML character by character looking for invalid characters and replacing them manually? I have no idea what other invalid characters might come up in the future, so I am after a way to deal with all the possibilities in one go, rather than just fixing this one 'é' character.
Although I was expecting UTF-8, the XML returned was ISO-8859-1. Forcing ISO-8859-1 encoding solved the issue.

SimpleXML parse external php feed

I want to parse an external php feed.
The address: http://www.hittadjur.se/feed.php?count=1
The output:
<?xml version="1.0"?>
<annons>
<rubrik>Wilja</rubrik>
<datum>2013-03-22</datum>
<ras>Chihuahua långhår</ras>
<ort>Göteborg</ort><bildurl>http://www.hittadjur.se/images/uploaded/thumbs/1363984467.jpg</bildurl><addurl>http://www.hittadjur.se/index.php?page=case&type=&county=32&subpage=show&case=1363984558</addurl>
</annons>
My PHP code that doesn't work:
$content = utf8_encode(file_get_contents('http://www.hittadjur.se/feed.php?count=1'));
$xml = simplexml_load_file($content);
echo $xml->annons->rubrik;
The reason I use the utf8_encode is that I receive this message if I don't:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE5 0x6E 0x67 0x68
The error now is:
Warning: simplexml_load_file() [function.simplexml-load-file]: I/O warning : failed to load external entity
Any ideas?
Thanks!
Try to pass full directory path if you are trying to load xmls held at your server
simplexml_load_file($_SERVER['DOCUMENT_ROOT'].'/example.xml')
or if you want to access xml by http protocol you will need to set allow_url_fopen ON in php.ini or
ini_set('allow_url_fopen ','ON');
in your code. or you can also do this if you are using php version <5
$temp = file_get_contents($url);
$XmlObj = simplexml_load_string($temp);
like alvaro vicario wrote, the problem are the parameter separators ( & ) in your urls. in xml, an ampersand is a entity marker ( = start of a named symbol (sequence) or numerical representation of a character code point ) and must be escaped.
either replace & by & in your urls or mark the urls as literal text (CDATA section in xml speak): <![CDATA[http://...]]>.
eg. : <addurl><![CDATA[http://www.hittadjur.se/index.php?page=case&type=&county=32&subpage=show&case=1363984558]]></addurl>
if you are uncomfortable with the express utf8 conversion in your code and you know the character encoding of your data source, you may enhance the xml prologue (iso-8859-1 contains the offending å/0xE5of your xml):
<?xml version="1.0" encoding="iso-8859-1"?>
I'm afraid that the feed provides malformed XML. Apart from the encoding mess:
<addurl>http://www.hittadjur.se/index.php?page=case&type=&county=32&subpage=show&case=1363984558</addurl>
^
\_ Data not properly escaped
I may be wrong but I don't think you can parse it using regular XML functions because they're designed for valid XML (that's the whole purpose of using XML in the first place).
Perhaps you can try with DOMDocument. It's designed for HTML so it can deal with invalid input but it can also do XML.
Edit: Here's a trick to fix invalid XML but, honestly, I'm not sure it's worth the effort.

Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string

I'm getting the error:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20
When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:
<?xml version="1.0" encoding="UTF-8"?>
Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.
I'm unable to get the 3rd party to sort out their XML.
How can I pre-process the XML and fix the encoding incompatibilities?
Is there a way to detect the correct encoding for a XML file?
Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.
Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)
Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.
Either way, notify your data provider that they're sending invalid data so that they can fix it.
Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}
function utf8_encode_callback($m)
{
return utf8_encode($m[0]);
}
I solved this using
$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);
If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control character in our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.
We managed to solve our problem thanks to this post and this:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:
mysql_set_charset('utf8',$connection);
Cheers.
Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.
If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).
If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().
If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)
String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.
I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.
I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.
After several tries i found htmlentities function works.
$value = htmlentities($value)
What I was facing was solved by what Erik proposed
https://stackoverflow.com/a/4575802/14934277
and it IS, actually, the only way to know if your data is okay to be printed.
And here is some peace of code that could be useful to anyone out there:
$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash
Note that part.
<![CDATA[]]>
When you try to create an XML out of it, be sure to pass it the final product a browser would see, meaning, having your field wrapped with CDATA
When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.

Categories