XML Entity not defined in Chrome - php

My browser is telling me:
error on line 2 at column 308899: Entity 'ntilde' not defined
and the specific line is in my xml as:
<LastName>Treviño</LastName>
the name was originally Treviño, but it was modified via php's htmlentities function.
What can I do to get php and xml to play nicely?
Using Chrome 19 on Mac.

Apparently using htmlspecialchars and htmlentities in tandem does the trick.
htmlspecialchars(htmlentities($value));

Are you actually generating XML, or HTML? They're not the same thing. HTML defines a bunch of entities (IIRC) whereas XML has very few "built-in" - just a few like & and <.
Why both using the entity when you can just use the text directly? Simply make sure you're consistent about the encoding you use (UTF-8 would be a good bet).

You dont need to encode the tilde character in XML. It will throw errors. Best thing to do in this case might be to wrap the text in CDATA.
<LastName><![CDATA[Treviño]]></LastName>

Try to specify encoding for htmlentities
htmlentities($string,ENT_QUOTES,'UTF-8');

It should be: Treviño.
An Entity may be missing from your DTD file?
something like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Customer
[
<!ENTITY ntilde "ñ">
]>
<Customer>
<LastName>Treviño</LastName>
</Customer>

Related

How to pass ä in xml using PHP

I am using a payment gateway to send xml via CURL. I am getting the following error when I use an XML Validator:
Errors in the XML document: The entity "auml" was referenced, but not
declared.
So I understand the problem lies with the ä, however I am unsure on how to fix this using PHP.
Here is the xml request I am passing:
<request type='payer-new' timestamp='XXXXXX'>
<merchantid>XXXXXXXXX</merchantid>
<orderid>XXXXXXXXX</orderid>
<payer type='Business' ref='XXXXXXXXXXXXX'>
<firstname>Xäxxxx</firstname>
<surname>xäxxxxxx</surname>
<address>
<line1>XXXXXXXXXXXXX</line1>
<line2>XXXXXXXXXXXXX</line2>
<city>XXXXXXXXXXXXXXXX</city>
<postcode>XXXXXXXXXXXX</postcode>
<country code='FI'>Finland</country>
</address>
<phonenumbers>
<home>XXXXXXXXXXXXXXXXXXXXXXXXX</home>
</phonenumbers>
<email>XXXXXXXXXXXXXXXXXXXXXXXXXXXXX</email>
</payer>
<sha1hash>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</sha1hash>
</request>
I wrap htmlentities around all of the variables going into the request like so:
".htmlentities($_SESSION['W_CUSTOMER_FIRSTNAME'], ENT_QUOTES,
"UTF-8")."
Is there a way that will work with all kinds of characters / names / places etc that contain these characters?
Many thanks in advance
ä is an HTML entity code, not a generic XML one.
Generic XML only understands three named entities: &, > and <.
If you want to use any other named entities such as ä, those entities must be defined in the XML schema definition. Some standardised XML dialects have schemas which define named entities, but most do not, and if you don't have a schema, then you definitely won't be able to use any named entities.
So instead of using named entities in XML, it is generally better to use numeric entities. These take the form of Ӓ, where 1234 is the character code for the character you want. For a auml character, the code you need is ä. Note that these numeric entity codes can also work fine in HTML.
You can find a list of some of the more useful character codes here: http://www.econlib.org/library/asciicodes.html
Annoyingly, there isn't a standard PHP function that can produce these numeric XML entities. The htmlentities() and html_special_chars() functions are not suitable, as they produce named entities. So we have to write our own.
You'll need to use the ord() function to get the character code, but be aware of multi-byte characters. There is actually a reasonable attempt at an xmlentities() function in the comments on the manual page for htmlentities(), which you could try. I know other implementations exist, though.
The fundamental problem is you have an XML document that is using HTML entities to encode things. An XML Validator knows nothing about HTML-specific entities, and so will choke.
I would hope that there is an XSD (schema) for the XML; it should really be declared in the root tag with an xmlns declaration and possibly with a xsi:schemaLocation too. This XSD file would be the right place to xsd:import the html entities that would enable your validator to validate correctly. There should also be an <?xml vers... > tag as the first line.
That said, I suspect that the receiving application won't care what the validator says, and that your response file is probably just fine, assuming the receiver knows about html entities too.
If not, you need to decode the html entities into actual utf8 characters, but probably do so just on the text elements of the DOM (e.g. the content of <email> not the whole text). Doing this with php's html_entity_decode() would seem reasonable. If you do this you definitely need the <?xml> tag to include the file charset.
HTH
I wrap htmlentities around all of the variables going into the request like so: ...
There is your problem. You're creating the XML string "by hand". Not that it wouldn't be possible to do so, it's just easy to make mistakes by doing so. One hint could be the name of the function you use already, it starts with "html" which is not XML.
Anyway, before discussing in depth to which extend interpolating strings can cause troubles for creating XML and when such problems arise, it's much easier to use an XML library to create the XML.
An XML libary allows you to encode all data properly (so you won't see such errors) and with ease. In PHP there are normally three:
SimpleXML
DOM
XMLWriter
Take the one you can work best with.
Alternatively you can verify that the XML you created "by hand" is well-formed before you send it to the remote service, by use of one of the following XML libraries as they are also XML parsers:
SimpleXML
DOM
Q&A material on how to create an XML document with either of these already exists on this website - even with examples and comments on them - so I don't duplicate such content in my answer. Same for the XML validation.
Example of an XML preset (pattern) of a request of which some parameters are set. Here with SimpleXML:
$pattern = <<<REQUEST_PATTERN
<request type='payer-new' timestamp='XXXXXX'>
<merchantid>XXXXXXXXX</merchantid>
<orderid>XXXXXXXXX</orderid>
<payer type='Business' ref='XXXXXXXXXXXXX'>
<firstname></firstname>
<surname></surname>
<address>
<line1>XXXXXXXXXXXXX</line1>
<line2>XXXXXXXXXXXXX</line2>
<city>XXXXXXXXXXXXXXXX</city>
<postcode>XXXXXXXXXXXX</postcode>
<country code='FI'>Finland</country>
</address>
<phonenumbers>
<home>XXXXXXXXXXXXXXXXXXXXXXXXX</home>
</phonenumbers>
<email>XXXXXXXXXXXXXXXXXXXXXXXXXXXXX</email>
</payer>
<sha1hash>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</sha1hash>
</request>
REQUEST_PATTERN;
$xml = simplexml_load_string($pattern);
$xml->payer->firstname = 'Äpfel';
$xml->payer->surname = 'Wachsen-Überirdisch';
# ...
// just an assumed way on how you would pass the XML string
// to the API via CURL (here as HTTP POST request body)
curl_setopt($handle, CURLOPT_POSTFIELDS, $xml->asXML());
The XML that would be passed to the remote service would always(*) be XML encoded in a proper way:
<?xml version="1.0"?>
<request type="payer-new" timestamp="XXXXXX">
<merchantid>XXXXXXXXX</merchantid>
<orderid>XXXXXXXXX</orderid>
<payer type="Business" ref="XXXXXXXXXXXXX">
<firstname>Äpfel</firstname>
<surname>Wachsen-Überirdisch</surname>
<address>
<line1>XXXXXXXXXXXXX</line1>
<line2>XXXXXXXXXXXXX</line2>
<city>XXXXXXXXXXXXXXXX</city>
<postcode>XXXXXXXXXXXX</postcode>
<country code="FI">Finland</country>
</address>
<phonenumbers>
<home>XXXXXXXXXXXXXXXXXXXXXXXXX</home>
</phonenumbers>
<email>XXXXXXXXXXXXXXXXXXXXXXXXXXXXX</email>
</payer>
<sha1hash>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</sha1hash>
</request>
(*) there are some rare circumstances where this would not be the case, but they should not play any role for this example: The SimpleXML library requires properly UTF-8 encoded strings to work.
I created such function for XML strings safe replacing:
/**
* Safe symbols escaping for XML. It's very similar to htmlspecialschars for html + mysql.
* #param string $string
* #return string
*/
public static function xmlentities(string $string): string
{
return htmlspecialchars($string, ENT_XML1, 'UTF-8', true);
}
Usage:
$str = 'Dafür hörten "der Relativen Schw&auml;che&ldquo;-Entwicklung gegen&uuml;ber den Wall Street';
$str = static::xmlentities($str);
echo '<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<set>'.$str.'</set>
</urlset>';
Explanation:
ENT_XML1 set's the list of symbols accepted for XML documents
'UTF-8' sets the charset so ä and other German/national symbols becomes accepted as a symbol
true -- the last TRUE parameter makes converting safe sins it often happens that incoming string has already some htmlspecialchars precessed symbols (e.g. during the loading from DB or getting from the API)
Remember to put charset declaration to the document header
<?xml version="1.0" encoding="UTF-8"?>
or to set some other charset for htmlspecialchars
I have suffered same issue and i have resolves issue.
$str = 'Dafür hörten "der Relativen Schw&auml;che&ldquo;-Entwicklung gegen&uuml;ber den Wall Street';
echo htmlspecialchars($str, ENT_XML1, 'UTF-8', true);
I hope,this is usefull and it's working

cast simplexmlelement to string to get inner content but keep htmlspecialchars escaped

i have a xmlfile:
$xml = <<<EOD
<?xml version="1.0" encoding="utf-8"?>
<metaData xmlns="http://www.test.com/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="test">
<qkc6b1hh0k9>testdata&more</qkc6b1hh0k9>
</metaData>
EOD;
now i loaded it into a simplexmlobject and later on i wanted to get the inner of the "qkc6b1hh0k9"-node
$xmlRootElem = simplexml_load_string( $xml );
$xmlRootElem->registerXPathNamespace( 'xmlns', "http://www.test.com/" );
// ...
$xPathElems = $xmlRootElem->xpath( './'."xmlns:qkc6b1hh0k9" );
$var = (string)($xPathElems[0]);
var_dump($var);
I expected to get the string
testdata&more
... but i got
testdata&more
Why is the __toString() method of simplexmlobject converting my escaped specialchars to normal chars? Can I deactivate this behaviour?
I came up with a temp-solution, which I consider as dirty, what do you say?
(strip_tags($xPathElems[0]->asXML()))
May the DOMDocument be an alternative?
Thanks for any help on my questions!
edit
problem solved, problem was not in the __toString method of simplexml, it was later on when using the string with addChild
the behaviour as described above was totaly fine and has to be expected as you can see in the answers...
problems only came up, when the value was added to another xml-document via "addChild".
Since addChild doesn't escape the ampersand (http://www.php.net/manual/de/simplexmlelement.addchild.php#103587) one has to do it manually.
Why is the __toString() method of simplexmlobject converting my escaped specialchars to normal chars? Can I deactivate this behaviour?
Because those "speical" chars are actually XML encoding of characters. Using the string value gives you these characters verbatim again. That is what an XML parser has been made for.
I came up with a temp-solution, which I consider as dirty, what do you say?
Well, shaky. Instead let me suggest you the inverse: XML encode the the string:
$var = htmlspecialchars($xPathElems[0]);
var_dump($var);
May the DOMDocument be an alternative?
No, as SimpleXML it is an XML Parser and therefore you get the text decoded as well. This is not fully true (you can do that with DomDocument by going through all childnodes and picking entity nodes next to character data, but it's much more work as just outlined with htmlspecialchars() above).
If you create an XML tag, by any sane method, and set it to contain the string "testdata&more", this will be escaped as testdata&more. It is therefore only logical that extracting that string content back out reverses the escaping procedure to give you the text you put in.
The question is, why do you want the XML-escaped representation? If you want the content of the element as intended by the author, then __toString() is doing the right thing; there is more than one way of representing that string in XML, but it is the data being represented that you should normally care about.
If for some reason you really need details of how the XML is constructed in that particular instance, you could use a more complex parsing framework such as DOM, which will separate testdata&more into a text node (containing "testdata"), an entity node (with name "amp"), and another text node (containing "more").
If, on the other hand, all you want is to put it back into another XML (or HTML) document, then let SimpleXML do the unescaping properly, and re-escape it at the appropriate time.

XML Conform Linebreak in PHP

I'm doing an XML generation where the main data is coming from xsl transformation (but that's not the problem, it's just the reason why I'm not using PHP DOM or SimpleXML).
Like this:
$xml = '<?xml version="1.0" encoding="utf-8"?>' . PHP_EOL;
$xml .= '<rootElement>';
foreach($xslRenderings as $rendering) {
$xml .= $rendering;
}
$xml .= '</rootElement>';
The resulting XML validates against its XSD here http://www.freeformatter.com/xml-validator-xsd.html and here http://xsdvalidation.utilities-online.info/.
But fails here: http://www.xmlforasp.net/schemavalidator.aspx,
Unexpected XML declaration. The XML declaration must be the first node in the
document, and no white space characters are allowed to appear before it.
Line 2, position 3.
If I do manually remove the line break produced by PHP_EOL and hit return, it validates.
I assume, that it's an error in the last schema validator. Or is PHP_EOL (or a manual break in PHP) something that is a problem for some validators? If yes, how to fix that?
I'm asking because the resulting XML will be send to a .NET Service and the last validator is built with NET.
EDIT
The XML looks like this, Scheme can be found here http://cb.heimat.de/interface/schema/interfaceformat.xsd
<?xml version="1.0" encoding="utf-8"?>
<dataset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://cb.heimat.de/interface/schema/interfaceformat.xsd">
<production foreignId="1327" id="0" cityId="6062" productionType="3" subCategoryId="7013" keywords="" productionStart="" productionEnd="" url=""><title languageId="1">
...
</production>
You really have to look at the generated XML as a binary stream to understand what is going on. I'll try to explain what you should look at...
I'll show you a dump of an invalid XML (similar to yours) to help illustrate:
The first three bytes are Byte Order Mark and may be encountered with text files and streams (in this case UTF-8). Those kind of bytes would never cause a compliant XML parser to trip since are used as a hint for understanding the encoding scheme.
The next two bytes (0x0D0A) are new line on Windows platform. Those should cause any XML parser to fail well formed rules. According to the current XML 1.0 standard, white space is not allowed before the XML declaration.
On .NET you'll get an error such as the one you described. Java (xerces based) would say something more cryptic: The processing instruction target matching "[xX][mM][lL]" is not allowed. [2]
Removing any white space before your first < should fix this error message. All you have to do is understand how that white space gets there...
From what you describe it looks as if the XML PI gets somehow dropped before using the XML.

Issue parsing XML, unknown encoding

I'm trying to read an XML feed, I'm not sure the encoding is proper, but it's set to UTF-8 and when I try to parse it in PHP via SimpleXML, it errors on "BöðVar" (note the special "o" characters).
libxml_use_internal_errors(TRUE);
$XMLOutputXMLObj = simplexml_load_string($xml_string);
if($XMLOutputXMLObj !== FALSE)
{
//do stuff
}
This is all I get for an error:
Entity 'ouml' not defined
Entity 'eth' not defined
I tried using "mb_convert_encoding", in various ways, but that failed.
How can I resolve this issue for any character? IE WITHOUT manually replacing ö with &214; (with # of course)?
Even better... is there a way to make it so SimpleXML doesn't care what it is parsing, as long as the tags are intact?
Thanks
Have you tried to escape the XML data in the node using the <![CDATA[ and ]]> tags before and after the node's text/value? E.g.
<?xml version="1.0" encoding="UTF-8"?>
<fmsdata>
<result><![CDATA[Success !##$%^&*()]]></result>
</fmsdata>

XML Generated by DOMDocument with Line Break

I am creating XML files with PHP DOMDocument, and these XML files can not contain line breaks.
But when I use the method "saveXML()", the generated XML comes with a line break between the definition and the initial tag, like this:
<?xml version="1.0" encoding="UTF-8"?>
<NFe xmlns="http://www.portalfiscal.inf.br/nfe"><infNFe...
Can I correct this problem in DOMDocument? Or do I have to do it after I generate the XML?
I'd like to correct this problem to get a result like this:
<?xml version="1.0" encoding="UTF-8"?><NFe xmlns="...
By default, DOMDocument::$preserveWhiteSpace is true. Try setting it to false on the document in question, then calling saveXML again. This may have side effects should any whitespace inside the document actually matter. You should also make sure that DOMDocument::$formatOutput is false.
As said by Gordon, though, there is no logical reason whatsoever for the whitespace restriction. Though seriously, if you don't want any whitespace in there whatsoever, just make sure any CR/LF characters that you want to keep are entity-encoded and then $nonewlines = preg_replace("/[\r\n]/", '', $xml) to yank out the newlines that might remain after turning off Preserve and Format. But again, that's silly.

Categories