cast simplexmlelement to string to get inner content but keep htmlspecialchars escaped - php

i have a xmlfile:
$xml = <<<EOD
<?xml version="1.0" encoding="utf-8"?>
<metaData xmlns="http://www.test.com/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="test">
<qkc6b1hh0k9>testdata&more</qkc6b1hh0k9>
</metaData>
EOD;
now i loaded it into a simplexmlobject and later on i wanted to get the inner of the "qkc6b1hh0k9"-node
$xmlRootElem = simplexml_load_string( $xml );
$xmlRootElem->registerXPathNamespace( 'xmlns', "http://www.test.com/" );
// ...
$xPathElems = $xmlRootElem->xpath( './'."xmlns:qkc6b1hh0k9" );
$var = (string)($xPathElems[0]);
var_dump($var);
I expected to get the string
testdata&more
... but i got
testdata&more
Why is the __toString() method of simplexmlobject converting my escaped specialchars to normal chars? Can I deactivate this behaviour?
I came up with a temp-solution, which I consider as dirty, what do you say?
(strip_tags($xPathElems[0]->asXML()))
May the DOMDocument be an alternative?
Thanks for any help on my questions!
edit
problem solved, problem was not in the __toString method of simplexml, it was later on when using the string with addChild
the behaviour as described above was totaly fine and has to be expected as you can see in the answers...
problems only came up, when the value was added to another xml-document via "addChild".
Since addChild doesn't escape the ampersand (http://www.php.net/manual/de/simplexmlelement.addchild.php#103587) one has to do it manually.

Why is the __toString() method of simplexmlobject converting my escaped specialchars to normal chars? Can I deactivate this behaviour?
Because those "speical" chars are actually XML encoding of characters. Using the string value gives you these characters verbatim again. That is what an XML parser has been made for.
I came up with a temp-solution, which I consider as dirty, what do you say?
Well, shaky. Instead let me suggest you the inverse: XML encode the the string:
$var = htmlspecialchars($xPathElems[0]);
var_dump($var);
May the DOMDocument be an alternative?
No, as SimpleXML it is an XML Parser and therefore you get the text decoded as well. This is not fully true (you can do that with DomDocument by going through all childnodes and picking entity nodes next to character data, but it's much more work as just outlined with htmlspecialchars() above).

If you create an XML tag, by any sane method, and set it to contain the string "testdata&more", this will be escaped as testdata&more. It is therefore only logical that extracting that string content back out reverses the escaping procedure to give you the text you put in.
The question is, why do you want the XML-escaped representation? If you want the content of the element as intended by the author, then __toString() is doing the right thing; there is more than one way of representing that string in XML, but it is the data being represented that you should normally care about.
If for some reason you really need details of how the XML is constructed in that particular instance, you could use a more complex parsing framework such as DOM, which will separate testdata&more into a text node (containing "testdata"), an entity node (with name "amp"), and another text node (containing "more").
If, on the other hand, all you want is to put it back into another XML (or HTML) document, then let SimpleXML do the unescaping properly, and re-escape it at the appropriate time.

Related

PHP return XML string with values added to attributes missing values

I have to parse HTML and "HTML" from emails. I've already managed to create a function that cleans most of the errors such as improper nesting of elements.
I'm trying to determine how best to tackle the issue of HTML attributes that are missing values. We must parse everything ultimately as XML so well-formed HTML is a must as well.
The cleaning function starts off simple enough:
$xml = explode('<', $xml);
We quickly determine opening and closing tags of elements.
However once we get to attributes things get really messy really quickly:
Missing values.
People using single quotes instead of double quotes.
Attribute values may contain single quotes.
Here is an example of an HTML string we have to parse (a p element):
$s = 'p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text';
We do not care what those attributes are; our goal is simply to fix the XML so that it is well-formed as demonstrated by the following string:
$s = 'p obnoxious="true" nonprofessional="true" style="wrong: lulz-immature" dunno="true">Some paragraph text';
We're not interested in attribute="attribute" as that is just extra work (most email is frivolous) so we're simply interested in appending ="true" for each attribute missing a value just to prevent the XML parser on client browsers from failing over the trivialities of someone somewhere else not doing their job.
As I mentioned earlier we only need to fix the attributes which are missing values and we need to return a string. At this point all other issues of malformed XML have been addressed. I'm not sure where I should start as the topic is such a mess. So...
We're open to sending the entire XML string as a whole to be parsed and returned back as a string with some built in library. If this option presume that the XML is well-formed with a proper XML declaration (<?xml version="1.0" encoding="UTF-8"?>).
We're open to manually creating a function to address whatever we encounter though we're not interested in building a validator as much of the "HTML" we receive screams 1997.
We are working with the XML as a single string or an array (your pick); we are explicitly not dealing with files.
How do we with reasonable effort ensure that an XML string (in part or whole) is returned as a string with values for all attributes?
The DOM extension may solve your problem:
$doc = new DOMDocument('1.0');
$doc->loadHTML('<p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text');
echo $doc->saveXML();
The above code will result in the following output:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p obnoxious="" nonprofessional="" style="wrong: lulz-immature" dunno="">Some paragraph text</p></body></html>
You may replace every ="" with ="true" if you want, but the output is already a valid XML.

How to pass ä in xml using PHP

I am using a payment gateway to send xml via CURL. I am getting the following error when I use an XML Validator:
Errors in the XML document: The entity "auml" was referenced, but not
declared.
So I understand the problem lies with the ä, however I am unsure on how to fix this using PHP.
Here is the xml request I am passing:
<request type='payer-new' timestamp='XXXXXX'>
<merchantid>XXXXXXXXX</merchantid>
<orderid>XXXXXXXXX</orderid>
<payer type='Business' ref='XXXXXXXXXXXXX'>
<firstname>Xäxxxx</firstname>
<surname>xäxxxxxx</surname>
<address>
<line1>XXXXXXXXXXXXX</line1>
<line2>XXXXXXXXXXXXX</line2>
<city>XXXXXXXXXXXXXXXX</city>
<postcode>XXXXXXXXXXXX</postcode>
<country code='FI'>Finland</country>
</address>
<phonenumbers>
<home>XXXXXXXXXXXXXXXXXXXXXXXXX</home>
</phonenumbers>
<email>XXXXXXXXXXXXXXXXXXXXXXXXXXXXX</email>
</payer>
<sha1hash>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</sha1hash>
</request>
I wrap htmlentities around all of the variables going into the request like so:
".htmlentities($_SESSION['W_CUSTOMER_FIRSTNAME'], ENT_QUOTES,
"UTF-8")."
Is there a way that will work with all kinds of characters / names / places etc that contain these characters?
Many thanks in advance
ä is an HTML entity code, not a generic XML one.
Generic XML only understands three named entities: &, > and <.
If you want to use any other named entities such as ä, those entities must be defined in the XML schema definition. Some standardised XML dialects have schemas which define named entities, but most do not, and if you don't have a schema, then you definitely won't be able to use any named entities.
So instead of using named entities in XML, it is generally better to use numeric entities. These take the form of Ӓ, where 1234 is the character code for the character you want. For a auml character, the code you need is ä. Note that these numeric entity codes can also work fine in HTML.
You can find a list of some of the more useful character codes here: http://www.econlib.org/library/asciicodes.html
Annoyingly, there isn't a standard PHP function that can produce these numeric XML entities. The htmlentities() and html_special_chars() functions are not suitable, as they produce named entities. So we have to write our own.
You'll need to use the ord() function to get the character code, but be aware of multi-byte characters. There is actually a reasonable attempt at an xmlentities() function in the comments on the manual page for htmlentities(), which you could try. I know other implementations exist, though.
The fundamental problem is you have an XML document that is using HTML entities to encode things. An XML Validator knows nothing about HTML-specific entities, and so will choke.
I would hope that there is an XSD (schema) for the XML; it should really be declared in the root tag with an xmlns declaration and possibly with a xsi:schemaLocation too. This XSD file would be the right place to xsd:import the html entities that would enable your validator to validate correctly. There should also be an <?xml vers... > tag as the first line.
That said, I suspect that the receiving application won't care what the validator says, and that your response file is probably just fine, assuming the receiver knows about html entities too.
If not, you need to decode the html entities into actual utf8 characters, but probably do so just on the text elements of the DOM (e.g. the content of <email> not the whole text). Doing this with php's html_entity_decode() would seem reasonable. If you do this you definitely need the <?xml> tag to include the file charset.
HTH
I wrap htmlentities around all of the variables going into the request like so: ...
There is your problem. You're creating the XML string "by hand". Not that it wouldn't be possible to do so, it's just easy to make mistakes by doing so. One hint could be the name of the function you use already, it starts with "html" which is not XML.
Anyway, before discussing in depth to which extend interpolating strings can cause troubles for creating XML and when such problems arise, it's much easier to use an XML library to create the XML.
An XML libary allows you to encode all data properly (so you won't see such errors) and with ease. In PHP there are normally three:
SimpleXML
DOM
XMLWriter
Take the one you can work best with.
Alternatively you can verify that the XML you created "by hand" is well-formed before you send it to the remote service, by use of one of the following XML libraries as they are also XML parsers:
SimpleXML
DOM
Q&A material on how to create an XML document with either of these already exists on this website - even with examples and comments on them - so I don't duplicate such content in my answer. Same for the XML validation.
Example of an XML preset (pattern) of a request of which some parameters are set. Here with SimpleXML:
$pattern = <<<REQUEST_PATTERN
<request type='payer-new' timestamp='XXXXXX'>
<merchantid>XXXXXXXXX</merchantid>
<orderid>XXXXXXXXX</orderid>
<payer type='Business' ref='XXXXXXXXXXXXX'>
<firstname></firstname>
<surname></surname>
<address>
<line1>XXXXXXXXXXXXX</line1>
<line2>XXXXXXXXXXXXX</line2>
<city>XXXXXXXXXXXXXXXX</city>
<postcode>XXXXXXXXXXXX</postcode>
<country code='FI'>Finland</country>
</address>
<phonenumbers>
<home>XXXXXXXXXXXXXXXXXXXXXXXXX</home>
</phonenumbers>
<email>XXXXXXXXXXXXXXXXXXXXXXXXXXXXX</email>
</payer>
<sha1hash>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</sha1hash>
</request>
REQUEST_PATTERN;
$xml = simplexml_load_string($pattern);
$xml->payer->firstname = 'Äpfel';
$xml->payer->surname = 'Wachsen-Überirdisch';
# ...
// just an assumed way on how you would pass the XML string
// to the API via CURL (here as HTTP POST request body)
curl_setopt($handle, CURLOPT_POSTFIELDS, $xml->asXML());
The XML that would be passed to the remote service would always(*) be XML encoded in a proper way:
<?xml version="1.0"?>
<request type="payer-new" timestamp="XXXXXX">
<merchantid>XXXXXXXXX</merchantid>
<orderid>XXXXXXXXX</orderid>
<payer type="Business" ref="XXXXXXXXXXXXX">
<firstname>Äpfel</firstname>
<surname>Wachsen-Überirdisch</surname>
<address>
<line1>XXXXXXXXXXXXX</line1>
<line2>XXXXXXXXXXXXX</line2>
<city>XXXXXXXXXXXXXXXX</city>
<postcode>XXXXXXXXXXXX</postcode>
<country code="FI">Finland</country>
</address>
<phonenumbers>
<home>XXXXXXXXXXXXXXXXXXXXXXXXX</home>
</phonenumbers>
<email>XXXXXXXXXXXXXXXXXXXXXXXXXXXXX</email>
</payer>
<sha1hash>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</sha1hash>
</request>
(*) there are some rare circumstances where this would not be the case, but they should not play any role for this example: The SimpleXML library requires properly UTF-8 encoded strings to work.
I created such function for XML strings safe replacing:
/**
* Safe symbols escaping for XML. It's very similar to htmlspecialschars for html + mysql.
* #param string $string
* #return string
*/
public static function xmlentities(string $string): string
{
return htmlspecialchars($string, ENT_XML1, 'UTF-8', true);
}
Usage:
$str = 'Dafür hörten "der Relativen Schw&auml;che&ldquo;-Entwicklung gegen&uuml;ber den Wall Street';
$str = static::xmlentities($str);
echo '<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<set>'.$str.'</set>
</urlset>';
Explanation:
ENT_XML1 set's the list of symbols accepted for XML documents
'UTF-8' sets the charset so ä and other German/national symbols becomes accepted as a symbol
true -- the last TRUE parameter makes converting safe sins it often happens that incoming string has already some htmlspecialchars precessed symbols (e.g. during the loading from DB or getting from the API)
Remember to put charset declaration to the document header
<?xml version="1.0" encoding="UTF-8"?>
or to set some other charset for htmlspecialchars
I have suffered same issue and i have resolves issue.
$str = 'Dafür hörten "der Relativen Schw&auml;che&ldquo;-Entwicklung gegen&uuml;ber den Wall Street';
echo htmlspecialchars($str, ENT_XML1, 'UTF-8', true);
I hope,this is usefull and it's working

PHP and DOM - parsing error an XML with inside entities

I have a xml :
<title>My title</title>
<text>This is a text and I love it <3 </text>
When I try to parse it with DOM, I have an error because of the "<3":
Warning: DOMDocument::loadXML(): StartTag: invalid element name in Entity...
Do you know how can I escape all inside special char but keeping my XML tree ? The goal is to use this method: $document->loadXML($xmlContent);
Tank a lot for your answers.
EDIT: I forget to say that I cannot modify the XML. I receive it like that and I have to do with it...
The symbol "<" is a predefined entity in XML and thus cannot be used in a text field. It should be replaced with:
<
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
So the input text should be:
<title>My title</title>
<text>This is a text and I love it <3 </text>
An XML built like that should be rejected, and whoever sends it should replace the predefined entities for the allowed values. Doing said task with tools like htmlentities() and htmlspecialchars(), as Y U NO WORK suggests, is easy and straightforward.
Now, if you really need to parse said data, you need to sanitize it prior to parsing. This is not a recommended behaviour, particularly if you are receiving arbitrary text, but if it is a set of known or predictable characters, regular expressions can do the job.
This one, in particular, will remove a single "<" contained in a "text" element composed by characters, numbers or white spaces:
$xmlContent = preg_replace('/(<text>[a-zA-Z 0-9]*)[<]?([a-zA-Z 0-9]*<\/text>)/', '$1<$2', $xmlContent);
It is very specific, but it is done on purpose: regular expressions are really bad at matching nested structures, such as HTML or XML. Applying more arbitrary regular expressions to HTML or XML can have wildly unexpected behaviours.
XML says that every title has to start with a letter, nothing else is allowed, so the title <3 is not possible.
A workaround for this could be htmlentities() or htmlspecialchars(). But even that wont add a valid character to the beginning, so you should think about either:
Manually add a letter in front of the tag with if
Rework your XML so nothing like that can ever happen.
You need put the content with special chars inside CDATA:
<text><![CDATA[This is a text and I love it <3 ]]></text>

PHP, SimpleXML, decoding entities in CDATA

I'm experiencing the following behavior:
$xml_string1 = "<person><name><![CDATA[ Someone's Name ]]></name></person>";
$xml_string2 = "<person><name> Someone's Name </name></person>";
$person = new SimpleXMLElement($xml_string1);
print (string) $person->name; # Someone's Name
$person = new SimpleXMLElement($xml_string2);
print (string) $person->name; # Someone's Name
$person = new SimpleXMLElement($xml_string1, LIBXML_NOCDATA);
print (string) $person->name; # Someone's Name
The php docs say that NOCDATA "Merge[s] CDATA as text nodes". To me this means that CDATA will then be treated the same as text nodes - or that the behavior of the 3rd example will now be the same as the 2nd example.
I don't have control over the XML (it's a feed from an external source), otherwise I'd just remove the CDATA tag as it does nothing and ruins the behavior I want.
Why does the above example behave the way that it does? Is there any way to make SimpleXML handle the CDATA nodes in the same way that it handles text nodes? What does "Merge CDATA as text nodes" actually do, since I don't seem to be understanding that option?
I'm currently decoding after I pull out the data, but the above example still doesn't make sense to me.
The purpose of CDATA sections in XML is to encapsulate a block of text "as is" which would otherwise require special characters (in particular, >, < and &) to be escaped. A CDATA section containing the character & is the same as a normal text node containing &.
If a parser were to offer to ignore this, and pretend all CDATA nodes were really just text nodes, it would instantly break as soon as someone mentioned "P&O Cruises" - that & simply can't be there on its own (rather than as &, or &somethingElse;).
The LIBXML_NOCDATA is actually pretty useless with SimpleXML, because (string)$foo neatly combines any sequence of text and CDATA nodes into an ordinary PHP string. (Something which people frequently fail to notice, because print_r doesn't.) This isn't necessarily true of more systematic access methods, such as DOM, where you can manipulate text nodes and CDATA nodes as objects in their own right.
What it effectively does is go through the document, and wherever it encounters a CDATA section, it takes the content, escapes it, and puts it back as an ordinary text node, or "merges" it with any text nodes to either side. The text represented is identical, just stored in the document in a different way; you can see the difference if you export back to XML, as in this example:
$xml_string = "<person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>";
$person = new SimpleXMLElement($xml_string);
echo 'CDATA retained: ', $person->asXML();
// CDATA retained: <?xml version="1.0"?>
// <person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>
$person = new SimpleXMLElement($xml_string, LIBXML_NOCDATA);
echo 'CDATA merged: ', $person->asXML();
// CDATA merged: <?xml version="1.0"?>
// <person><name>Welcome aboard this P&O Cruises voyage!</name></person>
If the XML document you're parsing contains a CDATA section which actually contains entities, you need to take that string and unescape it completely independent of the XML. One common reason to do this (other than laziness with poorly understood libraries) is to treat something marked up in HTML as just any old string inside an XML document, like this:
<Comment>
<SubmittedBy>IMSoP</SubmittedBy>
<Text><![CDATA[I'm <em>really</em> bad at keeping my answers brief <tt>;)</tt>]]></Text>
</Comment>

simplexml_load_string() != simplexml_import_dom()?

If I load an HTML page using DOMDocument::loadHTMLFile() then pass it to simplexml_import_dom() everything is fine, however, if I using $dom->saveHTML() to get a string representation from the DOMDocument then use simplexml_load_string(), I get nothing. Actually, if I use a very simple page it will work, but as soon as there is anything more complex, it fails without any errors in the PHP log file.
Can anyone shed light on this?
Is it something to do with HTML not being parsable XML?
I am trying to strip out CR's and newlines from the formatted HTML text before using the contents as they have nothing to do with the content but get inserted into the SimpleXMLElement object, which is rather tedious.
Is it something to do with HTML not being parsable XML?
YES! HTML is a far less strict syntax so simplexml_load_string will not work with it by itself. This is because simplexml is simple and HTML is convoluted. On the other hand, DOMDocument is designed to be able to read the convoluted HTML structure, which means that since it can make sense of HTML and simplexml can make sense of it, you can bridge the proverbial gap there.
<!-- Valid HTML but not valid XML -->
<ul>
<li>foo
<li>bar
</ul>
HTML may or may not be valid XML. when you use loadHTMLFile it doesnt necessarily have to be well formed xml because the DOM is an HTML one so different rules, but when you pass a string to SimpleXML it must indeed be well formed.
If I get your question correclty and you simply want no whitespace in your output, then there is no need to use simplexml here.
Use: DOMDocument::preservewhitespace
like:
$dom->preserveWhiteSpace = false;
before saveHTML and you're set.

Categories