PHP, SimpleXML, decoding entities in CDATA - php

I'm experiencing the following behavior:
$xml_string1 = "<person><name><![CDATA[ Someone's Name ]]></name></person>";
$xml_string2 = "<person><name> Someone's Name </name></person>";
$person = new SimpleXMLElement($xml_string1);
print (string) $person->name; # Someone's Name
$person = new SimpleXMLElement($xml_string2);
print (string) $person->name; # Someone's Name
$person = new SimpleXMLElement($xml_string1, LIBXML_NOCDATA);
print (string) $person->name; # Someone's Name
The php docs say that NOCDATA "Merge[s] CDATA as text nodes". To me this means that CDATA will then be treated the same as text nodes - or that the behavior of the 3rd example will now be the same as the 2nd example.
I don't have control over the XML (it's a feed from an external source), otherwise I'd just remove the CDATA tag as it does nothing and ruins the behavior I want.
Why does the above example behave the way that it does? Is there any way to make SimpleXML handle the CDATA nodes in the same way that it handles text nodes? What does "Merge CDATA as text nodes" actually do, since I don't seem to be understanding that option?
I'm currently decoding after I pull out the data, but the above example still doesn't make sense to me.

The purpose of CDATA sections in XML is to encapsulate a block of text "as is" which would otherwise require special characters (in particular, >, < and &) to be escaped. A CDATA section containing the character & is the same as a normal text node containing &.
If a parser were to offer to ignore this, and pretend all CDATA nodes were really just text nodes, it would instantly break as soon as someone mentioned "P&O Cruises" - that & simply can't be there on its own (rather than as &, or &somethingElse;).
The LIBXML_NOCDATA is actually pretty useless with SimpleXML, because (string)$foo neatly combines any sequence of text and CDATA nodes into an ordinary PHP string. (Something which people frequently fail to notice, because print_r doesn't.) This isn't necessarily true of more systematic access methods, such as DOM, where you can manipulate text nodes and CDATA nodes as objects in their own right.
What it effectively does is go through the document, and wherever it encounters a CDATA section, it takes the content, escapes it, and puts it back as an ordinary text node, or "merges" it with any text nodes to either side. The text represented is identical, just stored in the document in a different way; you can see the difference if you export back to XML, as in this example:
$xml_string = "<person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>";
$person = new SimpleXMLElement($xml_string);
echo 'CDATA retained: ', $person->asXML();
// CDATA retained: <?xml version="1.0"?>
// <person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>
$person = new SimpleXMLElement($xml_string, LIBXML_NOCDATA);
echo 'CDATA merged: ', $person->asXML();
// CDATA merged: <?xml version="1.0"?>
// <person><name>Welcome aboard this P&O Cruises voyage!</name></person>
If the XML document you're parsing contains a CDATA section which actually contains entities, you need to take that string and unescape it completely independent of the XML. One common reason to do this (other than laziness with poorly understood libraries) is to treat something marked up in HTML as just any old string inside an XML document, like this:
<Comment>
<SubmittedBy>IMSoP</SubmittedBy>
<Text><![CDATA[I'm <em>really</em> bad at keeping my answers brief <tt>;)</tt>]]></Text>
</Comment>

Related

PHP return XML string with values added to attributes missing values

I have to parse HTML and "HTML" from emails. I've already managed to create a function that cleans most of the errors such as improper nesting of elements.
I'm trying to determine how best to tackle the issue of HTML attributes that are missing values. We must parse everything ultimately as XML so well-formed HTML is a must as well.
The cleaning function starts off simple enough:
$xml = explode('<', $xml);
We quickly determine opening and closing tags of elements.
However once we get to attributes things get really messy really quickly:
Missing values.
People using single quotes instead of double quotes.
Attribute values may contain single quotes.
Here is an example of an HTML string we have to parse (a p element):
$s = 'p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text';
We do not care what those attributes are; our goal is simply to fix the XML so that it is well-formed as demonstrated by the following string:
$s = 'p obnoxious="true" nonprofessional="true" style="wrong: lulz-immature" dunno="true">Some paragraph text';
We're not interested in attribute="attribute" as that is just extra work (most email is frivolous) so we're simply interested in appending ="true" for each attribute missing a value just to prevent the XML parser on client browsers from failing over the trivialities of someone somewhere else not doing their job.
As I mentioned earlier we only need to fix the attributes which are missing values and we need to return a string. At this point all other issues of malformed XML have been addressed. I'm not sure where I should start as the topic is such a mess. So...
We're open to sending the entire XML string as a whole to be parsed and returned back as a string with some built in library. If this option presume that the XML is well-formed with a proper XML declaration (<?xml version="1.0" encoding="UTF-8"?>).
We're open to manually creating a function to address whatever we encounter though we're not interested in building a validator as much of the "HTML" we receive screams 1997.
We are working with the XML as a single string or an array (your pick); we are explicitly not dealing with files.
How do we with reasonable effort ensure that an XML string (in part or whole) is returned as a string with values for all attributes?
The DOM extension may solve your problem:
$doc = new DOMDocument('1.0');
$doc->loadHTML('<p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text');
echo $doc->saveXML();
The above code will result in the following output:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p obnoxious="" nonprofessional="" style="wrong: lulz-immature" dunno="">Some paragraph text</p></body></html>
You may replace every ="" with ="true" if you want, but the output is already a valid XML.

PHP regex, tag which is not in another tag

I need a regex that is matching the content of the <cherry> tag which is not part of another tag. Unsatisfied I can't use the PHP DOM Parser because the content of the tag includes sometimes very special chars.
This is an example of the incoming input:
<cherry>test</cherry>
<banana>
<cherry>test</cherry>
some text
</banana>
This is my current regex but it will also match to the <cherry> tag inside the <banana> tag
(<cherry>)(.*?)(<\/cherry>)
How can I exclude the occurrence in other tags?
I have already tried a lot...
Why don't you make use of the DOMDocument class rather than a regex. Simply load your DOM and then use getElementsByTagName to get your tags. This way you can exclude any other tags which you don't want and only get those that you do.
Example
<?php
$xml = <<< XML
<?xml version="1.0" encoding="utf-8"?>
<books>
<book>Patterns of Enterprise Application Architecture</book>
<book>Design Patterns: Elements of Reusable Software Design</book>
<book>Clean Code</book>
</books>
XML;
$dom = new DOMDocument;
$dom->loadXML($xml);
$books = $dom->getElementsByTagName('book');
foreach ($books as $book) {
echo $book->nodeValue, PHP_EOL;
}
?>
Reading Material
DOMDocument
Under the assumption, that you just need the contents of math tags at top level without anything else and you so far can't do it, because math tags contain invalid xml and therefore any xml-parser gives up ... (as mentioned in question and comments)
The clean approach would probably be, to use some fault-tolerant xml-parser (or fault-tolerant mode) or Tidy up the input before. However, these approaches all might "corrupt" the content.
The hacky and possibly dirty approach would be the following, which might very well have other issues, especially if the remaining xml is also invalid or your math tags are nested (this will lead to the xml-parser failing in step 2):
replace any <math>.*</math> (ungreedy) by a placeholder (preferably something unique uniqid might help, but a simple counter is probably enough) via preg_replace_callback or something
parse the document with a common xml-parser (wrapping it in some root tag as necessary)
fetch all child nodes of root node / all root nodes, see which ones were generated in step 1.
for example:
<math>some invalid xml</math>
<sometag>
<math>more invalid xml</math>
some text
</sometag>
replace with
$replacements = [];
$newcontent = preg_replace_callback(
'/'.preg_quote('<math>','/').'(.*)'.preg_quote('</math>','/').'/siU',
function($hit) use ($replacements) {
$id = uniqid();
$replacements[$id] = $hit[1];
return '<math id="'.$id.'" />';
},
$originalcontent);
which will turn your content into:
<math id="1stuniqid" />
<sometag>
<math id="2nduniqid" />
some text
</sometag>
now use the xml parser of your choice and select all root level/base level elements and look for /math/#id (my XPath is possibly just wrong, adjust as needed). result should contain all uniqids, which you can look up in your replacement array
edit: some preg_quote problems fixed and used more standard delimiters.

cast simplexmlelement to string to get inner content but keep htmlspecialchars escaped

i have a xmlfile:
$xml = <<<EOD
<?xml version="1.0" encoding="utf-8"?>
<metaData xmlns="http://www.test.com/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="test">
<qkc6b1hh0k9>testdata&more</qkc6b1hh0k9>
</metaData>
EOD;
now i loaded it into a simplexmlobject and later on i wanted to get the inner of the "qkc6b1hh0k9"-node
$xmlRootElem = simplexml_load_string( $xml );
$xmlRootElem->registerXPathNamespace( 'xmlns', "http://www.test.com/" );
// ...
$xPathElems = $xmlRootElem->xpath( './'."xmlns:qkc6b1hh0k9" );
$var = (string)($xPathElems[0]);
var_dump($var);
I expected to get the string
testdata&more
... but i got
testdata&more
Why is the __toString() method of simplexmlobject converting my escaped specialchars to normal chars? Can I deactivate this behaviour?
I came up with a temp-solution, which I consider as dirty, what do you say?
(strip_tags($xPathElems[0]->asXML()))
May the DOMDocument be an alternative?
Thanks for any help on my questions!
edit
problem solved, problem was not in the __toString method of simplexml, it was later on when using the string with addChild
the behaviour as described above was totaly fine and has to be expected as you can see in the answers...
problems only came up, when the value was added to another xml-document via "addChild".
Since addChild doesn't escape the ampersand (http://www.php.net/manual/de/simplexmlelement.addchild.php#103587) one has to do it manually.
Why is the __toString() method of simplexmlobject converting my escaped specialchars to normal chars? Can I deactivate this behaviour?
Because those "speical" chars are actually XML encoding of characters. Using the string value gives you these characters verbatim again. That is what an XML parser has been made for.
I came up with a temp-solution, which I consider as dirty, what do you say?
Well, shaky. Instead let me suggest you the inverse: XML encode the the string:
$var = htmlspecialchars($xPathElems[0]);
var_dump($var);
May the DOMDocument be an alternative?
No, as SimpleXML it is an XML Parser and therefore you get the text decoded as well. This is not fully true (you can do that with DomDocument by going through all childnodes and picking entity nodes next to character data, but it's much more work as just outlined with htmlspecialchars() above).
If you create an XML tag, by any sane method, and set it to contain the string "testdata&more", this will be escaped as testdata&more. It is therefore only logical that extracting that string content back out reverses the escaping procedure to give you the text you put in.
The question is, why do you want the XML-escaped representation? If you want the content of the element as intended by the author, then __toString() is doing the right thing; there is more than one way of representing that string in XML, but it is the data being represented that you should normally care about.
If for some reason you really need details of how the XML is constructed in that particular instance, you could use a more complex parsing framework such as DOM, which will separate testdata&more into a text node (containing "testdata"), an entity node (with name "amp"), and another text node (containing "more").
If, on the other hand, all you want is to put it back into another XML (or HTML) document, then let SimpleXML do the unescaping properly, and re-escape it at the appropriate time.

How to access a node inside CDATA with Simple HTML DOM?

Given a xml
<xml>
<![CDATA[<myNode>aaa</myNode><anotherNode>bbb</anotherNode>]]>
</xml>
How to access a node inside myNode (that it's inside a CDATA) using Simple HTML DOM?
Is it possible, or maybe I should change to another lib?
CDATA blocks will be ignored by any parser, so any xml nodes that you have in CDATA blocks will not be queryable unless you parse the CDATA text as well. In other words:
Parse your original document
Query your CDATA text block. You will get a new xml string.
Parse your new (inner) xml string, and query whatever data you need from it.
Having said all of that, why in the world do you have full xml text inside of CDATA blocks? Sounds like extremely lazy escaping to me.

How to get the nest tags as string rather actual nest tags using PHP(DOM)?

Consider the xml
<data>
<node1>
some text some text <nested-node>nest node content</nested-node> some text
</node1>
</data>
Want to access <node1> tag (that i can do), But i want to get content as follow...
some text some text <nested-node>nest node content </nested-node>
some text
Please help me how can I achieve this???
The problem is that your XML on top is not well formed.
You cannot use tags in a text sequence. The parser remove such elements.
XML Escaping
List of escape characters
With this:
$xml = new SimpleXMLElement(file_get_contents('xmltest.xml'));
var_dump($xml->node1);
you can read the node.
I got curious of this question. Is it really impossible to do what what OP wants? #Stony noted that malformed XML makes this impossible, and I wouldn't be surprised if the XML functions won't work with malformed XML in the way OP wants.
Here's an example of nesting element in XML: http://www.featureblend.com/xml-nesting.html
If you form your XML like this:
<data>
<master_node>
<nested1>First nested node text.</nested1>
<nested2>Second nested node text.</nested2>
<nested3>Third nested node text.</nested3>
</master_node>
</data>
You are able to get all the text contents:
$xml = new SimpleXMLElement(file_get_contents('xmltest.xml'));
var_dump($xml->master_node);
Maybe it's possible to something similar with JavaScript or plain PHP. Just load the file into variable and parse it with regex or string search functions. Parsing XML with regex, yes I know, it's like parsing HTML with regex, a big no-no...

Categories