Given a xml
<xml>
<![CDATA[<myNode>aaa</myNode><anotherNode>bbb</anotherNode>]]>
</xml>
How to access a node inside myNode (that it's inside a CDATA) using Simple HTML DOM?
Is it possible, or maybe I should change to another lib?
CDATA blocks will be ignored by any parser, so any xml nodes that you have in CDATA blocks will not be queryable unless you parse the CDATA text as well. In other words:
Parse your original document
Query your CDATA text block. You will get a new xml string.
Parse your new (inner) xml string, and query whatever data you need from it.
Having said all of that, why in the world do you have full xml text inside of CDATA blocks? Sounds like extremely lazy escaping to me.
Related
I have to parse HTML and "HTML" from emails. I've already managed to create a function that cleans most of the errors such as improper nesting of elements.
I'm trying to determine how best to tackle the issue of HTML attributes that are missing values. We must parse everything ultimately as XML so well-formed HTML is a must as well.
The cleaning function starts off simple enough:
$xml = explode('<', $xml);
We quickly determine opening and closing tags of elements.
However once we get to attributes things get really messy really quickly:
Missing values.
People using single quotes instead of double quotes.
Attribute values may contain single quotes.
Here is an example of an HTML string we have to parse (a p element):
$s = 'p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text';
We do not care what those attributes are; our goal is simply to fix the XML so that it is well-formed as demonstrated by the following string:
$s = 'p obnoxious="true" nonprofessional="true" style="wrong: lulz-immature" dunno="true">Some paragraph text';
We're not interested in attribute="attribute" as that is just extra work (most email is frivolous) so we're simply interested in appending ="true" for each attribute missing a value just to prevent the XML parser on client browsers from failing over the trivialities of someone somewhere else not doing their job.
As I mentioned earlier we only need to fix the attributes which are missing values and we need to return a string. At this point all other issues of malformed XML have been addressed. I'm not sure where I should start as the topic is such a mess. So...
We're open to sending the entire XML string as a whole to be parsed and returned back as a string with some built in library. If this option presume that the XML is well-formed with a proper XML declaration (<?xml version="1.0" encoding="UTF-8"?>).
We're open to manually creating a function to address whatever we encounter though we're not interested in building a validator as much of the "HTML" we receive screams 1997.
We are working with the XML as a single string or an array (your pick); we are explicitly not dealing with files.
How do we with reasonable effort ensure that an XML string (in part or whole) is returned as a string with values for all attributes?
The DOM extension may solve your problem:
$doc = new DOMDocument('1.0');
$doc->loadHTML('<p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text');
echo $doc->saveXML();
The above code will result in the following output:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p obnoxious="" nonprofessional="" style="wrong: lulz-immature" dunno="">Some paragraph text</p></body></html>
You may replace every ="" with ="true" if you want, but the output is already a valid XML.
I have a LARGE XML file. I'm troubleshooting some things, and I would like to extract specific nodes from the XML file. I don't want a SimpleXML object, I want to make a new file with the raw string matching what I want (posting this on bash/sed/php).
<?xml version="1.0" encoding="UTF-8"?>
<definition></definition>
<metadata></metadata>
<nodeToRegex>
<nodeImightwant>
<subnode>
<subsubnode1></subsubnode1>
<subsubnodeToCheck>stringCheck</subnodeToCheck>
<subsubnode2></subsubnode2>
</subnode>
</nodeImightwant>
<nodeImightwant></nodeImightwant>
<nodeImightwant></nodeImightwant>
</nodeToRegex>
So from this XML file, I want all lines from every node except the nodeToRegex. From nodeToRegex, I only want the nodeImightwant if the stringCheck string equals "aValidString". Can this be done via regex or should I just copy and paste the stuff out of the file? (my regex skills are subpar)
Don't parse XML with regexes. There is no reason you can't repackage/rearrange the data using SimpleXML, but trying to do it with a regex is a recipe for lots of headaches and, ultimately, broken code.
See this classic example for why parsing XML/HTML/XHTML with regexes is the road to madness.
If you insist on using a regex, just replace the nodes you don't want, like this:
$myxml = preg_replace('~<nodeToRegex>.*?</nodeToRegex>~', '', $myxml);
Debuggex Demo
I'm experiencing the following behavior:
$xml_string1 = "<person><name><![CDATA[ Someone's Name ]]></name></person>";
$xml_string2 = "<person><name> Someone's Name </name></person>";
$person = new SimpleXMLElement($xml_string1);
print (string) $person->name; # Someone's Name
$person = new SimpleXMLElement($xml_string2);
print (string) $person->name; # Someone's Name
$person = new SimpleXMLElement($xml_string1, LIBXML_NOCDATA);
print (string) $person->name; # Someone's Name
The php docs say that NOCDATA "Merge[s] CDATA as text nodes". To me this means that CDATA will then be treated the same as text nodes - or that the behavior of the 3rd example will now be the same as the 2nd example.
I don't have control over the XML (it's a feed from an external source), otherwise I'd just remove the CDATA tag as it does nothing and ruins the behavior I want.
Why does the above example behave the way that it does? Is there any way to make SimpleXML handle the CDATA nodes in the same way that it handles text nodes? What does "Merge CDATA as text nodes" actually do, since I don't seem to be understanding that option?
I'm currently decoding after I pull out the data, but the above example still doesn't make sense to me.
The purpose of CDATA sections in XML is to encapsulate a block of text "as is" which would otherwise require special characters (in particular, >, < and &) to be escaped. A CDATA section containing the character & is the same as a normal text node containing &.
If a parser were to offer to ignore this, and pretend all CDATA nodes were really just text nodes, it would instantly break as soon as someone mentioned "P&O Cruises" - that & simply can't be there on its own (rather than as &, or &somethingElse;).
The LIBXML_NOCDATA is actually pretty useless with SimpleXML, because (string)$foo neatly combines any sequence of text and CDATA nodes into an ordinary PHP string. (Something which people frequently fail to notice, because print_r doesn't.) This isn't necessarily true of more systematic access methods, such as DOM, where you can manipulate text nodes and CDATA nodes as objects in their own right.
What it effectively does is go through the document, and wherever it encounters a CDATA section, it takes the content, escapes it, and puts it back as an ordinary text node, or "merges" it with any text nodes to either side. The text represented is identical, just stored in the document in a different way; you can see the difference if you export back to XML, as in this example:
$xml_string = "<person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>";
$person = new SimpleXMLElement($xml_string);
echo 'CDATA retained: ', $person->asXML();
// CDATA retained: <?xml version="1.0"?>
// <person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>
$person = new SimpleXMLElement($xml_string, LIBXML_NOCDATA);
echo 'CDATA merged: ', $person->asXML();
// CDATA merged: <?xml version="1.0"?>
// <person><name>Welcome aboard this P&O Cruises voyage!</name></person>
If the XML document you're parsing contains a CDATA section which actually contains entities, you need to take that string and unescape it completely independent of the XML. One common reason to do this (other than laziness with poorly understood libraries) is to treat something marked up in HTML as just any old string inside an XML document, like this:
<Comment>
<SubmittedBy>IMSoP</SubmittedBy>
<Text><![CDATA[I'm <em>really</em> bad at keeping my answers brief <tt>;)</tt>]]></Text>
</Comment>
Consider the xml
<data>
<node1>
some text some text <nested-node>nest node content</nested-node> some text
</node1>
</data>
Want to access <node1> tag (that i can do), But i want to get content as follow...
some text some text <nested-node>nest node content </nested-node>
some text
Please help me how can I achieve this???
The problem is that your XML on top is not well formed.
You cannot use tags in a text sequence. The parser remove such elements.
XML Escaping
List of escape characters
With this:
$xml = new SimpleXMLElement(file_get_contents('xmltest.xml'));
var_dump($xml->node1);
you can read the node.
I got curious of this question. Is it really impossible to do what what OP wants? #Stony noted that malformed XML makes this impossible, and I wouldn't be surprised if the XML functions won't work with malformed XML in the way OP wants.
Here's an example of nesting element in XML: http://www.featureblend.com/xml-nesting.html
If you form your XML like this:
<data>
<master_node>
<nested1>First nested node text.</nested1>
<nested2>Second nested node text.</nested2>
<nested3>Third nested node text.</nested3>
</master_node>
</data>
You are able to get all the text contents:
$xml = new SimpleXMLElement(file_get_contents('xmltest.xml'));
var_dump($xml->master_node);
Maybe it's possible to something similar with JavaScript or plain PHP. Just load the file into variable and parse it with regex or string search functions. Parsing XML with regex, yes I know, it's like parsing HTML with regex, a big no-no...
When I extract text from an XML file
Here is some text before the
<br/><br/>
line break.
in PHP,
echo $value->description;
I get the text but not the including br tags. How do I get around this?
Thanks.
And from experience, you shouldn't even get any text after the <br/> tags. Reason for this is because all text nodes in XML are suppose to have < and > replaced with their htmlentity() counterparts, and all other special characters replaced with htmlspecialchars(). I'm fairly certain that it causes an error with your XML DOM parser, or at least make it as a new node, an empty text node with a line break, I think.
The only solution for this is to store the XML into a string, use regex to take out the <br/> tags (well, all the < and > tags for that matter), and replace them with the correct values I noted above.
Or, you can read about CDATA here, and escape the tags instead, but that's if you're the one creating that XML file. You should notify the webmaster for the site that you got the XML from, that the XML is incorrectly created.
First, you can read the XML file into one string, and then replace '' by '<br/>'. Now, you can load the replaced string as XML data, and process it with XML DOM.