PHP DOMDocument->loadXML with XML containing ampersand/less/greater? - php

I'm trying to parse an XML string containing characters & < and > in the TEXTDATA. Normally, those characters should be htmlencoded, but in my case they aren't so I get the following messages:
Warning: DOMDocument::loadXML() [function.loadXML]: error parsing attribute name in Entity ...
Warning: DOMDocument::loadXML() [function.loadXML]: Couldn't find end of Start Tag ...
I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.
Does anyone know a workaround for this problem??
Thank you!

If you have a < inside text in an XML... it's not a valid XML. Try to encode it or to enclose them into <![CDATA[.
If it's not possible (because you're not outputting this "XML") I'd suggest to try with some Html parsing library (I didn't used them, but they exists) beacuse they're less strict than XML ones.
But I'd really try to get valid XML before trying any other thing!!

I often use # in front of calls to load() for DomDocument mainly because you can never be absolutely sure what you load, is what you expected.
Using # will suppress errors.
#$dom->loadXml($myXml);

I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.
As a strictly temporary fixup measure you can replace the ones that aren't part of what looks like a tag or entity reference, eg.:
$str= preg_replace('<(?![a-zA-Z_!?])', '<', $str);
$str= preg_replace('&(?!([a-zA-Z]+|#[0-9]+|#x[0-9a-fA-F]+);)', '&', $str);
However this isn't watertight and in the longer term you need to fix whatever is generating this bogus markup, or shout at the person who needs to fix it until they get a clue. Grossly-non-well-formed XML like this is simply not XML by definition.

Put all your text inside CDATA elements?
<!-- Old -->
<blah>
x & y < 3
</blah>
<!-- New -->
<blah><![CDATA[
x & y < 3
]]></blah>

Related

PHP and DOM - parsing error an XML with inside entities

I have a xml :
<title>My title</title>
<text>This is a text and I love it <3 </text>
When I try to parse it with DOM, I have an error because of the "<3":
Warning: DOMDocument::loadXML(): StartTag: invalid element name in Entity...
Do you know how can I escape all inside special char but keeping my XML tree ? The goal is to use this method: $document->loadXML($xmlContent);
Tank a lot for your answers.
EDIT: I forget to say that I cannot modify the XML. I receive it like that and I have to do with it...
The symbol "<" is a predefined entity in XML and thus cannot be used in a text field. It should be replaced with:
<
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
So the input text should be:
<title>My title</title>
<text>This is a text and I love it <3 </text>
An XML built like that should be rejected, and whoever sends it should replace the predefined entities for the allowed values. Doing said task with tools like htmlentities() and htmlspecialchars(), as Y U NO WORK suggests, is easy and straightforward.
Now, if you really need to parse said data, you need to sanitize it prior to parsing. This is not a recommended behaviour, particularly if you are receiving arbitrary text, but if it is a set of known or predictable characters, regular expressions can do the job.
This one, in particular, will remove a single "<" contained in a "text" element composed by characters, numbers or white spaces:
$xmlContent = preg_replace('/(<text>[a-zA-Z 0-9]*)[<]?([a-zA-Z 0-9]*<\/text>)/', '$1<$2', $xmlContent);
It is very specific, but it is done on purpose: regular expressions are really bad at matching nested structures, such as HTML or XML. Applying more arbitrary regular expressions to HTML or XML can have wildly unexpected behaviours.
XML says that every title has to start with a letter, nothing else is allowed, so the title <3 is not possible.
A workaround for this could be htmlentities() or htmlspecialchars(). But even that wont add a valid character to the beginning, so you should think about either:
Manually add a letter in front of the tag with if
Rework your XML so nothing like that can ever happen.
You need put the content with special chars inside CDATA:
<text><![CDATA[This is a text and I love it <3 ]]></text>

Insert CDATA into an XML

I'm in a real hurry right now, and I'm begging REGEX masters for help!
I'm receiving an XML trough a HTTP request, and I just can't parse it since it contains some special chars not being wrapped in CDATA sections.
example XML:
<root>
<node>good node</node>
<node>bad node containing &</node>
<root>
Trying to parse this XML with simplexml_load_string($xml) I get:
Warning: simplexml_load_string() [function.simplexml-load-string]:
Entity: line 3: parser error : xmlParseEntityRef: no name in /..../file.php on line ##
Supposing that the bad nodes will not contain > or <, I need a REGEX that will wrap the text in that nodes in CDATA sections. I guess there will be some lookarounds, I just can't do it quickly.
Thank you!
If you can indeed assume that there will be no < or > characters inside the nodes you want to CDATA-ize, then this should work just fine for your situation:
>(?=[^<&]*&)([^<]*)<
replacing with
<!CDATA[\1]]>
This expression only looks for nodes that contain & characters (whether or not they are part of HTML entities), then wraps the contents of those nodes in a CDATA tag, if you need to ignore & characters inside entities, that's a considerable bit tougher, but I'd be willing to give it a look.

Extract text including line breaks from XML with PHP

When I extract text from an XML file
Here is some text before the
<br/><br/>
line break.
in PHP,
echo $value->description;
I get the text but not the including br tags. How do I get around this?
Thanks.
And from experience, you shouldn't even get any text after the <br/> tags. Reason for this is because all text nodes in XML are suppose to have < and > replaced with their htmlentity() counterparts, and all other special characters replaced with htmlspecialchars(). I'm fairly certain that it causes an error with your XML DOM parser, or at least make it as a new node, an empty text node with a line break, I think.
The only solution for this is to store the XML into a string, use regex to take out the <br/> tags (well, all the < and > tags for that matter), and replace them with the correct values I noted above.
Or, you can read about CDATA here, and escape the tags instead, but that's if you're the one creating that XML file. You should notify the webmaster for the site that you got the XML from, that the XML is incorrectly created.
First, you can read the XML file into one string, and then replace '' by '<br/>'. Now, you can load the replaced string as XML data, and process it with XML DOM.

How to deal with special characters in URLs inside XML

I have an XML element that has a url as one of it's children, for example:
http://maps.google.com/FortWorth&Texas,more+url;data
When parsing this, I'm having two issues:
1.) The (&) symbol breaks the entire parse unless replaced with &amp (which breaks the url)
2.) The comma (,) tries to send my parser on to the next child, resulting in an incomplete url.
What can I do to remedy this?
I'm using Javascript and PHP.
Replacing & with & shouldn't break the url. Did you left out the ;?
Better solution is you should wrap that in a CDATA tag:
<![CDATA[ http://maps.google.com/FortWorth&Texas,more+url;data ]]>
Which tells the XML parser to treat it as text and not parse the &.
There are certain characters which are not valid in XML - you need to "escape" these in the xml document.
These characters and their "escaped" versions are:
> >
< <
& &
' &apos;
" "

Error Tolerant HTML/XML/SGML parsing in PHP

I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren't a part of HTML
<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>
I need to parse these files. PHP is the only only tool available. The documents don't come close to being well formed XML.
My original thought was to use the loadHTML methods on PHPs DOMDocument. However, these methods choke on the make up HTML tags, and will refuse to parse the string/file.
$oDom = new DomDocument();
$oDom->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
//gives us
DOMDocument::loadHTML() [function.loadHTML]: Tag pseud-template invalid in Entity, line: 1 occured in ....
The only solution I've been able to come up with is to pre-process the files with string replacement functions that will remove the invalid tags and replace them with a valid HTML tag (maybe a span with an id of the tag name).
Is there a more elegant solution? A way to let DOMDocument know about additional tags to consider as valid? Is there a different, robust HTML parsing class/object out there for PHP?
(if it's not obvious, I don't consider regular expressions a valid solution here)
Update: The information in the fake tags is part of the goal here, so something like Tidy isn't an option. Also, I'm after something that does the some level, if not all, of well-formedness cleanup for me, which is why I was looking the DomDocument's loadHTML method in the first place.
You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:
libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);
If, for some reason, you need access to the warnings, use libxml_get_errors
I wonder if passing the "bad" HTML through HTML Tidy might help as a first pass? Might be worth a look, if you can get the document to be well formed, maybe you could load it as a regular XML file with DomDocument.
#Twan
You don't need a DTD for DOMDocument to parse custom XML. Just use DOMDocument->load(), and as long as the XML is well-formed, it can read it.
Once you get the files to be well-formed, that's when you can start looking at XML parsers, before that you're S.O.L. Lok Alejo said, you could look at HTML TIDY, but it looks like that's specific to HTML, and I don't know how it would go with your custom elements.
I don't consider regular expressions a valid solution here
Until you've got well-formedness, that might be your only option. Once you get the documents to that stage, then you're in the clear with the DOM functions.
Take a look at the Parser in the PHP Fit port. The code is clean and was originally designed for loading the dirty HTML saved by Word. It's configured to pull tables out, but can easily be adapated.
You can see the source here:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/Parser.phps
The unit test will show you how to use it:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/test/parser.phps
My quick and dirty solution to this problem was to run a loop that matches my list of custom tags with a regular expression. The regexp doesn't catch tags that have another inner custom tag inside them.
When there is a match, a function to process that tag is called and returns the "processed HTML". If that custom tag was inside another custom tag than the parent becomes childless by the fact that actual HTML was inserted in place of the child, and it will be matched by the regexp and processed at the next iteration of the loop.
The loop ends when there are no childless custom tags to be matched. Overall it's iterative (a while loop) and not recursive.
#Alan Storm
Your comment on my other answer got me to thinking:
When you load an HTML file with DOMDocument, it appears to do some level of cleanup re: well well-formedness, BUT requires all your tags to be legit HTML tags. I'm looking for something that does the former, but not the later. (Alan Storm)
Run a regex (sorry!) over the tags, and when it finds one which isn't a valid HTML element, replace it with a valid element that you know doesn't exist in any of the documents (blink comes to mind...), and give it an attribute value with the name of the illegal element, so that you can switch it back afterwards. eg:
$code = str_replace("<pseudo-tag>", "<blink rel=\"pseudo-tag\">", $code);
// and then back again...
$code = preg_replace('<blink rel="(.*?)">', '<\1>', $code);
obviously that code won't work, but you get the general idea?

Categories