I'm in a real hurry right now, and I'm begging REGEX masters for help!
I'm receiving an XML trough a HTTP request, and I just can't parse it since it contains some special chars not being wrapped in CDATA sections.
example XML:
<root>
<node>good node</node>
<node>bad node containing &</node>
<root>
Trying to parse this XML with simplexml_load_string($xml) I get:
Warning: simplexml_load_string() [function.simplexml-load-string]:
Entity: line 3: parser error : xmlParseEntityRef: no name in /..../file.php on line ##
Supposing that the bad nodes will not contain > or <, I need a REGEX that will wrap the text in that nodes in CDATA sections. I guess there will be some lookarounds, I just can't do it quickly.
Thank you!
If you can indeed assume that there will be no < or > characters inside the nodes you want to CDATA-ize, then this should work just fine for your situation:
>(?=[^<&]*&)([^<]*)<
replacing with
<!CDATA[\1]]>
This expression only looks for nodes that contain & characters (whether or not they are part of HTML entities), then wraps the contents of those nodes in a CDATA tag, if you need to ignore & characters inside entities, that's a considerable bit tougher, but I'd be willing to give it a look.
Related
I have to parse HTML and "HTML" from emails. I've already managed to create a function that cleans most of the errors such as improper nesting of elements.
I'm trying to determine how best to tackle the issue of HTML attributes that are missing values. We must parse everything ultimately as XML so well-formed HTML is a must as well.
The cleaning function starts off simple enough:
$xml = explode('<', $xml);
We quickly determine opening and closing tags of elements.
However once we get to attributes things get really messy really quickly:
Missing values.
People using single quotes instead of double quotes.
Attribute values may contain single quotes.
Here is an example of an HTML string we have to parse (a p element):
$s = 'p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text';
We do not care what those attributes are; our goal is simply to fix the XML so that it is well-formed as demonstrated by the following string:
$s = 'p obnoxious="true" nonprofessional="true" style="wrong: lulz-immature" dunno="true">Some paragraph text';
We're not interested in attribute="attribute" as that is just extra work (most email is frivolous) so we're simply interested in appending ="true" for each attribute missing a value just to prevent the XML parser on client browsers from failing over the trivialities of someone somewhere else not doing their job.
As I mentioned earlier we only need to fix the attributes which are missing values and we need to return a string. At this point all other issues of malformed XML have been addressed. I'm not sure where I should start as the topic is such a mess. So...
We're open to sending the entire XML string as a whole to be parsed and returned back as a string with some built in library. If this option presume that the XML is well-formed with a proper XML declaration (<?xml version="1.0" encoding="UTF-8"?>).
We're open to manually creating a function to address whatever we encounter though we're not interested in building a validator as much of the "HTML" we receive screams 1997.
We are working with the XML as a single string or an array (your pick); we are explicitly not dealing with files.
How do we with reasonable effort ensure that an XML string (in part or whole) is returned as a string with values for all attributes?
The DOM extension may solve your problem:
$doc = new DOMDocument('1.0');
$doc->loadHTML('<p obnoxious nonprofessional style=\'wrong: lulz-immature\' dunno>Some paragraph text');
echo $doc->saveXML();
The above code will result in the following output:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p obnoxious="" nonprofessional="" style="wrong: lulz-immature" dunno="">Some paragraph text</p></body></html>
You may replace every ="" with ="true" if you want, but the output is already a valid XML.
I have a xml :
<title>My title</title>
<text>This is a text and I love it <3 </text>
When I try to parse it with DOM, I have an error because of the "<3":
Warning: DOMDocument::loadXML(): StartTag: invalid element name in Entity...
Do you know how can I escape all inside special char but keeping my XML tree ? The goal is to use this method: $document->loadXML($xmlContent);
Tank a lot for your answers.
EDIT: I forget to say that I cannot modify the XML. I receive it like that and I have to do with it...
The symbol "<" is a predefined entity in XML and thus cannot be used in a text field. It should be replaced with:
<
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
So the input text should be:
<title>My title</title>
<text>This is a text and I love it <3 </text>
An XML built like that should be rejected, and whoever sends it should replace the predefined entities for the allowed values. Doing said task with tools like htmlentities() and htmlspecialchars(), as Y U NO WORK suggests, is easy and straightforward.
Now, if you really need to parse said data, you need to sanitize it prior to parsing. This is not a recommended behaviour, particularly if you are receiving arbitrary text, but if it is a set of known or predictable characters, regular expressions can do the job.
This one, in particular, will remove a single "<" contained in a "text" element composed by characters, numbers or white spaces:
$xmlContent = preg_replace('/(<text>[a-zA-Z 0-9]*)[<]?([a-zA-Z 0-9]*<\/text>)/', '$1<$2', $xmlContent);
It is very specific, but it is done on purpose: regular expressions are really bad at matching nested structures, such as HTML or XML. Applying more arbitrary regular expressions to HTML or XML can have wildly unexpected behaviours.
XML says that every title has to start with a letter, nothing else is allowed, so the title <3 is not possible.
A workaround for this could be htmlentities() or htmlspecialchars(). But even that wont add a valid character to the beginning, so you should think about either:
Manually add a letter in front of the tag with if
Rework your XML so nothing like that can ever happen.
You need put the content with special chars inside CDATA:
<text><![CDATA[This is a text and I love it <3 ]]></text>
I have to read large (about 200MB) XML file, I'am using xmlreader with PHP. There is node URL with unescaped ampersand in it. Parsing always stops on first url NODE. I'm using encoding windows-1250 same as is specified in xml tag of XML file.
Iam getting error: parser error : EntityRef: expecting ';' in
Is it possible to parse an XML with & in value of NODE ?
Thank you for any tips, I can share a code if you need.
Is it possible to parse an XML with & in value of NODE ?
No, that means the file is not well-formed XML at all therefore does not really qualify as an XML file and no XML file parser can deal with that otherwise it would not be an XML parser.
However you can pre-process the data before you pass it to an XML parser and fix the issue (& -> &) your own.
#hakre is correct. In order for any XML to be parsed, you would have to pre-process the data first.
The reason for this is that in XML, the "&" is used for entities only. For example, if you are using XML, the opening '<' and closing '>' are very important, and the following node just doesn't make any sense to a parser:
<object>This object is > than the other object</object>
The parser thinks that the ">" in the middle of the text is trying to close a tag somewhere, but there is no matching opening tag, so it would get confused. To do so, you would need to type the following:
<object>This object is > than the other object</object>
Other entities include: < and &.
When I extract text from an XML file
Here is some text before the
<br/><br/>
line break.
in PHP,
echo $value->description;
I get the text but not the including br tags. How do I get around this?
Thanks.
And from experience, you shouldn't even get any text after the <br/> tags. Reason for this is because all text nodes in XML are suppose to have < and > replaced with their htmlentity() counterparts, and all other special characters replaced with htmlspecialchars(). I'm fairly certain that it causes an error with your XML DOM parser, or at least make it as a new node, an empty text node with a line break, I think.
The only solution for this is to store the XML into a string, use regex to take out the <br/> tags (well, all the < and > tags for that matter), and replace them with the correct values I noted above.
Or, you can read about CDATA here, and escape the tags instead, but that's if you're the one creating that XML file. You should notify the webmaster for the site that you got the XML from, that the XML is incorrectly created.
First, you can read the XML file into one string, and then replace '' by '<br/>'. Now, you can load the replaced string as XML data, and process it with XML DOM.
I'm trying to parse an XML string containing characters & < and > in the TEXTDATA. Normally, those characters should be htmlencoded, but in my case they aren't so I get the following messages:
Warning: DOMDocument::loadXML() [function.loadXML]: error parsing attribute name in Entity ...
Warning: DOMDocument::loadXML() [function.loadXML]: Couldn't find end of Start Tag ...
I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.
Does anyone know a workaround for this problem??
Thank you!
If you have a < inside text in an XML... it's not a valid XML. Try to encode it or to enclose them into <![CDATA[.
If it's not possible (because you're not outputting this "XML") I'd suggest to try with some Html parsing library (I didn't used them, but they exists) beacuse they're less strict than XML ones.
But I'd really try to get valid XML before trying any other thing!!
I often use # in front of calls to load() for DomDocument mainly because you can never be absolutely sure what you load, is what you expected.
Using # will suppress errors.
#$dom->loadXml($myXml);
I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.
As a strictly temporary fixup measure you can replace the ones that aren't part of what looks like a tag or entity reference, eg.:
$str= preg_replace('<(?![a-zA-Z_!?])', '<', $str);
$str= preg_replace('&(?!([a-zA-Z]+|#[0-9]+|#x[0-9a-fA-F]+);)', '&', $str);
However this isn't watertight and in the longer term you need to fix whatever is generating this bogus markup, or shout at the person who needs to fix it until they get a clue. Grossly-non-well-formed XML like this is simply not XML by definition.
Put all your text inside CDATA elements?
<!-- Old -->
<blah>
x & y < 3
</blah>
<!-- New -->
<blah><![CDATA[
x & y < 3
]]></blah>