XML parser error: entity not defined

XML parser error: entity not defined - php

I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.
I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.
Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".
I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.
From what I gather, there's a few options I've seen:
I can find and replace all and swap them out with or an actual space.
I can place the code in question within a CDATA section.
I can include these entities within the XML file.
What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).
Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?
Thanks,
Ryan
UPDATE
I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!
Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).
The solution to all this was quite simple:
I added this line entity_encoding : "utf-8" in my tinyMCE.init. Now, all the characters show up the way they are supposed to.
I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.

I agree that it is purely an encoding issue. In PHP, this is how I solved this problem:
Before passing the html-fragment to SimpleXMLElement constructor I decoded it by using html_entity_decode.
Then further encoded it using utf8_encode().
$headerDoc = '<temp>' . utf8_encode(html_entity_decode($headerFragment)) . '</temp>';
$xmlHeader = new SimpleXMLElement($headerDoc);
Now the above code does not throw any undefined entity errors.

You could HTML-parse the text and have it re-escaped with the respective numeric entities only (like: →  ). In any case — simply using un-sanitized user input is a bad idea.
All of the numeric entities are allowed in XML, only the named ones known from HTML do not work (with the exception of &, ", <, >, &apos;).
Most of the time though, you can just write the actual character (ö → ö) to the XML file so there is no need to use an entity reference at all. If you are using a DOM API to manipulate your XML (and you should!) this is your safest bet.
Finally (this is the lazy developer solution) you could build a broken XML file (i.e. not well-formed, with entity errors) and just pass it through tidy for the necessary fix-ups. This may work or may fail depending on just how broken the whole thing is. In my experience, tidy is pretty smart, though, and lets you get away with a lot.

1. I can find and replace all [ ?] and swap them out with [ ?] or an actual space.
This is a robust method, but it requires you to have a table of all the HTML entities (I assume the pasted input is coming from HTML) and to parse the pasted text for entity references.
2. I can place the code in question within a CDATA section.
In other words disable parsing for the whole section? Then you would have to parse it some other way. Could work.
3. I can include these entities within the XML file.
You mean include the entity definitions? I think this is an easy and robust way, if you don't mind making the XML file quite a bit bigger. You could have an "included" file (find one on the web) which is an external entity, which you reference from the top of your main XML file.
One downside is that the XML parser you use has to be one that processes external entities (which not all parsers are required to do). And it must correctly resolve the (possibly relative) URL of the external entity to something accessible. This is not too bad but it may increase constraints on your processing tools.
4. You could forbid non-XML in the pasted content. Among other things, this would disallow entity references that are not predefined in XML (the 5 that Tomalak mentioned) or defined in the content itself. However this may violate the requirements of the application, if users need to be able to paste HTML in there.
5. You could parse the pasted content as HTML into a DOM tree by setting someDiv.innerHTML = thePastedContent;
In other words, create a div somewhere (probably display=none, except for debugging). Say you then have a javascript variable myDiv that holds this div element, and another variable myField that holds the element that is your input text field. Then in javascript you do
myDiv.innerHTML = myField.value;
which takes the unparsed text from myField, parses it into an HTML DOM tree, and sticks it into myDiv as HTML content.
Then you would use some browser-based method for serializing (= "de-parsing") the DOM tree back into XML. See for example this question. Then you send the result to the server as XML.
Whether you want to do this fix in the browser or on the server (as #Hannes suggested) will depend on the size of the data, how quick the response has to be, how beefy your server is, and whether you care about hackers sending not-well-formed XML on purpose.

Use "htmlentities()" with flag "ENT_XML1": htmlentities($value, ENT_XML1);
If you use "SimpleXMLElement" class:
$SimpleXMLElement->addChild($name, htmlentities($value, ENT_XML1));

If you want to convert all characters, this may help you (I wrote it a while back) :
http://www.lautr.com/convert-all-applicable-characters-to-numeric-entities-for-use-in-xml
function _convertAlphaEntitysToNumericEntitys($entity) {
return '&#'.ord(html_entity_decode($entity[0])).';';
}
$content = preg_replace_callback(
'/&([\w\d]+);/i',
'_convertAlphaEntitysToNumericEntitys',
$content);
function _convertAsciOver127toNumericEntitys($entity) {
if(($asciCode = ord($entity[0])) > 127)
return '&#'.$asciCode.';';
else
return $entity[0];
}
$content = preg_replace_callback(
'/[^\w\d ]/i',
'_convertAsciOver127toNumericEntitys', $content);

This question is a general problem for any language that parses XML or JSON (so, basically, every language).
The above answers are for PHP, but a Perl solution would be as easy as...
my $excluderegex =
'^\n\x20-\x20' . # Don't Encode Spaces
'\x30-\x39' . # Don't Encode Numbers
'\x41-\x5a' . # Don't Encode Capitalized Letters
'\x61-\x7a' ; # Don't Encode Lowercase Letters
# in case anything is already encoded
$value = HTML::Entities::decode_entities($value);
# encode properly to numeric
$value = HTML::Entities::encode_numeric($value, $excluderegex);

Related

Escaping DOI links in php - when esc_url() is not enough

I am writing php code that generates html that contains links to documents via their DOI. The links should point to https://doi.org/ followed by the DOI of the document.
As the results is a url, I thought I could simply use php's esc_url() function like in
echo '' . esc_url('https://doi.org/' . $doi)) . '';
as this is what one is supposed to use in text nodes, attribute nodes or anywhere else. Unfortunately things apparenty aren't that easy...
The problem is that DOIs can contain all sorts of special characters that are apparently not handled correctly by esc_url(). A nice example of such a DOI is
10.1002/(SICI)1521-3978(199806)46:4/5<493::AID-PROP493>3.0.CO;2-P
which is supposed to link to
https://doi.org/10.1002/(SICI)1521-3978(199806)46:4/5<493::AID-PROP493>3.0.CO;2-P
With $doi equal to this DOI the above code however produces a link that is displayed and links to https://doi.org/10.1002/(SICI)1521-3978(199806)46:4/5493::AID-PROP4933.0.CO;2-P.
This leads me to the question: If esc_url() is obviously not one-size-fits-all no-brained solution to escaping urls, then what should I use? For this case I can get the result I want with
esc_url(htmlspecialchars('https://doi.org/' . $doi))
but is this really the right way™ of doing it? Does this have any other unwanted side effects? If not, then why does esc_url() not also escape < and >? Would esc_html() be better than htmlspecialchars()? If so, should I nest it into a esc_url()?
I am aware that there are many articles on escaping urls in php on stackoverflow, but I couldn't find one that addresses the issues of < and > signs.

I'm no PHP expert, but I do know about DOIs and SICIs can be really annoying.
URL-encoding and HTML encoding are separate things, so it makes sense to think about them separately. You must escape the angle-brackets to make correct HTML. As for the URL-escaping, you should also do this because there are other characters that might break URLs (such as the # character, which also pops up from time to time).
So I would recommend:
'https://doi.org/' . htmlspecialcharacters(urlencode($doi))
Which will give you:
Click here
Note the order of function application, and the fact that you don't want to encode the https://doi.org resolver!
To the above "dipshit decision" comment... it's certainly inconvenient. But SICIs were around before DOIs and it's one of those annoying things we've had to live with ever since!

Settings that could influence PHP str_replace behaviour

I am currently working on a replacement tool that will dynamically replace certain strings (including html) in a website using a smarty outputfilter.
For the replacement to take place, I am using PHP's str_ireplace method, which reads the code that is supposed to be replaced and the replacement code from a database, and then pass the result to the smarty output (using an output filter), in a similar way as the below.
$tpl_source = str_ireplace($replacements['sourceHTML'], $replacements['replacementHTML'], $tpl_source);
The problem is, that although it works great on my dev server, once uploaded to the live server replacements occasionally fail. The same replacements work just fine on my dev version though. After some examinations and googling there was not much I could find out regarding this issue. So my question is, what could influence str_replace's behavour?
Thanks
Edit with replacement example:
$htmlsource = file_get_contents('somefile.html');
$newstr = str_replace('Some text', 'sometext', $htmlsource); // the text to be replaced does exist in the html source
fails to replace. After some checking, it looks like the combination of "> creates a problem. But just the combination of it. If I try to change only (") it works, if I try to change only (>) it works.

It might be that special chars like umlauts do not display on the live server correctly and so str_replace() would fail, if there are specialchars inside the string you want to replace.

Is the input string identical on both systems? Have you verified this? Are you sure?
Things to check:
Are the HTML attributes in the same order?
Are the attribute values using the same kind quote marks? (eg <a href='#'> vs <a href="#">)
Is there any other stray HTML code getting in there?
Is the entity encoding the same? (eg vs   - same character; different HTML)
Is the character-set the same? (eg utf-8 vs ISO 8859-1: Accented characters will be encoded differently)
Any of these things will affect the result and produce the failures you're describing.

This was a trikcy problem, and it ended up having nothing to do with the str_replace method itself;
We are using smarty as a tamplating system. The str_replace method was used by a smarty ouput filter in order to change the html in some ocassions, just before it was delivered to the user.
Here is the Smarty outputfilter Code:
function smarty_outputfilter_replace($tpl_source, &$smarty)
{
$replacements = Content::getReplacementsForPage();
if (is_array($replacements))
{
foreach ($replacements as $replacementData)
{
$tpl_source = str_replace($replacementData['sourcecode'], $replacementData['replacementcode'], $tpl_source);
}
}
return ($tpl_source);
}
So this code failed now and then for now apparent reason... until I realized that the HTML code in the smarty template was being manipulated by an Apache filter.
This resulted into the source code in the browser (which we were using as the code to be replaced by something else) not being identical to the template code (which smarty was trying to modify). Result? str_replace failed :)

Replace '&' with 'and' on the fly in PHP

Is there a way to replace the character & with and in a PHP web form as the user types it rather than after submitting the form?
When & is inserted into our database our search engine doesn't interpret the & correctly replacing it with & returning an incorrect search result (i.e. not the result that included &).
Here is the field we would like to run this on:
<input type="text" name="project_title" id="project_title" value="<?php echo $project_title; ?>" size="60" class="btn_input2"/>

Is there a way to replace the character & with and in a PHP web form as the user types it rather than after submitting the form?
PHP is on the server, it has no control over anything taking place under any circumstances what-so-ever on the client-side. It sends raw text from the web server, a 100megaton thermonuclear device explodes, and PHP never exists anymore after the content is sent. Just the document received on your client side remains. To work with effects on your client side, you need to work with JavaScript.
To do that, you would pick your favorite JavaScript library and add an event listener for "keyup" events. Replace ampersands with "and", and drop the replacement text back in the box. mugur has posted an answer that shows you how to do this.
This is a horrible solution in practice because your users will be screaming for bloody justice to deliver them from such an awful user experience. What you've ended up doing is replacing the input text with something they didn't want. Other search tools do this, why can't yours? You hit backspace, then what? When you hit in the text, you probably lose your cursor position.
Not only that, you're treating a symptom rather than the cause. Look at why you're doing this:
The reason is when & is inserted into our database our search engine flips out and replaces it with & which then returns an incorrect result (i.e. not the result that included &).
No, your database and search engine do no such thing as "flipping out". You're not aware of what's going on and try to treat symptoms rather than learn the cause and fix it. Your symptom cure will create MORE issues down the road. Don't do it.
& is an HTML Entity Code. Every "special" charecter has one. This means your database also encodes > as > as well as characters with accents in them (such as French, German, or Spanish texts). You get "Wrong" results for all of these.
You didn't show any code so you don't get any code. But here's what your problem is.
Your code is converting raw text into HTML Entity codes where appropriate, you're searching against a non-encoded string.
Option 1: Fix the cause
Encode your search text with HTML entities so that it matches for all these cases. Match accent charecters with their non-accented cousins so searching for "francais" might return "français".
Option 2: Fix one symptom
Do a string replace for ampersands either on the client or server side, your search breaks for all other encodings. Never find texts such as "Bob > Sally". Never find "français".

Before submitting the form you'd need to use JavaScript to change as the user types it in. Not ideal since JS can be turned off.
You'd be much better to "clean" the ampersands after submitting but before inserting into the database.
A simple str_replace should work:
str_replace(' & ',' and ', $_POST['value']);
But as others have pointed out, this isn't a good solution. The best solution would be to encode the ampersands as they go into the database (which seems to be happening just now), then modify your search script to allow for this.

You can do that as they complete the form with jquery like this:
$('#input').change(function() { // edited conforming Icognito suggestion
var some_val = $('#input').val().replace('&', 'and');
$('#input').val( some_val );
});
EDIT: working example (http://jsfiddle.net/4gXZW/13/)
JS:
$('.target').change(function() {
$('.target').val($('.target').val().replace('&', 'and'));
});
HTML:
<input class="target" type="text" value="Field 1" />
Otherwise you can do that in PHP before the insert sql.
$to_insert = str_replace("&", "and", $_POST['your_variable']);

Ampersand issue in w3c validator and search engine

I'm using more than one ampersand in my url, see my link below
http://www.theonlytutorials.com/video.php?cat=55&vid=3975&auth=many
When i try to validate in w3c validator it showed hundreds of error because of this & (ampersand).
After that i read some post in here and i got the solution too.
Instead of using (&) If i use (&) w3c validates fine.
But the problem now is in search Engine. Instead of taking (&). it is taking like the below link
http://www.theonlytutorials.com/video.php?cat=55&vid=3975&auth=many
if you copy paste the above link in the address bar it will take you to the wrong page!. Please help how can i solve it.

There must be an error in your code but since we cannot see any of it I think the most important bit is to understand why the W3C validator complaints about raw &.
The HTML syntax contains two basic elements: tags (e.g. <strong>) and entities (e.g. €). Everything else is displayed as-is.
Browsers are expected to ignore errors.
When you type unknown or invalid tags, the browser will do its best to guess and fix it (you are probably aware of that already):
<p>Hello <i>world</b>!</p>
... will render as:
<p>Hello <i>world</i>!</p>
But the same happens when you type an unknown or invalid entity. In your example, there are two invalid entities:
http://www.theonlytutorials.com/video.php?cat=55&vid=3975&auth=many
^^^^ ^^^^^
However, it works because the browser is clever enough to figure out the real URL. Only the validator complaints because it is a tool specifically designed to find errors.
Now, imagine I want to use HTML to write an HTML tutorial and I want to explain the <strong> tag. If I just type <strong>example</strong>, the browser will display example. I need to encode the < symbol so it no longer has a special meaning:
<strong>example</strong>
Now the browser displays <strong>example</strong>, which is precisely the content I want to show.
The same happens with your URL. Since & is part of the entity syntax, when I want to insert a literal & I need to encode it as well:
Barnes & Noble
... will render as Barnes & Noble. Please note that this is only a syntactic trick to insert plain text into a HTML document. Your document shows Barnes & Noble. to all effects, no matter how you encode it. So when you replace & with & in your URL, you are not changing your URL, you are just encoding it.
If search engines are spidering the wrong URL, that means you have actually changed your URL rather than just encoding it, so the source code is:
http://www.theonlytutorials.com/video.php?cat=55&amp;vid=3975&amp;auth=many
... and renders as:
http://www.theonlytutorials.com/video.php?cat=55&vid=3975&auth=many
This can happen, for instance, if you encode twice:
<?php
$url = 'http://www.theonlytutorials.com/video.php?cat=55&vid=3975&auth=many';
$url = htmlspecialchars($url);
$url = htmlspecialchars($url);
echo $url;
... or:
<?php
$url = 'http://www.theonlytutorials.com/video.php?cat=55&vid=3975&auth=many';
$url = htmlspecialchars($url); // Oops: URL is already encoded!
echo $url;

Seems that you made a typo error, it must be & not &ampamp;

Regular expression to match ">", "<", "&" chars that appear inside XML nodes

I'm trying to write a regular expression using the PCRE library in PHP.
I need a regex to match only &, > and < chars that exist within string part of any XML node and not the tag declaration themselves.
Input XML:
<pnode>
<cnode>This string contains > and < and & chars.</cnode>
</pnode>
The idea is to to a search and replace these chars and convert them to XML entities equivalents.
If I was to convert the entire XML to entities the XML would look like this:
Entire XML converted to entities
<pnode>
<cnode>This string contains > and < and & chars.</cnode>
</pnode>
I need it to look like this:
Correct XML
<pnode>
<cnode>This string contains > and &lt and & chars.</cnode>
</pnode>
I have tried to write a regular expression to match these chars using look-ahaead but I don't know enough to get this to work. My attempt (currently only attempting to match > symbols):
/>(?=[^<]*<)/g
Just to make it clear the XML I'm trying to fix comes from a 3rd party and they seem unable to fix it their end hence my attempt to fix it.

In the end I've opted to use the Tidy library in PHP. The code I used is shown below:
// Specify configuration
$config = array(
'input-xml' => true,
'show-warnings' => false,
'numeric-entities' => true,
'output-xml' => true);
$tidy = new tidy();
$tidy->parseFile('feed.xml', $config, 'latin1');
$tidy->cleanRepair()
This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.

Classic example of garbage in, garbage out. The real solution is to fix the broken XML exporter, but obviously that's out of the scope of your problem. Sounds like you might have to manually parse the XML, run htmlentites() on the contents, then put the XML tags back.

I'm reasonably certain it's simply not possible. You need something that keeps track of nesting, and there's no way to get a regular expression to track nesting. Your choices are to fix the text first (when you probably can use an RE) or use something that's at least vaguely like an XML parser, specifically to the extent of keeping track of how the tags are nested.
There's a reason XML demands that these characters be escaped though -- without that, you can only guess about whether something is really a tag or not. For example, given something like:
<tag>Text containing < and > characters</tag>
you and I can probably guess that the result should be: ...containing < and >... but I'm pretty sure the XML specification allows the extra whitespace, so officially "< and >" should be treated as a tag. You could, I suppose, assume that anything that looks like an un-matched tag really isn't intended to be a tag, but that's going to take some work too.

Would it be possible to intercept the text before it tries to become part of your XML? A few ounces of prevention might be worth pounds of cure.

This should do it for ampersands:
/(\s+)(&)(\s+)/gim
This means you're only looking for those characters when they have whitespace characters on both sides.
Just make sure the replacement expression is "$1$2amp;$3";
The others would go like this, with their replacement expressions on the right
/(\s+)(>)(\s+)/gim "$1>$2"
/(\s+)(<)(\s+)/gim "$1<$2"

As stated by others, regular expressions don't do well with hierarchical data. Besides, if the data is improperly formatted, you can't guarantee that you'll get it right. Consider:
<xml>
<tag>Something<br/>Something Else</tag>
</xml>
Is that <br/> supposed to read <br/>? There's no way to know because it's validly formatted XML.
If you have arbitrary data that you wish to include in your XML tree, consider using a <![CDATA[ ... ]]> block instead. It's treated the same as a text node, and the only thing you don't have to escape is the character sequence ]]>.

What you have there is not, of course, XML. In XML, the characters '<' and '&' may not occur (unescaped) inside text: only inside a comment, CDATA section, or processing instruction. Actually, '>' can occur in text, except as part of the string ']]>'. In well-formed XML, literal '<' and '&' characters signal the start of markup: '<' signals the start of a start tag, end tag, or empty element tag, and '&' signals the start of an entity reference. In both these cases, the next character may NOT be whitespace. So using an RE like Robusto's suggestion would find all such occurrences. You might also need to catch corner cases like '<<', '<\', or '&<'. In this case you don't need to try to parse your input, an RE will work fine.
If the source contains strings like '<something ' where 'something' matches the production for a Name:
Name ::= NameStartChar (NameChar)*
Then you have more of a problem. You are going to have to (try to) parse your input as if it were real XML, and detect the error cases of malformed Names, non-matching start & end tags, malformed attributes, and undefined entity references (to name a few). Unfortunately the error condition isn't guaranteed to happen at the location of the error.
Your best bet may be to use an RE to catch 90% of the error and fix the rest manually. You need to look for a '<' or '&' followed by anything other than a NameStartChar

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.