php output xml produces parse error "’"

php output xml produces parse error "’" - php

Is there any function that I can use to parse any string to ensure it won't cause xml parsing problems? I have a php script outputting a xml file with content obtained from forms.
The thing is, apart from the usual string checks from a php form, some of the user text causes xml parsing errors. I'm facing this "’" in particular. This is the error I'm getting Entity 'rsquo' not defined
Does anyone have any experience in encoding text for xml output?
Thank you!
Some clarification:
I'm outputting content from forms in a xml file, which is subsequently parsed by javascript.
I process all form inputs with: htmlentities(trim($_POST['content']), ENT_QUOTES, 'UTF-8');
When I want to output this content into a xml file, how should I encode it such that it won't throw up xml parsing errors?
So far the following 2 solutions work:
1) echo '<content><![CDATA['.$content.']]></content>';
2) echo '<content>'.htmlspecialchars(html_entity_decode($content, ENT_QUOTES, 'UTF-8'),ENT_QUOTES, 'UTF-8').'</content>'."\n";
Are the above 2 solutions safe? Which is better?
Thanks, sorry for not providing this information earlier.

You take it the wrong way - don't look for a parser which doesn't give you errors. Instead try to have a well-formed xml.
How did you get ’ from the user? If he literally typed it in, you are not processing the input correctly - for example you should escape & to &. If it is you who put the entity there (perhaps in place of some apostrophe), either define it in DTD (<!ENTITY rsquo "&x2019;">) or write it using a numeric notation (’), because almost every of the named entities are a part of HTML. XML defines only a few basic ones, as Gumbo pointed out.
EDIT based on additions to the question:
In #1, you escape the content in the way that if user types in ]]> <°)))><, you have a problem.
In #2, you are doing the encoding and decoding which result in the original value of the $content. the decoding should not be necessary (if you don't expect users to post values like & which should be interpreted like &).
If you use htmlspecialchars() with ENT_QUOTES, it should be ok, but see how Drupal does it.

html_entity_decode($string, ENT_QUOTES, 'UTF-8')

Enclose the value within CDATA tags.
<message><![CDATA[’]]></message>
From the w3schools site:
Characters like "<" and "&" are illegal in XML elements.
"<" will generate an error because the parser interprets it as the start of a new element.
"&" will generate an error because the parser interprets it as the start of an character entity.
Some text, like JavaScript code, contains a lot of "<" or "&" characters. To avoid errors script code can be defined as CDATA.
Everything inside a CDATA section is ignored by the parser.

The problem is that your htmlentities function is doing what it should - generating HTML entities from characters. You're then inserting these into an XML document which doesn't have the HTML entities defined (things like ’ are HTML-specific).
The easiest way to handle this is keep all input raw (i.e. don't parse with htmlentities), then generate your XML using PHP's XML functions.
This will ensure that all text is properly encoded, and your XML is well-formed.
Example:
$user_input = "...<>&'";
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createTextNode($user_input));
$doc->appendChild($element);

I had a similar problem that the data i needed to add to the XML was already being returned by my code as htmlentities() (not in the database like this).
i used:
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createElement('string', htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_XML1, 'UTF-8')));
$doc->appendChild($element);
or if it was not already in htmlentities()
just the below should work
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createElement('string', htmlspecialchars($string, ENT_XML1, 'UTF-8')));
$doc->appendChild($element);
basically using htmlspecialchars with ENT_XML1 should get user imputed data into XML safe data (and works fine for me):
htmlspecialchars($string, ENT_XML1, 'UTF-8');

Use htmlspecialchars() will solve your problem. See the post below.
PHP - Is htmlentities() sufficient for creating xml-safe values?

This worked for me. Some one facing the same issue can try this.
htmlentities($string, ENT_XML1)
With special characters conversion.
htmlspecialchars(htmlentities($string, ENT_XML1))

htmlspecialchars($trim($_POST['content'], ENT_XML1, 'UTF-8');
Should do it.

Related

saveHTML() doesn't output special characters properly

I've looked at other answers (php: using DomDocument whenever I try to write UTF-8 it writes the hexadecimal notation of it, DOMDocument->saveHTML() converting to space) and either they don't apply to my situation, or I'm not understanding them.
I'm feeding some HTML into $dom like this...
$dom = new DOMDocument;
$dom->loadHTML($table_data_for_db);
I then do some stuff with it, then output it like this..
$table_data_for_db = $dom->saveHTML();
echo $table_data_for_db;
The problem is that special characters such as → end up like this â†’.
1.) Is there a way around this?
2.) Is there another way in PHP other than using DOMDocument, loadHTML, etc. to strip out sections of HTML? Like, if I want to remove <style id="fraction_class"> and all of its contents, is there another way?
Thank you.

cast simplexmlelement to string to get inner content but keep htmlspecialchars escaped

i have a xmlfile:
$xml = <<<EOD
<?xml version="1.0" encoding="utf-8"?>
<metaData xmlns="http://www.test.com/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="test">
<qkc6b1hh0k9>testdata&more</qkc6b1hh0k9>
</metaData>
EOD;
now i loaded it into a simplexmlobject and later on i wanted to get the inner of the "qkc6b1hh0k9"-node
$xmlRootElem = simplexml_load_string( $xml );
$xmlRootElem->registerXPathNamespace( 'xmlns', "http://www.test.com/" );
// ...
$xPathElems = $xmlRootElem->xpath( './'."xmlns:qkc6b1hh0k9" );
$var = (string)($xPathElems[0]);
var_dump($var);
I expected to get the string
testdata&more
... but i got
testdata&more
Why is the __toString() method of simplexmlobject converting my escaped specialchars to normal chars? Can I deactivate this behaviour?
I came up with a temp-solution, which I consider as dirty, what do you say?
(strip_tags($xPathElems[0]->asXML()))
May the DOMDocument be an alternative?
Thanks for any help on my questions!
edit
problem solved, problem was not in the __toString method of simplexml, it was later on when using the string with addChild
the behaviour as described above was totaly fine and has to be expected as you can see in the answers...
problems only came up, when the value was added to another xml-document via "addChild".
Since addChild doesn't escape the ampersand (http://www.php.net/manual/de/simplexmlelement.addchild.php#103587) one has to do it manually.

Why is the __toString() method of simplexmlobject converting my escaped specialchars to normal chars? Can I deactivate this behaviour?
Because those "speical" chars are actually XML encoding of characters. Using the string value gives you these characters verbatim again. That is what an XML parser has been made for.
I came up with a temp-solution, which I consider as dirty, what do you say?
Well, shaky. Instead let me suggest you the inverse: XML encode the the string:
$var = htmlspecialchars($xPathElems[0]);
var_dump($var);
May the DOMDocument be an alternative?
No, as SimpleXML it is an XML Parser and therefore you get the text decoded as well. This is not fully true (you can do that with DomDocument by going through all childnodes and picking entity nodes next to character data, but it's much more work as just outlined with htmlspecialchars() above).

If you create an XML tag, by any sane method, and set it to contain the string "testdata&more", this will be escaped as testdata&more. It is therefore only logical that extracting that string content back out reverses the escaping procedure to give you the text you put in.
The question is, why do you want the XML-escaped representation? If you want the content of the element as intended by the author, then __toString() is doing the right thing; there is more than one way of representing that string in XML, but it is the data being represented that you should normally care about.
If for some reason you really need details of how the XML is constructed in that particular instance, you could use a more complex parsing framework such as DOM, which will separate testdata&more into a text node (containing "testdata"), an entity node (with name "amp"), and another text node (containing "more").
If, on the other hand, all you want is to put it back into another XML (or HTML) document, then let SimpleXML do the unescaping properly, and re-escape it at the appropriate time.

The Actual Unicode Characters automatically converted to Numeric values using DOMDocument->saveHTML()

I am using the following function to get the inner html of html string
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument('1.0', 'UTF-8');
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML .= trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
my html string also contains unicode character. here is example of html string
$html = '<div>Thats True. Yes it is well defined آپ مجھے تم کہہ کر پکاریں</div>';
When I use the above function
$output = DOMinnerHTML($html);
the output is as below
$output = '<div>Thats True. Yes it is well defined
کے۔سلط&#1575</div>';
the actual unicode characters converted to numeric values.
I have debugged the code and found that in DOMinnerHTML function before the following line
$innerHTML .= trim($tmp_dom->saveHTML());
if I echo
echo $tmp_dom->textContent;
It shows the actual unicode characters but after saving to $innerHTML it outputs the numeric symbols.
Why it is doing that.
Note: please don't suggest me html_entity_decode like functions to convert numeric symbols to real unicode characters because, I also have user formatted data in my html string, that I don't want to convert.
Note: I have also tried by putting the
<meta http-equiv="content-type" content="text/html; charset=utf-8">
before my html string but no difference.

I had a similar problem. After reading the above comment, and after further investigation, I found a very simple solution.
All you have to do is use html_entity_decode() to convert the output of saveHTML(), as follows:
// Create a new dom document
$dom = new DOMDocument();
// .... Do some stuff, adding nodes, ...etc.
// the html_entity_decode function will solve the unicode issue you described
$result = html_entity_decode($dom->saveHTML();
// echo your output
echo $result;
This will ensure that unicode characters are displayed properly

Good question, and you did an excellent job narrowing down the problem to a single line of code that caused things to go haywire! This allowed me to figure out what is going wrong.
The problem is with the DOMDocument's saveHTML() function. It is doing exactly what it is supposed to do, but it's design is not what you wanted.
saveHTML() converts the document into a string "using HTML formatting" - which means that it does HTML entity encoding for you! Sadly, this is not what you wanted. Comments in the PHP docs also indicate that DOMDocument does not handle utf-8 especially well and does not do very well with fragments (as it automatically adds html, doctype, etc).
Check out this comment for a proposed solution by simply using another class: alternative to DOMDocument
After seeing many complaints about certain DOMDocument shortcomings,
such as bad handling of encodings and always saving HTML fragments
with , , and DOCTYPE, I decided that a better solution is
needed.
So here it is: SmartDOMDocument. You can find it at
http://beerpla.net/projects/smartdomdocument/
Currently, the main highlights are:
SmartDOMDocument inherits from DOMDocument, so it's very easy to use - just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing
functionality (see example below).
saveHTMLExact() - DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain
and tags, it adds them automatically (yup, there are no
flags to turn this behavior off). Thus, when you call
$doc->saveHTML(), your newly saved content now has and
DOCTYPE in it. Not very handy when trying to work with code fragments
(XML has a similar problem). SmartDOMDocument contains a new function
called saveHTMLExact() which does exactly what you would want - it
saves HTML without adding that extra garbage that DOMDocument does.
encoding fix - DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output. SmartDOMDocument tries
to work around this problem by enhancing loadHTML() to deal with
encoding correctly. This behavior is transparent to you - just use
loadHTML() as you would normally.

mb_convert_encoding($html,'HTML-ENTITIES','UTF-8');
This worked for me

simplexml_load_string() != simplexml_import_dom()?

If I load an HTML page using DOMDocument::loadHTMLFile() then pass it to simplexml_import_dom() everything is fine, however, if I using $dom->saveHTML() to get a string representation from the DOMDocument then use simplexml_load_string(), I get nothing. Actually, if I use a very simple page it will work, but as soon as there is anything more complex, it fails without any errors in the PHP log file.
Can anyone shed light on this?
Is it something to do with HTML not being parsable XML?
I am trying to strip out CR's and newlines from the formatted HTML text before using the contents as they have nothing to do with the content but get inserted into the SimpleXMLElement object, which is rather tedious.

Is it something to do with HTML not being parsable XML?
YES! HTML is a far less strict syntax so simplexml_load_string will not work with it by itself. This is because simplexml is simple and HTML is convoluted. On the other hand, DOMDocument is designed to be able to read the convoluted HTML structure, which means that since it can make sense of HTML and simplexml can make sense of it, you can bridge the proverbial gap there.
<!-- Valid HTML but not valid XML -->
<ul>
<li>foo
<li>bar
</ul>

HTML may or may not be valid XML. when you use loadHTMLFile it doesnt necessarily have to be well formed xml because the DOM is an HTML one so different rules, but when you pass a string to SimpleXML it must indeed be well formed.

If I get your question correclty and you simply want no whitespace in your output, then there is no need to use simplexml here.
Use: DOMDocument::preservewhitespace
like:
$dom->preserveWhiteSpace = false;
before saveHTML and you're set.

How can I store UTF8 in MySQL with PHP, sanitize it, echo it with XML and transform it with XSLT?

I am developing a MVC application with PHP that uses XML and XSLT to print the views. It need to be fully UTF-8 supported. I also use MySQL right configured with UTF8. My problem is the next.
I have a <input type="text"/> with a value like àáèéìíòóùú"><'##~!¡¿?. This is processed to add it to the database. I use mysql_real_escape_string($_POST["name"]) and then do MySQL a INSERT. This will add a slash \ before " and '.
The MySQL database have a DEFAULT CHARACTER SET utf8 and COLLOCATE utf8_spanish_ci. The table field is a normal VARCHAR.
Then I have to print this on a XML that will be transformed with XSLT. I can use PHP on the XML so I echo it with <?php echo TexUtils::obtainSqlText($value_obtained_from_sql); ?>. The obtainSqlText() function actually returns the same as the $value processed, is waiting for a final structure.
One of the first things that I will need for the selected input is to convert > and < to > and < because this will generate problems with start/end tags. This will be done with <?php htmlspecialchars($string, ENT_QUOTES, "UTF-8"); ?>. This will also converts & to &, " to " and ' to '. This is a big problem: XSLT starts to fail because it doesn't recognize all HTML special characters.
There is another problem. I've talked about àáèéìíòóùú"><'##~!¡¿? input but I will have some text from a CKEditor <textarea /> that the value will look like:
<p>
àáèéìíòóùú"><'##~!¡¿?
</p>
How I've to manage this? At first, if I want to print this second value right I will need to use <xsl:value-of select="value" disable-output-escaping="yes" />. Will "><' print right?
So what I am really looking for is how I need to manage this values and how I've to print. I need to use something if is coming from a VARCHARthat doesn't allows HTML and another if is a TEXT (for example) and allows HTML? I will need to use disable-output-escaping="yes" everytime?
I also want to know if doing this I am really securing the query from XSS attacks.
Thank you in advance!

This will be done with <?php htmlspecialchars($string, ENT_QUOTES, "UTF-8"); ?>.
Fine.
This is a big problem: XSLT starts to fail because it doesn't recognize all HTML special characters.
It shouldn't fail on htmlspecialchars() output, ever. & is a predefined entity in XML and ' is a character reference which is always allowed. htmlspecialchars() should produce XML-compatible output, unlike the usually-a-mistake htmlentities(). What is the error you are seeing?
àáèéìíòóùú"><'##~!¡¿?
Urgh, an HTML rich text editor produced that invalid markup? What a dodgy editor.
If you have to allow users to input arbitrary HTML, it's going to need some processing. Unless you really trust those users, you'll need a purifier (to stop them using dangerous scripting elements and XSS-ing each other), and a tidier (to remove malformed markup either due to crap rich-text-editor output or deliberate sabotage). If you intend to put the content directly into XML you will also need it to convert to XHTML output and replace HTML entity references.
A simple way to do this in PHP would be DOMDocument->loadHTML followed by a walk of the DOM tree removing all but known-good elements/attributes/URL-schemes, followed by DOMDocument->saveXML.
Will "><' print right?
Well, it'll print as in your example, yes. But that's equally invalid as both HTML and XML.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.