Formatting string for xml attribute in php - php

I have some strings that are valid in my database but when I include them in an attribute of a UTF-8 XML output they give me the following error:
XML Parsing Error: not well-formed
My current code (simplified):
header('Content-Type: text/xml');
echo '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>';
echo '<root attribute="' . htmlentities($string_from_hell) . '">';
How should I format these strings before including them in XML attributes?
A possible value for $string_from_hell:  (don't know if it will show up properly)

Try
htmlspecialchars($string_from_hell, ENT_QUOTES, "UTF-8")
htmlentities won't do because it will create HTML entities that are not recognized in XML, only HTML. You should also specify the charset because the default is not UTF-8, it's the ISO-8859-1.
You're also missing the quotes (") around the attribute value.
There are also better ways to create XML files that handle escaping for you. See e.g. XMLWriter.

Related

How to dump an XML document's element as a string that has the same encoding as the document?

So for example, an ISO-8859-1 encoded XML document that even has some characters that are not part of the character set of that encoding, let's say the € (euro) symbol. This is possible in XML if the symbol is represented as a unicode character entity, in this case the € (euro) string:
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
<bar>€</bar>
</foo>
I need to obtain the bar element string with the same encoding as the document, which means encoded in ISO-8859-1 (also means to preserve the unicode character entities that are not part of this encoding), i.e. the ISO-8859-1 string <bar>€</bar>.
I couldn't achieve this by using the saveXML method of the DOMDocument class, since it dumps elements always in UTF-8 (whilst whole documents always in the encoding of their XML declaration):
$DD = new DOMDocument;
$DD -> load('foo.xml');
$dump = $DD -> saveXML($DD -> getElementsByTagName('bar') -> item(0));
The $dump variable resulted in the UTF-8 string <bar>€</bar>.
Notice how elements are dumped also with its unicode character entities traduced to actual UTF-8 characters.
So, how would I get the ISO-8859-1 string <bar>€</bar>? Are XML parsers meant to work this sort of task or should I just utilize regular expressions o something else?
Yes, they will decode entities and if you only save a part of a document it will be UTF-8 because it has no way to specify the encoding - it defaults back to UTF-8.
Here is a demo:
$xml = <<<'XML'
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
<bar>€</bar>
</foo>
XML;
$source = new DOMDocument();
$source->loadXML($xml);
echo "Document Part:\n";
echo $source->saveXML($source->getElementsByTagName('bar')->item(0));
echo "\n\n";
echo "Whole Document:\n";
echo $source->saveXML();
echo "\n\n";
Output:
Document Part:
<bar>€</bar>
Whole Document:
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
<bar>€</bar>
</foo>
You could copy the node into a new document. However the output will include the XML declaration with the encoding:
$target = new DOMDocument('1.0', 'ASCII');
$target->appendChild($target->importNode($source->getElementsByTagName('bar')->item(0), true));
echo "Separated Node:\n";
echo $target->saveXML();
Output:
Separated Node:
<?xml version="1.0" encoding="ASCII"?>
<bar>€</bar>
It looks like the encoding is not used when saveXML() is used with a node argument. When you set the $encoding property on the DOMDocument class it will be used in the saveXML() function, but only when saving the whole document. By checking the source code of the saveXML() function you will see there is even a comment mentioning the encoding property:
if (nodep != NULL) {
[...]
} else {
[...]
/* Encoding is handled from the encoding property set on the document */
xmlDocDumpFormatMemory(docp, &mem, &size, format);
}
According to the Document Object Model (DOM) Level 3 Load and Save Specification a lot of defined types support setting the encoding (and the PHP implementation has it at least on the DOMDocument class). So I'm not sure if it is a bug in the implementation of DOM in PHP. However, the documentation also states that it uses UTF-8 encoding:
Note:
The DOM extension uses UTF-8 encoding. Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or iconv for other encodings.
So, the solution would be to use such functions to convert it to the correct result or only save the whole XML document with saveXML() without any arguments given.

PHP htmlentities or htmlspecialchars

I'm using fwrite to create an xml file but i'm losing the special the characters.
Example:
$message0 = htmlspecialchars('<?xml version="1.0" encoding="UTF-8"?>');
$file = fopen("test.xml","w");
echo fwrite($file,"$message0");
fclose($file);
The above code gives me the following output
<?xml version="1.0" encoding="UTF-8"?><JobTemplates>
I need the special characters in order for the xml file to work. If i echo the variables, the special characters appear on the page.
Not understanding why you're encoding html characters for this. It's a trusted string, so, just put it in single quotes and write it. If any character's are giving you trouble, escape them instead of encoding them.
If there's a reason you must do it this way, then decode inline. But it all seems a bit messy to me.
Here is a tested example , you should not use htmlspecialchars
$message0 = '<?xml version="1.0" encoding="UTF-8"?><contact><name>foo</name><phone>123456</phone></contact>';
$file = fopen("test.xml","w");
fwrite($file,$message0);
fclose($file);

Encode ’ to be XML safe

I have a string that contains a right single quotation mark:
$str = "David’s Spade";
I am sending the string via XML and need to encode it. I have read that I should encode string using htmlspecialchars, but I have found that XML request still fails whereas htmlentities works.
When I error_log $str:
$str; // David\xe2\x80\x99s Spade
htmlspecialchars($str); // David\xe2\x80\x99s Spade
htmlspecialchars($str, ENT_QUOTES, 'UTF-8'); // David\xe2\x80\x99s Spade
htmlentities($str); // David’s Spade
Would it be better to str_replace ’ and then use htmlentities? Are there any other chars htmlentities may miss?
I am sending the string via XML and need to encode it.
No, you don't. If the XML is UTF-8 encoded (it is by default) and as your $str is UTF-8 encoded (as you show by the binary sequences in your question), you do not need to encode it.
This is by the book. So given on the technical information of the data you collaborate with, this is clear and fine.
You then write that some things work and others won't. Whatever you do there, there problem lies within the things you've hidden from your question.
To make this more explicit:
$str = "David’s Spade"; // "David\xE2\x80\x99s Spade"
is a perfectly valid string, for example to use it with an XML library like Simplexml to add it to an XML document:
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0" encoding="UTF-8"?>
<doc><element>David’s Spade</element></doc>
As you can see, the XML has been encoded by not changing the byte-sequence of the string here because it's UTF-8.
Let's take some ASCII:
$xml = new SimpleXMLElement('<doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0"?>
<doc><element>David’s Spade</element></doc>
As this example shows, it depends on the document encoding then. This second example is a fall-back of Simplexml to make the output more robust, but actually this wouldn't be necessary as UTF-8 would be the default encoding.
In any case you should not be too concerned about the encoding yourself by using a library that has specialized on creating XML documents. PHP has some few for exactly that. Take one of them.

PHP: htmlentities and htmlspecialchars not converting some characters

I'm trying convert all special chars into HTML safe entities on their way into my database, but I can't seem to get PHP to handle certain characters. For example, if my string contains any of the following: ¡£¢∞§¶ It gets turned into an empty string.
So for example, the following string:
Hello£
Get turned into an empty string after it's POSTed and processed by the following code:
$workDetails["copy"] = htmlentities($workDetails["copy"], ENT_QUOTES, "UTF-8");
I presume I'm doing something wrong? :(
Maybe it will just be enough if you change the Encoding of your website to UTF-8 via the header() command:
header("Content-Type: text/html; charset=utf-8"); in PHP
or
<?xml version="1.0" encoding="utf-8" ?>; at the top of your HTML template if you use one.
but if you definitely need to convert those chars to its specific html code, you should create your own function to replace the symbols which are not covered by htmlspecialchars() as well.

XML Parsing Error: undefined entity - special characters

Why does XML display error on certain special characters and some are ok?
For instance, below will create error,
<?xml version="1.0" standalone="yes"?>
<Customers>
<Customer>
<Name>Löic</Name>
</Customer>
</Customers>
but this is ok,
<?xml version="1.0" standalone="yes"?>
<Customers>
<Customer>
<Name>&</Name>
</Customer>
</Customers>
I convert the special character through php - htmlentities('Löic',ENT_QUOTES) by the way.
How can I get around this?
Thanks.
EDIT:
I found that it works fine if I use numeric character such as Lóic
now I have to find how to use php to convert special characters into numeric characters!
There are five entities defined in the XML specification — &, <, >, &apos; and "
There are lots of entities defined in the HTML DTD.
You can't use the ones from HTML in generic XML.
You could use numeric references, but you would probably be better off just getting your character encodings straight (which basically boils down to:
Set your editor to save the data in UTF-8
If you process the data with a programming language, make sure it is UTF-8 aware
If you store the data in a database, make sure it is configured for UTF-8
When you serve up your document, make sure the HTTP headers specify that it is UTF-8 (in the case of XML, UTF-8 is the default, so not specifying anything is almost as good)
)
Because it is not an built-in entity, it is instead an external entity that needs declaration in DTD.
TLDR Solution
You can solve this problem with html_entity_decode() (Source: PHP.net), like so...
$xml_line = '<description>' . html_entity_decode($description) . '</description>';
Full, Working Demo Online
In this demo, I use ’ and a line from the Tao teh Ching to demonstrate the above use of html_entity_decode()...
$title = 'The name you can say isn’t the real name.';
$xml_title = html_entity_decode($title)
$xml_title = str_replace(['<', '>',], ['<', '>',], $xml_title );
$xml_line = '<title>' . $xml_title . '</title>';
print($xml_line);
Don't forget to replace back those < and > chars, though!
Working Demo Sandbox
How Do You Know It Worked?
Want to verify it worked just fine? Then head on over to the W3C RSS Feed Validator, and see the above code being approved as just fine.

Categories