Escape Special Characters without <> and quotes with PHP - php

I have an XML/SVG. Part of it:
<text id="p6_segmentMainLabel5-outer" class="p6_segmentMainLabel-outer" style="font-size: 11px; font-family: arial; fill: rgb(170, 170, 170);">BüG [349]</text>
There is a special character Inside of it. How Do I clean the entire XML of such special characters without escaping all the "<" and ">" to < and >? I could make an array of all the characters I want to convert but I would like a mthod that only excludes <> and Quotes to have a clean XML.

Encoding the umlauts does not make your XML "cleaner", but more difficult to read.
Here is not need to encode umlauts and other characters not belonging to ASCII - except if you want to create ASCII XML. This is not needed often.
Use UTF-8 as the encoding for you XML and you will be fine 99% of the time.
If you need ASCII specify the encoding on the XML-API (default is UTF-8):
$dom = new DOMDocument('1.0', 'ASCII');
$dom
->appendChild($dom->createElement('text'))
->appendChild($dom->createTextNode('ÄÖÜ'));
echo $dom->saveXml();
Output:
<?xml version="1.0" encoding="ASCII"?>
<text>ÄÖÜ</text>
It is possible to load the XML into a DOM and copy all the nodes to a new DOM defined to use ASCII:
$source = new DOMDocument();
$source->loadXml(
'<?xml version="1.0" encoding="utf-8" ?><text>ÄÖÜ</text>'
);
$target = new DOMDocument('1.0', 'ASCII');
$target->appendChild(
$target->importNode(
$source->documentElement, TRUE
)
);
echo $target->saveXml();
If you generate XML as text, you can use the htmlentities() function to convert a string.

Related

How to dump an XML document's element as a string that has the same encoding as the document?

So for example, an ISO-8859-1 encoded XML document that even has some characters that are not part of the character set of that encoding, let's say the € (euro) symbol. This is possible in XML if the symbol is represented as a unicode character entity, in this case the € (euro) string:
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
<bar>€</bar>
</foo>
I need to obtain the bar element string with the same encoding as the document, which means encoded in ISO-8859-1 (also means to preserve the unicode character entities that are not part of this encoding), i.e. the ISO-8859-1 string <bar>€</bar>.
I couldn't achieve this by using the saveXML method of the DOMDocument class, since it dumps elements always in UTF-8 (whilst whole documents always in the encoding of their XML declaration):
$DD = new DOMDocument;
$DD -> load('foo.xml');
$dump = $DD -> saveXML($DD -> getElementsByTagName('bar') -> item(0));
The $dump variable resulted in the UTF-8 string <bar>€</bar>.
Notice how elements are dumped also with its unicode character entities traduced to actual UTF-8 characters.
So, how would I get the ISO-8859-1 string <bar>€</bar>? Are XML parsers meant to work this sort of task or should I just utilize regular expressions o something else?
Yes, they will decode entities and if you only save a part of a document it will be UTF-8 because it has no way to specify the encoding - it defaults back to UTF-8.
Here is a demo:
$xml = <<<'XML'
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
<bar>€</bar>
</foo>
XML;
$source = new DOMDocument();
$source->loadXML($xml);
echo "Document Part:\n";
echo $source->saveXML($source->getElementsByTagName('bar')->item(0));
echo "\n\n";
echo "Whole Document:\n";
echo $source->saveXML();
echo "\n\n";
Output:
Document Part:
<bar>€</bar>
Whole Document:
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
<bar>€</bar>
</foo>
You could copy the node into a new document. However the output will include the XML declaration with the encoding:
$target = new DOMDocument('1.0', 'ASCII');
$target->appendChild($target->importNode($source->getElementsByTagName('bar')->item(0), true));
echo "Separated Node:\n";
echo $target->saveXML();
Output:
Separated Node:
<?xml version="1.0" encoding="ASCII"?>
<bar>€</bar>
It looks like the encoding is not used when saveXML() is used with a node argument. When you set the $encoding property on the DOMDocument class it will be used in the saveXML() function, but only when saving the whole document. By checking the source code of the saveXML() function you will see there is even a comment mentioning the encoding property:
if (nodep != NULL) {
[...]
} else {
[...]
/* Encoding is handled from the encoding property set on the document */
xmlDocDumpFormatMemory(docp, &mem, &size, format);
}
According to the Document Object Model (DOM) Level 3 Load and Save Specification a lot of defined types support setting the encoding (and the PHP implementation has it at least on the DOMDocument class). So I'm not sure if it is a bug in the implementation of DOM in PHP. However, the documentation also states that it uses UTF-8 encoding:
Note:
The DOM extension uses UTF-8 encoding. Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or iconv for other encodings.
So, the solution would be to use such functions to convert it to the correct result or only save the whole XML document with saveXML() without any arguments given.

Regular Expression issue for XML

I want to write a string into an XML node, but I have to strip any forbidden characters before doing so. I found the following piece to work:
preg_replace("/[^\\x0009\\x000A\\x000D\\x0020-\\xD7FF\\xE000-\\xFFFD]/", "", $var)
However, it removes alot of characters that I want to keep. Such as space, ;, &, <, > \, and /.
I did some searching and found space to be x0020 so I tried first to allow spaces by changing the above code to:
preg_replace("/[^\\x0009\\x000A\\x000D\\x0021-\\xD7FF\\xE000-\\xFFFD]/", "", $var)
but it still removes spaces. I just want to remove those weird hidden "command" characters. How can I do that?
EDIT: I have previously made $var with htmlspecialchars(), hence why I want to keep & and ;
You don't have to strip them.
If you use an XML API like DOM or XMLWriter it will encode the special characters into entities:
$document = new DOMDocument('1.0', 'UTF-8');
$document
->appendChild($document->createElement('foo'))
->appendChild($document->createTextNode("\x09\x0A\x0D\x20 ä ç <&>"));
echo $document->saveXml();
Output:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
ä ç <&></foo>
The XML parser will decode them again:
$document = new DOMDocument('1.0', 'UTF-8');
$document->loadXml($xml);
var_dump($document->documentElement->textContent);
Output:
string(14) "
ä ç <&>"
Do you need to add a "u" to the end of your regex, so PHP knows you want Unicode matching? See also UTF-8 in PHP regular expressions
I also wonder if you might want to replace those characters with spaces, rather than nothing. Depends on what you're doing, but since you're dropping newlines, so as is you could have words joining up across lines.

Euro Currency Symbol breaks XML document

I am adding content to an XML document using PHP File_Put_Contents and then I am using Microsoft Word to open that document. The problem is, if I add the Euro currency symbol(€), then the document breaks, I get the following error:
€ is not a valid XML entity.
Trying to solve encoding issues with entities is a bad practice. Instead, make sure all your strings are properly UTF-8.
First make sure that your strings are UTF-8 actually. The methods and functions in PHP will expect it as UTF-8 independent from the output. It is possible to work with other character sets/encodings but this is really complex.
If you create the XML using an XML API like DOM or XMLWriter, it will take care of the encoding as needed. In an UTF-8 XML document the € does not need to be encoded.
$document = new DOMDocument('1.0', 'UTF-8');
$document
->appendChild($document->createElement('price'))
->appendChild($document->createTextNode('€ 42.00'));
echo $document->saveXml();
Output:
<?xml version="1.0" encoding="UTF-8"?>
<price>€ 42.00</price>
However in an ASCII XML document the special character needs to be encoded as a numeric entity. Named entities like € will not work. They are specific to (X)HTML and not XML.
$document = new DOMDocument('1.0', 'ASCII');
$document
->appendChild($document->createElement('price'))
->appendChild($document->createTextNode('€ 42.00'));
echo $document->saveXml();
Output:
<?xml version="1.0" encoding="ASCII"?>
<price>€ 42.00</price>
The same is possible with XMLWriter:
$writer = new XMLWriter();
$writer->openMemory();
$writer->startDocument('1.0', 'ASCII');
$writer->writeElement("price", '€ 42.00');
$writer->endDocument();
echo $writer->outputMemory();
If you generate the XML as text (usually not the best choice), you will have to take care of the encoding yourself:
echo '<?xml version="1.0" encoding="UTF-8"?>', "\n";
printf('<price>%s</price>', htmlentities('€ 42.00', ENT_XML1 | ENT_COMPAT, "UTF-8"));
Output:
<?xml version="1.0" encoding="UTF-8"?>
<price>€ 42.00</price>
Have you tried to used '€'? And make sure you clean up your string using the snipped below:
$currentString = preg_replace("[^!-~ ]", '', $currentString);

Encode ’ to be XML safe

I have a string that contains a right single quotation mark:
$str = "David’s Spade";
I am sending the string via XML and need to encode it. I have read that I should encode string using htmlspecialchars, but I have found that XML request still fails whereas htmlentities works.
When I error_log $str:
$str; // David\xe2\x80\x99s Spade
htmlspecialchars($str); // David\xe2\x80\x99s Spade
htmlspecialchars($str, ENT_QUOTES, 'UTF-8'); // David\xe2\x80\x99s Spade
htmlentities($str); // David’s Spade
Would it be better to str_replace ’ and then use htmlentities? Are there any other chars htmlentities may miss?
I am sending the string via XML and need to encode it.
No, you don't. If the XML is UTF-8 encoded (it is by default) and as your $str is UTF-8 encoded (as you show by the binary sequences in your question), you do not need to encode it.
This is by the book. So given on the technical information of the data you collaborate with, this is clear and fine.
You then write that some things work and others won't. Whatever you do there, there problem lies within the things you've hidden from your question.
To make this more explicit:
$str = "David’s Spade"; // "David\xE2\x80\x99s Spade"
is a perfectly valid string, for example to use it with an XML library like Simplexml to add it to an XML document:
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0" encoding="UTF-8"?>
<doc><element>David’s Spade</element></doc>
As you can see, the XML has been encoded by not changing the byte-sequence of the string here because it's UTF-8.
Let's take some ASCII:
$xml = new SimpleXMLElement('<doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0"?>
<doc><element>David’s Spade</element></doc>
As this example shows, it depends on the document encoding then. This second example is a fall-back of Simplexml to make the output more robust, but actually this wouldn't be necessary as UTF-8 would be the default encoding.
In any case you should not be too concerned about the encoding yourself by using a library that has specialized on creating XML documents. PHP has some few for exactly that. Take one of them.

Parsing xml with PHP what to do with characters like these

I'm parsing an xml document using php.
When I see the result in my browser I get the following characters:
ñ instead of spanish ñ
í instead of í
á instead of á
ó instead of ó
é instead of é
I was going to use a str_replace and replace every odd character for the good ones, but sadly the pattern before happens only sometimes and in general I have a wide collection of odd characters :(
The xml heading is:
<?xml version="1.0" encoding="iso-8859-1"?>
But if I change it to utf-8 it simply won't be printed ..
I load the xml as a string with simplexml_load_string (comes from database like that)
Can you please give me any ideas on how to solve this?
Thanks a lot
You have 2 options:
a) include a header('Content-Type: text/html; charset=iso-8859-1'); before any output in your php file.
b) convert the output to utf-8 with $str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
Both should do the trick.
SimpleXML uses UTF-8 to encode stored strings. You can use an XML-File with iso-8859-1, but if you want to print XML values with this encoding, you have to use utf8_decode before.
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
// new xml
$xml = new SimpleXMLElement('new.xml');
// Displaying XML in textual form
echo $xml->asXML();

Categories