error with unicode and simple xml

error with unicode and simple xml - php

I try to add a string to an XML object with Simple XML.
Example (http://ideone.com/L4ztum):
$str = "<aoc> САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";
$movies = new SimpleXMLElement($str);
But it gives a warning:
PHP Warning: SimpleXMLElement::__construct(): Entity: line 1: parser error : PCDATA invalid Char value 2 in /home/nmo2E7/prog.php on line 5
and finally an Exception with the message String could not be parsed as XML.
If I remove two Unicode characters, it works (http://ideone.com/LaMvHN):
$str = "<aoc> САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";
^
`-- two invisible characters have been removed here
How can I remove Unicode from string?

It is not Unicode, but two stray bytes, valued \x01 and \x02. You can filter them out with str_replace:
$s = str_replace("\x01", "", $s);
$s = str_replace("\x02", "", $s);

The constructor of the SimepleXMLElement needs it's first parameter to be well-formed XML.
The string you pass
$str = "<aoc> САМОЛЕТОМ\x02\x01 ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";
is not well-formed XML because it contains characters out of the character-range of XML, namely:
Unicode Character 'START OF TEXT' (U+0002) at binary offset 24
Unicode Character 'START OF HEADING' (U+0001) at binary offset 25
So instead of using SimpleXMLElement to create it from a hand-mangled XML-string (which is error-prone), use it to create the XML you're looking for. Let's give an example.
In the following example I assume you've got the text you want to create the XML element of. This example creates an XML element similar to the one in your question with the difference that the exact same string is passed in as text-content for the document element ("<aoc>").
$text = 'САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12';
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><aoc/>');
$xml->{0} = $text; // set the document-element's text-content to $text
When done this way, SimpleXML will filter any invalid control-characters for you and the SimpleXMLElement remains stable:
$str = $xml->asXML();
$movies = new SimpleXMLElement($str);
print_r($movies);
/* output:
SimpleXMLElement Object
(
[0] => САМОЛЕТОМ ТК Адамант, г.Домодедово, мкр-н Востряково, Центральный просп. д.12
)
*/
So to finally answer your question:
How can I remove Unicode from string?
You don't want to remove Unicode from the string. The SimpleXML library accepts Unicode strings only (in the UTF-8 encoding). What you want is that you remove Unicode-characters that are invalid for XML usage. The SimpleXML library does that for you when you set node-values as it has been designed for.
However if you try to load non-well-formed XML via the contructor or the constructor functions (simplexml_load_string etc.), it will fail and give you the (important) error.
I hope this clarifies the situation for you and answers your question.

Related

XML encoding error, but both XML and input text encoding are utf-8 in php

I'm generating a XML Dom with DomDocument in php, containing some news, with title, date, links and a description. The problem occurs on description of some news, but not on others, and both of them contains accents and cedilla.
I create the XML Dom element in UTF-8:
$dom = new \DOMDocument("1.0", "UTF-8");
Then, I retrieve my text from a MySQL database, which is encoded in latin-1, and after I tested the encoding with mb_detect_encoding it returns UTF-8.
I tried the following:
iconv('UTF-8', 'ISO-8859-1', $descricao);
iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $descricao);
iconv('ISO-8859-1', 'UTF-8', $descricao);
iconv('ISO-8859-1//TRANSLIT', 'UTF-8', $descricao);
mb_convert_encoding($descricao, 'ISO-8859-1', 'UTF-8');
mb_convert_encoding($descricao, 'UTF-8', 'ISO-8859-1');
mb_convert_encoding($descricao, 'UTF-8', 'UTF-8'); //that makes no sense, but who knows
Also tried changing the database encode to UTF-8, and changing the XML encode to ISO-8859-1.
This is the full method that generates the XML:
$informativos = Informativo::where('inf_ativo','S')->orderBy('inf_data','DESC')->take(20)->get();
$dom = new \DOMDocument("1.0", "UTF-8");
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$rss = $dom->createElement("rss");
$channel = $dom->createElement("channel");
$title = $dom->createElement("title", "Informativos");
$link = $dom->createElement("link", "http://example.com/informativos");
$channel->appendChild($title);
$channel->appendChild($link);
foreach ($informativos as $informativo) {
$item = $dom->createElement("item");
$itemTitle = $dom->createElement("title", $informativo->inf_titulo);
$itemImage = $dom->createElement("image", "http://example.com/".$informativo->inf_ilustracao);
$itemLink = $dom->createElement("link", "http://example.com/informativo/".$informativo->informativo_id);
$descricao = strip_tags($informativo->inf_descricao);
$descricao = str_replace(" ", " ", $descricao);
$descricao = str_replace("
", " ", $descricao);
$descricao = substr($descricao, 0, 150).'...';
$itemDesc = $dom->createElement("description", $descricao);
$itemDate = $dom->createElement("pubDate", $informativo->inf_data);
$item->appendChild($itemTitle);
$item->appendChild($itemImage);
$item->appendChild($itemLink);
$item->appendChild($itemDesc);
$item->appendChild($itemDate);
$channel->appendChild($item);
}
$rss->appendChild($channel);
$dom->appendChild($rss);
return $dom->saveXML();
Here is an example of successful output:
Segundo a instituição, número de pessoas que vivem na pobreza subiu 7,3 milhões desde 2014, atingindo 21% da população, ou 43,5 milhões de br
And an example that gives the encoding error:
procuradores da Lava Jato em Curitiba, que estão sendo investigados por um
suposto acordo fraudulento com a Petrobras e o Departamento de Justi�...
Everything renders fine, until the problematic description text above, that gives me:
"This page contains the following errors:
error on line 118 at column 20: Encoding error
Below is a rendering of the page up to the first error."
Probably that 
 is the problem here. Since I can't control whether or not the text have this, I need to render these special characters correctly.
UPDATE 2019-04-12: Found out the error on the problematic text and changed the example.

The encoding of the database connection is important. Make sure that it is set to UTF-8. It is a good idea to use UTF-8 most of the time (for your fields). Character sets like ISO-8859-1 have only a very limited amount of characters. So if a Unicode string gets encoded into them it might loose data.
The second argument of DOMDocument::createElement() is broken. In only encodes some special characters, but not &. To avoid problems create and append the content as an separate text node. However DOMNode::appendChild() returns the append node, so the DOMElement::create* methods can be nested and chained.
$data = [
[
'inf_titulo' => 'Foo',
'inf_ilustracao' => 'foo.jpg',
'informativo_id' => 42,
'inf_descricao' => 'Some content',
'inf_data' => 'a-date'
]
];
$informativos = json_decode(json_encode($data));
function stripTagsAndTruncate($text) {
$text = strip_tags($text);
$text = str_replace([" ", "
"], " ", $text);
return substr($text, 0, 150).'...';
}
$dom = new DOMDocument('1.0', 'UTF-8');
$rss = $dom->appendChild($dom->createElement('rss'));
$channel = $rss->appendChild($dom->createElement("channel"));
$channel
->appendChild($dom->createElement("title"))
->appendChild($dom->createTextNode("Informativos"));
$channel
->appendChild($dom->createElement("link"))
->appendChild($dom->createTextNode("http://example.com/informativos"));
foreach ($informativos as $informativo) {
$item = $channel->appendChild($dom->createElement("item"));
$item
->appendChild($dom->createElement("title"))
->appendChild($dom->createTextNode($informativo->inf_titulo));
$item
->appendChild($dom->createElement("image"))
->appendChild($dom->createTextNode("http://example.com/".$informativo->inf_ilustracao));
$item
->appendChild($dom->createElement("link"))
->appendChild($dom->createTextNode("http://example.com/informativo/".$informativo->informativo_id));
$item
->appendChild($dom->createElement("description"))
->appendChild($dom->createTextNode(stripTagsAndTruncate($informativo->inf_descricao)));
$item
->appendChild($dom->createElement("pubDate"))
->appendChild($dom->createTextNode($informativo->inf_data));
}
$dom->formatOutput = TRUE;
echo $dom->saveXML();
Output:
<?xml version="1.0" encoding="UTF-8"?>
<rss>
<channel>
<title>Informativos</title>
<link>http://example.com/informativos</link>
<item>
<title>Foo</title>
<image>http://example.com/foo.jpg</image>
<link>http://example.com/informativo/42</link>
<description>Some content...</description>
<pubDate>a-date</pubDate>
</item>
</channel>
</rss>
Truncating an HTML fragment can result in broken entities and broken code points (if you don't use a UTF-8 aware string function). Here are two approaches to solve it.
You can use PCRE in UTF-8 mode and match n entities/codepoints:
// have some string with HTML and entities
$text = 'Hello<b>äöü</b> ä
 foobar';
// strip tags and replace some specific entities with spaces
$stripped = str_replace([' ', '
'], ' ', strip_tags($text));
// match 0-10 entities or unicode codepoints
preg_match('(^(?:&[^;]+;|\\X){0,10})u', $stripped, $match);
var_dump($match[0]);
Output:
string(18) "Helloäöü ä"
However I would suggest using DOM. It can load HTML and allow to use Xpath expressions on it.
// have some string with HTML and entities
$text = 'Hello<b>äöü</b> ä
 foobar';
$document = new DOMDocument();
// force UTF-8 and load
$document->loadHTML('<?xml encoding="UTF-8"?>'.$text);
$xpath = new DOMXpath($document);
// use xpath to fetch the first 10 characters of the text content
var_dump($xpath->evaluate('substring(//body, 1, 10)'));
Output:
string(15) "Helloäöü ä"
DOM in general treats all strings as UTF-8. So Codepoints are a not a problem. Xpaths substring() works on the text content of the first matched node. The argument are character positions (not index) so they start with 1.
DOMDocument::loadHTML() will add html and body tags and decode entities. The results will a little bit cleaner then with the PCRE approach.

PHP htmlentities and saving the data in xml format

Im trying to save some data into a xml file using the following PHP script:
<?php
$string = 'Go to google maps and some special characters ë è & ä etc.';
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
$root = $doc->createElement('top');
$root = $doc->appendChild($root);
$title = $doc->createElement('title');
$title = $root->appendChild($title);
$id = $doc->createAttribute('id');
$id->value = '1';
$text = $title->appendChild($id);
$text = $doc->createTextNode($string);
$text = $title->appendChild($text);
$doc->save('data.xml');
echo 'data saved!';
?>
I'm using htmlentities to translate all of the string into an html format, if I leave this out the special characters won't be translated to html format. this is the output:
<?xml version="1.0" encoding="UTF-8"?>
<top>
<title id="1">&lt;a href=&quot;google.com/maps&quot;&gt;Go to google maps&lt;/a&gt; and some special characters &euml; &egrave; &amp; &auml; etc.</title>
</top>
The ampersand of the html tags get a double html code: &lt; and an ampersand becomes: &amp;
Is this normal behavior? Or how can I prevent this from happening? Looks like a double encoding.

Try to remove the line:
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
Because the text passed to createTextNode() is escaped anyway.
Update:
If you want the utf-8 characters to be escaped. You could leave that line and try to add the $string directly in createElement().
For example:
$title = $doc->createElement('title', $string);
$title = $root->appendChild($title);
In PHP documentation it says that $string will not be escaped. I haven't tried it, but it should work.

It is the htmlentities that turns a & into &
When working with xml data you should not use htmlentities, as the DOMDocument will handle a & and not &.
As of php 5.3 the default encoding is UTF-8, so there is no need to convert to UTF-8.

This line:
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
… encodes a string as HTML.
This line:
$text = $doc->createTextNode($string);
… encodes your string of HTML as XML.
This gives you an XML representation of an HTML string. When the XML is parsed you get the HTML back.
how can I prevent this from happening?
If your goal is to store some text in an XML document. Remove the line that encodes it as HTML.
Looks like a double encoding.
Pretty much. It is encoded twice, it just uses different (albeit very similar) encoding methods for each of the two passes.

What can be alternate way to load strip out javascript and put it in array for later use

I am using following code to strip out javascript from html dom string and put them in array for later use.
What can be alternate good use.
My Problem:
I am getting problem with unicode inside the file. When files with unicode are parsed then it generates following error:
Warning: DOMDocument::saveHTML() [domdocument.savehtml]: output
conversion failed due to conv error, bytes 0x97 0xC3 0xA0 0xC2 in
my code:
function loadJSCodeToLast( $strDOM ){
//Find all the <script></script> code and add to $objApp
global $objApp;
$objDOM = new DOMDocument();
//$x = new DOMImplementation();
//$doc = $x->createDocument(NULL,"rootElementName");
//$strDOM = '<kool>'.$strDOM.'</kool>';
$objDOM->preserveWhiteSpace = false;
//$objDOM->formatOutput = true;
#$objDOM->loadHtml( $strDOM );
$xpath = new DOMXPath($objDOM);
$objScripts = $xpath->query('//script');
$totCount = $objScripts->length;
if ($totCount > 0) {
//document contains script tags
foreach($objScripts as $entries){
$strSrc = $entries->getAttribute('src');
if( $strSrc !== ''){
$objApp->AddJSFile( $strSrc );
}else{
$objApp->AddJSScript( $entries->nodeValue );
}
$entries->parentNode->removeChild( $entries );
}
}
//return $objDOM->saveHTML();
//echo $GLOBALS['strTemplateDirAbs'];
return preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $objDOM->saveHTML()));
}

Try converting your string with utf8_encode() before loading it.
$txt = utf8_encode($txt);
var_dump(loadJSCodeToLast($txt));
The XML parser converts the text of an XML document into UTF-8, even
if you have set the character encoding of the XML, for example as a
second parameter of the DOMDocument constructor. After parsing the XML
with the load() command all its texts have been converted to UTF-8.
In case you append text nodes with special characters (e. g. Umlaut)
to your XML document you should therefore use utf8_encode() with your
text to convert it into UTF-8 before you append the text to the
document. Otherwise you will get an error message like "output
conversion failed due to conv error" at the save()
From DOMDocument::save documentation comments.

PHP invalid character error

I'm getting this error when running this code:
Fatal error: Uncaught exception 'DOMException' with message 'Invalid Character Error' in test.php:29 Stack trace: #0 test.php(29): DOMDocument->createElement('1OhmStable', 'a') #1 {main} thrown in test.php on line 29
The nodes that from the original XML file do contain invalid characters, but as I am stripping the invalid characters away from the nodes, the nodes should be created. What type of encoding do I need to do on the original XML document? Do I need to decode the saveXML?
function __cleanData($c)
{
return preg_replace("/[^A-Za-z0-9]/", "",$c);
}
$xml = new DOMDocument('1.0', 'UTF-8');
$xml->load('test.xml');
$xml->formatOutput = true;
$append = array();
foreach ($xml->getElementsByTagName('product') as $product )
{
foreach($product->getElementsByTagName('name') as $name )
{
$append[] = $name;
}
foreach ($append as $a)
{
$nodeName = __cleanData($a->textContent);
$element = $xml->createElement(htmlentities($nodeName) , 'a');
}
$product->removeChild($xml->getElementsByTagName('details')->item(0));
$product->appendChild($element);
}
$result = $xml->saveXML();
$file = "data.xml";
file_put_contents($file,$result);
This is what the original XML looks like:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/v1/xsl/xml_pretty_printer.xsl" type="text/xsl"?>
<products>
<product>
<modelNumber>M100</modelNumber>
<itemId>1553725</itemId>
<details>
<detail>
<name>1 Ohm Stable</name>
<value>600 x 1</value>
</detail>
</details>
</product>
</products>
The new document is supposed to look like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/v1/xsl/xml_pretty_printer.xsl" type="text/xsl"?>
<products>
<product>
<modelNumber>M100</modelNumber>
<itemId>1553725</itemId>
<1 Ohm Stable>
</1 Ohm Stable>
</product>
</products>

Simply you can not use an element name start with number
1OhmStable <-- rename this
_1OhmStable <-- this is fine
php parse xml - error: StartTag: invalid element name
A nice article :- http://www.xml.com/pub/a/2001/07/25/namingparts.html
A Name is a token beginning with a letter or one of a few punctuation characters, and continuing with letters, digits, hyphens, underscores, colons, or full stops, together known as name characters.

You have not written where you get that error. In case it's after you cleaned the value, this is my guess:
preg_replace("/[^A-Za-z0-9]/", "",$c);
This replacement is not written for UTF-8 encoded strings (which are used by DOMDocument). You can make it UTF-8 compatible by using the u-modifier (PCRE8)Docs:
preg_replace("/[^A-Za-z0-9]/u", "",$c);
^
It's just a guess, I suggest you make it more precise in your question which part of your code triggers the error.

Even if __cleandata() will remove all other characters than latin alphabets a-z and numbers, it doesn't necessarily guarantee that the result is a valid XML name. Your function can return strings that begin with a number but numbers are illegal name start characters in XML, they can only appear in a name after the first name character. Also spaces are forbidden in names, so that is another point where your expected XML output would fail.

Make sure scripts have same encoding: if it's UTF make sure they are without Byte Order Mark (BOM) at very begin of file.
To do that open your XML file with a text editor like Notepad++ and convert your file in "UTF-8 without BOM".
I had a similar error, but with a json file

PHP SimpleXML doesn't preserve line breaks in XML attributes

I have to parse externally provided XML that has attributes with line breaks in them. Using SimpleXML, the line breaks seem to be lost. According to another stackoverflow question, line breaks should be valid (even though far less than ideal!) for XML.
Why are they lost? [edit] And how can I preserve them? [/edit]
Here is a demo file script (note that when the line breaks are not in an attribute they are preserved).
PHP File with embedded XML
$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<Rows>
<data Title='Data Title' Remarks='First line of the row.
Followed by the second line.
Even a third!' />
<data Title='Full Title' Remarks='None really'>First line of the row.
Followed by the second line.
Even a third!</data>
</Rows>
XML;
$xml = new SimpleXMLElement( $xml );
print '<pre>'; print_r($xml); print '</pre>';
Output from print_r
SimpleXMLElement Object
(
[data] => Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[Title] => Data Title
[Remarks] => First line of the row. Followed by the second line. Even a third!
)
)
[1] => First line of the row.
Followed by the second line.
Even a third!
)
)

Using SimpleXML, the line breaks seem to be lost.
Yes, that is expected... in fact it is required of any conformant XML parser that newlines in attribute values represent simple spaces. See attribute value normalisation in the XML spec.
If there was supposed to be a real newline character in the attribute value, the XML should have included a
character reference instead of a raw newline.

The entity for a new line is
. I played with your code until I found something that did the trick. It's not very elegant, I warn you:
//First remove any indentations:
$xml = str_replace(" ","", $xml);
$xml = str_replace("\t","", $xml);
//Next replace unify all new-lines into unix LF:
$xml = str_replace("\r","\n", $xml);
$xml = str_replace("\n\n","\n", $xml);
//Next replace all new lines with the unicode:
$xml = str_replace("\n","
", $xml);
Finally, replace any new line entities between >< with a new line:
$xml = str_replace(">
<",">\n<", $xml);
The assumption, based on your example, is that any new lines that occur inside a node or attribute will have more text on the next line, not a < to open a new element.
This of course would fail if your next line had some text that was wrapped in a line-level element.

Assuming $xmlData is your XML string before it is sent to the parser, this should replace all newlines in attributes with the correct entity. I had the issue with XML coming from SQL Server.
$parts = explode("<", $xmlData); //split over <
array_shift($parts); //remove the blank array element
$newParts = array(); //create array for storing new parts
foreach($parts as $p)
{
list($attr,$other) = explode(">", $p, 2); //get attribute data into $attr
$attr = str_replace("\r\n", "
", $attr); //do the replacement
$newParts[] = $attr.">".$other; // put parts back together
}
$xmlData = "<".implode("<", $newParts); // put parts back together prefixing with <
Probably can be done more simply with a regex, but that's not a strong point for me.

Here is code to replace the new lines with the appropriate character reference in that particular XML fragment. Run this code prior to parsing.
$replaceFunction = function ($matches) {
return str_replace("\n", "
", $matches[0]);
};
$xml = preg_replace_callback(
"/<data Title='[^']+' Remarks='[^']+'/i",
$replaceFunction, $xml);

This is what worked for me:
First, get the xml as a string:
$xml = file_get_contents($urlXml);
Then do the replacement:
$xml = str_replace(".\xe2\x80\xa9<as:eol/>",".\n\n<as:eol/>",$xml);
The "." and "< as:eol/ >" were there because I needed to add breaks in that case. The new lines "\n" can be replaced with whatever you like.
After replacing, just load the xml-string as a SimpleXMLElement object:
$xmlo = new SimpleXMLElement( $xml );
Et Voilà

Well, this question is old but like me, someone might come to this page eventually.
I had slightly different approach and I think the most elegant out of these mentioned.
Inside the xml, you put some unique word which you will use for new line.
Change xml to
<data Title='Data Title' Remarks='First line of the row. \n
Followed by the second line. \n
Even a third!' />
And then when you get path to desired node in SimpleXML in string output write something like this:
$findme = '\n';
$pos = strpos($output, $findme);
if($pos!=0)
{
$output = str_replace("\n","<br/>",$output);
It doesn't have to be '\n, it can be any unique char.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

error with unicode and simple xml - php

It is not Unicode, but two stray bytes, valued \x01 and \x02. You can filter them out with str_replace: $s = str_replace("\x01", "", $s); $s = str_replace("\x02", "", $s);

Related

XML encoding error, but both XML and input text encoding are utf-8 in php

PHP htmlentities and saving the data in xml format

What can be alternate way to load strip out javascript and put it in array for later use

PHP invalid character error

PHP SimpleXML doesn't preserve line breaks in XML attributes

Categories

Resources