PHP htmlentities and saving the data in xml format - php

Im trying to save some data into a xml file using the following PHP script:
<?php
$string = 'Go to google maps and some special characters ë è & ä etc.';
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
$root = $doc->createElement('top');
$root = $doc->appendChild($root);
$title = $doc->createElement('title');
$title = $root->appendChild($title);
$id = $doc->createAttribute('id');
$id->value = '1';
$text = $title->appendChild($id);
$text = $doc->createTextNode($string);
$text = $title->appendChild($text);
$doc->save('data.xml');
echo 'data saved!';
?>
I'm using htmlentities to translate all of the string into an html format, if I leave this out the special characters won't be translated to html format. this is the output:
<?xml version="1.0" encoding="UTF-8"?>
<top>
<title id="1">&lt;a href=&quot;google.com/maps&quot;&gt;Go to google maps&lt;/a&gt; and some special characters &euml; &egrave; &amp; &auml; etc.</title>
</top>
The ampersand of the html tags get a double html code: &lt; and an ampersand becomes: &amp;
Is this normal behavior? Or how can I prevent this from happening? Looks like a double encoding.

Try to remove the line:
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
Because the text passed to createTextNode() is escaped anyway.
Update:
If you want the utf-8 characters to be escaped. You could leave that line and try to add the $string directly in createElement().
For example:
$title = $doc->createElement('title', $string);
$title = $root->appendChild($title);
In PHP documentation it says that $string will not be escaped. I haven't tried it, but it should work.

It is the htmlentities that turns a & into &
When working with xml data you should not use htmlentities, as the DOMDocument will handle a & and not &.
As of php 5.3 the default encoding is UTF-8, so there is no need to convert to UTF-8.

This line:
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
… encodes a string as HTML.
This line:
$text = $doc->createTextNode($string);
… encodes your string of HTML as XML.
This gives you an XML representation of an HTML string. When the XML is parsed you get the HTML back.
how can I prevent this from happening?
If your goal is to store some text in an XML document. Remove the line that encodes it as HTML.
Looks like a double encoding.
Pretty much. It is encoded twice, it just uses different (albeit very similar) encoding methods for each of the two passes.

Related

XML encoding error, but both XML and input text encoding are utf-8 in php

I'm generating a XML Dom with DomDocument in php, containing some news, with title, date, links and a description. The problem occurs on description of some news, but not on others, and both of them contains accents and cedilla.
I create the XML Dom element in UTF-8:
$dom = new \DOMDocument("1.0", "UTF-8");
Then, I retrieve my text from a MySQL database, which is encoded in latin-1, and after I tested the encoding with mb_detect_encoding it returns UTF-8.
I tried the following:
iconv('UTF-8', 'ISO-8859-1', $descricao);
iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $descricao);
iconv('ISO-8859-1', 'UTF-8', $descricao);
iconv('ISO-8859-1//TRANSLIT', 'UTF-8', $descricao);
mb_convert_encoding($descricao, 'ISO-8859-1', 'UTF-8');
mb_convert_encoding($descricao, 'UTF-8', 'ISO-8859-1');
mb_convert_encoding($descricao, 'UTF-8', 'UTF-8'); //that makes no sense, but who knows
Also tried changing the database encode to UTF-8, and changing the XML encode to ISO-8859-1.
This is the full method that generates the XML:
$informativos = Informativo::where('inf_ativo','S')->orderBy('inf_data','DESC')->take(20)->get();
$dom = new \DOMDocument("1.0", "UTF-8");
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$rss = $dom->createElement("rss");
$channel = $dom->createElement("channel");
$title = $dom->createElement("title", "Informativos");
$link = $dom->createElement("link", "http://example.com/informativos");
$channel->appendChild($title);
$channel->appendChild($link);
foreach ($informativos as $informativo) {
$item = $dom->createElement("item");
$itemTitle = $dom->createElement("title", $informativo->inf_titulo);
$itemImage = $dom->createElement("image", "http://example.com/".$informativo->inf_ilustracao);
$itemLink = $dom->createElement("link", "http://example.com/informativo/".$informativo->informativo_id);
$descricao = strip_tags($informativo->inf_descricao);
$descricao = str_replace(" ", " ", $descricao);
$descricao = str_replace("
", " ", $descricao);
$descricao = substr($descricao, 0, 150).'...';
$itemDesc = $dom->createElement("description", $descricao);
$itemDate = $dom->createElement("pubDate", $informativo->inf_data);
$item->appendChild($itemTitle);
$item->appendChild($itemImage);
$item->appendChild($itemLink);
$item->appendChild($itemDesc);
$item->appendChild($itemDate);
$channel->appendChild($item);
}
$rss->appendChild($channel);
$dom->appendChild($rss);
return $dom->saveXML();
Here is an example of successful output:
Segundo a instituição, número de pessoas que vivem na pobreza subiu 7,3 milhões desde 2014, atingindo 21% da população, ou 43,5 milhões de br
And an example that gives the encoding error:
procuradores da Lava Jato em Curitiba, que estão sendo investigados por um
suposto acordo fraudulento com a Petrobras e o Departamento de Justi�...
Everything renders fine, until the problematic description text above, that gives me:
"This page contains the following errors:
error on line 118 at column 20: Encoding error
Below is a rendering of the page up to the first error."
Probably that 
 is the problem here. Since I can't control whether or not the text have this, I need to render these special characters correctly.
UPDATE 2019-04-12: Found out the error on the problematic text and changed the example.
The encoding of the database connection is important. Make sure that it is set to UTF-8. It is a good idea to use UTF-8 most of the time (for your fields). Character sets like ISO-8859-1 have only a very limited amount of characters. So if a Unicode string gets encoded into them it might loose data.
The second argument of DOMDocument::createElement() is broken. In only encodes some special characters, but not &. To avoid problems create and append the content as an separate text node. However DOMNode::appendChild() returns the append node, so the DOMElement::create* methods can be nested and chained.
$data = [
[
'inf_titulo' => 'Foo',
'inf_ilustracao' => 'foo.jpg',
'informativo_id' => 42,
'inf_descricao' => 'Some content',
'inf_data' => 'a-date'
]
];
$informativos = json_decode(json_encode($data));
function stripTagsAndTruncate($text) {
$text = strip_tags($text);
$text = str_replace([" ", "
"], " ", $text);
return substr($text, 0, 150).'...';
}
$dom = new DOMDocument('1.0', 'UTF-8');
$rss = $dom->appendChild($dom->createElement('rss'));
$channel = $rss->appendChild($dom->createElement("channel"));
$channel
->appendChild($dom->createElement("title"))
->appendChild($dom->createTextNode("Informativos"));
$channel
->appendChild($dom->createElement("link"))
->appendChild($dom->createTextNode("http://example.com/informativos"));
foreach ($informativos as $informativo) {
$item = $channel->appendChild($dom->createElement("item"));
$item
->appendChild($dom->createElement("title"))
->appendChild($dom->createTextNode($informativo->inf_titulo));
$item
->appendChild($dom->createElement("image"))
->appendChild($dom->createTextNode("http://example.com/".$informativo->inf_ilustracao));
$item
->appendChild($dom->createElement("link"))
->appendChild($dom->createTextNode("http://example.com/informativo/".$informativo->informativo_id));
$item
->appendChild($dom->createElement("description"))
->appendChild($dom->createTextNode(stripTagsAndTruncate($informativo->inf_descricao)));
$item
->appendChild($dom->createElement("pubDate"))
->appendChild($dom->createTextNode($informativo->inf_data));
}
$dom->formatOutput = TRUE;
echo $dom->saveXML();
Output:
<?xml version="1.0" encoding="UTF-8"?>
<rss>
<channel>
<title>Informativos</title>
<link>http://example.com/informativos</link>
<item>
<title>Foo</title>
<image>http://example.com/foo.jpg</image>
<link>http://example.com/informativo/42</link>
<description>Some content...</description>
<pubDate>a-date</pubDate>
</item>
</channel>
</rss>
Truncating an HTML fragment can result in broken entities and broken code points (if you don't use a UTF-8 aware string function). Here are two approaches to solve it.
You can use PCRE in UTF-8 mode and match n entities/codepoints:
// have some string with HTML and entities
$text = 'Hello<b>äöü</b> ä
 foobar';
// strip tags and replace some specific entities with spaces
$stripped = str_replace([' ', '
'], ' ', strip_tags($text));
// match 0-10 entities or unicode codepoints
preg_match('(^(?:&[^;]+;|\\X){0,10})u', $stripped, $match);
var_dump($match[0]);
Output:
string(18) "Helloäöü ä"
However I would suggest using DOM. It can load HTML and allow to use Xpath expressions on it.
// have some string with HTML and entities
$text = 'Hello<b>äöü</b> ä
 foobar';
$document = new DOMDocument();
// force UTF-8 and load
$document->loadHTML('<?xml encoding="UTF-8"?>'.$text);
$xpath = new DOMXpath($document);
// use xpath to fetch the first 10 characters of the text content
var_dump($xpath->evaluate('substring(//body, 1, 10)'));
Output:
string(15) "Helloäöü ä"
DOM in general treats all strings as UTF-8. So Codepoints are a not a problem. Xpaths substring() works on the text content of the first matched node. The argument are character positions (not index) so they start with 1.
DOMDocument::loadHTML() will add html and body tags and decode entities. The results will a little bit cleaner then with the PCRE approach.

PHP Escaped special characters to html

I have string that looks like this "v\u00e4lkommen till mig" that I get after doing utf8_encode() on the string.
I would like that string to become
välkommen till mig
where the character
\u00e4 = ä = ä
How can I achive this in PHP?
Do not use utf8_(de|en)code. It just converts from UTF8 to ISO-8859-1 and back. ISO 8859-1 does not provide the same characters as ISO-8859-15 or Windows1252, which are the most used encodings (besides UTF-8). Better use mb_convert_encoding.
"v\u00e4lkommen till mig" > This string looks like a JSON encoded string which IS already utf8 encoded. The unicode code positiotion of "ä" is U+00E4 >> \u00e4.
Example
<?php
header('Content-Type: text/html; charset=utf-8');
$json = '"v\u00e4lkommen till mig"';
var_dump(json_decode($json)); //It will return a utf8 encoded string "välkommen till mig"
What is the source of this string?
There is no need to replace the ä with its HTML representation ä, if you print it in a utf8 encoded document and tell the browser the used encoding. If it is necessary, use htmlentities:
<?php
$json = '"v\u00e4lkommen till mig"';
$string = json_decode($json);
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
Edit: Since you want to keep HTML characters, and I now think your source string isn't quite what you posted (I think it is actual unicode, rather than containing \unnnn as a string), I think your best option is this:
$html = str_replace( str_replace( str_replace( htmlentities( $whatever ), '<', '<' ), '>', '>' ), '&', '&' );
(note: no call to utf8-decode)
Original answer:
There is no direct conversion. First, decode it again:
$decoded = utf8_decode( $whatever );
then encode as HTML:
$html = htmlentities( $decoded );
and of course you can do it without a variable:
$html = htmlentities( utf8_decode( $whatever ) );
http://php.net/manual/en/function.utf8-decode.php
http://php.net/manual/en/function.htmlentities.php
To do this by regular expression (not recommended, likely slower, less reliable), you can use the fact that HTML supports &#xnnnn; constructs, where the nnnn is the same as your existing \unnnn values. So you can say:
$html = preg_replace( '/\\\\u([0-9a-f]{4})/i', '&#x$1;', $whatever )
The html_entity_decode worked for me.
$json = '"v\u00e4lkommen till mig"';
echo $decoded = html_entity_decode( json_decode($json) );

What can be alternate way to load strip out javascript and put it in array for later use

I am using following code to strip out javascript from html dom string and put them in array for later use.
What can be alternate good use.
My Problem:
I am getting problem with unicode inside the file. When files with unicode are parsed then it generates following error:
Warning: DOMDocument::saveHTML() [domdocument.savehtml]: output
conversion failed due to conv error, bytes 0x97 0xC3 0xA0 0xC2 in
my code:
function loadJSCodeToLast( $strDOM ){
//Find all the <script></script> code and add to $objApp
global $objApp;
$objDOM = new DOMDocument();
//$x = new DOMImplementation();
//$doc = $x->createDocument(NULL,"rootElementName");
//$strDOM = '<kool>'.$strDOM.'</kool>';
$objDOM->preserveWhiteSpace = false;
//$objDOM->formatOutput = true;
#$objDOM->loadHtml( $strDOM );
$xpath = new DOMXPath($objDOM);
$objScripts = $xpath->query('//script');
$totCount = $objScripts->length;
if ($totCount > 0) {
//document contains script tags
foreach($objScripts as $entries){
$strSrc = $entries->getAttribute('src');
if( $strSrc !== ''){
$objApp->AddJSFile( $strSrc );
}else{
$objApp->AddJSScript( $entries->nodeValue );
}
$entries->parentNode->removeChild( $entries );
}
}
//return $objDOM->saveHTML();
//echo $GLOBALS['strTemplateDirAbs'];
return preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $objDOM->saveHTML()));
}
Try converting your string with utf8_encode() before loading it.
$txt = utf8_encode($txt);
var_dump(loadJSCodeToLast($txt));
The XML parser converts the text of an XML document into UTF-8, even
if you have set the character encoding of the XML, for example as a
second parameter of the DOMDocument constructor. After parsing the XML
with the load() command all its texts have been converted to UTF-8.
In case you append text nodes with special characters (e. g. Umlaut)
to your XML document you should therefore use utf8_encode() with your
text to convert it into UTF-8 before you append the text to the
document. Otherwise you will get an error message like "output
conversion failed due to conv error" at the save()
From DOMDocument::save documentation comments.

How to convert some multibyte characters into its numeric html entity using PHP?

Test string:
$s = "convert this: ";
$s .= "–, —, †, ‡, •, ≤, ≥, μ, ₪, ©, ® y ™, ⅓, ⅔, ⅛, ⅜, ⅝, ⅞, ™, Ω, ℮, ∑, ⌂, ♀, ♂ ";
$s .= "but, not convert ordinary characters to entities";
$encoded = mb_convert_encoding($s, 'HTML-ENTITIES', 'UTF-8');
asssuming your input string is UTF-8, this should encode most everything into numeric entities.
Well htmlentities doesn't work correctly. Fortunately someone has posted code on the php website that seems to do the translation of multibyte characters properly
I did work on decoding ascii into html coded text (&#xxxx). https://github.com/hellonearthis/ascii2web

PHP - convert a string with - or + signs to HTML

How do I convert a string that has a - or + sign to a html friendly string?
I mean to convert those characters to html notations, like space is and so on...
ps: htmlentities doesn't work. I still see the -/+
Try this
$string = str_replace('+', '+', $string); // Convert + sign
$string = str_replace('-', '-', $string); // Convert - sign
I don't think there is entities for these symbols see: http://www.w3schools.com/tags/ref_entities.asp
I tested with
$str = "- and +"; echo htmlentities($str);
and didn't get entities. According to: http://us.php.net/manual/en/function.htmlentities.php
I would expect them to be encoded if there was encoding available.
No idea what you want to accomplish. But this escapes selected characters to html entities:
$html = preg_replace("/([+-])/e", '"&#".ord("$1").";"', $html);
As far as I am aware, - and + are fine in HTML, and dont have an entity equivalent. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Are you sure you're not thinking of URL encoding?
Specify that you want it to use unicode as follows:
htmlentities($str, ENT_QUOTES | ENT_IGNORE, "UTF-8");
Have a look at the 2nd comment on this page:
http://www.php.net/manual/en/function.htmlentities.php#100388
This will enable more encoding characters.
If you just want to encode some, then this is a little lighter weight:
<?php
$ent = array(
'+'=>'+',
'-'=>'+'
);
echo strtr('+ and -', $ent);
?>

Categories