Convert a UTF-8 string to/from 7-bit XML in PHP - php

How can UTF-8 strings (i.e. 8-bit string) be converted to/from XML-compatible 7-bit strings (i.e. printable ASCII with numeric entities)?
i.e. an encode() function such that:
encode("“£”") -> "“£”"
decode() would also be useful:
decode("“£”") -> "“£”"
PHP's htmlenties()/html_entity_decode() pair does not do the right thing:
htmlentities(html_entity_decode("“£”")) ->
"“£”"
Laboriously specifying types helps a little, but still returns XML-incompatible named entities, not numeric ones:
htmlentities(html_entity_decode("“£”", ENT_QUOTES, "UTF-8"), ENT_QUOTES, "UTF-8") ->
"“£”"

mb_encode_numericentity does that exactly.

It's a bit of a workaround, but I read a bit about iconv() and i don't think it'll give you numeric entities (not put to the test)
function decode( $string )
{
$doc = new DOMDocument( "1.0", "UTF-8" );
$doc->LoadXML( '<?xml version="1.0" encoding="UTF-8"?>'."\n".'<x />', LIBXML_NOENT );
$doc->documentElement->appendChild( $doc->createTextNode( $string ) );
$output = $doc->saveXML( $doc );
$output = preg_replace( '/<\?([^>]+)\?>/', '', $output );
$output = str_replace( array( '<x>', '</x>' ), array( '', '' ), $output );
return trim( $output );
}
This however, I have put to the test. I might do the reverse later, just don't hold your breath ;-)

Related

XML encoding error, but both XML and input text encoding are utf-8 in php

I'm generating a XML Dom with DomDocument in php, containing some news, with title, date, links and a description. The problem occurs on description of some news, but not on others, and both of them contains accents and cedilla.
I create the XML Dom element in UTF-8:
$dom = new \DOMDocument("1.0", "UTF-8");
Then, I retrieve my text from a MySQL database, which is encoded in latin-1, and after I tested the encoding with mb_detect_encoding it returns UTF-8.
I tried the following:
iconv('UTF-8', 'ISO-8859-1', $descricao);
iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $descricao);
iconv('ISO-8859-1', 'UTF-8', $descricao);
iconv('ISO-8859-1//TRANSLIT', 'UTF-8', $descricao);
mb_convert_encoding($descricao, 'ISO-8859-1', 'UTF-8');
mb_convert_encoding($descricao, 'UTF-8', 'ISO-8859-1');
mb_convert_encoding($descricao, 'UTF-8', 'UTF-8'); //that makes no sense, but who knows
Also tried changing the database encode to UTF-8, and changing the XML encode to ISO-8859-1.
This is the full method that generates the XML:
$informativos = Informativo::where('inf_ativo','S')->orderBy('inf_data','DESC')->take(20)->get();
$dom = new \DOMDocument("1.0", "UTF-8");
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$rss = $dom->createElement("rss");
$channel = $dom->createElement("channel");
$title = $dom->createElement("title", "Informativos");
$link = $dom->createElement("link", "http://example.com/informativos");
$channel->appendChild($title);
$channel->appendChild($link);
foreach ($informativos as $informativo) {
$item = $dom->createElement("item");
$itemTitle = $dom->createElement("title", $informativo->inf_titulo);
$itemImage = $dom->createElement("image", "http://example.com/".$informativo->inf_ilustracao);
$itemLink = $dom->createElement("link", "http://example.com/informativo/".$informativo->informativo_id);
$descricao = strip_tags($informativo->inf_descricao);
$descricao = str_replace(" ", " ", $descricao);
$descricao = str_replace("
", " ", $descricao);
$descricao = substr($descricao, 0, 150).'...';
$itemDesc = $dom->createElement("description", $descricao);
$itemDate = $dom->createElement("pubDate", $informativo->inf_data);
$item->appendChild($itemTitle);
$item->appendChild($itemImage);
$item->appendChild($itemLink);
$item->appendChild($itemDesc);
$item->appendChild($itemDate);
$channel->appendChild($item);
}
$rss->appendChild($channel);
$dom->appendChild($rss);
return $dom->saveXML();
Here is an example of successful output:
Segundo a instituição, número de pessoas que vivem na pobreza subiu 7,3 milhões desde 2014, atingindo 21% da população, ou 43,5 milhões de br
And an example that gives the encoding error:
procuradores da Lava Jato em Curitiba, que estão sendo investigados por um
suposto acordo fraudulento com a Petrobras e o Departamento de Justi�...
Everything renders fine, until the problematic description text above, that gives me:
"This page contains the following errors:
error on line 118 at column 20: Encoding error
Below is a rendering of the page up to the first error."
Probably that 
 is the problem here. Since I can't control whether or not the text have this, I need to render these special characters correctly.
UPDATE 2019-04-12: Found out the error on the problematic text and changed the example.
The encoding of the database connection is important. Make sure that it is set to UTF-8. It is a good idea to use UTF-8 most of the time (for your fields). Character sets like ISO-8859-1 have only a very limited amount of characters. So if a Unicode string gets encoded into them it might loose data.
The second argument of DOMDocument::createElement() is broken. In only encodes some special characters, but not &. To avoid problems create and append the content as an separate text node. However DOMNode::appendChild() returns the append node, so the DOMElement::create* methods can be nested and chained.
$data = [
[
'inf_titulo' => 'Foo',
'inf_ilustracao' => 'foo.jpg',
'informativo_id' => 42,
'inf_descricao' => 'Some content',
'inf_data' => 'a-date'
]
];
$informativos = json_decode(json_encode($data));
function stripTagsAndTruncate($text) {
$text = strip_tags($text);
$text = str_replace([" ", "
"], " ", $text);
return substr($text, 0, 150).'...';
}
$dom = new DOMDocument('1.0', 'UTF-8');
$rss = $dom->appendChild($dom->createElement('rss'));
$channel = $rss->appendChild($dom->createElement("channel"));
$channel
->appendChild($dom->createElement("title"))
->appendChild($dom->createTextNode("Informativos"));
$channel
->appendChild($dom->createElement("link"))
->appendChild($dom->createTextNode("http://example.com/informativos"));
foreach ($informativos as $informativo) {
$item = $channel->appendChild($dom->createElement("item"));
$item
->appendChild($dom->createElement("title"))
->appendChild($dom->createTextNode($informativo->inf_titulo));
$item
->appendChild($dom->createElement("image"))
->appendChild($dom->createTextNode("http://example.com/".$informativo->inf_ilustracao));
$item
->appendChild($dom->createElement("link"))
->appendChild($dom->createTextNode("http://example.com/informativo/".$informativo->informativo_id));
$item
->appendChild($dom->createElement("description"))
->appendChild($dom->createTextNode(stripTagsAndTruncate($informativo->inf_descricao)));
$item
->appendChild($dom->createElement("pubDate"))
->appendChild($dom->createTextNode($informativo->inf_data));
}
$dom->formatOutput = TRUE;
echo $dom->saveXML();
Output:
<?xml version="1.0" encoding="UTF-8"?>
<rss>
<channel>
<title>Informativos</title>
<link>http://example.com/informativos</link>
<item>
<title>Foo</title>
<image>http://example.com/foo.jpg</image>
<link>http://example.com/informativo/42</link>
<description>Some content...</description>
<pubDate>a-date</pubDate>
</item>
</channel>
</rss>
Truncating an HTML fragment can result in broken entities and broken code points (if you don't use a UTF-8 aware string function). Here are two approaches to solve it.
You can use PCRE in UTF-8 mode and match n entities/codepoints:
// have some string with HTML and entities
$text = 'Hello<b>äöü</b> ä
 foobar';
// strip tags and replace some specific entities with spaces
$stripped = str_replace([' ', '
'], ' ', strip_tags($text));
// match 0-10 entities or unicode codepoints
preg_match('(^(?:&[^;]+;|\\X){0,10})u', $stripped, $match);
var_dump($match[0]);
Output:
string(18) "Helloäöü ä"
However I would suggest using DOM. It can load HTML and allow to use Xpath expressions on it.
// have some string with HTML and entities
$text = 'Hello<b>äöü</b> ä
 foobar';
$document = new DOMDocument();
// force UTF-8 and load
$document->loadHTML('<?xml encoding="UTF-8"?>'.$text);
$xpath = new DOMXpath($document);
// use xpath to fetch the first 10 characters of the text content
var_dump($xpath->evaluate('substring(//body, 1, 10)'));
Output:
string(15) "Helloäöü ä"
DOM in general treats all strings as UTF-8. So Codepoints are a not a problem. Xpaths substring() works on the text content of the first matched node. The argument are character positions (not index) so they start with 1.
DOMDocument::loadHTML() will add html and body tags and decode entities. The results will a little bit cleaner then with the PCRE approach.

PHP How to encode text to numeric entity?

I have xml like this:
<formula type="inline">
<default:math xmlns="http://www.w3.org/1998/Math/MathML">
<default:mi>
&Zopf;
</default:mi>
</default:math>
</formula>
My goal is to get rid of all special entities like &Zopf; by replacing them by their numeric entity presentations.
I tried :
$test = <content of the xml>;
$convmap = array(0x80, 0xffff, 0, 0xffff);
$test = mb_encode_numericentity($test, $convmap, 'UTF-8');
But this will not replace the &Zopf; Any idea?
My goal is to get:
ℤ
as shown here: http://www.fileformat.info/info/unicode/char/2124/index.htm
Thank you.
Your converter is converting your LaTeX into MathML, not HTML entities. You need something that converts directly into HTML character references, or a MathML to HTML character reference converter.
You should be able to use htmlentities:
htmlentities($symbolsToEncode, ENT_XML1, 'UTF-8');
http://pt1.php.net/htmlentities
You can change ENT_XML1 to ENT_SUBSTITUTE and it will return Unicode Replacement Characters or Hex character references.
As an alternative, you could use strtr to convert the characters to something you specify:
$chars = array(
"\x8484" => "蒄"
...
);
$convertedXML = strtr($xml, $chars);
http://php.net/strtr
Someone has done something similar on GitHub.
So you need to decode the named entities first:
function decodeNamedEntities($string) {
static $entities = NULL;
if (NULL === $entities) {
$entities = array_flip(
array_diff(
get_html_translation_table(HTML_ENTITIES, ENT_COMPAT | ENT_HTML5, 'UTF-8'),
get_html_translation_table(HTML_ENTITIES, ENT_COMPAT | ENT_XML1, 'UTF-8')
)
);
}
return str_replace(array_keys($entities), $entities, $string);
}
After that you can use htmlentities to encode them in a different format if it is really needed.

PHP Escaped special characters to html

I have string that looks like this "v\u00e4lkommen till mig" that I get after doing utf8_encode() on the string.
I would like that string to become
välkommen till mig
where the character
\u00e4 = ä = ä
How can I achive this in PHP?
Do not use utf8_(de|en)code. It just converts from UTF8 to ISO-8859-1 and back. ISO 8859-1 does not provide the same characters as ISO-8859-15 or Windows1252, which are the most used encodings (besides UTF-8). Better use mb_convert_encoding.
"v\u00e4lkommen till mig" > This string looks like a JSON encoded string which IS already utf8 encoded. The unicode code positiotion of "ä" is U+00E4 >> \u00e4.
Example
<?php
header('Content-Type: text/html; charset=utf-8');
$json = '"v\u00e4lkommen till mig"';
var_dump(json_decode($json)); //It will return a utf8 encoded string "välkommen till mig"
What is the source of this string?
There is no need to replace the ä with its HTML representation ä, if you print it in a utf8 encoded document and tell the browser the used encoding. If it is necessary, use htmlentities:
<?php
$json = '"v\u00e4lkommen till mig"';
$string = json_decode($json);
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
Edit: Since you want to keep HTML characters, and I now think your source string isn't quite what you posted (I think it is actual unicode, rather than containing \unnnn as a string), I think your best option is this:
$html = str_replace( str_replace( str_replace( htmlentities( $whatever ), '<', '<' ), '>', '>' ), '&', '&' );
(note: no call to utf8-decode)
Original answer:
There is no direct conversion. First, decode it again:
$decoded = utf8_decode( $whatever );
then encode as HTML:
$html = htmlentities( $decoded );
and of course you can do it without a variable:
$html = htmlentities( utf8_decode( $whatever ) );
http://php.net/manual/en/function.utf8-decode.php
http://php.net/manual/en/function.htmlentities.php
To do this by regular expression (not recommended, likely slower, less reliable), you can use the fact that HTML supports &#xnnnn; constructs, where the nnnn is the same as your existing \unnnn values. So you can say:
$html = preg_replace( '/\\\\u([0-9a-f]{4})/i', '&#x$1;', $whatever )
The html_entity_decode worked for me.
$json = '"v\u00e4lkommen till mig"';
echo $decoded = html_entity_decode( json_decode($json) );

Migrating data, from latin1 charset to UTF-8

I'm trying to move over some fish species information profiles from a bespoke CMS using latin1 charset to a WordPress customised (custom post type, with numerous meta fields) database which uses UTF-8.
On top of that, the old CMS uses some odd bbCode bits.
Basically, I'm looking for a function which will do this:
Take information from my old database with latin1_swedish_ci collation (and latin1 charset)
Convert all of the non-standard characters (we have characters from languages including but not exclusive of Croatian, Czech, Spanish, French and German) to HTML entities such as á (numbers like &134; fine too).
Convert all of the bbCode (see below) to HTML
Convert ' and " to HTML entities
Return the information with utf-8 charset to my new database
The bbCode to and from are:
$search = array( '[i]', '[/i]', '[b]', '[/b]', '[pl]', '[/pl]' );
$replace = array( '<i>', '</i>', '<strong>', '</strong>', '', '' );
The function that I've tried so far is:
$search = array( '[i]', '[/i]', '[b]', '[/b]', '[pl]', '[/pl]' );
$replace = array( '<i>', '</i>', '<strong>', '</strong>', '', '' );
function _convert($content) {
if(!mb_check_encoding($content, 'UTF-8')
OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
return $content;
} else {
echo "<p>Couldn't convert to UTF-8.</p>";
}
}
}
function _clean($content) {
$content = _convert( $content );
/* edited out because otherwise all HTML appears as <html> rather than <html>
//$content = htmlentities( $content, ENT_QUOTES, "UTF-8" );
$content = str_replace( $search, $replace, $content );
return $content;
}
However this is stopping some fields from being imported to the new database and isn't replacing the bbCode.
If I use the following code, it mostly works:
$var = str_replace( $search, $replace, htmlentities( $row["var"], ENT_QUOTES, "UTF-8" ) );
However, certain fields containing what I think are Czech/Croatian characters don't appear at all.
Does anyone have any suggestions for how I can, in the order listed above, successfully convert the information from the "old format" to the new?
I would say if you want to convert all your non-ASCII characters you won't need to do any latin1 to UTF-8 conversion what so ever. Let's say you run a function such as htmlspecialchars or htmlentities on your data, then all non-ASCII characters will be replaced with their corresponding entity code.
Basically, after this step, there shouldn't be any characters left that needs conversion to UTF-8. Also, if you wanted to convert your latin1 encoding string into UTF-8 i strongly suspect utf8_encode will du just fine.
PS. When it comes to converting bbCode into HTML I would recommend using regular expressions instead. For example you could do it all in a line like this:
$html_data = preg_replace('/\[(/?[a-z]+)\]/i', '<$1>', $bb_code_data);

Generating XML document in PHP (escape characters)

I'm generating an XML document from a PHP script and I need to escape the XML special characters.
I know the list of characters that should be escaped; but what is the correct way to do it?
Should the characters be escaped just with backslash (\') or what is the proper way?
Is there any built-in PHP function that can handle this for me?
I created simple function that escapes with the five "predefined entities" that are in XML:
function xml_entities($string) {
return strtr(
$string,
array(
"<" => "<",
">" => ">",
'"' => """,
"'" => "&apos;",
"&" => "&",
)
);
}
Usage example Demo:
$text = "Test & <b> and encode </b> :)";
echo xml_entities($text);
Output:
Test &amp; <b> and encode </b> :)
A similar effect can be achieved by using str_replace but it is fragile because of double-replacings (untested, not recommended):
function xml_entities($string) {
return str_replace(
array("&", "<", ">", '"', "'"),
array("&", "<", ">", """, "&apos;"),
$string
);
}
Use the DOM classes to generate your whole XML document. It will handle encodings and decodings that we don't even want to care about.
Edit: This was criticized by #Tchalvak:
The DOM object creates a full XML document, it doesn't easily lend itself to just encoding a string on it's own.
Which is wrong, DOMDocument can properly output just a fragment not the whole document:
$doc->saveXML($fragment);
which gives:
Test & <b> and encode </b> :)
Test &amp; <b> and encode </b> :)
as in:
$doc = new DOMDocument();
$fragment = $doc->createDocumentFragment();
// adding XML verbatim:
$xml = "Test & <b> and encode </b> :)\n";
$fragment->appendXML($xml);
// adding text:
$text = $xml;
$fragment->appendChild($doc->createTextNode($text));
// output the result
echo $doc->saveXML($fragment);
See Demo
What about the htmlspecialchars() function?
htmlspecialchars($input, ENT_QUOTES | ENT_XML1, $encoding);
Note: the ENT_XML1 flag is only available if you have PHP 5.4.0 or higher.
htmlspecialchars() with these parameters replaces the following characters:
& (ampersand) becomes &
" (double quote) becomes "
' (single quote) becomes &apos;
< (less than) becomes <
> (greater than) becomes >
You can get the translation table by using the get_html_translation_table() function.
Tried hard to deal with XML entity issue, solve in this way:
htmlspecialchars($value, ENT_QUOTES, 'UTF-8')
In order to have a valid final XML text, you need to escape all XML entities and have the text written in the same encoding as the XML document processing-instruction states it (the "encoding" in the <?xml line). The accented characters don't need to be escaped as long as they are encoded as the document.
However, in many situations simply escaping the input with htmlspecialchars may lead to double-encoded entities (for example é would become &eacute;), so I suggest decoding html entities first:
function xml_escape($s)
{
$s = html_entity_decode($s, ENT_QUOTES, 'UTF-8');
$s = htmlspecialchars($s, ENT_QUOTES, 'UTF-8', false);
return $s;
}
Now you need to make sure all accented characters are valid in the XML document encoding. I strongly encourage to always encode XML output in UTF-8, since not all the XML parsers respect the XML document processing-instruction encoding. If your input might come from a different charset, try using utf8_encode().
There's a special case, which is your input may come from one of these encodings: ISO-8859-1, ISO-8859-15, UTF-8, cp866, cp1251, cp1252, and KOI8-R -- PHP treats them all the same, but there are some slight differences in them -- some of which even iconv() cannot handle. I could only solve this encoding issue by complementing utf8_encode() behavior:
function encode_utf8($s)
{
$cp1252_map = array(
"\xc2\x80" => "\xe2\x82\xac",
"\xc2\x82" => "\xe2\x80\x9a",
"\xc2\x83" => "\xc6\x92",
"\xc2\x84" => "\xe2\x80\x9e",
"\xc2\x85" => "\xe2\x80\xa6",
"\xc2\x86" => "\xe2\x80\xa0",
"\xc2\x87" => "\xe2\x80\xa1",
"\xc2\x88" => "\xcb\x86",
"\xc2\x89" => "\xe2\x80\xb0",
"\xc2\x8a" => "\xc5\xa0",
"\xc2\x8b" => "\xe2\x80\xb9",
"\xc2\x8c" => "\xc5\x92",
"\xc2\x8e" => "\xc5\xbd",
"\xc2\x91" => "\xe2\x80\x98",
"\xc2\x92" => "\xe2\x80\x99",
"\xc2\x93" => "\xe2\x80\x9c",
"\xc2\x94" => "\xe2\x80\x9d",
"\xc2\x95" => "\xe2\x80\xa2",
"\xc2\x96" => "\xe2\x80\x93",
"\xc2\x97" => "\xe2\x80\x94",
"\xc2\x98" => "\xcb\x9c",
"\xc2\x99" => "\xe2\x84\xa2",
"\xc2\x9a" => "\xc5\xa1",
"\xc2\x9b" => "\xe2\x80\xba",
"\xc2\x9c" => "\xc5\x93",
"\xc2\x9e" => "\xc5\xbe",
"\xc2\x9f" => "\xc5\xb8"
);
$s=strtr(utf8_encode($s), $cp1252_map);
return $s;
}
If you need proper xml output, simplexml is the way to go:
http://www.php.net/manual/en/simplexmlelement.asxml.php
Proper escaping is the way to get correct XML output but you need to handle escaping differently for attributes and elements. (That is Tomas' answer is incorrect).
I wrote/stole some Java code a while back that differentiates between attribute and element escaping. The reason is that the XML parser considers all white space special particularly in attributes.
It should be trivial to port that over to PHP (you can use Tomas Jancik's approach with the above appropriate escaping). You don't have to worry about escaping extended entities if your using UTF-8.
If you don't want to port my Java code you can look at XMLWriter which is stream based and uses libxml so it should be very efficient.
You can use this methods:
http://php.net/manual/en/function.htmlentities.php
In that way all entities (html/xml) are escaped and you can put your string inside XML tags
Based on the solution of sadeghj the following code worked for me:
/**
* #param $arr1 the single string that shall be masked
* #return the resulting string with the masked characters
*/
function replace_char($arr1)
{
if (strpos ($arr1,'&')!== FALSE) { //test if the character appears
$arr1=preg_replace('/&/','&', $arr1); // do this first
}
// just encode the
if (strpos ($arr1,'>')!== FALSE) {
$arr1=preg_replace('/>/','>', $arr1);
}
if (strpos ($arr1,'<')!== FALSE) {
$arr1=preg_replace('/</','<', $arr1);
}
if (strpos ($arr1,'"')!== FALSE) {
$arr1=preg_replace('/"/','"', $arr1);
}
if (strpos ($arr1,'\'')!== FALSE) {
$arr1=preg_replace('/\'/','&apos;', $arr1);
}
return $arr1;
}
function replace_char($arr1)
{
$arr[]=preg_replace('>','&gt', $arr1);
$arr[]=preg_replace('<','&lt', $arr1);
$arr[]=preg_replace('"','&quot', $arr1);
$arr[]=preg_replace('\'','&apos', $arr1);
$arr[]=preg_replace('&','&amp', $arr1);
return $arr;
}

Categories