Related
THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.
THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.
Illustrating the problem
Suppose
$xmlFrag ='<p>Hello world! Let A<B and A=∬dxdy</p>';
Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt not. The XML text will be (after transformed),
$xmlFrag = '<p>Hello world! Let A<B and A=∬dxdy</p>';
The text "A<B" needs an XML-reserved character, so MUST stay as A<B.
Frustrated solutions
I try to use html_entity_decode for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1 option,
$s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
// as I expected
Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.
I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,
function xml_entity_decode($s) {
// here an illustration (by user-defined function)
// about how the hypothetical PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
//$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
} // you see? not need a benchmark:
// it is not so fast as direct use of html_entity_decode; if there
// was an XML-safe option was ideal.
PS: corrected after this answer. Must be ENT_HTML5 flag, for convert really all named entities.
This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.
... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:
The best workaround
Pay attention:
The function xml_entity_decode() below is the best (over any other) workaround.
The function below is not an answer to the present question, it is only a workwaround.
function xml_entity_decode($s) {
// illustrating how a (hypothetical) PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
}
To test and to demonstrate that you have a better solution, please test first with this simple benckmark:
$countBchMk_MAX=1000;
$xml = file_get_contents('sample1.xml'); // BIG and complex XML string
$start_time = microtime(TRUE);
for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){
$A = xml_entity_decode($xml); // 0.0002
/* 0.0014
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$A = $doc->saveXML();
*/
}
$end_time = microtime(TRUE);
echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
($end_time - $start_time)/$countBchMk_MAX,
" seconds</h1>";
Use the DTD when loading the JATS XML document, as it will define any mapping from named entities to Unicode characters, then set the encoding to UTF-8 when saving:
$doc = new DOMDocument;
$doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$doc->save($outputFile);
I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert #_x_amp#; to $amp;, however unlikely its presence in the source XML.
Note: I'm assuming default encoding is UTF-8
// Search for named entities (strings like "&abc1;").
echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
/* <Foo>€&foo Ç</Foo> */
Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:
// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
/* <Foo>€&foo Ç</Foo> */
In your case you want it the other way around. Encode numbered entities as UTF-8:
// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
// Encodes (uncaught) numbered entities to UTF-8.
echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
/* <Foo>€&foo Ç</Foo> */
Benchmark
I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
Your method
php -r '$q=["&",">","<"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é & ∬</Foo>
=====
Time taken: 2.0397531986237
My method
php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 4.045273065567
My method (with unicode to numbered entity):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.4407880306244
My method (with numbered entity to unicode):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.5400078296661
public function entity_decode($str, $charset = NULL)
{
if (strpos($str, '&') === FALSE)
{
return $str;
}
static $_entities;
isset($charset) OR $charset = $this->charset;
$flag = is_php('5.4')
? ENT_COMPAT | ENT_HTML5
: ENT_COMPAT;
do
{
$str_compare = $str;
// Decode standard entities, avoiding false positives
if ($c = preg_match_all('/&[a-z]{2,}(?![a-z;])/i', $str, $matches))
{
if ( ! isset($_entities))
{
$_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset));
// If we're not on PHP 5.4+, add the possibly dangerous HTML 5
// entities to the array manually
if ($flag === ENT_COMPAT)
{
$_entities[':'] = ':';
$_entities['('] = '(';
$_entities[')'] = '&rpar';
$_entities["\n"] = '&newline;';
$_entities["\t"] = '&tab;';
}
}
$replace = array();
$matches = array_unique(array_map('strtolower', $matches[0]));
for ($i = 0; $i < $c; $i++)
{
if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE)
{
$replace[$matches[$i]] = $char;
}
}
$str = str_ireplace(array_keys($replace), array_values($replace), $str);
}
// Decode numeric & UTF16 two byte entities
$str = html_entity_decode(
preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str),
$flag,
$charset
);
}
while ($str_compare !== $str);
return $str;
}
For those coming here because your numeric entity in the range 128 to 159 remains as numeric entity instead of being converted to a character:
echo xml_entity_decode('');
//Output instead expected €
This depends on PHP version (at least for PHP >=5.6 the entity remains) and on the affected characters. The reason is that the characters 128 to 159 are not printable characters in UTF-8. This can happen if the data to be converted mix up windows-1252 content (where is the € sign).
Try this function:
function xmlsafe($s,$intoQuotes=1) {
if ($intoQuotes)
return str_replace(array('&','>','<','"'), array('&','>','<','"'), $s);
else
return str_replace(array('&','>','<'), array('&','>','<'), html_entity_decode($s));
}
example usage:
echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>';
also: https://stackoverflow.com/a/9446666/2312709
this code used in production seem that no problems happened with UTF-8
I am trying to write some Urdu text on an image using imgttftext() function of PHP. It does not display the characters unless I convert the text using the following code:
function convert($text){
$out="";
mb_language('uni');
mb_internal_encoding('UTF-8');
$text = mb_convert_encoding($text, 'HTML-ENTITIES',"UTF-8");
$text = html_entity_decode($text,ENT_NOQUOTES, "ISO-8859-1");
for($i = 0; $i < strlen($text); $i++) {
$letter = $text[$i];
$num = ord($letter);
if($num>127) {
$out .= "&#$num;";
} else {
$out .= $letter;
}
}
return $out;
}
Now, the text e.g. عچں (which contains the three characters ع چ ں) is printed on to the image as separate and full characters instead of cutting and joining the characters to form an Urdu word like عچں.
I have used the characters ا ب ت ث with codes U+0627, U+0628, U+0629 and so on from this page http://en.wikipedia.org/wiki/List_of_Unicode_characters#Arabic
I have shared the code here: https://code.google.com/p/urdu-captcha/downloads/list
Note: I have added space between the characters in the code provided
removing which makes no difference to how the text is displayed on the
image.
How do I make it write the characters joined together to form proper words?
You'll need an additional library to perform Arabic glyph joining. Check out AR-PHP.
I have website that's in win-1251 encoding and it needs to stay that way. But I also need to be able to echo few links that contain non latin, non cyrillic characters like šžāņūī...
I need a function that convert this
"māja un man tā patīk"
to
"māja un man tā patīk"
and that does not touch html, so if there is <b> it needs to stay as <b>, not > or <
And please no advices about the encoding and how wrong that is.
$str = "<b>Obāchan</b> おばあちゃん";
$str = preg_replace_callback('/./u', function ($matches) {
$chr = $matches[0];
if (strlen($chr) > 1) {
$chr = mb_convert_encoding($chr, 'HTML-ENTITIES', 'UTF-8');
}
return $chr;
}, $str);
This expects the original $str to be UTF-8 encoded, i.e. your PHP file should be saved in UTF-8. It encodes all non-ASCII compatible code points to HTML entities. Since all HTML special characters are ASCII characters, they remain untouched. The resulting string is pure ASCII. Since the lower Win-1251 code points are ASCII compatible, the resulting string is also a valid Win-1251 string. The above $str converts to:
<b>Obāchan</b> おばあちゃん
The main things you probably don't want to encode are <, > and &. Those are really the only special characters. So how about encoding everything first, and then just decode <, > and & I feel you should be fine.
This is untested:
$output =
htmlspecialchars_decode(
htmlentities($input, ENT_NOQUOTES, 'CP-1251')
);
let me know
What Evert suggest looks logical to me too! If you insist this is a way to do it if there are only two letters that bother you. For more letters the scrit will not be as effective and needs to change.
<?PHP
function myConvert($str)
{
$chars['ā']='ā';
$chars['ī']='ī';
foreach ($chars as $key => $value)
$output = str_replace($key, $value, $str);
echo $str;
}
myConvert("māja un man tā patīk");
?>
==================edited==============
For many characters maybe this one can help you:
<?PHP
function myConvert($str)
{
$final=null;
$parts = preg_split("/&#[0-9]*;/i", $str);//get all text parts
preg_match_all("/&#[0-9]*;/i", $str, $delimiters );//get delimiters;
$delimiters[0][]='';//make arrays equal size
foreach($parts as $key => $value)
$final.=$value.mb_convert_encoding
($delimiters[0][$key], "UTF-8", "HTML-ENTITIES");
return $final;
}
$fh = fopen("testFile.txt", 'w') ;
fwrite($fh, myConvert("māja un man tā patīkī"));
fclose($fh);
?>
The desired output is written in the text file. This code, exactly as it is -not merged in some project- does what it claims to do. Converts codes like ā to the analogous character they present.
I am working on a CakePHP site saved in MacRoman char encoding. I want to change all the files to UTF-8 for internationalisation. For all the other files in the site this works fine. However, in the core.php file there is a security salt, which is a string with special characters ("!:* etc.). When I save this file as UTF-8 the salt gets corrupted. I can roll this back with git, but it's an annoyance.
Does anyone know how I can convert the string from MacRoman to UTF-8?
You don't give enough information to confirm this, but I guess the salt is used in its binary form. In that case, changing the encoding of the file will corrupt the salt if this binary stream is changed, even if the characters are correctly converted.
Since the first 128 characters are similar in UTF-8 and Mac OS Roman, you don't have to worry if the salt is written using only these characters.
Let's say the salt is somewhere:
$salt = "a!c‡Œ";
You could write instead:
$salt = "a!c\xE0\xCE";
You could map all to their hexadecimal representation, as it might be easier to automate:
$salt = "\x61\x21\x63\xE0\xCE";
See the table here.
The following snippet can automate this conversion:
$res = "";
foreach (str_split($salt) as $c) {
$res .= "\\x".dechex(ord($c));
}
echo $res;
Thanks for the input, pointed me in the right direction. The solution is:
$salt = iconv('UTF-8', 'macintosh', $string);
For those who do not have access to iconv here is a function in PHP:
http://sebastienguillon.com/test/jeux-de-caracteres/MacRoman_to_utf8.txt.php
It will properly convert MacRoman text to UTF-8 and you can even decide how you want to break ligatures.
<?php
function MacRoman_to_utf8($str, $break_ligatures='none')
{
// $break_ligatures : 'none' | 'fifl' | 'all'
// 'none' : don't break any MacRoman ligatures, transform them into their utf-8 counterparts
// 'fifl' : break only fi ("\xDE" => "fi") and fl ("\xDF"=>"fl")
// 'all' : break fi, fl and also AE ("\xAE"=>"AE"), ae ("\xBE"=>"ae"), OE ("\xCE"=>"OE") and oe ("\xCF"=>"oe")
if($break_ligatures == 'fifl')
{
$str = strtr($str, array("\xDE"=>"fi", "\xDF"=>"fl"));
}
if($break_ligatures == 'all')
{
$str = strtr($str, array("\xDE"=>"fi", "\xDF"=>"fl", "\xAE"=>"AE", "\xBE"=>"ae", "\xCE"=>"OE", "\xCF"=>"oe"));
}
$str = strtr($str, array("\x7F"=>"\x20", "\x80"=>"\xC3\x84", "\x81"=>"\xC3\x85",
"\x82"=>"\xC3\x87", "\x83"=>"\xC3\x89", "\x84"=>"\xC3\x91", "\x85"=>"\xC3\x96",
"\x86"=>"\xC3\x9C", "\x87"=>"\xC3\xA1", "\x88"=>"\xC3\xA0", "\x89"=>"\xC3\xA2",
"\x8A"=>"\xC3\xA4", "\x8B"=>"\xC3\xA3", "\x8C"=>"\xC3\xA5", "\x8D"=>"\xC3\xA7",
"\x8E"=>"\xC3\xA9", "\x8F"=>"\xC3\xA8", "\x90"=>"\xC3\xAA", "\x91"=>"\xC3\xAB",
"\x92"=>"\xC3\xAD", "\x93"=>"\xC3\xAC", "\x94"=>"\xC3\xAE", "\x95"=>"\xC3\xAF",
"\x96"=>"\xC3\xB1", "\x97"=>"\xC3\xB3", "\x98"=>"\xC3\xB2", "\x99"=>"\xC3\xB4",
"\x9A"=>"\xC3\xB6", "\x9B"=>"\xC3\xB5", "\x9C"=>"\xC3\xBA", "\x9D"=>"\xC3\xB9",
"\x9E"=>"\xC3\xBB", "\x9F"=>"\xC3\xBC", "\xA0"=>"\xE2\x80\xA0", "\xA1"=>"\xC2\xB0",
"\xA2"=>"\xC2\xA2", "\xA3"=>"\xC2\xA3", "\xA4"=>"\xC2\xA7", "\xA5"=>"\xE2\x80\xA2",
"\xA6"=>"\xC2\xB6", "\xA7"=>"\xC3\x9F", "\xA8"=>"\xC2\xAE", "\xA9"=>"\xC2\xA9",
"\xAA"=>"\xE2\x84\xA2", "\xAB"=>"\xC2\xB4", "\xAC"=>"\xC2\xA8", "\xAD"=>"\xE2\x89\xA0",
"\xAE"=>"\xC3\x86", "\xAF"=>"\xC3\x98", "\xB0"=>"\xE2\x88\x9E", "\xB1"=>"\xC2\xB1",
"\xB2"=>"\xE2\x89\xA4", "\xB3"=>"\xE2\x89\xA5", "\xB4"=>"\xC2\xA5", "\xB5"=>"\xC2\xB5",
"\xB6"=>"\xE2\x88\x82", "\xB7"=>"\xE2\x88\x91", "\xB8"=>"\xE2\x88\x8F", "\xB9"=>"\xCF\x80",
"\xBA"=>"\xE2\x88\xAB", "\xBB"=>"\xC2\xAA", "\xBC"=>"\xC2\xBA", "\xBD"=>"\xCE\xA9",
"\xBE"=>"\xC3\xA6", "\xBF"=>"\xC3\xB8", "\xC0"=>"\xC2\xBF", "\xC1"=>"\xC2\xA1",
"\xC2"=>"\xC2\xAC", "\xC3"=>"\xE2\x88\x9A", "\xC4"=>"\xC6\x92", "\xC5"=>"\xE2\x89\x88",
"\xC6"=>"\xE2\x88\x86", "\xC7"=>"\xC2\xAB", "\xC8"=>"\xC2\xBB", "\xC9"=>"\xE2\x80\xA6",
"\xCA"=>"\xC2\xA0", "\xCB"=>"\xC3\x80", "\xCC"=>"\xC3\x83", "\xCD"=>"\xC3\x95",
"\xCE"=>"\xC5\x92", "\xCF"=>"\xC5\x93", "\xD0"=>"\xE2\x80\x93", "\xD1"=>"\xE2\x80\x94",
"\xD2"=>"\xE2\x80\x9C", "\xD3"=>"\xE2\x80\x9D", "\xD4"=>"\xE2\x80\x98", "\xD5"=>"\xE2\x80\x99",
"\xD6"=>"\xC3\xB7", "\xD7"=>"\xE2\x97\x8A", "\xD8"=>"\xC3\xBF", "\xD9"=>"\xC5\xB8",
"\xDA"=>"\xE2\x81\x84", "\xDB"=>"\xE2\x82\xAC", "\xDC"=>"\xE2\x80\xB9", "\xDD"=>"\xE2\x80\xBA",
"\xDE"=>"\xEF\xAC\x81", "\xDF"=>"\xEF\xAC\x82", "\xE0"=>"\xE2\x80\xA1", "\xE1"=>"\xC2\xB7",
"\xE2"=>"\xE2\x80\x9A", "\xE3"=>"\xE2\x80\x9E", "\xE4"=>"\xE2\x80\xB0", "\xE5"=>"\xC3\x82",
"\xE6"=>"\xC3\x8A", "\xE7"=>"\xC3\x81", "\xE8"=>"\xC3\x8B", "\xE9"=>"\xC3\x88",
"\xEA"=>"\xC3\x8D", "\xEB"=>"\xC3\x8E", "\xEC"=>"\xC3\x8F", "\xED"=>"\xC3\x8C",
"\xEE"=>"\xC3\x93", "\xEF"=>"\xC3\x94", "\xF0"=>"\xEF\xA3\xBF", "\xF1"=>"\xC3\x92",
"\xF2"=>"\xC3\x9A", "\xF3"=>"\xC3\x9B", "\xF4"=>"\xC3\x99", "\xF5"=>"\xC4\xB1",
"\xF6"=>"\xCB\x86", "\xF7"=>"\xCB\x9C", "\xF8"=>"\xC2\xAF", "\xF9"=>"\xCB\x98",
"\xFA"=>"\xCB\x99", "\xFB"=>"\xCB\x9A", "\xFC"=>"\xC2\xB8", "\xFD"=>"\xCB\x9D",
"\xFE"=>"\xCB\x9B", "\xFF"=>"\xCB\x87", "\x00"=>"\x20", "\x01"=>"\x20",
"\x02"=>"\x20", "\x03"=>"\x20", "\x04"=>"\x20", "\x05"=>"\x20",
"\x06"=>"\x20", "\x07"=>"\x20", "\x08"=>"\x20", "\x0B"=>"\x20",
"\x0C"=>"\x20", "\x0E"=>"\x20", "\x0F"=>"\x20", "\x10"=>"\x20",
"\x11"=>"\x20", "\x12"=>"\x20", "\x13"=>"\x20", "\x14"=>"\x20",
"\x15"=>"\x20", "\x16"=>"\x20", "\x17"=>"\x20", "\x18"=>"\x20",
"\x19"=>"\x20", "\x1A"=>"\x20", "\x1B"=>"\x20", "\x1C"=>"\x20",
"\1D"=>"\x20", "\x1E"=>"\x20", "\x1F"=>"\x20", "\xF0"=>""));
return $str;
}
?>
Have you tried mb-convert-encoding ?
Think it would be:
$str = mb_convert_encoding($str, "macintosh", "UTF-8");
Just curious, have you tried copying the salt, saving as UTF-8 and then pasting the salt back in place and saving again?
I have a database the I am rebuilding the table structure was crap so I'm porting some of the data from one table to another. This data appears to have been copy-pasted from MSO product so as I'm getting the data I clean it up with htmlpurifier and some str_replace in php. Here is the clean function:
function clean_html($html) {
$config = HTMLPurifier_Config::createDefault();
$config->set('AutoFormat','RemoveEmpty',true);
$config->set('HTML','AllowedAttributes','href,src');
$config->set('HTML','AllowedElements','p,em,strong,a,ul,li,ol,img');
$purifier = new HTMLPurifier($config);
$html = $purifier->purify($html);
$html = str_replace(' ',' ',$html);
$html = str_replace("\r",'',$html);
$html = str_replace("\n",'',$html);
$html = str_replace("\t",'',$html);
$html = str_replace(' ',' ',$html);
$html = str_replace('<p> </p>','',$html);
$html = str_replace(chr(160),' ',$html);
return trim($html);
}
However, when I put the results into my new table and output them to the ckeditor I get those three characters.
I then have a javascript function that is called to remove special characters from the content of the ckeditor too. it doesn't clean it either
function remove_special(str) {
var rExps=[ /[\xC0-\xC2]/g, /[\xE0-\xE2]/g,
/[\xC8-\xCA]/g, /[\xE8-\xEB]/g,
/[\xCC-\xCE]/g, /[\xEC-\xEE]/g,
/[\xD2-\xD4]/g, /[\xF2-\xF4]/g,
/[\xD9-\xDB]/g, /[\xF9-\xFB]/g,
/\xD1/,/\xF1/g,
"/[\u00a0|\u1680|[\u2000-\u2009]|u200a|\u200b|\u2028|\u2029|\u202f|\u205f|\u3000|\xa0]/g",
/\u000b/g,'/[\u180e|\u000c]/g',
/\u2013/g, /\u2014/g,
/\xa9/g,/\xae/g,/\xb7/g,/\u2018/g,/\u2019/g,/\u201c/g,/\u201d/g,/\u2026/g];
var repChar=['A','a','E','e','I','i','O','o','U','u','N','n',' ','\t','','-','--','(c)','(r)','*',"'","'",'"','"','...'];
for(var i=0; i<rExps.length; i++) {
str=str.replace(rExps[i],repChar[i]);
}
for (var x = 0; x < str.length; x++) {
charcode = str.charCodeAt(x);
if ((charcode < 32 || charcode > 126) && charcode !=10 && charcode != 13) {
str = str.replace(str.charAt(x), "");
}
}
return str;
}
Does anyone know off hand what I need to do to get rid of them. I think they may be some sort of quote.
Your character encodings are all out of whack. � is indicative to me of a three-byte UTF-8 encoded character.
Some things you need to discover
What is was the encoding of the old table?
What is the encoding of the new table?
What is the encoding of the page that displays ckeditor?
It looks like HTMLPurifier's default is UTF-8 so you really need to be aware of the encoding of your data!
Had a similar issue: php remove/identify this symbol �
The character � is the REPLACEMENT CHARACTER (U+FFFD). It is used when there was an error within an UTF code:
FFFD � REPLACEMENT CHARACTER
- used to replace an incoming character whose value
is unknown or unrepresentable in Unicode
In most cases it means that some data is interpreted with an UTF encoding while the data is not encoded with that encoding but a different one.
My problem was pasting text from microsoft office products to html, or into a database. The largest offenders seem to be the emdash and smart quotes.