� in my html after purify - php

I have a database the I am rebuilding the table structure was crap so I'm porting some of the data from one table to another. This data appears to have been copy-pasted from MSO product so as I'm getting the data I clean it up with htmlpurifier and some str_replace in php. Here is the clean function:
function clean_html($html) {
$config = HTMLPurifier_Config::createDefault();
$config->set('AutoFormat','RemoveEmpty',true);
$config->set('HTML','AllowedAttributes','href,src');
$config->set('HTML','AllowedElements','p,em,strong,a,ul,li,ol,img');
$purifier = new HTMLPurifier($config);
$html = $purifier->purify($html);
$html = str_replace(' ',' ',$html);
$html = str_replace("\r",'',$html);
$html = str_replace("\n",'',$html);
$html = str_replace("\t",'',$html);
$html = str_replace(' ',' ',$html);
$html = str_replace('<p> </p>','',$html);
$html = str_replace(chr(160),' ',$html);
return trim($html);
}
However, when I put the results into my new table and output them to the ckeditor I get those three characters.
I then have a javascript function that is called to remove special characters from the content of the ckeditor too. it doesn't clean it either
function remove_special(str) {
var rExps=[ /[\xC0-\xC2]/g, /[\xE0-\xE2]/g,
/[\xC8-\xCA]/g, /[\xE8-\xEB]/g,
/[\xCC-\xCE]/g, /[\xEC-\xEE]/g,
/[\xD2-\xD4]/g, /[\xF2-\xF4]/g,
/[\xD9-\xDB]/g, /[\xF9-\xFB]/g,
/\xD1/,/\xF1/g,
"/[\u00a0|\u1680|[\u2000-\u2009]|u200a|\u200b|\u2028|\u2029|\u202f|\u205f|\u3000|\xa0]/g",
/\u000b/g,'/[\u180e|\u000c]/g',
/\u2013/g, /\u2014/g,
/\xa9/g,/\xae/g,/\xb7/g,/\u2018/g,/\u2019/g,/\u201c/g,/\u201d/g,/\u2026/g];
var repChar=['A','a','E','e','I','i','O','o','U','u','N','n',' ','\t','','-','--','(c)','(r)','*',"'","'",'"','"','...'];
for(var i=0; i<rExps.length; i++) {
str=str.replace(rExps[i],repChar[i]);
}
for (var x = 0; x < str.length; x++) {
charcode = str.charCodeAt(x);
if ((charcode < 32 || charcode > 126) && charcode !=10 && charcode != 13) {
str = str.replace(str.charAt(x), "");
}
}
return str;
}
Does anyone know off hand what I need to do to get rid of them. I think they may be some sort of quote.

Your character encodings are all out of whack. � is indicative to me of a three-byte UTF-8 encoded character.
Some things you need to discover
What is was the encoding of the old table?
What is the encoding of the new table?
What is the encoding of the page that displays ckeditor?
It looks like HTMLPurifier's default is UTF-8 so you really need to be aware of the encoding of your data!

Had a similar issue: php remove/identify this symbol �
The character � is the REPLACEMENT CHARACTER (U+FFFD). It is used when there was an error within an UTF code:
FFFD � REPLACEMENT CHARACTER
- used to replace an incoming character whose value
is unknown or unrepresentable in Unicode
In most cases it means that some data is interpreted with an UTF encoding while the data is not encoded with that encoding but a different one.
My problem was pasting text from microsoft office products to html, or into a database. The largest offenders seem to be the emdash and smart quotes.

Related

PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?

THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.
THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.
Illustrating the problem
Suppose
$xmlFrag ='<p>Hello world!    Let A<B and A=∬dxdy</p>';
Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt not. The XML text will be (after transformed),
$xmlFrag = '<p>Hello world!    Let A<B and A=∬dxdy</p>';
The text "A<B" needs an XML-reserved character, so MUST stay as A<B.
Frustrated solutions
I try to use html_entity_decode for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1 option,
$s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
// as I expected
Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.
I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,
function xml_entity_decode($s) {
// here an illustration (by user-defined function)
// about how the hypothetical PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
//$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
} // you see? not need a benchmark:
// it is not so fast as direct use of html_entity_decode; if there
// was an XML-safe option was ideal.
PS: corrected after this answer. Must be ENT_HTML5 flag, for convert really all named entities.
This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.
... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:
The best workaround
Pay attention:
The function xml_entity_decode() below is the best (over any other) workaround.
The function below is not an answer to the present question, it is only a workwaround.
function xml_entity_decode($s) {
// illustrating how a (hypothetical) PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
}
To test and to demonstrate that you have a better solution, please test first with this simple benckmark:
$countBchMk_MAX=1000;
$xml = file_get_contents('sample1.xml'); // BIG and complex XML string
$start_time = microtime(TRUE);
for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){
$A = xml_entity_decode($xml); // 0.0002
/* 0.0014
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$A = $doc->saveXML();
*/
}
$end_time = microtime(TRUE);
echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
($end_time - $start_time)/$countBchMk_MAX,
" seconds</h1>";
Use the DTD when loading the JATS XML document, as it will define any mapping from named entities to Unicode characters, then set the encoding to UTF-8 when saving:
$doc = new DOMDocument;
$doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$doc->save($outputFile);
I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert #_x_amp#; to $amp;, however unlikely its presence in the source XML.
Note: I'm assuming default encoding is UTF-8
// Search for named entities (strings like "&abc1;").
echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
/* <Foo>€&foo Ç</Foo> */
Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:
// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
/* <Foo>€&foo Ç</Foo> */
In your case you want it the other way around. Encode numbered entities as UTF-8:
// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
// Encodes (uncaught) numbered entities to UTF-8.
echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
/* <Foo>€&foo Ç</Foo> */
Benchmark
I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
Your method
php -r '$q=["&",">","<"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é & ∬</Foo>
=====
Time taken: 2.0397531986237
My method
php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 4.045273065567
My method (with unicode to numbered entity):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.4407880306244
My method (with numbered entity to unicode):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.5400078296661
public function entity_decode($str, $charset = NULL)
{
if (strpos($str, '&') === FALSE)
{
return $str;
}
static $_entities;
isset($charset) OR $charset = $this->charset;
$flag = is_php('5.4')
? ENT_COMPAT | ENT_HTML5
: ENT_COMPAT;
do
{
$str_compare = $str;
// Decode standard entities, avoiding false positives
if ($c = preg_match_all('/&[a-z]{2,}(?![a-z;])/i', $str, $matches))
{
if ( ! isset($_entities))
{
$_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset));
// If we're not on PHP 5.4+, add the possibly dangerous HTML 5
// entities to the array manually
if ($flag === ENT_COMPAT)
{
$_entities[':'] = '&colon;';
$_entities['('] = '&lpar;';
$_entities[')'] = '&rpar';
$_entities["\n"] = '&newline;';
$_entities["\t"] = '&tab;';
}
}
$replace = array();
$matches = array_unique(array_map('strtolower', $matches[0]));
for ($i = 0; $i < $c; $i++)
{
if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE)
{
$replace[$matches[$i]] = $char;
}
}
$str = str_ireplace(array_keys($replace), array_values($replace), $str);
}
// Decode numeric & UTF16 two byte entities
$str = html_entity_decode(
preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str),
$flag,
$charset
);
}
while ($str_compare !== $str);
return $str;
}
For those coming here because your numeric entity in the range 128 to 159 remains as numeric entity instead of being converted to a character:
echo xml_entity_decode('€');
//Output € instead expected €
This depends on PHP version (at least for PHP >=5.6 the entity remains) and on the affected characters. The reason is that the characters 128 to 159 are not printable characters in UTF-8. This can happen if the data to be converted mix up windows-1252 content (where € is the € sign).
Try this function:
function xmlsafe($s,$intoQuotes=1) {
if ($intoQuotes)
return str_replace(array('&','>','<','"'), array('&','>','<','"'), $s);
else
return str_replace(array('&','>','<'), array('&','>','<'), html_entity_decode($s));
}
example usage:
echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>';
also: https://stackoverflow.com/a/9446666/2312709
this code used in production seem that no problems happened with UTF-8

How to Convert Html Codes to Relevant Unicode Characters

Actually, I have googled a Lot, And I have explored this forum too, but this is my second day, and I could not find the solution.
My Problem is that I want to convert the Html Codes
باخ
to its equallent unicode characters
خ ا ب
Actually I do not want to convert all the html symbols to unicode characters. I only want to convert the arabic / urdu html code to unicode characters. The range of these characters is from ؛ To ۹ If there is no any PHP function then How can I replace the codes with their equallent unicode character in one go?
I think you're looking for:
html_entity_decode('باخ', ENT_QUOTES, 'UTF-8');
When you go from ب to ب, that's called decoding. Doing the opposite is called encoding.
As for replacing only characters from ؛ to ۹ maybe try something like this.
<?php
// Random set of entities, two are outside the 1563 - 1785 range.
$entities = '؛؜<لñ۸۹';
// Matches entities from 1500 to 1799, not perfect, I know.
preg_match_all('/&#1[5-7][0-9]{2};/', $entities, $matches);
$entityRegex = array(); // Will hold the entity code regular expression.
$decodedCharacters = array(); // Will hold the decoded characters.
foreach ($matches[0] as $entity)
{
// Convert the entity to human-readable character.
$unicodeCharacter = html_entity_decode($entity, ENT_QUOTES, 'UTF-8');
array_push($entityRegex, "/$entity/");
array_push($decodedCharacters, $unicodeCharacter);
}
// Replace all of the matched entities with the human-readable character.
$replaced = preg_replace($entityRegex, $decodedCharacters, $entities);
?>
That's as close as I can get to solving this. Hopefully, this helps a little. It's 5:00am where I am now, so I'm off to sleep! :)
did you try the utf-8 encoding in html head?
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
try this
<?php
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
foreach($trans_tbl as $k => $v)
{
$ttr[$v] = utf8_encode($k);
}
$text = 'بب....;&#1582';
$text = strtr($text, $ttr);
echo $text;
?>
for mysql solution you can set the character set as
$mysqli = new mysqli($host, $user, $pass, $db);
if (!$mysqli->set_charset("utf8")) {
die("error");
}

How to properly write unicode words to image using PHP imgttftext() function

I am trying to write some Urdu text on an image using imgttftext() function of PHP. It does not display the characters unless I convert the text using the following code:
function convert($text){
$out="";
mb_language('uni');
mb_internal_encoding('UTF-8');
$text = mb_convert_encoding($text, 'HTML-ENTITIES',"UTF-8");
$text = html_entity_decode($text,ENT_NOQUOTES, "ISO-8859-1");
for($i = 0; $i < strlen($text); $i++) {
$letter = $text[$i];
$num = ord($letter);
if($num>127) {
$out .= "&#$num;";
} else {
$out .= $letter;
}
}
return $out;
}
Now, the text e.g. عچں (which contains the three characters ع چ ں) is printed on to the image as separate and full characters instead of cutting and joining the characters to form an Urdu word like عچں.
I have used the characters ا ب ت ث with codes U+0627, U+0628, U+0629 and so on from this page http://en.wikipedia.org/wiki/List_of_Unicode_characters#Arabic
I have shared the code here: https://code.google.com/p/urdu-captcha/downloads/list
Note: I have added space between the characters in the code provided
removing which makes no difference to how the text is displayed on the
image.
How do I make it write the characters joined together to form proper words?
You'll need an additional library to perform Arabic glyph joining. Check out AR-PHP.

How to convert UTF8 characters to numeric character entities in PHP

Is a translation of the below code at all possible using PHP?
The code below is written in JavaScript. It returns html with numeric character references where needed. Ex. smslån -> smslån
I have been unsuccessful at creating a translation. This script looked like it may work, but returns å for å instead of å as the javascript below does.
function toEntity() {
var aa = document.form.utf.value;
var bb = '';
for(i=0; i<aa.length; i++)
{
if(aa.charCodeAt(i)>127)
{
bb += '&#' + aa.charCodeAt(i) + ';';
}
else
{
bb += aa.charAt(i);
}
}
document.form.entity.value = bb;
}
PHP's ord function sounds like it does the same thing as charCodeAt, but it does not. I get 195 for å using ord and 229 using charCodeAt. That, or I am having some incredibly difficult encoding problems.
Use mb_encode_numericentity:
$convmap = array(0x80, 0xffff, 0, 0xffff);
echo mb_encode_numericentity($utf8Str, $convmap, 'UTF-8');

How to escape Chinese Unicode characters in URL?

I have Chinese users of my PHP web application who enter products into our system. The information the’re entering is for example a product title and price.
We would like to use the product title to generate a nice URL slug for those product.
Seems like we cannot just use Chinese as HREF attributes.
Does anyone know how we handle a title like “婴儿服饰” so that we can generate a clean url like http://www.site.com/婴儿服饰 ?
Everything works fine for “normal” languages, but high UTF‐8 languages give us problems.
Also, when generating the clean URL, we want to keep SEO in mind, but I have no experience with Chinese in that matter.
If your string is already UTF-8, just use rawurlencode to encode the string properly:
$path = '婴儿服饰';
$url = 'http://example.com/'.rawurlencode($path);
UTF-8 is the preferred character encoding for non-ASCII characters (although only ASCII characters are allowed in URIs which is why you need to use the percent-encoding). The result is the same as in tchrist’s example:
http://example.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0
This code, which uses the CPAN module, URI::Escape:
#!/usr/bin/env perl
use v5.10;
use utf8;
use URI::Escape qw(uri_escape_utf8);
my $url = "http://www.site.com/";
my $path = "婴儿服饰";
say $url, uri_escape_utf8($path);
when run, prints:
http://www.site.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0
Is that what you're looking for?
BTW, those four characters are:
CJK UNIFIED IDEOGRAPH-5A74
CJK UNIFIED IDEOGRAPH-513F
CJK UNIFIED IDEOGRAPH-670D
CJK UNIFIED IDEOGRAPH-9970
Which, according to the Unicode::Unihan database, seems to be yīng ér fú shì, or perhaps just ying er fú shi per Lingua::ZH::Romanize::Pinyin. And maybe even jing¹ jan⁴ fuk⁶ sik¹ or jing˥ jan˨˩ fuk˨ sik˥, using the Cantonese version from Unicode::Unihan.
Use encoded url as href attribute of the link, and keep original characters as content of the link.
Then you could have the safe url and make the webpage SEO friendly.
// Safely convert url like "http://example.com/婴儿服饰" to valid encoded string
// => http://example.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0
// KEY: multipart character occupies more than one byte
function autoEncodeMultibyteChars($url) {
$encoding = 'UTF-8';
$mbLen = mb_strlen($url, $encoding);
$append = '';
for ($idx = 0; $idx < $mbLen; $idx++) {
$char = mb_substr($url, $idx, 1, $encoding);
if (strlen($char) > 1) { // multibyte char
$append .= rawurlencode($char);
} else {
$append .= $char;
}
}
return $append;
}

Categories