How do I stop htmlPurifier from automatically decoding html entities?

How do I stop htmlPurifier from automatically decoding html entities? - php

I have a strange issue. I use CKEditor-4 to collect formatted text from user in form of html. Also, the html content is filtered using htmlpurifier from the server.
When the user use quotes like ”, ’ and “ CKEditor converts them into html entities like ”, ’, and “, which is fine. The issue is, when I filter them using htmlpurifier - this quotes get's automatically decoded. This prevents the content from: being presented to user for later edit as the quotes are literally encoded in strage ways like â€œ
How do i fix this? I think, if I could stop htmlpurifier from automatically decoding things, this would work, But I am new to htmlpurifier - so I can't find a way.
I have tried using htmlentities before passing it to htmlpurifier. but it would encode the whole html, Hence: stopping htmlpurifier from purifying html at all.

After CBroe's comment, I found out that my application is not using UTF-8 all the way through.
And I can't rectify it also. For those who are in similar situation, I found a work-around. htmlPurifier does support a configuration to encode all non-ASCII charecters with some trade-offs - It's fine with my case(I think).
you can enable the htmlpurifier config Core.EscapeNonASCIICharacters like so
$config->set('Core.EscapeNonASCIICharacters', true);
which did the trick for me.
This is the full function
/**
* Purifies dirty html
*
* #param string $dirty_html
* #return string
*/
function purifyHtml($dirty_html)
{
$config = HTMLPurifier_Config::createDefault();
$config->set('Core.Encoding', 'UTF-8');
$config->set('Core.EscapeNonASCIICharacters', true);
$config->set('HTML.Doctype', 'HTML 4.01 Transitional');
$config->set('Cache.SerializerPath', getStoragePath('cache/html-purifier'));
$htmlPurifier = new HTMLPurifier($config);
return $htmlPurifier->purify($dirty_html);
}

Related

converting special characters in HTML into the appropriate coding for PHP

I am making a website where one fills out a form and it creates a PDF. The user will be able to put in diacritic and special characters. The way I am sending the characters to the PHP, those characters will come into the PHP as HTML coded characters i.e. à. I need to change this to whatever it is PHP will read so when I put it through the PDF maker we have it has the diacritic character and not the HTML code for it.
I wrote a test to try this out but I haven't been able to figure it out. If I have to I will end up writing an array for every possible character they can use and translate the incoming string but I am trying to find an easier solution.
Here is the code of my test:
$title = "Test of Title for use With This Project and it should also wrap because it is sò long! Acutally it is even longer than previously expected!";
$ti = htmlspecialchars_decode($title);
I have been attempting to use the htmlspecialchars_decode() to convert it but it still comes out as &ograve and not ò. Is there an easy way to do this?

See the documentation which tells you it won't touch most of the characters you care about and to use html_entity_decode instead.

Use the html_entity_decode function instead of htmlspecialchars_decode (which only decodes entities such as &, ", < and > = special HTML chars, not all entities).

HTML to plaintext - unknown original encoding

I'm working with PHP, getting html from websites, converting them to plain text and saving them to the database.
They need to be saved to the database in utf-8.
My first problem is that I don't know the original encoding, what's the best way to encode to utf-8 from an unknown encoding?
the 2nd issue is the html to plain text conversion. I tried using html2text but it messed up all the foreign utf characters.
What is the best approach?
Edit: It seems the part about plain text is not clear enough. What i need not to just strip the html tags. I want to strip the tags while maintaining a kind of document structure. <p>, <li> tags would convert to line breaks etc and tags like <script> would be completely removed with their content.

Use mb_detect_encoding() for encoding detection.
Use strip_tags() to get rid of HTML tags.
Rest of the subjects like formatting the output depends on your needs.
Edit: I don't know if a complete solution exists but this link is really helpful to improve existing html to text PHP scripts on your own.
http://www.phpwact.org/php/i18n/utf-8

This function may be useful to you:
<?php
function FixEncoding($x){
if(mb_detect_encoding($x)=='UTF-8'){
return $x;
}else{
return utf8_encode($x);
}
}
?>

Converting diacritics to numerical HTML code with HTML Purifier

I'm having trouble finding the correct setting for HTML Purifier 4.3.0 to convert diacritics to numerical HTML code. Is this possible using this library?
So, from încă to încă .

As you can see in the demo: by default: no. To me, but there isn't a clear description of what it does and doesn't, HTML Purifier looks like it's meant to strip html tags from input.
I think you're better off using htmlentities().

If you're working in UTF-8 mode, as HTML Purifier does by default, there's no need to escape character entities. If you tell HTML Purifier that you're working in ASCII mode, it will do so for you.

htmlentities() makes Chinese characters unusable

we have a web application where we allow users to enter their own html in a text area. We save that data to our database.
When we load the html data into the text area, of course, we use htmlentities() before throwing the html data into the textarea. Otherwise users could save inside the textarea and our application would break when loading that into the textarea.
this works great, except when entering Chinese characters (and probably other languages such as Arabic, Japanese).
The htmlentities() makes the chinese text unusable like this: Ã�Â¨Ã�Â³Ã�Â¼Ã�Â§Ã¯
When I remove the htmlentities() before loading the entered html into the text area, Chinese characters show up just fine, but then we have the problem of HTML interfering with our textarea, especially when a users enters inside the text area.
I hope that makes sense.
Does anyone know how we can safely and correctly allow languages such as Chinese, Japanese, ... to be used inside our text area, while still being safe for loading any html inside our text area?

Have you tried using htmlspecialchars?
I currently use that in production and it's fine.
$foo = "我的名字叫萨沙"
echo '<textarea>' . htmlspecialchars($foo) . '</textarea>';
Alternately,
$str = “你好”;
echo mb_convert_encoding($str, ‘UTF-8′, ‘HTML-ENTITIES’);
As found on http://www.techiecorner.com/129/php-how-to-convert-iso-character-htmlentities-to-utf-8/

Specify charset, e.g. UTF-8 and it should work.
echo htmlentities($data, ENT_COMPAT, 'UTF-8');

PHP is pretty appalling in terms of framework-wide support for international character sets (although it's slowly getting better, especially in PHP5, but you don't specify which version you're using). There are a few mb_ (multibyte, as in multibyte characters) functions to help you out, though.
This example may help you (from here):
<?php
/**
* Multibyte equivalent for htmlentities() [lite version :)]
*
* #param string $str
* #param string $encoding
* #return string
**/
function mb_htmlentities($str, $encoding = 'utf-8') {
mb_regex_encoding($encoding);
$pattern = array('<', '>', '"', '\'');
$replacement = array('<', '>', '"', ''');
for ($i=0; $i<sizeof($pattern); $i++) {
$str = mb_ereg_replace($pattern[$i], $replacement[$i], $str);
}
return $str;
}
?>
Also, make sure your page is specifying the same character set. You can do this with a meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Most likely you're not using the correct encoding. If you already know your output encoding, use the charset argument of the html_entities function.
If you haven't settled on an internal encoding yet, take a look at the iconv functions; iconv_set_encoding("internal_encoding", "UTF-8"); might be a good start.

XML charactor encoding issues with accents

I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.â��"
OR
tambiÃ©n
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.

Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().

You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.

Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.

You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How do I stop htmlPurifier from automatically decoding html entities? - php

Related

converting special characters in HTML into the appropriate coding for PHP

HTML to plaintext - unknown original encoding

Converting diacritics to numerical HTML code with HTML Purifier

htmlentities() makes Chinese characters unusable

XML charactor encoding issues with accents

Categories

Resources