Converting diacritics to numerical HTML code with HTML Purifier - php

I'm having trouble finding the correct setting for HTML Purifier 4.3.0 to convert diacritics to numerical HTML code. Is this possible using this library?
So, from încă to încă .

As you can see in the demo: by default: no. To me, but there isn't a clear description of what it does and doesn't, HTML Purifier looks like it's meant to strip html tags from input.
I think you're better off using htmlentities().

If you're working in UTF-8 mode, as HTML Purifier does by default, there's no need to escape character entities. If you tell HTML Purifier that you're working in ASCII mode, it will do so for you.

Related

How do I stop htmlPurifier from automatically decoding html entities?

I have a strange issue. I use CKEditor-4 to collect formatted text from user in form of html. Also, the html content is filtered using htmlpurifier from the server.
When the user use quotes like ”, ’ and “ CKEditor converts them into html entities like ”, ’, and “, which is fine. The issue is, when I filter them using htmlpurifier - this quotes get's automatically decoded. This prevents the content from: being presented to user for later edit as the quotes are literally encoded in strage ways like “
How do i fix this? I think, if I could stop htmlpurifier from automatically decoding things, this would work, But I am new to htmlpurifier - so I can't find a way.
I have tried using htmlentities before passing it to htmlpurifier. but it would encode the whole html, Hence: stopping htmlpurifier from purifying html at all.
After CBroe's comment, I found out that my application is not using UTF-8 all the way through.
And I can't rectify it also. For those who are in similar situation, I found a work-around. htmlPurifier does support a configuration to encode all non-ASCII charecters with some trade-offs - It's fine with my case(I think).
you can enable the htmlpurifier config Core.EscapeNonASCIICharacters like so
$config->set('Core.EscapeNonASCIICharacters', true);
which did the trick for me.
This is the full function
/**
* Purifies dirty html
*
* #param string $dirty_html
* #return string
*/
function purifyHtml($dirty_html)
{
$config = HTMLPurifier_Config::createDefault();
$config->set('Core.Encoding', 'UTF-8');
$config->set('Core.EscapeNonASCIICharacters', true);
$config->set('HTML.Doctype', 'HTML 4.01 Transitional');
$config->set('Cache.SerializerPath', getStoragePath('cache/html-purifier'));
$htmlPurifier = new HTMLPurifier($config);
return $htmlPurifier->purify($dirty_html);
}

using htmlentities with superglobal variables

I'm working on php with a book now. The book said I should be careful using superglobal variables, so it's better to use htmlentities like this.
$came_from = htmlentities($_SERVER['HTTP_REFERER']);
So, I wrote a code like this;
<?php
$came_from=htmlentities($_SERVER['HTTP_REFERER']);
echo $came_from;
?>
However, the display of the code above was the same without htmlentities(); It didn't change anything at all. I thought that it would change \ into something else. Did I use it wrong?
So, by default, htmlentities() encodes characters using ENT_COMPAT (converts double-quotes and leave single-quotes alone) and ENT_HTML401. Seeing as the backslash isn't part of the HTML 4.01 entity spec (as far as I can see anyway), it won't be converted.
If you specify the ENT_HTML5 flag, you get a different result
php > echo htmlentities('abc\123');
abc\123
php > echo htmlentities('abc\123', ENT_HTML5);
abc&bsol;123
This is because backslash is part of the HTML5 spec. See http://dev.w3.org/html5/html-author/charref
Sorry. My previous answer was absolutely wrong. I was confused with something else. My apologise. Let me refrain my answer:
htmlentities will convert special characters into their HTML entity. "<" for example will be converted to "<". Your browser will automaticly recognise this HTML entity and decode it back to "<". So you won't notice any difference.
The reason for this is to prevent problems when saving your document in something different then UTF-8 encoding. Any characters not encoded might become screwed up for this reason.

How to decode Cyrillic characters without touching html tags

I am fetching some content from remote sources and some of the them output Cyrillic characters like this:
Щерката
Browsers can read this just fine, but there are issues with some programs. After running this through PHP's html_entity_decode() I can get the correct characters and the text looks like this:
Щерката
The problem is that html_entity_decode() also decodes any HTML tags inside the string and I don't want them to be touched.
Is there any way of doing this without affecting the HTML tags?
var_dump(htmlspecialchars(html_entity_decode('Щ<b>')));
Gives me:
string(11) "Щ<b>"
(Double)encode the < and > sequences first with a simple str_replace() and then do the decode

Print WYSIWYG content with PHP as HTML

How can I print html code that has been created by a WYSIWYG editor with PHP? When I print it with "echo", then it shows the html code on the website only, rather than interpreting it as html tags.
Thanks for your answer.
What you need is htmlspecialchars_decode
htmlspecialchars_decode — Convert special HTML entities back to characters

HTML to plaintext - unknown original encoding

I'm working with PHP, getting html from websites, converting them to plain text and saving them to the database.
They need to be saved to the database in utf-8.
My first problem is that I don't know the original encoding, what's the best way to encode to utf-8 from an unknown encoding?
the 2nd issue is the html to plain text conversion. I tried using html2text but it messed up all the foreign utf characters.
What is the best approach?
Edit: It seems the part about plain text is not clear enough. What i need not to just strip the html tags. I want to strip the tags while maintaining a kind of document structure. <p>, <li> tags would convert to line breaks etc and tags like <script> would be completely removed with their content.
Use mb_detect_encoding() for encoding detection.
Use strip_tags() to get rid of HTML tags.
Rest of the subjects like formatting the output depends on your needs.
Edit: I don't know if a complete solution exists but this link is really helpful to improve existing html to text PHP scripts on your own.
http://www.phpwact.org/php/i18n/utf-8
This function may be useful to you:
<?php
function FixEncoding($x){
if(mb_detect_encoding($x)=='UTF-8'){
return $x;
}else{
return utf8_encode($x);
}
}
?>

Categories