HTML to plaintext - unknown original encoding - php

I'm working with PHP, getting html from websites, converting them to plain text and saving them to the database.
They need to be saved to the database in utf-8.
My first problem is that I don't know the original encoding, what's the best way to encode to utf-8 from an unknown encoding?
the 2nd issue is the html to plain text conversion. I tried using html2text but it messed up all the foreign utf characters.
What is the best approach?
Edit: It seems the part about plain text is not clear enough. What i need not to just strip the html tags. I want to strip the tags while maintaining a kind of document structure. <p>, <li> tags would convert to line breaks etc and tags like <script> would be completely removed with their content.

Use mb_detect_encoding() for encoding detection.
Use strip_tags() to get rid of HTML tags.
Rest of the subjects like formatting the output depends on your needs.
Edit: I don't know if a complete solution exists but this link is really helpful to improve existing html to text PHP scripts on your own.
http://www.phpwact.org/php/i18n/utf-8

This function may be useful to you:
<?php
function FixEncoding($x){
if(mb_detect_encoding($x)=='UTF-8'){
return $x;
}else{
return utf8_encode($x);
}
}
?>

Related

Searching in the column of mysql database table, having the encoded html along with text. I need to search only text in this column using sphinx

I have a table in the database having the column question_description. I have indexed the column in sphinx, and getting the results successfully. Now problem is that the column contains encoded html along with text, and I want from sphinx to only search in text ignoring the encoded html. How can i configure this requirement. Thanks!
Not totally sure what you mean by 'encoded' html, does that mean gziped or something?
But do see html_strip:
http://sphinxsearch.com/docs/current.html#conf-html-strip
HTML tags are removed, their contents (i.e., everything between <P> and </P>) are left intact by default.
Edited to add (too long for a comment!):
Eek, yes, ok you do have 'encoded' HTML (in this case using html entities) - sphinx DOESN'T have explicit support for decoding that.
It can 'strip' down plain html, not encoded html. Frankly encoding it like that seems to add lots of extra overhead (your real html entities, will then be DOUBLE encoded) and means you always need to decode it when 'using' the HTML (be that outputting it in a webpage, or to sphinx etc).
This would have to use XMLPipe2 index (or other pipe index) to decode the text for indexing, (will be quite complicated as will have to decode the htmlspecialchars, but then re-encode it as XML)
or maybe find a MySQL function to decode it is there a mysql function to decode html entities? - during the sql_query
Second Edit to add:
Actuilly checking http://php.net/manual/en/function.htmlspecialchars.php - it seems the htmlspecialchars only really does 5 transformations.
That might be preactical to fix with regexp_filter - you could replace the entities back with their unencoded version.
the Regexp filters are applied BEFORE html processing...
http://sphinxsearch.com/blog/2014/11/26/sphinx-text-processing-pipeline/
http://sphinxsearch.com/docs/current.html#conf-regexp-filter
regexp_filter = " => "
... etc
regexp_filter = & => &

Passing code through the post variable?

I am coding a small template editor and the problem I am having is that code keeps getting converted into other characters, such as:
<?php
$hello = "hello";
?>
and it writes exactly that to the file, I want to write the actual code, php and html.
How can I accomplish this?
In this specific case you should run the contents of the file through the html_entity_decode function.
Description from the documentation -
Convert special characters to HTML entities
$str = '<?php';
echo html_entity_decode($str);
Outputs - <?php
Your issue is that your PHP is calling htmlspecialchars(). This converts characters that could be an issue (such as <>) into their HTML-safe version. You can resolve this by removing the htmlspecialchars() function (not recommended, as it's probably there for a reason) or calling html_entity_decode() on the code you want to save to a file.
I wouldn't necessarily recommend using html_entity_decode. It's usually a bad idea to fix incorrectly-encoded text by just reversing the encoding. Instead, figure out why it's encoded incorrectly in the first place.

™ gets converted to â„ ¢ DOMDocument XPath

If I have
<p id='test'>TEST™</p>
and I use
document.getElementById('test').innerHTML;
to pass the HTML to a php function where it extract all of the text nodes using DOMDocument and XPath.
When the PHP gets the content the ™ gets converted to ™. I run it through XPath and the text node comes back as:
TESTâ„ ¢
I am not sure what is going wrong, or if there is a way fix it, either on the javascript side so it passes the ™ rather then ™.
Any help is appreciated.
Your value that your variable is being passed with the TM character, not with ™, running through htmlentities() in PHP should take care of it.
You could try and use the HTML Unicode form
EX
<p id='test'>™</p>
Read this page for more example on Unicode TM
http://www.fileformat.info/info/unicode/char/2122/index.htm
Hope this helps.
You need to be more precise than saying it "comes back as". The ™ appears to have been written somewhere in UTF-8 encoding, and the same bytes have then been read by something that doesn't realise they are in UTF-8 encoding, and is assuming they are Latin-1 or similar. To solve the problem you will need to look very carefully at the configuration of the software that wrote the character and the software that read it.
What Michael said is true; in addition you should be aware that XML processors are basically required to convert character entities (like &tm;) to their actual character values, and will (almost) always produce output with those characters encoded in some prevailing character set. It takes heroic measures to prevent this, and is usually not a "good idea". So you should abandon attempts to do that, and my guess is that you would be better served by making sure that the function you are passing the HTML to is told to interpret it as utf-8 not some other charset (which may just be the system default).

Converting diacritics to numerical HTML code with HTML Purifier

I'm having trouble finding the correct setting for HTML Purifier 4.3.0 to convert diacritics to numerical HTML code. Is this possible using this library?
So, from încă to încă .
As you can see in the demo: by default: no. To me, but there isn't a clear description of what it does and doesn't, HTML Purifier looks like it's meant to strip html tags from input.
I think you're better off using htmlentities().
If you're working in UTF-8 mode, as HTML Purifier does by default, there's no need to escape character entities. If you tell HTML Purifier that you're working in ASCII mode, it will do so for you.

XML charactor encoding issues with accents

I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.�"
OR
también
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.
Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().
You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.
Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.
You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.

Categories