htmlentities() makes Chinese characters unusable - php

we have a web application where we allow users to enter their own html in a text area. We save that data to our database.
When we load the html data into the text area, of course, we use htmlentities() before throwing the html data into the textarea. Otherwise users could save inside the textarea and our application would break when loading that into the textarea.
this works great, except when entering Chinese characters (and probably other languages such as Arabic, Japanese).
The htmlentities() makes the chinese text unusable like this: �¨�³�¼�§ï
When I remove the htmlentities() before loading the entered html into the text area, Chinese characters show up just fine, but then we have the problem of HTML interfering with our textarea, especially when a users enters inside the text area.
I hope that makes sense.
Does anyone know how we can safely and correctly allow languages such as Chinese, Japanese, ... to be used inside our text area, while still being safe for loading any html inside our text area?

Have you tried using htmlspecialchars?
I currently use that in production and it's fine.
$foo = "我的名字叫萨沙"
echo '<textarea>' . htmlspecialchars($foo) . '</textarea>';
Alternately,
$str = “你好”;
echo mb_convert_encoding($str, ‘UTF-8′, ‘HTML-ENTITIES’);
As found on http://www.techiecorner.com/129/php-how-to-convert-iso-character-htmlentities-to-utf-8/

Specify charset, e.g. UTF-8 and it should work.
echo htmlentities($data, ENT_COMPAT, 'UTF-8');

PHP is pretty appalling in terms of framework-wide support for international character sets (although it's slowly getting better, especially in PHP5, but you don't specify which version you're using). There are a few mb_ (multibyte, as in multibyte characters) functions to help you out, though.
This example may help you (from here):
<?php
/**
* Multibyte equivalent for htmlentities() [lite version :)]
*
* #param string $str
* #param string $encoding
* #return string
**/
function mb_htmlentities($str, $encoding = 'utf-8') {
mb_regex_encoding($encoding);
$pattern = array('<', '>', '"', '\'');
$replacement = array('<', '>', '"', ''');
for ($i=0; $i<sizeof($pattern); $i++) {
$str = mb_ereg_replace($pattern[$i], $replacement[$i], $str);
}
return $str;
}
?>
Also, make sure your page is specifying the same character set. You can do this with a meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Most likely you're not using the correct encoding. If you already know your output encoding, use the charset argument of the html_entities function.
If you haven't settled on an internal encoding yet, take a look at the iconv functions; iconv_set_encoding("internal_encoding", "UTF-8"); might be a good start.

Related

convert special characters to regular alphabet in php

I'm trying to build a search page for a bunch of menu items in my database which often contain special characters like é (as in sautéed), and so I want to convert both the search query and the database content to regular alphabets, and I'm having trouble. I'm using ISO-8859-1 so that special characters will display properly on the website, and I get the feeling this is hindering my attempts at conversion...
header('Content-Type: text/html; charset=ISO-8859-1');
The search query is sent to search.php using the GET method, so the query "sautéed" will appear like this in the address bar:
search.php?q=saut%E9ed
This is the function I'm trying to build, that's not working:
$q = $_GET['q'];
function clean_str($a) {
$fix = array('é' => 'e');
$str = str_replace(array_keys($fix), array_values($fix), $a);
return $str;
}
$fixed = clean_str($q); // currently has no effect
I'm tried using %29 as the array key, as well as the HTML character code (é). I've tried utf8_encode($q); to no avail. Other characters like ! and + work fine in the clean_str() function, but not special alphabets like é.
Though you might want to reconsider the way you're doing this, as has been suggested, I believe this will get you there.
function clean_str($a) {
$fix = array('é' => 'e');
$str = str_replace(array_keys($fix), array_values($fix), $a);
return $str;
}
$fixed = clean_str(utf8_encode($_GET['q'])); // return an encoded utf8 string.
echo $fixed;
For more on utf8_encode see here.
To wit, é is the regular alphabet in several languages =) While you're suggesting you would like to know how to covert the text to ASCII (which English speakers may consider 'regular') what you really should be doing is working with the modern web's most permissive encoding, which is UTF8.
That way, you will be able to accept input in any language, save it, process it, and serve it back up, without needing to normalise or ill-convert to another codepage.
Serve your pages with <meta charset="utf-8"> in the source code, and an http content header to indicate UTF8 encoding, and things should go a lot smoother. (note that for the now defunct HTML 4.01 or XHTML 1/1.1 you will need to use the older meta tag syntax. Using those flavours for new projects is, however, very much not recommended)

How should I deal with character encodings when storing crawled web content for a search engine into a MySQL database?

I have a crawler that downloads webpages, scrapes specific content and then stores that content into a MySQL database. Later that content is displayed on a webpage when it's searched for ( standard search engine type setup ).
The content is generally of two different encoding types... UTF-8 or ISO-8859-1 or it is not specified. My database tables use cp1252 west european ( latin1 ) encoding. Up until now, I've simply filtered all characters that are not alphanumeric, spaces or punctuation using a regular expression before storing the content to MySQL. For the most part, this has eliminated all character encoding problems, and content is displayed properly when recalled and outputted to HTML. Here is the code I use:
function clean_string( $string )
{
$string = trim( $string );
$string = preg_replace( '/[^a-zA-Z0-9\s\p{P}]/', '', $string );
$string = $mysqli->real_escape_string( $string );
return $string;
}
I now need to start capturing "special" characters like trademark, copyright, and registered symbols, and am having trouble. No matter what I try, I end up with weird characters when I redisplay the content in HTML.
From what I've read, it sounds like I should use UTF-8 for my database encoding. How do I ensure all my data is converted properly before storing it to the database? Remember that my original content comes from all over the web in various encoding formats. Are there other steps I'm overlooking that may be giving me problems?
You should convert your database encoding to UTF-8.
About the content: for every page you crawl, fetch the page's encoding (from HTTP header/
meta charset) and use that encoding to convert to utf-8 like this:
$string = iconv("UTF-8", "THIS STRING'S ENCODING", $string);
Where THIS STRING'S ENCODING is the one you just grabbed as described above.
PHP manual on iconv: http://be2.php.net/manual/en/function.iconv.php
UTF-8 encompasses just about everything. It would definitely be my choice.
As far as storing the data, just ensure the connection to your database is using the proper charset. See the manual.
To deal with the ISO encoding, simply use utf8_encode when you store it, and utf8_decode when you retrieve it.
Try doing the encoding/decoding even when it's supposedly UTF-8 and see if that works for you. I've often seen people say something is UTF-8 when it isn't.
You'll also need to change your database to UTF-8.
Below worked for me when I am scraping and presenting the data on html page.
While scraping the data from external website do an utf8_encode:utf8_encode(trim(str_replace(array("\t","\n\r","\n","\r"),"",trim($th->plaintext))));
Before writing to the HTML page set the charset to utf-8 : <meta charset="UTF-8">
While writing of echoing out on html do an utf8_decode.echo "Menu Item:". utf8_decode ($value['item'])
This helped me to solve problem with my html scraping issues. Hope someone else finds it useful.

PHP problem character set

I have a problem where users upload zipped text files. After I extract text contents I import them in mysql database. But later when I display the text in browser some characters are garbled. I tried to encode them but I am unable to detect the encoding of the text files with PHP and convert to UTF-8 with iconv or mbstring.
Mysql database charset is UTF-8.
header('Content-type: text/html; charset=utf-8');
is added.
Tried with
iconv('UTF-8', 'UTF-8//IGNORE', $text_file_contents)
But it simply removes the garbled chars: � which should be either ' or " when I checked manually with Firefox browser. Firefox showed that is ISO-8859-1 but I can not check for every article they send (articles may be in different character set).
How to convert this characters to UTF-8 ?
EDIT:
This is a modified function I found on
http://php.net/manual/en/function.mb-detect-encoding.php
origanlly written by prgss at bk dot ru .
function myutf8_detect_encoding($string, $default = 'UTF-8', $encode = 0, $encode_to = 'UTF-8') {
static $list = array('UTF-8', 'ISO-8859-1', 'ASCII', 'windows-1250', 'windows-1251', 'latin1', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'ISO-8859-2', 'ISO-8859-3', 'GBK', 'GB2312', 'GB18030', 'MACROMAN', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-11', 'ISO-8859-12', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16');
foreach ($list as $item) {
$sample = iconv($item, $item, $string);
if (md5($sample) == md5($string)) {
if ($encode == 1)
return iconv($item, $encode_to, $string);
else
return $item;
}
}
if ($encode == 1)
return iconv($encode_to, $encode_to . '//IGNORE', $string);
else
return $default;
}
and in my code I use:
myutf8_detect_encoding(trim($description), 'UTF-8', 1)
but it still returns garbled characters of this text “old is gold’’ .
This is indeed tricky.
Detecting an arbitrary string's encoding using detect_encoding... is known to be not very reliable (although it should be able to distinguish between UTF-8 and ISO-8859-1 for example - make sure you give it a try first.)
If the auto-detection doesn't work out, there is the option of displaying the content to the user before it gets submitted, along with a drop-down menu to switch between the most used encodings. Then show a message like
Please check your submission. If you are seeing incorrect or garbled characters, please change the encoding in the drop-down menu until the content is correct.
Whenever the user changes the drop-down value, your script will pull the content again, use iconv() to convert it from the specified encoding to UTF-8, and output the result, until it looks good.
This needs some finesse in designing the User Interface to be understandable for the end user, but it would often be the best option. Especially if you are dealing with users from many different regions or continents with a lot of different encodings.
Having had the same problem of encoding detection, I made a php function that outputs different information about the string and should make it relatively easy to identify the encoding used.
http://php.net/manual/en/function.ord.php (function hex_chars by "manixrock(hat)gmail(doink)com").
It shows the values of the characters inside the string, as well as the values of each individual byte. You look at the output and see which of your suspected encodings matches the bytes. You should first familiarize yourself with the various popular encodings like UTF-8, UTF-16, ISO-8859-X (understand their byte storage). Also make sure you test the string as unaltered as possible (take care how the encoding might change between what PHP outputs and what the browser receives, how the browser displays, or if you get the string from another source like MySQL or a file how that may change the encoding).
This helped me detect that a text had undergone the conversions: (UTF-8 to byte[]) then (ISO-8859-1 to UTF-8). That function helped a lot. Hope it helps you.
Use mb_detect_encoding to find out what encoding is used, then iconv to convert.
Try to insert right after the mysql connection:
mysql_query("SET NAMES utf8");

Getting the title of a page in PHP

When I want to get the title of a remote webiste, I use this script:
function get_remotetitle($urlpage) {
$file = #fopen(($urlpage),"r");
$text = fread($file,16384);
if (preg_match('/<title>(.*?)<\/title>/is',$text,$found)) {
$title = $found[1];
} else {
$title = 'Title N/A';
}
return $title;
}
But when I parase a webiste title with accents, I get "�". But if I look in PHPMyAdmin, I see the accents correctly. What's happening?
This is most likely a character encoding issue. You are probably getting the character correctly but the page that displays it has the wrong character encoding so it doesn't display right.
check out PHP Simple HTML DOM Parser
use it something like:
$html = file_get_html('http://www.google.com/');
$ret = $html->find('title', 0);
The trouble is that the text has a different encoding from what you're using on the page you're displaying it on.
What you want to do is find out what encoding the data is (for instance by looking at what encoding the page you take the text from is using) and converting it to the encoding you're using yourself.
For doing the actual conversion, you can use iconv (for the general case), utf8_decode (UTF8 -> ISO-8859-1), utf8_encode (ISO-8859-1 -> UTF8) or mb_convert_encoding.
To help you find out what the encoding of the source page is, you could for instance put the website through the w3c Validator which automatically detects encoding.
If want an automatic way to determine encoding, you'll have to look at the HTML itself. The ways you can determine the selected charset can be fonud in the HTML 4 specification.
In addition, it's worth having a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for a bit more information on encoding.
I solved it. I added htmlentities($text) and now displays the accents and so.
Try this:
echo iconv('UTF-8', 'ASCII//TRANSLIT', $title);

Best method for pre and post processing multilingual user input for a php/mysql CMS

Okay, there is a ton of stuff out there on sanitizing strings but very little, that I can find, on the best methods to prepare user input (like what I'm typing now) for inserting into a content management system then how to filter it coming out.
I'm building two multilingual (Japanese, English + other Romance languages) CMSs and having a heck of a time with getting both special characters like ®, ™, to display along with Japanese characters.
I continue to get very inconsistent results.
I have everything set to UTF-8:
web page: and
.htaccess file: AddDefaultCharset UTF-8 AND (to force the issue)
after each db connection: mysql_query("SET NAMES 'UTF8'");
each database, table, and field is also set to utf8_general_ci
Magic quotes are off. I preprocess user input first with the default settings of htmlpurifier, then I run this function on it:
function html_encode($var) {
// Encodes HTML safely for UTF-8. Use instead of htmlentities.
$var = htmlentities($var, ENT_QUOTES, 'UTF-8');
// convert pesky special characters to unicode
$look = array('™', '™','®','®');
$safe = array('™', '™', '®', '®');
$var = str_replace($look, $safe, $var);
$var = mysql_real_escape_string($var);
return $var;
}
That get's it in to the database.
I return it from the database by filtering everything with this function:
function decodeit($var) {
return html_entity_decode(stripcslashes($var), ENT_QUOTES, 'UTF-8');
}
Unfortunately, after all this I STILL get inconsistent results. Most often the ® symbols become little diamonds.
I've searched all over for a good tut on this but can't seem to find what are the best methods...
Sorry the web page headers got scrubbed by the wysiwyg editor. For clarity's sake:
Web page headers are:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
And
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Don't put htmlentities in your database! Never call html_entities(), it should be deprecated from php. Use htmlspecialchars but when you display the text, not before you put it in the database. The point is to prevent your data from being treated as html. There is no point in translating trademark symbols or copyright symbols, because they don't cause a risk. The only html you need to worry about is: > < & ' "
http://us3.php.net/utf8_encode
http://us3.php.net/utf8-decode
That should help.
Everything is already encoded utf8. Decoding it to ISO-8859-1 would merely wreck any Japanese.
I once had an issue with encoding that came down to the encoding of the php files themselves. So basically make sure the files themselves are encoded to utf-8. In vim you can do
:e ++enc=

Categories