UTF-8, XML, and htmlentities with PHP / Mysql - php

I have found a lot of varying / inconsistent information across the web on this topic, so I'm hoping someone can help me out with these issues:
I need a function to cleanse a string so that it is safe to insert into a utf-8 mysql db or to write to a utf-8 XML file. Characters that can't be converted to utf-8 should be removed.
For writing to an XML file, I'm also running into the problem of converting html entities into numeric entities. The htmlspecialchars() works almost all the time, but I have read that it is not sufficient for properly cleansing all strings, for example one that contains an invalid html entity.
Thanks for your help, Brian

You didn't say where the strings were coming from, but if you're getting them from an HTML form submission, see this article:
Setting the character encoding in form submit for Internet Explorer
Long and short, you'll need to explicitly tell the browser what charset you want the form submission in. If you specify UTF-8, you should never get invalid UTF-8 from a browser. If you want to protect yourself against ANY type of malicious attack, you'll need to use iconv:
$utf_8_string = iconv($from_charset, $to_charset, $original_string);
If you specify "utf-8" as both $from_charset and $to_charset, iconv() should return an error if $original_string contains invalid UTF-8.
If you're getting your strings from a different source and you know the character encoding, you can still use iconv(). Typical encodings in the US are CP-1252 (Windows) and ISO-8859-1 (everything else.)

Something like this?
function cleanse($in) {
$bad = Array('”', '“', '’', '‘');
$good = Array('"', '"', '\'', '\'');
$out = str_replace($bad, $good, $in);
return $out;

You can convert a string from any encoding to UTF-8 with iconv or mbstring:
// With the //IGNORE flag, this will ignore invalid characters
iconv('input-encoding', 'UTF-8//IGNORE', $the_string);
mb_convert_encoding($the_string, 'UTF-8', 'input-encoding');


mb_detect_encoding returns both ASCII and UTF8 [duplicate]

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

iconv with ascii // transit triggers ErrorException: "iconv(): Detected an illegal character in input string"

First of all, I have to say that; I am a stranger of multilingual conversions.
I have strings that i want to mb_lowercase in UTF-8 form if possible (sth like clean url), and I use
$str = iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($str));
$str = preg_replace("/[^a-zA-Z0-9_]/", "", $str);
$str = mb_strtolower($str);
to achive my requirements (an UTF8, lowercase string)
However, when I stress that function with "çokGüŞelLl" using CocoaRestClient; I get à as $str (thanks to my client?) and iconv triggers an error complaining about an illegal character in input string (Ã).
What is the problem with iconv? the str is encoded as utf8 by utf8_encode($str) already. How can it be an illegal character?
I read about #iconv questions here, but I think it is not a good solution to have empty database entries.
Thanks to all answers, I will read and try to understand each of them.
The PHP function utf8_encode() expects your string to be ISO-8859-1 encoded. If it isn’t, well, you get funny results.
Ensure that your data is proper UTF-8 before saving it to your database:
// Validate that the input string is valid UTF-8
if (preg_match("//u", $string) === false) {
throw new \InvalidArgumentException("String contains invalid UTF-8 characters.");
// Normalize to Unicode NFC form (recommended by W3C)
$string = \Normalizer::normalize($string);
Now everything is stored the same way in our database and we don't have to care about this problem anymore when receiving data from our database.
$string = $database->getSomeRecordWithUnicode();
echo mb_strtolower($string);
PS: If you want to ensure that your database is using the exact same encoding as PHP either use utf8mb4 as character set (and utf8mb4_unicode_ci as default collation for perfect sorting) or a BLOB (binary) data type.
PPS: Use your database configuration file to force proper encoding of all strings instead of using e.g. $mysqli->set_charset("utf8") or similar.
About HTML forms
Because you asked in the comments of your question. How data is sent to your server has nothing to do with the locale the user has set in his operating system. It has to do with the client's browser. All modern browsers default to utf-8 when sending form data. If you are afraid that some of your clients might be using totally broken browsers, simply tell them that you only accept utf-8. Drupal is doing that on all their forms.
<!doctype html>
<form accept-charset="UTF-8">
Now all browsers should encode the data they submit in utf-8.
If you encode çokGüŞelLl as UTF-8 you should get the following bytes:
var_dump( bin2hex('çokGüŞelLl') );
string(26) "c3a76f6b47c3bcc59e656c4c6c"
That's a check you must do. You also have this:
Your string contains Ş, which cannot be represented in ISO-8859-1 to begin with.
So, whatever reason you have to convert your original UTF-8 (as stored in DB) to ISO-8859-1, I'm afraid that it's corrupting your data.
You're double encoding. First you set your database to UTF-8. That means your data is now UTF-8 encoded. Then you use utf8_encode on the iconv-function. But your input is already UTF-8. Try removing your utf8_encode statement from iconv.

How should I deal with character encodings when storing crawled web content for a search engine into a MySQL database?

I have a crawler that downloads webpages, scrapes specific content and then stores that content into a MySQL database. Later that content is displayed on a webpage when it's searched for ( standard search engine type setup ).
The content is generally of two different encoding types... UTF-8 or ISO-8859-1 or it is not specified. My database tables use cp1252 west european ( latin1 ) encoding. Up until now, I've simply filtered all characters that are not alphanumeric, spaces or punctuation using a regular expression before storing the content to MySQL. For the most part, this has eliminated all character encoding problems, and content is displayed properly when recalled and outputted to HTML. Here is the code I use:
function clean_string( $string )
$string = trim( $string );
$string = preg_replace( '/[^a-zA-Z0-9\s\p{P}]/', '', $string );
$string = $mysqli->real_escape_string( $string );
return $string;
I now need to start capturing "special" characters like trademark, copyright, and registered symbols, and am having trouble. No matter what I try, I end up with weird characters when I redisplay the content in HTML.
From what I've read, it sounds like I should use UTF-8 for my database encoding. How do I ensure all my data is converted properly before storing it to the database? Remember that my original content comes from all over the web in various encoding formats. Are there other steps I'm overlooking that may be giving me problems?
You should convert your database encoding to UTF-8.
About the content: for every page you crawl, fetch the page's encoding (from HTTP header/
meta charset) and use that encoding to convert to utf-8 like this:
$string = iconv("UTF-8", "THIS STRING'S ENCODING", $string);
Where THIS STRING'S ENCODING is the one you just grabbed as described above.
PHP manual on iconv: http://be2.php.net/manual/en/function.iconv.php
UTF-8 encompasses just about everything. It would definitely be my choice.
As far as storing the data, just ensure the connection to your database is using the proper charset. See the manual.
To deal with the ISO encoding, simply use utf8_encode when you store it, and utf8_decode when you retrieve it.
Try doing the encoding/decoding even when it's supposedly UTF-8 and see if that works for you. I've often seen people say something is UTF-8 when it isn't.
You'll also need to change your database to UTF-8.
Below worked for me when I am scraping and presenting the data on html page.
While scraping the data from external website do an utf8_encode:utf8_encode(trim(str_replace(array("\t","\n\r","\n","\r"),"",trim($th->plaintext))));
Before writing to the HTML page set the charset to utf-8 : <meta charset="UTF-8">
While writing of echoing out on html do an utf8_decode.echo "Menu Item:". utf8_decode ($value['item'])
This helped me to solve problem with my html scraping issues. Hope someone else finds it useful.

PHP problem character set

I have a problem where users upload zipped text files. After I extract text contents I import them in mysql database. But later when I display the text in browser some characters are garbled. I tried to encode them but I am unable to detect the encoding of the text files with PHP and convert to UTF-8 with iconv or mbstring.
Mysql database charset is UTF-8.
header('Content-type: text/html; charset=utf-8');
is added.
Tried with
iconv('UTF-8', 'UTF-8//IGNORE', $text_file_contents)
But it simply removes the garbled chars: � which should be either ' or " when I checked manually with Firefox browser. Firefox showed that is ISO-8859-1 but I can not check for every article they send (articles may be in different character set).
How to convert this characters to UTF-8 ?
This is a modified function I found on
origanlly written by prgss at bk dot ru .
function myutf8_detect_encoding($string, $default = 'UTF-8', $encode = 0, $encode_to = 'UTF-8') {
static $list = array('UTF-8', 'ISO-8859-1', 'ASCII', 'windows-1250', 'windows-1251', 'latin1', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'ISO-8859-2', 'ISO-8859-3', 'GBK', 'GB2312', 'GB18030', 'MACROMAN', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-11', 'ISO-8859-12', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16');
foreach ($list as $item) {
$sample = iconv($item, $item, $string);
if (md5($sample) == md5($string)) {
if ($encode == 1)
return iconv($item, $encode_to, $string);
return $item;
if ($encode == 1)
return iconv($encode_to, $encode_to . '//IGNORE', $string);
return $default;
and in my code I use:
myutf8_detect_encoding(trim($description), 'UTF-8', 1)
but it still returns garbled characters of this text “old is gold’’ .
This is indeed tricky.
Detecting an arbitrary string's encoding using detect_encoding... is known to be not very reliable (although it should be able to distinguish between UTF-8 and ISO-8859-1 for example - make sure you give it a try first.)
If the auto-detection doesn't work out, there is the option of displaying the content to the user before it gets submitted, along with a drop-down menu to switch between the most used encodings. Then show a message like
Please check your submission. If you are seeing incorrect or garbled characters, please change the encoding in the drop-down menu until the content is correct.
Whenever the user changes the drop-down value, your script will pull the content again, use iconv() to convert it from the specified encoding to UTF-8, and output the result, until it looks good.
This needs some finesse in designing the User Interface to be understandable for the end user, but it would often be the best option. Especially if you are dealing with users from many different regions or continents with a lot of different encodings.
Having had the same problem of encoding detection, I made a php function that outputs different information about the string and should make it relatively easy to identify the encoding used.
http://php.net/manual/en/function.ord.php (function hex_chars by "manixrock(hat)gmail(doink)com").
It shows the values of the characters inside the string, as well as the values of each individual byte. You look at the output and see which of your suspected encodings matches the bytes. You should first familiarize yourself with the various popular encodings like UTF-8, UTF-16, ISO-8859-X (understand their byte storage). Also make sure you test the string as unaltered as possible (take care how the encoding might change between what PHP outputs and what the browser receives, how the browser displays, or if you get the string from another source like MySQL or a file how that may change the encoding).
This helped me detect that a text had undergone the conversions: (UTF-8 to byte[]) then (ISO-8859-1 to UTF-8). That function helped a lot. Hope it helps you.
Use mb_detect_encoding to find out what encoding is used, then iconv to convert.
Try to insert right after the mysql connection:
mysql_query("SET NAMES utf8");
