I know there is an age-old issue with character encoding between different characters sets, but I'm stuck on one related to Window's "curly quotes".
We have a client that likes to copy-and-paste data into a text field and then post it out onto our app. That data will often have curly quotes in it. I used to use the following transform them into their normal counterparts:
function convert_smart_quotes($string) {
$badwordchars=array("\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6");
$fixedwordchars=array("'", "'", '"', '"', '-', '--', '...');
return str_replace($badwordchars,$fixedwordchars,$string);
}
This worked great for a few months. Then after some changes (we switches servers, made updates to the system, upgraded PHP, etc., etc.) we learned it doesn't work anymore. So, I take a look and I learn that the "curly quotes" are all changing into a different characters. In this case, they're turning into the following:
“ = ¡È
” = ¡É
‘ = ¡Æ
’ = ¡Ç
These characters then show up as the cursed "black diamond-question mark symbols" when saved in the database. The mySQL database is in latin1_swedish_ci as is the app the messages are received on. So, although I know utf-8 is better, it has to remain in latin1_swedish_ci, or ISO-8859-1, or else we'll have to rebuild everything... and that's out of the question.
My webpage, and form, are both posting in utf-8. If I change it to be in ISO-8859-1, the quotes become question marks instead.
I have tried searching the string for occurrences of "¡È" or "¡É" and replacing them with normal quotes, but I couldn't get that to work. I did it by adding the following to my above function:
$string = str_replace("xa1\xc8", '"', $string);
$string = str_replace("xa1\xc9", '"', $string);
$string = str_replace("xa1\xc6", "'", $string);
$string = str_replace("xa1\xc7", "'", $string);
I've been stuck on this for a couple hours now and haven't been able to find any real help online. As you can imagine, googleing "¡É" doesn't bring a very specific response.
Any guidance is appreciated!
Your problem is that you are accepting UTF-8 input from your user and then inserting it into your database as if it were Latin1 (ISO-8859-1). (Note that latin1_swedish_ci is not an encoding but a collation (for Latin1). See this SO question on the difference. For the purpose of solving your character encoding question, the collation is not important.)
Rather than manually identifying important UTF-8 sequences and replacing them, you should use a robust method for converting your UTF-8 string to Latin1 such as iconv.
Note that this is a lossy conversion: some UTF-8 characters, such as curly quotes, don't exist in Latin1. You can choose to ignore those characters (replacing them with the empty string, or ?, or something else), or you can choose to transliterate them (replacing them with close equivalents, like " for a curly quote... but what do you do if someone puts 金 in your form?
iconv will attempt to transliterate where it can:
// convert from utf8 to latin1, approximating out of range characters
// by the closest latin1 alternative where possible (//TRANSLIT)
$latinString = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $utf8String);
(You can also configure it to ignore all out of range characters — see iconv's documentation for more info.)
If you don't want to mess around with adding a new library, PHP also comes with the utf_decode function:
$latinString = utf_decode($utf8String);
However, PHP was not really designed with multiple character encodings in mind, so I prefer to stay away from the (sometimes buggy) standard library functions that deal with encoding.
You should also consider reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
You can use below code to solve this problem.
$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8');
or
$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'auto');
more information can be found on php documentation website.
Related
I have a form with a textarea. Users enter a block of text which is stored in a database.
Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,â€
What function should I call on the input string to convert smart quotes to regular quotes and emdashes to regular dashes?
I am working in PHP.
Update: Thanks for all of the great responses so far. The page on Joel's site about encodings is very informative: http://www.joelonsoftware.com/articles/Unicode.html
Some notes on my environment:
The MySQL database is using UTF-8 encoding. Likewise, the HTML pages that display the content are using UTF-8 (Update:) by explicitly setting the meta content-type.
On those pages the smart quotes and emdashes appear as a diamond with question mark.
Solution:
Thanks again for the responses. The solution was twofold:
Make sure the database and HTML
files were explicitly set to use
UTF-8 encoding.
Use htmlspecialchars() instead of
htmlentities().
This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html
The mysql database is using UTF-8
encoding. Likewise, the html pages
that display the content are using
UTF-8.
The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type header of "text/html;charset=utf-8" or add <meta> tags to your HTMLs:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
That way, the content type of the data submitted to PHP will also be the same.
I had a similar issue and adding the <meta> tag worked for me.
It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.
Here is some info on migrating your database to another character encoding, at least for a MySQL database.
This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.
What we do is force the text through iconv
// Convert input data to UTF8, ignore any odd (MS Word..) chars
// that don't translate
$input = iconv("ISO-8859-1","UTF-8//IGNORE",$input);
The //IGNORE flag means that anything that can't be translated will be thrown away.
If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.
We would often use standard string replace functions for that. Even though the nature of ASCII/Unicode in that context is pretty murky, it works. Just make sure your php file is saved in the right encoding format, etc.
In my experience, it's easier to just accept the smart quotes and make sure you're using the same encoding everywhere. To start, add this to your form tag: accept-charset="utf-8"
You could try mb_ convert_encoding from ISO-8859-1 to UTF-8.
$str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
This assumes you want UTF-8, and convert can find reasonable replacements... if not, mb_str_replace or preg_replace them yourself.
This may not be the best solution, but I'd try testing to find out what PHP sees. Let's say it sees "–" (there are a few other possibilities, like simple "“" or maybe "“"). Then do a str_replace to get rid of all of those and replace them with normal quotes, before stuffing the answer in a database.
The better solution would probably involve making the end-to-end data passing all UTF-8, as people are trying to help with in other answers.
You have to be sure your database connection is configured to accept and provide UTF-8 from and to the client (otherwise it will convert to the "default", which is usually latin1).
In practice this means running a query SET NAMES 'utf8';
http://www.phpwact.org/php/i18n/utf-8/mysql
Also, smart quotes are part of the windows-1252 character set, not iso-8859-1 (latin-1). Not very relevant to your problem, but just FYI. The euro symbol is in there as well.
the problem is on the mysql charset, I fixed my issues with this line of code.
mysql_set_charset('utf8',$link);
You have to manually change the collation of individual columns to UTF8; changing the database overall won't alter these.
If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...
You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():
$trans_tbl = false;
function htmlEncode($text) {
global $trans_tbl;
// create translation table once
if(!$trans_tbl) {
// start with the default set of conversions and add more.
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl[chr(130)] = '‚'; // Single Low-9 Quotation Mark
$trans_tbl[chr(131)] = 'ƒ'; // Latin Small Letter F With Hook
$trans_tbl[chr(132)] = '„'; // Double Low-9 Quotation Mark
$trans_tbl[chr(133)] = '…'; // Horizontal Ellipsis
$trans_tbl[chr(134)] = '†'; // Dagger
$trans_tbl[chr(135)] = '‡'; // Double Dagger
$trans_tbl[chr(136)] = 'ˆ'; // Modifier Letter Circumflex Accent
$trans_tbl[chr(137)] = '‰'; // Per Mille Sign
$trans_tbl[chr(138)] = 'Š'; // Latin Capital Letter S With Caron
$trans_tbl[chr(139)] = '‹'; // Single Left-Pointing Angle Quotation Mark
$trans_tbl[chr(140)] = 'Œ'; // Latin Capital Ligature OE
// smart single/ double quotes (from MS)
$trans_tbl[chr(145)] = '‘';
$trans_tbl[chr(146)] = '’';
$trans_tbl[chr(147)] = '“';
$trans_tbl[chr(148)] = '”';
$trans_tbl[chr(149)] = '•'; // Bullet
$trans_tbl[chr(150)] = '–'; // En Dash
$trans_tbl[chr(151)] = '—'; // Em Dash
$trans_tbl[chr(152)] = '˜'; // Small Tilde
$trans_tbl[chr(153)] = '™'; // Trade Mark Sign
$trans_tbl[chr(154)] = 'š'; // Latin Small Letter S With Caron
$trans_tbl[chr(155)] = '›'; // Single Right-Pointing Angle Quotation Mark
$trans_tbl[chr(156)] = 'œ'; // Latin Small Ligature OE
$trans_tbl[chr(159)] = 'Ÿ'; // Latin Capital Letter Y With Diaeresis
ksort($trans_tbl);
}
// escape HTML
return strtr($text, $trans_tbl);
}
Actually the problem is not happening in PHP but it is happening in JavaScript, it is due to copy/paste from Word, so you need to solve your problem in JavaScript before you pass your text to PHP, Please see this answer https://stackoverflow.com/a/6219023/1857295.
So today I was updating some code I made that took some data from a webpage and emailed it to people for convenience. However, I noticed that whoever was typing the text used a program which used some other encoding which had a weird ’ character which was 0xD5 (213) in the Mac Roman set. But when they uploaded it to their website, it came out as Õ. So I used php and did this:
$parsed = str_ireplace("Õ", "'", $parsed);
So I did this and tested it, but it didn't seem to work. Can anyone help me? Thanks!
If this is just a single anomaly you're correcting you can specify it with a hex escape sequence like:
$parsed = str_replace("\xD5", "'", $parsed);
The reason just "Õ" isn't working is the encoding of your PHP file doesn't represent Õ as 0xD5. Strings are just byte sequences and what you're giving str_ireplace don't match. (Well, that and str_ireplace is gonna do funky things with it, str_replace is preferred here.)
More appropriate to handle the problem in general would be to use iconv to convert the input string from whatever its source encoding is into the output encoding you need.
Examples:
$parsed = iconv('MACINTOSH', 'UTF-8', $parsed);
or
$parsed = iconv('MACINTOSH', 'ASCII//TRANSLIT', $parsed);
The //TRANSLIT here means that when a character can't be represented in the target charset, it'll be approximated through one or several similarly looking characters. There's a lot ASCII (and others) can't represent, so transliteration can come in handy if you're not outputting UTF-8 (which would be ideal.)
I am having a problem with  character on my website.
I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).
The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.
When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the  character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).
How can I prevent and/or remove this character so I can run the regular expression?
EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.
You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.
Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.
Additionally to find the character you can't see, create a hexdump of the string.
Since the character you are talking about exists within the ANSI charset, you can do this:
utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));
This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:
string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )
I have found a lot of varying / inconsistent information across the web on this topic, so I'm hoping someone can help me out with these issues:
I need a function to cleanse a string so that it is safe to insert into a utf-8 mysql db or to write to a utf-8 XML file. Characters that can't be converted to utf-8 should be removed.
For writing to an XML file, I'm also running into the problem of converting html entities into numeric entities. The htmlspecialchars() works almost all the time, but I have read that it is not sufficient for properly cleansing all strings, for example one that contains an invalid html entity.
Thanks for your help, Brian
You didn't say where the strings were coming from, but if you're getting them from an HTML form submission, see this article:
Setting the character encoding in form submit for Internet Explorer
Long and short, you'll need to explicitly tell the browser what charset you want the form submission in. If you specify UTF-8, you should never get invalid UTF-8 from a browser. If you want to protect yourself against ANY type of malicious attack, you'll need to use iconv:
http://www.php.net/iconv
$utf_8_string = iconv($from_charset, $to_charset, $original_string);
If you specify "utf-8" as both $from_charset and $to_charset, iconv() should return an error if $original_string contains invalid UTF-8.
If you're getting your strings from a different source and you know the character encoding, you can still use iconv(). Typical encodings in the US are CP-1252 (Windows) and ISO-8859-1 (everything else.)
Something like this?
function cleanse($in) {
$bad = Array('”', '“', '’', '‘');
$good = Array('"', '"', '\'', '\'');
$out = str_replace($bad, $good, $in);
return $out;
}
You can convert a string from any encoding to UTF-8 with iconv or mbstring:
// With the //IGNORE flag, this will ignore invalid characters
iconv('input-encoding', 'UTF-8//IGNORE', $the_string);
or
mb_convert_encoding($the_string, 'UTF-8', 'input-encoding');
I have a form with a textarea. Users enter a block of text which is stored in a database.
Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,â€
What function should I call on the input string to convert smart quotes to regular quotes and emdashes to regular dashes?
I am working in PHP.
Update: Thanks for all of the great responses so far. The page on Joel's site about encodings is very informative: http://www.joelonsoftware.com/articles/Unicode.html
Some notes on my environment:
The MySQL database is using UTF-8 encoding. Likewise, the HTML pages that display the content are using UTF-8 (Update:) by explicitly setting the meta content-type.
On those pages the smart quotes and emdashes appear as a diamond with question mark.
Solution:
Thanks again for the responses. The solution was twofold:
Make sure the database and HTML
files were explicitly set to use
UTF-8 encoding.
Use htmlspecialchars() instead of
htmlentities().
This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html
The mysql database is using UTF-8
encoding. Likewise, the html pages
that display the content are using
UTF-8.
The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type header of "text/html;charset=utf-8" or add <meta> tags to your HTMLs:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
That way, the content type of the data submitted to PHP will also be the same.
I had a similar issue and adding the <meta> tag worked for me.
It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.
Here is some info on migrating your database to another character encoding, at least for a MySQL database.
This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.
What we do is force the text through iconv
// Convert input data to UTF8, ignore any odd (MS Word..) chars
// that don't translate
$input = iconv("ISO-8859-1","UTF-8//IGNORE",$input);
The //IGNORE flag means that anything that can't be translated will be thrown away.
If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.
We would often use standard string replace functions for that. Even though the nature of ASCII/Unicode in that context is pretty murky, it works. Just make sure your php file is saved in the right encoding format, etc.
In my experience, it's easier to just accept the smart quotes and make sure you're using the same encoding everywhere. To start, add this to your form tag: accept-charset="utf-8"
You could try mb_ convert_encoding from ISO-8859-1 to UTF-8.
$str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
This assumes you want UTF-8, and convert can find reasonable replacements... if not, mb_str_replace or preg_replace them yourself.
This may not be the best solution, but I'd try testing to find out what PHP sees. Let's say it sees "–" (there are a few other possibilities, like simple "“" or maybe "“"). Then do a str_replace to get rid of all of those and replace them with normal quotes, before stuffing the answer in a database.
The better solution would probably involve making the end-to-end data passing all UTF-8, as people are trying to help with in other answers.
You have to be sure your database connection is configured to accept and provide UTF-8 from and to the client (otherwise it will convert to the "default", which is usually latin1).
In practice this means running a query SET NAMES 'utf8';
http://www.phpwact.org/php/i18n/utf-8/mysql
Also, smart quotes are part of the windows-1252 character set, not iso-8859-1 (latin-1). Not very relevant to your problem, but just FYI. The euro symbol is in there as well.
the problem is on the mysql charset, I fixed my issues with this line of code.
mysql_set_charset('utf8',$link);
You have to manually change the collation of individual columns to UTF8; changing the database overall won't alter these.
If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...
You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():
$trans_tbl = false;
function htmlEncode($text) {
global $trans_tbl;
// create translation table once
if(!$trans_tbl) {
// start with the default set of conversions and add more.
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl[chr(130)] = '‚'; // Single Low-9 Quotation Mark
$trans_tbl[chr(131)] = 'ƒ'; // Latin Small Letter F With Hook
$trans_tbl[chr(132)] = '„'; // Double Low-9 Quotation Mark
$trans_tbl[chr(133)] = '…'; // Horizontal Ellipsis
$trans_tbl[chr(134)] = '†'; // Dagger
$trans_tbl[chr(135)] = '‡'; // Double Dagger
$trans_tbl[chr(136)] = 'ˆ'; // Modifier Letter Circumflex Accent
$trans_tbl[chr(137)] = '‰'; // Per Mille Sign
$trans_tbl[chr(138)] = 'Š'; // Latin Capital Letter S With Caron
$trans_tbl[chr(139)] = '‹'; // Single Left-Pointing Angle Quotation Mark
$trans_tbl[chr(140)] = 'Œ'; // Latin Capital Ligature OE
// smart single/ double quotes (from MS)
$trans_tbl[chr(145)] = '‘';
$trans_tbl[chr(146)] = '’';
$trans_tbl[chr(147)] = '“';
$trans_tbl[chr(148)] = '”';
$trans_tbl[chr(149)] = '•'; // Bullet
$trans_tbl[chr(150)] = '–'; // En Dash
$trans_tbl[chr(151)] = '—'; // Em Dash
$trans_tbl[chr(152)] = '˜'; // Small Tilde
$trans_tbl[chr(153)] = '™'; // Trade Mark Sign
$trans_tbl[chr(154)] = 'š'; // Latin Small Letter S With Caron
$trans_tbl[chr(155)] = '›'; // Single Right-Pointing Angle Quotation Mark
$trans_tbl[chr(156)] = 'œ'; // Latin Small Ligature OE
$trans_tbl[chr(159)] = 'Ÿ'; // Latin Capital Letter Y With Diaeresis
ksort($trans_tbl);
}
// escape HTML
return strtr($text, $trans_tbl);
}
Actually the problem is not happening in PHP but it is happening in JavaScript, it is due to copy/paste from Word, so you need to solve your problem in JavaScript before you pass your text to PHP, Please see this answer https://stackoverflow.com/a/6219023/1857295.