Some rows in my database contain an apostrophe of sorts, that, when displayed with PHP, are converted to diamonds with a question mark in the center. Example, if it copies correctly: Captain Jim O’Brien
These "apostrophes" were inserted most likely via TinyMCE, where the user was copying and pasting from Word, or something from a Mac computer perhaps.
How can I display these "apostrophes"? When I view the row in PHPMyAdmin, the apostrophes are displayed (no diamond), so there is obviously a way.
My character encoding is set to UTF-8, and I've tried htmlspecialchars($string) and htmlentities($string), with no luck.
Characters are encoded in different places.
MySQL has a particular character encoding. By default, it is not UTF-8 but rather latin1.
The HTML document you generate using PHP also has a particular character encoding specified. Finally, the actual bytes in the HTML document factually assume a particular character encoding, which if you're not careful can be different than the character encoding you specify for the document.
Verify that your MySQL encoding is set to UTF-8 as a first step. Note that MySQL can have the default character encoding for the database overridden on a per-table or even per-column basis.
You may be interested in this related post to get a deeper understanding of character encoding
Character Encoding and the ’ Issue
Update
Something put the data into the MySQL database in the first place. Perhaps that "something" was not using UTF-8 encoding.
Related
I'm currently making a quiz-game with Construct 2 (HTML5), but I'm getting a problem with encoding when using interpunct (·), also known as middle dot. I'm currently generating an XML document with PHP. Encoding set to UTF-8.
All my documents is saved as UTF-8.
I have set MySQL tables to utf8mb4.
All HTML has meta charset set to UTF-8.
XML has encoding UTF-8.
If I set $conn->set_charset("utf8"); in the PHP document, it seems to be OK, but middle dot is returned as · in my game, but normal if I view the XML in browser.
If I don't set MySQL charset, it works in game, but I can't view it in the browser, but gets an encoding error on the middle dot line.
Something is not set to utf8/utf8mb4 (or, outside MySQL, UTF-8). Search for "Mojibake" in Trouble with utf8 characters; what I see is not what I stored for discussion of what probably went wrong.
Or it could be "double encoded". So please follow the tips there (SELECT HEX...) to see if which you have:
C382C2B7 double encoding - bad
C2B7 utf8 -- should see this
B7 latin1 -- should not see this
I have the following problem: on a very simple php-mysqli query:
if ( $result = $mysqli->query( $sqlquery ) )
{
$res = $result->fetch_all();
$result->close();
}
I get strings wrongly encoded as Western encoded string, although the database, the table and the column is in utf8_general_ci collation. The php script itself is utf-8 encoded and the mysql-less parts of the script get the correct encodings. So say echo "ő" works perfectly, but echo $res[0] from the previous example outputs the EF BF BD character when the file viewed in the correct UTF-8 encoding. If I manually switch the browser's encoding to Western, the mysqli sourced strings get good decoding, except for the non-western characters being replaced with "?'.
What makes it even stranger is that on my development environment this isn't happening, while on my webserver it is. The developer environment is a LAMP stack (The Uniform Server), while the webserver uses nginx.
In this case, I entered the data in the database using phpMyAdmin, and inside phpmyadmin it displays perfectly. phpMyAdmin's collation is utf-8 too. I believe that the problem must be somewhere around here, as on the same webserver, for an other site where I enter data through php (using POST) the same problem doesn't happen. On that case, the data is visible correctly both while entering and while viewing it (I mean in the php generated webpages), but the special characters are not correct in phpMyAdmin.
Can you help me start where to debug? Is it connected to php or mysql or nginx or phpMyAdmin?
Use mysqli_set_charset to change the client encoding to UTF-8 just after you connect:
$mysqli->set_charset("utf8");
The client encoding is what MySql expects your input to be in (e.g. when you insert user-supplied text to a search query) and what it gives you the results in (so it has to match your output encoding in order for echo to display things correctly).
You need to have it match the encoding of your web page to account for the two scenarios above and the encoding of the PHP source file (so that the hardcoded parts of your queries are interpreted correctly).
Update: How to convert data inserted using latin-1 to utf-8
Regarding data that have already been inserted using the wrong connection encoding there is a convenient solution to fix the problem. For each column that contains this kind of data you need to do:
ALTER TABLE table_name MODIFY column_name existing_column_type CHARACTER SET latin1;
ALTER TABLE table_name MODIFY column_name BLOB;
ALTER TABLE table_name MODIFY column_name existing_column_type CHARACTER SET utf8;
The placeholders table_name, column_name and existing_column_type should be replaced with the correct values from your database each time.
What this does is
Tell MySql that it needs to store data in that column in latin1. This character set contains only a small subset of utf8 so in general this conversion involves data loss, but in this specific scenario the data was already interpreted as latin1 on input so there will be no side effects. However, MySql will internally convert the byte representation of your data to match what was originally sent from PHP.
Convert the column to a binary type (BLOB) that has no associated encoding information. At this point the column will contain raw bytes that are a proper utf8 character string.
Convert the column to its previous character type, telling MySql that the raw bytes should be considered to be in utf8 encoding.
WARNING: You can only use this indiscriminate approach if the column in question contains only incorrectly inserted data. Any data that has been correctly inserted will be truncated at the first occurrence of any non-ASCII character!
Therefore it's a good idea to do it right now, before the PHP side fix goes into effect.
Use mysqli::set_charset function.
$mysqli->set_charset('utf8'); //returns false if the encoding was not valid... won't happen
http://php.net/manual/en/mysqli.set-charset.php
I haven't used mysqli for some time, but if things are the same, connections by default use the latin swedish encoding (ISO 8859 1).
I will consider your page is already using utf8 encoding by having:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
Inside the <head> tag.
If you have string already on latin swedish encoding, you can use mk_convert_encoding:
http://php.net/manual/en/function.mb-convert-encoding.php
$fixedStr = mb_convert_encoding($wrongStr, 'UTF-8', 'ISO-8859-1');
iconv does something very similar: Truth be told, I don't know the difference, but here's the link to the function reference:
http://php.net/manual/en/function.iconv.php
I just realized that you might have some strings in utf8 and others in latin swedish. You can use mb_detect_encoding for that: http://php.net/manual/en/function.mb-detect-encoding.php
You can also dump the database and use iconv (cmd line) if you have it installed:
iconv -f latain -t utf-8 < currentdb.sql > fixeddb.sql
I have a MySQL table & fields that are all set to UTF-8. The thing is, a previous PHP script, which was in charge of the database writing, was using some other encoding, not sure whether it is in the script itself, the MySQL connection or somewhere else. The result is that although the table & fields are set to UTF-8, we see the wrong chars instead of Chinese.
It looks like that:
Now, the previous scripts (which were in charge of the writing and corrupted the data) can read it well for some reason, but my new script which all encoded in UTF-8, shows chars like ½©. How can that be fixed?
By the sound of it, you have a utf8 column but you are writing to it and reading from it using a latin1 connection, so what is actually being stored in the table is mis-encoded. Your problem is that when you read from the table using a utf8 connection, you see the data that's actually stored there, which is why it looks wrong. You can fix the mis-encoded data in the table by converting to latin1, then back to utf8 via the binary character set (three steps in total).
The original database was in a Chinese encoding – GB-18030 or similar, not Latin-1 – and the bytes that make up these characters, when displayed in UTF-8, show up as a bunch of Latin diacritics. Read each string as GB-18030, convert it to UTF-8, and save.
I moved my php application to the new server. i use mysql5 db. When i'm Updating or Inserting something to db, every " and - sign changed to ?. I use SET NAMES UTF8 and SET CHARACTER SET but it don't work. Any ideas?
SET NAMES UTF8 should be used on every page, when selecting as well as when updating or inserting.
actually this query must be used every time you connect to the database. just add it to connect code.
You need UTF-8 all the way through to make smart quotes and dashes (“”—) and other non-ASCII characters work reliably:
(1) Ensure that the browser sends you characters encoded to UTF-8. Do this by declaring the page that includes the form to be UTF-8:
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
...
(Ignore <form accept-encoding>, which doesn't work right in IE.)
(2) PHP deals with raw bytes and doesn't care what encoding they're in, but the database does care, so you have to tell it what encoding the bytes from PHP are coming in. This is what SET NAMES is doing, though mysql_set_charset may be preferable.
(3) Once the proper characters have reached the database, it'll need to store them in a Unicode encoding to make sure all characters can fit. Each column can have a different encoding, but you can use DEFAULT CHARACTER SET utf8 when you CREATE table to make all the text columns in it use UTF-8. You can also set the default character set for a database or the whole server to utf8 if you prefer.
If you have already CREATEd the tables and they a non-UTF-8 collation, you'll have to recreate or alter the tables. You can check the current collation using SHOW FULL COLUMNS FROM sometable;.
(4) Make sure you HTML-encode text you output from PHP using htmlspecialchars() and not htmlentities(), which by default will mess up non-ASCII characters.
[You can, as an alternative to (2) and (3), just use the default Latin-1 encoding for the connection and the table storage, but put UTF-8 bytes in it nonetheless. The disadvantage of this approach is that it'll look wrong to other tools looking at the database, and lower/upper case characters won't compare against each other in the expected case-insensitive way.]
My guess is you are pasting from some text editor which is transforming the " into an angled pretty quote, and transforming your - into an mdash, which is causing both to be represented as ?.
While you set your database to accept UTF8 characters, you probably did not set your webserver/PHP to accept those characters. Try playing with mbstring functions, but check to make sure you arent using the slanted quotes or dashes.
I have a HTML form that is sometimes submitted with accented characters: à, è, ì, ò, ù
I have a PHP script that exports these form submissions into CSV format, when I look at the CSV format in a text editor (vim or notepad for example) the characters look fine, but when opened with Open Office or Word, I get some funky results: �����
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
What can I do to ensure portability of my CSV file? What's the proper way to handle the encoding?
My HTML file is content-type is set as: Content-Type: text/html; charset=utf-8
Data is being stored in MySQL as latin1_swedish_ci collation.
Total encoding confusion! :-)
The table character set
The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.
If you set it to Latin-1 (aka ISO 8859-1), you will not be able to store international characters in your table.
Importantly, the character set does not affect the encoding MySQL uses when communicating with your PHP script.
The table collation specifies rules for sorting.
The connection character set
The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).
The encoding is set using SET NAMES, e.g. SET NAMES "utf8".
If this does not match the table encoding, MySQL automatically converts data on the fly.
If this does not match your page character set, you'll have to manually perform character set conversion in PHP, using e.g. utf8_encode or mb_convert_encoding.
Page character set
The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.
As an HTTP header, it is not saved when you save the file from within your browser. The information is thus not available to OpenOffice or other programs.
Recommendations
Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.
However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you've seen, the guess will be wrong.
I don't know about OpenOffice, but Microsoft Office will assume the Windows "ANSI" encoding, which usually means Latin-1 (or CP1252 to be specific).
Microsoft Office will also cause problems in countries that use "," as a decimal separator, since Office then switches to using ";" as a field separator for CSV-files.
Your best bet is to use Latin-1 for the CSV-file. I'd still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.
If you use UTF-8 for the connection character set (by executing SET NAMES "utf8" after connecting), you'll need to run the text through utf8_decode to convert to Latin-1.
That entity problem
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
This sounds like you're passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.
Also, what document type have you set, is it?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Try using the htmlentities() function for any text that is not showing correctly.
You may also want to have a look PHP Normalizer.
Make sure you are writing the CSV file as UTF-8. See http://www.php.net/manual/en/function.fwrite.php#55054 if you are unsure how to.
(Also, your sql table should be using utf8, not latin1)
It's up to you to decide which charset encoding you'll use for writing your CSV file (but, note, that must be a concious decision on your part).
Which charset encoding to use ? CSV does not defines a charset encoding - So I'd go for some Unicode charset, presumably UTF8. But some CSV consumers (eg Excel) might not be happy with it. If you are restricted to "western" langs, then latin1 or its variants (iso-8859-1 or iso-8859-15) might be more appropiate. But then (in any case, actually) you must think the conversion from user input to your particular encoding - and what to do if there are invalid characters.
(BTW: same consideration goes for the html-input-to-db conversion - you are using latin1 for your database, have you asked yourself what happens if the user types a non-latin1 character ? eg a japanese char ? ).