I am having great problems solving this one:
I have a mysql database encoding latin1_swedish_ci and a table that stores names and addresses.
I am trying to output a UTF-8 XML file, but I am having problems with the following string:
Otivägen it is being outputted as Otivägen when i vim the file. Also when opened it IE i get
"An invalid character was found in text content. Error processing resource"
I have the following code:
function fixEncoding($in_str)
{
$cur_encoding = mb_detect_encoding($in_str) ;
if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
return $in_str;
else
return utf8_encode($in_str);
}
header("Content-type: text/plain;charset=utf-8");
$mystring = "Otivägen" // this is actually obtained from database;
$myxml = "<myxml>
....
<node>".$mystring."</node>
....
</myxml>
";
$myxml = fixEncoding($myxml);
The actual XML output is below:
<?xml version="1.0" encoding="UTF-8" ?>
<myxml>
....
<node>Otivägen</node>
....
</myxml>
Any ideas how I can output the file so in vim the file reads Otivägen and not Otivägen?
EDIT:
I did mysql_client_encoding() and got latin1
I then did mysql_set_charset()
and again ran mysql_client_encoding() and got utf8, but still the same outputting issues.
Edit 2
I have logged into the command line and run the query SELECT address1 FROM address WHERE id = 1000;
SELECT address1 FROM address WHERE id = 1000;
Current database: ftpuser_db
+-------------+
| address1 |
+-------------+
| Otivägen 32 |
+-------------+
1 row in set (0.06 sec)
Thanks in advance!
I think you did everything correctly, except that your terminal is in Latin-1.
The UTF-8 sequence for ä is C3 A4, which is ä if displayed as Latin-1.
Is your MySQL connection encoding properly set to UTF-8 ?
Check mysql_set_charset() and mysql_client_encoding() for more details.
Oh boy. UTF8 issues can be a real pain and they get almost impossible to solve when something is doing re-encodings for you.
You really need to start at one end and make sure every process is UTF8. That will remove things in the process from interpreting the data wrong and 'converting' it for you. But significantly, it will also let you much more easily spot when something has already mis-encoded text for you (yes, I've had that problem).
And if you have UTF8 data in tables that aren't set to UTF8 and might be mis-encoded, you need to do the tables last, after the data has been re-encoded. Otherwise you will damage your data irretrievably. I've had that problem, too.
First steps:
Check your terminal is UTF8 compliant. Gnome-terminal is. Kterm is. ETerm is not.
Check your LANG setting in your shell. It should probably have .UTF-8 on the end of it's value.
Check that vim is picking up the UTF8 setting correctly. You can check with :set encoding
This will mean that your files will be edited in UTF8.
Now we check MySQL.
In the MySQL CLI, do show variables like 'character_set%';. The results will probably be something like:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
What you're aiming for is to change all those latin1 values (or whatever you're seeing) to utf8.
set names utf8; will change most of them and you might need to do that with every new connection in your database. This was the solution I had to adopt in a previous application. The other settings to change are in the my.cnf file for which I need to direct you to the documentation. It is unlikely you will need to set them all.
I see you're already setting the output headers, so that's good.
Now you can look at the data from the database and see why it's "wrong".
latin1_swedish_ci is a collation, not a charset. Since collations are supposed to match their charset, it suggests that the table is using latin1, but it's not a guarantee.
Strictly speaking, the charset of tables is irrelevant here, since MySql can convert input/output. That's what the connection charset (mysql_set_charset) is for. However, for that to work properly, the data needs to be encoded properly in the database. I would begin by checking that strings are correct in the database. Simplest thing is to log in on the command line and select a row which has non-ascii characters in it. Does it look OK?
$mystring = "Otivägen" // this is actually obtained from database;
Watch out. The encoding of the data in $mystring will now depend on the encoding of the php file. That may or may not be the same as the data in the database.
before output run query SET NAMES utf8
after output you can go back and run SET NAMES latin1
Look here, I've got the same problem
It seems you are "double encoding" Otivägen. You get this behaviour if Otivägen already is UTF-8, and run utf8_encode() on it again. Example:
$str = "Otivägen"; // already an UTF-8 string
echo utf8_encode($str); // outputs Otivägen
I'm not sure we're the actual "double encoding" occurs, but it may be due to settings in your editor. My theory. Lets say you are running Aptana Studio: Your actual character set is set to ISO-8859-1 (in Aptana, you can check this by right clicking on a file and choose "properties". To set default character encoding for all projects, choose Preferences from Aptana main menu -> General -> workspace). If that's the case, the actual PHP source file where you have $myxml and its string <myxml><node>... is detected to be ISO-8859-1, but $mystring received from the database is UTF-8. Your fixEncoding function would then run the else clause, since the $myxml as a whole is seen as ISO-8859-1 and not UTF-8. This results in double encoding the results from the database, and may be the cause to your problem.
Check the encoding of your actual source file in your editor, and verify that it is set to UTF-8. Alternatively, experiment with applying or removing fixEncoding/utf8_encode/utf8_decode to $myxml. Observe the results and see what needs to be done to the value Otivägen right.
Related
I have a simple (custom) CMS accepting markdown and displaying it in in a web page. Works fine in php5.6 (using the ondrej/php5 ppa on ubuntu 15.10). Mysql collation set to utf8 everywhere.
Upgrade the server to php7.0 (ondrej/php) and it displays garbage characters. I tried migrating the relevant mysql tables and fields to utf8mb4 / utf8mb4_unicode_ci with no luck.
Downgrade to php5.6 and it all works fine.
I have a hunch it is some strange php setting I don't know about? php.ini default_collation=UTF-8. Couldn't find anything else that worked. phpMyAdmin shows garbage no matter what version of php or server settings, so it is not much help.
What could i try next?
Source text (copied from php5.6 rendered page)
아동 보호 정책에 대한 규정
This Code is part of the
Rendered output (from php7 and phpMyAdmin)
ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì •
This Code is part of the
Use this to change a table to utf8mb4:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4 COLLATION utf8mb4_unicode_520_ci;
However, if the table was already messed up, then this won't fix it. Do the following to verify:
SELECT col, HEX(col) FROM tbl WHERE ...
For example, 아동 보호 정책에 대한 규정 will show a hex of EC9584 EB8F99 EBB3B4 ED98B8 ECA095 ECB185 EC9790 EB8C80 ED959C EAB79C ECA095. (Please ignore the spaces.)
For Korean text, you should see (mostly) groups of 3 hex bytes of the form Ewxxyy, where w is A or B or C or D, as shown in the example above. Hex 20 (only 1 byte) represents a space.
ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì • is the Mojibake for it. This implies that somewhere latin1 was erroneously involved, probably when you INSERTed the text. In that case, you will see something like C3AC E280A2 E2809E C3AB C28F E284A2 C3AB C2B3 C2B4 C3AD CB9C C2B8 ... -- mostly 2-byte Cwxx hex.
If you see that, an UPDATE of something like this will repair the data:
CONVERT(BINARY(CONVERT(CONVERT(col USING utf8mb4) USING latin1)) USING utf8mb4) (Edit: removed call to UNHEX.)
Saving data via PHP 5.5 on WAMP to MySQL saves fine.
Recalling it to the client via json_encode did work until I dropped and entered new data into my db. Then json_encode returned nothing - no error - no data. No error in log file.
New data has German street names (with umlauts etc)
I replaced German street names with ascii codes.
json_encode worked as I expected, thus problem sort of resolved.
How does one resolve the issue going forward?
Data in my MySQL INNODB is saved as latin1.
Do I need to filter the data after read from DB, before calling json_encode ? Some other way?
Use below code:
mysql_query('SET CHARACTER SET utf8');
Before sql query,
And use below for json encode:
json_encode($array, JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_QUOT | JSON_HEX_AMP | JSON_UNESCAPED_UNICODE)
I know this has been discussed several times but yet I'm getting crazy dealing with this problem. I have a form with a submit.php action. At first I didn't change anything about the charsets, I didn't use any utf8 header information.. The result was that I could read all the ä,ö,ü etc correctly inside the database. Now exporting them to .csv and importing them to Excel as UTF-8 charset (also tested all the others) results in an incorrect charset.
Now what I tried:
PHP:
header("Content-Type: text/html; charset=utf-8");
$mysqli->set_charset("utf8");
MySQL:
I dropped my database and created a new one:
create database db CHARACTER SET utf8 COLLATE utf8_general_ci;
create table ...
I changed my my.cnf and restarted my sql server:
[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci
[mysql]
default-character-set=utf8
If I connect to my db via bash I receive the following output:
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
A php test:
var_dump($mysqli->get_charset());
Giving me:
Current character set: utf8 object(stdClass)#3 (8) { ["charset"]=> string(4) "utf8" ["collation"]=> string(15) "utf8_general_ci" ["dir"]=> string(0) "" ["min_length"]=> int(1) ["max_length"]=> int(3) ["number"]=> int(33) ["state"]=> int(1) ["comment"]=> string(13) "UTF-8 Unicode" }
Now I use:
mysql -uroot -ppw db < require.sql > /tmp/test.csv
require.sql is simply a
select * from table;
And again I'm unable to import it as a csv into Excel no matter if I choose UTF-8 or anything else. It's always giving me some crypto..
Hopefully someone got a hint what might went wrong here..
Cheers
E: TextMate is giving me a correct output so it seems that the conversion actually worked and it's and Excel issue? Using Microsoft Office 2011.
E2: Also tried the same stuff with latin1 - same issue, cannot import special characters into excel without breaking them. Any hint or workaround?
E3: I found a workaround which is working with the Excel Import feature but not with double clicking the .csv.
iconv -f utf8 -t ISO-8859-1 test.csv > test_ISO.csv
Now I'm able to import the csv into excel using Windows(ANSI). Still annoying to have to use this feature instead of doubleclicking. Also I really don't get why UTF8 isn't working, not even with the import feature, BOM added and the complete database in UTF8.
Comma separation turned out to be a mess as well.
1. Concat_WS works only partly because it's adding a stupid concat_ws(..) header to the .csv file. Also "file test.csv" doesn't give me a "comma separated". This means even tho everything is separated by commas Excel won't notice it using double click.
2. sed/awk: Found some code snippets but all of them were separating the table very badly. E.g. colum street "streetname number" remained a 'streetname','number' which made 2 colums out of one and the table was screwed.
So it seems to me that Excel can only open .csv with a double click which
a) Are encoded with ISO-8859-1 (and only under windows because standard mac charset is Macintosh)
b) File having the attribute "comma separated". This means if I create a .csv through Excel itself the output of
file test1.csv
would be
test1.csv: ISO-8859 text, with CRLF line terminators
while a iconv changed charset with RegEx used for adding commas would look like:
test1.csv: ISO-8859 text
Pretty weird behaviour - maybe someone got a working solution.
That's how I save the data taken from utf-8 mysql tables.
You need to add BOM first.
Example:
<?php
$fp = fopen(dirname(__FILE__).'/'.$filename, 'wb');
fputs($fp, "\xEF\xBB\xBF");
fputcsv($fp, array($utfstr_1,$utfstr_2);
fclose($fp);
Make sure that you also tells MySQL you're gonna use UTF-8
mysql_query("SET CHARACTER SET utf8");
mysql_query("SET NAMES utf8");
You need to execute this before you're selecting any data.
Propaply won't be bad if you set the locale:setlocale(LC_ALL, "en_US.UTF-8");
Hope it helps.
Thanks everyone for the help, I finally managed to get a working - double clickable csv file which opens separated and displaying the letter correctly.
For those who are interested in a good workflow here we go:
1.) My database is completely using UTF8.
2.) I export a form into my database via php. I'm using mysqli and as header information:
header("Content-Type: text/html; charset=ISO-8859");
I know this makes everything look crappy inside the database, feel free to use utf8 to make it look correctly but it doesn't matter in my case.
3.) I wrote a script executed by a cron daemon which
a) removes the .csv files which were created previously
rm -f path/to/csv ##I have 3 due to some renaming see below
b) creating the new csv using mysql (this is still UTF8)
mysql -hSERVERIP -uUSER -pPASS DBNAME -e "select * from DBTABLE;" > PATH/TO/output.csv
Now you have a tab separated .csv and (if u exported from PHP in UTF8) it will display correctly in OpenOffice etc. but not in Excel. Even an import as UTF8 isn't working.
c) Making the file SEMICOLON separated (Excel standard, double clicking a comma separated file won't work at least not with the european version of Excel). I used a small python script semicolon.py:
import sys
import csv
tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, delimiter=";")
for row in tabin:
commaout.writerow(row)
d) Now I had to call the script inside my cron sh file:
/usr/bin/python PATH/TO/semicolon.py < output.csv > output_semi.csv
Make sure you use the full path for every file if u use the script as cron.
e) Change the charset from UTF8 to ISO-8859-1 (Windows ANSI Excel standard) with iconv:
iconv -f utf8 -t ISO-8859-1 output_semi.csv > output_final.csv
And that's it. csv opens up on double click on Mac/Windows Excel 2010 (tested).
Maybe this is a help for someone with similar problems. It drove me crazy.
Edit: For some servers you don't need iconv because the output from the database is already ISO8859. You should check your csv after executing the mysql command:
file output.csv
Use iconv only if the charset isn't iso8859-1
Hello I have a character encoding problem in my application and thought to ask for some help, because I couldn't solve the problem even thought I was given some guidance so here goes:
My Ä and Ö characters are shown in the browser as: �
I will also post all what I have done so far trying to solve the problem:
1) Database: I have tried changing the collation of my tables, here are some info what SHOW TABLE STATUS gives for one of my tables:
Name = test_groups Engine = InnoDB Version = 10 Row_format = Compact
Collation = utf8_swedish_ci
Database character variables gives:
| character_set_client = utf8 | character_set_connection =
utf8 | character_set_database = latin1 (I
Wonder is this the cause?) | character_set_filesystem
= binary | character_set_results = utf8 | character_set_server = utf8 |
character_set_system = utf8
2) In apache httpd.conf I have:
AddDefaultCharset UTF-8
3) In my Zend-application application.ini:
resources.view.encoding = "UTF-8"
4) In my firefox 14.0.1 browser
edit->preferences->content->advanced->Default character encoding =
Unicode (UTF-8)
5) In my php code meta-tag:
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
Now here's also few other interesting things: When I look at my page and change from firefox
View->Character encoding->Western (ISO-8859-1)
, the �-characters which came from the MySQL database turn out ok to öä-characters, but the öä-characters that come from my php-code turn into ät-characters.
Another thing when I check the encoding of the data coming from my MySQL-database with
mb_detect_encoding($DATA_FROM_MYSQL_DATABASE)
it outputs UTF-8!! Then lastly if I do in the code:
utf8_encode($DATA_FROM_MYSQL_DATABASE)
and output the result the problem disappears that is �-characters -> öä-characters. So what's going on here x) All help appreciated
Are you sending SET NAMES utf8 in your PHP as the first query to MySQL ? That could be the cause if not.
SET NAMES indicates what character set the client will use to send SQL
statements to the server. Thus, SET NAMES 'cp1251' tells the server,
“future incoming messages from this client are in character set
cp1251.” It also specifies the character set that the server should
use for sending results back to the client. (For example, it indicates
what character set to use for column values if you use a SELECT
statement.)
SET NAMES utf8 in MySQL? has more detail about how and why.
Troubleshoot:
Check your database (with PHPMyAdmin, for instance). Are the characters correctly stored? Or does it seem gibberish?
If the characters in the database are ok, then the problem happens when retrieving. If they are stored incorrectly (as I would guess they are), then the problem is in the "storing".
Check your source code file and verify if they are encoded in UTF-8.
Force mysql connection to use UTF8 (mysqli::set_charset('utf8') or mysql_set_charset('utf8') or PDO: Add charset to the connection string (charset=utf8) )
Background:
There is a table, events; this table is formatted latin1. Individual columns in this table are set to utf8. The column we will cherry pick to discuss is 'title' which is one of the utf8 columns. The website is set for utf8 both via apache and the meta tag.
As a test, if I save décor or © into the title field and perform
select title, LENGTH(title) as len, CHAR_LENGTH(title) as chlen
from events where length(title) != char_length(title)
I will get décor or ©, 12, 10 back as a result; which is expected showing that the data has indeed been properly saved into my utf8 column.
However, upon echoing the title out to a page, it's mangeld into d�cor or � which makes no sense to me since, as mentioned before, the character encoding is set to utf-8 on the page.
Not sure if this final detail makes a difference but if I edit the page and resubmit the mangled text it turns into d%uFFFDcor or %uFFFD both in the database and when displayed to the page. Further submits cause no change.
Actual Question:
Does anyone have an idea as to what I may be doing wrong? :-P
Well, there's likely one of three problems.
1. Mysql's connection is not using UTF-8
This means that it's converted to another charset (likely Latin-1) before it hits PHP. I've found the best solution is to run the following queries:
SET CHARACTER SET = "utf8";
SET character_set_database = "utf8";
SET character_set_connection = "utf8";
SET character_set_server = "utf8";
2. The page rendered is not really set to UTF-8
Set both the Content-type header and the <meta> tag content types to UTF-8. Some browsers don't respect one or the other...
header ('Content-Type: text/html; charset=UTF-8');
echo '<meta http-equiv="content-type" content="text/html; charset=utf-8" />';
As noted in the comments, that's not the problem...
3. You're doing something to the string before echoing it
Most of PHP's string functions will not do well with UTF-8. If you're calling a normal function that doesn't accept a $charset parameter, the chances are that it won't work with utf-8 strings (such as str_replace). If it does have a $charset parameter (like htmlspecialchars, make sure that you set it.
echo htmlspecialchars($content, ENT_COMPAT, 'UTF-8');