PHP to MySql to CSV to Excel UTF-8 - php

I know this has been discussed several times but yet I'm getting crazy dealing with this problem. I have a form with a submit.php action. At first I didn't change anything about the charsets, I didn't use any utf8 header information.. The result was that I could read all the ä,ö,ü etc correctly inside the database. Now exporting them to .csv and importing them to Excel as UTF-8 charset (also tested all the others) results in an incorrect charset.
Now what I tried:
PHP:
header("Content-Type: text/html; charset=utf-8");
$mysqli->set_charset("utf8");
MySQL:
I dropped my database and created a new one:
create database db CHARACTER SET utf8 COLLATE utf8_general_ci;
create table ...
I changed my my.cnf and restarted my sql server:
[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci
[mysql]
default-character-set=utf8
If I connect to my db via bash I receive the following output:
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
A php test:
var_dump($mysqli->get_charset());
Giving me:
Current character set: utf8 object(stdClass)#3 (8) { ["charset"]=> string(4) "utf8" ["collation"]=> string(15) "utf8_general_ci" ["dir"]=> string(0) "" ["min_length"]=> int(1) ["max_length"]=> int(3) ["number"]=> int(33) ["state"]=> int(1) ["comment"]=> string(13) "UTF-8 Unicode" }
Now I use:
mysql -uroot -ppw db < require.sql > /tmp/test.csv
require.sql is simply a
select * from table;
And again I'm unable to import it as a csv into Excel no matter if I choose UTF-8 or anything else. It's always giving me some crypto..
Hopefully someone got a hint what might went wrong here..
Cheers
E: TextMate is giving me a correct output so it seems that the conversion actually worked and it's and Excel issue? Using Microsoft Office 2011.
E2: Also tried the same stuff with latin1 - same issue, cannot import special characters into excel without breaking them. Any hint or workaround?
E3: I found a workaround which is working with the Excel Import feature but not with double clicking the .csv.
iconv -f utf8 -t ISO-8859-1 test.csv > test_ISO.csv
Now I'm able to import the csv into excel using Windows(ANSI). Still annoying to have to use this feature instead of doubleclicking. Also I really don't get why UTF8 isn't working, not even with the import feature, BOM added and the complete database in UTF8.
Comma separation turned out to be a mess as well.
1. Concat_WS works only partly because it's adding a stupid concat_ws(..) header to the .csv file. Also "file test.csv" doesn't give me a "comma separated". This means even tho everything is separated by commas Excel won't notice it using double click.
2. sed/awk: Found some code snippets but all of them were separating the table very badly. E.g. colum street "streetname number" remained a 'streetname','number' which made 2 colums out of one and the table was screwed.
So it seems to me that Excel can only open .csv with a double click which
a) Are encoded with ISO-8859-1 (and only under windows because standard mac charset is Macintosh)
b) File having the attribute "comma separated". This means if I create a .csv through Excel itself the output of
file test1.csv
would be
test1.csv: ISO-8859 text, with CRLF line terminators
while a iconv changed charset with RegEx used for adding commas would look like:
test1.csv: ISO-8859 text
Pretty weird behaviour - maybe someone got a working solution.

That's how I save the data taken from utf-8 mysql tables.
You need to add BOM first.
Example:
<?php
$fp = fopen(dirname(__FILE__).'/'.$filename, 'wb');
fputs($fp, "\xEF\xBB\xBF");
fputcsv($fp, array($utfstr_1,$utfstr_2);
fclose($fp);
Make sure that you also tells MySQL you're gonna use UTF-8
mysql_query("SET CHARACTER SET utf8");
mysql_query("SET NAMES utf8");
You need to execute this before you're selecting any data.
Propaply won't be bad if you set the locale:setlocale(LC_ALL, "en_US.UTF-8");
Hope it helps.

Thanks everyone for the help, I finally managed to get a working - double clickable csv file which opens separated and displaying the letter correctly.
For those who are interested in a good workflow here we go:
1.) My database is completely using UTF8.
2.) I export a form into my database via php. I'm using mysqli and as header information:
header("Content-Type: text/html; charset=ISO-8859");
I know this makes everything look crappy inside the database, feel free to use utf8 to make it look correctly but it doesn't matter in my case.
3.) I wrote a script executed by a cron daemon which
a) removes the .csv files which were created previously
rm -f path/to/csv ##I have 3 due to some renaming see below
b) creating the new csv using mysql (this is still UTF8)
mysql -hSERVERIP -uUSER -pPASS DBNAME -e "select * from DBTABLE;" > PATH/TO/output.csv
Now you have a tab separated .csv and (if u exported from PHP in UTF8) it will display correctly in OpenOffice etc. but not in Excel. Even an import as UTF8 isn't working.
c) Making the file SEMICOLON separated (Excel standard, double clicking a comma separated file won't work at least not with the european version of Excel). I used a small python script semicolon.py:
import sys
import csv
tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, delimiter=";")
for row in tabin:
commaout.writerow(row)
d) Now I had to call the script inside my cron sh file:
/usr/bin/python PATH/TO/semicolon.py < output.csv > output_semi.csv
Make sure you use the full path for every file if u use the script as cron.
e) Change the charset from UTF8 to ISO-8859-1 (Windows ANSI Excel standard) with iconv:
iconv -f utf8 -t ISO-8859-1 output_semi.csv > output_final.csv
And that's it. csv opens up on double click on Mac/Windows Excel 2010 (tested).
Maybe this is a help for someone with similar problems. It drove me crazy.
Edit: For some servers you don't need iconv because the output from the database is already ISO8859. You should check your csv after executing the mysql command:
file output.csv
Use iconv only if the charset isn't iso8859-1

Related

utf8 encoding breaks when upgrading from php5.6 to php7.0

I have a simple (custom) CMS accepting markdown and displaying it in in a web page. Works fine in php5.6 (using the ondrej/php5 ppa on ubuntu 15.10). Mysql collation set to utf8 everywhere.
Upgrade the server to php7.0 (ondrej/php) and it displays garbage characters. I tried migrating the relevant mysql tables and fields to utf8mb4 / utf8mb4_unicode_ci with no luck.
Downgrade to php5.6 and it all works fine.
I have a hunch it is some strange php setting I don't know about? php.ini default_collation=UTF-8. Couldn't find anything else that worked. phpMyAdmin shows garbage no matter what version of php or server settings, so it is not much help.
What could i try next?
Source text (copied from php5.6 rendered page)
아동 보호 정책에 대한 규정
This Code is part of the
Rendered output (from php7 and phpMyAdmin)
ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì •
This Code is part of the
Use this to change a table to utf8mb4:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4 COLLATION utf8mb4_unicode_520_ci;
However, if the table was already messed up, then this won't fix it. Do the following to verify:
SELECT col, HEX(col) FROM tbl WHERE ...
For example, 아동 보호 정책에 대한 규정 will show a hex of EC9584 EB8F99 EBB3B4 ED98B8 ECA095 ECB185 EC9790 EB8C80 ED959C EAB79C ECA095. (Please ignore the spaces.)
For Korean text, you should see (mostly) groups of 3 hex bytes of the form Ewxxyy, where w is A or B or C or D, as shown in the example above. Hex 20 (only 1 byte) represents a space.
ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì • is the Mojibake for it. This implies that somewhere latin1 was erroneously involved, probably when you INSERTed the text. In that case, you will see something like C3AC E280A2 E2809E C3AB C28F E284A2 C3AB C2B3 C2B4 C3AD CB9C C2B8 ... -- mostly 2-byte Cwxx hex.
If you see that, an UPDATE of something like this will repair the data:
CONVERT(BINARY(CONVERT(CONVERT(col USING utf8mb4) USING latin1)) USING utf8mb4) (Edit: removed call to UNHEX.)

php utf-8 encoding for chinese text

I am doing migration to generate SQL from one DB to another.
I am trying to get the output
But when I did a mb_convert_encoding("Mr.Wang (王老板)", 'UTF-8', 'Windows-1252')
I have the output as
I have those two extra "box". Any idea what am I doing wrong?
phpMyAdmin is able to export my old database containing chinese text in correct format, how do it do that in script?
*updated the images to better show my view
Have you tried setting the header in the script to UTF8? What I normally use is the following:
header('Content-Type: text/html; charset=utf-8');
That has worked for me so far for German characters & some Arabic & Japanese etc.
I found that I actually need to
mysql_query("SET NAMES 'utf8'");
before my select statement. And I do not need to run mb_convert_encoding("Mr.Wang (王老板)", 'UTF-8', 'Windows-1252') at all.
Now if I write my insert sql I got the correct text i wanted.

mysql & UTF8 Issue with arabic

this might look like a similar issues for utf8 and Arabic language with MySQL database but i searched for result and found none..
my database endocing is set to utf8_general_ci ,
i had my php paging to be encoded as ansi by default
the arabic language in database shows as : ãÌÑÈ
but i changed it to utf8 ,
if i add new input to database , the arabic language in database shows as : زين
i dont care how it show indatabase as long as it shows normally in php page ,
after changing the php page to utf8 , when adding input than retriving it , if show result as it should .
but the old data which was added before converting the page encoding to uft8 show like this : �����
i tried a lot of methods for fixis this like using iconv in ssh and php , utf8_decode() utf8_encode() .. and more but none worked .
so i was hoping that you have a solution for me here ?
update :: Main goal was solved by retrieving data from php page in old encoding ' windows-1256' than update it from ssh .
but one issue left ::
i have some text that was inserted as 'windows-1256' and other that was inserted as 'utf-8' so now the windows encoding was converted to utf-8 and works fine , but the original utf-8 was converted as well to something unreadable , using iconv in php, with old page encoding ..
so is there a way to check what encoding is original in order to convert or not ?
Try run query set name utf8 after create a DB connection, before run any other query.
Such as :
$dbh = new PDO('mysql:dbname='.DB_NAME.';host='.DB_HOST, DB_USER, DB_PASSWORD);
$dbh->exec('set names utf8');

MySQL Character encoding (öä) in PHP application

Hello I have a character encoding problem in my application and thought to ask for some help, because I couldn't solve the problem even thought I was given some guidance so here goes:
My Ä and Ö characters are shown in the browser as: �
I will also post all what I have done so far trying to solve the problem:
1) Database: I have tried changing the collation of my tables, here are some info what SHOW TABLE STATUS gives for one of my tables:
Name = test_groups Engine = InnoDB Version = 10 Row_format = Compact
Collation = utf8_swedish_ci
Database character variables gives:
| character_set_client = utf8 | character_set_connection =
utf8 | character_set_database = latin1 (I
Wonder is this the cause?) | character_set_filesystem
= binary | character_set_results = utf8 | character_set_server = utf8 |
character_set_system = utf8
2) In apache httpd.conf I have:
AddDefaultCharset UTF-8
3) In my Zend-application application.ini:
resources.view.encoding = "UTF-8"
4) In my firefox 14.0.1 browser
edit->preferences->content->advanced->Default character encoding =
Unicode (UTF-8)
5) In my php code meta-tag:
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
Now here's also few other interesting things: When I look at my page and change from firefox
View->Character encoding->Western (ISO-8859-1)
, the �-characters which came from the MySQL database turn out ok to öä-characters, but the öä-characters that come from my php-code turn into ät-characters.
Another thing when I check the encoding of the data coming from my MySQL-database with
mb_detect_encoding($DATA_FROM_MYSQL_DATABASE)
it outputs UTF-8!! Then lastly if I do in the code:
utf8_encode($DATA_FROM_MYSQL_DATABASE)
and output the result the problem disappears that is �-characters -> öä-characters. So what's going on here x) All help appreciated
Are you sending SET NAMES utf8 in your PHP as the first query to MySQL ? That could be the cause if not.
SET NAMES indicates what character set the client will use to send SQL
statements to the server. Thus, SET NAMES 'cp1251' tells the server,
“future incoming messages from this client are in character set
cp1251.” It also specifies the character set that the server should
use for sending results back to the client. (For example, it indicates
what character set to use for column values if you use a SELECT
statement.)
SET NAMES utf8 in MySQL? has more detail about how and why.
Troubleshoot:
Check your database (with PHPMyAdmin, for instance). Are the characters correctly stored? Or does it seem gibberish?
If the characters in the database are ok, then the problem happens when retrieving. If they are stored incorrectly (as I would guess they are), then the problem is in the "storing".
Check your source code file and verify if they are encoded in UTF-8.
Force mysql connection to use UTF8 (mysqli::set_charset('utf8') or mysql_set_charset('utf8') or PDO: Add charset to the connection string (charset=utf8) )

UTF-8, PHP and XML Mysql

I am having great problems solving this one:
I have a mysql database encoding latin1_swedish_ci and a table that stores names and addresses.
I am trying to output a UTF-8 XML file, but I am having problems with the following string:
Otivägen it is being outputted as Otivägen when i vim the file. Also when opened it IE i get
"An invalid character was found in text content. Error processing resource"
I have the following code:
function fixEncoding($in_str)
{
$cur_encoding = mb_detect_encoding($in_str) ;
if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
return $in_str;
else
return utf8_encode($in_str);
}
header("Content-type: text/plain;charset=utf-8");
$mystring = "Otivägen" // this is actually obtained from database;
$myxml = "<myxml>
....
<node>".$mystring."</node>
....
</myxml>
";
$myxml = fixEncoding($myxml);
The actual XML output is below:
<?xml version="1.0" encoding="UTF-8" ?>
<myxml>
....
<node>Otivägen</node>
....
</myxml>
Any ideas how I can output the file so in vim the file reads Otivägen and not Otivägen?
EDIT:
I did mysql_client_encoding() and got latin1
I then did mysql_set_charset()
and again ran mysql_client_encoding() and got utf8, but still the same outputting issues.
Edit 2
I have logged into the command line and run the query SELECT address1 FROM address WHERE id = 1000;
SELECT address1 FROM address WHERE id = 1000;
Current database: ftpuser_db
+-------------+
| address1 |
+-------------+
| Otivägen 32 |
+-------------+
1 row in set (0.06 sec)
Thanks in advance!
I think you did everything correctly, except that your terminal is in Latin-1.
The UTF-8 sequence for ä is C3 A4, which is ä if displayed as Latin-1.
Is your MySQL connection encoding properly set to UTF-8 ?
Check mysql_set_charset() and mysql_client_encoding() for more details.
Oh boy. UTF8 issues can be a real pain and they get almost impossible to solve when something is doing re-encodings for you.
You really need to start at one end and make sure every process is UTF8. That will remove things in the process from interpreting the data wrong and 'converting' it for you. But significantly, it will also let you much more easily spot when something has already mis-encoded text for you (yes, I've had that problem).
And if you have UTF8 data in tables that aren't set to UTF8 and might be mis-encoded, you need to do the tables last, after the data has been re-encoded. Otherwise you will damage your data irretrievably. I've had that problem, too.
First steps:
Check your terminal is UTF8 compliant. Gnome-terminal is. Kterm is. ETerm is not.
Check your LANG setting in your shell. It should probably have .UTF-8 on the end of it's value.
Check that vim is picking up the UTF8 setting correctly. You can check with :set encoding
This will mean that your files will be edited in UTF8.
Now we check MySQL.
In the MySQL CLI, do show variables like 'character_set%';. The results will probably be something like:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
What you're aiming for is to change all those latin1 values (or whatever you're seeing) to utf8.
set names utf8; will change most of them and you might need to do that with every new connection in your database. This was the solution I had to adopt in a previous application. The other settings to change are in the my.cnf file for which I need to direct you to the documentation. It is unlikely you will need to set them all.
I see you're already setting the output headers, so that's good.
Now you can look at the data from the database and see why it's "wrong".
latin1_swedish_ci is a collation, not a charset. Since collations are supposed to match their charset, it suggests that the table is using latin1, but it's not a guarantee.
Strictly speaking, the charset of tables is irrelevant here, since MySql can convert input/output. That's what the connection charset (mysql_set_charset) is for. However, for that to work properly, the data needs to be encoded properly in the database. I would begin by checking that strings are correct in the database. Simplest thing is to log in on the command line and select a row which has non-ascii characters in it. Does it look OK?
$mystring = "Otivägen" // this is actually obtained from database;
Watch out. The encoding of the data in $mystring will now depend on the encoding of the php file. That may or may not be the same as the data in the database.
before output run query SET NAMES utf8
after output you can go back and run SET NAMES latin1
Look here, I've got the same problem
It seems you are "double encoding" Otivägen. You get this behaviour if Otivägen already is UTF-8, and run utf8_encode() on it again. Example:
$str = "Otivägen"; // already an UTF-8 string
echo utf8_encode($str); // outputs Otivägen
I'm not sure we're the actual "double encoding" occurs, but it may be due to settings in your editor. My theory. Lets say you are running Aptana Studio: Your actual character set is set to ISO-8859-1 (in Aptana, you can check this by right clicking on a file and choose "properties". To set default character encoding for all projects, choose Preferences from Aptana main menu -> General -> workspace). If that's the case, the actual PHP source file where you have $myxml and its string <myxml><node>... is detected to be ISO-8859-1, but $mystring received from the database is UTF-8. Your fixEncoding function would then run the else clause, since the $myxml as a whole is seen as ISO-8859-1 and not UTF-8. This results in double encoding the results from the database, and may be the cause to your problem.
Check the encoding of your actual source file in your editor, and verify that it is set to UTF-8. Alternatively, experiment with applying or removing fixEncoding/utf8_encode/utf8_decode to $myxml. Observe the results and see what needs to be done to the value Otivägen right.

Categories