utf8 encoding breaks when upgrading from php5.6 to php7.0 - php

I have a simple (custom) CMS accepting markdown and displaying it in in a web page. Works fine in php5.6 (using the ondrej/php5 ppa on ubuntu 15.10). Mysql collation set to utf8 everywhere.
Upgrade the server to php7.0 (ondrej/php) and it displays garbage characters. I tried migrating the relevant mysql tables and fields to utf8mb4 / utf8mb4_unicode_ci with no luck.
Downgrade to php5.6 and it all works fine.
I have a hunch it is some strange php setting I don't know about? php.ini default_collation=UTF-8. Couldn't find anything else that worked. phpMyAdmin shows garbage no matter what version of php or server settings, so it is not much help.
What could i try next?
Source text (copied from php5.6 rendered page)
아동 보호 정책에 대한 규정
This Code is part of the
Rendered output (from php7 and phpMyAdmin)
ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì •
This Code is part of the

Use this to change a table to utf8mb4:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4 COLLATION utf8mb4_unicode_520_ci;
However, if the table was already messed up, then this won't fix it. Do the following to verify:
SELECT col, HEX(col) FROM tbl WHERE ...
For example, 아동 보호 정책에 대한 규정 will show a hex of EC9584 EB8F99 EBB3B4 ED98B8 ECA095 ECB185 EC9790 EB8C80 ED959C EAB79C ECA095. (Please ignore the spaces.)
For Korean text, you should see (mostly) groups of 3 hex bytes of the form Ewxxyy, where w is A or B or C or D, as shown in the example above. Hex 20 (only 1 byte) represents a space.
ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì • is the Mojibake for it. This implies that somewhere latin1 was erroneously involved, probably when you INSERTed the text. In that case, you will see something like C3AC E280A2 E2809E C3AB C28F E284A2 C3AB C2B3 C2B4 C3AD CB9C C2B8 ... -- mostly 2-byte Cwxx hex.
If you see that, an UPDATE of something like this will repair the data:
CONVERT(BINARY(CONVERT(CONVERT(col USING utf8mb4) USING latin1)) USING utf8mb4) (Edit: removed call to UNHEX.)

Related

PHP Propel 1.6 and MySQL - save() using utf8 not working

I have installed Propel 1.6. I can create tables in MySQL with propel commands.
Below is my propel settings in file: runtime-config.xml
<propel>
<datasources default="myProject">
<datasource id="myProject">
<adapter>mysql</adapter>
<connection>
<dsn>mysql:host=localhost;dbname=myDBname</dsn>
<user>myUser</user>
<password>mypass</password>
**<charset>utf8</charset>
<collate>utf8_unicode_ci</collate>**
</connection>
</datasource>
</datasources>
</propel>
MySQL database and table User has collation utf8_unicode_ci (see photo below):
mySql collation screenshot
Ι create a new Patient object to test everything is ok, through the following code:
$pat = new Patient();
$pat->setEmail("tg#gmail.com");
$pat->setAddress("Η διεύθυνσή μου");
$pat->setAmka("555555555");
$pat->setBirthdate("1966-01-01");
$pat->setFirstname("Τοόνομάμου");
$pat->setLastname("τοεπώνυμόμου");
$pat->setPhone("2109999999");
$pat->setSex(1);
$pat->save();
I checked through debug mode in Netbeans and the object $pat contains the values in the correct format so i can read them.
After save(), in mysql the greek values are showing like this:
mySql values saved screenshot
I would like your help to solve this issue.
Thank you in advance.
Τοόνομάμου, when "Mojibaked", becomes Τοόνομάμου. Notice the pattern often has Î and a second character, like your screenshot. Apparently, latin1 was involved at some point.
Trouble with UTF-8 characters; what I see is not what I stored discusses Mojibake and its causes.
It may be that you have "double encoding", which that link also discusses.
If you choose to fix the data rather than start over, see http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Finally, i found a solution.
In MySQL, i checked my settings using the following command:
show variables like 'char%';
I had to replace character_set utf8 with utf8mb4.
Everything works perfect now!!
For more info, https://mathiasbynens.be/notes/mysql-utf8mb4
You must specify the charset in the propel connection DSN - in your runtime-config.xml file like this:
<dsn>mysql:host=localhost;dbname=myDBname;charset=UTF8</dsn>
https://github.com/propelorm/sfPropelORMPlugin/issues/74#issuecomment-2011350

Sphinx search doesn't understand special characters (accents)

I have a MySQL db in utf8_general_ci.
And my sphinx.conf is like this:
source jobs
{
type = mysql
sql_sock = /var/run/mysqld/mysqld.sock
sql_query_pre = SET NAMES utf8
...
}
When I query "système" I would like sphinx to search for "système" & "systeme" in the DB.
AND when I query "systeme" I would like sphinx to search for "système" & "systeme" too.
What it does now is removing all the characters before the accents (including the accents themselves). So "système" becomes "me" and "dév" becomes "v"...
PS : I'm using the sphinxapi.php - which shouldn't be preferred over SphinxQL, I know, but it should still work with the api. And I use EXTENDED match mode.
You need to setup your charset_table to be able do this
http://sphinxsearch.com/docs/current.html#charsets
Alas there is no 'magic' config option to just magically work with all languages text, need to setup charset_table to deal with the langauge(s) you deal with.
Although this is pretty close:
http://sphinxsearch.com/forum/view.html?id=9312
(ie steals the hard work MySQL had done with collations and mimics it in charset_table)

mysql & UTF8 Issue with arabic

this might look like a similar issues for utf8 and Arabic language with MySQL database but i searched for result and found none..
my database endocing is set to utf8_general_ci ,
i had my php paging to be encoded as ansi by default
the arabic language in database shows as : ãÌÑÈ
but i changed it to utf8 ,
if i add new input to database , the arabic language in database shows as : زين
i dont care how it show indatabase as long as it shows normally in php page ,
after changing the php page to utf8 , when adding input than retriving it , if show result as it should .
but the old data which was added before converting the page encoding to uft8 show like this : �����
i tried a lot of methods for fixis this like using iconv in ssh and php , utf8_decode() utf8_encode() .. and more but none worked .
so i was hoping that you have a solution for me here ?
update :: Main goal was solved by retrieving data from php page in old encoding ' windows-1256' than update it from ssh .
but one issue left ::
i have some text that was inserted as 'windows-1256' and other that was inserted as 'utf-8' so now the windows encoding was converted to utf-8 and works fine , but the original utf-8 was converted as well to something unreadable , using iconv in php, with old page encoding ..
so is there a way to check what encoding is original in order to convert or not ?
Try run query set name utf8 after create a DB connection, before run any other query.
Such as :
$dbh = new PDO('mysql:dbname='.DB_NAME.';host='.DB_HOST, DB_USER, DB_PASSWORD);
$dbh->exec('set names utf8');

PHP to MySql to CSV to Excel UTF-8

I know this has been discussed several times but yet I'm getting crazy dealing with this problem. I have a form with a submit.php action. At first I didn't change anything about the charsets, I didn't use any utf8 header information.. The result was that I could read all the ä,ö,ü etc correctly inside the database. Now exporting them to .csv and importing them to Excel as UTF-8 charset (also tested all the others) results in an incorrect charset.
Now what I tried:
PHP:
header("Content-Type: text/html; charset=utf-8");
$mysqli->set_charset("utf8");
MySQL:
I dropped my database and created a new one:
create database db CHARACTER SET utf8 COLLATE utf8_general_ci;
create table ...
I changed my my.cnf and restarted my sql server:
[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci
[mysql]
default-character-set=utf8
If I connect to my db via bash I receive the following output:
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
A php test:
var_dump($mysqli->get_charset());
Giving me:
Current character set: utf8 object(stdClass)#3 (8) { ["charset"]=> string(4) "utf8" ["collation"]=> string(15) "utf8_general_ci" ["dir"]=> string(0) "" ["min_length"]=> int(1) ["max_length"]=> int(3) ["number"]=> int(33) ["state"]=> int(1) ["comment"]=> string(13) "UTF-8 Unicode" }
Now I use:
mysql -uroot -ppw db < require.sql > /tmp/test.csv
require.sql is simply a
select * from table;
And again I'm unable to import it as a csv into Excel no matter if I choose UTF-8 or anything else. It's always giving me some crypto..
Hopefully someone got a hint what might went wrong here..
Cheers
E: TextMate is giving me a correct output so it seems that the conversion actually worked and it's and Excel issue? Using Microsoft Office 2011.
E2: Also tried the same stuff with latin1 - same issue, cannot import special characters into excel without breaking them. Any hint or workaround?
E3: I found a workaround which is working with the Excel Import feature but not with double clicking the .csv.
iconv -f utf8 -t ISO-8859-1 test.csv > test_ISO.csv
Now I'm able to import the csv into excel using Windows(ANSI). Still annoying to have to use this feature instead of doubleclicking. Also I really don't get why UTF8 isn't working, not even with the import feature, BOM added and the complete database in UTF8.
Comma separation turned out to be a mess as well.
1. Concat_WS works only partly because it's adding a stupid concat_ws(..) header to the .csv file. Also "file test.csv" doesn't give me a "comma separated". This means even tho everything is separated by commas Excel won't notice it using double click.
2. sed/awk: Found some code snippets but all of them were separating the table very badly. E.g. colum street "streetname number" remained a 'streetname','number' which made 2 colums out of one and the table was screwed.
So it seems to me that Excel can only open .csv with a double click which
a) Are encoded with ISO-8859-1 (and only under windows because standard mac charset is Macintosh)
b) File having the attribute "comma separated". This means if I create a .csv through Excel itself the output of
file test1.csv
would be
test1.csv: ISO-8859 text, with CRLF line terminators
while a iconv changed charset with RegEx used for adding commas would look like:
test1.csv: ISO-8859 text
Pretty weird behaviour - maybe someone got a working solution.
That's how I save the data taken from utf-8 mysql tables.
You need to add BOM first.
Example:
<?php
$fp = fopen(dirname(__FILE__).'/'.$filename, 'wb');
fputs($fp, "\xEF\xBB\xBF");
fputcsv($fp, array($utfstr_1,$utfstr_2);
fclose($fp);
Make sure that you also tells MySQL you're gonna use UTF-8
mysql_query("SET CHARACTER SET utf8");
mysql_query("SET NAMES utf8");
You need to execute this before you're selecting any data.
Propaply won't be bad if you set the locale:setlocale(LC_ALL, "en_US.UTF-8");
Hope it helps.
Thanks everyone for the help, I finally managed to get a working - double clickable csv file which opens separated and displaying the letter correctly.
For those who are interested in a good workflow here we go:
1.) My database is completely using UTF8.
2.) I export a form into my database via php. I'm using mysqli and as header information:
header("Content-Type: text/html; charset=ISO-8859");
I know this makes everything look crappy inside the database, feel free to use utf8 to make it look correctly but it doesn't matter in my case.
3.) I wrote a script executed by a cron daemon which
a) removes the .csv files which were created previously
rm -f path/to/csv ##I have 3 due to some renaming see below
b) creating the new csv using mysql (this is still UTF8)
mysql -hSERVERIP -uUSER -pPASS DBNAME -e "select * from DBTABLE;" > PATH/TO/output.csv
Now you have a tab separated .csv and (if u exported from PHP in UTF8) it will display correctly in OpenOffice etc. but not in Excel. Even an import as UTF8 isn't working.
c) Making the file SEMICOLON separated (Excel standard, double clicking a comma separated file won't work at least not with the european version of Excel). I used a small python script semicolon.py:
import sys
import csv
tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, delimiter=";")
for row in tabin:
commaout.writerow(row)
d) Now I had to call the script inside my cron sh file:
/usr/bin/python PATH/TO/semicolon.py < output.csv > output_semi.csv
Make sure you use the full path for every file if u use the script as cron.
e) Change the charset from UTF8 to ISO-8859-1 (Windows ANSI Excel standard) with iconv:
iconv -f utf8 -t ISO-8859-1 output_semi.csv > output_final.csv
And that's it. csv opens up on double click on Mac/Windows Excel 2010 (tested).
Maybe this is a help for someone with similar problems. It drove me crazy.
Edit: For some servers you don't need iconv because the output from the database is already ISO8859. You should check your csv after executing the mysql command:
file output.csv
Use iconv only if the charset isn't iso8859-1

Convert latin1 to UTF8

I have a DB - with the table articles.
I want to convert the title, and content field to utf8
now - all data looks like this: פורטל רעל נפתח רשמית!
I want it to become normal hebrew characters.
Thanks
The following MySQL function will return the correct utf8 string after double-encoding:
CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8)
It can be used with an UPDATE statement to correct the fields:
UPDATE tablename SET field = CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8);
if you need to convert the whole database , you can back it as databaseback.sql file then form your command line
iconv -f latain -t utf-8 < databaseback.sql > databaseback.utf8.sql
you can use the http://www.php.net/manual/en/function.iconv.php
to convert each row in php in case you don't have command line access
and lastly don't forget to convert the collation of each field in phpmyadmin , then you can resotre the utf8 back easily
update
if you got iconv is not recognized , it means that you don't have iconv installed
much more easier solution is :
Migrating MySQL Data to Unicode
http://daveyshafik.com/archives/166-migrating-mysql-data-to-unicode.html
You can make mysqldump from this database. Then download something like Notepad++, open dump file, convert it to UTF8, then replace through the file all encodings to utf-8 including the first SET NAMES operator.
If you make dump to file via phpMyAdmin (with default settings) use output file encoding ISO-8859-1 instead of UTF-8 as you can see by default.
You can write a little php script which does the conversion. See http://www.php.net/manual/en/function.mb-detect-encoding.php and http://php.net/manual/en/function.mb-convert-encoding.php This is how I did this.
And remember to use strict mode! http://www.php.net/manual/en/function.mb-detect-encoding.php#102510
In pseudocode it would be sth. like this:
str = getDataAsString()
if(!isUTF8(str)) {
str = convert2UTF8(str)
}
saveStr2DB()
try
ALTER TABLE `tablename` CHANGE `field_name` `field_name` VARCHAR( 200 ) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL

Categories