UTF-8 charset issues from MySQL in PHP - php

this is really doing my nut.....
all relevant PHP Output scripts set headers (in this case only one file - the main php script):
header("Content-type: text/html; charset=utf-8");
HTML meta is set in head:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
all Mysql tables and related columns set to:
utf8_unicode_ci Unicode (multilingual), case-insensitive
I have been writing a class to do some translation.. when the class writes to a file using fopen, fputs etc everything works great, the correct chars appear in my output files (Which are written as php arrays and saved to the filesystem as .php or .htm files. eval() brings back .htm files correctly, as does just including the .php files when I want to use them. All good.
Prob is when I am trying to create translation entries to my DB. My DB connection class has the following line added directly after the initial connection:
mysql_query("SET NAMES utf8, character_set_results = 'utf8', character_set_client = 'utf8', character_set_connection = 'utf8', character_set_database = 'utf8', character_set_server = 'utf8'");
instead of seeing the correct chars, i get the usual crud you would expect using the wrong charset in the DB. Eg:
Propriétés
instead of:
propriétés
don't even get me started on Russian, Japanese, etc chars! But then using UTF8 should not make any single language charset an issue...
What have I missed? I know its not the PHP as the site shows the correct chars from the included translation .php or .htm files, its only when I am dealing with the MySQL DB that I am having these issues. PHPMyAdmin shows the entries with the wrong chars, so I assume its happening when the PHP "writes" to MySQL. Have checked similar questions here on stack, but none of the answers (all of which were taken care of) give me any clues...
Also, anyone have thoughts on speed difference using include $filename vs eval(file_get_contents($filename)).

You say that you are seeing "the usual crud you would expect using the wrong charset". But that crud is in fact created by using utf8_encode() on an already UTF8 string, so chances are that you are not using the "wrong encoding" anywhere, but exceeding the times you are encoding into UTF8.
You may take a look into a library I made to fix that kind of problems:
https://stackoverflow.com/a/3521340/290221

Here is all you need to make sure you have a good display of those chars :
/* HTTP charset */
header("Content-Type:text/html; charset=UTF-8");
/* Set MySQL communication encoding */
mysql_set_charset("UTF8");
You also need to set the DB encoding to the correct one, also each table's encoding AND the field's encoding
Last but not least, your php file's encoding should also match.

There is a mysql_set_charset('utf8'); in mysql for that. Run the query at the beginning of another query.

Related

I get an Ansi string instead of Utf-8 from Utf-8 mysql table [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 4 years ago.
When I moved from php mysql shared hosting to my own VPS I've found that code which outputs user names in UTF8 from mysql database outputs ?�??????� instead of 鬼神❗. My page has utf-8 encoding, and I have default_charset = "UTF-8" in php.ini, and header('Content-Type: text/html; charset=utf-8'); in my php file, as well as <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in html part of it.
My database has collation utf8_bin, and table has the same. On both previos and current hosting in phpmyadmin for this database record I see: 鬼神❗. When I create ANSI text file in Notepad++, paste 鬼神❗ into it and select Encoding->Encode in UTF-8 menu I see 鬼神❗, so I suppose it is correct encoded UTF-8 string.
Ok, and then I added
init_connect='SET collation_connection = utf8_general_bin'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_bin
skip-character-set-client-handshake
to my.cnf and now my page shows 鬼神❗ instead of ?�??????�. This is the same output I get in phpmyadmin on both hostings, so I'm on a right way. And still somehow on my old hosting the same php script returns utf-8 web page with name 鬼神❗ while on new hosting - 鬼神❗. It looks like the string is twice utf-8 encoded: I get utf-8 string, I give it as ansi string to Notepad++ and it encodes it in correct utf-8 string.
However when I try utf8_encode() I get й¬ÑзÒÑвÑâ, and utf8_decode() returns ?�???????. The same result return mb_convert_encoding($name,"UTF-8","ISO-8859-1"); and iconv( "ISO-8859-1","UTF-8", $name);.
So how could I reproduce the same conversion Notepad++ does?
See answer below.
The solution was simple yet not obvious for me, as I never saw my.cnf on that shared hosting: it seems that that server had settings as follows
init_connect='SET collation_connection = cp1252'
init_connect='SET NAMES cp1252'
character-set-server=cp1252
So to make no harm to other code on my new server I have to place mysql_query("SET NAMES CP1252"); on top of each php script which works with utf8 strings.
The trick here was script gets a string as is (ansi) and outputs it, and when browser is warned that page is in utf-8 encoding it just renders my strings as utf-8.

PHP sudden UTF8 encoding issues previously absent. Application not changed

I run a personal website that always showed accented chars correctly.
Now, suddenly, it doesn't any longer. The funny thing is, even its localhost version doesn't.
The application is unaltered over years in this regard and here it is what it does, in the given order:
mysql database set to collation utf8_general_ci
Application sends these two queries:
"SET NAMES 'utf8' COLLATE 'utf8'"
and
"SET CHARACTER_SET 'utf8'"
Php headers send the following headers before anything is printed:
header('Content-type: text/xml; charset=utf-8'."\r\n");
header('Content-transfer-encoding: utf-8'."\r\n");
Each web page shows a meta tag as follows:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Yet, now, suddenly, chars are shown all wrong.
If I replace manually the chars, they are shwon as intended. But I cannot fathom if or what may have "corrupted" the database then. And certainly I cannot fix manually hundreds of posts.
Any idea why this strange thing suddenly happens and suggestions about how to fix it?
Instance of a wrong line:
"Non ho mai avuto l' opportunità di incontrarti di persona. Non so se è perchè non ho cercato abbastanza l' occasione o perchè" etc...
è is Mojibake for è. Were you expecting a grave-e? Regardless of what changed or did not change, let's look at fixing it.
See this and look for Mojibake. It says to check/fix these:
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with .
Also see the technique for checking the HEX of what is stored for è:
utf8 hex is C3A8.
Hex C383C2A8 means you have "double encoding"; that will lead to other issues.
E8 is the latin1 hex -- I doubt if you will see this.

Alter language settings for PHP application - WAMP server

Alter language settings for PHP application - WAMP server. Firefox do not regognize language in my PHP-application and replace letters in my language with very small <?>-icons (with appropriate quotes, of cause). It does not help to write <html lang=nb-NO> or <html lang=no> at the start of the PHP application file and not using <meta charset=iso-8859-1> in the header either.
The letter substitution does not occur in PHP-admin and not in a local installed Wordpress. In Wamp server 3.0.0. most text are in english. Data stored in MySQL have been stored correct. Right click on Wamp-server icon in the system tray and then the Language menu choice gives a language list with v-mark before english.
How can settings be altered so that Firefox/IE/Opera/... will interpret the application correct and display all characters in the alphabet?
Your problem is not with the language, actually the language does not influence anything. The problem is with charset.
Commons issues in enconding
It is very common when working with accents that we find strange characters such as:
Like this é (é character in Unicode), this is because the character is unicode, but the page is in iso-8859-1 (or compatible).
The � signal/character is an example of this is when you use a compatible accents with "iso-8859-1" on a page that's trying to process "UTF-8" because of the use of Content-Type: ...; charset=utf8.
What is needed to use UTF-8
PHP scripts (refer to files on the server and not the answer thereof) must be saved in "utf-8 without BOM"
Set MySQL (or other database system) with charset=utf-8
It is recommended to use header('Content-type: text/html; charset=UTF-8'); in PHP scripts (use a framework may not be necessary, the situation varies).
Note: The advantage of UTF-8 is that you can use various "languages" in your page with characters that are not supported by "iso-8859-1".
About ISO-8859-1
I recommend using ISO-8859-1 if your site does not use characters other than Latin and you do not need "extra encodings" (such as "icons" of "UTF-8" or "UTF-16"), however even if you do no need of UTF-8, one of the reasons that might be good to move to UTF-8, it is that in June 2004, the development group of ISO/IEC responsible for its maintenance, declared the end of support for this encoding, focusing on UCS and Unicode.
Source: http://en.wikipedia.org/wiki/ISO_8859-1
If you decide to use UTF-8 in your site, I recommend the following steps:
PHP script with UTF-8 (without BOM)
Note: read about BOM in http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark
You should save all PHP scripts (yet you will use with include,require, etc.) in UTF-8 without BOM, use programs like SublimeText or notepad++ for convert files:
Using notepad++:
Using Sublime Text:
Netbeans got to Window > Preferences > General > Workspace > Text File Encoding:
MySQL with UTF-8
To create a table in UTF-8 you should use something like:
CREATE TABLE mytable (
id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
title varchar(300) DEFAULT NULL
) ENGINE=InnoDB CHARACTER SET=utf8 COLLATE utf8_unicode_ci;
If the tables already existed, so first make a BACKUP them and then use one of the following commands (as appropriate):
Convert database:
ALTER DATABASE mysdatabase CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Convert specific table:
ALTER TABLE mytable CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
In addition to creating the tables in UTF-8 it is necessary to define the connection as UTF-8.
With PDO use exec:
$conn = new PDO('mysql:host=HOST;dbname=mysdatabase', 'USER', 'PASSWORD');
$conn->exec('SET CHARACTER SET utf8');//set UTF-8
With mysqli use mysqli_set_charset:
$mysqli = new mysqli('HOST', 'USER', 'PASSWORD', 'mysdatabase');
if (false === $mysqli->set_charset('utf8')) {
printf('Error: %s', $mysqli->error);
}
Setting the page charset
You can use the <meta> tag to set the charset, but recommended is you do this in the response of the request (server response), defining the "headers" (this does not mean that you should not use <meta>).
Use header function, the reason to use the server response is also because the page rendering time as the server response and page "AJAX" also need the charset defined by header();.
Note: header(); should always go at the top of the script before anyone echo ...;, print "...";, or other output function.
Example:
<?php
header('Content-Type: text/html; charset=UTF-8');
echo 'Hello World!';

Stuck writing UTF-8 file via PHP's fwrite

I can't figure out what I'm doing wrong. I'm getting file content from the database. When I echo the content, everything displays just fine, when I write it to a file (.html) it breaks. I've tried iconv and a few other solutions, but I just don't understand what I should put for the first parameter, I've tried blanks, and that didn't work very well either. I assume it's coming out of the DB as UTF-8 if it's echoing properly. Been stuck a little while now without much luck.
function file($fileName, $content) {
if (!file_exists("out/".$fileName)) {
$file_handle = fopen(DOCROOT . "out/".$fileName, "wb") or die("can't open file");
fwrite($file_handle, iconv('UTF-8', 'UTF-8', $content));
fclose($file_handle);
return TRUE;
} else {
return FALSE;
}
}
Source of the html file looks like.
Comes out of the DB like this:
<h5>Текущая стабильная версия CMS</h5>
goes in file like this
<h5>Ð¢ÐµÐºÑƒÑ‰Ð°Ñ ÑÑ‚Ð°Ð±Ð¸Ð»ÑŒÐ½Ð°Ñ Ð²ÐµÑ€ÑÐ¸Ñ CMS</h5>
EDIT:
Turns out the root of the problem was Apache serving the files incorrectly. Adding
AddDefaultCharset utf-8
To my .htaccess file fixed it. Hours wasted... At least I learned something though.
Edit: The database encoding does not seem to be the issue here, so this part of the answer is retained for information only
I assume it's coming out of the DB as UTF-8
This is most likely your problem, what database type do you use? Have you set the character encoding and collation details for the database, the table, the connection and the transfer.
If I was to hazard a guess, I would say your table is MySQL and that your MySQL collation for the database / table / column should all be UTF8_general_ci ?
However, for some reason MySQL UTF8 is not actually UTF8, as it stores its data in 3bits rather than 4bits, so can not store the whole UTF-8 Character sets, see UTF-8 all the way through .
So you need to go through every table, column on your MySQL and change it from UTF8_ to the UTF8mb4_ (note: since MySQL 5.5.3) which is UTF8_multibyte_4 which covers the whole UTF-8 Spectrum of characters.
Also if you do any PHP work on the data strings be aware you should be using mb_ PHP functions for multibyte encodings.
And finally, you need to specify a connection character set for the database, don't run with the default one as it will almost certainly not be UTF8mb4, and hence you can have the correct data in the database, but then that data is repackaged as 3bit UTF8 before then being treated as 4bit UTF8 by PHP at the other end.
Hope this helps, and if your DB is not MySQL, let us know what it is!
Edit:
function file($fileName, $content) {
if (!file_exists("out/".$fileName)) {
$file_handle = fopen(DOCROOT . "out/".$fileName, "wb") or die("can't open file");
fwrite($file_handle, iconv('UTF-8', 'UTF-8', $content));
fclose($file_handle);
return TRUE;
} else {
return FALSE;
}
}
your $file_handle is trying to open a file inside an if statement that will only run if the file does not exist.
Your iconv is worthless here, turning from "utf-8" to er, "utf-8". character detection is extremely haphazard and hard for programs to do correctly so it's generally advised not to try and work out / guess what a character encoding it, you need to know what it is and tell the function what it is.
The comment by Dean is actually very important. The HTML should have a <meta charset="UTF-8"> inside <head>.
That iconv call is actually not useful and, if you are right that you are getting your content as UTF-8, it is not necessary.
You should check the character set of your database connection. Your database can be encoded in UTF-8 but the connection could be in another character set.
Good luck!

Mysql insert text data truncated by weird character encodings

I'm importing data from a CSV file that comes from excel, but i can't seem to insert my data correctly. This data contains french accented characters and if i open the CSV with OpenOffice (i don't use excel) i just select UTF-8 and the data gets converted and shown fine.
If i try to read that into php memory, i can see they are UTF-8 encoded strings if i use MB_DETECT_ENCODING. I connect to a database and specify all UTF-8 charsets using:
mysql_query('SET character_set_results = "utf8", character_set_client = "utf8", character_set_connection = "utf8", character_set_database = "utf8", character_set_server = "utf8"');
And i can certify that my database contains UTF-8 only fields and tables.
What happens is that my content gets truncated at the first accented character. But that happens only in my php script it seems. I output all my data to the browser and if i copy the INSERT statement, it inserts the whole data.
There might be something going on between php and the browser output but i can certify that it's not in the programming of the script... Thus far, i was able to circumvent this issue by HTMLENTITY'ing all my data, but the problem is that my search engine is going coo-coo-crazy because of that...
Any reason or way you can spare would be really appreciated...
EDIT #1:
I searched for the default excel encoding of CSV data and found out it was CP1252. I tried using ICONV('CP1252', 'UTF-8//TRANSLIT', $data) and now, the accented characters seem to fit. I'm going to try it everywhere in my script to see if all my accented character issues are fix and post the solution if so...
After countless tries, i was able to fix all my encoding problems but some of them i still don't know why they happen. I hope this will give some help to someone else later:
function fixEncoding($data){
//Replace
return iconv('CP1252', 'UTF-8//TRANSLIT', $data);
}
I used this function now to recode my strings correctly. It seems that excel saves data as CP1252 and NOT utf-8.
Further more, it seems there is a bug with accented characters at the start of a string in a CSV if you use fgetcsv, so i had to forego usage of fgetcsv and create an alternate method cause i'm not in PHP 5.3, maybe str_getcsv could have fixed my issue i'm not sure but in the current case it couldn't cause i don't have the function. I even tried looking for ports and nothing seems to exist and work correctly.
This is my solution, although very ugly, it works for me:
function fgetcsv2($filepointer, $maxlen, $sep, $enc){
$data = fgets($filepointer, $maxlen);
if($data === false){
return false;
}
$data = explode($sep, $data);
return $data;
}
Good luck to all who get similar problems
I also had to work on such a project, and, seriously, PHPExcel was my savior to avoid any brainfuck.
P.S. : also, there is this link to help you getting started (in french).
I have just had a similar problem and although I tested the $value using MB_DETECT_ENCODING and it said it was UTF-8, it still truncated the data.
Not knowing what to convert from, I couldn't use the iconv function mentioned above.
However I forced it to UTF-8 using utf8_encode($value) and everything works fine now.
Which encoding are you using for your tables?
MB_DETECT_ENCODING is not 100% correct all the time and no encoding detecter can ever be that.

Categories