I run a personal website that always showed accented chars correctly.
Now, suddenly, it doesn't any longer. The funny thing is, even its localhost version doesn't.
The application is unaltered over years in this regard and here it is what it does, in the given order:
mysql database set to collation utf8_general_ci
Application sends these two queries:
"SET NAMES 'utf8' COLLATE 'utf8'"
and
"SET CHARACTER_SET 'utf8'"
Php headers send the following headers before anything is printed:
header('Content-type: text/xml; charset=utf-8'."\r\n");
header('Content-transfer-encoding: utf-8'."\r\n");
Each web page shows a meta tag as follows:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Yet, now, suddenly, chars are shown all wrong.
If I replace manually the chars, they are shwon as intended. But I cannot fathom if or what may have "corrupted" the database then. And certainly I cannot fix manually hundreds of posts.
Any idea why this strange thing suddenly happens and suggestions about how to fix it?
Instance of a wrong line:
"Non ho mai avuto l' opportunità di incontrarti di persona. Non so se è perchè non ho cercato abbastanza l' occasione o perchè" etc...
è is Mojibake for è. Were you expecting a grave-e? Regardless of what changed or did not change, let's look at fixing it.
See this and look for Mojibake. It says to check/fix these:
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with .
Also see the technique for checking the HEX of what is stored for è:
utf8 hex is C3A8.
Hex C383C2A8 means you have "double encoding"; that will lead to other issues.
E8 is the latin1 hex -- I doubt if you will see this.
Related
Here are the hex values of two strings stored in a MySQL database using two different methods.
20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5
and
E0A495E0A4BEE0A49AE0A48220E0A4B6E0A495E0A58DE0A4A8E0A58BE0A4AEE0A58DE0A4AFE0A4A4E0A58DE0A4A4E0A581E0A4AEE0A58D20E0A5A420E0A4A8E0A58BE0A4AAE0A4B9E0A4BFE0A4A8E0A4B8E0A58DE0A4A4E0A4BF20E0A4AEE0A4BEE0A4AEE0A58D20E0A5A5
They represent the string काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥. The former appears to be encoded badly, but works in the application, the latter appears encoded correctly but does not. I need to be able to create the first hex string from the input.
Here comes the long version: I've got a legacy application built in PHP/MySQL. The database connection charset is latin1. The charset of the table is utf8 (don't ask). The input is coerced into being correct utf8 via the ForceUTF8 composer library. Looking directly in the database, the stored value of this string is काचं शकà¥à¤¨à¥‹à¤®à¥à¤¯à¤¤à¥à¤¤à¥à¤®à¥ । नोपहिनसà¥à¤¤à¤¿ मामॠ॥
I am aware that this looks horrendous and appears to me to be badly encoded, however it is out of scope to fix the legacy application. The rest of the application is able to cope with this data as it is and everything else works and displays perfectly well with it.
I have created an external node application to replace the current insert routine running on Azure. I've set the connection charset to latin1, it's connecting to the same database and running the same insert statement. The only part of the puzzle I've not been able to replicate is the ForceUTF8 library as I could find no equivalent in the npm ecosystem. When the same string is inserted it renders perfectly when looking at the raw field in PHP Storm i.e. it looks exactly like the original text above, and the hex value of the string is the latter of the two presented at the top of the question. However, when viewed in the application the values are corrupted by question marks and black diamonds.
If, within the PHP application, I run SET NAMES utf8 ahead of the rendering data query then the node-inserted values render correctly, and the legacy ones now display as corrupted. Adding set names utf8 to the application for this query is not an acceptable solution since it breaks the appearance of the legacy data, and fixing the legacy data is also not an acceptable solution.
I have tried all sorts of connection charsets and various Iconv functions to make the data exactly match how the legacy app makes it but have not been able to "break it" in exactly the same way.
How can I make "काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥" into a string, the hex value of which is "20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5" using some variation of database connection charset and string conversion?
I'm not familiar with PHP, but I was able to generate the "horrendous" encoding with Python (and it is horrendous...not sure how someone intentionally generated this crap). Hopefully this guides you to a solution:
import re
expected = '20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5'
original = 'काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥'
# Encode in UTF-8 w/ BOM (U+FEFF encoded in UTF-8 as a signature)
step1 = original.encode('utf-8-sig')
# Windows-1252 doesn't define some byte -> codepoint mappings and Python normally
# raises an error on those bytes. Use an error handler to keep the bytes that
# fail, then replace the escape codes with the matching Unicode codepoint.
step2 = step1.decode('cp1252',errors='backslashreplace')
step3 = re.sub(r'\\x([0-9a-f]{2})', lambda x: chr(int(x.group(1),16)), step2)
# There is an extra space before the UTF-8-encoded BOM for some reason
step4 = ' ' + step3
step5 = step4.encode('utf8')
# Format to match expected string
final = step5.hex().upper()
print(final == expected) # True
HEX('काचं') = 'E0A495E0A4BEE0A49AE0A482'
-- utf8mb4 to utf8mb4 hex
HEX(CONVERT(CONVERT(BINARY('काचं') USING latin1) USING utf8mb4)) = 'C3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A' is utf8mb4 to double-encoded
See "double-encoding" in Trouble with UTF-8 characters; what I see is not what I stored
More
"Double-encoding", as I understand it, is where utf8 bytes (up to 4 bytes per "character") are treated as latin1 (or cpnnnn) and converted to utf8, and then that happens a second time. In this case, each 3-byte Devanagari is converted twice, leading to between 6 and 9 bytes.
You explained the cause here:
The database connection charset is latin1. The charset of the table is utf8
BOM is, in my opinion, a red herring. It was intended to be a useful clue that a "text" file was encoded in UTF-8, but unfortunately, very few products generate it. Hence, BOM is more of a distraction than a help. (I don't think MySQL has any way to take care of BOM -- after all, most database activity is at the row level, not the file level.)
The solution (for the data flow) in MySQL context is to rip out all "conversion" functions and, instead, configure things so that MySQL will convert at the appropriate places. Your mention of "latin1" was the main "mis-configuration".
The long expression (HEX...) gives a clue of how to fix the data, but it must be coordinated with changes to configuration and changes to code.
This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 4 years ago.
When I moved from php mysql shared hosting to my own VPS I've found that code which outputs user names in UTF8 from mysql database outputs ?�??????� instead of 鬼神❗. My page has utf-8 encoding, and I have default_charset = "UTF-8" in php.ini, and header('Content-Type: text/html; charset=utf-8'); in my php file, as well as <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in html part of it.
My database has collation utf8_bin, and table has the same. On both previos and current hosting in phpmyadmin for this database record I see: 鬼神❗. When I create ANSI text file in Notepad++, paste 鬼神❗ into it and select Encoding->Encode in UTF-8 menu I see 鬼神❗, so I suppose it is correct encoded UTF-8 string.
Ok, and then I added
init_connect='SET collation_connection = utf8_general_bin'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_bin
skip-character-set-client-handshake
to my.cnf and now my page shows 鬼神❗ instead of ?�??????�. This is the same output I get in phpmyadmin on both hostings, so I'm on a right way. And still somehow on my old hosting the same php script returns utf-8 web page with name 鬼神❗ while on new hosting - 鬼神❗. It looks like the string is twice utf-8 encoded: I get utf-8 string, I give it as ansi string to Notepad++ and it encodes it in correct utf-8 string.
However when I try utf8_encode() I get й¬ÑзÒÑвÑâ, and utf8_decode() returns ?�???????. The same result return mb_convert_encoding($name,"UTF-8","ISO-8859-1"); and iconv( "ISO-8859-1","UTF-8", $name);.
So how could I reproduce the same conversion Notepad++ does?
See answer below.
The solution was simple yet not obvious for me, as I never saw my.cnf on that shared hosting: it seems that that server had settings as follows
init_connect='SET collation_connection = cp1252'
init_connect='SET NAMES cp1252'
character-set-server=cp1252
So to make no harm to other code on my new server I have to place mysql_query("SET NAMES CP1252"); on top of each php script which works with utf8 strings.
The trick here was script gets a string as is (ansi) and outputs it, and when browser is warned that page is in utf-8 encoding it just renders my strings as utf-8.
Alter language settings for PHP application - WAMP server. Firefox do not regognize language in my PHP-application and replace letters in my language with very small <?>-icons (with appropriate quotes, of cause). It does not help to write <html lang=nb-NO> or <html lang=no> at the start of the PHP application file and not using <meta charset=iso-8859-1> in the header either.
The letter substitution does not occur in PHP-admin and not in a local installed Wordpress. In Wamp server 3.0.0. most text are in english. Data stored in MySQL have been stored correct. Right click on Wamp-server icon in the system tray and then the Language menu choice gives a language list with v-mark before english.
How can settings be altered so that Firefox/IE/Opera/... will interpret the application correct and display all characters in the alphabet?
Your problem is not with the language, actually the language does not influence anything. The problem is with charset.
Commons issues in enconding
It is very common when working with accents that we find strange characters such as:
Like this é (é character in Unicode), this is because the character is unicode, but the page is in iso-8859-1 (or compatible).
The � signal/character is an example of this is when you use a compatible accents with "iso-8859-1" on a page that's trying to process "UTF-8" because of the use of Content-Type: ...; charset=utf8.
What is needed to use UTF-8
PHP scripts (refer to files on the server and not the answer thereof) must be saved in "utf-8 without BOM"
Set MySQL (or other database system) with charset=utf-8
It is recommended to use header('Content-type: text/html; charset=UTF-8'); in PHP scripts (use a framework may not be necessary, the situation varies).
Note: The advantage of UTF-8 is that you can use various "languages" in your page with characters that are not supported by "iso-8859-1".
About ISO-8859-1
I recommend using ISO-8859-1 if your site does not use characters other than Latin and you do not need "extra encodings" (such as "icons" of "UTF-8" or "UTF-16"), however even if you do no need of UTF-8, one of the reasons that might be good to move to UTF-8, it is that in June 2004, the development group of ISO/IEC responsible for its maintenance, declared the end of support for this encoding, focusing on UCS and Unicode.
Source: http://en.wikipedia.org/wiki/ISO_8859-1
If you decide to use UTF-8 in your site, I recommend the following steps:
PHP script with UTF-8 (without BOM)
Note: read about BOM in http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark
You should save all PHP scripts (yet you will use with include,require, etc.) in UTF-8 without BOM, use programs like SublimeText or notepad++ for convert files:
Using notepad++:
Using Sublime Text:
Netbeans got to Window > Preferences > General > Workspace > Text File Encoding:
MySQL with UTF-8
To create a table in UTF-8 you should use something like:
CREATE TABLE mytable (
id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
title varchar(300) DEFAULT NULL
) ENGINE=InnoDB CHARACTER SET=utf8 COLLATE utf8_unicode_ci;
If the tables already existed, so first make a BACKUP them and then use one of the following commands (as appropriate):
Convert database:
ALTER DATABASE mysdatabase CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Convert specific table:
ALTER TABLE mytable CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
In addition to creating the tables in UTF-8 it is necessary to define the connection as UTF-8.
With PDO use exec:
$conn = new PDO('mysql:host=HOST;dbname=mysdatabase', 'USER', 'PASSWORD');
$conn->exec('SET CHARACTER SET utf8');//set UTF-8
With mysqli use mysqli_set_charset:
$mysqli = new mysqli('HOST', 'USER', 'PASSWORD', 'mysdatabase');
if (false === $mysqli->set_charset('utf8')) {
printf('Error: %s', $mysqli->error);
}
Setting the page charset
You can use the <meta> tag to set the charset, but recommended is you do this in the response of the request (server response), defining the "headers" (this does not mean that you should not use <meta>).
Use header function, the reason to use the server response is also because the page rendering time as the server response and page "AJAX" also need the charset defined by header();.
Note: header(); should always go at the top of the script before anyone echo ...;, print "...";, or other output function.
Example:
<?php
header('Content-Type: text/html; charset=UTF-8');
echo 'Hello World!';
I have some problem. I have some data in my DB in Latvian (i.e Valentīna) and I need to display this on my page.
Other data are saved in cp1257 encoding and looks like AÎDA MACIJEVSKA - and it displays as Aīda Macijevska
So what I have tried...
1 - ucwords(mb_strtolower(iconv("windows-1257", "UTF-8//TRANSLIT", trim($row['pac_name'])), "UTF-8"));
2 - ucwords(mb_strtolower(iconv("windows-1257", "UTF-8", trim($row['pac_name'])), "UTF-8"));
3 - just show without any converting from DB `$row["pac_name"]`;
and all 3 points display same result - Valent?na
P.S Database has utf8_general_ci collation, also I gave header for utf-8 encoding - header('Content-Type: text/html; charset=utf-8');
So can anyone please help me with my problem?
Assuming you are truly using cp1257 and not utf8, then you need
SET NAMES cp1257 (or some other way for the client to tell mysqld that the bytes are encoded using cp1257)
CHARACTER SET cp1257 on each column (or perhaps defaulting from table definition).
But it sounds like you should go with utf8, not cp1257...
As I see it, Î does not exist in cp1257. Reference: http://en.wikipedia.org/wiki/Windows-1257 . Hence, the code you mentioned is free to screw it up, by either using ī or ?.
If you really need I-hat, go with utf8. Note that the collation utf8_latvian_ci exists. All the i's mentioned here do exist in utf8.
If you have further questions, please provide SELECT HEX(col)... for any text in question. For example (with spaces added for clarity):
In utf8: AÎDA --> 41 C38E 44 41; Aīda --> 41 C4AB 64 61
In cp1257: Aīda --> 41 EE 64 61
this is really doing my nut.....
all relevant PHP Output scripts set headers (in this case only one file - the main php script):
header("Content-type: text/html; charset=utf-8");
HTML meta is set in head:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
all Mysql tables and related columns set to:
utf8_unicode_ci Unicode (multilingual), case-insensitive
I have been writing a class to do some translation.. when the class writes to a file using fopen, fputs etc everything works great, the correct chars appear in my output files (Which are written as php arrays and saved to the filesystem as .php or .htm files. eval() brings back .htm files correctly, as does just including the .php files when I want to use them. All good.
Prob is when I am trying to create translation entries to my DB. My DB connection class has the following line added directly after the initial connection:
mysql_query("SET NAMES utf8, character_set_results = 'utf8', character_set_client = 'utf8', character_set_connection = 'utf8', character_set_database = 'utf8', character_set_server = 'utf8'");
instead of seeing the correct chars, i get the usual crud you would expect using the wrong charset in the DB. Eg:
Propriétés
instead of:
propriétés
don't even get me started on Russian, Japanese, etc chars! But then using UTF8 should not make any single language charset an issue...
What have I missed? I know its not the PHP as the site shows the correct chars from the included translation .php or .htm files, its only when I am dealing with the MySQL DB that I am having these issues. PHPMyAdmin shows the entries with the wrong chars, so I assume its happening when the PHP "writes" to MySQL. Have checked similar questions here on stack, but none of the answers (all of which were taken care of) give me any clues...
Also, anyone have thoughts on speed difference using include $filename vs eval(file_get_contents($filename)).
You say that you are seeing "the usual crud you would expect using the wrong charset". But that crud is in fact created by using utf8_encode() on an already UTF8 string, so chances are that you are not using the "wrong encoding" anywhere, but exceeding the times you are encoding into UTF8.
You may take a look into a library I made to fix that kind of problems:
https://stackoverflow.com/a/3521340/290221
Here is all you need to make sure you have a good display of those chars :
/* HTTP charset */
header("Content-Type:text/html; charset=UTF-8");
/* Set MySQL communication encoding */
mysql_set_charset("UTF8");
You also need to set the DB encoding to the correct one, also each table's encoding AND the field's encoding
Last but not least, your php file's encoding should also match.
There is a mysql_set_charset('utf8'); in mysql for that. Run the query at the beginning of another query.