mysql + php encoding - php

I know there were plenty of questions like this but I am creating the new one because to my point of view it is specific to each situation.
So, my page is displayed in UTF-8 format. The data is taken from mySQL that has utf8_unicode_ci collation. The data I am displaying is the string - 1  Bröllops-Festkläder.
There are some unicode characters in here and they should display fine but they do not. On my page these are just a bunch of hieroglyphs.
Now, the interesting situation:
I am using phpMyAdmin to keep track of what is happening in the database. The website has the ability to import CSV documents containing customer data and modify each customer individually. If I import CSV document containing these characters they are written to the database, readable in phpMyAdmin and not readable on my page. If I use my script to modify the customer information and I type those characters from the browser, the it is vice versa - they are readable on the page and they are not readable in phpMyAdmin, so clearly the encoding is different. I spent ages figuring out the right combination and I could not.
UPDATE: Deceze posted a link below that I copy here to make it more noticeable. I am sure this will save hours and days to many people facing similar issues - Handling Unicode Front to Back in a Web App

There're couple of things that got involved here. If your database encoding is fine and html encoding is fine and you still see artefact, it's most likely your db connection is not using same encoding, thus leading to data corruption. If you connect by hand, you can easily enforce utf encoding, by doing query SET NAMES UTF8 as very first thing after you connect() to your database. It is sufficient to do this only once per connection.
EDIT: one important note though - depending on how you put your data to the DB, your database content may require fixing as it can be corrupted if you put it via broken connection. So, if anyone is facing the same issue - once you set all things up, ensure you are checking on fresh data set, or you may still see things incorrectly, even all is now fine.

Related

PHP5 to PHP7 upgrade causes encoding troubles in SQL Server database

We are having a PHP5.6 website project and we are about to re-launch it on PHP7.4.
Let's call them old environment and new environment. Old one still intact. Both are on different server machines.
Charsets (html meta tags) are set to utf-8.
Zend Framework 1 is involved in both. The database is on an SQL Server, shared by both environments. We use the SqlSrv driver to connect to the database (new environment), the old environment has PDO-Sql.
The encoding of the database is set to Latin1_General_CI_AS.
Information is getting inserted and selected into/from many tables (INSERT, SELECT). Html textfields and
textareas are in use.
In the old environment, any text written in textfields/-areas with special characters, such as umlauts, is being saved in the database in a corrupt form, like instead of ö there is ö in the database table. On the screen, after a select-statement however, it is shown as ö (clean!).
That was all okay until now, but now we have the new environment.
Let's say there is are old entries saved during the old environment era and we open the website on the new environment. The content is shown 1:1 as seen in the database table, in other words: corrupted. Which explains why anything that is saved with the help of the new environment is shown correctly on screen since special characters and umlauts are saved without any changes in the database table.
But the entries made with the new environment cannot be seen on the old environment website.
Using utf8_encode or utf8_decode didn't help much, either it looked even worse, or there is no text on screen to be seen neither.
Writing some script that changes the encoding in the table would cause mayhem, because since the old environment still in use, it can't be done that easy.
There are no encoding options mentioned in the options, that are used on the class called Zend_Db_Adapter_Sqlsrv.
Well, I don't trust mb_detect_encoding and yet we tried that, but it returned UTF-8 on the returned values from the tables.
So what would people recommend? I might have missed some facts, but I'll provide you with more information if needed.
This sounds very similar to a problem I've solved in the past. Unfortunately I solved it in ASP.NET, so I can only describe what I did and let you translate it into PHP.
So the issue probably arises because your old system is using a non-UTF-8 codepage, in my case the codepage was windows-1252 which was fairly common at the time. The codepage determines the character coding that your code uses.
So on my more modern system what I had to do was force the codepage back to windows-1252 while I was reading from the database. And then before rendering the page, set the content encoding to UTF-8.
So unless you are able to fix the problem at source, you basically have to hack your new system to continue operating the same way - which is a unfortunate but sometimes necessary.
The ASP.NET code looks like this:
protected void Page_Load(object Sender, EventArgs Args)
{
// Set the encoding for building and rendering, then switch later to display as utf-8
Response.Charset = "windows-1252"; // Hmmm... double check this
Response.ContentEncoding = System.Text.Encoding.GetEncoding("windows-1252");
}
protected void Render(HtmlTextWriter writer)
{
// Now that all the character encoding has taken place, switch to utf-8 to force it to display this way...
Response.Charset = "utf-8";
Response.ContentEncoding = Encoding.UTF8;
}
Hopefully that gives you enough to go on... its been a long time since I did this, but the pain still sticks in my mind!
Since it was a project someone else did, we decided to update all tables and correct the values.

PHP/MySQL Encoding

I have a website, with arabic content which has been migrated from a different server. On the old server, everything was displaying correctly, supposedly everything was encoded with UTF-8.
On the current server, the data started displaying incorrectly, showing نبذة عن and similar characters.
The application is build on the CakePHP Framework.
After many trials, I changed the 'encoding' parameter in the MySql connection array to become 'latin1'. For the people who don't know CakePHP, this sets MySql's connection encoding. Setting this value to UTF8 did not change anything, even after the steps described below.
Some of the records started showing correctly in Arabic, while others remained gibberish.
I have already gone through all the database and server checks, confirming that:
The database created is UTF-8.
The table is UTF-8.
The columns are not explicitly set to any encoding, thus encoded in UTF-8.
Default Character set in PHP is UTF-8
mysql.cnf settings default to UTF-8
After that, I retrieved my data and looped through it, printing the encoding of each string (from each row) using mb_detect_encoding. The rows that are displaying correctly are returning UTF8 while it is returning nothing for the rows that are corrupt.
The data of the website has been edited on multiple types, possibly with different encodings, this is something I cannot know for sure. What I can confirm though, is that the only 2 encodings that this data might have passed through are UTF-8 and latin1.
Is there any possible way to recover the data when mb_detect_encoding is not returning anything and the current dataset is unknown?
UPDATE: I have found out that while the database was active on the new server, the my.cnf was updated.
The below directive was changed:
character-set-server=utf8
To
default-character-set=utf8
I am not sure how much this makes a difference though.
Checking the modified dates, I can conclude to a certain degree of certainty that the data I could recover was not edited on the new server, while the data I couldn't retrieve has been edited.
Try to fix the problem from DB side .. not from php or DB connection
I advice you to go to your old server and export your DB again with character set UTF8
then after import it to a new server .. be sure that you can see the arabic characters inside the tables(with phpmyadmin)
if your tables looks fine ..
then you can move to check the next
DB connection
php file encoding
the header encoding in html
as I know if the problem from the DB .. there is no way without export the data again from the old server
Edit:
if you do not have access to your old DB please check this answer it can help you
You were expecting نبذة عن? Mojibake. See duplicate for discussion and solution, including how to recover the data via a pair of ALTER TABLEs.
I had a similar problem with migrating database tables encoded with utf8 from a public server to localhost. The resolution was in setting the localhost server encoding using PHP
$db->set_charset("utf8")
right after the mysqli connection.
Now it works properly.

PHP mysql fixed connection to utf8, but now existing greek data is useless

I have a mysql database storing some fields in greek characters. In my html I have charset=utf-8 and my database columns are defined with encoding utf_general_ci. But I was not setting the connection encoding so far. As a result I have a database that doesn't display the greek characters well, but when reading back in PHP, it all shows well.
Now I try to do this the right way, so I added also in my database functions.
$mysqli->set_charset("utf8");
This works great for new entries.
But for existing entries, the problem is that when I read data in PHP, it comes garbled, since now the connection encoding has changed.
Is there a way to fix my data and make them useful again? I can continue working my old way, but I know it's wrong and can cause me more problems in the future.
I solved this issue as follows:
in a PHP script, retrieve the information as I do now, i.e without setting the connection. This way the mistake will be inverted and corrected and in your php file you will have the characters in the correct utf-8 format.
in the same PHP script, write back the information with setting the connection to utf-8
at this point the correct characters are in the database
I changed all my read/write functions of your site to use the utf-8 from now on

ARC2 (PHP semantic web library) wrongly double-converts UTF-8 file to UTF-8

Using ARC2, textual data gets corrupted.
My RDF input file is in UTF-8. It gets loaded in ARC2, which uses a MySQL backend, through a LOAD <path/to/file.rdf> query. The MySQL database is in UTF-8 too, as a check with PHPMyAdmin makes sure.
However, the textual data gets corrupted. After several conversion checks, the problem seems to be that the original UTF-8 file is believed to be in ISO-8859-1, and converted to UTF-8 once again.
Example: "surmonté" → "surmonteÌ".
This "surmonteÌ" is actulally available in UTF-8 in the database.
Is this related to the way ARC2 opens files (digging through the code, not exhaustively but quite deep, did not show anything suspicious), or could this be a more general case with PHP and MySQL?
How can I make sure the imported data is not wrongly re-encoded but taken as the original?
ARC2 uses two functions: $store->setUp(), which CREATEs TABLEs and DATABASE if needs be; and query(LOAD…, a detailed in the question.
It turns out, the setUp() part must not be called in the same script as the load part. At least, not during the same execution. The solution I took was to make two separate scripts, one to init the database, another to load the data, but simply commenting out the init part once it is done also works. In any case, the trick is to make sure the loading won't take place right after the initialization.
This happens because the SET NAMES utf8 encoding specification upon DB connection is set only after collation detection, for which MySQL does not seem to detect properly if the database has just been created. I made a pull request of a fix.
As a side note, it is not efficient to use the LOAD <path/to/file.rdf construct of the question: this will be computed as a relative web address, calling the server to download from itself through the network. It is much more efficient to use a construct such as:
$store->query('LOAD <file://' . dirname(__FILE__) . '/path/to/file.rdf>')

Help with multi-lingual text, php, and mysql

I have had no end of problems trying to do what I thought would be relatively simple:
I need to have a form which can accept user input text in a mix of English an other languages, some multi-byte (ie Japanese, Korean, etc), and this gets processed by php and is stored (safely, avoiding SQL injection) in a mysql database. It also needs to be accessed from the database, processed, and used on-screen.
I have it set up fine for Latin chars but when I add a mix of Latin andmulti-byte chars it turns garbled.
I have tried to do my homework but just am banging my head against a wall now.
Magic quotes is off, I have tried using utf8_encode/decode, htmlentities, addslashes/stripslashes, and (in mysql) both "utf8_general_ci" and "utf8_unicode_ci" for the field in the table.
Part of the problem is that there are so many places where I could be messing it up that I'm not sure where to begin solving the problem.
Thanks very much for any and all help with this. Ideally, if someone has working php code examples and/or knows the right mysql table format, that would be fantastic.
Here is a laundry list of things to check are in UTF8 mode:
MySQL table encoding. You seem to have already done this.
MySQL connection encoding. Do SHOW STATUS LIKE 'char%' and you will see what MySQL is using. You need character_set_client, character_set_connection and character_set_results set to utf8 which can easily set in your application by doing SET NAMES 'utf8' at the start of all connections. This is the one most people forget to check, IME.
If you use them, your CLI and terminal settings. In bash, this means LANG=(something).UTF-8.
Your source code (this is not usually a problem unless you have UTF8 constant text).
The page encoding. You seem to have this one right, too, but your browsers debug tools can help a lot.
Once you get all this right, all you will need in your app is mysql_real_escape_string().
Oh and it is (sadly) possible to successfully store correctly encoded UTf8 text in a column with the wrong encoding type or from a connection with the wrong encoding type. And it can come back "correctly", too. Until you fix all the bits that aren't UTF8, at which point it breaks.
I don't think you have any practical alternatives to UTF-8. You're going to have to track down where the encoding and/or decoding breaks. Start by checking whether you can round-trip multi-language text to the data base from the mysql command line, or perhaps through phpmyadmin. Track down and eliminate problems at that level. Then move out one more level by simulating input to your php and examining the output, again dealing with any problems. Finally add browsers into the mix.
First you need to check if you can add multi-language text to your database directly. If its possible you can do it in your application
Are you serializing any data by chance? PHPs serialize function has some issue when serializing non-english characters.
Everything you do should be utf-8 encoded.
One thing you could try is to json_encode() the data when putting it into the database and json_decoding() it when it's retrieved.
The problem was caused by my not having the default char set in the php.ini file, and (possibly) not having set the char set in the mysql table (in PhpMyAdmin, via the Operations tab).
Setting the default char set to "utf-8" fixed it. Thanks for the help!!
Check your database connection settings. It also needs to support UTF-8.

Categories