Once again... php-mysql export UTF-8-issues - php

I've developed an PHP/MySQL-application where in one table names are stored. These names sometimes contain special characters (like é, à, ë, ...).
When creating the table I had forgotten to set the collocation-item to UTF-8 and now is set to LATIN1_SWEDISH_CI.
So some data isn't displayed correct in phpMyAdmin. But when I show the names on a PHP-page, those special characters are displayed correctly. Here's an extract from a PHP-file where I use UTF-8
<?php ... ?>
<html>
<head>
<meta http-equiv="Content-Type" content-"text/html; charset="UTF-8">
....
Like I said the special characters are displayed as it should. So far... no problem.
But now I would like to export that data into an CSV-file and guess what? The special characters aren't included in the CSV-file.
My PHP-export-file contains the following lines of code:
<?php
mysql_query("SET NAMES utf8");
header('Content-Type: text/html; charset=UTF-8');
...
But no special characters are displayed?
Does anyone have a solution for this problem? Because I find it a little ridiculous to open the CSV in Excel and use 'Find & Replace'.
Using the HTML escape-codes is out of the question. That's why there's UTF-8, not?

You have stored UTF-8 encoded data which MySQL regards as Latin-1 data. MySQL does not complain about this because any arbitrary sequence of bytes is valid Latin-1. Because the connection character set of the connection used to retrieve the data is the same as that used to insert it, the correct data is displayed on your web page. But if you view the data in a utility that takes pains to display the actually stored characters, you will see mis-encoded text, because that is what you actually have stored.
There are two things you need to do: firstly, you need to change your database connection code to make sure that all connections you make to your database are using the UTF-8 character set. This can be accomplished using a settings file or just by issuing a SET NAMES statement every time you connect.
Secondly, you need to correct the mis-encoded data already stored in the database. Do not alter table to change the character set to UTF-8 directly; if you do, you will end up with double-UTF-8-encoded data. Instead, use an alter table query to change the column to the binary character set, and after doing that, alter table again to UTF-8.

Related

PHP <==> MySQL; storing Cyrillic / Scandinavian characters in the database

There are so many threads dedicated to this topic, that I feel silly having to ask this.
But, I'm at a total loss as to what the problem could be.
I am trying to insert special characters (cyrillic, scandinavian, etc) into a MySQL database, via PHP (html) form.
Characters like : Ä,Ö,Å, as well as russian alphabets, etc.
Based on previous threads in this forum, I have tried all the following (inserted right after the MySQL database-connection string) :
mysqli->set_charset("utf8");
This didn't work, so I tried the following :
mysqli_query("set names 'utf8'");
mysqli_query("set charset 'utf8'");
These are not recommended by PHP. But, I tried them anyway, but still no luck.
(All my databases, tables, and columns are collated as : UTF8_general_ci)
In addition, all my html forms have the following :
<meta charset="utf-8">
So, I'm at a complete loss as to what I'm doing wrong. Once the data is sent to the database, it shows up (in the database itself) as rubbish characters (question marks, and other hieroglyphics).
However, the funny thing is :
(a) When I view the data on my website, it displays correctly;
(b) When the data is sent within the body of an email, it also displays correctly
So..........why is it not displaying correctly within the database itself ??
When dealing with specific charset (like UTF-8), it's important that the entire line of code is set to the same charset. Below are a few pointers how to follow this.
ALL attributes must be set to ut8 (collation is NOT the same as charset in the database)
You should save the document itself as UTF-8 (If you're using Notepad++, it's Format -> Convert to UFT-8 (or UTF-8 w/o BOM), there's a difference - both or either may work for you)
The header in both PHP and HTML should be set to UTF-8:
HTML: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
PHP: header('Content-Type: text/html; charset=utf-8');
Upon connecting to the databse, set the charset ti UTF-8, like this:
$connection->set_charset("utf8"); (directly after connecting)
Also make sure your database and tables are set to UTF-8, you can do that by this query (in the database, need only be done once):
ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Remember that EVERYTHING needs to be set to UFT-8 charcode. If something can be set to UFT-8 (or another charset, check the PHP-docs (php.net)), it should be set to the same charset as everything else.
(a) When I view the data on my website, it displays correctly;
(b) When the data is sent within the body of an email, it also displays correctly
This means data is correctly stored in the db, when you get the output is the same like the input, logically correct?
The other question is: How are you looking into the database, which kind of client are you using?
PHPMyAdmin, SomeDesktop Client.. The problem will be there.. because the data is stored right.. seems so ;)

Why Unicode Data is stored in numeric form in mysql database

I have this following
$html = <div>ياں ان کي پرائيويٹ ليمٹڈ کمپنياں ہيں</div>
But it is being stored in the mysql database as following format
تو يہ اسمب
لي ميں غر
يب کو آنے
نہيں
Actually, When I retrieve the data from mysql database and shows it on the webpage it is shown correctly.
But I want to know that Is it the standard format of unicode to store in the database, or the unicode data should be stored as it is (ياں ان کي پرائيويٹ ليمٹڈ کمپنياں ہيں)
When you store unicode in your database...
First off, your database has to be set as 'utf-general', which is not the default. With MySQL, you have to set both the table to utf format, AND individual columns to utf. In addition to this, you have to be sure that your connection is a utf-8 connection, but doing that varies based on what method you use to store the unicode text into your database.
To set your connection's char-set, if you are using Mysqli, you would do this:
$c->set_charset('utf8'); where $c is a Mysqli connection.
Still, you have to change your database charsets like I said before.
EDIT: I honestly don't think it matters MUCH how you store it, though I store it as the actual unicode characters, because that way if some user were to input '& #1610;' into the database, it wouldn't be retrieved as a unicode character by mistake.
EDIT: Here is a good example, if you remove that space between & and #1610; in my answer, it will be mistakenly retrieved from the server as a unicode character, unless you want users to be able to create unicode characters by using a code like that.
Not a perfect example since stackoverflow does that on purpose, and it doesn't work like that really, but the concept is the same.
Something wrong with data charset. I don't know what exactly.
This is workaround. Do it before insert/update:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
it looks like to me that this is HTML encoding, the way PHP encode unicode to make sure it will display OK on the web page, no matter the page encoding.
Did you tried to fetch the same data using MySQL Workbench?
It seems that somewhere in your PHP code htmlentities is being used on the text -- instead of htmlspecialchars. The difference with htmlentities is that it escapes a lot of non-ASCII characters in the form you see there. Then the result of that is being stored in the database. It's not MySQL's doing.
In theory this shouldn't be necessary. It should be okay to output the plain characters if you set the character set of the page correctly. Aassuming UTF-8, for example, use header('Content-Type: text/html; charset=utf-8'); or <meta http-equiv="Content-Type" value="text/html; charset=utf-8">.
This might result in gibberish (mojibake) if you view the database directly (although it will display fine on the web page) unless you also make sure the character set of the database is set correctly. That means the table columns, table, database, and connection character set all to, probably, utf8mb4_general_bin or utf8_general_bin (or ..._general_ci). In practice getting it all working can be a bit of a nuisance. If you didn't write this code, then probably someone in your code base decided at some point to use htmlentities on it to convert the exotic characters to ASCII HTML entities, to make storage easier. Or sometimes people use htmlentities out of habit when the merer htmlspecialchars would be fine.

Retrieving Unicode text from MySQL causing ???? in output

I am storing Unicode text لاہور in MySQL, I have set tables and columns to utf8_general_ci. The text لاہور is displaying correctly in MySQL. However if I echo that with PHP it shows ?????? on the browser window.
One thing to mention here: I have the whole document in Unicode and all words are displaying correctly, but they are written directly i.e. not coming from MySQL.
Even if I try
$p="لاہور";
echo $p;
It displays لاہور in the browser. Things go wrong only when retrieving from MySQL.
One common cause for this is that your PHP script is being saved with another format (for example ASCII), you must be sure that your PHP script is also saved as UTF-8 or whatever codification you use in your database.
Another possible cause is that MySQL is not returning proper Unicode characters to your script, you may use mysql_query("SET NAMES utf8") or whatever encoding you want to use, before processing your queries, a good way to troubleshot this problem could be converting the string to their respective unicode codes and comparing them to see if they're the same.
It may not always be sufficient to set the content type using meta tags, I usually set it via the header directive as well as below.
header('Content-Type: text/html; charset=utf-8');
Most likely your MySQL connection (as opposed to storage) has not been set to UTF-8, causing the UTF-8 data retrieved from MySQL to be converted to Latin1 (or similar), which cannot represent those characters and they are replaced with a ?.
If you are using mysql_:
mysql_set_charset( 'utf8' );
If you are using mysqli_:
$mysqli->set_charset( 'utf8' );
before you make any queries
If you are using PDO, add charset=utf8 to the connection string.

Black Diamonds that are Fixing themselves in MySQL

I am running into a very strange issue with a site that I am working on. The site is basically a job board where the owner or users can create job listings including a description that ends up being stored into a MySQL text field. What we are experiencing is this, whenever listings from certain sources are entered, they initially end up with the "Black Diamond" with a question mark inside character in place of apostrophes and double spaces. This part I know is an encoding issue and can correct. The real question is this, these black diamonds show when the record is displayed in a MySQL admin tool and when the job listing is viewed in a web browser (simple select statement displays the listing in a PHP app), but after the first time it is viewed, then the problem somehow fixes itself. It is like the running the select then displaying the record updates the job description field and fixes the encoding issues. How could this be? Has anyone ever heard of this or anything similar? I cannot understand how a database field would change without running an update statement...
How are the job listings entered? Are they entered via a web page? If so, what character encoding does the web page use? (This should determine the character encoding of the submitted data AFAIK.) What character set is the connection used to communicate with MySQL? What is the character set of the column the data is stored in? Finally, what is the character encoding of the web page(s) on which the entered data is reviewed?
Here is what I do: I declare all of my pages as UTF-8 encoded, using the following tag at the start of the <head> section:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I issue the following command immediately when I connect to MySQL, so as to make sure that MySQL understands the data I send to it will be UTF-8 encoded:
SET NAMES uft8
(Depending on the database abstraction method you use, a special function might be recommended in order to set the connection character set, like mysqli's mysqli_set_charset().)
I also make sure that those columns in which I intend to store UTF-8 data are declared to be UTF-8. You can find out what the character set of a column is by issuing SHOW CREATE TABLE table_name. The character set of the table (which by default is the character set for any column in the table) is displayed at the end. If the character set for the column is different to the default character set for the table then it is displayed as part of the column definition. If you wish to change the character set of a column then you can do so using ALTER TABLE.
If you have not previously taken the steps to handle character sets in your app then you may find that the tables are all using the latin1 character set. If you naively store UTF-8-encoded data (for example) into these columns, you may run into character encoding issues. Changing the column character set using ALTER TABLE does not necessarily fix your old data, because MySQL reads your old data assuming it to be valid latin1-encoded text and converts it to the eqivalent UTF-8 (correctly converting what it has read, but not giving the result you want).
The above steps would hopefully mean that future data will be correctly encoded and correctly displayed, but you may have data already mis-encoded in your database, so be aware that if you follow the above steps and still see older data displaying incorrectly, this may be why. Good luck.
Run into this problem a few years ago... I remember finding those notorious characters, and replacing them in php with a single quote or a double quote... Ofcourse with escaping... A simple preg_replace for those characters will do the trick... Its just an encoding issue...
This page, though geared for wordpress might help
http://codex.wordpress.org/Converting_Database_Character_Sets
I had the same issue (mysql encoding and webpage encoding set to UTF-8 but black diamonds showing up in my query results. I found this snippet while googling but cannot for the life of me find its source to give proper attribution:
if( function_exists('mysql_set_charset') ){
mysql_set_charset('utf8', $db_connection);
}else{
mysql_query("SET NAMES 'utf8'", $db_connection);
}
Anyway, it cleared up the issue for me.

mysql and encoding

I moved my php application to the new server. i use mysql5 db. When i'm Updating or Inserting something to db, every " and - sign changed to ?. I use SET NAMES UTF8 and SET CHARACTER SET but it don't work. Any ideas?
SET NAMES UTF8 should be used on every page, when selecting as well as when updating or inserting.
actually this query must be used every time you connect to the database. just add it to connect code.
You need UTF-8 all the way through to make smart quotes and dashes (“”—) and other non-ASCII characters work reliably:
(1) Ensure that the browser sends you characters encoded to UTF-8. Do this by declaring the page that includes the form to be UTF-8:
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
...
(Ignore <form accept-encoding>, which doesn't work right in IE.)
(2) PHP deals with raw bytes and doesn't care what encoding they're in, but the database does care, so you have to tell it what encoding the bytes from PHP are coming in. This is what SET NAMES is doing, though mysql_set_charset may be preferable.
(3) Once the proper characters have reached the database, it'll need to store them in a Unicode encoding to make sure all characters can fit. Each column can have a different encoding, but you can use DEFAULT CHARACTER SET utf8 when you CREATE table to make all the text columns in it use UTF-8. You can also set the default character set for a database or the whole server to utf8 if you prefer.
If you have already CREATE​d the tables and they a non-UTF-8 collation, you'll have to recreate or alter the tables. You can check the current collation using SHOW FULL COLUMNS FROM sometable;.
(4) Make sure you HTML-encode text you output from PHP using htmlspecialchars() and not htmlentities(), which by default will mess up non-ASCII characters.
[You can, as an alternative to (2) and (3), just use the default Latin-1 encoding for the connection and the table storage, but put UTF-8 bytes in it nonetheless. The disadvantage of this approach is that it'll look wrong to other tools looking at the database, and lower/upper case characters won't compare against each other in the expected case-insensitive way.]
My guess is you are pasting from some text editor which is transforming the " into an angled pretty quote, and transforming your - into an mdash, which is causing both to be represented as ?.
While you set your database to accept UTF8 characters, you probably did not set your webserver/PHP to accept those characters. Try playing with mbstring functions, but check to make sure you arent using the slanted quotes or dashes.

Categories