Problems with unicode characters in PHP and MySQL - php

I have a column in my UTF-8 MySQL table that is datatype 'longtext'. When I display the string on a charset=UTF-8 page in PHP, I get a unicode character (� or U+FFFD) occasionally. Example:
"None of these adjustments affects existing force structure or military capabilities, and the efficiencies will further enable U.S. European Command to resource high priority missions,"� Pentagon Press Secretary Navy Rear Adm. John Kirby said in the release.
I have tried wrapping my string in and html_entity_decode(), to replace the unicode character with nothing, but without much luck:
$content = html_entity_decode(preg_replace("/U\+([0-9A-F]{4,5})/", "", $getstory[0]['content']), ENT_NOQUOTES, 'UTF-8');
As a side-note, this issue doesn't occur with new data inserted into the this table column, only with older data.
Any suggestions?

Try to change the encoding of your php file to utf8. This can be done in your editor, somewhere at Tools - Character Encoding, and change it to UTF-8.
If you can't find it, open it in notepad, and go to file - save as, and when it prompts for location to save, below of the name of the file, there will be an option to choose in what character encoding you wish to save the file.
**EDIT:
It looks like you want to change your database charset. Go to phpmyadmin and there you can change it for your database, and for each table separately

Related

UTF-8 Charset displaying french characters incorrectly.

I am doing the folloiwing steps to import data into my e-commerce shop:
convert excel sheet to csv in excel
open csv file in notepad++ and convert to UTF-8
import csv file in phpmyadmin
If I look at the front end of the webpage the french characters are displayed as ?. The charset of the page is utf-8
If I change the charset to iso-8859-1 everything displays correctly.
If I check the item in the phpmyadmin the accents are displayed correctly.
How come utf-8 is not displaying them correctly? I thought it should display é etc.
If i go to the back end of the website and edit the product, the french description displays properly in the WYSIWYG editor. If I save then the product the french characters then show correctly. But this is because the WYSIWYG editor is converting the characters to html entities.
A common issue when collecting Unicode DATA is leaving the Connection and database/table/column character set configurad as ISO-8859-1, but then inserting data that is actually utf-8. The database is essentially told, "here's some 8859-1-encoded data, store it in this 8859-1 table". It doesn't do any conversions because it doesn't realize the data isn't in 8859-1. So the data is utf-8 but the database has essentially been told it's in 8859-1.
It's an insidious problem because, as you say, the database will convert them wrongly if you change your charset to UtF-8, since it will convert the "8859-1" data (remmember the databae thinks it's 8859-1) to utf-8 - a conversion that fails of course, as the data really is in utf-8.
So basically the problem is that phpmyadmin is in 8859-1 but you told it to insert the data in 8859-1 and then told it you were providing data in 8859-1, and then gave it utf-8 data. The database thinks it's 8859-1 so the only easy way to solve the problem is to a) keep acting like it's 8859-1 even though it's not, and hope you never have to deal with sorting, searching, collation, etc ( may work in your case), or b) pulling out the data as 8859-1 ( leaving it unconverted ), then re-inserting it after setting the database and connection to utf-8 so the database knows what character set the data really is in.
Hope that makes sense. Let me know if it doesn't. This is a hard one to wrap your head around.
You might consider opening your csv with PHP (since you mention it in your tags), and use utf8_encode on the fields before saving them with queries.
This question is so old, but changing the encoding of the file from ISO-8859-1 to UTF-8 in various programs such as Excel etc was not working for me.
My issue is words like intérêt shows up as intérêt in the file.
In case this helps someone, here is what finally worked for me:
Starting with a CSV file, open in Notepad
Click "File > Save As".
In the dialog window that appears - select "ANSI" from the "Encoding" field. Then click "Save".
That's it! Opening this new CSV file using Excel should now show the non-English characters properly.

How can i change the character set of mysql database with php without changing the correct ones?

I'm currently using a MySQL database, and the previous guy that maintained the database has changed the character set from ISO-8859-1 to UTF-8. Now there is a problem that every ä turns into ä. Now I've wrote code to change all of the records in the entire database. But apparently there are some words that are correctly written.
So for example you have a word like Pöytäkrono and a word like Sisäänkirjautuminen.
If I use iconv('UTF-8', 'ISO-8859-1', Pöytäkrono) it will give Pöytäkrono,
but when I use iconv('UTF-8', 'ISO-8859-1', Sisäänkirjautuminen) it will give S.
Because the database is quite big I want to do it automatically, but I don't want that the words that are correctly spelled/written to be changed only the ones that are wrong.
You can change database storage encoding just like that and it will work in that the database stores strings in UTF-8. This doesn't buy you anything by itself.
But things that also need to change:
Text editor encoding needs to be set to UTF-8. PHP strings directly in source code have the encoding your text editor has been set to.
The database<->php transport encoding, which probably doesn't even exist in your code because it defaults to ISO-8859-1. For UTF-8 you need to explicitly call mysql_set_charset("utf8") before making queries.
The website encoding declaration, also defaults to ISO-8859-1. You need to explicitly call header("Content-Type: text/html; charset=UTF-8") or configure for example apache to do it automatically.

List of 'messed up characters' in utf8

one of my clients has a website which has been totally messed up by the hosting companie forcing a characterset on the complete database. We've had troubles before with character sets but now it's just straight forward a drama!
So far I've added the charset=utf-8 to the page content type and set the charset for the mysql connection to utf8. And now it's time to replace all characters. So far what I've found is:
ö = ö
ë = ë
é = é
The data inside the database is being updated like so:
UPDATE table SET `fieldname` = REPLACE(`fieldname`, 'ö', 'ö');
Now I just need to find a complete list of alle characters that are messed up. I tried a MySQL query searching for field LIKE '%Ã%' but this returns me all records inside the database.
Google also just displays a couple of characters (mostly the 3 above) in some topics of other people that have had troubles, however it seems there's nowhere a complete list of these characters (or at least the most common) which I can use to find and replace all data for my client.
If anyone perhaps knows such location or is able to complete my list I will, in return, create a page containing these characters to help others (unless there's a list already which I'm not aware of somewhere ofcourse).
// EDIT:
it would be for the most common european characters such as é è ë, á à ä, ö ó ò, ï, ü and perhaps the ringel-S (German double S). Not so much for the spaning signs like ñ or ã, but if they are in a list somewhere that would be much appreciated aswel.
// EDIT 2:
I updated the MySQL database and tables using the 2 ALTER queries from the 1st part of this article: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet. I DID NOT make use of the mb_ functions so far and didn't do any MB configuration as it seems.
The headers are all set to utf-8 in the files (I still have to check the headers for some ajax scripts tho, not sure if that's needed but it won't be harmfull doing so). And the files are all saved as UTF8 without BOM. Also PHPFreakMailer is updated by setting the charset to utf-8.
Bad enough, I'm still having these weird characters. I wasn't thinking they'd go away by theirself, but at least it was worth hoping so :-) So what's the final step I should take? Continuïng using the REPLACE query and changing all wierd characters manually?
Thanks in advance!
This is a bit crazy; what character set do you think "ö" is in?
It looks like that's actually a correct UTF-8 sequence (since it's two bytes), you're just displaying it as ISO-8559-1.
Edit:
Based on your comment I think the following is going on:
I think (but really not 100% sure) that the correct UTF-8 binary sequence is stored in the database. But since the table is marked as ISO-8559-1, and you requested to automatically convert character set. So it thinks it's ISO-8559-1 (which looks like ö), but then tries to convert that to UTF-8.
You should be able to verify this, if strlen('ö') is 4, and not 2. If the length is indeed 2, your browser encoding somehow screws up.
To fix this, don't set the MySQL to encode the characters.
Option 2
The data could also be 'double encoded' in the table. To check this, simply also check the string length on the database. If the 'ö' is 4 bytes long, this is the issue.
My advice in this case is to not try to make a big 'messed up character'-map. You should simply be able to 'utf8_decode' the string. Normally this function will output a ISO-8559-1 string, but in your case.. it should turn out to be the original valid UTF-8 string.
I hope this works!
Edit2
Ok so effectively what I believe has happened is Option 2. To put it in simple (php) terms:
$output = utf8_encode(utf8_encode('string'));
So one utf8_decode() should be enough.
Do test this before you run your migration scripts though :)
If they forced a character change, why is your database not converted? Are your tables still the old character set (see your phpMyAdmin on table information).
Is the data wrong if it shows up in your phpMyAdmin or only on your webpage? -> your names and collation should change, as well as headers and filetype (safe file as utf-8).
Or try:
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
I would start replacing characters only if there are no options from within MySQL left.
Since you've tagged this question with "php", I assume you read the database and it's values with PHP? If so, please have a look at mb_convert_encoding if you no longer have control over the database.
The better solution would be to fix the inconsistency between the data and the tables characterset. Backup the database (just in case), and alter all tables and columns to UTF-8. Note: when using MySQL, it is not enough to alter the table's charset, you'll have to do this per column.
Why don't you use: ä = ä, ö = ö,...
Do htmlentities(); in php and it will convert all special characters into Entitys. I think this would be the easiest way to do it.

Web Site Character Set Issues

I am somewhat confused with this whole character set thingy. Everything seems fine when the data is inputting manually into the web sites and database tables. Except when data is inputted by copy and pasting – the character sets being to get screwy.
I asked several clients where there are getting this data from – the majority seems to be either from another web site or from a MS Document.
The characters that seem to be messing up are common characters like the following:
‘ © "
What is being inserted the the black triangle with the dreaded question mark! On my server I have the following settings.
PHP TIDY to clean the text before input to web page or database - output-encoding > UTF-8
Each web page has meta tag > charset=UTF-8
The database tables default > latin1_swedish_ci
I assume at first it was a database problem until I noticed that the same issue occurs with static web pages that are not database driven.
Help?
It's not really a good solution to replace away the smart quotes. If you can't cope with smart quotes or the copyright symbol, you can't cope with any other non-ASCII characters either, leaving you with an ASCII-only application (which these days is a pretty sad thing).
Instead you should ideally ensure that your web application using UTF-8 throughout, which means:
Serve all your pages as UTF-8 using a header('Content-Type: text/html; charset=utf-8'); and/or a <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>.
Ensure your .php source files are saved as UTF-8, if they contain any non-ASCII characters themselves.
Use mysql_set_charset('utf-8') when connecting to the database.
Ensure your MySQL tables are created with a UTF-8 CHARACTER SET/COLLATION. They won't be by default if you didn't specify one when you created them. In this case you would need to ALTER TABLE on each text column to change it.
If you use htmlentities() to HTML-escape database content when putting it into the page, you need to pass in utf-8 for the $charset argument or it will mangle all non-ASCII characters by treating them as ISO-8859-1 (which is never the proper encoding). Better: use htmlspecialchars() instead, which doesn't touch non-ASCII characters so doesn't care.

Changing character encoding in MySQL, PHP scripts, HTML

So, I have built on this system for quite some time, and it is currently outputting Latin1 (ISO-8859-1) to the web browser, and this is the components:
MySQL - all data is stored with the Latin1 character set
PHP - All PHP text files are stored on disk with Latin1 encoding
HTML - The output has the http-equiv="content-type" content="text/html; charset=iso-8859-1" meta tag
So, I'm trying to understand how the encoding of the different parts come into play in my workflow. If I open a PHP script and change its encoding within the text editor to UTF-8 and save it back to disk and reload the web browser, the text is all messed up - unless the text comes from the DB. If I change the encoding of the DB to UTF-8 and keep the PHP files in latin1 I have to use utf8_decode() for the data to display correctly. And if I change the HTML code the browser will read it incorrectly.
So yeah, I realise that if I want to "upgrade" to UTF8, I have to update all three parts of this setup for it to work correctly, but since it's a huge system with some 180k lines of PHP code and millions of posts in a lot of databases/tables, I don't want to start something like this without understanding everything correctly.
What haven't I thought about? What could mess this up beyond fixing? What are the procedures for changing the encoding of an entire MySQL installation and what's the easiest way to change the encoding of hundreds or thousands of PHP files on disk?
The META tag is luckily added dynamically, so I'll change that in one place only :)
Let me hear about your experiences with this.
It's tricky.
You have to:
change the DB and every table character set/encoding – I don't know much about MySQL, but see here
set the client encoding to UTF-8 in PHP (SET NAMES UTF8) before the first query
change the meta tag and possible the Content-type header (note the Content-type header has precedence)
convert all the PHP files to UTF-8 w/out BOM – you can easily do that with a loop and iconv.
the trickiest of all: you have to change most of your string function calls. Than means mb_strlen instead of strlen, mb_substr instead of substr and $str[index], etc.
Don't convert to UTF8 if you don't have to. Its not worth the trouble.
UTF8 is (becoming) the new standard, so for new projects I can recommend it.
Functions
Certain function calls don't work anymore. For latin1 it's:
echo htmlentities($string);
For UTF8 it's:
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
strlen(), substr(), etc. Aren't aware of the multibyte characters.
MySQL
mysql_set_charset('UTF8') or mysql_query('SET NAMES UTF8') will convert all text to UTF8 coming from the database(SELECTs). It will also convert incoming strings(INSERT, UPDATE) from UTF8 to the encoding of the table.
So for reading from a latin1 table it's not necessary to convert the table encoding.
But certain characters are only available in unicode (like the snowman ☃, iPhone emoticons, etc) and can't be converted to latin1. (The data will be truncated)
Scripts
I try to prevent specials-characters in my php-scripts / templates.
I use the ë notation instead of ë etc. This way it doesn't matter if is saved in latin1 or utf8.

Categories