PHP: Converting utf8 chars from mysql db - php

I have a longtext in my db where i have some special chars like Ã
How i can convert it to "à"? I've tried using utf8_encode and _decode but it seems not work.
Document charset is utf8, and longtext field too.

It's not about encoding but html entities : http://php.net/manual/fr/function.html-entity-decode.php

Related

PHP encoding string from latin1 (ISO-8859-1) to UTF8 [duplicate]

My page often shows things like ë, Ã, ì, ù, à in place of normal characters.
I use utf8 for header page and MySQL encode. How does this happen?
These are utf-8 encoded characters. Use utf8_decode() to convert them to normal ISO-8859-1 characters.
If you see those characters you probably just didn’t specify the character encoding properly. Because those characters are the result when an UTF-8 multi-byte string is interpreted with a single-byte encoding like ISO 8859-1 or Windows-1252.
In this case ë could be encoded with 0xC3 0xAB that represents the Unicode character ë (U+00EB) in UTF-8.
Even though utf8_decode is a useful solution, I prefer to correct the encoding errors on the table itself. In my opinion it is better to correct the bad characters themselves than making "hacks" in the code. Simply do a replace on the field on the table. To correct the bad encoded characters from OP :
update <table> set <field> = replace(<field>, "ë", "ë")
update <table> set <field> = replace(<field>, "Ã", "à")
update <table> set <field> = replace(<field>, "ì", "ì")
update <table> set <field> = replace(<field>, "ù", "ù")
Where <table> is the name of the mysql table and <field> is the name of the column in the table. Here is a very good check-list for those typically bad encoded windows-1252 to utf-8 characters -> Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters.
Remember to backup your table before trying to replace any characters with SQL!
[I know this is an answer to a very old question, but was facing the issue once again. Some old windows machine didnt encoded the text correct before inserting it to the utf8_general_ci collated table.]
I actually found something that worked for me. It converts the text to binary and then to UTF8.
Source Text that has encoding issues:
If ‘Yes’, what was your last
SELECT CONVERT(CAST(CONVERT(
(SELECT CONVERT(CAST(CONVERT(english_text USING LATIN1) AS BINARY) USING UTF8) AS res FROM m_translation WHERE id = 865)
USING LATIN1) AS BINARY) USING UTF8) AS 'result';
Corrected Result text:
If ‘Yes’, what was your last
My source was wrongly encoded twice so I had two do it twice. For one time you can use:
SELECT CONVERT(CAST(CONVERT(column_name USING latin1) AS BINARY) USING UTF8) AS res FROM m_translation WHERE id = 865;
Please excuse me for any formatting mistakes

Character Encoding utf8 to latin1, explain these 2 characters

I have a database which uses latin-1 and a PHP application which is utf-8.
I have strings in the database like this:
'Société' which should be Société
'€1bn' which should be €2bn.
When I print the faulty characters to screen with PHP's ord(), from the returning data in the db, it prints 195 and 226.
Could somebody explain why this is happening (why saving like this and why characters being read as they are) and if I can reverse it.
The WHY:
1) é is unicode 233 (as the browser reads it).
é utf8 bytes converted into latin1 chars bytes is à ©. This is why it appears like this in the database.
à © is recognised as à which is code point 195. Hence why you see that.
2) € is unicode 8364.
€ utf8 bytes converted into latin1 chars bytes is â <82> ¬. Again this is why they appear like this in the db.
â <82> ¬ is recognised as â which is code point 226. Again this is why you see this.
That is why you see those values from ord() and why the characters are stored in that manner in a latin-1 database.
Reverse:
To reverse it we need Latin-1 char bytes to UTF8 bytes.
If we try it:
â is 226. Converted latin-1 to utf8 produces â.
à is 195. Converted latin-1 to utf8 produces Ã.
Problem:
The problem is Latin-1 has less characters than utf-8 (by a long way).
Latin1 single-byte stream and UTF8 multi-byte char stream so 1 char in utf8 could produce up to 4 chars for latin1.
So the UTF-8 to Latin-1 conversion produces faulty characters.
Latin1 back to utf8 is not possible.
Solution:
IF you are unable to change the character set of your database I could suggest encoding special characters in the database in their character entity before writing them (so the db can stay as latin1 and app as utf8 as both can understand html entities) e.g. umlaut as Ä.
It could be done using PHPs html_entity_decode() combined with mb_detect_encoding() to detect and convert specific characters.
References:
See ltf.ed.ac.uk for the utf8 char bytes to latin1 bytes:
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%96&mode=char
These are strings in UTF-8 but displayed as if they were latin1. In UTF-8 é and € are encoded with two bytes, that's why you see two characters when the string is interpreted as latin1. So what you are doing is storing UTF-8 data in a table that was not declared as UTF-8. You should change the encoding of the database* and the connection**, then you will get a consistent presentation of your data
*) for example see here: https://stackoverflow.com/a/6184788/664108 (case 2)
**) SET NAMES 'utf8' in SQL

Reading Unicode characters from MySQL with PHP

I've inherited a MySQL database which contains a field named Description of type text and collation of latin1_swedish_ci.
The problem with this field is it contains utf-8 data with some Unicode characters, e.g. character 733, etc. Sometimes this character also exists in the field represented as HTML encoded "&#733" as well.
I'm trying to read the table and export the data to a CSV file and I need to represent this character as a double quote.
Reading the HTML encoded character is easy enough. However, it appears that the actual Unicode character is converted to utf-8 before I can do anything with it resulting in a "?".
How do I read in the Unicode character 733 (U+02DD), recognize it and convert it?
Here's a simplified (not tested) version of the code.
<?
$testconn=odbc_connect ("TESTLIB", "......", "......");
$query="SELECT Description FROM TestTable";
$rsWeb=mysql_query($query));
$WebRow=mysql_fetch_row($rsWeb));
$Desc = $WebRow[0];
$Desc = str_replace('"','""',$Desc);
fwrite($output,"\"".$Desc."\",\r\n");
%>
Also set charset to utf-8 when connecting to SQL server:
http://php.net/manual/en/mysqli.set-charset.php
$mysqli->set_charset("utf8");
I think your connection charset is not utf8, that's why chars are being converted to '?'.
Read this: http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html
Post result for query:
show variables like 'char%';
You really should put only non-entity (Unicode) version in the database, and entity-decode the rest. However, when you want to use UTF-8 with MySQL, there are a few things to remember:
Your table column's collation should be utf8_bin or similar.
Your table's collation and database collation should also be utf8_bin just in case.
Your connection charset should be UTF8. Do this by executing the "SET NAMES utf8" query.
Also, if you're outputting a HTML page, that should have the UTF8 charset as well. If everything is correct, the UTF8 characters should come out fine.
Good luck!

UTF-8 and HTML entities

I try to eject text from Word .DOC file with PHP. All seems ok, but the only trouble is something like
СУДОВА БУХГАЛТЕРІЯ
instead of russian text. I've tried to use html_entity_decode and utf8_encode, but they didn't help. Is there any simple solution?
html_entity_decode should work with the proper parameters (unless you’re using PHP 5.3.3 or later):
html_entity_decode($str, ENT_QUOTES, 'UTF-8')
This will convert the character references into UTF-8. Before PHP 5.3.3, the charset parameter’s default value was ISO-8859-1. In that case the cyrillic characters can’t be converted as the ISO 8859-1 character set doesn’t contain them.

php import utf-8 txt file to latin1 database

I have an UTF-8 encoded txt file and I want to import it to latin1_general_ci table.
Problem is that some characters display as ? in database and not as they supposed to.
I tried mb_convert_encoding($str, "ISO-8859-1", "UTF-8"); but that didn't do anything.
What am I doing wrong?
Latin1 doesn't include all Unicode characters. You can use iconv() with //TRANSLIT option to transliterate unknown characters to their closest latin1 equivalents:
iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text)
I use utf8_decode, it works for me.
You can convert them to binary and then convert it back to latin
insert into table values
(convert(binary convert(data using utf8) using latin1))

Categories