Character Encoding utf8 to latin1, explain these 2 characters - php

I have a database which uses latin-1 and a PHP application which is utf-8.
I have strings in the database like this:
'Société' which should be Société
'€1bn' which should be €2bn.
When I print the faulty characters to screen with PHP's ord(), from the returning data in the db, it prints 195 and 226.
Could somebody explain why this is happening (why saving like this and why characters being read as they are) and if I can reverse it.

The WHY:
1) é is unicode 233 (as the browser reads it).
é utf8 bytes converted into latin1 chars bytes is à ©. This is why it appears like this in the database.
à © is recognised as à which is code point 195. Hence why you see that.
2) € is unicode 8364.
€ utf8 bytes converted into latin1 chars bytes is â <82> ¬. Again this is why they appear like this in the db.
â <82> ¬ is recognised as â which is code point 226. Again this is why you see this.
That is why you see those values from ord() and why the characters are stored in that manner in a latin-1 database.
Reverse:
To reverse it we need Latin-1 char bytes to UTF8 bytes.
If we try it:
â is 226. Converted latin-1 to utf8 produces â.
à is 195. Converted latin-1 to utf8 produces Ã.
Problem:
The problem is Latin-1 has less characters than utf-8 (by a long way).
Latin1 single-byte stream and UTF8 multi-byte char stream so 1 char in utf8 could produce up to 4 chars for latin1.
So the UTF-8 to Latin-1 conversion produces faulty characters.
Latin1 back to utf8 is not possible.
Solution:
IF you are unable to change the character set of your database I could suggest encoding special characters in the database in their character entity before writing them (so the db can stay as latin1 and app as utf8 as both can understand html entities) e.g. umlaut as Ä.
It could be done using PHPs html_entity_decode() combined with mb_detect_encoding() to detect and convert specific characters.
References:
See ltf.ed.ac.uk for the utf8 char bytes to latin1 bytes:
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%96&mode=char

These are strings in UTF-8 but displayed as if they were latin1. In UTF-8 é and € are encoded with two bytes, that's why you see two characters when the string is interpreted as latin1. So what you are doing is storing UTF-8 data in a table that was not declared as UTF-8. You should change the encoding of the database* and the connection**, then you will get a consistent presentation of your data
*) for example see here: https://stackoverflow.com/a/6184788/664108 (case 2)
**) SET NAMES 'utf8' in SQL

Related

PHP encoding string from latin1 (ISO-8859-1) to UTF8 [duplicate]

My page often shows things like ë, Ã, ì, ù, à in place of normal characters.
I use utf8 for header page and MySQL encode. How does this happen?
These are utf-8 encoded characters. Use utf8_decode() to convert them to normal ISO-8859-1 characters.
If you see those characters you probably just didn’t specify the character encoding properly. Because those characters are the result when an UTF-8 multi-byte string is interpreted with a single-byte encoding like ISO 8859-1 or Windows-1252.
In this case ë could be encoded with 0xC3 0xAB that represents the Unicode character ë (U+00EB) in UTF-8.
Even though utf8_decode is a useful solution, I prefer to correct the encoding errors on the table itself. In my opinion it is better to correct the bad characters themselves than making "hacks" in the code. Simply do a replace on the field on the table. To correct the bad encoded characters from OP :
update <table> set <field> = replace(<field>, "ë", "ë")
update <table> set <field> = replace(<field>, "Ã", "à")
update <table> set <field> = replace(<field>, "ì", "ì")
update <table> set <field> = replace(<field>, "ù", "ù")
Where <table> is the name of the mysql table and <field> is the name of the column in the table. Here is a very good check-list for those typically bad encoded windows-1252 to utf-8 characters -> Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters.
Remember to backup your table before trying to replace any characters with SQL!
[I know this is an answer to a very old question, but was facing the issue once again. Some old windows machine didnt encoded the text correct before inserting it to the utf8_general_ci collated table.]
I actually found something that worked for me. It converts the text to binary and then to UTF8.
Source Text that has encoding issues:
If ‘Yes’, what was your last
SELECT CONVERT(CAST(CONVERT(
(SELECT CONVERT(CAST(CONVERT(english_text USING LATIN1) AS BINARY) USING UTF8) AS res FROM m_translation WHERE id = 865)
USING LATIN1) AS BINARY) USING UTF8) AS 'result';
Corrected Result text:
If ‘Yes’, what was your last
My source was wrongly encoded twice so I had two do it twice. For one time you can use:
SELECT CONVERT(CAST(CONVERT(column_name USING latin1) AS BINARY) USING UTF8) AS res FROM m_translation WHERE id = 865;
Please excuse me for any formatting mistakes

Proper Charset to work with Vietnamese Characters (that isn't Unicode) in PHP [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 9 months ago.
I've searched around for a while and haven't yet found something that'll work for me. I am using a PHP form to submit data into SAP using the SAP DI API. I need to figure out which character set will actually allow me to store and work with Vietnamese characters.
UTF8 seems to work for a lot of the characters but ô becomes ô. More importantly, there are character limits, and UTF-8 breaks character limits. If I have a string of 30 characters it tells the API that it's more than 50. The same is true for storing in MySQL--if there's a varchar character limit, UTF-8 causes the string to go above it.
Unfortunately, when I search, UTF-8 seems to be the only thing people suggest for Vietnamese characters. If I don't encode the characters at all, they get stored as their html character codes. I've also tried ISO-8859-1, converting into UCS-2 or UCS-4... I'm really at a loss. If anyone has experience working with vietnamese characters, your help would be greatly appreciated.
UPDATE
It appears the issue may be with my wampserver on Windows. here's a bit of code that is confusing me:
$str = 'VậTCôNG';
$str1 = utf8_encode($str);
if (mb_detect_encoding($str,"UTF-8",true) == true) {
print_r('yes');
if ($str1 == $str) {
print_r('yes2');
}
}
echo $str . $str1;
This prints "yes" but not "yes2", and $str.str1 = "VậTCôNGVậTCôNG" in the browser.
I have my php.ini file with:
default_charset = "utf-8"
and my httpd.conf file with:
AddDefaultCharset UTF-8
and my php file I'm running has:
header("Content-type: text/html; charset=utf-8");
So I'm now wondering: if the original string was utf-8, why wouldn't it equal a utf8 encoding of itself? and why is the utf8 encoding returning wrong characters? Is something wrong in the wampserver configurations?
ô is the "Mojibake" for ô. That is, you do have UTF-8, but something in the code mangled it.
See Trouble with utf8 characters; what I see is not what I stored and search for Mojibake. It says to check these:
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
It is possible to recover the data in the database, but it depends on details not yet provided.
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Each Vietnamese character take 2-3 bytes for encoding in UTF-8. It is unclear whether the "hard 50" is really a character limit or a byte limit.
If you happen to have Mojibake's sibling "double encoding", then a Vietnamese character will take 4-6 bytes and feel like 2-3 characters. See "Test the data" in the first link.
An example of how to 'undo' Mobibake in MySQL:
CONVERT(BINARY(CONVERT('VậTCôNG' USING latin1)) USING utf8mb4) --> 'VậTCôNG'
"Double encoding" is sort of like Mojibake twice. That is one side treats it as latin1, the other as UTF-8, but twice.
VậTCôNG, as UTF-8, is hex 56e1baad5443c3b44e47. If that hex is treated as character set cp850 or keybcs2, the string is Vß║¡TC├┤NG.
Change it to VISCII.
Input: ô
Output: ô
You can test it at Charset converter.

PHP character encoding � sign instead of à

Hi this is a very strange error I have on some pages of this joomla website:
http://www.pcsnet.it/news
If you go into the details of a specific news the à character is correctly displayed.
Other accented characters seem not affected.
I've checked that UTF-8 encoding is default both in the MySql db and that the text files are in UTF-8 encoding.
Other ideas?
What is very interesting in your case is that it only affects the letter à! So it cannot be an encoding problem.
Here is my take on your problem. The letter à is encoded in two bytes in utf8. The first byte is xC3, which is à in latin-1, the second byte is... non breaking space! (The other accented letters, such as è are encoded by à followed by an other accented letter in latin-1, and they are not affected).
Therefore, my guess is that you have a script, somewhere, that removes, or replaces, the non-breaking space in latin-1, i.e., character xA0. The resulting lonely byte xC3 cannot be displayed properly, so the general placeholder � is displayed instead. (just load your page in latin-1, you will see that I am right).
Find that script that removes non-breaking spaces, and you'll be fine.
Are you by any chance using htmlentities, htmlspecialchars or html_entity_decode on the text you retrieved from the database ? If so, you need to force UTF8 in the third parameter because it's not the default charset of these functions.
Example: htmlentities('£hello', null, 'utf-8');
A � sign is usually an indication that the character the browser is trying to display is not available in the font used. It's probably not an à, if that works on other pages (using the same font).

character set conversion

i wanna convert to original string of “Cool†..Origingal string is cool . (' is backquote)
It seems that you just forgot to specify the character encoding properly.
Because “ is what you get when the character “ (U+201C) encoded in UTF-8 (0xE2809C) is interpreted with a single-byte character encoding like Windows-1252 (default character encoding in some browsers) where 0xE2, 0x80, and 0x9C represent the characters â, €, and œ respectively.
So just make sure to specify your character encoding properly. Or if you actually want to use Windows-1252 as your output character encoding, you can convert your UTF-8 data with mb_convert_encoding, iconv or similar functions.
There's a wide variety of character encoding functions in PHP, especially if you have access to the multibyte string functions. (mb_string is thankfully enabled on most PHP installs.)
What you need to do is convert the encoding of the original string to the encoding you require, but as I don't know what encoding has been used/is required all I can suggest is that you could try using the mb_convert_encoding function, possibly after using mb_detect_encoding on the original string.
Incidentally, I'd highly recommend attempting to keep all data in UTF-8, (text files, HTML encoding, database connections/data, etc.) as you'll make your life a lot easier this way.

DomDocument and special characters written in two bytes

I have a web application, written in PHP, based on UTF-8 (both PHP and MySQL are on UTF-8). Everything is beautiful - no problem with special characters.
However, I had to build an export to XML with encoding ISO-8859-2 (Polish), so I picked DomDocument because it has built in encoding conversion.
But when I had sent the XML to my partner for validation, he said that one of tags have too many characters. It was strange because it had the specific maximum number of characters. Then I have opened the file in HexEditor and saw that every special character has two bytes.
I have tried to convert the result with iconv and mb_convert_encoding.
Iconv says:
iconv() [<a href='function.iconv'>function.iconv</a>]: Detected an illegal character in input string in file application/controllers/report/export.php at 169
mb_convert_encoding is simply deleting all special characters and result is encoded in ASCII.
Is there a way to convert the output of DomDocument to one-byte characters?
Thanks in advance!
One problem when switching between encodings is that, even with transliteration, not all characters are representable in other encodings in a single byte.
For example, consider the EURO SIGN, a character that takes 3 bytes when encoded in UTF-8. If you look at the charset support page, you can see that ISO-8859-2 is not listed.
Since there is not a single character to represent the euro sign, then transliteration does its best to still represent it in the output
echo iconv( 'UTF-8', 'ISO-8859-2//TRANSLIT', '€' ); // EUR
In this example, we still end up with 3 bytes to represent the euro sign after transliterating.
EDIT
P.S. The NOTICE level error you're getting is because you executed iconv() without the transliteration flag. And as I highlighted above, the EURO SIGN doesn't exist in ISO-8859-2, so you clearly have at least one character in your data that also doesn't exist in ISO-8859-2, so you'll have to use transliteration. Just know that it doesn't guarantee that you'll get down to 1 byte/char.

Categories