My page often shows things like ë, Ã, ì, ù, à in place of normal characters.
I use utf8 for header page and MySQL encode. How does this happen?
These are utf-8 encoded characters. Use utf8_decode() to convert them to normal ISO-8859-1 characters.
If you see those characters you probably just didn’t specify the character encoding properly. Because those characters are the result when an UTF-8 multi-byte string is interpreted with a single-byte encoding like ISO 8859-1 or Windows-1252.
In this case ë could be encoded with 0xC3 0xAB that represents the Unicode character ë (U+00EB) in UTF-8.
Even though utf8_decode is a useful solution, I prefer to correct the encoding errors on the table itself. In my opinion it is better to correct the bad characters themselves than making "hacks" in the code. Simply do a replace on the field on the table. To correct the bad encoded characters from OP :
update <table> set <field> = replace(<field>, "ë", "ë")
update <table> set <field> = replace(<field>, "Ã", "à")
update <table> set <field> = replace(<field>, "ì", "ì")
update <table> set <field> = replace(<field>, "ù", "ù")
Where <table> is the name of the mysql table and <field> is the name of the column in the table. Here is a very good check-list for those typically bad encoded windows-1252 to utf-8 characters -> Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters.
Remember to backup your table before trying to replace any characters with SQL!
[I know this is an answer to a very old question, but was facing the issue once again. Some old windows machine didnt encoded the text correct before inserting it to the utf8_general_ci collated table.]
I actually found something that worked for me. It converts the text to binary and then to UTF8.
Source Text that has encoding issues:
If ‘Yes’, what was your last
SELECT CONVERT(CAST(CONVERT(
(SELECT CONVERT(CAST(CONVERT(english_text USING LATIN1) AS BINARY) USING UTF8) AS res FROM m_translation WHERE id = 865)
USING LATIN1) AS BINARY) USING UTF8) AS 'result';
Corrected Result text:
If ‘Yes’, what was your last
My source was wrongly encoded twice so I had two do it twice. For one time you can use:
SELECT CONVERT(CAST(CONVERT(column_name USING latin1) AS BINARY) USING UTF8) AS res FROM m_translation WHERE id = 865;
Please excuse me for any formatting mistakes
Related
I am getting stuck with this, previously i am using php 5 and now i came up with php 7,
the problem is when i am trying to echo value from database it returns weird special character and where in my previous web it returns normal. Is it utf-8 problem? i tried meta tag utf-8, and change collation sql into utf8_unicode_ci, but somehow it doesn't help at all...
it returns like this
 — 
what i want to return
—
What you get from the database is a UTF8 encoded string.
The characters you see is a UTF8 string interpreted with encoding Western (Windows Latin 1).
If you include that string in a web page whose character set is Latin 1 then you'll see the string you posted; if the character set is UTF-8 then you should see the correct characters (without need to convert them into HTML entities).
As the latter is not your case you can proceed as follows:
Let the characters you see are stored in the variable $string: you can get html entities with mb_convert_encoding:
$html = mb_convert_encoding( $string, 'HTML-ENTITIES', 'UTF-8' );
This will result in:
—
As after conversion you get characters in the ASCII range then the resulting string is suitable for any destination character encoding.
Note that, according to the above, even the dash — is converted (into —)
This is just a quick solution to the problem you faced.
I think the comment from Machavity:
"Take a minute and read stackoverflow.com/questions/279170/utf-8-all-the-way-through"
is a good advice.
This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 9 months ago.
I've searched around for a while and haven't yet found something that'll work for me. I am using a PHP form to submit data into SAP using the SAP DI API. I need to figure out which character set will actually allow me to store and work with Vietnamese characters.
UTF8 seems to work for a lot of the characters but ô becomes ô. More importantly, there are character limits, and UTF-8 breaks character limits. If I have a string of 30 characters it tells the API that it's more than 50. The same is true for storing in MySQL--if there's a varchar character limit, UTF-8 causes the string to go above it.
Unfortunately, when I search, UTF-8 seems to be the only thing people suggest for Vietnamese characters. If I don't encode the characters at all, they get stored as their html character codes. I've also tried ISO-8859-1, converting into UCS-2 or UCS-4... I'm really at a loss. If anyone has experience working with vietnamese characters, your help would be greatly appreciated.
UPDATE
It appears the issue may be with my wampserver on Windows. here's a bit of code that is confusing me:
$str = 'VậTCôNG';
$str1 = utf8_encode($str);
if (mb_detect_encoding($str,"UTF-8",true) == true) {
print_r('yes');
if ($str1 == $str) {
print_r('yes2');
}
}
echo $str . $str1;
This prints "yes" but not "yes2", and $str.str1 = "VậTCôNGVáºTCôNG" in the browser.
I have my php.ini file with:
default_charset = "utf-8"
and my httpd.conf file with:
AddDefaultCharset UTF-8
and my php file I'm running has:
header("Content-type: text/html; charset=utf-8");
So I'm now wondering: if the original string was utf-8, why wouldn't it equal a utf8 encoding of itself? and why is the utf8 encoding returning wrong characters? Is something wrong in the wampserver configurations?
ô is the "Mojibake" for ô. That is, you do have UTF-8, but something in the code mangled it.
See Trouble with utf8 characters; what I see is not what I stored and search for Mojibake. It says to check these:
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
It is possible to recover the data in the database, but it depends on details not yet provided.
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Each Vietnamese character take 2-3 bytes for encoding in UTF-8. It is unclear whether the "hard 50" is really a character limit or a byte limit.
If you happen to have Mojibake's sibling "double encoding", then a Vietnamese character will take 4-6 bytes and feel like 2-3 characters. See "Test the data" in the first link.
An example of how to 'undo' Mobibake in MySQL:
CONVERT(BINARY(CONVERT('VáºTCôNG' USING latin1)) USING utf8mb4) --> 'VậTCôNG'
"Double encoding" is sort of like Mojibake twice. That is one side treats it as latin1, the other as UTF-8, but twice.
VậTCôNG, as UTF-8, is hex 56e1baad5443c3b44e47. If that hex is treated as character set cp850 or keybcs2, the string is Vß║¡TC├┤NG.
Change it to VISCII.
Input: ô
Output: ô
You can test it at Charset converter.
I've inherited a MySQL database which contains a field named Description of type text and collation of latin1_swedish_ci.
The problem with this field is it contains utf-8 data with some Unicode characters, e.g. character 733, etc. Sometimes this character also exists in the field represented as HTML encoded "˝" as well.
I'm trying to read the table and export the data to a CSV file and I need to represent this character as a double quote.
Reading the HTML encoded character is easy enough. However, it appears that the actual Unicode character is converted to utf-8 before I can do anything with it resulting in a "?".
How do I read in the Unicode character 733 (U+02DD), recognize it and convert it?
Here's a simplified (not tested) version of the code.
<?
$testconn=odbc_connect ("TESTLIB", "......", "......");
$query="SELECT Description FROM TestTable";
$rsWeb=mysql_query($query));
$WebRow=mysql_fetch_row($rsWeb));
$Desc = $WebRow[0];
$Desc = str_replace('"','""',$Desc);
fwrite($output,"\"".$Desc."\",\r\n");
%>
Also set charset to utf-8 when connecting to SQL server:
http://php.net/manual/en/mysqli.set-charset.php
$mysqli->set_charset("utf8");
I think your connection charset is not utf8, that's why chars are being converted to '?'.
Read this: http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html
Post result for query:
show variables like 'char%';
You really should put only non-entity (Unicode) version in the database, and entity-decode the rest. However, when you want to use UTF-8 with MySQL, there are a few things to remember:
Your table column's collation should be utf8_bin or similar.
Your table's collation and database collation should also be utf8_bin just in case.
Your connection charset should be UTF8. Do this by executing the "SET NAMES utf8" query.
Also, if you're outputting a HTML page, that should have the UTF8 charset as well. If everything is correct, the UTF8 characters should come out fine.
Good luck!
I did the following things:
I have a spreadsheet with data. One of the rows has a ü character in it.
I save this as a CSV file in OpenOffice.org. When it asks me for a character encoding, I choose UTF-8.
I use Navicat to create a MySQL database table, InnoDB with UTF-8 utf8_general encoding and import the CSV.
I try to use PHP function htmlspecialchars($string, ENT_COMPAT, 'UTF-8') where $string is the string containing the special ü character.
It gives me an error: Invalid multibyte sequence in argument. When I change 'UTF-8' with 'ISO8859-1', no error is thrown, but the incorrect character is shown. (The 'unknown character' character, looks like <?>)
If I use an HTML form to update the string in the database, the error disappears and the character is displayed correctly, however, when I then look at the record in Navicat, it looks two characters:
[1/4][A with some thing on top of it]
Some multibyte that isn't seen as one character.`
What is going on, where are things going wrong, and what can I do about it?
Although I don't understand where the "invalid multibyte" error comes from, I'm pretty sure htmlspecialchars() is not your culprit:
For the purposes of this function, the charsets ISO-8859-1, ISO-8859-15, UTF-8, cp866, cp1251, cp1252, and KOI8-R are effectively equivalent, as the characters affected by htmlspecialchars() occupy the same positions in all of these charsets.
In my understanding, htmlspecialchars() should work fine for a UTF-8 string without specifying a character set. My bet would be that either the HTML page containing the form, or the database connection you use is not UTF-8 encoded. For the latter, try sending a
SET NAMES utf8;
to mySQL before doing the insert.
I'm starting out with some XML that looks like this (simplified):
<?xml version="1.0" encoding="UTF-8"?>
<alldata>
<data name="Forsetì" />
</alldata>
</xml>
But after I've parsed it with simplexml_load_string the special character (the i) becomes: ì which is obviously pretty mangled.
Is there a way to prevent this from happening?
I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.
This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.
It's very likely that the XML is fine, but the character gets mangled when stored or output.
If you're outputting data on a HTML page: Make sure it's encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use utf8_decode as a quick fix; using UTF-8 is the better option in the long run.
If you're storing the data in a mySQL, you need to have UTF8 selected as the encoding all the way through: As the connection's encoding, in the table, and in the column(s) you insert the data into.
I've also had some problems with this, and it came from the PHP script encoding. Make sure it's set to UTF-8.
If it's still not good, try printing the variable using uft8_encode or utf8_decode.
XML is strict when it comes to entities, like & should be & and ì should ì
So you will need a translation table.
function xml_entity_decode($_string) {
// Set up XML translation table
$_xml=array();
$_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
while (list($_key,)=each($_xl8))
$_xml['&#'.ord($_key).';']=$_key;
return strtr($_string,$_xml);
}
Late to the party... But I've faced this and solved like below.
You have declared encoding in XML so if you load xml file using DOMDocument it won't cause any issue.
But in case it happens in other use case, you can use html_entity_decode like below:
html_entity_decode($xml->saveXML());