MySQL/php Charset madness: which one is correct? - php

I imported a list of name in mysql, directly from txt file to phpmyadmin, european names.
My html header is set to utf8, mysql is set to utf8.
now the names with accent like Contè, display a <?> instead of accents.
If i remove the meta ut8, I can see the accents correctly, but everything else breaks, for instance when i upload a file like Aleš.jpg the html spit out a unreadable filename..
i'm lost..

It smells like the text file was encoded in latin1 some other encoding.
Can you provide a hex dump of "Contè" as found in the file? We can help you recognize whether it is utf8 or not. http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings shows that è is hex E8 in latin1 or C3A8 in utf8.
Once the encoding is determined, the tag on the html page can be fixed to agree with it.

Related

How to print UTF-8 data in latin1 database?

My issue is I have a database which was imported as UTF-8 that has columns that are default latin1. This is obviously an issue so when I set the charset to UTF-8 on php it gives me � instead of the expected ae character.
Now, when I originally had my encoding as windows-1252 it worked perfectly but then when I validate my file it says that windows-1252 is legacy and shouldn't be used.
Obviously I'm only trying to get rid of the error message but the only problem is I'm not allowed to change anything in the database at all. Is there any way the data can be output as utf-8 whilst still being stored as latin1 in the DB?
Time ago, I used this function to resolve printing texts in a hellish page of different lurking out-of-control charsets xD:
function to_entities($string)
{
$encoding = mb_detect_encoding($string, array('UTF-8', 'ISO-8859-1')); // and a few encodings more... sigh...
return htmlentities($string, ENT_QUOTES, $encoding, false);
}
print to_entities('á é í ó ú ñ');
1252 (latin1) can handler æ. It is hex E6. In utf8 it is hex C3A6.
� usually comes from latin1 encodings, then displaying them as utf8. So, let's go back to what was stored.
Please provide SHOW CREATE TABLE. I suspect it will say CHARACTER SET latin1, not utf8.
Then, let's see
SELECT col, HEX(col) FROM tbl WHERE ...
to see the hex. (See hex notes above.)
Assuming everything is latin1 so far, then the simple (and perhaps expedient) answer is to check the html source. I suspect it says
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Changing to charset=ISO-8859-1 may solve the black diamond problem.
But... latin1 only handles Western European characters. If you need Slavic, Greek, Chinese, etc, then you do need utf8. I'll provide a different answer in that case.
I have figured out how to do this after looking through the link that Fred provided, thanks!
if anyone needs to know what to do
if you have a database connection file. inside that, underneath the mysqli_connect command add
mysqli_set_charset($connectvar, "utf8");

encoding from any language to utf in php

I insert from csv characters from different languages..
I apply this to every set of characters:
private function process_elements($element){
utf8_encode($element);
return $element;
}
The problem is when they go into the database, they go like this:
???????? ?? ???????????? ????? ??????? ??? ???????...
When I retrieve them from the databse, I also get this.
This happens with greek. However, when I retrieve greek pages (through scrapping), who are on a utf encoded page. The characters look like this:
Δες webcam δωμάτια | Gr.ImLive.com
which is okay, because when i use the utf8_encode function, they look normal on the screen..
But when the data is taken from the csv and be put into the database, i get those question marks..
Is there a way to encode form any language to utf.. why retrieving data from csv and a utf8 encoded webpage makes such a difference.. they look the same.. how do I address that problem?
please take a look at this
it will help you
Handling Unicode Front To Back In A Web App
It's not about "languages", it's about encodings. Text is encoded as bits and bytes. Any one byte is equal to any other byte. If you only have a blob of bytes, you cannot know what encoding it represents. You can guess, but that's not accurate. You have to know what encoding some text is in by reading the accompanying meta data. That may be documentation, a <meta> tag or an HTTP header. Then you need to treat the text in that encoding.
utf8_encode actually converts text from ISO-8859-1 to UTF-8. It does not simply encode anything to UTF-8, because it does not have the means to determine what something is encoded in either. If your text is already UTF-8 encoded or was not ISO-8859-1 encoded to begin with, you're just garbling the text (as you are).

Arabic Character Encoding Issue: UTF-8 versus Windows-1256

Quick Background: I inherited a large sql dump file containing a combination of english and arabic text and (I think) it was originally exported using 'latin1'. I changed all occurrences of 'latin1' to 'utf8' prior to importing the file. The the arabic text didn't appear correctly in phpmyadmin (which I guess is normal), but when I loaded the text to a web page with the following...
<meta http-equiv='Content-Type' content='text/html; charset=windows-1256'/>
...everything looked good and the arabic text displayed perfectly.
Problem: My client is really really really picky and doesn't want to change his...
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
...to the 'Windows-1256' equivalent. I didn't think this would be a problem, but when I changed the charset value to 'UTF-8', all of the arabic characters appeared as diamonds with question marks. Shouldn't UTF-8 display arabic text correctly?
Here are a few notes about my database configuration:
Database charset is 'utf8'
Database connection collation is 'utf8_general_ci'
All databases, tables, and applicable fields have been collated as 'utf8_general_ci'
I've been scouring stack overflow and other forums for anything the relates to my issue. I've found similar problems, but not of the solutions seem to work for my specific situation. Hope someone can help!
If the document looks right when declared as windows-1256 encoded, then it most probably is windows-1256 encoded. So it was apparently not exported using latin1—which would have been impossible, since latin1 has no Arabic letters.
If this is just about a single file, then the simplest way is to convert it from windows-1256 encoding to utf-8 encoding, using e.g. Notepad++. (Open the file in it, change the encoding, via File format menu, to Arabic, windows-1256. Then select Convert to UTF-8 in the File format menu and do File → Save.)
Windows-1256 and UTF-8 are completely different encodings, so data gets all messed up if you declare windows-1256 data as UTF-8 or vice versa. Only ASCII characters, such as English letters, have the same representation in both encodings.
We can't find the error in your code if you don't show us your code, so we're very limited in how we can help you.
You told the browser to interpret the document as being UTF-8 rather than Windows-1256, but did you actually change the encoding used from Windows-1256 to UTF-8?
For example,
$ cat a.pl
use strict;
use warnings;
use feature qw( say );
use charnames ':full';
my $enc = $ARGV[0] or die;
binmode STDOUT, ":encoding($enc)";
print <<"__EOI__";
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=$enc">
<title>Foo!</title>
</head>
<body dir="rtl">
\N{ARABIC LETTER ALEF}\N{ARABIC LETTER LAM}\N{ARABIC LETTER AIN}\N{ARABIC LETTER REH}\N{ARABIC LETTER BEH}\N{ARABIC LETTER YEH}\N{ARABIC LETTER TEH MARBUTA}
</body>
</html>
__EOI__
$ perl a.pl UTF-8 > utf8.html
$ perl a.pl Windows-1256 > cp1256.html
I think you need to go back to square one. It sounds like you have a database dump in Win-1256 encoding and you want to work with it in UTF-8 from now on. It also sounds like you are using PHP but you have lots of irrelevant tags on your question and are missing the most important one, PHP.
First, you need to convert the text dump into UTF-8 and you should be able to do that with PHP. Chances are that your conversion script will have two steps, first read the Win-1256 bytes and decode them into internal Unicode text strings, then encode the Unicode text strings into UTF-8 bytes for output to a new text file.
Once you have done that, redo the database import as you did before, but now you have correctly encoded the input data as UTF-8.
After that it should be as simple as reading the database and rendering a web page with the correct UTF-8 encoding.
P.S. It is actually possible to reencode the data every time you display it, but that does not solve the problem of having a database full of incorrectly encoded data.
inorder to display arabic characters correctly , you need to convert your php file to utf-8 without Bom
this happened with me, arabic characters was displayed diamonds, but conversion to utf-8 without bom will solve this problem
I seems that the db is configured as UTF8, but the data itself is extended ascii. If the data is converted to UTF8, it will display correctly in content type set to UTF8

Problem storing german chars in the MySQL database.....?

I have a table named "cust_details" which has a column "categories", where I have to store some categories like : blockadenlösung, affirmation, beziehungsprobleme lösen
But when I am trying to save this data into the database it is stored like :
blockadenlüsung, affirmation, beziehungsprobleme lösen
That is when umlauts are coming in the string it is not saved in its original form. I tried some charset for storing this characters. But I am still facing the problem.....
What may be the possible reasons...?
Thanks In Advance.....
The data you stored is encoded in UTF-8 (ü for an "ö" is typical for UTF-8), but is not displayed as UTF-8 but rather as ISO-8859-1 or the like.
Make sure that you use the same encoding everywhere:
Deliver your websites with Content-Encoding "utf-8"
Use mysql_query("SET NAMES 'utf8'"); to set the encoding to utf-8
Make sure that the encoding of the database is UTF-8 (use HeidiSQL etc. to check)
Use this when you are inserting the characters:
N'characters here'
The N before the string declaration should enable you to enter it into the DB.
What is the type of the field?
You could specify database/table/field level character-sets. The default latin-1 works in most scenarios.
Otherwise, you would have to use plain text and store unicode strings like &#<4-digit-unicode-value>; into it. Then when you print it out, just dump the unicode into HTML and it will show up as such.
Here is a sample string in Pashto ترافيکي پيښو کې درې تنه مړه او څوارلس نور ټپيان شول. which we store directy into the table. The charset used is latin_charset_ci
Good Luck!

Parse XML with special characters (UTF-8)

I'm starting out with some XML that looks like this (simplified):
<?xml version="1.0" encoding="UTF-8"?>
<alldata>
<data name="Forsetì" />
</alldata>
</xml>
But after I've parsed it with simplexml_load_string the special character (the i) becomes: ì which is obviously pretty mangled.
Is there a way to prevent this from happening?
I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.
This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.
It's very likely that the XML is fine, but the character gets mangled when stored or output.
If you're outputting data on a HTML page: Make sure it's encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use utf8_decode as a quick fix; using UTF-8 is the better option in the long run.
If you're storing the data in a mySQL, you need to have UTF8 selected as the encoding all the way through: As the connection's encoding, in the table, and in the column(s) you insert the data into.
I've also had some problems with this, and it came from the PHP script encoding. Make sure it's set to UTF-8.
If it's still not good, try printing the variable using uft8_encode or utf8_decode.
XML is strict when it comes to entities, like & should be &amp; and ì should &igrave;
So you will need a translation table.
function xml_entity_decode($_string) {
// Set up XML translation table
$_xml=array();
$_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
while (list($_key,)=each($_xl8))
$_xml['&#'.ord($_key).';']=$_key;
return strtr($_string,$_xml);
}
Late to the party... But I've faced this and solved like below.
You have declared encoding in XML so if you load xml file using DOMDocument it won't cause any issue.
But in case it happens in other use case, you can use html_entity_decode like below:
html_entity_decode($xml->saveXML());

Categories