I have a page, that loads data from different databases (which could have different charset).
Problem is that it loads with broken charset to UTF-8.
And I need to find a way, how to load it properly.
My connection is:
$db = new PDO("mysql:host=".DBHOST.";dbname=".DBNAME, DBUSER, DBPASS);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, 'SET NAMES utf8');
$db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
as you can see, I use 'SET NAMES utf8'
I have <meta charset="utf-8"> in <head>
And I have tried some conversions:
error_log("ORGIGINAL: ".$row["title"]);
error_log("ICONV: ".iconv(mb_detect_encoding($row["title"], mb_detect_order(), true), "UTF-8", $row["title"]));
error_log("UTF_ENCODE: ".utf8_encode ($row["title"]));
I believe I have all files loaded in UTF-8 too
(re-saved every file in notepad switching from ANSI to UTF-8. then tried this tool for verification https://nlp.fi.muni.cz/projects/chared/)
now, where the fun begins:
Not only that I got the wrong output, but I also have a different output for the browser and error log.
Original string stored in DB:
FIREFOX reaction:
Original:
utf8_encode:
iconv:
same as utf8_encode
and now, how it was loaded into PHP error file:
As you can see, the output has the best result in the original shape, while if trying to convert, it has a more deformed output. Also tried to change the error log file charset to UTF-8 (original unknown/ANSI probably), but the same shape in both encodings)
The text is central-europe/czech.
needed characters are:
á é ý í ó ú ů
ž š č ř ď ť ň ě
So, any ideas, where can be something wrong?
Thanks :)
Do not use any conversion functions.
There are two causes for black diamonds; see Trouble with utf8 characters; what I see is not what I stored
The error file is exhibiting Mojibake, or possibly "double encoding". Those are also discussed in the link above.
Check that Firefox is interpreting the page as UTF8. Older version did not necessarily assume such.
Oh, I just noticed the plain question mark. (Also covered in the link.) You win the prize for the most number of was to mangle UTF8 in a single file!
This possibly means that there are multiple errors occurring. Good luck. If you provide HEX of the data at various stages (in PHP, in the database table, etc), I may be able to help in more detail.
An issue with the Czech character set is that some characters (those with acute accents) are found in western European subsets, hence are more likely to be rendered correctly. The other accents are mostly specific to Czech (with carons), and go down a different path. This explains why some of your samples exhibit two different failure cases. (Search for Czech on this forum; you may more tips.)
After some experimentation...
?eské probably comes from the CHARACTER SET of the column in the table being latin1 (or other "latin"), plus establishing the connection as being latin1 when inserting the data. That can be seen on the browser when it is in Western mode, not utf8.
?esk� shows up if you do the above and also have latin1 as the connection during selecting. That is visible with the browser set to utf8.
Caveat: The analysis may not be the only way to get what you are seeing.
Related
I want to work with data from CSV file, but I realized letters are not showing correctly. I tried million ways to convert the encoding but nothing works. Working on MacOS, PHP 7.4.4.
After executing fgets() or fgetcsv() on handle variable, I will get this (2 rows/lines in example).
Kód ADM;Kód obce;Název obce;Kód MOMC;Název MOMC;Kód MOP;Název MOP;Kód èásti obce;Název èásti obce;Kód ulice;Název ulice;Typ SO;Èíslo domovní;Èíslo orientaèní;Znak èísla orientaèního;PSÈ;Souøadnice Y;Souøadnice X;Platí Od
1234;1234;HorniDolni;;;;;1234;HorniDolni;;;è.p.;2;;;748790401;4799.98;15893971.21;2013-12-01T00:00:00
It is more or less correct czech language, but letter č is superseded by è and ř is superseded by ø, neither of them are part of czech alphabet. I am confident, there will be more of the misplaced letters in the file.
Executing file -I path/to/file I receive file: text/plain; charset=iso-8859-1 which is sad, because as far as wiki is concerned, this charset doesn't have a czech alphabet included.
Neither of following commands didn't converted misplaced letters:
mb_convert_encoding($line, 'UTF-8', 'ISO8859-1')
iconv('ISO-8859-1', 'UTF-8', $line)
iconv('ISO8859-1', 'UTF-8', $line)
I have noticed that in ISO-8859-1 the ø letter has a code 00F8. Windows-1250 (which includes czech aplhabet) has correct letter ř with code 0159 but both of them are preceded by 00F8. Same with letter č and è which are both preceded by code 00E7. I do not understand encoding very deeply, but it seems that file is encoded in Windows-1250 but the interpreter thinks the encoding is ISO-8859-1 and takes letter that is in place/code of original one.
But neither conversion (ISO-8859-1 => Windows-1250, ISO-8859-1 => UTF-8 or other way around) is working.
Does anyone has any idea how to solve this? Thanks!
The problem with 8-bit character encoding is that it mostly needs human intelligence to interpret the correct codepage.
When you run file on a file, it can work out that the file is mostly made up of printable characters but as it's only looking at the bytes, it can't easily tell the difference between iso-8895-1 and iso-8895-2. To file, 0x80 is the same as 0x80.
file can only tell that the file is text and likely iso-8895-* or windows-*, because of the use of 0x80-0xFF. I.e. not just ASCII.
(Unicode encodings, like UTF-8, and UTF-16 are easier to detect by their byte sequence or Byte Order Mark set at the top of the file)
There are some intelligent character codepage detectors that, with the help of dictionaries from different languages, can estimate the codepage based on character/byte sequences.
The likely conversion you need is simply iso-8895-2 -> UTF-8.
What is important for you is that you know the original encoding (interpretation) and then when you validate it, that you know exactly what encoding you're viewing it.
For example, PHP will by default set the HTTP charset to iso-8895-1. That means it's quite possible for you to be converting correctly to iso-8895-2, but your browser will then "interpret" as iso-8895-1.
The best way to validate is to save the file to disk, then use a text editor like VS Code set to your required encoding beforehand before opening the file.
If you need further help, you will need to edit your question to include the exact code you're using.
I have a .tsv file using Danish letters like Æ Ø Å.
The file is uploaded in php with file_get_contents();
and then processed and made to an mysqli query.
I tried putting <?php header('Content-Type: text/html; charset=utf-8'); ?> at the very top of the code.
also using the meta tag <meta charset="UTF-8">
and in my SQL I have the rows created like:
text COLLATE utf8_danish_ci NOT NULL
and:
PRIMARY KEY (`id`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci AUTO_INCREMENT
and:
$conn->set_charset("utf8");
.... But still no luck.
If I open my .tsv file in excel, then it shows the Æ Ø Å correctly. But when open with "TextEdit" on mac. the "Æ Ø Å" shows like "¯ ¯ ¯"
UPDATE - SOLUTION as the accepted answer refers to I should be using CP1252:
mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "CP1252");
There are many things to consider with UTF-8. But I see this one particular comment of yours...
If I open my .tsv file in excel, then it shows the Æ Ø Å correctly. But when open with "TextEdit" on mac. the "Æ Ø Å" shows like "¯ ¯ ¯"
The problem...
If you are talking about MicroSoft Excel, then you should know that the characters above are both within the UTF-8 charset and the LATIN_1_SUPPLEMENT charset (often called CP1252). Take a look: LATIN_1_SUPPLEMENT Block
If you are saving this document, without setting an encoding of it to UTF-8, then Windows will have no reason to convert this text out of CP1252 and into UTF-8. But that is what you will need to do.
Possible solutions...
On your server: You can try to decode any windows charset or "unknown" charset from CP1252 to UTF-8. (Since Windows will save documents "according to the system default", this information may disappear by the time it hits your Linux servers.)
On the submitter's computer: You can solve this by having the user adjust their UTF-8 settings in whatever editor is generating the document (to encode their documents as UTF-8, which causes this information to be stored in the document BOM, or "byte-order mark", which your server can read). This second approach may seem user-unfriendly (and it is, sure), but it can help you identify where the data is being corrupted.
My issue is I have a database which was imported as UTF-8 that has columns that are default latin1. This is obviously an issue so when I set the charset to UTF-8 on php it gives me � instead of the expected ae character.
Now, when I originally had my encoding as windows-1252 it worked perfectly but then when I validate my file it says that windows-1252 is legacy and shouldn't be used.
Obviously I'm only trying to get rid of the error message but the only problem is I'm not allowed to change anything in the database at all. Is there any way the data can be output as utf-8 whilst still being stored as latin1 in the DB?
Time ago, I used this function to resolve printing texts in a hellish page of different lurking out-of-control charsets xD:
function to_entities($string)
{
$encoding = mb_detect_encoding($string, array('UTF-8', 'ISO-8859-1')); // and a few encodings more... sigh...
return htmlentities($string, ENT_QUOTES, $encoding, false);
}
print to_entities('á é í ó ú ñ');
1252 (latin1) can handler æ. It is hex E6. In utf8 it is hex C3A6.
� usually comes from latin1 encodings, then displaying them as utf8. So, let's go back to what was stored.
Please provide SHOW CREATE TABLE. I suspect it will say CHARACTER SET latin1, not utf8.
Then, let's see
SELECT col, HEX(col) FROM tbl WHERE ...
to see the hex. (See hex notes above.)
Assuming everything is latin1 so far, then the simple (and perhaps expedient) answer is to check the html source. I suspect it says
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Changing to charset=ISO-8859-1 may solve the black diamond problem.
But... latin1 only handles Western European characters. If you need Slavic, Greek, Chinese, etc, then you do need utf8. I'll provide a different answer in that case.
I have figured out how to do this after looking through the link that Fred provided, thanks!
if anyone needs to know what to do
if you have a database connection file. inside that, underneath the mysqli_connect command add
mysqli_set_charset($connectvar, "utf8");
I have a database full of strings containing strange characters such as:
Design Tattoo Ãœbungshaut
Mehrflächiges Biozid Reinigungs- & Desinfektionsmittel
Where the Ãœ and ä should be, as I understand, an Ü and à when in proper UTF-8.
Is there a standard function to revert these multiple characters back to there proper UTF-8 form?
In PHP I have come across $url = iconv('utf-8', 'iso-8859-1', $url); which seems to get close but falls short. Perhaps I have the wrong parameters, but in any case was just wondering how well this issue is know and if there is an established fix?
The original data was taken from the eCommerce system CubeCart which seems to have no problem converting it back to normal text FYI.
The data shown as example is UTF-8 encoded data mistakenly interpreted as ISO-8859-1 (or windows-1252). The problem combinations are in fact “Ü” and “ä” (“Ā” does not appear in German). So apparently what you need to do is to read the data as UTF-8 and display it that way, instead of converting it.
If the database and output is utf-8 it could be because your not using utf-8 as the client character set.
If your using mysqli you can use set_charset or run SET NAMES utf8 as a query before fetching data.
Quick Background: I inherited a large sql dump file containing a combination of english and arabic text and (I think) it was originally exported using 'latin1'. I changed all occurrences of 'latin1' to 'utf8' prior to importing the file. The the arabic text didn't appear correctly in phpmyadmin (which I guess is normal), but when I loaded the text to a web page with the following...
<meta http-equiv='Content-Type' content='text/html; charset=windows-1256'/>
...everything looked good and the arabic text displayed perfectly.
Problem: My client is really really really picky and doesn't want to change his...
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
...to the 'Windows-1256' equivalent. I didn't think this would be a problem, but when I changed the charset value to 'UTF-8', all of the arabic characters appeared as diamonds with question marks. Shouldn't UTF-8 display arabic text correctly?
Here are a few notes about my database configuration:
Database charset is 'utf8'
Database connection collation is 'utf8_general_ci'
All databases, tables, and applicable fields have been collated as 'utf8_general_ci'
I've been scouring stack overflow and other forums for anything the relates to my issue. I've found similar problems, but not of the solutions seem to work for my specific situation. Hope someone can help!
If the document looks right when declared as windows-1256 encoded, then it most probably is windows-1256 encoded. So it was apparently not exported using latin1—which would have been impossible, since latin1 has no Arabic letters.
If this is just about a single file, then the simplest way is to convert it from windows-1256 encoding to utf-8 encoding, using e.g. Notepad++. (Open the file in it, change the encoding, via File format menu, to Arabic, windows-1256. Then select Convert to UTF-8 in the File format menu and do File → Save.)
Windows-1256 and UTF-8 are completely different encodings, so data gets all messed up if you declare windows-1256 data as UTF-8 or vice versa. Only ASCII characters, such as English letters, have the same representation in both encodings.
We can't find the error in your code if you don't show us your code, so we're very limited in how we can help you.
You told the browser to interpret the document as being UTF-8 rather than Windows-1256, but did you actually change the encoding used from Windows-1256 to UTF-8?
For example,
$ cat a.pl
use strict;
use warnings;
use feature qw( say );
use charnames ':full';
my $enc = $ARGV[0] or die;
binmode STDOUT, ":encoding($enc)";
print <<"__EOI__";
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=$enc">
<title>Foo!</title>
</head>
<body dir="rtl">
\N{ARABIC LETTER ALEF}\N{ARABIC LETTER LAM}\N{ARABIC LETTER AIN}\N{ARABIC LETTER REH}\N{ARABIC LETTER BEH}\N{ARABIC LETTER YEH}\N{ARABIC LETTER TEH MARBUTA}
</body>
</html>
__EOI__
$ perl a.pl UTF-8 > utf8.html
$ perl a.pl Windows-1256 > cp1256.html
I think you need to go back to square one. It sounds like you have a database dump in Win-1256 encoding and you want to work with it in UTF-8 from now on. It also sounds like you are using PHP but you have lots of irrelevant tags on your question and are missing the most important one, PHP.
First, you need to convert the text dump into UTF-8 and you should be able to do that with PHP. Chances are that your conversion script will have two steps, first read the Win-1256 bytes and decode them into internal Unicode text strings, then encode the Unicode text strings into UTF-8 bytes for output to a new text file.
Once you have done that, redo the database import as you did before, but now you have correctly encoded the input data as UTF-8.
After that it should be as simple as reading the database and rendering a web page with the correct UTF-8 encoding.
P.S. It is actually possible to reencode the data every time you display it, but that does not solve the problem of having a database full of incorrectly encoded data.
inorder to display arabic characters correctly , you need to convert your php file to utf-8 without Bom
this happened with me, arabic characters was displayed diamonds, but conversion to utf-8 without bom will solve this problem
I seems that the db is configured as UTF8, but the data itself is extended ascii. If the data is converted to UTF8, it will display correctly in content type set to UTF8