How to print UTF-8 data in latin1 database?

How to print UTF-8 data in latin1 database? - php

My issue is I have a database which was imported as UTF-8 that has columns that are default latin1. This is obviously an issue so when I set the charset to UTF-8 on php it gives me � instead of the expected ae character.
Now, when I originally had my encoding as windows-1252 it worked perfectly but then when I validate my file it says that windows-1252 is legacy and shouldn't be used.
Obviously I'm only trying to get rid of the error message but the only problem is I'm not allowed to change anything in the database at all. Is there any way the data can be output as utf-8 whilst still being stored as latin1 in the DB?

Time ago, I used this function to resolve printing texts in a hellish page of different lurking out-of-control charsets xD:
function to_entities($string)
{
$encoding = mb_detect_encoding($string, array('UTF-8', 'ISO-8859-1')); // and a few encodings more... sigh...
return htmlentities($string, ENT_QUOTES, $encoding, false);
}
print to_entities('á é í ó ú ñ');

1252 (latin1) can handler æ. It is hex E6. In utf8 it is hex C3A6.
� usually comes from latin1 encodings, then displaying them as utf8. So, let's go back to what was stored.
Please provide SHOW CREATE TABLE. I suspect it will say CHARACTER SET latin1, not utf8.
Then, let's see
SELECT col, HEX(col) FROM tbl WHERE ...
to see the hex. (See hex notes above.)
Assuming everything is latin1 so far, then the simple (and perhaps expedient) answer is to check the html source. I suspect it says
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Changing to charset=ISO-8859-1 may solve the black diamond problem.
But... latin1 only handles Western European characters. If you need Slavic, Greek, Chinese, etc, then you do need utf8. I'll provide a different answer in that case.

I have figured out how to do this after looking through the link that Fred provided, thanks!
if anyone needs to know what to do
if you have a database connection file. inside that, underneath the mysqli_connect command add
mysqli_set_charset($connectvar, "utf8");

Related

php wrong charset from DB

I have a page, that loads data from different databases (which could have different charset).
Problem is that it loads with broken charset to UTF-8.
And I need to find a way, how to load it properly.
My connection is:
$db = new PDO("mysql:host=".DBHOST.";dbname=".DBNAME, DBUSER, DBPASS);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, 'SET NAMES utf8');
$db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
as you can see, I use 'SET NAMES utf8'
I have <meta charset="utf-8"> in <head>
And I have tried some conversions:
error_log("ORGIGINAL: ".$row["title"]);
error_log("ICONV: ".iconv(mb_detect_encoding($row["title"], mb_detect_order(), true), "UTF-8", $row["title"]));
error_log("UTF_ENCODE: ".utf8_encode ($row["title"]));
I believe I have all files loaded in UTF-8 too
(re-saved every file in notepad switching from ANSI to UTF-8. then tried this tool for verification https://nlp.fi.muni.cz/projects/chared/)
now, where the fun begins:
Not only that I got the wrong output, but I also have a different output for the browser and error log.
Original string stored in DB:
FIREFOX reaction:
Original:
utf8_encode:
iconv:
same as utf8_encode
and now, how it was loaded into PHP error file:
As you can see, the output has the best result in the original shape, while if trying to convert, it has a more deformed output. Also tried to change the error log file charset to UTF-8 (original unknown/ANSI probably), but the same shape in both encodings)
The text is central-europe/czech.
needed characters are:
á é ý í ó ú ů
ž š č ř ď ť ň ě
So, any ideas, where can be something wrong?
Thanks :)

Do not use any conversion functions.
There are two causes for black diamonds; see Trouble with utf8 characters; what I see is not what I stored
The error file is exhibiting Mojibake, or possibly "double encoding". Those are also discussed in the link above.
Check that Firefox is interpreting the page as UTF8. Older version did not necessarily assume such.
Oh, I just noticed the plain question mark. (Also covered in the link.) You win the prize for the most number of was to mangle UTF8 in a single file!
This possibly means that there are multiple errors occurring. Good luck. If you provide HEX of the data at various stages (in PHP, in the database table, etc), I may be able to help in more detail.
An issue with the Czech character set is that some characters (those with acute accents) are found in western European subsets, hence are more likely to be rendered correctly. The other accents are mostly specific to Czech (with carons), and go down a different path. This explains why some of your samples exhibit two different failure cases. (Search for Czech on this forum; you may more tips.)
After some experimentation...
?eskÃ© probably comes from the CHARACTER SET of the column in the table being latin1 (or other "latin"), plus establishing the connection as being latin1 when inserting the data. That can be seen on the browser when it is in Western mode, not utf8.
?esk� shows up if you do the above and also have latin1 as the connection during selecting. That is visible with the browser set to utf8.
Caveat: The analysis may not be the only way to get what you are seeing.

Proper Charset to work with Vietnamese Characters (that isn't Unicode) in PHP [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 9 months ago.
I've searched around for a while and haven't yet found something that'll work for me. I am using a PHP form to submit data into SAP using the SAP DI API. I need to figure out which character set will actually allow me to store and work with Vietnamese characters.
UTF8 seems to work for a lot of the characters but ô becomes Ã´. More importantly, there are character limits, and UTF-8 breaks character limits. If I have a string of 30 characters it tells the API that it's more than 50. The same is true for storing in MySQL--if there's a varchar character limit, UTF-8 causes the string to go above it.
Unfortunately, when I search, UTF-8 seems to be the only thing people suggest for Vietnamese characters. If I don't encode the characters at all, they get stored as their html character codes. I've also tried ISO-8859-1, converting into UCS-2 or UCS-4... I'm really at a loss. If anyone has experience working with vietnamese characters, your help would be greatly appreciated.
UPDATE
It appears the issue may be with my wampserver on Windows. here's a bit of code that is confusing me:
$str = 'VậTCôNG';
$str1 = utf8_encode($str);
if (mb_detect_encoding($str,"UTF-8",true) == true) {
print_r('yes');
if ($str1 == $str) {
print_r('yes2');
}
}
echo $str . $str1;
This prints "yes" but not "yes2", and $str.str1 = "VậTCôNGVáºTCÃ´NG" in the browser.
I have my php.ini file with:
default_charset = "utf-8"
and my httpd.conf file with:
AddDefaultCharset UTF-8
and my php file I'm running has:
header("Content-type: text/html; charset=utf-8");
So I'm now wondering: if the original string was utf-8, why wouldn't it equal a utf8 encoding of itself? and why is the utf8 encoding returning wrong characters? Is something wrong in the wampserver configurations?

Ã´ is the "Mojibake" for ô. That is, you do have UTF-8, but something in the code mangled it.
See Trouble with utf8 characters; what I see is not what I stored and search for Mojibake. It says to check these:
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
It is possible to recover the data in the database, but it depends on details not yet provided.
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Each Vietnamese character take 2-3 bytes for encoding in UTF-8. It is unclear whether the "hard 50" is really a character limit or a byte limit.
If you happen to have Mojibake's sibling "double encoding", then a Vietnamese character will take 4-6 bytes and feel like 2-3 characters. See "Test the data" in the first link.
An example of how to 'undo' Mobibake in MySQL:
CONVERT(BINARY(CONVERT('VáºTCÃ´NG' USING latin1)) USING utf8mb4) --> 'VậTCôNG'
"Double encoding" is sort of like Mojibake twice. That is one side treats it as latin1, the other as UTF-8, but twice.
VậTCôNG, as UTF-8, is hex 56e1baad5443c3b44e47. If that hex is treated as character set cp850 or keybcs2, the string is Vß║¡TC├┤NG.

Change it to VISCII.
Input: ô
Output: ô
You can test it at Charset converter.

Converting odd character encoding back to utf-8

I have a database full of strings containing strange characters such as:
Design Tattoo Ãœbungshaut
MehrflÃ¤chiges Biozid Reinigungs- & Desinfektionsmittel
Where the Ãœ and Ã¤ should be, as I understand, an Ü and Ã when in proper UTF-8.
Is there a standard function to revert these multiple characters back to there proper UTF-8 form?
In PHP I have come across $url = iconv('utf-8', 'iso-8859-1', $url); which seems to get close but falls short. Perhaps I have the wrong parameters, but in any case was just wondering how well this issue is know and if there is an established fix?
The original data was taken from the eCommerce system CubeCart which seems to have no problem converting it back to normal text FYI.

The data shown as example is UTF-8 encoded data mistakenly interpreted as ISO-8859-1 (or windows-1252). The problem combinations are in fact “Ü” and “ä” (“Ā” does not appear in German). So apparently what you need to do is to read the data as UTF-8 and display it that way, instead of converting it.

If the database and output is utf-8 it could be because your not using utf-8 as the client character set.
If your using mysqli you can use set_charset or run SET NAMES utf8 as a query before fetching data.

Arabic Character Encoding Issue: UTF-8 versus Windows-1256

Quick Background: I inherited a large sql dump file containing a combination of english and arabic text and (I think) it was originally exported using 'latin1'. I changed all occurrences of 'latin1' to 'utf8' prior to importing the file. The the arabic text didn't appear correctly in phpmyadmin (which I guess is normal), but when I loaded the text to a web page with the following...
<meta http-equiv='Content-Type' content='text/html; charset=windows-1256'/>
...everything looked good and the arabic text displayed perfectly.
Problem: My client is really really really picky and doesn't want to change his...
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
...to the 'Windows-1256' equivalent. I didn't think this would be a problem, but when I changed the charset value to 'UTF-8', all of the arabic characters appeared as diamonds with question marks. Shouldn't UTF-8 display arabic text correctly?
Here are a few notes about my database configuration:
Database charset is 'utf8'
Database connection collation is 'utf8_general_ci'
All databases, tables, and applicable fields have been collated as 'utf8_general_ci'
I've been scouring stack overflow and other forums for anything the relates to my issue. I've found similar problems, but not of the solutions seem to work for my specific situation. Hope someone can help!

If the document looks right when declared as windows-1256 encoded, then it most probably is windows-1256 encoded. So it was apparently not exported using latin1—which would have been impossible, since latin1 has no Arabic letters.
If this is just about a single file, then the simplest way is to convert it from windows-1256 encoding to utf-8 encoding, using e.g. Notepad++. (Open the file in it, change the encoding, via File format menu, to Arabic, windows-1256. Then select Convert to UTF-8 in the File format menu and do File → Save.)
Windows-1256 and UTF-8 are completely different encodings, so data gets all messed up if you declare windows-1256 data as UTF-8 or vice versa. Only ASCII characters, such as English letters, have the same representation in both encodings.

We can't find the error in your code if you don't show us your code, so we're very limited in how we can help you.
You told the browser to interpret the document as being UTF-8 rather than Windows-1256, but did you actually change the encoding used from Windows-1256 to UTF-8?
For example,
$ cat a.pl
use strict;
use warnings;
use feature qw( say );
use charnames ':full';
my $enc = $ARGV[0] or die;
binmode STDOUT, ":encoding($enc)";
print <<"__EOI__";
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=$enc">
<title>Foo!</title>
</head>
<body dir="rtl">
\N{ARABIC LETTER ALEF}\N{ARABIC LETTER LAM}\N{ARABIC LETTER AIN}\N{ARABIC LETTER REH}\N{ARABIC LETTER BEH}\N{ARABIC LETTER YEH}\N{ARABIC LETTER TEH MARBUTA}
</body>
</html>
__EOI__
$ perl a.pl UTF-8 > utf8.html
$ perl a.pl Windows-1256 > cp1256.html

I think you need to go back to square one. It sounds like you have a database dump in Win-1256 encoding and you want to work with it in UTF-8 from now on. It also sounds like you are using PHP but you have lots of irrelevant tags on your question and are missing the most important one, PHP.
First, you need to convert the text dump into UTF-8 and you should be able to do that with PHP. Chances are that your conversion script will have two steps, first read the Win-1256 bytes and decode them into internal Unicode text strings, then encode the Unicode text strings into UTF-8 bytes for output to a new text file.
Once you have done that, redo the database import as you did before, but now you have correctly encoded the input data as UTF-8.
After that it should be as simple as reading the database and rendering a web page with the correct UTF-8 encoding.
P.S. It is actually possible to reencode the data every time you display it, but that does not solve the problem of having a database full of incorrectly encoded data.

inorder to display arabic characters correctly , you need to convert your php file to utf-8 without Bom
this happened with me, arabic characters was displayed diamonds, but conversion to utf-8 without bom will solve this problem

I seems that the db is configured as UTF8, but the data itself is extended ascii. If the data is converted to UTF8, it will display correctly in content type set to UTF8

CKEditor charset

I updated my web app to use UTF-8 instead of ANSI.
I did the following measures to define charset:
mysql_set_charset("utf8"); // PHP
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> // HTML
utf8_general_ci // In MySQL
I also edited the CKEditor config to remove htmlentities because I need the correct character (i.e. é and not é) for MySQL fulltext search.
config.entities = false;
config.entities_latin = false;
In the database (phpMyAdmin view) and on normal text fields output (HTML, <input> or <textarea>), everything looks fine (I see é, not é, not Ã©, yay).
However, CKEditor has some trouble with the encoding. See attached image for the same field taken from the database, displayed in a textarea, then in a textarea repalced by CKEditor:
This seems to be in the CKEditor JavaScript code (probably a fixed charset), but I can't find it in the config. Again, since the é displays correctly in normal HTML (real UTF-8 é, not é nor Ã©), I'm quite sure it's not the PHP/MySQL query that's wrong (but I might be mistaken).
EDIT: This seems like a symptom of applying htmlentities, which by default is encoded in Latin-1, on UTF-8 text. There is either a possibility of using htmlspecialchars or to specify the charset ("utf-8"), but I don't know where to modify that in CKEditor.

This thread seems bit dated but answering it to help anyone looking for a response.
To allow CKEditor to process the character é as é and not é set the config for entities_latin to false, as below:
config.entities_latin = false;
Or, you may just want to set following options to false:
config.entities = false;
config.basicEntities = false;

It was my approach that was wrong, not CKEditor's. Was looking in the wrong file and missed the UTF-8 encoding on a htmlspecialchars.

You can also use in your database connection: $connection->query("SET NAMES 'utf8'");
And remember to set db, and/or table Collation to utf8... I prefer utf8_general_ci

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.