Strings containing non-ASCII characters are truncated by PHP/MySQL - php

I have a page that has a translation function here. My problem here is that, when I translate the language into French, the words are cut because the page didn't interpret the words correctly. I checked posts related to my problem but none of them work.
In my page, I put these stuffs:
header ('Content-Type:text/html; charset=WINDOWS-1252'); -> This is just to insist the encoding on start up. I think this one is optional but I still use it.
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
Equivalent translations are fetched from a database tablename: labels. Labels's table type is InnoDB with utf8 -- UTF-8 Unicode as default Character set.
Characters after é are being cut. Is there anything that I need to do to display the characters correctly? Thanks!

I don't see any point in using Unicode on the backend and a code page in the frontend of a multilingual application. You either use the same encoding throughout your project, or you manually convert back and forth between UTF-8 and windows-1252.
I don't think you have a problem with reading. The labels come truncated from the DB, otherwise your browser would display garbage characters. So this is not an issue with PHP/HTML, but with MySQL. In the case of èéàòì and the like, MySQL is certainly able to convert from UTF-8 to CP1252 (latin1). However, if this were not the case (as if we try to convert the same string from UTF-8 to CP1251), MySQL would show a question mark ?.
In your case I think it's an input problem, ie the labels are truncated in the DB. How is this possible? You may have a UTF8 PHP and MySQL, but your browser sends windows-1252 strings when it submits a form from a page loaded with such a charset. In your PHP script you should transcode this string to UTF-8 before inserting it in the db, or connect to MySQL with SET NAMES 'CP1252'. Since you don't do so, you end up trying to insert a bunch of invalid UTF-8 bytes, so MySQL truncates the string and your labels are empty. Attached is a test case. Here is the test table
CREATE TABLE `test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(128) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8
Here is the PHP part. Note that this script is UTF-8 encoded, so every literal string appearing in it has the same encoding.
// This is a UTF-8 file, so my editor uses UTF-8 and thus each literal
// string is a UTF-8 string, since PHP only has binary strings.
$label = "Référence";
// Now let's translate this string as if it came from a browser submitting
// a form loaded from a cp1252 encoded page
$src = mb_convert_encoding($label, "CP1252", "UTF-8");
// But connect as if I were UTF-8
$db = new PDO('mysql:host=localhost;dbname=test;charset=utf8',
'test', 'test');
// Insert the string
$stmt = $db->prepare('INSERT INTO test (name) VALUES ( ? )');
$stmt->bindValue(1, $src);
$stmt->execute();
// Read it
header("content-type: text/plain; charset=windows-1252");
foreach($db->query('SELECT * FROM test') as $row)
echo $row['name'] . "\n";
How do you recover? Either you connect to MySQL with the cp1252 charset and let MySQL translate for you, or you transcode the string in your script.
After correctly getting data in, you'll have to extract them and put it on a HTML page. This time you'll have the same problem, but reversed: showing a UTF-8 string in a CP1252 document. The bytes in the DB are unsuitable, because UTF-8 is a variable-length encoding, whilst in CP1252 a char is exactly 1 byte long. If you put these bytes directly into the page, the browser will show some random gibberish for the extra bytes. So, again, you either connect to the db specifying the CP1252 charset so that MySQL takes care of the conversion and give you the right bytes, or you transcode the bytes yourself on the PHP side.
Or you'd better doing yourself a favor: use the same encoding everywhere. I suggest UTF-8 because today is the right thing to do, but you can successfully opt for CP1252 because it can represents English and French chars (and saves some storage, but I don't consider this an issue)

My suggestion is to use the same encoding through the whole process. Use UTF-8 as the charset both in header and the meta tag.

It seems to me, that your data is not stored correctly in the database. If you are working with mysqli you can try to set the charset of the connection object, before reading or writing to the database.
// tells the mysqli connection to deliver UTF-8 encoded strings.
$db = new mysqli($dbHost, $dbUser, $dbPassword, $dbName);
$db->set_charset('utf8');
For other databases see UTF-8 for PHP and MySQL. Maybe it's necessary to insert the french textes again (with this setting), because the existing textes could be invalid now.
Your linked example page is correctly encoded with UTF-8 (the file format), though your meta tag is a bit incorrect:
<!meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
The <! is not a commented out, you would have to write <!-- instead. The best would be to declare it only once for UTF-8 and remove other meta tags.

Related

How does php utf8_decode deal with utf8mb4? [duplicate]

This question already has answers here:
PHP DOMDocument loadHTML not encoding UTF-8 correctly
(11 answers)
Closed 1 year ago.
I am working on localhost windows10 apache 2.4: Apache/2.4.51 (Win64) OpenSSL/1.1.1l PHP/8.0.11and Database client version: libmysql - mysqlnd 8.0.11 which uses the server Server version: 10.4.21-MariaDB - mariadb.org binary distribution. It is by default set to _utf8mb4: Server charset: UTF-8 Unicode (utf8mb4).
I made a php script that gets content(including html tags) from a Wikipedia page using loadHTMLFile. I then further use xpath->query to filter the dom and then the data is saved in mysql table as a string after being escaped by mysqli_real_escape_string. Later on, I query the database and save the content in a variable which is passed to loadHTML, then I remove a few dom elements and then pass the modified content to saveHTML and echo it to my webpage.
What happens is some characters are being displayed like:
--> Â
- --> –
€ --> €
ευρώ --> ευÏÏŽÂ
All the characters are displayed correctly, when I use echo utf8_decode($output). Note: that instead of using utf8_decode, any of the following has no effect:
<meta charset="utf-8"> // in my html file
header('Content-Type: text/html; charset=utf-8'); // before the echo statement
mysqli_query($conn, "SET NAMES utf8"); // before mysql insert into and Select from statements
mysqli_set_charset($conn, "utf8"); // before mysql insert into and Select from
statements
Also both mb_detect_encoding($output) and mb_detect_encoding(utf8_decode($output)) returns UTF-8 not utf8mb4. In my chrome browser's network/headers tab, I always get Content-type as text/html; charset=UTF-8 , regardless of whatever changes I make in my server side php/mysql settings.
My guess is that, the data in the Wikipedia page is in normal UTF-8 form, which is automatically converted by php into utf8mb4 when it's downloaded by loadHTMLFile. Now this data is saved in mysql tables in utf8mb4 format. This data when retrieved later on stays in utf8mb4 format and is seen to the browser in utf8mb4 format. When I use utf8_decode it must convert it to normal utf-8 format.
The problem with my guess is that the php docs about utf8_decode page, mention nothing of utf8mb4, rather it says, multi-byte UTF-8 ISO-8859-1 encoding is converted into single byte UTF-8 ISO-8859-1. Secondly the docs say, ISO-8859-1 charset does not contain the EURO sign. But my webpage successfully shows euro sign after utf8_decode and a browser is capable of parsing multibyte utf-8 characters as well, so if that was the only thing that utf8_decode does, then it should not make any difference with my code.
Edit:
I found the culprit. The following echos correct characters:
$stmt = $conn->prepare("Select ...");
...
$result = $stmt->execute();
...
$row = $stmt->get_result()->fetch_assoc()
echo $row['content']; // gives €ερυώ
Now, $row['content'] is the data directly from my database without any utf_decode. But I happen to use php domdocument afterwards and the following happens:
libxml_use_internal_errors(true); // important
$content = new DOMDocument();
$content->loadHTML($row['content']);
echo $row['content'], $content->saveHTML($content); die();
// The output is: €ερυώ
//â¬ÎµÏÏÏ
The output from the above code in my view source is:
€ερυώ<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>â¬ÎµÏÏÏ</p></body></html>
So please explain what the heck does loadHTML and saveHTML is doing here?
P.S: My whole code available on github repo: https://github.com/AnupamKhosla/crimeWiki and the speciic scripts about wikipedea pages encoding at https://github.com/AnupamKhosla/crimeWiki/blob/main/include/wikipedea_code.php https://github.com/AnupamKhosla/crimeWiki/blob/main/include/post_code.php
The fact that utf8_decode() helps you is incidental. This function should not be used most of the time. If using it helps you, then it can only mean that somehow you have managed to mangle your data.
utf8mb4 is MySQL's charset that represents the full UTF-8 encoding. Therefore, if you are using UTF-8 everywhere in your code, you should never need to use utf8_decode() as it will only damage the data. ISO-8859-1 supports very few characters. It's not what you want.
What seems to have happened here is that you forgot to set $conn->set_charset('utf8mb4') when you opened the connection. Many MySQL servers default to Latin1 when you don't specify the charset, which means that even though your schema might be using utf8mb4 consistently, the connection to the database doesn't and converts the data into garbled up text.
The solution is simple, always set the right connection charset right after opening a new mysqli connection. $conn->set_charset('utf8mb4') will solve your problem and you don't need to use the ridiculous utf8_decode() function that accidentally solved your problem.
Using any encode/decode is a symptom of misconfiguration.
When you connect to mysql, you tell it what encoding is being used in the client.
When you declare the tables, you specify how to store things. CHARACTER SET utf8mb4 is often the best.
Please provide SELECT HEX(col), col ... for a sample. (You cannot trust what the browser displays; it tries to "fix" the encoding. Once you have the hex, we can discuss how to repair the data. A common problem is "double-encoding", wherein the data has been misconverted twice.
As for your current samples, there are enough inconsistencies that I cannot deduce what went wrong:
-> That is represented as hex 80 by some word processors, not by HTML.
- --> this is a plain dash; it is never mangled. Perhaps you have an n-dash or m-dash?
€ --> mangles to "€" via "Mojibake" through latin1;
did you omit the "SINGLE LOW-9 QUOTATION MARK" that looks like a comma??
ευρώ --> ευÏÏŽ via "Mojibake" through latin1;
More on Mojibake and other common manglings: Trouble with UTF-8 characters; what I see is not what I stored

Display special characters from table in php page

I have a table which contains name of teachers. Some names are like René Visser having special characters. When I write the SQL query for displaying the name, the special characters are replaced by � symbols.
I have tried cast() but it's not working properly. My query is like this.
$qry = mssql_query("SELECT CAST(FirstName_1 AS NVARCHAR(250)) AS Name FROM
tbl_teachers");
The FirstName_1 column is nvarchar type. I have tried to cast FirstName_1 to VARBINARY(8000), then casting result to IMAGE like following.
CAST(CAST(FirstName_1 AS VARBINARY(8000)) AS IMAGE) AS Name.
You should have UTF-8 encoding for the SQL server.
Then, make sure you send the encoding headers also from php using :
header('Content-Type: text/html; charset=utf-8');
You will need to specify the charset, or if you have already, set it to Windows-1252. It's likely your page is reading in the data with UTF-8 encoding. Which explains the ? symbols.
<head>
<meta charset="Windows-1252">
</head>
You most likely have a charset issue. Your query has little to do with this, you don't need to cast it.
You'll need to set the charset of the connection, the PHP and HTML header and the document itself as the same charset. UTF-8 will most likely cover all of the special characters you'll ever need.
Below is some things you could do.
ini_set('mssql.charset', 'UTF-8'); (Have this run upon connecting to your database)
Set both PHP and HTML header to UTF-8
PHP: header('Content-Type: text/html; charset=utf-8'); (has to be called before any and all output)
HTML: <meta charset="utf-8"> (has to be inside the <head>-tag
Save the document in UTF-8 encoding. If you're doing it in Notepad++, it's Format -> Convert to UFT-8 (you may also choose UTF-8 w/o BOM)
The database itself, and it's tables, may need to be set to UTF-8. This can be done with the query below (need only to be run once):
ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Keep in mind that all parts of your application has to be set to the same type of charset, or you'll experience those kind of things in your database.

UTF-8 page to UTF-8 database table showing incorrect

I'm hoping for an explanation of why some UTF-8 text is being saved to a database table incorrectly...
I created an HTML form and the page's meta content is set to UTF-8:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The PHP and template files are all Unicode/UTF-8.
The form field data is submitted to a utf8_unicode_ci encoded database table.
If I submit the form with characters such as "éçä" (which I created from Windows' Character Map program set to Unicode character set) they show up incorrectly in the database ("éçä"). I'm viewing the database via phpMyAdmin (which is also set to UTF-8 character encoding).
However, if I run iconv() on the string to convert to ISO-8859-1 before inserting it into the database, then the character show correctly:
$input = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $input);
What is going on? Shouldn't the fact that everything is UTF-8/Unicode from beginning to end resulted in it being correct in the DB? What am I doing wrong and why did converting the data to ISO-8859-1 work?
The only other thing done to the data is a FILTER_SANITIZE_MAGIC_QUOTES:
$input = filter_var($input,FILTER_SANITIZE_MAGIC_QUOTES);
Thank you for your time and input.
Two steps you haven't mentioned:
Specify UTF-8 in HTTP Content-Type header
Specify UTF-8 when connecting to MySQL, e.g. specifying charset in PDO

Why do I have to utf8_decode() my MySQL column value to get it to display properly?

I'm using CakePHP with App.encoding set to UTF-8, <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> present in my <head> and my MySQL database set to UTF-8 Unicode Encoding and utf8_general_ci collation. I also have "encoding"=>"UTF8" in my database.php connection details.
When I store a '£' symbol in the database table and view it using command line MySQL, the character displays correctly.
If I use CakePHP to fetch the rows from the database table and output them in my website, I see £ instead of my intended £ symbol.
However if I then use utf8_decode() to output my data, it displays correctly.
Is this correct? I have tried using htmlentities() to convert the £ symbol into £ but it outputs £ instead! Even when I use the additional parameters for charset.
Perhaps someone can help - I must have missed something here, but I thought that the characters should display correctly (in things like textarea HTML tags) if all your headers, meta tags etc were consistently UTF-8?
It sounds like the data in your database is wrong: the character £ is actually stored as the two characters £. You can confirm this by going to the database and using the hex and charset functions:
select charset(MyColumn), hex(MyColumn) from MyTable;
If the column is encoded in UTF-8, for the value '£' you should see output identical to this:
+---------------+-----------+
| utf8 | C2A3 |
+---------------+-----------+
If you see anything else, like if the charset column reports latin1 or if hex column reports C382C2A3, the data in the table is wrong. It can be fixed though, but the fix depends on the kind of error the data has. What do you get from charset and hex?
You can use htmlentities with third parameters to safely encode UTF-8 :
htmlentities("£", ENT_COMPAT, "UTF-8")
If all is in UTF8 remove the "encoding"=>"UTF8" in your database.php connection details:
$conn = mysql_connect($server, $username, $password);
//mysql_set_charset("UTF8", $conn); // REMOVED. ;)
mysql_select_db($database, $conn);

Arabic Character Encoding Issue: UTF-8 versus Windows-1256

Quick Background: I inherited a large sql dump file containing a combination of english and arabic text and (I think) it was originally exported using 'latin1'. I changed all occurrences of 'latin1' to 'utf8' prior to importing the file. The the arabic text didn't appear correctly in phpmyadmin (which I guess is normal), but when I loaded the text to a web page with the following...
<meta http-equiv='Content-Type' content='text/html; charset=windows-1256'/>
...everything looked good and the arabic text displayed perfectly.
Problem: My client is really really really picky and doesn't want to change his...
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
...to the 'Windows-1256' equivalent. I didn't think this would be a problem, but when I changed the charset value to 'UTF-8', all of the arabic characters appeared as diamonds with question marks. Shouldn't UTF-8 display arabic text correctly?
Here are a few notes about my database configuration:
Database charset is 'utf8'
Database connection collation is 'utf8_general_ci'
All databases, tables, and applicable fields have been collated as 'utf8_general_ci'
I've been scouring stack overflow and other forums for anything the relates to my issue. I've found similar problems, but not of the solutions seem to work for my specific situation. Hope someone can help!
If the document looks right when declared as windows-1256 encoded, then it most probably is windows-1256 encoded. So it was apparently not exported using latin1—which would have been impossible, since latin1 has no Arabic letters.
If this is just about a single file, then the simplest way is to convert it from windows-1256 encoding to utf-8 encoding, using e.g. Notepad++. (Open the file in it, change the encoding, via File format menu, to Arabic, windows-1256. Then select Convert to UTF-8 in the File format menu and do File → Save.)
Windows-1256 and UTF-8 are completely different encodings, so data gets all messed up if you declare windows-1256 data as UTF-8 or vice versa. Only ASCII characters, such as English letters, have the same representation in both encodings.
We can't find the error in your code if you don't show us your code, so we're very limited in how we can help you.
You told the browser to interpret the document as being UTF-8 rather than Windows-1256, but did you actually change the encoding used from Windows-1256 to UTF-8?
For example,
$ cat a.pl
use strict;
use warnings;
use feature qw( say );
use charnames ':full';
my $enc = $ARGV[0] or die;
binmode STDOUT, ":encoding($enc)";
print <<"__EOI__";
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=$enc">
<title>Foo!</title>
</head>
<body dir="rtl">
\N{ARABIC LETTER ALEF}\N{ARABIC LETTER LAM}\N{ARABIC LETTER AIN}\N{ARABIC LETTER REH}\N{ARABIC LETTER BEH}\N{ARABIC LETTER YEH}\N{ARABIC LETTER TEH MARBUTA}
</body>
</html>
__EOI__
$ perl a.pl UTF-8 > utf8.html
$ perl a.pl Windows-1256 > cp1256.html
I think you need to go back to square one. It sounds like you have a database dump in Win-1256 encoding and you want to work with it in UTF-8 from now on. It also sounds like you are using PHP but you have lots of irrelevant tags on your question and are missing the most important one, PHP.
First, you need to convert the text dump into UTF-8 and you should be able to do that with PHP. Chances are that your conversion script will have two steps, first read the Win-1256 bytes and decode them into internal Unicode text strings, then encode the Unicode text strings into UTF-8 bytes for output to a new text file.
Once you have done that, redo the database import as you did before, but now you have correctly encoded the input data as UTF-8.
After that it should be as simple as reading the database and rendering a web page with the correct UTF-8 encoding.
P.S. It is actually possible to reencode the data every time you display it, but that does not solve the problem of having a database full of incorrectly encoded data.
inorder to display arabic characters correctly , you need to convert your php file to utf-8 without Bom
this happened with me, arabic characters was displayed diamonds, but conversion to utf-8 without bom will solve this problem
I seems that the db is configured as UTF8, but the data itself is extended ascii. If the data is converted to UTF8, it will display correctly in content type set to UTF8

Categories