I have some texts in French (containing accented characters such as "é"), stored in a MySQL table whose collation is utf8_unicode_ci (both the table and the columns), that I want to output on an HTML5 page.
The HTML page charset is UTF-8 (< meta charset="utf-8" />) and the PHP files themselves are encoded as "UTF-8 without BOM" (I use Notepad++ on Windows). I use PHP5 to request the database and generate the HTML.
However, on the output page, the special characters (such as "é") appear garbled and are replaced by "�".
When I browse the database (via phpMyAdmin) those same accented characters display just fine.
What am I missing here?
(Note: changing the page encoding (through Firefox's "web developer" menu) to ISO-8859-1 solves the problem... except for the special characters that appears directly in the PHP files, which become now corrupted. But anyway, I'd rather understand why it doesn't work as UTF-8 than changing the encoding without understanding why it works. ^^;)
I experienced that same problem before, and what I did are the following
1) Use notepad++(can almost adapt on any encoding) or eclipse and be sure in to save or open it in UTF-8 without BOM.
2) set the encoding in PHP header, using header('Content-type: text/html; charset=UTF-8');
3) remove any extra spaces on the start and end of my PHP files.
4) set all my table and columns encoding to utf8mb4_general_ci or utf8mb4_unicode_ci via PhpMyAdmin or any mySQL client you have. A comparison of the two encodings are available here
5) set mysql connection charset to UTF-8 (I use PDO for my database connection )
PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8"
PDO::MYSQL_ATTR_INIT_COMMAND => "SET CHARACTER SET utf8"
or just execute the SQL queries before fetching any data
6) use a meta tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
7) use a certain language code for French
<meta http-equiv="Content-language" content="fr" />
8) change the html element lang attribute to the desired language
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
and will be updating this more because I really had a hard time solving this problem before because I was dealing with Japanese characters in my past projects
9) Some fonts are not available in the client PC, you need to use Google fonts to include it on your CSS
10) Don't end your PHP source file with ?>
NOTE:
but if everything I said above doesn't work, try to adjust your encoding depending on the character-set you really want to display, for me I set everything to SHIFT-JIS to display all my japanese characters and it really works fine. But using UFT-8 must be your priority
This works for me
Make your database utf8_general_ci
Save your files in N++ as UTF-8 without BOM
Put $mysqli->query('SET NAMES utf8'); after the connection to the database in your PHP file
Put < meta charset="utf-8" /> in your HTML-s
Works perfect.
If your php.ini default_charset is not set to UTF-8, you need to use a Content-type to define your data. Apply the following header at the top of your file(s) :
header("Content-type: text/html; charset=utf-8");
If you have still troubles with encoding, the cause may be one of the following:
a database server charset problem (check encoding of your server)
a database client charset problem (check encoding of your connection)
a database table charset problem (check encoding of your table)
a php default encoding problem (check default_encoding parameter in parameters.ini)
a multibyte missconfigured (see mb_string parameters in parameters.ini)
a <form> charset problem (check that it is sent as utf-8)
a <html> charset problem (where no enctype is set in your html file)
a Content-encoding: problem (where the wrong encoding is sent by Apache).
SET NAMES worked for me.
My issue was in one of my editing pages the field with the foreign characters would not display, on the production web pages there was no problem.
I know you already have an answer. That's great. But strangely none of these answers solved my issue. I'd like to share my answer for the benefit of the others who may encounter the same issues.
I also had the same problems as the OP, with regards to French accents in a multi-lingual application.
But I encountered this issue for the first time when I had to pass (French accented) data as segments in AJAX calls.
Yes, we must have the database set to work with UTF8. But the fact that AJAX calls had query strings (in my case segments, since I'm using CodeIgniter), I had to simply encode the French text.
To do this on the client-side, use the Javascript encodeURI() function with your data.
And to reverse it in PHP, just use urldecode($MyStr) where data was received as parameters.
Hope this helps.
Type something full French signs in your (php) file
Save that file as UTF-8
Paste line beneath into your website header
header('Content-Type: text/html; charset=utf-8');
Page (file) should look good.
If looks good go here for mysql behavior after (SET_NAMES).
Related
This question already has answers here:
PHP DOMDocument loadHTML not encoding UTF-8 correctly
(11 answers)
Closed 1 year ago.
I am working on localhost windows10 apache 2.4: Apache/2.4.51 (Win64) OpenSSL/1.1.1l PHP/8.0.11and Database client version: libmysql - mysqlnd 8.0.11 which uses the server Server version: 10.4.21-MariaDB - mariadb.org binary distribution. It is by default set to _utf8mb4: Server charset: UTF-8 Unicode (utf8mb4).
I made a php script that gets content(including html tags) from a Wikipedia page using loadHTMLFile. I then further use xpath->query to filter the dom and then the data is saved in mysql table as a string after being escaped by mysqli_real_escape_string. Later on, I query the database and save the content in a variable which is passed to loadHTML, then I remove a few dom elements and then pass the modified content to saveHTML and echo it to my webpage.
What happens is some characters are being displayed like:
--> Â
- --> –
€ --> €
ευρώ --> ευÏÏŽÂ
All the characters are displayed correctly, when I use echo utf8_decode($output). Note: that instead of using utf8_decode, any of the following has no effect:
<meta charset="utf-8"> // in my html file
header('Content-Type: text/html; charset=utf-8'); // before the echo statement
mysqli_query($conn, "SET NAMES utf8"); // before mysql insert into and Select from statements
mysqli_set_charset($conn, "utf8"); // before mysql insert into and Select from
statements
Also both mb_detect_encoding($output) and mb_detect_encoding(utf8_decode($output)) returns UTF-8 not utf8mb4. In my chrome browser's network/headers tab, I always get Content-type as text/html; charset=UTF-8 , regardless of whatever changes I make in my server side php/mysql settings.
My guess is that, the data in the Wikipedia page is in normal UTF-8 form, which is automatically converted by php into utf8mb4 when it's downloaded by loadHTMLFile. Now this data is saved in mysql tables in utf8mb4 format. This data when retrieved later on stays in utf8mb4 format and is seen to the browser in utf8mb4 format. When I use utf8_decode it must convert it to normal utf-8 format.
The problem with my guess is that the php docs about utf8_decode page, mention nothing of utf8mb4, rather it says, multi-byte UTF-8 ISO-8859-1 encoding is converted into single byte UTF-8 ISO-8859-1. Secondly the docs say, ISO-8859-1 charset does not contain the EURO sign. But my webpage successfully shows euro sign after utf8_decode and a browser is capable of parsing multibyte utf-8 characters as well, so if that was the only thing that utf8_decode does, then it should not make any difference with my code.
Edit:
I found the culprit. The following echos correct characters:
$stmt = $conn->prepare("Select ...");
...
$result = $stmt->execute();
...
$row = $stmt->get_result()->fetch_assoc()
echo $row['content']; // gives €ερυώ
Now, $row['content'] is the data directly from my database without any utf_decode. But I happen to use php domdocument afterwards and the following happens:
libxml_use_internal_errors(true); // important
$content = new DOMDocument();
$content->loadHTML($row['content']);
echo $row['content'], $content->saveHTML($content); die();
// The output is: €ερυώ
//â¬ÎµÏÏÏ
The output from the above code in my view source is:
€ερυώ<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>â¬ÎµÏÏÏ</p></body></html>
So please explain what the heck does loadHTML and saveHTML is doing here?
P.S: My whole code available on github repo: https://github.com/AnupamKhosla/crimeWiki and the speciic scripts about wikipedea pages encoding at https://github.com/AnupamKhosla/crimeWiki/blob/main/include/wikipedea_code.php https://github.com/AnupamKhosla/crimeWiki/blob/main/include/post_code.php
The fact that utf8_decode() helps you is incidental. This function should not be used most of the time. If using it helps you, then it can only mean that somehow you have managed to mangle your data.
utf8mb4 is MySQL's charset that represents the full UTF-8 encoding. Therefore, if you are using UTF-8 everywhere in your code, you should never need to use utf8_decode() as it will only damage the data. ISO-8859-1 supports very few characters. It's not what you want.
What seems to have happened here is that you forgot to set $conn->set_charset('utf8mb4') when you opened the connection. Many MySQL servers default to Latin1 when you don't specify the charset, which means that even though your schema might be using utf8mb4 consistently, the connection to the database doesn't and converts the data into garbled up text.
The solution is simple, always set the right connection charset right after opening a new mysqli connection. $conn->set_charset('utf8mb4') will solve your problem and you don't need to use the ridiculous utf8_decode() function that accidentally solved your problem.
Using any encode/decode is a symptom of misconfiguration.
When you connect to mysql, you tell it what encoding is being used in the client.
When you declare the tables, you specify how to store things. CHARACTER SET utf8mb4 is often the best.
Please provide SELECT HEX(col), col ... for a sample. (You cannot trust what the browser displays; it tries to "fix" the encoding. Once you have the hex, we can discuss how to repair the data. A common problem is "double-encoding", wherein the data has been misconverted twice.
As for your current samples, there are enough inconsistencies that I cannot deduce what went wrong:
-> That is represented as hex 80 by some word processors, not by HTML.
- --> this is a plain dash; it is never mangled. Perhaps you have an n-dash or m-dash?
€ --> mangles to "€" via "Mojibake" through latin1;
did you omit the "SINGLE LOW-9 QUOTATION MARK" that looks like a comma??
ευρώ --> ευÏÏŽ via "Mojibake" through latin1;
More on Mojibake and other common manglings: Trouble with UTF-8 characters; what I see is not what I stored
I am building a webpage in HTML with PHP and MySQL and I ran into trouble with swedish characters ÅÄÖ when running page. They show up as � instead of Å/Ä/Ö.
I have set the charset to UTF-8 in both HTML meta-tag and via PHP:
<?php
header('Content-type: text/html; charset=UTF-8');
?>
<meta charset="UTF-8">
Also, MySQL runs utf8_general_ci collation on all tables.
All files should also be encoded and saved as UTF-8 without Unicode Signature (BOM) and no normalization form.
All this have worked flawless before, but today, nomather what I try I do end up with � instead of Å/Ä/Ö. Is there a good way to debug this and find the problem?
Is any of my steps unnecessary or have I forgotten anything?
What you need from deceze's article is the part regarding the SET NAMES:
mysql_set_charset('utf8', $connection); //not mysql_query("SET NAMES 'utf8'");
Just add that at the beginning of your php code, after the database connection was started
You may try save your php files in UTF-8 encoding. I assume the files are written in something else (possibly ISO-xxxx or ANSI)
To do that with Notepad++, select all the lines and copy to clipboard, change the coding to UTF-8 without BOM in encoding menu, then paste over everything and save.
Is this for only few records or all records with swedish characters?
You can change the page encoding manually in browser settings - this is how you test it: change it to latin1/iso-8854-1 to see if it displays these correctly that are wrong as utf-8.
Chances are someone is using browser that is not supporting utf8 or fiddled with the encoding manually.
Also make sure you db connection is utf8 too. (set names utf8;)
I'm writing a small web app that will receive and parse tab-delimited text files from users. Those files will either be uploaded via a textarea or a multipart/form-data form. Those files will be in a variety of charsets, including Asian and the like. In consequence I am trying to use utf-8 throughout the app.
The site is entirely (as far as I know) in UTF-8:
Each php file is saved in utf-8 encoding;
I have added default_charset = "utf-8" in my php.ini file;
The HTML header contains the required utf-8 mentions:
header('Content-Type:text/html; charset=UTF-8');
...
<?xml version="1.0" encoding="utf-8" ?>
...
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
The textarea forms contain the accept-charset="UTF-8" tag.
The db is collated in utf-8;
Each connection to the db includes the option 1002 => 'SET NAMES utf8'.
Now, I just discovered that I needed to set mb_regex_encoding to utf-8 manually for one of my parsing function to work (I use mb_split() to identify & replace tabs and new lines). So ...
What else do I need to do to make sure my site is once and for all utf-8 throughout? In particular, are there any other encoding function I should set, such as mb_internal_encoding(), and if so where in the code should I do that (e.g., at the start of the index.php file?
I can think of two more things;
mb_internal_encoding('UTF-8');
...as early as possible in the PHP script, and
mysqli_set_charset($link, 'utf8');
...to set the connection charset, if you're using MySql. For PDO, you can specify it with the connection string:
"mysql:host=$host;dbname=$db;charset=utf8"
I am new here, so I apologize if I am doing anything wrong.
I have a form which submits user input onto another page. User is expected to type ä, ö, é, etc... I have placed all of the following in the document:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
header('Content-Type:text/html; charset=UTF-8');
<form action="whatever.php" accept-charset="UTF-8">
I even tried:
ini_set('default_charset', 'UTF-8');
When the other page loads, I need to check what the user input with something like:
if ( $_POST['field'] == $check ) {
...
}
But if he inputs something like 'München', PHP will compare 'München' with 'München' and will never trigger TRUE even though it should. Since it is specified UTF-8 everywhere, I am guessing that the server is converting to something else (Windows-1252 as I read on another thread) because it does not support or is not configured to UTF-8. I am using Apache on a local server before I load into production; I have not changed (and don't know how to) any of the default settings. I've been working on a Windows 7, editing with Notepad++ enconding my files in ANSI. If I bin2hex('München') I get '4dc3bc6e6368656e'.
If I echo $_POST['field']; it displays 'München' correctly.
I have researched everywhere for an explanation, all I find is that I should include those tags/headings I already have.
Any help is much appreciated.
You are facing many different problems at the same, let's start with the simplest one.
Problem 1) You say that echo $_POST['field']; will display it correctly? What do you mean with "display"? It can be displayed correctly in two cases:
either the field is in UTF-8 and your page has been declared as UTF-8 and the browser is displaying it as UTF-8 or,
the field is in Latin-1 and the browser has decided (through the auto-detection heuristics) that your page is in Latin-1.
So, the fact that echo $_POST['field']; is correct tells you nothing.
Problem 2) You are using
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
header('Content-Type:text/html; charset=UTF-8');
Is this PHP code? If it is, it will be an error because the header must be set before sending out any byte. If you do this you will not set the Content-Type header and PHP should generate a warning.
Problem 3) You are using
<form action="whatever.php" accept-charset="UTF-8">
Some browsers (IE, mostly) ignore accept-charset if they can coerce the data to be sent in ASCII or ISO Latin-1. So the data will be in UTF-8 and declared as ISO Latin-1 or ISO Latin-1 and sent as ISO Latin-1 (but this second case is not your case).
Have a look at https://stackoverflow.com/a/8547004/449288 to see how to solve this problem.
Problem 4) Which strings are you comparing? For example, if you have
$city = "München"
$_POST['city'] == $city
The result of this code will depend on the encoding of the PHP file. If the file is encoded in ISO Latin-1 and the $_POST correctly contains UTF-8 data, the == will compare different bytes and will return false.
Another solution that may be helpful is in Apache, you can place a directive in your configuration file (httpd.conf) or .htacess called AddDefaultCharset. It looks like this:
AddDefaultCharset utf-8
http://httpd.apache.org/docs/2.0/mod/core.html#adddefaultcharset
That will override any other default charsets.
I changed "mbstring.detect_order = pass" in my php.ini file and i worked
I've used Unicode characters in my forms and file many times. I had not any problem up to now.
Try to do these steps and check the result:
Remove header('Content-Type:text/html; charset=UTF-8'); from your HTML form codes.
Use your form just like <form action="whatever.php"> without accept-charset="UTF-8". (It's better to insert the method of sending data in your form tag).
In target page (whatever.php), insert again <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> in a <head> tag.
I always did my project like what I mentioned here and I did not have any problem with Unicode strings.
This is due to the character encoding of the PHP file(s).
The hardcoded München is stored with the character encoding of the source file(s), in this case ANSI and when that value is compared to the UTF-8 encoded value provided in the $_POST variable, the two will, quite naturally, differ.
The solution to your problem is one of:
Serve and process content with the same encoding as that of the source file(s), in this case likely to be windows-1252.
This would, for starters, include changing the content="text/html; charset=UTF-8" to content="text/html; charset=windows-1252" whenever serving HTML data.
Avoid all hardcoded values that could be affected by character encoding issues between UTF-8 and windows-1252, more or less only hardcode values that only includes English letters and numbers.
Any UTF-8 values would have to be read from a source that ensures they are UTF-8 encoded (for instance a database set to use UTF-8 as storage encoding as well as connection encoding).
Wrap all hardcoded assignments in utf8_encode(), for instance $value = utf8_encode ('München');
Change the encoding of the source file(s) to UTF-8.
This can be accomplished in any number of ways, a decent text editor will be able to do it or the outstanding libiconv can be used, especially for batch processing.
Either solution 1 or 4 would be my preferred solution, especially if multiple people are involved in the project.
As a side-note, some text editors (notably Notepad++) has the option of using either UTF-8 or UTF-8 without BOM. The BOM (Byte Order Mark) is pointless in UTF-8 and will cause problems when writing headers in PHP (most often when doing a redirect). This is because the BOM is right in front of the initial <?php, causing the server to send the BOM just as it would had there been any other character in front. The difference is you'd note a character in front, but the BOM isn't displayed.
Rule of thumb: Always use UTF-8 without BOM.
I have a problem in Language Encoding in PHP as my php file should display both English and Arabic Characters.
Some web page parts are static and others are dynamic (data comes from a Sybase database) and the language encoding of database is ok as data is displayed well in it.
My web page has some drop down lists that are dynamic but they display the data in a strange format which is not English or Arabic like squares and unknown symbols.
I checked the possible causes and did many solutions like:-
Changing the encoding of the PHP script:
Saving File with the Name : WebPage1 of Type : PHP and Encoding : ANSI or UTF-8 or Unicode.
Changing the HTML encoding declaration:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=windows-1256" />
Changing the PHP encoding declaration:
header('Content-Type: text/html; charset=UTF-8);
header('Content-Type: text/html; charset=windows-1256');
Changing the database tables font and language:
Arial(Arabic).
The problem still exists and I do not know what I can do to solve that.
Can you suggest any solution?
Check your database connection, make sure the sybase_connect connects with UTF-8 as charset.
See http://php.net/manual/en/function.sybase-connect.php
From the comment that you are using ODBC to connect: There seems to be an issue with PHP/ODBC and UTF8. Some suggestions are mentioned in this thread: Php/ODBC encoding problem
Always use UTF-8.
Your first header is correct. Your first header is correct, except you should use single = instead of ==. Make sure you used header() function before sending any output to browser.
Open your files in a Unicode supporting editor like Editplus, notepad++ and while saving every source code or HTML file, use Save as and choose UTF-8 on the save as screen. If you use eclipse, import your project to eclipse, right click it and go to project settings, apply charset setting as utf-8 to all source code.
If there's something wrong with data coming from MySQL database, then use appropriate collation on any text storing column (varchar, blob etc). Those are the usual suggestions for it. If you use Sybase, then use Google for collation settings.
And don't change your font to Arabic; Arial already supports it.
You seem to confuse something. Neither UTF-8 nor windows-1256 describe languages, they denote character sets/encodings. Although the character sets may contain characters that are typically used in certain languages, their use doesn’t say anything about the language.
Now as the characters of Windows-1256 are contained in Unicode’s character set and thus can be encoded with UTF-8, you should choose UTF-8 for both languages.
And if you want to declare the language for your contents, read the W3C’s tutorial on Declaring Language in XHTML and HTML.
In your case you could declare your primary document language as en (English) and parts of your document as ar (Arabic):
header('Content-Language: en');
header('Content-Type: text/html;charset=UTF-8');
echo '<p>The following is in Arabic: <span lang="ar">العربية</span></p>';
Make sure to use UTF-8 for both.