How to make internal processing encoding change to UTF8 in PHP?

How to make internal processing encoding change to UTF8 in PHP? - php

Currently in my application the utf8 encoded data is spoiled by internal coding of PHP.
How to make it consistent with utf8?
EDIT:To show examples,please tell me how to output the current internal encoding in PHP?
In php.ini I found the following:
default_charset = "iso-8859-1"
Which means Latin1.
How to change it to utf8,say,what's the iso version of utf8?

Change it to:
default_charset = "utf-8"
There is no ISO version of UTF-8.
You'll need to be specific with the details since encoding can be mangled at many different areas in your PHP application.
The common problem areas are:
Saving and retrieving from DB:
The database encoding must the same as the strings sent to it from PHP, or you must convert the strings to the DB encoding.
PHP4's single byte string functions:
PHP's functions such as strlen(), str_replace() do not produce the correct results on multibyte encodings such as UTF-8, since they operate on single bytes.
Page encoding:
Make sure the browser knows you are sending it UTF-8.

You can change the character encoding in php file. To change encoding in php page use the following function.
$new_value = htmlentities('$old_value',ENT_COMPAT, "UTF-8");
and also you can add the following in the html head section
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I hope this will help to solve your problem.

Related

How does php utf8_decode deal with utf8mb4? [duplicate]

This question already has answers here:
PHP DOMDocument loadHTML not encoding UTF-8 correctly
(11 answers)
Closed 1 year ago.
I am working on localhost windows10 apache 2.4: Apache/2.4.51 (Win64) OpenSSL/1.1.1l PHP/8.0.11and Database client version: libmysql - mysqlnd 8.0.11 which uses the server Server version: 10.4.21-MariaDB - mariadb.org binary distribution. It is by default set to _utf8mb4: Server charset: UTF-8 Unicode (utf8mb4).
I made a php script that gets content(including html tags) from a Wikipedia page using loadHTMLFile. I then further use xpath->query to filter the dom and then the data is saved in mysql table as a string after being escaped by mysqli_real_escape_string. Later on, I query the database and save the content in a variable which is passed to loadHTML, then I remove a few dom elements and then pass the modified content to saveHTML and echo it to my webpage.
What happens is some characters are being displayed like:
--> Â
- --> â€“
€ --> â‚¬
ευρώ --> ÎµÏ…ÏÏŽÂ
All the characters are displayed correctly, when I use echo utf8_decode($output). Note: that instead of using utf8_decode, any of the following has no effect:
<meta charset="utf-8"> // in my html file
header('Content-Type: text/html; charset=utf-8'); // before the echo statement
mysqli_query($conn, "SET NAMES utf8"); // before mysql insert into and Select from statements
mysqli_set_charset($conn, "utf8"); // before mysql insert into and Select from
statements
Also both mb_detect_encoding($output) and mb_detect_encoding(utf8_decode($output)) returns UTF-8 not utf8mb4. In my chrome browser's network/headers tab, I always get Content-type as text/html; charset=UTF-8 , regardless of whatever changes I make in my server side php/mysql settings.
My guess is that, the data in the Wikipedia page is in normal UTF-8 form, which is automatically converted by php into utf8mb4 when it's downloaded by loadHTMLFile. Now this data is saved in mysql tables in utf8mb4 format. This data when retrieved later on stays in utf8mb4 format and is seen to the browser in utf8mb4 format. When I use utf8_decode it must convert it to normal utf-8 format.
The problem with my guess is that the php docs about utf8_decode page, mention nothing of utf8mb4, rather it says, multi-byte UTF-8 ISO-8859-1 encoding is converted into single byte UTF-8 ISO-8859-1. Secondly the docs say, ISO-8859-1 charset does not contain the EURO sign. But my webpage successfully shows euro sign after utf8_decode and a browser is capable of parsing multibyte utf-8 characters as well, so if that was the only thing that utf8_decode does, then it should not make any difference with my code.
Edit:
I found the culprit. The following echos correct characters:
$stmt = $conn->prepare("Select ...");
...
$result = $stmt->execute();
...
$row = $stmt->get_result()->fetch_assoc()
echo $row['content']; // gives €ερυώ
Now, $row['content'] is the data directly from my database without any utf_decode. But I happen to use php domdocument afterwards and the following happens:
libxml_use_internal_errors(true); // important
$content = new DOMDocument();
$content->loadHTML($row['content']);
echo $row['content'], $content->saveHTML($content); die();
// The output is: €ερυώ
//â¬ÎµÏÏÏ
The output from the above code in my view source is:
€ερυώ<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>â¬ÎµÏÏÏ</p></body></html>
So please explain what the heck does loadHTML and saveHTML is doing here?
P.S: My whole code available on github repo: https://github.com/AnupamKhosla/crimeWiki and the speciic scripts about wikipedea pages encoding at https://github.com/AnupamKhosla/crimeWiki/blob/main/include/wikipedea_code.php https://github.com/AnupamKhosla/crimeWiki/blob/main/include/post_code.php

The fact that utf8_decode() helps you is incidental. This function should not be used most of the time. If using it helps you, then it can only mean that somehow you have managed to mangle your data.
utf8mb4 is MySQL's charset that represents the full UTF-8 encoding. Therefore, if you are using UTF-8 everywhere in your code, you should never need to use utf8_decode() as it will only damage the data. ISO-8859-1 supports very few characters. It's not what you want.
What seems to have happened here is that you forgot to set $conn->set_charset('utf8mb4') when you opened the connection. Many MySQL servers default to Latin1 when you don't specify the charset, which means that even though your schema might be using utf8mb4 consistently, the connection to the database doesn't and converts the data into garbled up text.
The solution is simple, always set the right connection charset right after opening a new mysqli connection. $conn->set_charset('utf8mb4') will solve your problem and you don't need to use the ridiculous utf8_decode() function that accidentally solved your problem.

Using any encode/decode is a symptom of misconfiguration.
When you connect to mysql, you tell it what encoding is being used in the client.
When you declare the tables, you specify how to store things. CHARACTER SET utf8mb4 is often the best.
Please provide SELECT HEX(col), col ... for a sample. (You cannot trust what the browser displays; it tries to "fix" the encoding. Once you have the hex, we can discuss how to repair the data. A common problem is "double-encoding", wherein the data has been misconverted twice.
As for your current samples, there are enough inconsistencies that I cannot deduce what went wrong:
-> That is represented as hex 80 by some word processors, not by HTML.
- --> this is a plain dash; it is never mangled. Perhaps you have an n-dash or m-dash?
€ --> mangles to "â‚¬" via "Mojibake" through latin1;
did you omit the "SINGLE LOW-9 QUOTATION MARK" that looks like a comma??
ευρώ --> ÎµÏ…ÏÏŽ via "Mojibake" through latin1;
More on Mojibake and other common manglings: Trouble with UTF-8 characters; what I see is not what I stored

$_POST will convert from utf-8 to Ã¤ Ã¶ Ã¼ etc

I am new here, so I apologize if I am doing anything wrong.
I have a form which submits user input onto another page. User is expected to type ä, ö, é, etc... I have placed all of the following in the document:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
header('Content-Type:text/html; charset=UTF-8');
<form action="whatever.php" accept-charset="UTF-8">
I even tried:
ini_set('default_charset', 'UTF-8');
When the other page loads, I need to check what the user input with something like:
if ( $_POST['field'] == $check ) {
...
}
But if he inputs something like 'München', PHP will compare 'MÃ¼nchen' with 'München' and will never trigger TRUE even though it should. Since it is specified UTF-8 everywhere, I am guessing that the server is converting to something else (Windows-1252 as I read on another thread) because it does not support or is not configured to UTF-8. I am using Apache on a local server before I load into production; I have not changed (and don't know how to) any of the default settings. I've been working on a Windows 7, editing with Notepad++ enconding my files in ANSI. If I bin2hex('München') I get '4dc3bc6e6368656e'.
If I echo $_POST['field']; it displays 'München' correctly.
I have researched everywhere for an explanation, all I find is that I should include those tags/headings I already have.
Any help is much appreciated.

You are facing many different problems at the same, let's start with the simplest one.
Problem 1) You say that echo $_POST['field']; will display it correctly? What do you mean with "display"? It can be displayed correctly in two cases:
either the field is in UTF-8 and your page has been declared as UTF-8 and the browser is displaying it as UTF-8 or,
the field is in Latin-1 and the browser has decided (through the auto-detection heuristics) that your page is in Latin-1.
So, the fact that echo $_POST['field']; is correct tells you nothing.
Problem 2) You are using
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
header('Content-Type:text/html; charset=UTF-8');
Is this PHP code? If it is, it will be an error because the header must be set before sending out any byte. If you do this you will not set the Content-Type header and PHP should generate a warning.
Problem 3) You are using
<form action="whatever.php" accept-charset="UTF-8">
Some browsers (IE, mostly) ignore accept-charset if they can coerce the data to be sent in ASCII or ISO Latin-1. So the data will be in UTF-8 and declared as ISO Latin-1 or ISO Latin-1 and sent as ISO Latin-1 (but this second case is not your case).
Have a look at https://stackoverflow.com/a/8547004/449288 to see how to solve this problem.
Problem 4) Which strings are you comparing? For example, if you have
$city = "München"
$_POST['city'] == $city
The result of this code will depend on the encoding of the PHP file. If the file is encoded in ISO Latin-1 and the $_POST correctly contains UTF-8 data, the == will compare different bytes and will return false.

Another solution that may be helpful is in Apache, you can place a directive in your configuration file (httpd.conf) or .htacess called AddDefaultCharset. It looks like this:
AddDefaultCharset utf-8
http://httpd.apache.org/docs/2.0/mod/core.html#adddefaultcharset
That will override any other default charsets.

I changed "mbstring.detect_order = pass" in my php.ini file and i worked

I've used Unicode characters in my forms and file many times. I had not any problem up to now.
Try to do these steps and check the result:
Remove header('Content-Type:text/html; charset=UTF-8'); from your HTML form codes.
Use your form just like <form action="whatever.php"> without accept-charset="UTF-8". (It's better to insert the method of sending data in your form tag).
In target page (whatever.php), insert again <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> in a <head> tag.
I always did my project like what I mentioned here and I did not have any problem with Unicode strings.

This is due to the character encoding of the PHP file(s).
The hardcoded München is stored with the character encoding of the source file(s), in this case ANSI and when that value is compared to the UTF-8 encoded value provided in the $_POST variable, the two will, quite naturally, differ.
The solution to your problem is one of:
Serve and process content with the same encoding as that of the source file(s), in this case likely to be windows-1252.
This would, for starters, include changing the content="text/html; charset=UTF-8" to content="text/html; charset=windows-1252" whenever serving HTML data.
Avoid all hardcoded values that could be affected by character encoding issues between UTF-8 and windows-1252, more or less only hardcode values that only includes English letters and numbers.
Any UTF-8 values would have to be read from a source that ensures they are UTF-8 encoded (for instance a database set to use UTF-8 as storage encoding as well as connection encoding).
Wrap all hardcoded assignments in utf8_encode(), for instance $value = utf8_encode ('München');
Change the encoding of the source file(s) to UTF-8.
This can be accomplished in any number of ways, a decent text editor will be able to do it or the outstanding libiconv can be used, especially for batch processing.
Either solution 1 or 4 would be my preferred solution, especially if multiple people are involved in the project.
As a side-note, some text editors (notably Notepad++) has the option of using either UTF-8 or UTF-8 without BOM. The BOM (Byte Order Mark) is pointless in UTF-8 and will cause problems when writing headers in PHP (most often when doing a redirect). This is because the BOM is right in front of the initial <?php, causing the server to send the BOM just as it would had there been any other character in front. The difference is you'd note a character in front, but the BOM isn't displayed.
Rule of thumb: Always use UTF-8 without BOM.

How to handle character encoding in PHP - Codeigniter?

What is the best way to convert user input to UTF-8?
I have a simple form where a user will pass in HTML, the HTML can be in any language and it can be in any character encoding format.
My question is:
Is it possible to represent everything as UTF-8?
What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
I am trying to work out how to best implement this - advice and links appreciated.
I am making use of Codeigniter and its input class to retrieve post data.
A few points I should make:
I need to convert HTML special characters to their respective entities
It might be a good idea to accept encoding and return it in that same encoding. However, my web app is making use of :
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.

Specify accept-charset in your <form> tag to tell the browser to submit user-entered data encoded in UTF-8:
<form action="foo" accept-charset="UTF-8">...</form>
See here for a complete guide on HOW TO Use UTF-8 Throughout Your Web Stack.

Is it possible to represent everything as UTF-8?
Yes, UTF-8 is a Unicode encoding, so you can use any character defined in Unicode. That's the best you can do with a computer to date.
What can I use to effectively convert any character encoding to UTF-8
iconv lets you convert virtually any encoding to any other encoding. But, for that you have to know what encoding you're dealing with. You can't say "iconv, whatever this is, make it UTF-8!". That's unfortunately not how it works. You can only say "iconv, I have this string here in BIG5, please convert that to UTF-8.".
If you're only dealing with form data in UTF-8 though, you'll probably never need to convert anything.
so that I can parse it with PHP string functions
"PHP string functions" work on bytes. They don't care about characters or encodings. Depending on what you want to do, working with naive PHP string functions on UTF-8 text will give you bad results. Use encoding-aware string functions in the MB extension for any multi-byte encoding string manipulation.
save it to my database
Just make sure your database stores text in UTF-8 and you have set your database connection to UTF-8 (i.e. the database knows you're sending it UTF-8 data). You should be able to specify that in the CodeIgniter database connection settings.
subsequently echo out using htmlentities?
Just echo htmlentities($text), nothing more you need to do.
However, my web app is making use of : <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.
Not at all. It just signals to the browser that your page is encoded in UTF-8. Now you just need to make sure that's actually the case (as you're trying to do anyway). It also implies to the browser that it should send UTF-8 to the server. You can make that explicit with the accept-charset attribute on forms.
May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which might help you understand more.

1) Is it possible to represent everything as UTF-8?
Yes, everything defined in UNICODE. That's the most you can get nowadays, and there is room for the future that UNICODE can support.
2) What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
The only thing you need to know is the actual encoding of your data. If you want your webapplication to support UTF-8 for input and output, the frontend needs to signal that it supports UTF-8. See Character Encodings for a guide regarding your applications user-interface.
Within PHP you need to feed any function with the encoding it supports. Some need to have the encoding specified, for some you need to convert it. Always check the function docs if it supports what you ask for. Additionally check your PHP configuration.
Related:
Preparing PHP application to use with UTF-8
How to detect malformed utf-8 string in PHP?

If you want to change the encoding of a string you can try
$utf8_string = mb_convert_encoding( $yourBadString , 'UTF-8' );

I found out that the only thing that works out for UTF-8 encoding is setting inside my config.php
putenv('LC_ALL=en_US.utf8'); // or whatever language you need
setlocale(LC_ALL, 'en_US.utf8'); // or whatever language you need
bindtextdomain("mydomain", dirname(__FILE__) . "/../language");
textdomain("mydomain");

EDIT :
Is it possible to represent everything as UTF-8?
Yes, these is what you need to ensure :
html : headers/meta-header set to utf-8
all files saved as utf-8
database collation, tables and data encoding to utf-8
What can I use to effectively convert any character encoding to UTF-8
You can use utf8_encode (Since for a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation,ref) before saving it into your database.
// eg
$name = utf8_encode($this->input->post('name'));
And as i mention before, you need to make sure database collation, tables and data encoding to utf-8. In CI, at your database connection config
// Make sure have these lines
$db['default']['char_set'] = 'utf8';
$db['default']['dbcollat'] = 'utf8_general_ci';

Browser displays � instead of ´

I have a PHP file which has the following text:
<div class="small_italic">This is what you´ll use</div>
On one server, it appears as:
This is what you´ll use
And on another, as:
This is what you�ll use
Why would there be a difference and what can I do to make it appear properly (as an apostrophe)?
Note to all (for future reference)
I implemented Gordon's / Gumbo's suggestion, except I implemented it on a server level rather than the application level. Note that (a) I had to restart the Apache server and more importantly, (b) I had to replace the existing "bad data" with the corrected data in the right encoding.
/etc/php.ini
default_charset = "iso-8859-1"

You have to make sure the content is served with the proper character set:
Either send the content with a header that includes
<?php header("Content-Type: text/html; charset=[your charset]"); ?>
or - if the HTTP charset headers don't exist - insert a <META> element into the <head>:
<meta http-equiv="Content-Type" content="text/html; charset=[your charset]" />
Like the attribute name suggests, http-equiv is the equivalent of an HTTP response header and user agents should use them in case the corresponding HTTP headers are not set.
Like Hannes already suggested in the comments to the question, you can look at the headers returned by your webserver to see which encoding it serves. There is likely a discrepancy between the two servers. So change the [your charset] part above to that of the "working" server.
For a more elaborate explanation about the why, see Gumbo's answer.

The display of the REPLACEMENT CHARACTER � (U+FFFD) most likely means that you’re specifying your output to be Unicode but your data isn’t.
In this case, if the ACUTE ACCENT ´ is for example encoded using ISO 8859-1, it’s encoded with the byte sequence 0xB4 as that’s the code point of that character in ISO 8859-1. But that byte sequence is illegal in a Unicode encoding like UTF-8. In that case the replacement character U+FFFD is shown.
So to fix this, make sure that you’re specifying the character encoding properly according to your actual one (or vice versa).

To sum it maybe up a little bit:
Make sure the FILE saved on the web server has the right encoding
Make sure the web server also delivers it with the right encoding
Make sure the HTML meta tags is set to the right encoding
Make sure to use "standard" special chars, i.e. use the ' instead of ´of you want to write something like "Luke Skywalker's code"
For encoding, UTF-8 might be good for you.
If this answer helps, please mark as correct or vote for it. THX

The simple solution is to use ASCII code for special characters.
The value of the apostrophe character in ASCII is ’. Try putting this value in your HTML, and it should work properly for you.

Set your browser's character set to a defined value:
For example,
<meta http-equiv="content-type" content="text/html; charset=utf-8" />

This is probably being caused by the data you're inserting into the page with PHP being in a different character encoding from the page itself (the most common iteration is one being Latin 1 and the other UTF-8).
Check the encoding being used for the page, and for your database. Chances are there will be a mismatch.

Create an .htaccess file in the root directory:
AddDefaultCharset utf-8
AddCharset utf-8 *
<IfModule mod_charset.c>
CharsetSourceEnc utf-8
CharsetDefault utf-8
</IfModule>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Switch website encoding from ISO-8859-1 to UTF-8

I am trying to convert my existing PHP webpage to use UTF-8 encoding.
To do so, I have done the following things:
specified UTF-8 as the charset in the meta content tag at the start of my webpage.
change the default_charset to UTF-8 in the php.ini.
specified UTF-8 as the iconv encoding in the php.ini file.
specified UTF-8 in my .htaccess file using: AddDefaultCharset UTF-8.
Yet after all that, when i echo mb_internal_encoding(), it shows as ISO-8859-1. What am I missing here? I know I could use auto_prepend to attach a script that changes the default encoding to UTF-8, but I'm just trying to understand what I'm missing.
Thanks

mb_internal_encoding() doesn't effect the output of your scripts per se, it effects the default encoding when using the multibyte string functions and the conversion of POST and GET inputs.
Simply set with
mbstring.internal_encoding='UTF-8'
in your php.ini file, or programmatically with:
mb_internal_encoding('UTF-8');
Speaking of the mb_ functions, you'll need to rewrite your scripts to use these, e.g. mb_strlen() instead of strlen.(), etc.
Also check what HTTP content-type headers are being outputted, though from what you've done it should be ok.
If you using a database, you'll also have to convert that too, and specify that you're using UTF-8 when connecting to it.

The documentation states that you can SET that variable using
/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");
which should get rid of your problem :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.