How can I save UTF-8 character(Malayalam language) to the MySQL database as HTML entity using PHP. I have tried some of the php functions to do the same still I am not able to make it. So it will be helpful if someone point me in the right direction.
Here is what I've done:
Set the field collation to 'utf-8_general_ci'.
Set the content-type to utf-8 in the page header.
Used php function htmlentities() and
htmlspecialchars().
Create/change your table collation = utf-8, set names to utf-8 http://dev.mysql.com/doc/refman/5.0/en//charset-connection.html
also use utf-8 internally on your server and declare your website utf-8 with the appropriate tags
At last I got the solution myself :)
There is a php function to convert the characters to HTML entity.
mb_convert_encoding("$SPECIAL_CHAR",'HTML-ENTITIES', 'UTF-8');
Related
This question already has answers here:
PHP DOMDocument loadHTML not encoding UTF-8 correctly
(11 answers)
Closed 1 year ago.
I am working on localhost windows10 apache 2.4: Apache/2.4.51 (Win64) OpenSSL/1.1.1l PHP/8.0.11and Database client version: libmysql - mysqlnd 8.0.11 which uses the server Server version: 10.4.21-MariaDB - mariadb.org binary distribution. It is by default set to _utf8mb4: Server charset: UTF-8 Unicode (utf8mb4).
I made a php script that gets content(including html tags) from a Wikipedia page using loadHTMLFile. I then further use xpath->query to filter the dom and then the data is saved in mysql table as a string after being escaped by mysqli_real_escape_string. Later on, I query the database and save the content in a variable which is passed to loadHTML, then I remove a few dom elements and then pass the modified content to saveHTML and echo it to my webpage.
What happens is some characters are being displayed like:
--> Â
- --> –
€ --> €
ευρώ --> ευÏÏŽÂ
All the characters are displayed correctly, when I use echo utf8_decode($output). Note: that instead of using utf8_decode, any of the following has no effect:
<meta charset="utf-8"> // in my html file
header('Content-Type: text/html; charset=utf-8'); // before the echo statement
mysqli_query($conn, "SET NAMES utf8"); // before mysql insert into and Select from statements
mysqli_set_charset($conn, "utf8"); // before mysql insert into and Select from
statements
Also both mb_detect_encoding($output) and mb_detect_encoding(utf8_decode($output)) returns UTF-8 not utf8mb4. In my chrome browser's network/headers tab, I always get Content-type as text/html; charset=UTF-8 , regardless of whatever changes I make in my server side php/mysql settings.
My guess is that, the data in the Wikipedia page is in normal UTF-8 form, which is automatically converted by php into utf8mb4 when it's downloaded by loadHTMLFile. Now this data is saved in mysql tables in utf8mb4 format. This data when retrieved later on stays in utf8mb4 format and is seen to the browser in utf8mb4 format. When I use utf8_decode it must convert it to normal utf-8 format.
The problem with my guess is that the php docs about utf8_decode page, mention nothing of utf8mb4, rather it says, multi-byte UTF-8 ISO-8859-1 encoding is converted into single byte UTF-8 ISO-8859-1. Secondly the docs say, ISO-8859-1 charset does not contain the EURO sign. But my webpage successfully shows euro sign after utf8_decode and a browser is capable of parsing multibyte utf-8 characters as well, so if that was the only thing that utf8_decode does, then it should not make any difference with my code.
Edit:
I found the culprit. The following echos correct characters:
$stmt = $conn->prepare("Select ...");
...
$result = $stmt->execute();
...
$row = $stmt->get_result()->fetch_assoc()
echo $row['content']; // gives €ερυώ
Now, $row['content'] is the data directly from my database without any utf_decode. But I happen to use php domdocument afterwards and the following happens:
libxml_use_internal_errors(true); // important
$content = new DOMDocument();
$content->loadHTML($row['content']);
echo $row['content'], $content->saveHTML($content); die();
// The output is: €ερυώ
//â¬ÎµÏÏÏ
The output from the above code in my view source is:
€ερυώ<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>â¬ÎµÏÏÏ</p></body></html>
So please explain what the heck does loadHTML and saveHTML is doing here?
P.S: My whole code available on github repo: https://github.com/AnupamKhosla/crimeWiki and the speciic scripts about wikipedea pages encoding at https://github.com/AnupamKhosla/crimeWiki/blob/main/include/wikipedea_code.php https://github.com/AnupamKhosla/crimeWiki/blob/main/include/post_code.php
The fact that utf8_decode() helps you is incidental. This function should not be used most of the time. If using it helps you, then it can only mean that somehow you have managed to mangle your data.
utf8mb4 is MySQL's charset that represents the full UTF-8 encoding. Therefore, if you are using UTF-8 everywhere in your code, you should never need to use utf8_decode() as it will only damage the data. ISO-8859-1 supports very few characters. It's not what you want.
What seems to have happened here is that you forgot to set $conn->set_charset('utf8mb4') when you opened the connection. Many MySQL servers default to Latin1 when you don't specify the charset, which means that even though your schema might be using utf8mb4 consistently, the connection to the database doesn't and converts the data into garbled up text.
The solution is simple, always set the right connection charset right after opening a new mysqli connection. $conn->set_charset('utf8mb4') will solve your problem and you don't need to use the ridiculous utf8_decode() function that accidentally solved your problem.
Using any encode/decode is a symptom of misconfiguration.
When you connect to mysql, you tell it what encoding is being used in the client.
When you declare the tables, you specify how to store things. CHARACTER SET utf8mb4 is often the best.
Please provide SELECT HEX(col), col ... for a sample. (You cannot trust what the browser displays; it tries to "fix" the encoding. Once you have the hex, we can discuss how to repair the data. A common problem is "double-encoding", wherein the data has been misconverted twice.
As for your current samples, there are enough inconsistencies that I cannot deduce what went wrong:
-> That is represented as hex 80 by some word processors, not by HTML.
- --> this is a plain dash; it is never mangled. Perhaps you have an n-dash or m-dash?
€ --> mangles to "€" via "Mojibake" through latin1;
did you omit the "SINGLE LOW-9 QUOTATION MARK" that looks like a comma??
ευρώ --> ευÏÏŽ via "Mojibake" through latin1;
More on Mojibake and other common manglings: Trouble with UTF-8 characters; what I see is not what I stored
So I have programmed a crawler to scrape information and data from a website with charset utf8. But when I tried to store the contents into MySQL, some special characters, such as Spanish letters), did not show correctly in MySQL.
Here is what I have done:
Put header("Content-Type: text/html; charset=utf-8") in PHP
Set all charset in MySQL into utf8-unicode-ci
Have $conn->query("SET NAMES 'utf8'") this upon connection
Double checked that the html I parsed was encoded in utf-8
So what are some potentially problems here?
Maybe you coded your crawler using functions which are not supposed to manage multi-byte characters.
For example strlen instead of mb_strlen.
Try putting:
mb_internal_encoding("UTF-8");
as first line of your php coce, and then check if you have to convert some functions in their respective mb version.
Have a look at multibyte string reference
As a last chance you may play with iconv function just before inserting the string into mysql.
Something as:
$utf8_string = iconv(iconv_get_encoding($string), "UTF-8", $string);
should do the trick
Start by checking if the data is stored wrong in the database, in which case the problem is with your crawler. Otherwise the problem is in your presentation.
To test this, I would suggest that you use a dedicated mysql client (Such as the command line client) to inspect data.
I remember pulling my hair out in dealing with UTF8 issues until I started adding this to my header:
setlocale(LC_ALL, 'en_US.UTF-8');
What is the best way to convert user input to UTF-8?
I have a simple form where a user will pass in HTML, the HTML can be in any language and it can be in any character encoding format.
My question is:
Is it possible to represent everything as UTF-8?
What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
I am trying to work out how to best implement this - advice and links appreciated.
I am making use of Codeigniter and its input class to retrieve post data.
A few points I should make:
I need to convert HTML special characters to their respective entities
It might be a good idea to accept encoding and return it in that same encoding. However, my web app is making use of :
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.
Specify accept-charset in your <form> tag to tell the browser to submit user-entered data encoded in UTF-8:
<form action="foo" accept-charset="UTF-8">...</form>
See here for a complete guide on HOW TO Use UTF-8 Throughout Your Web Stack.
Is it possible to represent everything as UTF-8?
Yes, UTF-8 is a Unicode encoding, so you can use any character defined in Unicode. That's the best you can do with a computer to date.
What can I use to effectively convert any character encoding to UTF-8
iconv lets you convert virtually any encoding to any other encoding. But, for that you have to know what encoding you're dealing with. You can't say "iconv, whatever this is, make it UTF-8!". That's unfortunately not how it works. You can only say "iconv, I have this string here in BIG5, please convert that to UTF-8.".
If you're only dealing with form data in UTF-8 though, you'll probably never need to convert anything.
so that I can parse it with PHP string functions
"PHP string functions" work on bytes. They don't care about characters or encodings. Depending on what you want to do, working with naive PHP string functions on UTF-8 text will give you bad results. Use encoding-aware string functions in the MB extension for any multi-byte encoding string manipulation.
save it to my database
Just make sure your database stores text in UTF-8 and you have set your database connection to UTF-8 (i.e. the database knows you're sending it UTF-8 data). You should be able to specify that in the CodeIgniter database connection settings.
subsequently echo out using htmlentities?
Just echo htmlentities($text), nothing more you need to do.
However, my web app is making use of : <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.
Not at all. It just signals to the browser that your page is encoded in UTF-8. Now you just need to make sure that's actually the case (as you're trying to do anyway). It also implies to the browser that it should send UTF-8 to the server. You can make that explicit with the accept-charset attribute on forms.
May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which might help you understand more.
1) Is it possible to represent everything as UTF-8?
Yes, everything defined in UNICODE. That's the most you can get nowadays, and there is room for the future that UNICODE can support.
2) What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
The only thing you need to know is the actual encoding of your data. If you want your webapplication to support UTF-8 for input and output, the frontend needs to signal that it supports UTF-8. See Character Encodings for a guide regarding your applications user-interface.
Within PHP you need to feed any function with the encoding it supports. Some need to have the encoding specified, for some you need to convert it. Always check the function docs if it supports what you ask for. Additionally check your PHP configuration.
Related:
Preparing PHP application to use with UTF-8
How to detect malformed utf-8 string in PHP?
If you want to change the encoding of a string you can try
$utf8_string = mb_convert_encoding( $yourBadString , 'UTF-8' );
I found out that the only thing that works out for UTF-8 encoding is setting inside my config.php
putenv('LC_ALL=en_US.utf8'); // or whatever language you need
setlocale(LC_ALL, 'en_US.utf8'); // or whatever language you need
bindtextdomain("mydomain", dirname(__FILE__) . "/../language");
textdomain("mydomain");
EDIT :
Is it possible to represent everything as UTF-8?
Yes, these is what you need to ensure :
html : headers/meta-header set to utf-8
all files saved as utf-8
database collation, tables and data encoding to utf-8
What can I use to effectively convert any character encoding to UTF-8
You can use utf8_encode (Since for a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation,ref) before saving it into your database.
// eg
$name = utf8_encode($this->input->post('name'));
And as i mention before, you need to make sure database collation, tables and data encoding to utf-8. In CI, at your database connection config
// Make sure have these lines
$db['default']['char_set'] = 'utf8';
$db['default']['dbcollat'] = 'utf8_general_ci';
I have a table named "cust_details" which has a column "categories", where I have to store some categories like : blockadenlösung, affirmation, beziehungsprobleme lösen
But when I am trying to save this data into the database it is stored like :
blockadenlüsung, affirmation, beziehungsprobleme lösen
That is when umlauts are coming in the string it is not saved in its original form. I tried some charset for storing this characters. But I am still facing the problem.....
What may be the possible reasons...?
Thanks In Advance.....
The data you stored is encoded in UTF-8 (ü for an "ö" is typical for UTF-8), but is not displayed as UTF-8 but rather as ISO-8859-1 or the like.
Make sure that you use the same encoding everywhere:
Deliver your websites with Content-Encoding "utf-8"
Use mysql_query("SET NAMES 'utf8'"); to set the encoding to utf-8
Make sure that the encoding of the database is UTF-8 (use HeidiSQL etc. to check)
Use this when you are inserting the characters:
N'characters here'
The N before the string declaration should enable you to enter it into the DB.
What is the type of the field?
You could specify database/table/field level character-sets. The default latin-1 works in most scenarios.
Otherwise, you would have to use plain text and store unicode strings like &#<4-digit-unicode-value>; into it. Then when you print it out, just dump the unicode into HTML and it will show up as such.
Here is a sample string in Pashto ترافيکي پيښو کې درې تنه مړه او څوارلس نور ټپيان شول. which we store directy into the table. The charset used is latin_charset_ci
Good Luck!
I cant seem to get these Chinese punctuation marks to work with my database (utf-8)
when i do an echo of the query the marks look like this
���
in php i have already done
$text=mysql_real_escape_string(htmlentities($text));
so as a result they are not saved into the database correctly what can i do to fix this?
Thanks
Executing mysql_query('SET NAMES utf-8'); before any operations with unicode will do the trick
Try using using utf8_encode() function while inserting into db and utf8_decode() while printing the same.
Add the character 'N' before your string value.
Eg. select from test_table where temp=N'unicode string'
besides if you want to use htmlentities, you have to set it to utf-8 encoding like that:
htmlentities($string,ENT_COMPAT,"UTF-8");
Don't put HTML-encoded data in the database. It should be raw text until the time you spit it onto the page (at which point you should use htmlspecialchars().
You need to make sure that both your database and your page are using UTF-8:
ensure your tables are CREATEd with a UTF-8 collation;
use mysql_set_charset after connecting to ensure the connection between MySQL and PHP is UTF-8;
set the Content-Type of the page to text/html;charset=utf-8 by header or meta tag.
You can get away with using a different encoding such as the default latin-1 on the database end and the connection if you treat it as bytes, but case-insensitive comparisons won't work if you do, so it's best to stick to UTF-8.