Unicode character is shown as encoded ascii at client side - php

I am trying to show emoji using its unicode value(😀). But I am getting escaped string as \u00f0\u0178\u02dc\u20ac, which is decoded into 😀.
I am using Mysql server and PHP 5.4 in my project. In mysql, it's stored as 😀. Is there any way to unescape this and return Actual unicode from PHP server
I tried,
iconv('ASCII//TRANSLIT', 'UTF-8', '😀');, mb_convert_encoding($var, "US-ASCII", "UTF-8") and utf8_encode(). not working.
Thanks

Without knowing the structure of your database (make sure that you're using the utf8 as the character set for your table!), I think the problem may just be on the display side. Try starting your PHP script by sending a header to the browser that lets it know that you're going to be displaying UTF8 characters, rather than Western encoding (ISO-8859-1).
header('Content-type text/html; charset=UTF-8');

Related

How does php utf8_decode deal with utf8mb4? [duplicate]

This question already has answers here:
PHP DOMDocument loadHTML not encoding UTF-8 correctly
(11 answers)
Closed 1 year ago.
I am working on localhost windows10 apache 2.4: Apache/2.4.51 (Win64) OpenSSL/1.1.1l PHP/8.0.11and Database client version: libmysql - mysqlnd 8.0.11 which uses the server Server version: 10.4.21-MariaDB - mariadb.org binary distribution. It is by default set to _utf8mb4: Server charset: UTF-8 Unicode (utf8mb4).
I made a php script that gets content(including html tags) from a Wikipedia page using loadHTMLFile. I then further use xpath->query to filter the dom and then the data is saved in mysql table as a string after being escaped by mysqli_real_escape_string. Later on, I query the database and save the content in a variable which is passed to loadHTML, then I remove a few dom elements and then pass the modified content to saveHTML and echo it to my webpage.
What happens is some characters are being displayed like:
--> Â
- --> –
€ --> €
ευρώ --> ευÏÏŽÂ
All the characters are displayed correctly, when I use echo utf8_decode($output). Note: that instead of using utf8_decode, any of the following has no effect:
<meta charset="utf-8"> // in my html file
header('Content-Type: text/html; charset=utf-8'); // before the echo statement
mysqli_query($conn, "SET NAMES utf8"); // before mysql insert into and Select from statements
mysqli_set_charset($conn, "utf8"); // before mysql insert into and Select from
statements
Also both mb_detect_encoding($output) and mb_detect_encoding(utf8_decode($output)) returns UTF-8 not utf8mb4. In my chrome browser's network/headers tab, I always get Content-type as text/html; charset=UTF-8 , regardless of whatever changes I make in my server side php/mysql settings.
My guess is that, the data in the Wikipedia page is in normal UTF-8 form, which is automatically converted by php into utf8mb4 when it's downloaded by loadHTMLFile. Now this data is saved in mysql tables in utf8mb4 format. This data when retrieved later on stays in utf8mb4 format and is seen to the browser in utf8mb4 format. When I use utf8_decode it must convert it to normal utf-8 format.
The problem with my guess is that the php docs about utf8_decode page, mention nothing of utf8mb4, rather it says, multi-byte UTF-8 ISO-8859-1 encoding is converted into single byte UTF-8 ISO-8859-1. Secondly the docs say, ISO-8859-1 charset does not contain the EURO sign. But my webpage successfully shows euro sign after utf8_decode and a browser is capable of parsing multibyte utf-8 characters as well, so if that was the only thing that utf8_decode does, then it should not make any difference with my code.
Edit:
I found the culprit. The following echos correct characters:
$stmt = $conn->prepare("Select ...");
...
$result = $stmt->execute();
...
$row = $stmt->get_result()->fetch_assoc()
echo $row['content']; // gives €ερυώ
Now, $row['content'] is the data directly from my database without any utf_decode. But I happen to use php domdocument afterwards and the following happens:
libxml_use_internal_errors(true); // important
$content = new DOMDocument();
$content->loadHTML($row['content']);
echo $row['content'], $content->saveHTML($content); die();
// The output is: €ερυώ
//â¬ÎµÏÏÏ
The output from the above code in my view source is:
€ερυώ<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>â¬ÎµÏÏÏ</p></body></html>
So please explain what the heck does loadHTML and saveHTML is doing here?
P.S: My whole code available on github repo: https://github.com/AnupamKhosla/crimeWiki and the speciic scripts about wikipedea pages encoding at https://github.com/AnupamKhosla/crimeWiki/blob/main/include/wikipedea_code.php https://github.com/AnupamKhosla/crimeWiki/blob/main/include/post_code.php
The fact that utf8_decode() helps you is incidental. This function should not be used most of the time. If using it helps you, then it can only mean that somehow you have managed to mangle your data.
utf8mb4 is MySQL's charset that represents the full UTF-8 encoding. Therefore, if you are using UTF-8 everywhere in your code, you should never need to use utf8_decode() as it will only damage the data. ISO-8859-1 supports very few characters. It's not what you want.
What seems to have happened here is that you forgot to set $conn->set_charset('utf8mb4') when you opened the connection. Many MySQL servers default to Latin1 when you don't specify the charset, which means that even though your schema might be using utf8mb4 consistently, the connection to the database doesn't and converts the data into garbled up text.
The solution is simple, always set the right connection charset right after opening a new mysqli connection. $conn->set_charset('utf8mb4') will solve your problem and you don't need to use the ridiculous utf8_decode() function that accidentally solved your problem.
Using any encode/decode is a symptom of misconfiguration.
When you connect to mysql, you tell it what encoding is being used in the client.
When you declare the tables, you specify how to store things. CHARACTER SET utf8mb4 is often the best.
Please provide SELECT HEX(col), col ... for a sample. (You cannot trust what the browser displays; it tries to "fix" the encoding. Once you have the hex, we can discuss how to repair the data. A common problem is "double-encoding", wherein the data has been misconverted twice.
As for your current samples, there are enough inconsistencies that I cannot deduce what went wrong:
-> That is represented as hex 80 by some word processors, not by HTML.
- --> this is a plain dash; it is never mangled. Perhaps you have an n-dash or m-dash?
€ --> mangles to "€" via "Mojibake" through latin1;
did you omit the "SINGLE LOW-9 QUOTATION MARK" that looks like a comma??
ευρώ --> ευÏÏŽ via "Mojibake" through latin1;
More on Mojibake and other common manglings: Trouble with UTF-8 characters; what I see is not what I stored

Saving UTF-8 characters in MySQL db

How can I save UTF-8 character(Malayalam language) to the MySQL database as HTML entity using PHP. I have tried some of the php functions to do the same still I am not able to make it. So it will be helpful if someone point me in the right direction.
Here is what I've done:
Set the field collation to 'utf-8_general_ci'.
Set the content-type to utf-8 in the page header.
Used php function htmlentities() and
htmlspecialchars().
Create/change your table collation = utf-8, set names to utf-8 http://dev.mysql.com/doc/refman/5.0/en//charset-connection.html
also use utf-8 internally on your server and declare your website utf-8 with the appropriate tags
At last I got the solution myself :)
There is a php function to convert the characters to HTML entity.
mb_convert_encoding("$SPECIAL_CHAR",'HTML-ENTITIES', 'UTF-8');

Removing unicode bullet character

I'm having an issue that i believe is related to unicode text. When the user enters a string that has the unicode bullet character, mysql is not able to save that field (the rest of the update query works though). Here's how i've been trying to deal with it.
$str = "· Close up the server";
$str = preg_replace("\u2022", "•", $str);
...however this is still not working.
So many things can go wrong here, because database, form submits and source code string literals are all involved. I'll assume you want to use UTF-8, because with any other typical encoding (CP1252, Latin1) you'll be screwed when you want to use json_ or accept more than ~200 different characters.
The first thing to do is remove any kind of conversion etc code that was written with the intention of trying to fix encoding issues. Such as utf8_encode, htmlentitites, *_replace.. whatever.
Source encoding.
$str = "· Close up the server";
When writing the above, the PHP source file needs to be physically encoded in UTF-8. If you are on Windows, you must explicitly do or configure this. UTF-8 doesn't happen magically on Windows.
Form submits
When user submits a form, the payload will be in whatever encoding you declared the page to be. You can declare it like so:
header("Content-Type: text/html; charset=utf-8");
But anyone can actually submit arbitrary bytes to your server, so you should validate the input is in UTF-8 before proceeding. mb_check_encoding is good.
Database
Since at this point your data is coming in as UTF-8, your input strings are in UTF-8. You must specify this after connecting to the database, by specifying a connection encoding.
mysql_set_charset("utf8"); //After making the connection, and before any queries
//or $mysqli->set_charset( "utf8");
This makes the database read your input in UTF-8, and encode its output in UTF-8. You would also want to set your columns/tables/databases to UTF-8 as well.
Unicode escape sequences \uxxxx or \uhhhh\ullll or \Uxxxxxxxx are not supported in PHP.
\u2022 is the UTF-16 hex encoding for "Bullet". Not UTF-8.
You might also want to SET NAMES 'UTF-8'; or change charset before you open your database.

UTF8 encoded strings not shown correctly in MySQL

So I have programmed a crawler to scrape information and data from a website with charset utf8. But when I tried to store the contents into MySQL, some special characters, such as Spanish letters), did not show correctly in MySQL.
Here is what I have done:
Put header("Content-Type: text/html; charset=utf-8") in PHP
Set all charset in MySQL into utf8-unicode-ci
Have $conn->query("SET NAMES 'utf8'") this upon connection
Double checked that the html I parsed was encoded in utf-8
So what are some potentially problems here?
Maybe you coded your crawler using functions which are not supposed to manage multi-byte characters.
For example strlen instead of mb_strlen.
Try putting:
mb_internal_encoding("UTF-8");
as first line of your php coce, and then check if you have to convert some functions in their respective mb version.
Have a look at multibyte string reference
As a last chance you may play with iconv function just before inserting the string into mysql.
Something as:
$utf8_string = iconv(iconv_get_encoding($string), "UTF-8", $string);
should do the trick
Start by checking if the data is stored wrong in the database, in which case the problem is with your crawler. Otherwise the problem is in your presentation.
To test this, I would suggest that you use a dedicated mysql client (Such as the command line client) to inspect data.
I remember pulling my hair out in dealing with UTF8 issues until I started adding this to my header:
setlocale(LC_ALL, 'en_US.UTF-8');

PHP urlencode for chinese characters

I'm creating a php application that involves sending chinese characters as url parameters.
I have to send query like :
http://xyz.com/?q=新
But the script at xyz.com won't automatically encode the chinese character. So, I need to explicitly send an encoded string as the paramter. It becomes:
http://xyz.com/?q=%E6%96%B0
The problem is, PHP won't encode the chinese character properly.
I've tried urlencode() and rawurlencode(). But they give %D0%C2 (doesn't work for my purpose) instead of %E6%96%B0 (works well with xyz.com) as the output.
I'm using this website to create the latter encoded string.
I've also defined header('Content-Type: text/html; charset=gb2312'); to display chinese characters properly.
Is there anything I can do to urlencode the chinese character properly?
Thanks!
PS: I'm a relatively new programmer and don't understand chinese.
You're URLencoding using the charset you specify in your header. %D0%C2 is 新 in gb2312; %E6%96%B0 is 新 in UTF-8. Switch your charset over to UTF-8 and you should fix this issue and still be able to display Simplified Chinese Han.
In order to reproduce your problem I created a simple PHP file:
<?php
var_dump(urlencode('新'));
?>
First I used UTF8 encoding and got %E6%96%B0. Afterwards I changed to GB2312 and got %D0%C2.
At http://meyerweb.com/eric/tools/dencoder/ they seem to use JavaScript, that's UTF8 capable and therefore returns %E6%96%B0, too.
PS: When changing from GB2312 to UTF8 some editors might break code some internationalized code. So please make sure to have a copy of your file before converting!

Categories