In php, json_encode() will encode UTF8 in hex entities, e.g.
json_encode('中'); // become "\u4e2d"
Assume the data "\u4e2d" is now being stored in MySQL, is it possible to convert back from "\u4e2d" to 中 without using PHP, just plain MySQL?
On my configuration, select hex('中'); returns E4B8AD
which is the hex code of the UTF8 bytes. Naturally it
is not the same as the hex of the code point 4e2d, but you can get
that with select hex(cast('中' as char(1) character set utf16));.
Update: The questioner has edited the question, to what looks to me like a completely different question, now it's apparently: how to get '中' given a string containing '\u4e2d' when 4e2d is the code point of 中 and the default character set is utf8. Okay, that is
select cast(char(conv(right('\u4e2d',4),16,10) using utf16) as char(1) character set utf8);
Encoding non-ASCII characters as JavaScript entities is only one of the different things that JSON encoders will do—and it isn't actually mandatory:
echo json_encode('中'), PHP_EOL;
echo json_encode('中', JSON_UNESCAPED_UNICODE), PHP_EOL;
echo json_encode('One "Two" Three \中'), PHP_EOL;
"\u4e2d"
"中"
"One \"Two\" Three \\\u4e2d"
Thus the only safe decoding approach is using a dedicated JSON decoder. MySQL bundles the required abilities since 5.7.8:
SET #input = '"One \\"Two\\" Three \\\\\\u4e2d"';
SELECT #input AS json_string, JSON_UNQUOTE(#input) AS original_string;
json_string original_string
============================ ===================
"One \"Two\" Three \\\u4e2d" One "Two" Three \中
(Demo)
If you have an older version you'll have to resort to more elaborate solutions (you can Google for third-party UDF's).
In any case, I suggest you get back to the design table. It's strange that you need JSON data in a context where you don't have a proper JSON decoder available.
Related
PHP: 7.2.5
Laravel: 7.25
We have a bug where a very small number of users are trying to insert copy with the '' character included. I'm assuming this is because of a copy and paste from a PDF, I have seen them before with line breaks. This produces the following error:
SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF4\x8F\xB0\x80</...' for column 'body' at row 1 (SQL: update `post` set `body` = <p></p>, `body_raw` = , `post`.`updated_at` = 2020-10-06 10:34:22 where `id` = 1)
Character '':
Decimal Character Codes: 56319, 56320
Hexadecimal Character Codes: 0xdbff, 0xdc00
HTML with named character references: ? ?
Looking at Google, a suggestion is that you could update the DB encoding from utf8 to utf8mb4. This is probably the optimal solution, but we have a large database and I'm uneasy amending the encoding (though this may be very safe). I'm concerned about possible data loss/corruption.
As this issue is only appearing on this 1 character in our bug system, and its 100% not required, I'm inclined to just remove it before saving it in the database, to create the minimum changes.
I'm inclined to do the following:
str_replace("","", $post);
But if I paste the character '' into any of my code editors it disappears (I assuming utf8 encoding). What would the best way to accomplish this?
With great help from #04FS (thanks). I have found a solution. As mentioned, I think the database utf8 to utf8mb4 fix is probably the best route here. But as not to amend the database, here is the solution I have found.
The main confusing issue here is with the character "". As I can not enter it into my text editors it was hard to work with. So I relied on 3rd party sites to encode it. One suggestion was to use char() to be able to write and match the character. But on 2 different websites, the character code came out both char(111) and char(244). With char(244) I was able to use str_replace, but it only created a partial replacement and broke the SQL query.
#04FS mentioned trying urlencode() which gave me '%F4%8F%B0%80' for that character. This matches the database error. So the following solution works correctly:
private function removeSpecialCharacters($str) {
$str = str_replace(urldecode('%F4%8F%B0%80'), '', $str);
return $str;
}
I have a CSV file that looks like this:
http://ideone.com/YWuuWx
I read the file and convert it to array, which works completely fine, but then I jsonize the array - but json_encode doesnt put the real values - it puts null - here is the dump of the array and jsonized array:
http://jave.jecool.net/stackoverflowdemos/csv_to_json_to_arraydump.php
I convert like this: $php_array= json_encode($json_array,JSON_PRETTY_PRINT);
anyone knows what might cause the problem?
EDIT: I think ther is like 90% chance that its caused by the latin1 characters - anyone knows the best workaround?
Assuming that it is in fact an encoding error, and that your data is actually encoded in some ISO-8859 variant (I'm guessing latin2 rather than latin1 based on your use of LATIN SMALL LETTER R WITH CARON), and that it is CONSISTENTLY so, you can use iconv() to re-encode it as UTF-8 before doing json_encode():
$foo = iconv('ISO-8859-2', 'utf8', $foo);
I have come across some problems when inputting certain characters into my mysql database using php. What I am doing is submitting user inputted text to a database. I cannot figure out what I need to change to allow any kind of character to be put into the database and printed back out through php as it's suppose to.
My MySQL collation is: latin1_swedish_ci
Just before I send the text to the database from my form I use mysql_real_escape_string() on the data.
Example below
this text:
�People are just as happy as they make up their minds to be.�
� Abraham Lincoln
is suppose to look like this:
“People are just as happy as they make up their minds to be.”
― Abraham Lincoln
As mentioned by others, you need to convert to UTF8 from end to end if you want to support "special" characters. This means your web page, PHP, mysql connection and mysql table. The web page is fairly simple, just use the meta tag for UTF8. Ideally your headers would say UTF8 also.
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Set your PHP to use UTF8. Things would probably work anyway, but it's a good measure to do this:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
For mysql, you want to convert your table to UTF8, no need to export/import.
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8
You can, and should, configure mysql to default utf8. But you can also run the query:
SET NAMES UTF8
as the first query after establishing a connection and that will "convert" your database connection to UTF8.
That should solve all your character display problems.
The likeliest cause of the problem is that the database connection is set to latin1 but you are feeding it text encoded in UTF-8. The simplest way to solve this is to convert your input into what the client expects:
$quote = iconv("UTF-8", "WINDOWS-1252//TRANSLIT", $quote);
(What MySQL calls latin1 is windows-1252 in the rest of the world.) Note that many characters, such as the quotation dash U+2015 that you use there, cannot be represented in this encoding and will be converted into something else. Ideally you should change the column encoding to utf8.
An alternative solution: set the database connection to utf8. It doesn't matter how the columns are encoded: MySQL internally converts text from the connection encoding into the storage encoding, you can keep the columns as latin1 if you want to. (If you do, the quotation dash U+2015 will be turned into a question mark ? because it's not in latin1)
How to set the connection encoding depends on what library you are using: if you use the deprecated MySQL library it's mysql_set_charset, if MySQLi it's mysqli_set_charset, if PDO add encoding=utf8 to the DSN.
If you do this you'll have set the page encoding to UTF-8 with the Content-Type header.
Otherwise you would be having the same problem with the browser: feeding it text encoded in UTF-8 when it's expecting something else:
header("Content-Type: text/html; charset=utf-8");
The solutions provided are helpful if starting from scratch. Putting all possible connections to UTF-8 is indeed the safest. UTF-8 is the most used charset on the net for a variety of reasons.
Some suggestions and a word of warning:
copy the tables you want to sanitize with a unique prefix (tmp_)
although your db-connection is forced to utf8, check you General Settings collation, change to utf8_bin if that was not done yet
you need to run this on the local server
the funny char error is mostly due to mixing LATIN1 with UTF-8 configurations. This solution is designed for this. It could work with other used char-sets that LATIN1 but I haven't checked this
check these tmp_tables extensively before copying back to the original
Builds the 2 array needed for the magic:
$chars = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES, "UTF-8");
$LATIN1 = $UTF8 = array();
while (list($key,$val) = each ($chars)) {
$UTF8[] = $key;
$LATIN1[] = $val;
}
Now build up the routines you need: (tables->)rows->fields and at each field call
$row[$field] = mysql_real_escape_string(str_replace($LATIN1 , $UTF8 , $row[$field]));
$q[] = "$field = '{$row[$field]}'";
Finally build up and send the query:
mysql_query("UPDATE $table SET " . implode(" , " , $q) . " WHERE id = '{$row['id']}' LIMIT 1");
change the MySQL collation to utf8_unicode_ci or utf8_general_ci, including the table and the database.
You will need to set your database in utf-8 yes. There is many ways to do it. By changin the config file, via phpmyadmin or by calling php function (sorry memory blank) right before insert and update the mysql.
Unfortunately, i think you will have to re-enter any data you entered before.
One thing you also need to know, from personnal experience, make sure all table with relation have the same collation or you won'T be able to JOIN them.
as reference: http://dev.mysql.com/doc/refman/5.6/en/charset-syntax.html
Also, i can be a apache setting. We've experienced the same issue on 'free-hosting' server as well as on my brother's server. Once switched to another server, all the charater's became neat. Verfiy you apache setting, sorry but i can't bting more light on apache's config.
Get rid of everything you just need to follow these two points, every problem regarding special languages characters will be resolved.
1- You need to define the collation of your table to be utf8_general_ci.
2- define <meta http-equiv="content-type" content="text/html; charset=utf-8"> in the HTML after head tag.
2- You need to define the mysql_set_charset('utf8',$link_identifier); in the file where you made connection with the database and right after the selection of database like 'mysql_select_db' use this 'mysql_set_charset' this will allow you to add and retrieve data properly in what ever the language it is.
If your text has been encoded and decoded with the wrong encoding and so the mojibake is actually "solidified" into unicode characters, then the solutions mentioned so far won't work. I ended up having success with the ftfy Python package to automatically detect/fix mojibake:
https://github.com/LuminosoInsight/python-ftfy
https://pypi.org/project/ftfy/
https://ftfy.readthedocs.io/en/latest/
>>> import ftfy
>>> print(ftfy.fix_encoding("(ง'⌣')ง"))
(ง'⌣')ง
Hopefully this helps people who are in a similar situation.
I have a site I want to migrate from ISO to UTF-8.
I have a record in database indexed by the following primary key :
s:22:"Informations générales";
The problem is, now (with UTF-8), when I serialize the string, I get :
s:24:"Informations générales";
(notice the size of the string is now the number of bytes, not string length)
So this is not compatible with non-utf8 previous records !
Did I do something wrong ? How could I fix this ?
Thanks
The behaviour is completely correct. Two strings with different encodings will generate different byte streams, thus different serialization strings.
Dump the database in latin1.
In the command line:
sed -e 's/latin1/utf8/g' -i ./DBNAME.sql
Import the file converted to a new database in UTF-8.
Use a php script to update each field.
Make a query, loop through each field and update the serialized string using this:
$str = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $str);
After that, I was able to use unserialize() and everything working with UTF-8.
To unserialize an utf-8 encoded serialized array:
$array = #unserialize($arrayFromDatabase);
if ($array === false) {
$array = #unserialize(utf8_decode($arrayFromDatabase)); //decode first
$array = array_map('utf8_encode', $array ); // encode the array again
}
PHP 4 and 5 do not have built-in Unicode support; I believe PHP 6 is starting to add more Unicode support although I'm not sure how complete that is.
You did nothing wrong. PHP prior to v6 just isn't Unicode aware, and as such doesn't support it, if you don't beat it to be (i.e., via the mbstring extension or other means).
We here wrote our own wrapper around serialize() to remedy this. You could, too, move to other serialization techniques, like JSON (with json_encode() and json_decode() in PHP since 5.2.0).
I am using Delphi 7 and ICS components to communicate with php script and insert some data in mysql database...
How to post unicode data using http post ?
After using utf8encode from tnt controls I am doing it to post to PHP script
<?php
echo "Note = ". $_POST['note'];
if($_POST['action'] == 'i')
{
/*
* This code will add new notes to the database
*/
$sql = "INSERT INTO app_notes VALUES ('', '" . mysql_real_escape_string($_POST['username']) . "', '" . mysql_real_escape_string($_POST['note']) . "', NOW(), '')";
$result = mysql_query($sql, $link) or die('0 - Ins');
echo '1 - ' . mysql_insert_id($link);
?>
Delphi code :
data := Format('date=%s&username=%s&password=%s&hash=%s¬e=%s&action=%s',
[UrlEncode(FormatDateTime('yyyymmddhh:nn',now)),
UrlEncode(edtUserName.Text),
UrlEncode(getMd51(edtPassword.Text)),
UrlEncode(getMd51(dataHash)),UrlEncode(Utf8Encode(memoNote.Text)),'i'
]);
// try function StrHtmlEncode (const AStr: String): String; from IdStrings
HttpCli1.SendStream := TMemoryStream.Create;
HttpCli1.SendStream.Write(Data[1], Length(Data));
HttpCli1.SendStream.Seek(0, 0);
HttpCli1.RcvdStream := TMemoryStream.Create;
HttpCli1.URL := Trim(ActionURLEdit.Text);
HttpCli1.PostAsync;
But when I post that unicode value is totally different then original one that I see in Tnt Memo
Is there something I am missing ?!
Also anybody knows how to do this with Indy?
Thanks.
Your example code shows your data coming from a TNT Unicode control. That value will have type WideString, so to get UTF-8 data, you should call Utf8Encode, which will return an AnsiString value. Then call UrlEncode on that value. Make sure UrlEncode's input type is AnsiString. So, something like this:
var
data, date, username, passhash, datahash, note: AnsiString;
date := FormatDateTime('yyyymmddhh:nn',now);
username := Utf8Encode(edtUserName.Text);
passhash := getMd51(edtPassword.Text);
datahash := getMd51(data);
note := Utf8Encode(memoNote.Text);
data := Format('date=%s&username=%s&password=%s&hash=%s¬e=%s&action=%s',
[UrlEncode(date),
UrlEncode(username),
UrlEncode(passhash),
UrlEncode(datahash),
UrlEncode(note),
'i'
]);
There should be no need to UTF-8-encode the MD5 values since MD5 string values are just hexadecimal characters. However, you should double-check that your getMd51 function accepts WideString. Otherwise, you may be losing data before you ever send it anywhere.
Next, you have the issue of receiving UTF-8 data in PHP. I expect there's nothing special you need to do there or in MySQL. Whatever you store, you should get back identically later. Send that back to your Delphi program, and decode the UTF-8 data back into a WideString.
In other words, your Unicode data will look different in your database because you're storing it as UTF-8. In your database, you're seeing UTF-8-encoded data, but in your TNT controls, you're seeing the regular Unicode characters.
So, for instance, if you type the character "ش" into your edit box, that's Unicode character U+0634, Arabic letter sheen. As UTF-8, that's the two-byte sequence 0xD8 0xB4. If you store those bytes in your database, and then view the raw contents of the field, you may see characters interpreted as though those bytes are in some ANSI encoding. One possible interpretation of those bytes is as the two-character sequence "Ø´", which is the Latin capital letter o with stroke followed by an acute accent.
When you load that string back out of your database, it's still encoded as UTF-8, just as it was when you stored it, so you will need to decode it. As far as I can tell, neither PHP nor MySQL does any massaging of your data, so whatever UTF-8 character you give them will be returned to you as-is. If you are using the data in Delphi, then call Utf8Decode, which is the complement to the Utf8Encode function that you called previously. If you are using the data in PHP, then you might be interested in PHP's utf8_decode function, although that converts to ISO-8859-1, which doesn't include our example Arabic character. Stack Overflow already has a few questions related to using UTF-8 in PHP, so I won't attempt to add to them here. For example:
Best practices in PHP and MySQL
with international strings
UTF-8 all the way through…
Encode the UTF-8 data in application/x-www-form-urlencoded. This will ensure that the server can read the data over the http connection
I would expect (without knowing for sure) that you'd have to output them as &#nnnnn entities (with the number in decimal rather than hex ... I think)