Different Variants of UTF-8 Comma? [，] [,] - CURL Response for MySQL Data

Different Variants of UTF-8 Comma? [，] [,] - CURL Response for MySQL Data - php

Prepping a Curl Response for particular data to be inserted into a MySQL Table.
Noticed some special characters in the saved data for certain URL's.
$curldata = curl_exec($curl);
$encoding = mb_detect_encoding($curldata);
brought back ASCII encoding.
Okay, don't want that.
The tables in my database are an InnoDB type with a utf8mb4_unicode_ci collation.
Added this to my curl options:
curl_setopt($curl, CURLOPT_ENCODING, 1);
And an iconv function based on the above mb_detect_encoding / $encoding variable upon save.
$curldata = iconv($encoding, "UTF-8", $curldata);
// save to file to test output
file_put_contents('test.html', $curldata);
Not sure if this is the best way to go about this, but my test.html output no longer has any encoding for special characters, so... (perhaps) mission accomplished.
As I parse through the data, I then notice this character.
，
Not an ordinary comma... [Comparison: ,/，]
But acts like one. Try doing a ctrl+f and try to find a comma. It treats them as the same, and both as a UTF-8 character - var_dump(mb_detect_encoding('，'));
I look at my table row, and see it as a row inserted as such
8，8
If I try to search for a , it does indeed bring back the instances where ，is present.
Vice versa, if I search for ， it brings back all instances where that and a comma occurs.
Basically for all intents and purposes it is a comma, yet obviously isn't.
This is of course workable, but rather annoying and feels riddled with inconsistency.
Can anyone explain why the two commas are the same, yet obviously different?
Is there a solution for me to prevent these odd characters from entering my CURL response, or further in within my DOM response and PDO Insert.
edit:
If relevant,
// dom
$dom = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = FALSE;
$dom->loadHTML(mb_convert_encoding($curldata, 'HTML-ENTITIES', 'UTF-8'));
// pdo
$pdoquery = "INSERT INTO `table` (`Attr`) VALUES (?)";
$value = "8，8";
$stmt = $pdo->prepare("INSERT INTO `table` (`Attr`) VALUES (?)");
$stmt->execute([$value]);
edit 2:
Well, it appears to be a FULLWIDTH COMMA..
var_dump(utf8_to_unicode('，'));
string '%uff0c' (length=6)
var_dump(utf8_to_unicode(','));
string '%2c' (length=3)
Starting to make more sense... now to figure out how to prevent such characters from entering the curl response/DOM/database...

You might want the function mb_convert_kana which can convert characters of different widths into a uniform width.
$s = 'This is a string with ，, (commas having different widths)';
echo 'original : ', $s, PHP_EOL;
echo 'converted: ', mb_convert_kana($s, 'a');
result:
original : This is a string with ，, (commas having different widths)
converted: This is a string with ,, (commas having different widths)
PHP documentation: mb_convert_kana
To get an idea what the meaning is, see also http://unicode.org/reports/tr11-2/
By convention, 1/2 Em wide characters of East Asian legacy encodings are called "half-width" (or hankaku characters in Japanese), the others are called correspondingly "full-width" (or zenkaku) characters.

With a suitable COLLATION, the two commas are treated as equal:
mysql> SELECT '，' = ',' COLLATE utf8mb4_general_ci;
+----------------------------------------+
| '，' = ',' COLLATE utf8mb4_general_ci |
+----------------------------------------+
| 0 |
+----------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT '，' = ',' COLLATE utf8mb4_unicode_ci;
+----------------------------------------+
| '，' = ',' COLLATE utf8mb4_unicode_ci |
+----------------------------------------+
| 1 |
+----------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT '，' = ',' COLLATE utf8mb4_unicode_520_ci;
+--------------------------------------------+
| '，' = ',' COLLATE utf8mb4_unicode_520_ci |
+--------------------------------------------+
| 1 |
+--------------------------------------------+
1 row in set (0.00 sec)
It would be better to talk in terms of HEX, not unicode:
mysql> SELECT HEX('，'), HEX(',');
+------------+----------+
| HEX('，') | HEX(',') |
+------------+----------+
| EFBC8C | 2C |
+------------+----------+
1 row in set (0.00 sec)

Related

Error on accentuated characters with PHP and MySQL

My problem is that what is written directly via PHP is correctly accentuated, but when the accentuated word comes from the MySQL, the letters come like this �.
I tried using the html charset as ISO-8859-1 and it fixed the MySQL letters, but broke the others. One way to fix it all is to set my .php files to ISO-8859-1, but I can't do it, I need to use it in utf-8 encode.
What can I do?
At the moment solution: Include mysqli_set_charset($link, "utf8"); before the queries (only need to do once for each connection made). I'm still looking for a conclusive solution on the server, not on the client.
EDIT:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
mysql> SHOW VARIABLES LIKE 'collation%';
+----------------------+-----------------+
| Variable_name | Value |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+----------------------+-----------------+
mysql> show variables like "character_set_database";
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| character_set_database | utf8 |
+------------------------+-------+
1 row in set (0.00 sec)
mysql> show variables like "collation_database";
+--------------------+-----------------+
| Variable_name | Value |
+--------------------+-----------------+
| collation_database | utf8_general_ci |
+--------------------+-----------------+
1 row in set (0.00 sec)
These are the values of my database, but I still cannot make it right.
EDIT2:
<meta charset="utf-8">
...
$con = mysqli_connect('localhost', 'root', 'root00--', 'eicomnor_db');
$query = "SELECT * FROM table";
$result = mysqli_query($con, $query);
while ($row = mysqli_fetch_assoc($result)) {
echo "<tr>";
echo "<td>" . $row['id'] . "</td>";
echo "<td>" . $row['nome'] . "</td>";
echo "</tr>";
}
mysqli_close($con);
Here's the PHP code.

First off, don't try to modify your php files in the direction of ISO-8859-1, that's going backwards, and may lead to compatibility issues with browsers on down the line. Instead, you want to be following the path to utf-8 from the bottom up.
The
easiest thing to check is to make sure that you're serving your html as utf-8:
AddDefaultCharset utf-8 in your apache config may help with that,
and <meta charset="utf-8"> in your html header will as well.
The second thing to check is to make sure that the mysql connection & collation
uses utf-8:
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html or http://docs.moodle.org/23/en/Converting_your_MySQL_database_to_UTF8
The
final and most annoying step is to convert any data actually in the
database to utf-8. Back up your data with a standard mysql dump first! There are a few tricks to simplify this process by creating a dump of the database as utf-8 and then putting it back into the system with the right collation, but be aware that this is a delicate process and be sure you have a solid backup to work with first! http://docs.moodle.org/23/en/Converting_your_MySQL_database_to_UTF8 is a good guide to that process.
Good luck! charset issues with old databases are often more work than they initially appear.

Have you tried iconv? As you know that the charset used on the DB is ISO-8859-1, you can convert to your charset (I'm assuming UTF-8):
// Assuming that $text is the text coming from the DB
$text = iconv("ISO-8859-1", "UTF-8", $text)

Assuming you send the output to the browser, you need to ensure that the proper charset <meta charset="utf-8" /> is set and that you don't override it in your browser settings (check that it's either "auto" or "uft-8").

Include mysqli_set_charset($link, "utf8"); before the queries (only need to do once for each connection made) resolves the problem.

Mysql string input with formatting

I am a beginner at MySQL and I am having a little trouble with the correct formatting for a cell in my table.
I have the data type set to TEXT so there is plenty of space for a few small paragraphs within the cell however my problem is, how do I format the paragraph with apostrophes, quotes, colons, and other punctuation that I am inserting into that cell via command line (MAC)?
This is what the column values are:
joke TEXT NOT NULL,
I want to insert this example joke into the table:
Husband says: "When I'm gone you'll never find another man like me".
Wife replied: "What makes you think I'd want another man like you!"
How do I write it into command line so it will display the exact formatting or at least something close. My command line entry looks like this:
INSERT INTO jokes
(date_submitted, source, joke_style, joke)
VALUES
(NOW(), 'www.example.com','Blonde', 'Joke would go here with (: ' ,) and so on.');
Do I escape the apostrophes with \'? What do I do with the rest of the punctuation, and how do I store line breaks?

Use mysql_real_escape_string() or switch to PDO and use PDO::prepare() with placeholders.
Example:
$sth = $pdo->prepare('INSERT INTO jokes ... VALUES (NOW, :joke, ...)');
$sth->execute(array(':joke' => $joke));
If you need to write commands manually, you should escape ' as \', and the newline character is \n. For more details about how strings are escaped in MySQL, see the manual section about string literals.

heres an excerpt from the manual
The following SELECT statements demonstrate how quoting and escaping work:
mysql> SELECT 'hello', '"hello"', '""hello""', 'hel''lo', '\'hello';
+-------+---------+-----------+--------+--------+
| hello | "hello" | ""hello"" | hel'lo | 'hello |
+-------+---------+-----------+--------+--------+
mysql> SELECT "hello", "'hello'", "''hello''", "hel""lo", "\"hello";
+-------+---------+-----------+--------+--------+
| hello | 'hello' | ''hello'' | hel"lo | "hello |
+-------+---------+-----------+--------+--------+
mysql> SELECT 'This\nIs\nFour\nLines';
+--------------------+
| This
Is
Four
Lines |
+--------------------+
mysql> SELECT 'disappearing\ backslash';
+------------------------+
| disappearing backslash |
+------------------------+

MySQL VARCHAR(156) not storing 156 Multi-Byte Characters?

I have a multi-byte text of 156 characters encoded in UTF-8 format and verified by PHP function mb_strlen($text, 'UTF-8') to be of 156 length. I was expecting to be able to store all of it with VARCHAR(156). But a good portion of the text got truncated.
This is my original text:
위키백과, 백과사전.
대수(λ -, lambda -)는 함. 1930년대 다. 함수 s(x, y) = xx + 입력 x 것이다. x ↦ x 와 y
↦ y 는 변수의 이름은. 또한 (x, y) ↦ xx + yy 와 (u, v) ↦ uu + v*v 는.123456
This is what I got in MySQL:
ìœ„í‚¤ë°±ê³¼, ë°±ê³¼ì‚¬ì „.
ëŒ€ìˆ˜(Î» -, lambda -)ëŠ” í•¨. 1930ë…„ëŒ€ ë‹¤. í•¨ìˆ˜ s(x, y) = x*x +
ìž…ë ¥ x ê²ƒì´ë‹¤. x â†¦ x ì™€ y â†¦ y ëŠ” ë³€ìˆ˜ì
This is what is generated upon querying on my web page:
위키백과, 백과사전.
대수(λ -, lambda -)는 함. 1930년대 다. 함수 s(x, y) = x*x + 입력 x 것이다. x ↦ x 와 y
↦ y 는 변수�
There is a similar question on Stack Overflow, but it does not seem to address my question. Note that the table CHARSET=utf8 collation have been changed to UTF-8, General CI and column collation uses table default. I am using MySQL version 5.5.14 with system variables as shown:
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+--------------------------+----------------------------------------+
UPDATE:
After running mysqli_query($cxn, "SET NAMES utf8") on PHP script as suggested by Homer6, it did take in the full 156 characters and renders as per my original text.
But now what is generated on my web page becomes:
????, ????. ??(? -, lambda -)? ?. 1930?? ?. ?? s(x, y) = xx + ?? x
???. x ? x ? y ? y ? ??? ???. ?? (x, y) ? xx + yy ? (u, v) ? uu +
v*v ?.123456
Can anyone help me?

Can you try quadrupling the size to 624? I think the size is in bytes, not characters. And UTF-8 can be between 1 and 4 bytes.
See http://unicode.org/faq/utf_bom.html
Also, are you setting
SET NAMES 'utf8';
before you run your query?
Or, for Korean, what happens if you set
mysql_query( 'SET NAMES euckr_korean_ci' );
before your query?
http://dev.mysql.com/doc/refman/5.1/en/charset-asian-sets.html

It depends what version of MySQL you have. In MySQL 4 and earlier, the length is in bytes. In MySQL 5 and later, the length is in characters.
Also, the column needs to be set to utf8_unicode_ci for MySQL 5 to properly count the number of characters.

Im pretty sure that mb_strlen returns the number of characters, not the size of the string.
Although UTF-8 is 1 byte per ascii character, this is not true for other languages/character sets. The number of characters until the 1930 is about 45. This makes sense because Korean characters take 3 bytes per character (i think)
You must also explicitly set the character set to utf8, see http://dev.mysql.com/doc/refman/5.0/en/string-type-overview.html
You can alter the table with:
ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;
Run SHOW CREATE TABLE [TABLE_NAME]; to see what character set the column has. I.e. it should print out 'column_name' varchar(156) character set utf8 default NULL,

UTF8 issues PHP -> MySQL. Getting question marks in database?

OK, I am currently in PHP/MySQL/UTF-8/Unicode hell!
My environment:
MySQL: 5.1.53
Server characterset: latin1
Db characterset: latin1
Client characterset: latin1
Conn. characterset: latin1
PHP: 5.3.3
My PHP files are saved as UTF-8 format, not ASCII files.
In my PHP code when I make the database connection I do the following:
ini_set('default_charset', 'utf-8');
$my_db = mysql_connect(DEV_DB, DEV_USER, DEV_PASS);
mysql_select_db(MY_DB);
// I have tried both of the following utf8 connection functions
// mysql_query("SET NAMES 'utf8'", $my_db);
mysql_set_charset('utf8', $my_db);
// Detect if form value is not UTF-8
if (mb_detect_encoding($_POST['lang_desc']) == 'UTF-8') {
$lang_description = $_POST['lang_desc'];
} else {
$lang_description = utf8_encode($_POST['lang_desc']);
}
$language_sql = sprintf(
'INSERT INTO app_languages (language_id, app_id, description) VALUES (%d, %d, "%s")',
intval($lang_data['lang_id']),
intval($new_app_id),
mysql_real_escape_string($lang_description, $my_db)
);
The format/create of my MySQL database is:
CREATE TABLE IF NOT EXISTS app_languages (
language_id int(10) unsigned NOT NULL,
app_id int(10) unsigned NOT NULL,
description tinytext collate utf8_unicode_ci,
PRIMARY KEY (language_id,app_id)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The SQL statements that are generated from my PHP code look like this:
INSERT INTO app_languages (language_id, app_id, description) VALUES (91, 2055, "阿拉伯体育新闻和信息")
INSERT INTO app_languages (language_id, app_id, description) VALUES (26, 2055, "阿拉伯體育新聞和信息")
INSERT INTO app_languages (language_id, app_id, description) VALUES (56, 2055, "בערבית ספורט חדשות ומידע")
INSERT INTO app_languages (language_id, app_id, description) VALUES (69, 2055, "アラビア語のスポーツニュースと情報")
Yet, the output appears in my database as this:
| 69 | 2055 | ????????????????? |
| 56 | 2055 | ?????? ????? ????? ????? |
| 28 | 2055 | Arapski sportske vijesti i informacije |
| 42 | 2055 | Arabe des nouvelles sportives et d\'information |
| 91 | 2055 | ?????????? |
What am I doing wrong??
P.S. We can use Putty to SSH directly to the database server and via the command line Paste one of the unicode/multi-lingual insert statements. And they work successfully!?
Thanks for any light you can shed on this, it's driving me mad.
Cheers, Jason

try to execute the following query after you selected the db:
SET NAMES 'utf8'
this query should solve the problem with different charsets in your files and the db.
felix

The answer is right in your question. You're using latin1 throughout your database, and it can't handle unicode. You need to change those to UTF-8 as well.

//first make sure your file produce utf-8 chars
header('Content-Type: text/html; charset=utf-8');

mb_detect_encoding is quite useless unless you already know what you are dealing with. You probably should not rely on it unless you specify the second and third argument. Currently it probably does not return what you think it does.

I see that the words you saw it as ??????? are Arabic words.. which must have a collation
cp1256_general_ci
not
UTF-8_general_ci
change that, it may solve the problem.

UTF-8, PHP and XML Mysql

I am having great problems solving this one:
I have a mysql database encoding latin1_swedish_ci and a table that stores names and addresses.
I am trying to output a UTF-8 XML file, but I am having problems with the following string:
Otivägen it is being outputted as OtivÃ¤gen when i vim the file. Also when opened it IE i get
"An invalid character was found in text content. Error processing resource"
I have the following code:
function fixEncoding($in_str)
{
$cur_encoding = mb_detect_encoding($in_str) ;
if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
return $in_str;
else
return utf8_encode($in_str);
}
header("Content-type: text/plain;charset=utf-8");
$mystring = "Otivägen" // this is actually obtained from database;
$myxml = "<myxml>
....
<node>".$mystring."</node>
....
</myxml>
";
$myxml = fixEncoding($myxml);
The actual XML output is below:
<?xml version="1.0" encoding="UTF-8" ?>
<myxml>
....
<node>Otivägen</node>
....
</myxml>
Any ideas how I can output the file so in vim the file reads Otivägen and not OtivÃ¤gen?
EDIT:
I did mysql_client_encoding() and got latin1
I then did mysql_set_charset()
and again ran mysql_client_encoding() and got utf8, but still the same outputting issues.
Edit 2
I have logged into the command line and run the query SELECT address1 FROM address WHERE id = 1000;
SELECT address1 FROM address WHERE id = 1000;
Current database: ftpuser_db
+-------------+
| address1 |
+-------------+
| Otivägen 32 |
+-------------+
1 row in set (0.06 sec)
Thanks in advance!

I think you did everything correctly, except that your terminal is in Latin-1.
The UTF-8 sequence for ä is C3 A4, which is Ã¤ if displayed as Latin-1.

Is your MySQL connection encoding properly set to UTF-8 ?
Check mysql_set_charset() and mysql_client_encoding() for more details.

Oh boy. UTF8 issues can be a real pain and they get almost impossible to solve when something is doing re-encodings for you.
You really need to start at one end and make sure every process is UTF8. That will remove things in the process from interpreting the data wrong and 'converting' it for you. But significantly, it will also let you much more easily spot when something has already mis-encoded text for you (yes, I've had that problem).
And if you have UTF8 data in tables that aren't set to UTF8 and might be mis-encoded, you need to do the tables last, after the data has been re-encoded. Otherwise you will damage your data irretrievably. I've had that problem, too.
First steps:
Check your terminal is UTF8 compliant. Gnome-terminal is. Kterm is. ETerm is not.
Check your LANG setting in your shell. It should probably have .UTF-8 on the end of it's value.
Check that vim is picking up the UTF8 setting correctly. You can check with :set encoding
This will mean that your files will be edited in UTF8.
Now we check MySQL.
In the MySQL CLI, do show variables like 'character_set%';. The results will probably be something like:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
What you're aiming for is to change all those latin1 values (or whatever you're seeing) to utf8.
set names utf8; will change most of them and you might need to do that with every new connection in your database. This was the solution I had to adopt in a previous application. The other settings to change are in the my.cnf file for which I need to direct you to the documentation. It is unlikely you will need to set them all.
I see you're already setting the output headers, so that's good.
Now you can look at the data from the database and see why it's "wrong".

latin1_swedish_ci is a collation, not a charset. Since collations are supposed to match their charset, it suggests that the table is using latin1, but it's not a guarantee.
Strictly speaking, the charset of tables is irrelevant here, since MySql can convert input/output. That's what the connection charset (mysql_set_charset) is for. However, for that to work properly, the data needs to be encoded properly in the database. I would begin by checking that strings are correct in the database. Simplest thing is to log in on the command line and select a row which has non-ascii characters in it. Does it look OK?
$mystring = "Otivägen" // this is actually obtained from database;
Watch out. The encoding of the data in $mystring will now depend on the encoding of the php file. That may or may not be the same as the data in the database.

before output run query SET NAMES utf8
after output you can go back and run SET NAMES latin1
Look here, I've got the same problem

It seems you are "double encoding" Otivägen. You get this behaviour if Otivägen already is UTF-8, and run utf8_encode() on it again. Example:
$str = "Otivägen"; // already an UTF-8 string
echo utf8_encode($str); // outputs OtivÃ¤gen
I'm not sure we're the actual "double encoding" occurs, but it may be due to settings in your editor. My theory. Lets say you are running Aptana Studio: Your actual character set is set to ISO-8859-1 (in Aptana, you can check this by right clicking on a file and choose "properties". To set default character encoding for all projects, choose Preferences from Aptana main menu -> General -> workspace). If that's the case, the actual PHP source file where you have $myxml and its string <myxml><node>... is detected to be ISO-8859-1, but $mystring received from the database is UTF-8. Your fixEncoding function would then run the else clause, since the $myxml as a whole is seen as ISO-8859-1 and not UTF-8. This results in double encoding the results from the database, and may be the cause to your problem.
Check the encoding of your actual source file in your editor, and verify that it is set to UTF-8. Alternatively, experiment with applying or removing fixEncoding/utf8_encode/utf8_decode to $myxml. Observe the results and see what needs to be done to the value Otivägen right.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Different Variants of UTF-8 Comma? [，] [,] - CURL Response for MySQL Data - php

Related

Error on accentuated characters with PHP and MySQL

Mysql string input with formatting

MySQL VARCHAR(156) not storing 156 Multi-Byte Characters?

UTF8 issues PHP -> MySQL. Getting question marks in database?

UTF-8, PHP and XML Mysql

Categories

Resources