MySQL VARCHAR(156) not storing 156 Multi-Byte Characters? - php

I have a multi-byte text of 156 characters encoded in UTF-8 format and verified by PHP function mb_strlen($text, 'UTF-8') to be of 156 length. I was expecting to be able to store all of it with VARCHAR(156). But a good portion of the text got truncated.
This is my original text:
위키백과, 백과사전.
대수(λ -, lambda -)는 함. 1930년대 다. 함수 s(x, y) = xx + 입력 x 것이다. x ↦ x 와 y
↦ y 는 변수의 이름은. 또한 (x, y) ↦ xx + yy 와 (u, v) ↦ uu + v*v 는.123456
This is what I got in MySQL:
위키백과, 백과사전.
대수(λ -, lambda -)는 함. 1930년대 다. 함수 s(x, y) = x*x +
ìž…ë ¥ x 것ì´ë‹¤. x ↦ x 와 y ↦ y 는 변수ì
This is what is generated upon querying on my web page:
위키백과, 백과사전.
대수(λ -, lambda -)는 함. 1930년대 다. 함수 s(x, y) = x*x + 입력 x 것이다. x ↦ x 와 y
↦ y 는 변수�
There is a similar question on Stack Overflow, but it does not seem to address my question. Note that the table CHARSET=utf8 collation have been changed to UTF-8, General CI and column collation uses table default. I am using MySQL version 5.5.14 with system variables as shown:
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+--------------------------+----------------------------------------+
UPDATE:
After running mysqli_query($cxn, "SET NAMES utf8") on PHP script as suggested by Homer6, it did take in the full 156 characters and renders as per my original text.
But now what is generated on my web page becomes:
????, ????. ??(? -, lambda -)? ?. 1930?? ?. ?? s(x, y) = xx + ?? x
???. x ? x ? y ? y ? ??? ???. ?? (x, y) ? xx + yy ? (u, v) ? uu +
v*v ?.123456
Can anyone help me?

Can you try quadrupling the size to 624? I think the size is in bytes, not characters. And UTF-8 can be between 1 and 4 bytes.
See http://unicode.org/faq/utf_bom.html
Also, are you setting
SET NAMES 'utf8';
before you run your query?
Or, for Korean, what happens if you set
mysql_query( 'SET NAMES euckr_korean_ci' );
before your query?
http://dev.mysql.com/doc/refman/5.1/en/charset-asian-sets.html

It depends what version of MySQL you have. In MySQL 4 and earlier, the length is in bytes. In MySQL 5 and later, the length is in characters.
Also, the column needs to be set to utf8_unicode_ci for MySQL 5 to properly count the number of characters.

Im pretty sure that mb_strlen returns the number of characters, not the size of the string.
Although UTF-8 is 1 byte per ascii character, this is not true for other languages/character sets. The number of characters until the 1930 is about 45. This makes sense because Korean characters take 3 bytes per character (i think)
You must also explicitly set the character set to utf8, see http://dev.mysql.com/doc/refman/5.0/en/string-type-overview.html
You can alter the table with:
ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;
Run SHOW CREATE TABLE [TABLE_NAME]; to see what character set the column has. I.e. it should print out 'column_name' varchar(156) character set utf8 default NULL,

Related

Different Variants of UTF-8 Comma? [,] [,] - CURL Response for MySQL Data

Prepping a Curl Response for particular data to be inserted into a MySQL Table.
Noticed some special characters in the saved data for certain URL's.
$curldata = curl_exec($curl);
$encoding = mb_detect_encoding($curldata);
brought back ASCII encoding.
Okay, don't want that.
The tables in my database are an InnoDB type with a utf8mb4_unicode_ci collation.
Added this to my curl options:
curl_setopt($curl, CURLOPT_ENCODING, 1);
And an iconv function based on the above mb_detect_encoding / $encoding variable upon save.
$curldata = iconv($encoding, "UTF-8", $curldata);
// save to file to test output
file_put_contents('test.html', $curldata);
Not sure if this is the best way to go about this, but my test.html output no longer has any encoding for special characters, so... (perhaps) mission accomplished.
As I parse through the data, I then notice this character.
,
Not an ordinary comma... [Comparison: ,/,]
But acts like one. Try doing a ctrl+f and try to find a comma. It treats them as the same, and both as a UTF-8 character - var_dump(mb_detect_encoding(','));
I look at my table row, and see it as a row inserted as such
8,8
If I try to search for a , it does indeed bring back the instances where ,is present.
Vice versa, if I search for , it brings back all instances where that and a comma occurs.
Basically for all intents and purposes it is a comma, yet obviously isn't.
This is of course workable, but rather annoying and feels riddled with inconsistency.
Can anyone explain why the two commas are the same, yet obviously different?
Is there a solution for me to prevent these odd characters from entering my CURL response, or further in within my DOM response and PDO Insert.
edit:
If relevant,
// dom
$dom = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = FALSE;
$dom->loadHTML(mb_convert_encoding($curldata, 'HTML-ENTITIES', 'UTF-8'));
// pdo
$pdoquery = "INSERT INTO `table` (`Attr`) VALUES (?)";
$value = "8,8";
$stmt = $pdo->prepare("INSERT INTO `table` (`Attr`) VALUES (?)");
$stmt->execute([$value]);
edit 2:
Well, it appears to be a FULLWIDTH COMMA..
var_dump(utf8_to_unicode(','));
string '%uff0c' (length=6)
var_dump(utf8_to_unicode(','));
string '%2c' (length=3)
Starting to make more sense... now to figure out how to prevent such characters from entering the curl response/DOM/database...
You might want the function mb_convert_kana which can convert characters of different widths into a uniform width.
$s = 'This is a string with ,, (commas having different widths)';
echo 'original : ', $s, PHP_EOL;
echo 'converted: ', mb_convert_kana($s, 'a');
result:
original : This is a string with ,, (commas having different widths)
converted: This is a string with ,, (commas having different widths)
PHP documentation: mb_convert_kana
To get an idea what the meaning is, see also http://unicode.org/reports/tr11-2/
By convention, 1/2 Em wide characters of East Asian legacy encodings are called "half-width" (or hankaku characters in Japanese), the others are called correspondingly "full-width" (or zenkaku) characters.
With a suitable COLLATION, the two commas are treated as equal:
mysql> SELECT ',' = ',' COLLATE utf8mb4_general_ci;
+----------------------------------------+
| ',' = ',' COLLATE utf8mb4_general_ci |
+----------------------------------------+
| 0 |
+----------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT ',' = ',' COLLATE utf8mb4_unicode_ci;
+----------------------------------------+
| ',' = ',' COLLATE utf8mb4_unicode_ci |
+----------------------------------------+
| 1 |
+----------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT ',' = ',' COLLATE utf8mb4_unicode_520_ci;
+--------------------------------------------+
| ',' = ',' COLLATE utf8mb4_unicode_520_ci |
+--------------------------------------------+
| 1 |
+--------------------------------------------+
1 row in set (0.00 sec)
It would be better to talk in terms of HEX, not unicode:
mysql> SELECT HEX(','), HEX(',');
+------------+----------+
| HEX(',') | HEX(',') |
+------------+----------+
| EFBC8C | 2C |
+------------+----------+
1 row in set (0.00 sec)

How to find the length of a chinese phrase in a MySQL database with SQL?

For example, this is my table, which is called example:
--------------------------
| id | en_word | zh_word |
--------------------------
| 1 | Internet| 互联网 |
--------------------------
| 2 | Hello | 你好 |
--------------------------
and so on...
And I tried using this SQL Query:
SELECT * FROM `example` WHERE LENGTH(`zh_word`) = 3
For some reason, it wouldn't give me three, but would give me a lot of single letter characters.
Why is this? Can this be fixed? I tried this out in PhpMyAdmin.
But when I did it with JavaScript:
"互联网".length == 3; // true
And it seems to work fine. So how come it doesn't work?
you should use CHAR_LENGTH instead of LENGTH
LENGTH() returns the length of the string measured in bytes.
CHAR_LENGTH() returns the length of the string measured in characters.
LENGTH returns length in bytes (and chinese is multibyte)
Use CHAR_LENGTH to get length in characters
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_char-length
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_length

Mysql string check on equals is false for the same values

I have a problem with MySql
I have a table with parsed informations from websites. A strange string interpretation appear:
the query
select id, address from pagesjaunes_test where address = substr(address,1,length(address)-1)
return a set of values instead of none
at beginning I executed functions as:
address = replace(address, '\n', '')
address = replace(address, '\t', '')
address = replace(address, '\r', '')
address = replace(address, '\r\n', '')
address = trim(address)
but the problem still persist.
Values of field 'address' have some french chars , but the query returned also values that contains only alfanumeric english chars.
Another test: I tried to check the length of strings and ... the strlen() from PHP and LENGTH() from MYSQL display different results! Somewhere difference is by 2 chars, somewhere by 1 character without a specific "rule".
Visual I can't see any space or tabs or something else.
After I modified an address manualy(I deleted all string and I wrote it again), the problem is solved, but I have ~ 6000 values, so this is not a solution :)
What can be the problem?
I suppose that strings can have something as an "empty char", but how to detect and remove it?
Thanks
P.S.
the problem is not just length. I need to join this table with other one and using a condition that check if values from fields 'address' are equals. Even if the fields have the same collation and tables have the same collation, query returns that no addresses match
E.g.
For query:
SELECT p.address,char_length(p.address) , r.address, char_length(r.address)
FROM `pagesjaunes_test` p
LEFT JOIN restaurants r on p.name=r.name
WHERE
p.postal_code=r.postal_code
and p.address!=r.address
and p.phone=''
and p.cuisines=''
LIMIT 10
So: p.address!=r.address
The result is:
+-------------------------------------+------------------------+--------------------------+------------------------+
| address | char_length(p.address) | address | char_length(r.address) |
+-------------------------------------+------------------------+--------------------------+------------------------+
| Dupin Marc13 quai Grands Augustins | 34 | 13 quai Grands Augustins | 24 |
| 39 r Montpensier | 16 | 39 r Montpensier | 16 |
| 8 r Lord Byron | 14 | 3 r Balzac | 10 |
| 162 r Vaugirard | 15 | 162 r Vaugirard | 15 |
| 32 r Goutte d'Or | 16 | 32 r Goutte d'Or | 16 |
| 2 r Casimir Périer | 18 | 2 r Casimir Périer | 18 |
| 20 r Saussier Leroy | 19 | 20 r Saussier Leroy | 19 |
| Senes Douglas22 r Greneta | 25 | 22 r Greneta | 12 |
| Ngov Ly Mey44 r Tolbiac | 23 | 44 r Tolbiac | 12 |
| 33 r N-D de Nazareth | 20 | 33 r N-D de Nazareth | 20 |
+-------------------------------------+------------------------+--------------------------+------------------------+
As you see, "162 r Vaugirard", "20 r Saussier Leroy" contains only ASCII chars, have the same length but aren't equals!
Maybe have a look at the encoding of the mysql text fields - UTF8 encodes most of its characters with 2 bytes - only a small subset of UTF8 (ASCII characters for example) get encoded with one byte.
MySQL knows UTF8 and counts right.
PHP text functions aren't UTF8 aware and count the bytes itself.
So if PHP counts more than MYSQL, this is probably the cause and you could have a look at utf8decode.
br from Salzburg!
The official documentation says:
Returns the length of the string str, measured in bytes. A multi-byte character counts as multiple bytes. This means that for a string containing five two-byte characters, LENGTH() returns 10, whereas CHAR_LENGTH() returns 5.
So, use CHAR_LENGTH instead :)
select id, address from pagesjaunes_test
where address = substr(address, 1, char_length(address) - 1)
Finally, I found the problem. After changed collation to ascii_general_ci all non-ascii chars was transformed to "?". Some spaces also was replaced with "?". After check initial values, function ORD() from MySQL returned 160 (instead of 32) for these spaces. So,
UPDATE pagesjaunes_test SET address = TRIM(REPLACE(REPLACE(address, CHAR(160), ' '), ' ',' ')
resolved my question.

PHP TZ setting length

I was wonder what is the maximum length of the timezone settings in PHP? I'm storing the string in a database, and would like to keep the length as short as possible, but i see timezones like "Canada/East-Saskatchewan", which goes beyond our current limit.
If I could just get a list of all the supported timezone string, I can sort them, but they are currently split on to several different pages.
linky: http://www.php.net/manual/en/timezones.php
Edit June 2021 Answer is 64. Why? That's the width of the column used in MySQL to store those timezone name strings.
The zoneinfo database behind those time zone strings just added new prefixes. To America/Argentina/ComodRivadavia, formerly the longest timezone string, they added posix/America/Argentina/ComodRivadavia and right/America/Argentina/ComodRivadavia, both with a length of 38. This is up from 32, the previous longest string.
And here is the complete PHP code to find that:
<?php
$timezone_identifiers = DateTimeZone::listIdentifiers();
$maxlen = 0;
foreach($timezone_identifiers as $id)
{
if(strlen($id) > $maxlen)
$maxlen = strlen($id);
}
echo "Max Length: $maxlen";
/*
Output:
Max Length: 32
*/
?>
The Olson database — available from ftp://ftp.iana.org/tz/releases/ or http://www.iana.org/time-zones (but see also http://www.twinsun.com/tz/tz-link.htm* and http://en.wikipedia.org/wiki/Tz_database) — is the source of these names. The documentation in the file Theory includes a description of how the zone names are formed. This would help you establish how long names can be.
The longest 'current' names are 30 characters (America/Argentina/Buenos_Aires,
America/Argentina/Rio_Gallegos, America/North_Dakota/New_Salem); the longest 'backwards compatibility' name is 32 characters (America/Argentina/ComodRivadavia).
* Note that the TwinSun site has not been updated for some time and has some outdated links (such as suggesting that the Olson database is available from ftp://ftp.elsie.nci.nih.gov — it is now available from IANA instead).
From the manual:
<?php
$timezone_identifiers = DateTimeZone::listIdentifiers();
for ($i=0; $i < 5; $i++) {
echo "$timezone_identifiers[$i]\n";
}
?>
If you are using MySQL with PHP, consider that MySQL already stores Olson timezone names in the "mysql" system database, in a table called "time_zone_name", in a column called "name". One option is to choose the length of your column to match the length that MySQL uses. In MySQL 5.7, the timezone name length is 64 characters. To determine the length in use on your MySQL installation, follow the example below:
mysql> use mysql
Database changed
mysql> describe time_zone_name;
+--------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------+------+-----+---------+-------+
| Name | char(64) | NO | PRI | NULL | |
| Time_zone_id | int(10) unsigned | NO | | NULL | |
+--------------+------------------+------+-----+---------+-------+
2 rows in set (0.01 sec)

Utf-8 characters displayed as ISO-8859-1

I've got an issue with inserting/reading utf8 content from a db. All verifications I'm doing seem to point to the fact that the content in my DB should be utf8 encoded, however it seems to be latin encoded. The data are initially imported from a PHP script from the CLI.
Configuration:
Zend Framework Version: 1.10.5
mysql-server-5.0: 5.0.51a-3ubuntu5.7
php5-mysql: 5.2.4-2ubuntu5.10
apache2: 2.2.8-1ubuntu0.16
libapache2-mod-php5: 5.2.4-2ubuntu5.10
Vertifications:
-mysql:
mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
mysql> SHOW VARIABLES LIKE 'collation%';
+----------------------+-----------------+
| Variable_name | Value |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8_bin |
| collation_server | utf8_general_ci |
+----------------------+-----------------+
-database
created with
CREATE DATABASE mydb CHARACTER SET utf8 COLLATE utf8_bin;
CREATE SCHEMA `mydb` DEFAULT CHARACTER SET utf8 COLLATE utf8_bin ;
mysql> status;
--------------
mysql Ver 14.12 Distrib 5.0.51a, for debian-linux-gnu (i486) using readline 5.2
Connection id: 7
Current database: mydb
Current user: root#localhost
SSL: Not in use
Current pager: stdout
Using outfile: ''
Using delimiter: ;
Server version: 5.0.51a-3ubuntu5.7-log (Ubuntu)
Protocol version: 10
Connection: Localhost via UNIX socket
Server characterset: utf8
Db characterset: utf8
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /var/run/mysqld/mysqld.sock
Uptime: 9 min 45 sec
-sql: before doing my inserts I run the
SET names 'utf8';
-php: before doing my inserts I use utf8_encode() and mb_detect_encoding() which gives me 'UTF-8'. After retrieveing the content from db and before sending it to the user mb_detect_encoding() also gives 'UTF-8'
Validation test:
the only way for me to have the content displayed properly is to set the content type to latin (If I sniff the traffic I can see the content-type header with ISO-8859-1):
ini_set('default_charset', 'ISO-8859-1');
This test shows that the content comes out as latin. I don't understand why.
Does anybody have any idea?
Thanks.
Well, I've found that SET NAMES isn't really all that great. Take a peak at the docs...
What I typically do is execute 4 queries:
SET CHARACTER SET 'UTF8';
SET character_set_database = 'UTF8';
SET character_set_connection = 'UTF8';
SET character_set_server = 'UTF8';
Give that a shot and see if that does it for you...
Oh, and remember, all UTF-8 characters <= 127 are valid ISO-8859-1 characters as well. So if you only have characters <= 127 in the stream, mb_detect_encoding will fall on the higher prevalence charset (which is by default "UTF-8")...
What are you doing before retrieval? Also a 'SET NAMES utf8;'? Otherwise, MySQL will silently convert to the charset the connection indicates as used.
If not even that, what does a SHOW FULL COLUMNS FROM table; show? Having a table with a default charset does not mean the column is. i.e, this is valid:
.
CREATE TABLE test (
`name` varchar(10) character set latin1
) CHARSET=utf8

Categories