Unicode Characters with Mysql

Unicode Characters with Mysql - php

I have set my Charset "utf8 turkish ci" in my MySql database. Because I will store some Turkish characters in my project. I can properly enter Turkish charaters and see them. But my problem is that:
For example, i define "username" as varhar(20) and the maxlenght of inputbox is 20. That means user can't write any username more than 20 characters. But when user uses Turkish unicode characters (like ş,i,ü,ğ) there becomes "Data too long for column 'username'" error, because unicode characters are 2 bytes long!
I tried to update my database with phpmyadmin. But updating the lenght, brings some more errors. so do i have to drop all the tables and write them with x2 lenght? (i mean if data will be 20 char, that i define it varchar(40) ) I have 30 tables and it is a nightmare. Is there any way that i can do?

MySQL will by default use 3 bytes to store any character for a VARCHAR specified as UTF8 (or 4 bytes for UTF8MB4).
VARCHAR(10) actually does mean 10 characters, 30 bytes. It doesn't mean 10 bytes.

I suspect your <form> needs to include the charset: <form accept-charset="UTF-8">

Related

How to compare telegram/mysql?

I send russian alphabet with inline-keyboard, in callback_data I pass the letter that user selected. It looks like this:
But telegram returns me this letter is this way \xd0\xb3.
I also save word for compare in mysql db. It returns in this way \u0438\\u043c\\u043f\\u0435\\u0440\\u0430\\u0442\\u0438\\u0432. The encoding in the database is utf8_general_ci.
And as a result, I need to check if the selected letter is in the word from the database. How can I do that?

MySQL never generates \u0438, a Unicode representation. It will generate the 2-byte character whose hex is D0B3 (which might show as \xd0\xb3), specifically a Cyrillic character. And you should provide that format when INSERTing into a MySQL table.
PHP's json_encode will generate the Unicode form instead of the other, depending on the absence or presence of JSON_UNESCAPED_UNICODE in the second argument.
To check the database, do something like:
SELECT col, HEX(col) ...
If "correct" you should get something like
г D0B3
(That's a Cyrillic GHE, not a latin r.)
Who knows what telegram is doing to the data. There are over a hundred packages that use MySQL under the covers; I don't know anything about this one.
Terminology: The encoding is utf8 (or could be utf8mb4). The collation, according to what you say, is utf8_general_ci. Encoding is relevant to the querstion; collation has to do with the ordering of strings in comparisons and sorting.
Anoter example: Cyrillic small letter I и = utf8 hex D0B8 = Unicode codepoint \U0438
HTML is quite happy with Unicode codepoints; it will show и when given \U0438. Perhaps Telegram is converting to codepoints as it builds the web page?

Collation with Least Amount of Characters Necessary for Hashed Passwords

I'm trying to figure out the collation I should use for simple user tables that only contain two columns, email and password, whereby the input for password will be the output of password_hash($str, PASSWORD_DEFAULT).
What is the lightest weighted collation necessary for password_hash? Is it ascii_bin? latin1_bin?

Collation performance...
..._bin has the least to do, so they are the fastest.
ascii_... checks to see if you are using only 7 bits; so quite fast.
..._general_ci checks only bytes, no combinations of bytes. Example: German ß <> 'ss', unlike most other collations.
utf8_... and utf8mb4_... check the bytes for valid encodings.
Meanwhile, MySQL 8.0 has made the utf8mb4_... collations "orders of magnitude faster" than 5.7.
But I usually find that other considerations are more important in any operation in MySQL.
Another example of that... SELECT ... function(foo) ... -- The cost of evaluating the function is usually insignificant relative to the cost of fetching the row. So, I focus on how to optimize fetching the row(s).
As for hashes, ... It depends on whether the function returns a hexadecimal string or a bunch of bytes...
Hex: Use CHARACTER SET ascii COLLATION ascii_bin (or ascii_ci) The ...ci will do case folding, thereby be more forgiving; this is probably the 'right' collation for the case.
Bytes: Use the datatype BINARY; which is roughly equivalent to CHAR CHARACTER SET binary.
As for whether to use BINARY versus VARBINARY or CHAR versus VARCHAR, that should be controlled by whether the function returns a fixed length result. For example:
MD5('asdfb') --> '23c42e11237c24b5b4e01513916dab4a' returns exactly 32 hex bytes, so CHAR(32) COLLATION ascii_ci is 'best'.
But, you can save space by using BINARY(16) (no collation) and put UNHEX(MD5('asdfb')) into it.
UUID() --> '161b6a10-e17f-11e8-bcc6-80fa5b3669ce', which has some dashes to get rid of. Otherwise, it is CHAR(36) or BINARY(16).

Optimizing MySQL table for variety of VarChar length between 20 and 4,000 characters

I plan to be storing strings that have a maximum size of 4,500 VarChar, but MOST of the entries will be under 200 characters. Is MySql smart enough to optimize?
My current solution is to use 5 tables, data_small, data_medium, data_large, etc and insert based on the length of the string. The other solution would be to save files to disk, which would mean a second hit to the database, but result in a smaller return.

MySQL would do fine as would most every RDBMS for that matter. When you specify a field as type CHAR() the number of characters is always used regardless of how many characters are in your string. For instance: If you have Char(64) field and you insert 'ABCD' then the field is still 64 bytes (assuming non-unicode).
When using VARCHAR(), however, the cell only uses as many bytes as are in the string, plus the number of bytes necessary to store the size of the string. So: If you have VARCHAR(64) and insert 'ABCD' you will only use 5 bytes. 4 for the characters 'ABCD' and one for the number of characters '4'.
Your extremely varying string lengths are exactly the reason we have VARCHAR(), so feel free to use VARCHAR(4500) and rest assured you will only be using as much space as necessary to store the characters in the string, and a little bit extra for the length.
Somewhat related: This is why it's not a great idea to use VARCHAR() for fields that don't have varying length strings being inserted. You are wasting space storing the size of the string when it's already known. For instance, telephone numbers of the form x-xxx-xxx-xxxx should just use Char(14) since it will always take up 14 characters and only 14 bytes are necessary. If you used VARCHAR(14) instead you would actually end up using 15 bytes.

mysql max length on columns and ensuring I don't go over that limit using utf8_unicode_ci - PHP?

I am using a TEXT column which is ut8_unicode_ci in mysql to store some data that is scraped from over the internet.
The texts that are gathered are from various sites in different languages.
I am getting confused with the max length of 65535 bytes for a TEXT column.
How can I check that the strings I am inserting into the column do not go over that limit?
At the minute I am using strlen($str) to check the length of the strings, but by using this does it make sure it that the data will not be truncated to fit into the column as I understand utf8_unicode_ci can be more than 1 byte per character?

EDIT: The OP can simply use strlen() as it returns bytes, not characters. Witness:
$ cat test.php
#!/usr/bin/php -q
<?php
echo strlen("דותן כהן")."\n";
echo mb_strlen("דותן כהן", "UTF-8")."\n";
?>
$ ./test.php
15
8
Credit goes to deceze in a comment to this post.
Old post below:
The notes of the PHP manual have a handy function for determining how many bytes are in a string. It seems to be the only alternative to using MYSQL built in functions such as LENGTH to do the job, which would be cumbersome here.
There are two other possible workarounds. Firstly, you can write the string to a file and check the file's size. Secondly, you can force the ASCII encoding on mb_strlen and then it will treat each byte as a character, so the amount of characters that it returns is actually the amount of bytes. I haven't tested this, so check it first. Let us know what works for you!

Checkout MySQL function LENGTH() :
Returns the length of the string str, measured in bytes. A multi-byte
character counts as multiple bytes. This means that for a string
containing five two-byte characters, LENGTH() returns 10, whereas
CHAR_LENGTH() returns 5.

Text datatype in mysql is truncated at 32k

I have php script that inserts articles in mysql DB.
the field type in mysql is text.
now, when i insert an article larger than 32K, it is truncated to 32K only. what i know is the max size of text in mysql is 64K.
PS.: mysql version is 5.0.51a-24+lenny5
PHP version is: PHP 5.3.2-1ubuntu4.9
mysql: max_allowed_packet=16M
any idea of why does mysql truncate it or how to fix it??
** EDIT **
my character set is utf8
by selecting hex of this field i got 65768, and as you know every two hex digits represent one byte, and thus here the actual size is 65768/2=32884
mysql> select length(hex(body)), length(body) from articles where article_id=62727;
+-------------------+--------------+
| length(hex(body)) | length(body) |
+-------------------+--------------+
| 65768 | 32884 |
+-------------------+--------------+
Thanks for your help

The TEXT type has a length of 64k Bytes and not Characters, so if you use a character set using more than one byte per character, then this error may occur.
In your case the string is always truncated at 32k, it looks like that you are using a UTF-16 character set which requires two bytes per character.
You have two possibilities:
Use a single-byte character set
Use a larger column type

From: http://dev.mysql.com/doc/refman/5.0/en/string-type-overview.html
A TEXT column with a maximum length of 65,535 (2^16 – 1) characters.
The effective maximum length is less if the value contains multi-byte characters.
So the maximum length of 64k characters is only possible with a charset like ANSI. UTF-8, for example, uses more than one byte to encode a character and thus less text can be stored in 2^16 bytes.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Unicode Characters with Mysql - php

MySQL will by default use 3 bytes to store any character for a VARCHAR specified as UTF8 (or 4 bytes for UTF8MB4). VARCHAR(10) actually does mean 10 characters, 30 bytes. It doesn't mean 10 bytes.

I suspect your <form> needs to include the charset: <form accept-charset="UTF-8">

Related

How to compare telegram/mysql?

Collation with Least Amount of Characters Necessary for Hashed Passwords

Optimizing MySQL table for variety of VarChar length between 20 and 4,000 characters

mysql max length on columns and ensuring I don't go over that limit using utf8_unicode_ci - PHP?

Text datatype in mysql is truncated at 32k

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Unicode Characters with Mysql - php

MySQL will by default use 3 bytes to store any character for a VARCHAR specified as UTF8 (or 4 bytes for UTF8MB4). VARCHAR(10) actually does mean 10 characters, 30 bytes. It doesn't mean 10 bytes.

I suspect your <form> needs to include the charset: <form accept-charset="UTF-8">

Related

How to compare telegram/mysql?

Collation with Least Amount of Characters *Necessary* for Hashed Passwords

Optimizing MySQL table for variety of VarChar length between 20 and 4,000 characters

mysql max length on columns and ensuring I don't go over that limit using utf8_unicode_ci - PHP?

Text datatype in mysql is truncated at 32k

Categories

Resources

Collation with Least Amount of Characters Necessary for Hashed Passwords