I'm trying to figure out the collation I should use for simple user tables that only contain two columns, email and password, whereby the input for password will be the output of password_hash($str, PASSWORD_DEFAULT).
What is the lightest weighted collation necessary for password_hash? Is it ascii_bin? latin1_bin?
Collation performance...
..._bin has the least to do, so they are the fastest.
ascii_... checks to see if you are using only 7 bits; so quite fast.
..._general_ci checks only bytes, no combinations of bytes. Example: German ß <> 'ss', unlike most other collations.
utf8_... and utf8mb4_... check the bytes for valid encodings.
Meanwhile, MySQL 8.0 has made the utf8mb4_... collations "orders of magnitude faster" than 5.7.
But I usually find that other considerations are more important in any operation in MySQL.
Another example of that... SELECT ... function(foo) ... -- The cost of evaluating the function is usually insignificant relative to the cost of fetching the row. So, I focus on how to optimize fetching the row(s).
As for hashes, ... It depends on whether the function returns a hexadecimal string or a bunch of bytes...
Hex: Use CHARACTER SET ascii COLLATION ascii_bin (or ascii_ci) The ...ci will do case folding, thereby be more forgiving; this is probably the 'right' collation for the case.
Bytes: Use the datatype BINARY; which is roughly equivalent to CHAR CHARACTER SET binary.
As for whether to use BINARY versus VARBINARY or CHAR versus VARCHAR, that should be controlled by whether the function returns a fixed length result. For example:
MD5('asdfb') --> '23c42e11237c24b5b4e01513916dab4a' returns exactly 32 hex bytes, so CHAR(32) COLLATION ascii_ci is 'best'.
But, you can save space by using BINARY(16) (no collation) and put UNHEX(MD5('asdfb')) into it.
UUID() --> '161b6a10-e17f-11e8-bcc6-80fa5b3669ce', which has some dashes to get rid of. Otherwise, it is CHAR(36) or BINARY(16).
Related
What data type should use in a MySQL database to store 2 text files of code. If I intend to compare similarity later.
It's a MySQL database running on my Windows machine.
Also can you recommend an API that can compare code for me.
As per MySQL documentation
Values in VARCHAR columns are variable-length strings. The length can be specified as a value from 0 to 65,535. The effective maximum length of a VARCHAR is subject to the maximum row size (65,535 bytes, which is shared among all columns) and the character set used.
...
Values in CHAR and VARCHAR columns are sorted and compared according to the character set collation assigned to the column.
So, VARCHAR is stored inline with the table, whilst BLOB and TEXT types are stored off the table with the database holding the location of the data. Depending on how long your text is, TEXT might be defined as TINYTEXT, TEXT, MEDIUMTEXT, and LONGTEXT, the only difference is the maximum amount of data it holds.
TINYTEXT 256 bytes
TEXT 65,535 bytes
MEDIUMTEXT 16,777,215 bytes
LONGTEXT 4,294,967,295 bytes
To compare the two strings stored in TEXT (or any other string column) you might want to use STRCMP(expr1,expr2)
STRCMP() returns 0 if the strings are the same, -1 if the first argument is smaller than the second according to the current sort order, and 1 otherwise.
If you specify the desired output of the comparison, I might edit the answer.
EDIT
To compare two strings and calculate the difference percentage, you might want to use similar_text. As the official documentation states:
This calculates the similarity between two strings as described in Programming Classics: Implementing the World's Best Algorithms by Oliver (ISBN 0-131-00413-1). Note that this implementation does not use a stack as in Oliver's pseudo code, but recursive calls which may or may not speed up the whole process. Note also that the complexity of this algorithm is O(N**3) where N is the length of the longest string.
I plan to be storing strings that have a maximum size of 4,500 VarChar, but MOST of the entries will be under 200 characters. Is MySql smart enough to optimize?
My current solution is to use 5 tables, data_small, data_medium, data_large, etc and insert based on the length of the string. The other solution would be to save files to disk, which would mean a second hit to the database, but result in a smaller return.
MySQL would do fine as would most every RDBMS for that matter. When you specify a field as type CHAR() the number of characters is always used regardless of how many characters are in your string. For instance: If you have Char(64) field and you insert 'ABCD' then the field is still 64 bytes (assuming non-unicode).
When using VARCHAR(), however, the cell only uses as many bytes as are in the string, plus the number of bytes necessary to store the size of the string. So: If you have VARCHAR(64) and insert 'ABCD' you will only use 5 bytes. 4 for the characters 'ABCD' and one for the number of characters '4'.
Your extremely varying string lengths are exactly the reason we have VARCHAR(), so feel free to use VARCHAR(4500) and rest assured you will only be using as much space as necessary to store the characters in the string, and a little bit extra for the length.
Somewhat related: This is why it's not a great idea to use VARCHAR() for fields that don't have varying length strings being inserted. You are wasting space storing the size of the string when it's already known. For instance, telephone numbers of the form x-xxx-xxx-xxxx should just use Char(14) since it will always take up 14 characters and only 14 bytes are necessary. If you used VARCHAR(14) instead you would actually end up using 15 bytes.
I am using a TEXT column which is ut8_unicode_ci in mysql to store some data that is scraped from over the internet.
The texts that are gathered are from various sites in different languages.
I am getting confused with the max length of 65535 bytes for a TEXT column.
How can I check that the strings I am inserting into the column do not go over that limit?
At the minute I am using strlen($str) to check the length of the strings, but by using this does it make sure it that the data will not be truncated to fit into the column as I understand utf8_unicode_ci can be more than 1 byte per character?
EDIT: The OP can simply use strlen() as it returns bytes, not characters. Witness:
$ cat test.php
#!/usr/bin/php -q
<?php
echo strlen("דותן כהן")."\n";
echo mb_strlen("דותן כהן", "UTF-8")."\n";
?>
$ ./test.php
15
8
Credit goes to deceze in a comment to this post.
Old post below:
The notes of the PHP manual have a handy function for determining how many bytes are in a string. It seems to be the only alternative to using MYSQL built in functions such as LENGTH to do the job, which would be cumbersome here.
There are two other possible workarounds. Firstly, you can write the string to a file and check the file's size. Secondly, you can force the ASCII encoding on mb_strlen and then it will treat each byte as a character, so the amount of characters that it returns is actually the amount of bytes. I haven't tested this, so check it first. Let us know what works for you!
Checkout MySQL function LENGTH() :
Returns the length of the string str, measured in bytes. A multi-byte
character counts as multiple bytes. This means that for a string
containing five two-byte characters, LENGTH() returns 10, whereas
CHAR_LENGTH() returns 5.
Can we declare a variable with fixed length in PHP?
I'm not asking about trimming or by putting condition do substring.
Can we declare variable just like database char(10).
The reason I'm asking am doing an export process, PHP export data to DB.
In DB I have a field with size 100, and I'm passing a field with length 25, using PHP.
When I look in DB, it's showing some extra space for that field.
Maybe it's your database that is the problem.
The CHAR datatype will always fill up the remaining unused characters when storing data. If you have CHAR(3) and pass 'hi', it will store it as 'hi '. This is true for a lot of relational database engines (MySQL, Postgres, SQLite, etc.).
This is why some database engines also have the VARCHAR datatype (which is variable, like the name says). This one doesn't pad the content with spaces if the data stored in isn't long enough.
In most cases, you are looking for the VARCHAR datatype. CHAR is mostly useful when you store codes, etc. that always have the same length (e.g.: a CHAR(3) field for storing codes like ADD, DEL, CHG, FIX, etc.).
No, a string in PHP is always variable length. You could trim the string to see if extra space is still passed to your DB.
Nope. PHP has no provision to limit string size.
You could simulate something in an object using setter and getter variables, though, throwing an error (or cutting off the data) if the incoming value is larger than allowed.
No, but I really don't think you're having a problem with php. I think you should check your DB2 configuration, perhaps it automatically completes strings with spaces... How much spaces are added? Are they added before? After?
As others have said: No.
I don't understand how it would help anyway. I'm not familiar with DB2 but it sounds like if you have extra spaces, they are either coming in the variable (and thus it should be trimmed) or DB2 does space padding to make the value have 100 characters. If your input is only 25 characters long then if it is doing space padding, it seems it would do it anyway.
If you want to store variable length strings in DB2 then go with VARCHAR, if you always want the same length for each string in the column, define the exact length using CHAR (for postal codes, for instance).
Details on character strings is available here: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.sql.ref.doc/doc/r0008470.html with a good summary:
Fixed-length character string (CHAR)
All values in a fixed-length string column have the same length, which is determined by the length attribute of the column. The length attribute must be between 1 and 254, inclusive.
Varying-length character strings
There are two types of varying-length character strings:
A VARCHAR value can be up to 32,672 bytes long.
A CLOB (character large object) value can be up to 2 gigabytes minus 1 byte (2,147,483,647 bytes) long.
Of course it then gets more detailed, depending on what sort of encoding you're using, etc... ( like UTF-16 or UTF-32 )
One of the things that always worries me in MySQL is that my string fields will not be large enough for the data that need to be stored. The PHP project I'm currently working on will need to store strings, the lengths of which may vary wildly.
Not being familiar with how MySQL stores string data, I'm wondering if it would be overkill to use a larger data type like TEXT for strings that will probably often be less than 100 characters. What does MySQL do with highly variable data like this?
See this: http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
VARCHAR(M), VARBINARY(M) L + 1 bytes
if column values require 0 – 255
bytes, L + 2 bytes if values may
require more than 255 bytes
BLOB, TEXT L + 2 bytes, where L < 2^16
So in the worst case, you're using 1 byte per table cell more when using TEXT.
As for indexing: you can create a normal index on a TEXT column, but you must give a prefix length - e.g.
CREATE INDEX part_of_name ON customer (name(10));
and moreover, TEXT columns allow you to create and query fulltext indexes if using the MyISAM engine.
On the other hand, TEXT columns are not stored together with the table, so performance could, theoretically, become an issue in some cases (benchmark to see about your specific case).
In recent versions of MySQL, VARCHAR fields can be quite long - up to 65,535 characters depending on character set and the other columns in the table. It is very efficient when you have varying length strings. See:
http://dev.mysql.com/doc/refman/5.1/en/char.html
If you need longer strings than that, you'll probably just have to suck it up and use TEXT.