Is a search through numbers (INT) faster than characters in an mySQL database?
Regards,
Thijs
In practice the difference will be small - it will increase depending on the relative length of the 2 fields. The question pre-supposes that 2 are interchangeable - which is a very bad assumption given that MySQL natively supports an ENUM data-type.
If it's that important to you, why not measure it?
Yes, it should be. An int is only 4 bytes, while in a text (varchar) might be considerably bigger.
Besides, if you have an index on the field you are searching, it will be smaller in size. Hence, you might need fewer disk accesses to do an index scan.
If the INT and VARCHAR consumes the same amount of space the difference should be negligible, even though INT probably will come out on top.
Databases don't use black magic. They need to physically access data like everyone else. Table rows consume disk space. Reading 100 mb is faster than reading 200 mb. Always.
Therefore. this affects everything. Smaller rows means more rows per "block". More rows per block means more rows fetched per disk access. Fewer blocks in total means that a larger percentage of the rows will fit in various buffer caches.
Related
I have a php script which iterates through a JSON file line by line (using JsonMachine), checks each line for criteria (foreach); if criteria are met it checks if it's already in a database, and then it imports/updates (MYSQL 8.0.26). As an example last time this script ran it iterated through 65,000 rows and imported 54,000 of them in 24 seconds.
Each JSON row has a UUID as unique key, and I am importing this as a VARCHAR(36).
I read that it can be advantageous to store the UUIDs as BINARY(16) using uuid_to_bin and bin_to_uuid so I coded the script to store the UUID as binary and the read php scripts to unencode back to UUID, and the database fields to BINARY(16).
This worked functionally, but the script import time went from 24 seconds to 30 minutes. The server was not CPU-bound during that time, running at 25 to 30% (normally <5%).
The script without uuid conversion runs at about 3,000 lines per second, using uuid conversion it runs at about 30 lines per second.
The question: can anyone with experience on bulk importing using uuid_to_bin comment on performance?
I've reverted to native UUID storage, but I'm interested to hear others' experience.
EDIT with extra info from comments and replies:
The UUID is the primary key
The server is a VM with 12GB and 4 x assigned cores
The table is 54,000 rows (from the import), and is 70MB in size
Innodb buffer pool size is not changed from default, 128MB: 134,217,728
Oh, bother. UUID_TO_BIN changed the UUID values from being scattered to being roughly chronologically ordered (for type 1 uuids). This helps performance by clustering rows on disk better.
First, let's check the type. Please display one (any one) of the 36-char uuids or HEX(binary) using the 16-byte binary version. After that, I will continue this answer depending on whether it is type 1 or some other type.
Meanwhile, some other questions (to help me focus on the root cause):
What is the value of innodb_buffer_pool_size?
How much RAM?
How big is the table?
Were the incoming uuids in some particular order?
A tip: Use IODKU instead of SELECT + (UPDATE or INSERT). That will double the speed.
Then batch them 100 at a time. That may give another 10x speedup.
More
Your UUIDs are type 4 -- random. UUID_TO_BIN() changes from one random order to another. (Dropping from 36 bytes to 16 is still beneficial.)
innodb_buffer_pool_size -- 128M is an old, too small, default. If you have more than 4GB, set that to about 70% of RAM. This change should help performance significantly. Your VM has 12GB, so change the setting to 8G. This will eliminate most of the I/O, which is the slow part of SQL.
This question already has answers here:
What are the optimum varchar sizes for MySQL?
(2 answers)
Closed 9 years ago.
I am designing a database which will to store JSON strings of various sizes. I'm considering using different tables (each with two columns, 'id' and 'data') to store different string sizes (tinytext - bigtext). In this case each table would be searched, starting with the table containing the smallest string sizes.
I'm also considering using a single table with a single string size and using multiple rows to store large JSON strings.
..or I could just create a table with a large VARCHAR size and save myself some development time.
There are two points that I am designing around:
In some cases, mysql stores small pieces of data "in row" which helps performance. What does this mean and how can I take advantage of this?
In some cases, mysql processes VARCHAR as its largest possible size. When does this happen and how can I avoid it?
From the database point of view there is no particular "good" length for varchar. However try to keep maximum row size under 8kb, including non-clustered indexes. Then you will avoid MySQL storing data out of row, which hampers performance.
use 255
Why historically do people use 255 not 256 for database field magnitudes?
Although, as a side note, if you are working with PHP and trying to insert strings in excess of 1000 characters, you will need to truncate to your max col size on the PHP side before inserting, or you will hit an error.
Web application compares pairs of sets of positive integers. Each set has only unique values, no greater than 210 000 000 (fits into 28 bits). Up to 5 000 000 values in each set.
Comparing sets A & B, need three result sets: "unique to A", "unique to B", "common to A & B". Particular task is to answer a question "is number N present in set S?"
So far the project runs in limited resources of a shared hosting, under LAMP stack. Quick'n'dirty solution I came up with was to outsource the job to hosting's MySQL, which has more resources. Temporary table for each set, the only column with the numbers is the primary index. Rarely sets are small enough to fit into engine=Memory, which is fast. It works, but too slow.
Looking for a way to keep a set like this in-memory, effective for the task of searching a particular number within. Keeping memory footprint as low as possible.
I came up to an idea of coding each set as a bit mask of 2^28 bits (32 Mb). A number present in the set = 1 bit set. 5 mln numbers = 5 mln bits set out of 210mln. Many zeroes == can compress effectively?
Seems like I'm inventing a bicycle. Please direct me to a "well-known" solution to this particular case of binary compression. I read about Huffman coding, which seems not the right solution, as its focus is size reduction, while my task requires many searches over a compressed set.
Upd. Just found an article on Golomb coding and an example of its application to run-length encooding.
There is a standard compression technique available for represented large sets of integers in a range, which allows for efficient iteration (so it can easily do intersection, union, set difference, etc.) but does not allow random access (so it's no good for "is N in S"). For this particular problem, it will reduce the dataset to around seven bits each, which would be around 8MB for sets of size 5,000,000. In case it's useful, I'll describe it below.
Bit-vectors of size 210,000,000 bits (26MB each, roughly) are computationally efficient, both to answer the "is N in S" query, and for bitwise operations, since you can do them rapidly with vectorized instructions on modern processors; it's probably as fast as you're going to get for a 5,000,000-element intersection computation. It consumes a lot of memory, but if you've got that much memory, go for it.
The compression technique, which is simple and just about optimal if the sets are uniformly distributed random samples of the specified size, is as follows:
Sort the set (or ensure that it is sorted).
Set the "current value" to 0.
For each element in the set, in order:
a. subtract the "current value" from the element;
b. while that difference is at least 32, output a single 1 bit and subtract 32 from the difference;
c. output a single 0 bit, followed by the difference encoded in five bits.
d. set the "current value" to one more than the element
To justify my claim that the compression will result in around seven bits per element:
It's clear that every element will occupy six bits (0 plus a five-bit delta); in addition, we have to account for the 1 bits in step 3b. Note, however, that the sum of all the deltas is exactly the largest element in the set, which cannot be more than 210,000,000 and consequently, we cannot execute step 3b more than 210,000,000/32 times. So step 3b. will account for less than seven million bits, while step 3c will account for 6 * 5,000,000 bits, for a total of 37 million, or 7.4 bits per element (in practice, it will usually be a bit less than this).
I have a relatively large database (130.000+ rows) of weather data, which is accumulating very fast (every 5minutes a new row is added). Now on my website I publish min/max data for day, and for the entire existence of my weatherstation (which is around 1 year).
Now I would like to know, if I would benefit from creating additional tables, where these min/max data would be stored, rather than let the php do a mysql query searching for day min/max data and min/max data for the entire existence of my weather station. Would a query for max(), min() or sum() (need sum() to sum rain accumulation for months) take that much longer time then a simple query to a table, that already holds those min, max and sum values?
That depends on weather your columns are indexed or not. In case of MIN() and MAX() you can read in the MySQL manual the following:
MySQL uses indexes for these
operations:
To find the MIN() or MAX() value for a
specific indexed column key_col. This
is optimized by a preprocessor that
checks whether you are using WHERE
key_part_N = constant on all key parts
that occur before key_col in the
index. In this case, MySQL does a
single key lookup for each MIN() or
MAX() expression and replaces it with
a constant.
In other words in case that your columns are indexed you are unlikely to gain much performance benefits by denormalization. In case they are NOT you will definitely gain performance.
As for SUM() it is likely to be faster on an indexed column but I'm not really confident about the performance gains here.
Please note that you should not be tempted to index your columns after reading this post. If you put indices your update queries will slow down!
Yes, denormalization should help performance a lot in this case.
There is nothing wrong with storing calculations for historical data that will not change in order to gain performance benefits.
While I agree with RedFilter that there is nothing wrong with storing historical data, I don't agree with the performance boost you will get. Your database is not what I would consider a heavy use database.
One of the major advantages of databases is indexes. They used advanced data structures to make data access lightening fast. Just think, every primary key you have is an index. You shouldn't be afraid of them. Of course, it would probably be counter productive to make all your fields indexes, but that should never really be necessary. I would suggest researching indexes more to find the right balance.
As for the work done when a change happens, it is not that bad. An index is a tree like representation of your field data. This is done to reduce a search down to a small number of near binary decisions.
For example, think of finding a number between 1 and 100. Normally you would randomly stab at numbers, or you would just start at 1 and count up. This is slow. Instead, it would be much faster if you set it up so that you could ask if you were over or under when you choose a number. Then you would start at 50 and ask if you are over or under. Under, then choose 75, and so on till you found the number. Instead of possibly going through 100 numbers, you would only have to go through around 6 numbers to find the correct one.
The problem here is when you add 50 numbers and make it out of 1 to 150. If you start at 50 again, your search is less optimized as there are 100 numbers above you. Your binary search is out of balance. So, what you do is rebalance your search by starting at the mid-point again, namely 75.
So the work a database is just an adjustment to rebalance the mid-point of its index. It isn't actually a lot of work. If you are working on a database that is large and requires many changes a second, you would definitely need to have a strong strategy for your indexes. In a small database that gets very few changes like yours, its not a problem.
One of the things that always worries me in MySQL is that my string fields will not be large enough for the data that need to be stored. The PHP project I'm currently working on will need to store strings, the lengths of which may vary wildly.
Not being familiar with how MySQL stores string data, I'm wondering if it would be overkill to use a larger data type like TEXT for strings that will probably often be less than 100 characters. What does MySQL do with highly variable data like this?
See this: http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
VARCHAR(M), VARBINARY(M) L + 1 bytes
if column values require 0 – 255
bytes, L + 2 bytes if values may
require more than 255 bytes
BLOB, TEXT L + 2 bytes, where L < 2^16
So in the worst case, you're using 1 byte per table cell more when using TEXT.
As for indexing: you can create a normal index on a TEXT column, but you must give a prefix length - e.g.
CREATE INDEX part_of_name ON customer (name(10));
and moreover, TEXT columns allow you to create and query fulltext indexes if using the MyISAM engine.
On the other hand, TEXT columns are not stored together with the table, so performance could, theoretically, become an issue in some cases (benchmark to see about your specific case).
In recent versions of MySQL, VARCHAR fields can be quite long - up to 65,535 characters depending on character set and the other columns in the table. It is very efficient when you have varying length strings. See:
http://dev.mysql.com/doc/refman/5.1/en/char.html
If you need longer strings than that, you'll probably just have to suck it up and use TEXT.