This question already has answers here:
How many rows in a database are TOO MANY?
(10 answers)
Closed 8 years ago.
I run a small to medium car website and we are trying to log how many times a visit goes to vehicles detail page. We do this by hashing, md5, the make, model, and zip of the current vehicle. We then keep a vehicle_count total and increment this if the hashes match.
After running the numbers there appears to be about 50 makes, each make has about 50 models, and our locations db has about 44,000 unique zip codes. Roughly 100 million+ potential of unique hashes
This is the create table:
CREATE TABLE `vehicle_detail_page` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`vehicle_hash` char(32) NOT NULL,
`make` varchar(100) NOT NULL,
`model` varchar(100) NOT NULL,
`zip_code` char(7) DEFAULT NULL,
`vehicle_count` int(6) unsigned DEFAULT '1',
PRIMARY KEY (`id`),
UNIQUE KEY `vehicle_hash` (`vehicle_hash`),
KEY `make` (`make`),
KEY `model` (`model`),
KEY `zip_code` (`zip_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This is the PHP code to insert/update the table:
public function insertUpdate($make, $model, $zip)
{
// set table
$table = self::TABLE;
// create hash
$hash = md5($make.$model.$zip);
// insert or update count
try
{
$stmt = $this->db->conn->prepare("INSERT INTO $table
(vehicle_hash,
make,
model,
zip_code)
VALUES
(:vehicle_hash,
:make,
:model,
:zip_code)
ON DUPLICATE KEY UPDATE
vehicle_count = vehicle_count + 1;");
$stmt->bindParam(':vehicle_hash', $hash, PDO::PARAM_STR);
$stmt->bindParam(':make', $make, PDO::PARAM_STR);
$stmt->bindParam(':model', $model, PDO::PARAM_STR);
$stmt->bindParam(':zip_code', $zip, PDO::PARAM_STR);
$stmt->execute();
} catch (Exception $e)
{
return FALSE;
}
return TRUE;
}
Questions:
Can MySQL handle this many rows?
Does anyone see anything wrong with this code, and is there a better way to do this?
What will querying this data be like?
The Big question is, once this table grows how will that php function above perform. If/when that table has a few million+ rows, how will that table perform. Can anyone give some insight?
You could also avoid the hash altogether.
CREATE TABLE `vehicle_visits` (
`make` varchar(100) DEFAULT NULL,
`model` varchar(100) DEFAULT NULL,
`zip_code` char(7) DEFAULT NULL,
`vehicle_count` int(11) DEFAULT NULL,
UNIQUE KEY `make_model_zip` (`make`,`model`,`zip_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This avoids having multiple UNIQUE values. Instead of "ID" and "Hash", you can use real world values to create the UNIQUE identifier. Notice how MySQL can use 3 columns to form a unique index.
Note: to decrease the size of your index, you can decrease the size of make and model columns. Unless you are expecting to have 100 character make and model name of course. If you are worried about size, you can also create an index using a prefix of each of the columns.
Edit: adding the hash column as an index method
As an alternative to a composite index, you can introduce a column
that is “hashed” based on information from other columns. If this
column is short, reasonably unique, and indexed, it might be faster
than a “wide” index on many columns.
http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html
You will need to do some real world tests to see which method is quicker. Since the data shows about 50 makes and 50 models, the lookup will mostly involve the zip_code column. Index order also makes a difference. Also, creating an index using prefixes such as make(10), model(10), zip(7), creates an index of length 27. On the other hand, an md5 column would be 32.
The hash method may help with lookups, but will it really help with real world applications? This table seems to track visitors, and will most likely have analytics performed on it. The index will help with SUM() operations (depending on the order of the index). For example, if I want to find the total number of visitors to "Honda" or "Honda Civic" page, it is easily done with the multiple column index.
Related
I cleaned the question a little bit because it was getting very big and unreadable.
Running on my localhost.
As you can see in the image below, the query takes 755.15 ms when selecting from the table Job that contains 15000 rows (with the where conditions returning 6650)
The table Company contains 1000 rows.
The table geo__name contains 84300 rows approx and is not giving me any problem, so I believe the problem is the database structure or something.
The structure of these 2 tables is the following:
Table Job is:
CREATE TABLE `job` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
`company_id` int(11) NOT NULL,
`activity_sector_id` int(11) DEFAULT NULL,
`status` int(11) NOT NULL,
`active` datetime NOT NULL,
`contract_type_id` int(11) NOT NULL,
`salary_type_id` int(11) NOT NULL,
`workday_id` int(11) NOT NULL,
`geoname_id` int(11) NOT NULL,
`title` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`minimum_experience` int(11) DEFAULT NULL,
`min_salary` decimal(7,2) DEFAULT NULL,
`max_salary` decimal(7,2) DEFAULT NULL,
`zip_code` int(11) DEFAULT NULL,
`vacancies` int(11) DEFAULT NULL,
`show_salary` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `created_at` (`created_at`,`active`,`status`) USING BTREE,
CONSTRAINT `FK_FBD8E0F823F5422B` FOREIGN KEY (`geoname_id`) REFERENCES `geo__name` (`id`),
CONSTRAINT `FK_FBD8E0F8398DEFD0` FOREIGN KEY (`activity_sector_id`) REFERENCES `activity_sector` (`id`),
CONSTRAINT `FK_FBD8E0F85248165F` FOREIGN KEY (`salary_type_id`) REFERENCES `job_salary_type` (`id`),
CONSTRAINT `FK_FBD8E0F8979B1AD6` FOREIGN KEY (`company_id`) REFERENCES `company` (`id`),
CONSTRAINT `FK_FBD8E0F8AB01D695` FOREIGN KEY (`workday_id`) REFERENCES `workday` (`id`),
CONSTRAINT `FK_FBD8E0F8CD1DF15B` FOREIGN KEY (`contract_type_id`) REFERENCES `job_contract_type` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=15001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
The table company is:
CREATE TABLE `company` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`logo` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
`website` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`user_id` int(11) NOT NULL,
`phone` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`cifnif` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`type` int(11) NOT NULL,
`subscription_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_4FBF094FA76ED395` (`user_id`),
KEY `IDX_4FBF094F9A1887DC` (`subscription_id`),
KEY `name` (`name`(191)),
CONSTRAINT `FK_4FBF094F9A1887DC` FOREIGN KEY (`subscription_id`) REFERENCES `subscription` (`id`),
CONSTRAINT `FK_4FBF094FA76ED395` FOREIGN KEY (`user_id`) REFERENCES `user` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
The query is the following:
SELECT
j0_.id AS id_0,
j0_.status AS status_1,
j0_.title AS title_2,
j0_.min_salary AS min_salary_3,
j0_.max_salary AS max_salary_4,
c1_.id AS id_5,
c1_.name AS name_6,
c1_.logo AS logo_7,
a2_.id AS id_8,
a2_.name AS name_9,
g3_.id AS id_10,
g3_.name AS name_11,
j4_.id AS id_12,
j4_.name AS name_13,
j5_.id AS id_14,
j5_.name AS name_15,
w6_.id AS id_16,
w6_.name AS name_17
FROM
job j0_
INNER JOIN company c1_ ON j0_.company_id = c1_.id
INNER JOIN activity_sector a2_ ON j0_.activity_sector_id = a2_.id
INNER JOIN geo__name g3_ ON j0_.geoname_id = g3_.id
INNER JOIN job_salary_type j4_ ON j0_.salary_type_id = j4_.id
INNER JOIN job_contract_type j5_ ON j0_.contract_type_id = j5_.id
INNER JOIN workday w6_ ON j0_.workday_id = w6_.id
WHERE
j0_.active >= CURRENT_TIMESTAMP
AND j0_.status = 1
ORDER BY
j0_.created_at DESC
When executing the above query I have these results:
In MYSQL Workbench: 0.578 sec / 0.016 sec
In Symfony profiler: 755.15 ms
The question is: Is the duration of this query correct? if not, how can I improve the speed of the query? it seems too much.
The Symfony debug toolbar if it helps:
As you can see in the below image, I'm only getting the data I really need:
The explain query:
The timeline:
The MySQL server can't handle the load being placed on it. This could be due to resource contention, or because it has not been appropriately tuned and it could also be a problem with your hard drive.
First, I would start your performance by adding MySQL keyword "STRAIGHT_JOIN" which tells MySQL to query the data in the order I have provided, dont try to think the relationships for me. However, on your dataset being so small, and already 1/2 second, don't know if that will help as much, but on larger datasets I have known it to SIGNIFICANTLY improve performance.
Next, you appear to be getting lookup descriptions based on the PK/FK relationship results. Not seeing the indexes on those tables, I would suggest doing covering indexes which contain both the key and description so the join can get the data from the index pages it uses for the JOIN instead of use index page, find the actual data pages to get the description and continue.
Last, your job table with the index on (created_at,active,status), might perform better if the index had the index as ( status, active, created_at ).
With your existing index, think of it this way, each day of data is put into a single box. Within each day box that is sorted by an active timestamp (even if simplified by active date), THEN the status.
So, for each day CREATED, you open a box. Look at secondary boxes, one for each "Active" timestamp (ex: by day). Within each Active timestamp (day), only now can you see if the "Status = 1" records. So open each active timestamp day, assess Status = 1, then close each created day box and go to the next created day box and repeat. So look at the labor intensive of open each box per day, each active box within that day.
Now, under the suggested index starting with status. You now have a very finite number of boxes, one for each status. Open only the 1 box for status = 1 These are the only ones you want to consider... All the others you don't care. Inside that, you have the actual records based on ACTIVE Timestamp and that is sub-sorted. From that, you can jump directly to those at the current timestamp. From the first record and the rest within the box, you now have all the records that qualify. Done. Since these records (index) ALSO has the Created_at as part of the index, it can optimize that with the descending sort order.
For ensuring "covering indexes" for the other lookup tables if they do not yet exist, I suggest the following.
table index
company ( id, name, logo )
activity_sector (id, name )
geo__name ( id, name )
job_salary_type ( id, name )
job_contract_type ( id, name )
workday ( id, name )
And the MySQL Keyword...
SELECT STRAIGHT_JOIN (rest of query...)
There are several reasons as to why Symfony is slow.
1. Server fault
First, it could be the server fault. Server performances may hinder your query time.
2. Data size and defered rendering
Then comes the data size. As you can see on the image below, the query on one of my project have a 50Mb data size (currently about 20k rows).
Parsing 50Mb in HTML can take some time, mostly because of loops.
Still, there are solutions about this, like defered rendering.
Defered rendering is quite simple, instead of parsing data in your twig you,
send all data to a javascript varaible, and use javascript to parse/render data once the DOM is loaded.
3. Query optimisation
As I wrote in comment, you can check the following question, on which I explained why custom queries are important.
Are Doctrine relations affecting application performance?
In this question, you will read that order matter... It's in fact the most important thing.
While static data in your databases are often inserted in the right order,
it's rarely the case for dynamic data (data provided by user during the website life)
Which is why, using ORDER BY in your query will often speed up the page rendering,
as doctrine won't be doing extra queries on it's own.
As exemple, One of my site have about 700 entries diplayed on the index.
First, here is the query count while using findAll() :
It show 254 query (253 duplicates) in 144ms, plus 39 render time.
Next, using the second parameter of findBy(), ORDER BY, I get this result :
You can see the full query here (sreenshot is big)
Much better, 1 query only in 8ms, and about the same render time.
But, here, I don't use any fields from associations.
From the moment I will do it, doctrine qui do some extra query, and query count and time will skyrocket.
In the end, it will turn back to something like findAll()
And last, this is the custom query :
In this custom query, the query time went from 8ms to 38ms.
But, unlike the previous query, I got way more data in my result,
which will prevent doctrine from doing extra query.
Again, ORDER BY() matter in this query. Without it, I skyrocket back to 84 queries.
4. Partials
When you do custom query, you can load partials objects instead of full data.
As you said in your question, description field seems to slow down your loading speed,
with partials, you can avoid to load some fields from the table, which will speed up query speed.
First, instead of your regular syntax, this is how you will create the query builder :
$em=$this->getEntityManager();
$qb=$em->createQueryBuilder();
Just in case, I prefer to keep $em as a separate variable (if I want to fetch some class repository for example).
Then you can start your partial select. Careful, first select can't include any association fields :
$qb->select("partial job.{id, status, title, minimum_experience, min_salary, max_salary, zip_code, vacancies")
->from(Job::class, "job");
Then you can add your associations :
$qb->addSelect("company")
->join("job.company", "company");
Or even add partial association in case you don't need all the data of the association :
$qb->addSelect("partial activitySector.{id}")
->join("job.activitySector", "activitySector");
$qb->addSelect("partial job.{id, company_id, activity_sector_id, status, active, contract_type_id, salary_type_id, workday_id, geoname_id, title, minimum_experience, min_salary, max_salary, zip_code, vacancies, show_salary");
5. Caches
You could also use various caches, like Zend OPCache for PHP, which you will find some advices in this question: Why Symfony3 so slow?
There is also the SQL cache Varnish.
This round up about everything I can share to lower your loading time.
Hope it will prove useful and you will be able to solve your problem.
So many keys , try to minimize the number of keys.
I have this chat table:
CREATE TABLE IF NOT EXISTS `support_chat` (
`id` int(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`from` varchar(255) NOT NULL DEFAULT '',
`to` varchar(255) NOT NULL DEFAULT '',
`message` text NOT NULL,
`sent` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`seen` varchar(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `from` (`from`),
KEY `to` (`to`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 AUTO_INCREMENT=1 ;
basically I need to do a select all the time (3s per user) to check new messages:
select id, `from`, message, sent from support_chat where `to` = ? and seen = 0
I have 5 million rows, usually 100 users online at the same time. Can I change something to make this table faster? key from and key to is a good option?
There isn't much you can do by way of indexes to speed up that particular query. You could have a composite index on the to and seen fields but the improvement will be minimal if at all. Why? Because the seen field has very poor cardinality. You only seem to be storing 0 or 1 in it and indexes on such columns are not very usefull. Often it would be faster for the query optimizer to read the data directly.
But here's what you can do Partition:
... enables you to distribute portions of individual tables across a
file system according to rules which you can set largely as needed. In
effect, different portions of a table are stored as separate tables in
different locations. The user-selected rule by which the division of
data is accomplished is known as a partitioning function,
You can partition your data in such a way that very old data is separated from the new. This will probably give you a big boost. However be aware that if you have a query that fetches old data as well as new data that will be a lot slower.
Here is another thing you can do: Add a limit clause.
You are probably only showing a limited number of messages at any given time. Putting a limit clause will help. Then mysql knows that it doesn't need to look anymore after it has found N rows.
Add a multiple column index on to and seen columns in this particular order (to column should be the 1st column in the index). Then run explain select... on your query to see if the new index is used.
Assuming that the seen column stores 2 values only ('0' and '1') and that to column stores the recipient of the chat message (email, username), so it can have many more values, I'd use a composite index with seen first and to second:
ALTER TABLE support_chat
ADD INDEX seen_to_ix
(seen, `to`) ;
A composite index with reversed order (`to`, seen) would be a good choice, too. It might even be better depending on server load and how often the table is updated. An advantage (if you decide to use the second index), is that you can remove the (`to`) index.
Pick and add one of the two indexes and check the performance of your queries again.
Additional notes:
Using a varchar(1) for what is essentially a boolean value is not optimal. Even worse that it is a utf8mb4 charset. It uses 5 bytes! (1 for the variable and 4 for the single byte!)
I'd change the type of that column to tinyint (and store 0 and 1) or bit.
Please avoid using reserved words (eg, from, to) for table and column names.
I am a bit stumped on this wierdness.
I have a gps tracking app that logs gps points into a track_log table.
When I do a basic query on the running log table it takes about 50 seconds to complete:
SELECT * FROM track_log WHERE node_id = '26' ORDER BY time_stamp DESC LIMIT 1
When I run the exact same query on the archived table where I copied most of the logs to to reduce the running table's logs to about 1.2 million records.
The archive table is 7.5 million records big.
The exact same query on the archive table runs for 0.1 seconds on the same server even though it's six times bigger!
What's going on?
Here's the full Create Table schema:
CREATE TABLE `track_log` (
`id_track_log` INT(11) NOT NULL AUTO_INCREMENT,
`node_id` INT(11) DEFAULT NULL,
`client_id` INT(11) DEFAULT NULL,
`time_stamp` DATETIME NOT NULL,
`latitude` DOUBLE DEFAULT NULL,
`longitude` DOUBLE DEFAULT NULL,
`altitude` DOUBLE DEFAULT NULL,
`direction` DOUBLE DEFAULT NULL,
`speed` DOUBLE DEFAULT NULL,
`event_code` INT(11) DEFAULT NULL,
`event_description` VARCHAR(255) DEFAULT NULL,
`street_address` VARCHAR(255) DEFAULT NULL,
`mileage` INT(11) DEFAULT NULL,
`run_time` INT(11) DEFAULT NULL,
`satellites` INT(11) DEFAULT NULL,
`gsm_signal_status` DOUBLE DEFAULT NULL,
`hor_pos_accuracy` double DEFAULT NULL,
`positioning_status` char(1) DEFAULT NULL,
`io_port_status` char(16) DEFAULT NULL,
`AD1` decimal(10,2) DEFAULT NULL,
`AD2` decimal(10,2) DEFAULT NULL,
`AD3` decimal(10,2) DEFAULT NULL,
`battery_voltage` decimal(10,2) DEFAULT NULL,
`ext_power_voltage` decimal(10,2) DEFAULT NULL,
`rfid` char(8) DEFAULT NULL,
`pic_name` varchar(255) DEFAULT NULL,
`temp_sensor_no` char(2) DEFAULT NULL,
PRIMARY KEY (`id_track_log`),
UNIQUE KEY `id_track_log_UNIQUE` (`id_track_log`),
KEY `client_id_fk_idx` (`client_id`),
KEY `track_log_node_id_fk_idx` (`node_id`),
KEY `track_log_event_code_fk_idx` (`event_code`),
KEY `track_log_time_stamp_index` (`time_stamp`),
CONSTRAINT `track_log_client_id` FOREIGN KEY (`client_id`) REFERENCES `clients` (`client_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `track_log_event_code_fk` FOREIGN KEY (`event_code`) REFERENCES `event_codes` (`event_code`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `track_log_node_id_fk` FOREIGN KEY (`node_id`) REFERENCES `nodes` (`id_nodes`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=8632967 DEFAULT CHARSET=utf8
TL;DR
Make sure the indexes are defined in both tables, for this query node_id and time_stamp are good indexes.
Defragment your table: https://dev.mysql.com/doc/refman/5.5/en/innodb-file-defragmenting.html (This could help, but should not make this much of a difference).
Make sure your query is not being blocked by other queries. If data is being inserted in the track_log table at continuously, those queries might block your query. You can prevent this by changing the transaction isolation level, see https://dev.mysql.com/doc/refman/5.5/en/set-transaction.html for more information. Caution: be carefull with this!
Indexes
I'm guessing this has something to do with the indexes you defined on the tables. Could you post the SHOW CREATE TABLES track_log output and the output of your archive table as well? The query you are executing would require an index on node_id and time_stamp for optimal performance.
Defragmentation
Besides this indexes you defined on the table, this might have something to do with data fragmentation. I'm assuming you are using InnoDB as your table engine now. Depending on your settings, every table in a database is stored in a separate file or every table in the database is stored in a single file (innodb_file_per_table variable). Those files will never shrink in size. If your track_log table has grown to 8.7 million records, on disk, it still takes up space for all those 8.7 million records.
If you have moved records from your track_log table to your archive table, the data might still be at the beginning and the end of the physical file for track_log. If no index is defined at time_stamp, a full table scan is still required to order by the timestamp. This means: reading the complete file from disk. Because the records you deleted still take up space in the file, this could make a difference.
Edit:
Transactions
Other transactions might be blocking your SELECT query. This can happen with the InnoDB engine. If you continously insert a lot of data into your track_log table, those queries might block your query. It will have to wait until no other transactions are being performed at this table.
There is a way around this, but you should be careful with this. You are able to change to transaction isolation level of your query. By setting the transaction isolation level to READ UNCOMMITTED you will be able to read data, while the other inserts are running. But it might not always give you the latest data. If you want to sacrifice this depends on your situation. If you are going to alter the data and update the data later, you generally do not want to change the transaction isolation level. But, for example, when showing statistics which should not always be accurate and up to date, this could be something that really speeds up your query.
I use this myself sometimes when I need to show statistics from large tables which are updated regularly.
This is almost certainly because your archive table has superior indexing to your track_log table.
To satisfy this query efficiently you need a compound index on (node_id, time_stamp) Why does this work? Because InnoDB and MyISAM indexes are so-called BTREE indexes, which means our intuition about searching them in order will work. Your query looks for a specific value of node_id, which means it can jump to that value in the index efficiently. The query then calls for the highest possible value of time_stamp related to that node_id value. Now that's in the same index, and in the right order to access it quickly too. So the row you need can be random-accessed, and MySQL doesn't have to hunt for it by scanning the table row by row. That scanning is almost certainly what's taking the time in your query.
Three things to keep in mind:
One: lots of indexes on single columns can't help a query as much as well-chosen compound indexes. Read this http://use-the-index-luke.com/
Two: SELECT * is usually harmful on a table with as many columns as the one you have shown. Instead, you should enumerate the columns you actually need in your SELECT query. That way MySQL doesn't have to sling as much data.
Three: The DOUBLE datatype is overkill for commercial-grade GPS data. FLOAT is plenty of precision.
Let us analyze your query:
SELECT * FROM track_log WHERE node_id = '26' ORDER BY time_stamp DESC LIMIT 1
The above mentioned query first sorts all the data present in the table based on time_stamp and then returns the top row.
But, when this query is executed on archived table, order by clause might be ignored (based on compression and system setting) and hence it returns the first row it encountered in the table.
You may verify the output of archived table by comparing the result with actual latest row.
I have a MySQL (5.6.26) database with large ammount of data and I have problem with COUNT select on table join.
This query takes about 23 seconds to execute:
SELECT COUNT(0) FROM user
LEFT JOIN blog_user ON blog_user.id_user = user.id
WHERE email IS NOT NULL
AND blog_user.id_blog = 1
Table user is MyISAM and contains user data like id, email, name, etc...
CREATE TABLE `user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`username` varchar(50) DEFAULT NULL,
`email` varchar(100) DEFAULT '',
`hash` varchar(100) DEFAULT NULL,
`last_login` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`created` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
UNIQUE KEY `id` (`id`) USING BTREE,
UNIQUE KEY `email` (`email`) USING BTREE,
UNIQUE KEY `hash` (`hash`) USING BTREE,
FULLTEXT KEY `email_full_text` (`email`)
) ENGINE=MyISAM AUTO_INCREMENT=5728203 DEFAULT CHARSET=utf8
Table blog_user is InnoDB and contains only id, id_user and id_blog (user can have access to more than one blog). id is PRIMARY KEY and there are indexes on id_blog, id_user and id_blog-id_user.
CREATE TABLE `blog_user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`id_blog` int(11) NOT NULL DEFAULT '0',
`id_user` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `id_blog_user` (`id_blog`,`id_user`) USING BTREE,
KEY `id_user` (`id_user`) USING BTREE,
KEY `id_blog` (`id_blog`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=5250695 DEFAULT CHARSET=utf8
I deleted all other tables and there is no other connection to MySQL server (testing environment).
What I've found so far:
When I delete some columns from user table, duration of query is shorter (like 2 seconds per deleted column)
When I delete all columns from user table (except id and email), duration of query is 0.6 seconds.
When I change blog_user table also to MyISAM, duration of query is 46 seconds.
When I change user table to InnoDB, duration of query is 0.1 seconds.
The question is why is MyISAM so slow executing the command?
First, some comments on your query (after fixing it up a bit):
SELECT COUNT(*)
FROM user u LEFT JOIN
blog_user bu
ON bu.id_user = u.id
WHERE u.email IS NOT NULL AND bu.id_blog = 1;
Table aliases help make it easier to both write and to read a query. More importantly, You have a LEFT JOIN but your WHERE clause is turning it into an INNER JOIN. So, write it that way:
SELECT COUNT(*)
FROM user u INNER JOIN
blog_user bu
ON bu.id_user = u.id
WHERE u.email IS NOT NULL AND bu.id_blog = 1;
The difference is important because it affects choices that the optimizer can make.
Next, indexes will help this query. I am guessing that blog_user(id_blog, id_user) and user(id, email) are the best indexes.
The reason why the number of columns affects your original query is because it is doing a lot of I/O. The fewer columns then the fewer pages needed to store the records -- and the faster the query runs. Proper indexes should work better and more consistently.
To answer the real question (why is myisam slower than InnoDB), I can't give an authoritative answer.
But it is certainly related to one of the more important differences between the two storage engines : InnoDB does support foreign keys, and myisam doesn't. Foreign keys are important for joining tables.
I don't know if defining a foreign key constraint will improve speed further, but for sure, it will guarantee data consistency.
Another note : you observe that the time decreases as you delete columns. This indicates that the query requires a full table scan. This can be avoided by creating an index on the email column. user.id and blog.id_user hopefully already have an index, if they don't, this is an error. Columns that participate in a foreign key, explicit or not, always must have an index.
This is a long time after the event to be much use to the OP and all the foregoing suggestions for speeding up the query are entirely appropriate but I wonder why no one has remarked on the output of EXPLAIN. Specifically, why the index on email was chosen and how that relates to the definition for the email column in the user table.
The optimizer has selected an index on email column, presumably because it's included in the where clause. key_len for this index is comparatively long and it's a reasonably large table given the auto_increment value so the memory requirements for this index would be considerably greater than if it had chosen the id column (4 bytes against 303 bytes). The email column is NULLABLE but has a default of the empty string so, unless the application explicitly sets a NULL, you are not going to find any NULLs in this column anyway. Neither will you find more than one record with the default given the UNIQUE constraint. The column DEFAULT and UNIQUE constraint appear to be completely at odds with each other.
Given the above, and the fact we only want the count in the query, I'd then wonder if the email part of the where clause serves any purpose other than slowing the query down as each value is compared to NULL. Without it the optimizer would probably pick the primary key and do a much better job. Better yet would be a query which ignored the user table entirely and took the count based on the covering index on blog_user that Gordon Linoff highlighted.
There's another indexing issues here worth mentioning:
On the user table
UNIQUE KEY `id` (`id`) USING BTREE,
is redundant since id is the PRIMARY KEY and therefore UNIQUE by definition.
To answer your last question,
The question is why is MyISAM so slow executing the command?
MyISAM is dependent on the speed of your hard drive,
INNODB once the data is read is at speed of RAM. 1st time query is run could be loading data, second and later will avoid hard drive until aged out of RAM.
I have the talbe like that:
CREATE TABLE UserTrans (
`id` int(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL,
`transaction_id` varchar(255) NOT NULL default '0',
`source` varchar(100) NOT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`)
)
with innodb engine.
The transaction_id is var because sometimes it can be aphanumeric.
the id is the primary key.
so.. here is the thing, I have over 1M records. However, there is a query to check for duplicate transaciton_id on the specified source. So, here is my query:
SELECT *
FROM UserTrans
WHERE transaction_id = '212398043'
AND source = 'COMPANY_A';
this query getting very slow, like 2 seconds to run now. Should I index the transaction_id and the source?
e.g. KEY join_id (transaction_id, source)
What is the drawback if i do that?
Obviously the benefit is that it will improve the performance of certain queries.
The drawback is that it will take a bit of space to store the index and a bit of work for the RDBMS to maintain the index. The index is especially prone to consume space because your transaction_id is such a wide string.
You might consider whether transaction_id really needs to be up to 255 characters long, or if you could declare its max length to be something shorter.
Or you could use a prefix index to index only the first n characters:
CREATE INDEX join_id ON UserTrans (transaction_id(16), source(16));
#Daniel has a good point that you might get the same benefit and save even more space by indexing only one column. Since you're doing SELECT * you've ruled out the benefit of a covering index.
Also if you intend transaction_id to be unique, why not constrain it to be unique?
CREATE UNIQE INDEX uq_transaction_id ON UserTrans (transaction_id(16));
The main drawback is that the new index will take up space on your disks. It will also make inserts and updates a little bit slower (but this is often negligible in most situations).
On the other hand, your query will probably run in just a few milliseconds instead of 2 seconds.
The drawbacks to adding indices are space (since storing indexes does take up space) and insert time (since when you insert new records, they have to be added to the indices).
That said, you may not need to index both fields - just indexing one of them may be enough.
I would think about diching your id column and use transaction_id as your primary key
I am assuming that transaction_id is unique.
this will mean that your schema prevents you from inserting a transaction id that is already there.
this reduces the the amount of data being stored, and also reduces the number of columns needing to be indexed.
if source company and transaction_id are infact a composite key.. i would make the two columns the primary key.
your current schema allows you to put in duplicates, which is an unnecessary evil.