I'm trying to build out a mysql database design for a project. The problem is coming up with the best solution. Basically in my application, I will have to insert approximately 10-30 rows per user. The primary key will be a random CHAR(16) string. There will also be an datetime index, and an additional row (with an index) called "data".
Day to day, there will only be a heavy amount of inserts and lookups on the table. The lookups will always joined based on the primary key (so joining those 10-30 rows per user).
I will at times need to be able to look at a few specific months (or a full year even) and use mysql GROUP BY functions on the "data" index as well.
At its current volume and estimates, I would expect the table to grow 9.3m rows/month. I do expect this to increase.
So my question comes down to this: mysql partitions, programmatic table separation, or another solution? and are things best separated by month or year? We are running on RHEL, so getting mysql 5.1 may be a bit of work, but if that's a better solution it may be worth going for.
innoDB has already been selected for this project. Day to day performance is the primary concern.
This doesn't answer your question, but it needs to be mentioned...
The primary key will be a random CHAR(16) string.
This is a Bad Idea. Use an UNSIGNED BIGINT column with AUTO_INCREMENT. No need to reinvent the wheel: you won't have to worry about key management or collisions that way.
Partition the data on the dates (and maybe additionally the user it is per-user data and you have lots of users).
Then create a monthly table with the SUM, COUNT, AVG, etc that you need and the appropriate group by. You can partition that table as well (but dates probably won't be a meaningful partition)
Then create a yearly table like the monthly table.
Populate the monthly and yearly tables with REPLACE INTO ... SELECT ... statements.
Related
As part of a project I've been gathering data from a number of sensors deployed in the field. The aim of this is to understand the performance of the devices and any potential problems or bugs that might be present.
I'm storing the data in a single table in a database with the columns id(primary), MAC_address, name, status, timestamp.
MAC_address is the one thing that is guaranteed to always be the same for a physical device and this is what I've been using mostly to extract information from the database.
My aim was to be able to extract data over a specific time period for a specific device whose MAC_address can be selected from a dropdown. Even doing a single SELECT DISTINCT query to get a list of unique MAC addresses was taking forever, but creating an index for that column seemed to speed it up. However, it still takes >30 seconds right now to extract any number of full rows from the database.
What is the best way to go about speeding up queries from a database this large?
Single table...wrong.
One table describes each "device" (or "sensor"). It has an id which is, perhaps, a 2-byte SMALLINT UNSIGNED (range of 0..65K -- you won't have more sensors than that?). Note the MAC_address and name belong in this table. This id is used on the other table...
Another table contains the sensor_id, timestamp, and value. This table should have PRIMARY KEY(sensor_id, timestamp) and be ENGINE=InnoDB`. Now it is very efficient for finding all the readings for one sensor over a period of time.
This completely avoids the SELECT DISTINCT since there are no dups in the first table.
If you are having a pulldown, then TINYINT UNSIGNED (1-byte, 0..255) is probably plenty big.
OK, I have helped you with two SELECTs; what other ones will you have. Keep in mind that performance is primarily based on the number of rows you need to touch. And that counts the ones you reject.
So I have just set up a database which holds only one table with the following fields:
key_value: holds 6 digit code for a key
redeemed: boolean for if the key is redeemed
redeemed_by: who redeemed it
redeemed_date: when it was redeemed
software_name: name of the software the key relates to
I basically start with an empty database and then when someone purchases through PayPal, they get their own key and it is added to the database. After this they open an app which lets them input their code which is then searched for in the database and marked as redeemed so it can't be used again - this results in both redeemed and unredeemed codes being in one table.
If I was to reach a good few thousand purchases, would this cause the database to slow down majorly, crash maybe? what if it was a bigger number, say 10,000?
What exactly would be a good solution for this, even if I had another table of redeemed keys, it would have to look in the redeemed table to see if it was redeemed?
Thanks for any answer, I am still learning databases and SQL!
I think your design is sound. You might want to add indexes based on what queries you will be running. key_value sounds like a good primary key which would also serve as an index for updating redeemed.
As noted by Marc B, the hardware is your only likely consideration for performance.
I would use two tables for this: One for what you have spec'ed out, but another as an archive table with a job that migrates over redeemed/expired records on a regular basis.
Reasoning: The primary purpose of the table is for the benefit of redemptions, not for use as an archive. Over time, as more and more redeemed records are found in the table, the performance for lookups of unredeemed records starts getting worse and worse because of all the "deadwood" in the table. (Do you think eBay houses all active and completed auctions in one table?)
If you still absolutely need a "one-table" solution, you can easily create a view that merges the two tables.
Also, if you set up a proper primary key, the performance (for a while) will not degrade quickly as that would eliminate table scans which is what you are alluding to when the record volumes grow.
I want to create a table with this info:
ID bigint(20) PK AI
FID bigint(20) unique
points int(10) index
birthday date index
current_city varchar(175) index
current_country varchar(100) index
home_city varchar(175) index
home_country varchar(100) index
Engine = MyISAM
On school I learned: create 2 extra tables, one with cities and one with countries and FK to that table when inserting data. The reason I doubt is:
This table will have around 10M inserts an hour. I'm afraid if I Insert a row and have to lookup the city FK and country FK every insert, I might lose a lot of speed? And is this worth the gain I get when I am selecting rows which only happens with WHERE ID = id. there will be around 25M of those selects an hour.
Premature optimization if the root of all evil. Design cleanly first, and optimize next, when you have actual performance data.
A clean design would be a properly normalized table, i.e. with separate city and a country tables.
I'm afraid if I Insert a row and have to lookup the city FK and country FK every insert, I might lose a lot of speed?
Actually, inserting just small IDs instead of raw country/city names in a varchar column may be more efficient:
This will result in less disk writes
You have a MyISAM table; so it doesn't have FK support, and doesn't do any foreign key lookup / check
Replacing the varchar columns with integers will put the table in fixed-length rows format, which may be faster than the dynamic length format
Benchmark with real data/workload, and see if de-normalizing is really worth it.
There's a reason why db normalization exists.
Use a table for cities, one for countries and join them with your master table via FK's.
Also, what country do you know having 100 chars in the name?
What city do you know having 175 chars in the name?
ID can be bigint, but are you sure you need a BIGINT(20), wouldn't a INT(11) suffice ? Anyway, AUTOINCREMENT it and don't UNIQUE it, it doesn't make any sense.
Also, you have indexes on every column, but no composite index. This is wrong for so many reasons. Do not pre-index, but index depending on your queries. Use explain to see what's to be indexed.
Also, don't be afraid to use composite indexes and avoid creating indexes for every column that you have.
Do all the above steps and you will have fast queries (let's hope at least)
The City and Country tables will be small (relatively) and probably fit nice in memory so lookups will be fast.
If that isn't fast enough try to cache the lookup client side (ie your php-app).
Since your rows will be smaller (int instead of varchar) you can fit more rows on each page making index lookups faster.
Try to do it normalized first, it will probably be fast enough.
And make sure you use InnoDB instead of MyISAM. It has much better locking and your application looks very concurrent.
I am creating a database for keeping track of water usage per person for a city in South Florida.
There are around 40000 users, each one uploading daily readouts.
I was thinking of ways to set up the database and it would seem easier to give each user separate a table. This should ease the download of data because the server will not have to sort through a table with 10's of millions of entries.
Am I false in my logic?
Is there any way to index table names?
Are there any other ways of setting up the DB to both raise the speed and keep the layout simple enough?
-Thank you,
Jared
p.s.
The essential data for the readouts are:
-locationID (table name in my idea)
-Reading
-ReadDate
-ReadTime
p.p.s. during this conversation, i uploaded 5k tables and the server froze. ~.O
thanks for your help, ya'll
Setting up thousands of tables in not a good idea. You should maintain one table and put all entries in that table. MySQL can handle a surprisingly large amount of data. The biggest issue that you will encounter is the amount of queries that you can handle at a time, not the size of the database. For instances where you will be handling numbers use int with attribute unsigned, and instances where you will be handling text use varchar of appropriate size (unless text is large use text).
Handling users
If you need to identify records with users, setup another table that might look something like this:
user_id INT(10) AUTO_INCREMENT UNSIGNED PRIMARY
name VARCHAR(100) NOT NULL
When you need to link a record to the user, just reference the user's user_id. For the record information I would setup the SQL something like:
id INT(10) AUTO_INCREMENT UNSIGNED PRIMARY
u_id INT(10) UNSIGNED
reading Im not sure what your reading looks like. If it's a number use INT if its text use VARCHAR
read_time TIMESTAMP
You can also consolidate the date and time of the reading to a TIMESTAMP.
Do NOT create a seperate table for each user.
Keep indexes on the columns that identify a user and any other common contraints such as date.
Think about how you want to query the data at the end. How on earth would you sum up the data from ALL users for a single day?
If you are worried about primary key, I would suggest keeping a LocationID, Date composite key.
Edit: Lastly, (and I do mean this in a nice way) but if you are asking these sorts of questions about database design, are you sure that you are qualified for this project? It seems like you might be in over your head. Sometimes it is better to know your limitations and let a project pass by, rather than implement it in a way that creates too much work for you and folks aren't satisfied with the results. Again, I am not saying don't, I am just saying have you asked yourself if you can do this to the level they are expecting. It seems like a large amount of users constantly using it. I guess I am saying that learning certain things while at the same time delivering a project to thousands of users may be an exceptionally high pressure environment.
Generally speaking tables should represent sets of things. In your example, it's easy to identify the sets you have: users and readouts; there the theoretical best structure would be having those two tables, where the readouts entries have a reference to the id of the user.
MySQL can handle very large amounts of data, so your best bet is to just try the user-readouts structure and see how it performs. Alternatively you may want to look into a document based NoSQL database such as MongoDB or CouchDB, since storing readouts reports as individual documents could be a good choice aswell.
If you create a summary table that contains the monthly total per user, surely that would be the primary usage of the system, right?
Every month, you crunch the numbers and store the totals into a second table. You can prune the log table on a rolling 12 month period. i.e., The old data can be stuffed in the corner to keep the indexes smaller, since you'll only need to access it when the city is accused of fraud.
So exactly how you store the daily readouts isn't that big of a concern that you need to be freaking out about it. Giving each user his own table is not the proper solution. If you have tons and tons of data, then you might want to consider sharding via something like MongoDB.
What is the purpose of the Secondary key? Say I have a table that logs down all the check-ins (similar to Foursquare), with columns id, user_id, location_id, post, time, and there can be millions of rows, many people have stated to use secondary keys to speed up the process.
Why does this work? And should both user_id and location_id be secondary keys?
I'm using mySQL btw...
Edit: There will be a page that lists/calculates all the check-ins for a particular user, and another page that lists all the users who has checked-in to a particular location
mySQL Query
Type 1
SELECT location_id FROM checkin WHERE user_id = 1234
SELECT user_id FROM checkin WHERE location_id = 4321
Type 2
SELECT COUNT(location_id) as num_users FROM checkin
SELECT COUNT(user_id) as num_checkins FROM checkin
The key (also called index) is for speeding up queries. If you want to see all checkins for a given user, you need a key on user_id field. If you want to see all checking for a given location, you need index on location_id field. You can read more at mysql documentation
I want to comment on your question and your examples.
Let me just suggest strongly to you that since you are using MySQL you make sure that your tables are using the innodb engine type for many reasons you can research on your own.
One important feature of InnoDB is that you have referential integrity. What does that mean? In your checkin table, you have a foreign key of user_id which is the primary key of the user table. With referential integrity, MySQL will not let you insert a row with a user_id that doesn't exist in the user table. Using MyISAM, you can. That alone should be enough to make you want to use the innodb engine.
To your question about keys/indexes, essentially when a table is defined and a key is declared for a column or some combination of columns, mysql will create an index.
Indexes are essential for performance as a table grows with the insert of rows.
All relational databases and Document databases depend on an implementation of BTree indexing. What Btree's are very good for, is finding an item (or not) using a predictable number of lookups. So when people talk about the performance of a relational database the essential building block of that is use of btree indexes, which are created via KEY statements or with alter table or create index statements.
To understand why this is, imagine that your user table was simply a text file, with one line per row, perhaps separated by commas. As you add a row, a new line in the text file gets added at the bottom.
Eventually you get to the point that you have 10,000 lines in the file.
Now you want to find out if you entered a line for one particular person with the last name of Smith. How can you find that out?
Without any sort of sortation of the file, or a separate index, you have but one option and that is to start at the first line in the file and scan through every line in the table looking for a match. Even if you found a Smith, that might not be the only 'Smith' in the table, so you have to read the entire file from top to bottom every time you want do do this search.
Obviously as the table grows the performance of searching gets worse and worse.
In relational database parlance, this is known as a "table scan". The database has to start at the first row and scan through reading every row until it gets to the end.
Without indexes, relational databases still work, but they are highly dependent on IO performance.
With a Btree index, the rows you want to find are found in the index first. The indexes have a pointer directly to the data you want, so the table no longer needs to be scanned, but instead the individual data pages required are read. This is how a database can maintain adequate performance even when there are millions or 10's or 100's of millions of rows.
To really start to gain insight into how mysql works, you need to get familiar with EXPLAIN EXTENDED ... and start looking at the explain plans for queries. Simple ones like those you've provided will have simple plans that show you how many rows are being examined to get a result and whether or not they are using one or more indexes.
For your summary queries, indexes are not helpful because you are doing a COUNT(). The table will need to be scanned when you have no other criteria constraining the search.
I did notice what looks like a mistake in your summary queries. Just based on your labels, I would think that these are the right queries to get what you would want given your column alias names.
SELECT COUNT(DISTINCT user_id) as num_users FROM checkin
SELECT COUNT(*) as num_checkins FROM checkin
This is yet another reason to use InnoDB, which when properly configured has a data cache (innodb buffer pool) similar to other rdbms's like oracle and sql server. MyISAM doesn't cache data at all, so if you are repeatedly querying the same sorts of queries that might require a lot of IO, MySQL will have to do all that data reading work over and over, whereas with InnoDB, that data could very well be sitting in cache memory and have the result returned without having to go back and read from storage.
Primary vs Secondary
There really is no such concept internally. A Primary key is special because it allows the database to find one single row. Primary keys must be unique, and to reflect that, the associated Btree index is unique, which simply means that it will not allow you to have 2 keys with the same data to exist in the index.
Whether or not an index is unique is an excellent tool that allows you to maintain the consistency of your database in many other cases. Let's say you have an 'employee' table with the SS_Number column to store social security #. It makes sense to have an index on that column if you want the system to support finding an employee by SS number. Without an index, you will tablescan. But you also want to have that index be unique, so that once an employee with a SS# is inserted, there is no way the database will let you enter a duplicate employee with the same SS#.
But to demystify this for you, when you declare keys these indexes are just being created for you and used automagically in most cases, when you define the tables.
It's when you aren't dealing with keys (primary or foreign) as in the example of usernames, first, last & last names, ss#'s etc., that you need to also be aware of how to create an index because you are searching (using where clause criteria) on one or more columns that aren't keys.