Splitting data into two tables

Splitting data into two tables - php

I want to create a table with this info:
ID bigint(20) PK AI
FID bigint(20) unique
points int(10) index
birthday date index
current_city varchar(175) index
current_country varchar(100) index
home_city varchar(175) index
home_country varchar(100) index
Engine = MyISAM
On school I learned: create 2 extra tables, one with cities and one with countries and FK to that table when inserting data. The reason I doubt is:
This table will have around 10M inserts an hour. I'm afraid if I Insert a row and have to lookup the city FK and country FK every insert, I might lose a lot of speed? And is this worth the gain I get when I am selecting rows which only happens with WHERE ID = id. there will be around 25M of those selects an hour.

Premature optimization if the root of all evil. Design cleanly first, and optimize next, when you have actual performance data.
A clean design would be a properly normalized table, i.e. with separate city and a country tables.
I'm afraid if I Insert a row and have to lookup the city FK and country FK every insert, I might lose a lot of speed?
Actually, inserting just small IDs instead of raw country/city names in a varchar column may be more efficient:
This will result in less disk writes
You have a MyISAM table; so it doesn't have FK support, and doesn't do any foreign key lookup / check
Replacing the varchar columns with integers will put the table in fixed-length rows format, which may be faster than the dynamic length format
Benchmark with real data/workload, and see if de-normalizing is really worth it.

There's a reason why db normalization exists.
Use a table for cities, one for countries and join them with your master table via FK's.
Also, what country do you know having 100 chars in the name?
What city do you know having 175 chars in the name?
ID can be bigint, but are you sure you need a BIGINT(20), wouldn't a INT(11) suffice ? Anyway, AUTOINCREMENT it and don't UNIQUE it, it doesn't make any sense.
Also, you have indexes on every column, but no composite index. This is wrong for so many reasons. Do not pre-index, but index depending on your queries. Use explain to see what's to be indexed.
Also, don't be afraid to use composite indexes and avoid creating indexes for every column that you have.
Do all the above steps and you will have fast queries (let's hope at least)

The City and Country tables will be small (relatively) and probably fit nice in memory so lookups will be fast.
If that isn't fast enough try to cache the lookup client side (ie your php-app).
Since your rows will be smaller (int instead of varchar) you can fit more rows on each page making index lookups faster.
Try to do it normalized first, it will probably be fast enough.
And make sure you use InnoDB instead of MyISAM. It has much better locking and your application looks very concurrent.

Related

Load a full list of IDs from DB or perform one record at a time ? What's best?

I migrate a custom made web site to WordPress and first I have to migrate the data from the previous web site, and then, every day I have to perform some data insertion using an API.
The data I like to insert, comes with a unique ID, representing a single football game.
In order to avoid inserting the same game multiple times, I made a db table with the following structure:
CREATE TABLE `ss_highlight_ids` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`highlight_id` int(10) unsigned zerofill NOT NULL DEFAULT '0000000000',
PRIMARY KEY (`id`),
UNIQUE KEY `highlight_id_UNIQUE` (`highlight_id`),
KEY `highlight_id_INDEX` (`highlight_id`) COMMENT 'Contains a list with all the highlight IDs. This is used as index, and dissalow the creation of double records.'
) ENGINE=InnoDB AUTO_INCREMENT=2967 DEFAULT CHARSET=latin1
and when I try to insert a new record in my WordPress db, I first like to lookup this table, to see if the ID already exists.
The question now :)
What's preferable ? To load all the IDs using a single SQL query, and then use plain PHP to check if the current game ID exists, or is it better to query the DB for any single row I insert ?
I know that MySQL Queries are resource expensive, but from the other side, currently I have about 3k records in this table, and this will move over 30 - 40k in the next few year, so I don't know if it's a good practice to load all of those records in PHP ?
What is your opinion / suggestion ?
UPDATE #1
I just found that my table has 272KiB size with 2966 row. This means that in the near feature it seems that will have a size of about ~8000KiB+ size, and going on.
UPDATE #2
Maybe I have not make it too clear. For first insertion, I have to itterate a CSV file with about 12K records, and after the CSV insertion every day I will insert about 100 - 200 records. All of those records requiring a lookup in the table with the IDs.
So the excact question is, is it better to create a 12K queries in MySQL at CSV insertion and then about 100 - 200 MySQL Queries every day, or just load the IDs in server memory, and use PHP for the lookup ?

Your table has a column id which is auto_increment, what that means is there is no need to insert anything in that column. It will fill it itself.

highlight_id is UNIQUE, so it may as well be the PRIMARY KEY; get rid if id.
A PRIMARY KEY is a UNIQUE key is an INDEX. So this is redundant:
KEY `highlight_id_INDEX` (`highlight_id`)
Back to your question... SQL is designed to do things in batches. Don't defeat that by doing things one row at a time.
How can the table be 272KiB size if it has only two columns and 2966 rows? If there are more columns in the table; show them. There are often good clues of what you are doing, and how to make it more efficient.
2966 rows is 'trivial'; you will have to look closely to see performance differences.
Loading from CSV...
If this is a replacement, use LOAD DATA, building a new table, then RENAME to put it into place. One CREATE, one LOAD, one RENAME, one DROP. Much more efficient than 100 queries of any kind.
If the CSV is updates/inserts, LOAD into a temp table, then do INSERT ... ON DUPLICATE KEY UPDATE ... to perform the updates/inserts into the real table. One CREATE, one LOAD, one IODKU. Much more efficient than 100 queries of any kind.
If the CSV is something else, please elaborate.

How to store 60 Booleans in a MySQL Database?

I'm building a mobile App I use PHP & MySQL to write a backend - REST API.
If I have to store around 50-60 Boolean values in a table called "Reports"(users have to check things in a form) in my mobile app I store the values (0/1) in a simple array. In my MySql Table should I create a different column for each Boolean value or is it enough if I simply use a string or an Int to store it as a "number" like "110101110110111..."?
I get and put the data with JSON.
UPDATE 1: All I have to do is check if everything is 1, if one of them is 0 then that's a "problem". In 2 years this table will have around 15.000-20.000 rows, it has to be very fast and as space-saving as possible.
UPDATE 2: In terms of speed which solution is faster? Making separate columns vs store it in a string/binary type. What if I have to check which ones are the 0s? Is it a great solution if I store it as a "number" in one column and if it's not "111..111" then send it to the mobile app as JSON where I parse the value and analyse it on the user's device? Let's say I have to deal with 50K rows.
Thanks in advance.

A separate column per value is more flexible when it comes to searching.
A separate key/value table is more flexible if different rows have different collections of Boolean values.
And, if
your list of Boolean values is more-or-less static
all your rows have all those Boolean values
your performance-critical search is to find rows in which any of the values are false
then using text strings like '1001010010' etc is a good way to store them. You can search like this
WHERE flags <> '11111111'
to find the rows you need.
You could use a BINARY column with one bit per flag. But your table will be easier to use for casual queries and eyeball inspection if you use text. The space savings from using BINARY instead of CHAR won't be significant until you start storing many millions of rows.
edit It has to be said: every time I've built something like this with arrays of Boolean attributes, I've later been disappointed at how inflexible it turned out to be. For example, suppose it was a catalog of light bulbs. At the turn of the millennium, the Boolean flags might have been stuff like
screw base
halogen
mercury vapor
low voltage
Then, things change and I find myself needing more Boolean flags, like,
LED
CFL
dimmable
Energy Star
etc. All of a sudden my data types aren't big enough to hold what I need them to hold. When I wrote "your list of Boolean values is more-or-less static" I meant that you don't reasonably expect to have something like the light-bulb characteristics change during the lifetime of your application.
So, a separate table of attributes might be a better solution. It would have these columns:
item_id fk to item table -- pk
attribute_id attribute identifier -- pk
attribute_value
This is ultimately flexible. You can just add new flags. You can add them to existing items, or to new items, at any time in the lifetime of your application. And, every item doesn't need the same collection of flags. You can write the "what items have any false attributes?" query like this:
SELECT DISTINCT item_id FROM attribute_table WHERE attribute_value = 0
But, you have to be careful because the query "what items have missing attributes" is a lot harder to write.

For your specific purpose, when any zero-flag is a problen (an exception) and most of entries (like 99%) will be "1111...1111", i dont see any reason to store them all. I would rather create a separate table that only stores unchecked flags. The table could look like: uncheked_flags (user_id, flag_id). In an other table you store your flag definitions: flags (flag_id, flag_name, flag_description).
Then your report is as simple as SELECT * FROM unchecked_flags.
Update - possible table definitions:
CREATE TABLE `flags` (
`flag_id` TINYINT(3) UNSIGNED NOT NULL AUTO_INCREMENT,
`flag_name` VARCHAR(63) NOT NULL,
`flag_description` TEXT NOT NULL,
PRIMARY KEY (`flag_id`),
UNIQUE INDEX `flag_name` (`flag_name`)
) ENGINE=InnoDB;
CREATE TABLE `uncheked_flags` (
`user_id` MEDIUMINT(8) UNSIGNED NOT NULL,
`flag_id` TINYINT(3) UNSIGNED NOT NULL,
PRIMARY KEY (`user_id`, `flag_id`),
INDEX `flag_id` (`flag_id`),
CONSTRAINT `FK_uncheked_flags_flags` FOREIGN KEY (`flag_id`) REFERENCES `flags` (`flag_id`),
CONSTRAINT `FK_uncheked_flags_users` FOREIGN KEY (`user_id`) REFERENCES `users` (`user_id`)
) ENGINE=InnoDB;

You may get a better search out of using dedicated columns, for each boolean, but the cardinality is poor and even if you index each column it will involve a fair bit of traversal or scanning.
If you are just looking for HIGH-VALUES 0xFFF.... then definitely bitmap, this solves your cardinality problem (per OP update). It's not like you are checking parity... The tree will however be heavily skewed to HIGH-VALUES if this is normal and can create a hot spot prone to node splitting upon inserts.
Bit mapping and using bitwise operator masks will save space but will need to be aligned to a byte so there may be an unused "tip" (provisioning for future fields perhaps), so the mask must be of a maintained length or the field padded with 1s.
It will also add complexity to your architecture, that may require bespoke coding, bespoke standards.
You need to perform an analysis on the importance of any searching (you may not ordinarily expect to be searching all. or even any of the discrete fields).
This is a very common strategy for denormalising data and also for tuning service request for specific clients. (Where some reponses are fatter than others for the same transaction).

Case 1: If "problems" are rare.
Have a table Problems with ids, and a TINYINT with the value (50-60) of the problem. With suitable indexes on that table you can lookup whatever you need.
Case 2: Lots of items.
Use a BIGINT UNSIGNED to hold up to 64 0/1 value. Use an expression like 1 << n to build a mask for the nth (counting from 0) bit. If you know, for example, that there exactly 55 bits, then the value of all 1s is (1<<55)-1. Then you can find the items with "problems" via WHERE bits = (1<<55)-1.
Bit Operators and functions
Case 3: You have names for the problems.
SET ('broken', 'stolen', 'out of gas', 'wrong color', ...)
That will build a DATATYPE with (logically) a bit for each problem. See also the function FIND_IN_SET() as a way to check for one problem.
Cases 2 and 3 will take about 8 bytes for the full set of problems -- very compact. Most SELECT that you might perform would scan the entire table, but 20K rows won't take terribly long and will be a lot faster than having 60 columns or a row per problem.

Purpose of Secondary Key

What is the purpose of the Secondary key? Say I have a table that logs down all the check-ins (similar to Foursquare), with columns id, user_id, location_id, post, time, and there can be millions of rows, many people have stated to use secondary keys to speed up the process.
Why does this work? And should both user_id and location_id be secondary keys?
I'm using mySQL btw...
Edit: There will be a page that lists/calculates all the check-ins for a particular user, and another page that lists all the users who has checked-in to a particular location
mySQL Query
Type 1
SELECT location_id FROM checkin WHERE user_id = 1234
SELECT user_id FROM checkin WHERE location_id = 4321
Type 2
SELECT COUNT(location_id) as num_users FROM checkin
SELECT COUNT(user_id) as num_checkins FROM checkin

The key (also called index) is for speeding up queries. If you want to see all checkins for a given user, you need a key on user_id field. If you want to see all checking for a given location, you need index on location_id field. You can read more at mysql documentation

I want to comment on your question and your examples.
Let me just suggest strongly to you that since you are using MySQL you make sure that your tables are using the innodb engine type for many reasons you can research on your own.
One important feature of InnoDB is that you have referential integrity. What does that mean? In your checkin table, you have a foreign key of user_id which is the primary key of the user table. With referential integrity, MySQL will not let you insert a row with a user_id that doesn't exist in the user table. Using MyISAM, you can. That alone should be enough to make you want to use the innodb engine.
To your question about keys/indexes, essentially when a table is defined and a key is declared for a column or some combination of columns, mysql will create an index.
Indexes are essential for performance as a table grows with the insert of rows.
All relational databases and Document databases depend on an implementation of BTree indexing. What Btree's are very good for, is finding an item (or not) using a predictable number of lookups. So when people talk about the performance of a relational database the essential building block of that is use of btree indexes, which are created via KEY statements or with alter table or create index statements.
To understand why this is, imagine that your user table was simply a text file, with one line per row, perhaps separated by commas. As you add a row, a new line in the text file gets added at the bottom.
Eventually you get to the point that you have 10,000 lines in the file.
Now you want to find out if you entered a line for one particular person with the last name of Smith. How can you find that out?
Without any sort of sortation of the file, or a separate index, you have but one option and that is to start at the first line in the file and scan through every line in the table looking for a match. Even if you found a Smith, that might not be the only 'Smith' in the table, so you have to read the entire file from top to bottom every time you want do do this search.
Obviously as the table grows the performance of searching gets worse and worse.
In relational database parlance, this is known as a "table scan". The database has to start at the first row and scan through reading every row until it gets to the end.
Without indexes, relational databases still work, but they are highly dependent on IO performance.
With a Btree index, the rows you want to find are found in the index first. The indexes have a pointer directly to the data you want, so the table no longer needs to be scanned, but instead the individual data pages required are read. This is how a database can maintain adequate performance even when there are millions or 10's or 100's of millions of rows.
To really start to gain insight into how mysql works, you need to get familiar with EXPLAIN EXTENDED ... and start looking at the explain plans for queries. Simple ones like those you've provided will have simple plans that show you how many rows are being examined to get a result and whether or not they are using one or more indexes.
For your summary queries, indexes are not helpful because you are doing a COUNT(). The table will need to be scanned when you have no other criteria constraining the search.
I did notice what looks like a mistake in your summary queries. Just based on your labels, I would think that these are the right queries to get what you would want given your column alias names.
SELECT COUNT(DISTINCT user_id) as num_users FROM checkin
SELECT COUNT(*) as num_checkins FROM checkin
This is yet another reason to use InnoDB, which when properly configured has a data cache (innodb buffer pool) similar to other rdbms's like oracle and sql server. MyISAM doesn't cache data at all, so if you are repeatedly querying the same sorts of queries that might require a lot of IO, MySQL will have to do all that data reading work over and over, whereas with InnoDB, that data could very well be sitting in cache memory and have the result returned without having to go back and read from storage.
Primary vs Secondary
There really is no such concept internally. A Primary key is special because it allows the database to find one single row. Primary keys must be unique, and to reflect that, the associated Btree index is unique, which simply means that it will not allow you to have 2 keys with the same data to exist in the index.
Whether or not an index is unique is an excellent tool that allows you to maintain the consistency of your database in many other cases. Let's say you have an 'employee' table with the SS_Number column to store social security #. It makes sense to have an index on that column if you want the system to support finding an employee by SS number. Without an index, you will tablescan. But you also want to have that index be unique, so that once an employee with a SS# is inserted, there is no way the database will let you enter a duplicate employee with the same SS#.
But to demystify this for you, when you declare keys these indexes are just being created for you and used automagically in most cases, when you define the tables.
It's when you aren't dealing with keys (primary or foreign) as in the example of usernames, first, last & last names, ss#'s etc., that you need to also be aware of how to create an index because you are searching (using where clause criteria) on one or more columns that aren't keys.

which of these 2 methods is most efficient with PHP/MYSQL

I have some location data, which is in a table locations with the key being the unique location_id
I have some user data, which is in a table users with the key being the unique user_id
Two ways I was thinking of linking these two together:
I can put the 'location' in each user's data.
'SELECT user_id FROM users WHERE location = "LOCATIONID";'
//this IS NOT searching with the table's key
//this does not require an explode
//this stores 1 integer per user
I can also put the 'userIDs' as a comma delimited string of ids into each location's data.
'SELECT userIDs FROM locations WHERE location_id = "LOCATIONID";'
//this IS searching with the tables key
//this needs an explode() once the comma delimited list is retrieved
//this stores 1 string of user ids per location
so I wonder, which would be most efficient. I'm not really sure how much the size of the data stored could also impact the speed. I want retrievals that are as fast as possible when trying to find out which users are at which location.
This is just an example, and there will be many other tables like location to compare to the users, so the efficiency, or lack of, will be multiplied across the whole system.

Stick with option 1. Keep your database tables normalised as much as possible till you know you have a performance problem.
There's a whole slew of problems with option 2, including the lack of ability to then use the user ID's till you pull them into PHP and then having to fire off lots more SQL queries for each ID. This is extremely inefficient. Do as much inside MySQL as possible, the optimisations that the database layer can do while running the query will easily be a lot quicker than anything you write in PHP.
Regarding your point about not searching on the primary key, you should add an index to the location column. All columns that are in a WHERE clause should be indexed as a general rule. This negates the issue of not searching on the primary key, as the primary key is just another type of index for the purposes of performance.

Use the first one to keep your data normalized. You can then query for all users for a location directly from the database without having to go back to the database for each user.
Be sure to add the correct index on your users table too.
CREATE TABLE locations (
locationId INT PRIMARY KEY AUTO_INCREMENT
) ENGINE=INNODB;
CREATE TABLE users (
userId INT PRIMARY KEY AUTO_INCREMENT,
location INT,
INDEX ix_location (location)
) ENGINE=INNODB;
Or to only add the index
ALTER TABLE users ADD INDEX ix_location(location);

Have you heard of foreign key ?
get details from many tables tables using join .
You can use of sub query also.
As you said there are two tables users and locations.
Keep userid as foreign key in locations and fetch it based on that.

When you store the user IDs as a comma-separated list in a table, that table is not normalized (especially it violates the first normal form, item 4).
It is perfectly valid to denormalize tables for optimization purposes. But only after you have measured that this is where the bottleneck actually is in your specific situation. This, however, can only be determined if you know which query is executed how often, how long they take and whether the performance of the query is critical (in relation to other queries).
Stick with option 1 unless you know exactly why you have to denormalize your table.

Database Design

I'm trying to build out a mysql database design for a project. The problem is coming up with the best solution. Basically in my application, I will have to insert approximately 10-30 rows per user. The primary key will be a random CHAR(16) string. There will also be an datetime index, and an additional row (with an index) called "data".
Day to day, there will only be a heavy amount of inserts and lookups on the table. The lookups will always joined based on the primary key (so joining those 10-30 rows per user).
I will at times need to be able to look at a few specific months (or a full year even) and use mysql GROUP BY functions on the "data" index as well.
At its current volume and estimates, I would expect the table to grow 9.3m rows/month. I do expect this to increase.
So my question comes down to this: mysql partitions, programmatic table separation, or another solution? and are things best separated by month or year? We are running on RHEL, so getting mysql 5.1 may be a bit of work, but if that's a better solution it may be worth going for.
innoDB has already been selected for this project. Day to day performance is the primary concern.

This doesn't answer your question, but it needs to be mentioned...
The primary key will be a random CHAR(16) string.
This is a Bad Idea. Use an UNSIGNED BIGINT column with AUTO_INCREMENT. No need to reinvent the wheel: you won't have to worry about key management or collisions that way.

Partition the data on the dates (and maybe additionally the user it is per-user data and you have lots of users).
Then create a monthly table with the SUM, COUNT, AVG, etc that you need and the appropriate group by. You can partition that table as well (but dates probably won't be a meaningful partition)
Then create a yearly table like the monthly table.
Populate the monthly and yearly tables with REPLACE INTO ... SELECT ... statements.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.