Lots of rows (eav) vs lots of empty columns - php

I'm attempting a redesign of the database schema for a website where we have a dynamic form filled out by users based on the category they fit into. The end result of this dynamic form is that I have 100+ columns in the database to store their answers but in any given case less than 10 will actually have a value.
I'm looking to really tighten up the database as much as possible and remove NULL values where possible and 1 solution to the above problem that I am considering is to use a EAV table to hold the answers to variable questions.
So table 1 would hold their name and a few other bits and table 2 would have a row holding the question and answer to each question with a foreign key linking to table 1.
I've done some slap dash calculations that if they were to get 1000 forms filled a day (high estimate, not impossible in the longer term) I'm looking at up to 2,600,000 rows in table 2 per year (1000 * 10 * 260 - working days). My primary key will go up as far as 4 billion so that isn't a problem but I'm concerned that after a few years with 5 - 10 million records that performance will seriously degrade.
I'm mostly concerned with performance, as far as structure goes, EAV is very much preferable as it will allow my client to add new questions without having to interact with the database and it helps keep my database schema much tighter and I'm not concerned about how this complicates my code if it is the right data solution.

Related

Storing Large Amounts of Numbers in MySQL Database

I've been thinking this problem over for awhile now and have yet to think of a good way to accomplish it, especially with keeping the tables in some sort of normalization and the data manageable.
I have a 33 x 24 grid map, totaling 792 tiles, and each single tile should have an integer value stored with it. Now, it gets slightly more complex since each of those 792 unique map values will contain another 100 values each with small integer values (likely nothing more than two digits), making 79,200 total unique values being stored.
Theoretically there would be a "tblMap" table that would have a unique mapID and 792 related fields per mapID. Then each of those 792 fields would have a table related to them that would have another 100 fields of data.
This is essentially a large 2D tile map that will be displayed to and updated by a user on a website. The data should not be stored locally for users. I'm using MySQL to store the data and PHP to access and manipulate it.
So what would be the best way to store this, whether with MySQL or possibly saving and loading a file with the values? And what would the table structure for something like this be? Is it a bad thing to have 792 rows in a single table (I kind of think so, but don't see any normalization rules about it)?
I hope that is clear and makes some sense. I searched around and was able to find related topics, but nothing that was specific to this type of case.
Thanks for any insight!

Efficient way of storing large data set in phpMyAdmin SQL table

I have a large list of exercises that I am currently storing in a SQL table with a unique index value (int) in one column and the exercise name in the other column. I am trying to determine an efficient way of storing sets of exercises as workouts.
As of now, I have two ideas in mind but am not sure what the most efficient (both from a storage and a speed standpoint). These are the ideas I am considering:
Solution 1 Create a new table that has a column for every exercise that would have a 1 or 0 depending on whether or not it was within a workout. The workout would be an index column, with a unique int id.
Solution 2 Create a new table that has a the index column representing the unique workouts. A second column that has an array of the numbers that would correspond to the set of workouts. What I prefer about this option is that it would allow me to preserve the order of exercises within the workout.
While I currently only have a list of ~800 exercises and ~400 workouts making (1) an array of size 800 x 400 whereas (2) would be 400 x 2. However, with the option to create new workout (lets just say that the exercise list will be fixed) table (1) would grow significantly faster.
Does anyone have suggestions as to how to minimize size and maintain speed in such a context?
I like #2 but you could also, if you wanted a more robust solution, have 3 tables:
exercises
workouts
workout_exercises
In addition, I generally would not store the exercises per work out as an array. It would be difficult to use sql to search for a specific exercise within that field, eg:
What workouts contain exercise XYZ?
Your table sizes are actually tiny. When you have 10,000 records or 100,000 records you can start to think about size issues.

MySQL recommendation, field vs relational table

I've got a MySQL INNODB table containing about 2,000,000 rows with 10 fields (table "cars"). It'll keep increasing progressively at a current rate of about 500,000 rows a year. It's a busy table getting different type of queries on average 2-3 times a second 24/7.
The situation right now is that I need to expand the information to include an INT field ("country_id"). But, this field will for at least 99 % of all rows be default "1".
My question is: Would there be any specific reasons to do either of the following solutions:
Add the INT field to the table and index it ("cars"."country_id")
Add a relational table ("car_countries") which includes the fields "car_id" and "country_id"
I setup these examples in the test environment made a few thousand iterations of querying the tables for data to find this out:
Database/table size will due to the index increase with 19 % (~21 MB)
Queries will take on average 16 % longer (0.37717 secs vs 0.32431 secs for 1,000 queries each)
I've previously tried to keep tables filled with appropriate information for all fields and added relational tables where non-mandatory information was needed for a table but now I've read there's little gain in this as long as there's no need to have arrayed data (which MySQL doesn't handle (and PostgreSQL does)) in the table. In my example a specific car will never be sold to 2 countries so there will never be a need to add more countries to a specific car.
Almost everything is easier with solution 1 and since disk space doesn't really matter. Should I still consider solution 2 anyway? If so, why?
Best regards,
/Thomas
The theoretical answer is that option 1 reflects your underlying relationships - a car can be sold to only one country, and therefore a "many to many" relationship (which option 2 suggests) is not appropriate. It would confuse future developers, and pollutes the data model.
The pragmatic answer is that option 2 doesn't appear to have a dramatic performance improvement today, and - crucially - it's likely to introduce complexity into your code. If 99% of the queries don't need the country data, you either have to write the query to include it (thus negating the performance benefit), or build nasty "if I need country THEN query = xxx ELSE query = yyy" logic.
Finally, apropos the indexing question - MySQL only uses one index for a query, so unless you're writing a query where "country" is in the where clause or being joined on, it's unlikely to have an impact.
Thanks to bwoebi, Raphaƫl Althaus, AgRizzo, Alfons and Ed Gibbs for the input to the question!
Short summary:
Since there can't be two countries to a car and there's only one extra field needed:
Go with Solution 1
Also, an index is probably not needed, check our cardinality and performance on the specific scenario
/Thomas

MySQL Partition Highscore Table

I have a table which stores highscores for a game. This game has many levels where scores are then ordered by score DESC (which is an index) where the level is a level ID. Would partitioning on this level ID column create the same result as create many seperate level tables (one for each level ID)? I need this to seperate out the level data somehow as I'm expecting 10's of millions of entries. I hear partitioning could speed this process up, whilst leaving my tables normalised.
Also, I have an unknown amount of levels in my game (levels may be added or removed at any time). Can I specify to partition on this level ID column and have new partitions automaticaly get created when a new (distinct level ID) is added to the highscore table? I may start with 10 seperate levels but end up with 50, but all my data is still kept in one table, but many partitions? Do I have to index the level ID to make this work?
Thanks in advance for your advice!
Creting an index on a single column is good, but creating an index that contains two columns would be a better solution based on the information you have given. I would run a
alter table highscores add index(columnScore, columnLevel);
This will make performance much better. From a database point of view, no matter what highscores you are looking for, the database will know where to search for them.
On that note, if you can, (and you are using mysami tables) you could also run a:
alter table order by columnScore, columnLevel;
which will then group all your data together, so that even though the database KNOWS where each bit is, it can find all the records that belong to one another nearby - which means less hard drive work - and therefore quicker results.
That second operation too, can make a HUGE difference. My PC at work (horrible old machine that was top of the range in the nineties) has a database with several million records in it that I built - nothing huge, about 2.5gb of data including indexes - and performance was dragging, but ordering the data for the indexes improved query time from about 1.5 minutes per query to around 8 seconds. That's JUST due to hard drive speed in being able to get to all the sectors that contain the data.
If you plan to store data for different users, what about having 2 tables - one with all the information about different levels, another with one row for every user alongside with his scores in XML/json?

To serialize or to keep a separate table?

This question has risen on many different occasions for me but it's hard to explain without giving a specific example. So here goes:
Let's imagine for a while that we are creating a issue tracker database in PHP/MySQL. There is a "tasks" table. Now you need to keep track of people who are associated with a particular task (have commented or what not). These persons will get an email when a task changes.
There are two ways to solve this type of situation. One is to create a separate table tasks_participants:
CREATE TABLE IF NOT EXISTS `task_participants` (
`task_id` int(10) unsigned NOT NULL,
`person_id` int(10) unsigned NOT NULL,
UNIQUE KEY `task_id_person_id` (`task_id`,`person_id`)
);
And to query this table with SELECT person_id WHERE task_id='XXX'.
If there are 5000 tasks and each task has 4 participants in average (the reporter, the subject for whom the task brought benefit, the solver and one commenter) then the task_participants table would be 5000*4 = 20 000 rows.
There is also another way: create a field in tasks table and store serialized array (JSON or PHP serialize()) of person_id's. Then there would not be need for this 20 000 rows table.
What are your comments, which way would you go?
Go with the multiple records. It promotes database normalization. Normalization is very important. Updating a serialized value is no fun to maintain. With multiple records I can let the database do the work with INSERT, UPDATE and DELETE. Also, you are limiting your future joins by having a multivalued column.
Definitely do the cross reference table (the first option you listed). Why?
First of all, do not worry about the size of the cross reference table. Relational databases would have been out on their ear decades ago if they could not handle the scale of a simple cross reference table. Stop worrying about 20K or 200K records, etc. In fact, if you're going to worry about something like this, it's better to start worrying about why you've chosen a relational DB instead of a key-value DB. After that, and only when it actually starts to be a problem, then you can start worrying about adding an index or other tuning techniques.
Second, if you serialize the association info, you're probably opaque-ifying a whole dimension of your data that only your specialized JSON-enabled app can query. Serialization of data into a single cell in a table typically only makes sense if the embedded structure is (a) not something that contains data you would never need to query outside your app, (b) is not something you need to query the internals of efficiently (e.g., avg count(*) of people with tasks), and (c) is just something you either do not have time to model out properly or is in a prototypical state. So I say probably above, because it's not usually the case that data worth persisting fits these criteria.
Finally, by serializing your data, you are now forced to solve any computation on that serialized data in your code, which is just a big waste of time that you could have spent doing something more productive. Your database already can slice and dice that data any way you need, yet because your data is not in a format it understands, you need to now do that in your code. And now imagine what happens when you change the serialized data structure in V2.
I won't say there aren't use cases for serializing data (I've done it myself), but based on your case above, this probably isn't one of them.
There are a couple of great answers already, but they explain things in rather theoretical terms. Here's my (essentially identical) answer, in plain English:
1) 20k records is nothing to MySQL. If it gets up into the 20 million record range, then you might want to start getting concerned - but it still probably won't be an issue.
2) OK, let's assume you've gone with concatenating all the people involved with a ticket into a single field. Now... Quick! Tell me how many tickets Alice has touched! I have a feeling that Bob is screwing things up and Charlie is covering for him - can you get me a list of tickets that they both worked on, divided up by who touched them last?
With a separate table, MySQL itself can find answers to all kinds of questions about who worked on what tickets and it can find them fast. With everything crammed into a single field, you pretty much have to resort to using LIKE queries to find the (potentially) relevant records, then post-process the query results to extract the important data and summarize it yourself.

Categories