I'm building a medium sized (100,000 entries) table in MySQL, and I'm trying to optimize it for speed. The entries contain some data that is transactional in nature, this data will obviously be kept in MySQL. The remainder of the data will not change over the life of the table nor is it well suited to a table format (i.e. some entries will contain fields that other entries will not, leading to a lot of 'null' values). Further, much of the data in this second part will repeat, meaning that there may only be 500-1000 unique sets of data which are then paired with the entries in the table.
I'm considering three ways of organizing the data.
1) Leave all the data in MySQL in table format.
2) Serialize the non-unique data and save that data in a single MySQL field.
3) Serialize the non-unique data and save to a file in the hard disk, referenced by a pointer in the MySQL table.
My question is which format would you recommend and why? Which is going to be fastest, given that I will be running many queries on the database?
It sounds like you are describing a normalized database. This is very standard. You would have the "larger" entity as a single table with an id.
For the more voluminous data, you would have a reference to that id, called a foreign key. This is the structure that relational databases were designed for. Part of the meaning of "relational" is relationships between entities.
If you only have a few dozen columns, I wouldn't worry about some values being NULL in some rows and others being NULL in other rows. If you have multiple types of entities, then you can also reflect this in the data structure.
EDIT:
Normalization can have both good and bad effects on performance. In the case where it reduces the size of the data, then the performance is often better than with denormalized data. If you have proper index structures, then normalized data structures usually work pretty well.
Use one of indexing engines, such as Sphinx, do not re-invent the wheel. Sphinx organizes data according to searching / querying options and it is really fast, can handle lots of data. If your database doesnt change often you have to run Sphinx Indexer just once. One of cons of this solution is fact, that Sphinx index files are quite large.
Read this that will help you.
You can also used this and you can find your answer.
Related
I'm writing up a job recruitment site for ex-military personnel in PHP/MySQL and I'm at a stage where I'm not sure which direction to take
I have a "candidate" table with all the usual firstname, surname etc and a unique candidate_id, and an address table linked by the unique candidate_id
The client has asked for more data to be captured, as in driving licence type, religion, SIA Level (Security Industry Authority), languages spoken etc
My question is, with all this different data, is it worth setting up dedicated tables for each? So for example having a driving licence table, with all the different types of driving licence, each with a unique id, then linked to the candidate table with a driving_licence_id cell?
Or should I just serialize all the extra data as text and put it in one cell in the candidate table?
My question is, with all this different data, is it worth setting up dedicated tables for each?
Yes.
That is what databases are for.
Dedicated tables versus serialized data is called Database Normalization and Denormalization, respectively. In some cases both options are acceptable, but you should really make an educated choice, by reading up on the subject (for example here on about.com).
Personally I usually prefer working with normalized databases, as they are much easier to query for complex aggregated data. Besides, I feel they are easier to maintain as well, since you usually don't have to refactor when adding new fields and tables.
Finally, unless you have a lot of tables, you are unlikely to run into performance problems due to the number of one-to-one joins (the kind of data that's easy to denormalize).
It depends on whether you wish to query this data. If so keep the data normalised (eg in it's own logically separated table), otherwise, if it's just meta data to be pulled along for the ride, whatever is simplest seems reasonable.
Neither approach necessarily precludes the other in the future, simple migration scripts can be created to move the data from one format to the other. I would suggest doing what is simplest to enable you to work on other features of the site soonest.
You must Always go for normalization, believe me.
I made the mistake of going through the easy way and store data improperly (not only serialized, implode strings of multidimensional arrays ), then when the time came i had to re design the whole thing and it was a lot of time wasted.
I will never go by the wrong way again, clients can say "no" today, but "report (queries)" tomorrow.
I am in the process of creating a website where I need to have the activity for a user (similar to your inbox in stackoverflow) stored in sql. Currently, my teammates and I are arguing over the most effective way to do this; so far, we have come up with two alternate ways to do this:
Create a new table for each user and have the table name be theirusername_activity. Then when I need to get their activity (posting, being commented on, etc.) I simply get that table and see the rows in it...
In the end I will have a TON of tables
Possibly Faster
Have one huge table called activity, with an extra field for their username; when I want to get their activity I simply get the rows from that table "...WHERE username=".$loggedInUser
Less tables, cleaner
(assuming I index the tables correctly, will this still be slower?)
Any alternate methods would also be appreciated
"Create a new table for each user ... In the end I will have a TON of tables"
That is never a good way to use relational databases.
SQL databases can cope perfectly well with millions of rows (and more), even on commodity hardware. As you have already mentioned, you will obviously need usable indexes to cover all the possible queries that will be performed on this table.
Number 1 is just plain crazy. Can you imagine going to manage it, and seeing all those tables.
Can you imagine the backup! Or the dump! That many create tables... that would be crazy.
Get you a good index, and you will have no problem sorting through records.
here we talk about MySQL. So why would it be faster to make separate tables?
query cache efficiency, each insert from one user would'nt empty the query cache for others
Memory & pagination, used tables would fit in buffers, unsued data would easily not be loaded there
But as everybody here said is semms quite crazy, in term of management. But in term of performances having a lot of tables will add another problem in mySQL, you'll maybe run our of file descriptors or simply wipe out your table cache.
It may be more important here to choose the right engine, like MyIsam instead of Innodb as this is an insert-only table. And as #RC said a good partitionning policy would fix the memory & pagination problem by avoiding the load of rarely used data in active memory buffers. This should be done with an intelligent application design as well, where you avoid the load of all the activity history by default, if you reduce it to recent activity and restrict the complete history table parsing to batch processes and advanced screens you'll get a nice effect with the partitionning. You can even try a user-based partitioning policy.
For the query cache efficiency, you'll have a bigger gain by using an application level cache (like memcache) with history-per-user elements saved there and by emptying it at each new insert .
You want the second option, and you add the userId (and possibly a seperate table for userid, username etc etc).
If you do a lookup on that id on an properly indexed field you'd only need something like log(n) steps to find your rows. This is hardly anything at all. It will be way faster, way clearer and way better then option 1. option 1 is just silly.
In some cases, the first option is, in spite of not being strictly "the relational way", slightly better, because it makes it simpler to shard your database across multiple servers as you grow. (Doing this is precisely what allows wordpress.com to scale to millions of blogs.)
The key is to only do this with tables that are entirely independent from a user to the next -- i.e. never queried together.
In your case, option 2 makes the most case: you'll almost certainly want to query the activity across all or some users at some point.
Use option 2, and not only index the username column, but partition (consider a hash partition) on that column as well. Partitioning on username will provide you some of the same benefits as the first option and allow you to keep your sanity. Partitioning and indexing the column this way will provide a very fast and efficient means of accessing data based on the username/user_key. When querying a partitioned table, the SQL Engine can immediately lop off partitions it doesn't need to scan as it can tell based off of the username value queried vs. the ability of that username to reside within a partition. (in this case only one partition could contain records tied to that user) If you have a need to shard the table across multiple servers in the future, partitioning doesn't hinder that ability.
You will also want to normalize the table by separating the username field (and any other elements in the table related to username) into its own table with a user_key. Ensure a primary key on the user_key field in the username table.
This majorly depends now on where you need to retrieve the values. If its a page for single user, then use first approach. If you are showing data of all users, you should use single table. Using multiple table approach is also clean but in sql if the number of records in a single table are very high, the data retrieval is very slow
I'm building a very large website currently it uses around 13 tables and by the time it's done it should be about 20.
I came up with an idea to change the preferences table to use ID, Key, Value instead of many columns however I have recently thought I could also store other data inside the table.
Would it be efficient / smart to store almost everything in one table?
Edit: Here is some more information. I am building a social network that may end up with thousands of users. MySQL cluster will be used when the site is launched for now I am testing using a development VPS however everything will be moved to a dedicated server before launch. I know barely anything about NDB so this should be fun :)
This model is called EAV (entity-attribute-value)
It is usable for some scenarios, however, it's less efficient due to larger records, larger number or joins and impossibility to create composite indexes on multiple attributes.
Basically, it's used when entities have lots of attributes which are extremely sparse (rarely filled) and/or cannot be predicted at design time, like user tags, custom fields etc.
Granted I don't know too much about large database designs, but from what i've seen, even extremely large applications store their things is a very small amount of tables (20GB per table).
For me, i would rather have more info in 1 table as it means that data is not littered everywhere, and that I don't have to perform operations on multiple tables. Though 1 table also means messy (usually for me, each object would have it's on table, and an object is something you have in your application logic, like a User class, or a BlogPost class)
I guess what i'm trying to say is that do whatever makes sense. Don't put information on the same thing in 2 different table, and don't put information of 2 things in 1 table. Stick with 1 table only describes a certain object (this is very difficult to explain, but if you do object oriented, you should understand.)
nope. preferences should be stored as-they-are (in users table)
for example private messages can't be stored in users table ...
you don't have to think about joining different tables ...
I would first say that 20 tables is not a lot.
In general (it's hard to say from the limited info you give) the key-value model is not as efficient speed wise, though it can be more efficient space wise.
I would definitely not do this. Basically, the reason being if you have a large set of data stored in a single table you will see performance issues pretty fast when constantly querying the same table. Then think about the joins and complexity of queries you're going to need (depending on your site)... not a task I would personally like to undertake.
With using multiple tables it splits the data into smaller sets and the resources required for the query are lower and as an extra bonus it's easier to program!
There are some applications for doing this but they are rare, more or less if you have a large table with a ton of columns and most aren't going to have a value.
I hope this helps :-)
I think 20 tables in a project is not a lot. I do see your point and interest in using EAV but I don't think it's necessary. I would stick to tables in 3NF with proper FK relationships etc and you should be OK :)
the simple answer is that 20 tables won't make it a big DB and MySQL won't need any optimization for that. So focus on clean DB structures and normalization instead.
I recently started working for a fairly small business, which runs a small website. I over heard a co worker mention that either our site or MySQL databases get hit ~87 times a second.
I was also tasked today, with reorganizing some tables in these databases. I have been taught in school that good database design dictates that to represent a many-to-many relationship between two tables I should use a third table as a middle man of sorts. (This third table would contain the id of the two related rows in the two tables.)
Currently we use two separate databases, totalling to a little less than 40 tables, with no table having more than 1k rows. Right now, some PHP scripts use a third table to relate certain rows that has a third column that is used to store a string of comma separated ids if a row in one table relates to more than one row in some other table(s). So if they want to use an id from the third column they would have to get the string and separate it and get the proper id.
When I mentioned that we should switch to using the third table properly like good design dictates they said that it would cause too much overhead for such small tables, because they would have to use several join statements to get the data they wanted.
Finally, my question is would creating stored procedures for these joins mitigate the impact these joins would have on the system?
Thanks a bunch, sorry for the lengthy explanation!
By the sound of things you should really try to redesign your database schema.
two separate databases, totalling to a little less than 40 tables, with no table having more than 1k rows
Sounds like it's not properly normalized - or it has been far to aggressively normalized and would benefit from some polymorphism.
comma separated ids
Oh dear - surrogate keys - not intrinsically bad but often a sign of bad design.
a third table to relate certain rows that has a third column that is used to store a string of comma separated ids
So it's a very long way from normalised - this is really bad.
they said that it would cause too much overhead for such small tables
Time to start polishing up your resume I think. Sounds like 'they' know very little about DBMS systems.
But if you must persevere with this - its a lot easier to emulate a badly designed database from a well designed one (hint - use views) than vice versa. Redesign the database offline and compare the performance of tuned queries - it will run at least as fast. Add views to allow the old code to run unmodified and compare the amount of code you need to performa key operations.
I don't understand how storing a comma separated list of id's in a single column, and having to parse the list of ids in order to get all associated rows, is less complex than a simple table join.
Moving your queries into a stored procedure normally won't provide any sort of benefit. But if you absolutely have to use the comma separated list of values that represent foreign key associations, then a stored procedure may improve performance. Perhaps in your stored procedure you could declare a temporary table (see Create table variable in MySQL for example), and then populate the temporary table, 1 row for every value contained in your comma separated string.
I'm not sure what type of performance gain you would get by doing that though, considering like you mentioned there's not a lot of rows in any of the tables. The whole exercise seems a bit silly to do. Ditching the comma separated list of id's would be the best way to go.
It will be both quicker and more simple to do it in the database than in PHP; that's what database engines are good at. Make indexes on the keys (InnoDB will do this by default) and the joins will be fast; to be honest, with tables that tiny, the joins will almost always be fast.
Stored procedures don't really come into the picture, for two reasons; mainly, they're not going to make any difference to the impact (not that there's any impact anyway - you will be improving app performance by doing this in the DB rather than at the PHP level).
Secondly, avoid MySQL stored procedures like the plague, they're goddamn awful to write and debug if you've ever worked in the stored procedure languages for any other DB at all.
I'm just about to expand the functionality of our own CMS but was thinking of restructuring the database to make it simpler to add/edit data types and values.
Currently, the CMS is quite flat - the CMS requires a field in the database for every type of stored value (manually created).
The first option that comes to mind is simply a table which keeps the data types (ie: Address 1, Suburb, Email Address etc) and another table which holds values for each of these data types. Just like how Wordpress keeps values in the 'options' table, PHP serialize would be used to store an array of values.
The second option is how Drupal works, the CMS creates tables for every data type. Unlike Wordpress, this can be a bit of an overkill but really useful for SQL queries when ordering and grouping by a particular value.
What's everyone's thoughts?
In my opinion, you should avoid serialization where possible. Your relational database should be relational, and thus be structured as such. This would include the 'Drupal Method', e.g. one table per data type. This also keeps your database healthy in a sense that it can be indexed en easily queried upon.
Unless you plan to have lots of different data types that will be added in the future which are unknown now, this is not really going to help you and would be overkill. If you have very wide tables and lots of holes in your data (i.e. lots of columns that seem to be NULL at random) then that is a pattern that is screaming to maybe have a seperate table for data that may only belong to certain entries.
Keep it simple and logical. Don't abstract for the sake of abstraction. Indeed, storing integers is cheaper with regards to storage space but unless that is a problem then don't do it in this case.