Given a site like StackOverflow, would it be better to create num_comments column to store how many comments a submission has and then update it when a comment is made or just query the number of rows with the COUNT function? It seems like the latter would be more readable and elegant but the former would be more efficient. What does SO think?
Definitely to use COUNT. Storing the number of comments is a classic de-normalization that produces headaches. It's slightly more efficient for retrieval but makes inserts much more expensive: each new comment requires not only an insert into the comments table, but a write lock on the row containing the comment count.
The former is not normalized but will produce better performance (assuming many more reads than writes).
The latter is more normalized, but will require more resources and hence be less performant.
Which is better boils down to application requirements.
I would suggest counting comment records. Although the other method would be faster it lends to a cleaner database. Adding a count column would be a sort of data duplication not to mention require on additional code step and insert.
If you were to expect millions of comments, then you may want to pick the count column approach.
I agree with #Oded. It depends on the app requirements and also how active is the site, however here is also my two cents
I would try to avoid the writes which will have to be done by triggers, UPDATES to post table when new comments are added.
If you are concerned about reporting the data then don't do that on a transactional system. Create a reporting DB and update that periodically.
The "correct" way to design is to use another table, join it and COUNT. This is consistent with what database normalization teaches.
The problem with normalization is that it cannot scale. There are only so many ways to skin a cat, so if you have millions of queries per day and a lot of them involve table X, the database performance is going below ground as the server also has to deal with concurrent writes, transactions, etc.
To deal with this problem, a common practice is sharding. Sharding has the side effect that the rows of a table are not stored in the same physical location, and a primary consequence of this is that you cannot JOIN anymore; how can you JOIN against half a table and receive meaningful results? And obviously, trying to JOIN against all partitions of a table and merge the results is going to be worse than the disease.
So you see that not only the alternative you examine is used in practice to achieve high performance, but also that there are even more radical steps that engineers can and do take.
Of course, unless you do have performance issues, sharding or even de-normalizing is just making your life harder for no tangible benefit.
Related
I have an online store, our products come from a 4 table join.
I want to move away from these joins for the following reasons:
too expensive on the database.
when I need to query data, I want to use simpler queries.
I am thinking of offloading the data into a simpler form into another DB and table.
Then, in addition, cache that data coming from the new table.
This gives me:
Good performance
simpler querying when I need to perform on the fly lookups using a DB client.
Can anyone weigh in on whether or not this is a good approach?
Am I overdoing it?
This is not a good approach, what you are doing is denormalizing and this should only be done as a last resort if you really need to increase performance in your system. I've worked on websites with over 10 million views per month and even on those sites it was only necessary for some specific use cases.
MySQL joins are very fast, and joining on 4 tables is nothing, I've written queries joining to 15 tables that ran in less than 0.001s, if your indexes are done right the difference won't be noticeable.
What you're doing is both Premature Optimization and query writing laziness, unless your online store gets hundreds of thousands (or even millions) of visits every day you are not focusing on the right things, data integrity and consistency is way more important.
I am building a game site with a lot of queries. For optimisation what is best. handeling the data with a lot of tables and relations or fewer tables but with many fields?
I would think, especially regarding to inserts and updates that fewer fields with many fields would be better than many tables. That would give more queries or???
I'm trying to figure out what is best course I am experiencing high load on my server at the evenings when I have a lot of users...
Start off with the database normalized. Ensure that you have the appropriate indexes for the queries/updates/inserts that you are doing. Use optimize table periodically.
If you are still encountering problems do some profiling to find out where the performance is insufficient. Then consider either denormalizing or perhaps rewriting the queries.
In addition make sure that the system cannot have deadlocks. That really messes up performance.
i don't think the number of columns effects anything, really. it's all about how well you've indexed the columns. if you do more updates then selects on a particular field, you might want to drop the index if you have one.
not really an answer, just something i've noticed.
In the company where I came to work, they run a PHP/MySQL relational database. I had always thought that if I needed to pull different info from different tables, that I could just do a simple join to pull in the data such as....
SELECT table_1.id, table_2.id FROM table_1 LEFT JOIN table_2 ON table_1.sub_id = table_2.id
When I got to where I currently work, this is what they do.
<?php $query = mysql_query("SELECT sub_id FROM table_1");
while($rs = mysql_fetch_assoc($query)) {
$query_2 = mysql_fetch_assoc(mysql_query("SELECT * FROM table_2 WHERE id = '{$rs['sub_id']}'"));
//blah blah blah more queries
?>
When I asked why the did it the second way, they said that it actually ran faster than a join. They manage a database that has millions of records on different tables and some of the tables are a little wide (row-wise). They said that they wanted to avoid joins in the case that a poorly executed query could lock up a table (or several of them). One other thing to keep in mind is that there is a massive report builder attached to this database that a client can use to build their own report and if they go crazy and build a big report, it could cause some havoc.
I was confused so I thought I'd throw this out there for the general programming public. This could be a matter of opinion, but is it really faster to do the while statement (one larger query to pull a lot of rows, followed by a lot of small tiny sub-queries if you will) or to do a join (pull a larger query one time to get all the data you need). As long as indexes are done properly, does it matter? One other thing to consider is that the current DB is in InnoDB format.
Thanks!
Update 8/28/14
So I thought I'd throw up an update to this one and what has worked more long term. After this discussion I decided to rebuild the report generator here at work. I don't have definitive result numbers, but I thought I'd share what the result was.
I think went a little overkill because I turned the entire report (it's pretty dynamic as far as the data that's returned) into a massive join fest. Most of the joins, if not all are joining a value to a primary key so they all run really really fast. If the report had lets say 30 columns of data to pull and it pulled 2000 records, every single field was running a query to fetch the data (because that piece of data could be on a different field). 30 x 2000 = 60000 and even under a sweet query time of 0.0003 seconds per query, that was still 18 seconds of just query time (which is pretty much what I remember it being). Now that I rebuilt the query as a massive join on a bunch of primary keys (where possible), that same report loaded in about 2-3 seconds, and most of that time was downloading the html. Each record that returns runs between 0-4 extra queries depending on the data that's needed (may not need any data if it can fetch it in the joins, which happens 75% of the time). So the same 2000 records would return an additional 0-8000 queries, (much better than 60000).
I would say that the while statement is useful in some cases, but as stated below in the comments, benchmarking is what it's all about. In my case, joins were the better option, but in other areas of my site, a while statement is more useful. In one instance I have a report where a client could request several categories to pull by and only return data for those categories. What happened was I had a category_id IN(...,...,..,.., etc etc etc) with 50-500 IDs and the index would choke and die in my arms as I was holding it in it's final moments. So what I did was spread out the ids in groups of 10 and ran the same query x / 10 times and my results were fetch way faster than before because the index likes dealing with 10 IDs, not 500, so I saw a great improvement on my queries then because of doing the while statement.
If the indexes are properly used, then it is almost always more efficient to use a JOIN. The emphasis is added because best efficiency does not always equal best performance.
There isn't really a one-size-fits all answer, though; you should analyze a query using EXPLAIN to ensure that the indexes are indeed being used, that there is no unnecessary temp table use, etc. In some cases, conditions conspire to create a query that just can't use indexes. In those cases, it might be faster to separate the queries into pieces in the fashion you've indicated.
If I encountered such code in an existing project, I would question it: check the query, think of different ways to perform the query, make sure that these things have been considered, build a scientific, fact-supported case for or against the practice. Make sure that the original developers did their due diligence, since not using a JOIN superficially points to poor database or query design. In the end, though, the results speak loudly and if all the optimizations and corrections still result in a slower join than using query fragments provides, then the faster solution prevails. Benchmark and act on the results of the benchmark; there is no case in software design that you should trade poor performance for adhesion to arbitrary rules about what you should or should not do. The best-performing method is the best method.
It should be better to do the big query, if the indexes are well placed.
The logic behind it:
1 query = 1 call to the DB server, wich then processes the query (optimizer and all) and finally returns the result. N queries mean N calls to the database, including N calls to the optimizer and, in a bad case, I/O.
MySQL has optimizations wich work on JOINs. Those optimizations can not work if you do a while.
As stated in previous answers, check with EXPLAIN if there is something wich isn't using an index in case you use the JOIN. Also, you should check the memory wich is given to the InnoDB cache, and the memory given to MySQL to parse a given query. Maybe it's because of those parameters that the database goes slower when doing the JOINs.
I would say the answer is, it depends. Normally, I'd say joins are the answer, and doing multiple queries in a loop is bad practise, however, it depends entirely on what is being done.
Is it the case for you? Without detailed table structures and info on indexes as well as use of foreign keys etc, we can't say for sure. Best idea if you want to check, is try it and see. Get their queries, EXPLAIN them, write your own, and do an EXPLAIN on that, see which is more efficient.
I'm not sure about huge databases, but in my projects I always try to keep the queries to a minimum. Queries use harddrive access and (if not on same host) network access, which are slow. If there are many entries in that first query, you could be running thousands of queries per page which is going to be slow.
Benchmark to find out the actual answer.
With the example you provided, it is highly unlikely that (with equivalent data) a join by the database will use more resources than setting up a new connection and perform the exact same operation (after all: you're still connecting the data in the same way as a join, even if it is externally done): if it was, the engine could simply be rewritten to use that external route to improve performance.
When joins use more resources (apart from indexing problems), it mostly comes from the downsides of retrieving the data per row, which means that information of the parent table will be duplicated in every row, even when this is redundant.
This may cause performance problems that can be helped by splitting queries if:
there are many children to one parent AND
you fetch lots of data from the parent (many columns or large fields)
In my experience, reducing the number of queries almost always benefits performance (I've optimized by combining queries far more than picking them apart).
The correct use of indices is good advice of course, but at first sight I don't think it will account for differences between those two scenarios, as the same indices (or lack of) would apply in both cases.
I have a classifieds website, and I am thinking about redesigning the database a bit.
Currently I have 7 tables in the db. One table for each "MAIN CATEGORY".
For example, I have a "VEHICLES" table which holds all information about the following categories of classifieds:
cars
mc
mopeds/scooters
trucks
boats
etc etc
However, users on the website usually search in specific categories. For example, the user chooses the "cars" category to search in, and enters a keyword.
My code today, will search the entire VEHICLES table for all records with the field "category" equal to "cars", and then get their details:
"SELECT * IN vehicles WHERE category='cars' AND alot of other conditions" // just for example, not tested
I am thinking about making a table now, for each of these "sub-categories".
Ie, one for cars, one for mc, one for trucks etc, so that search isn't done through information which isn't needed.
Will this increase search speed? Because I have calculated that I will need atleast 30 or so tables for this.
Thanks
With a properly indexed table and a "reasonable" number of rows, you will not gain much speed from this approach. Anything you gain in speed of execution you will lose in time-to-market because your programming will become more complicated.
Do not perform this optimization unless and until you encounter a performance problem in testing with a representative set of data.
It will increase the speed of a search within the same category. It will potentially slow down queries where you need aggregate information from the different categories. You need to decide which is the best option for your site.
How many records do you have in total in the vehicles table. Its quite likely that adding proper indexes will greatly increase the speed of your searches.
Check out the 'EXPLAIN' query option in MySQL. Understanding this will help you optimize your database a lot with indices.
Performance optimization is as much art as science, and to really understand what's the best option requires that you do some benchmarking; anyone offering a definitive answer given the available information is just wrong. That said, a few thoughts on your situation:
You don't say what type your category column is now, but if it's a string type, it's probably using more space than other options, thus making the table larger. Proper indexing can help tremendously with speed, but a larger table with larger indexes will always work to do just the opposite.
As already mentioned by someone else, your queries within a category will be faster in the simple case of a category search. How much faster depends on how much data you have in your current table, and the increases may be negated if you have to join in other tables to satisfy the need for all the other conditions to which you alluded. OTOH, it may actually speed things up in certain join cases (e.g., if you were doing self-joins with your all-encompassing table).
If you're working with a lot of data, splitting into multiple tables can greatly ease backups.
Splitting into multiple tables may also make it easier to shard your data across multiple servers for performance reasons. Similarly, it may make replication setups easier to keep running.
If you're tracking data that's category-specific, separate tables enables you to better normalize your database and likely reap some nice performance as a result of using much smaller tables.
Splitting obviously means modifying your code. If your code is of the old, creaky type, you may very well achieve a performance gain from the clean-up. Of course, there's also the risk that you'll break something....
Check your indexes. Bad indexes are a very common cause of poor performance but are relatively easy to fix with a bit of quality time spent on self-education. MySQL's EXPLAIN can tell you whether your queries are using the indexes, and the index stats (look in the docs) can tell you how efficiently your indexes are working.
Finally, speaking of code, check yours. Try experimenting with a few approaches, regardless of how the database is set up. For example, it may be quicker to do a couple of separate queries and join the results in code than to do the join in the database. Likewise, it's often quicker to do things like sorts in code, particularly in cases where a join or something means the database would have to create a temporary file/table. Again, check the EXPLAIN output, and if you can't eliminate a problem area in your queries, see if it helps to simplify the queries and do more work in the code. This can be particularly beneficial in the common case where the web server has more resources to spare than the database server.
There are many more factors to consider. Ultimately, though, the best way to make these decisions is not to spend time pondering theories but to put both methods to the test. Create some test databases and benchmark the sort of queries you'd run most often, with and without simulated load. You'll get your answer.
if you are using php try something like
$query = mysql_query($sql);
while($row = mysql_fetch_assoc($query)){
$tempvalue[]=$row;
}
and then to loop the info use for like sentence
foreach($tempvalue as $key => $value){
write the table .....
}
maybe mysql isnt slow and the problem is in the code
test dont kill anyone =)
Apologies in advance if this is a silly question but I'm wondering which might be faster/better in the following simplified scenario...
I've got registered users (in a users table) and I've got countries (in a countries table) roughly as follows:
USERS TABLE:
user_id (PK, INT) | country_id (FK, TINYINT) | other user-related fields...
COUNTRIES TABLE:
country_id (PK, TINYINT) | country_name (VARCHAR) | other country-related fields...
Now, every time I need to display a user's country, I need to do a MySQL join. However, I often need to do lots of other joins with regard to the users and the big picture seems quite "join-heavy".
I'm wondering what the pros & cons might be of taking the countries out of the database and sticking them into a class as an array, from which I could easily retrieve them with public method calls using country_id? Would there be a speed advantage/disadvantage?
Thanks a lot.
EDIT: Thanks for the all the views, very useful. I'll pick the first answer as the accepted solution although all contributions are valued.
Do you have a serious problem performance problem now? I recently went through a performance improvement on a php/mysql website I developed for my company. Certain areas were too slow, and it turned out a lot of fault was with the queries themselves. I used timers to figure out which queries were slow, and I reorganized them (added indexes, etc). In a few cases, it was faster to make two separate queries and join them in php (I had some pretty complicated joins).
Do not try to optimize until you know you have a problem. Figure out if you have a problem first by measuring it, and then if you need to rearrange your queries you will be able to know if you made an improvement.
It would ease stress on your MySQL server to have less JOIN statements, but not significantly so (there aren't that many countries in the world). However, you'll make up that time in the fact that you'll have to implement the JOIN yourself in PHP. And since you're writing it yourself, you will probably write it less efficiently than the SQL statement, which means that it will take more time. I would recommend keeping it in the SQL server, since the advantages of moving it out are so few (and if the PHP instance and the MySQL instance are on the same box, there are not real advantages).
What you suggest should be faster. Granted, the join probably doesn't cost much, but looking it up in a dictionary should be just about free as far as compute power goes.
This is really just a trade off of memory for speed. The only downsides I could see would of course be the increased memory usage to store the country info and the fact that you would have to invalidate that cache if you ever update the countries table (which is probably not very often).
I don't think you'd gain anything from removing the join, as you'd have to iterate over all your result rows and manually lookup the country name, which I doubt would be quicker than MySQL can do.
I also would not consider such an approach for the following reason: If you want to change the name of a country (say you've got a typo), you can do so just by updating a row in the database. But if the names of the countries are in your PHP code, you'd have to redeploy the code in order to make a change. I don't know PHP, but that might not be as straightforard than a DB change in a production system.
So for maintainability reasons, IMHO let the DB do the work.
The general rule in a database world is to NORMALIZED first (results in more tables) and figure performance issues later.
You will want to DENORMALIZED only for simplicity of code, not for performance. Use indexes and stored procedures. DBMS are designed to optimize on joins.
The reason not "normalize as you go" is that you would have to modify the code you already have written most every time you modify the database design.