Is it ok if I create like 8 indexes inside a table which has 13 columns?
If I select data from it and sort the results by a key, the query is really fast, but if the sort field is not a key it's much slower. Like 40 times slower.
What I'm basically asking is if there are any side effects of having many keys in the database...
Creating indexes on a table slows down all write operations on it a little, but speeds up read operations on the relevant columns a lot. If your application is not going to be doing lots and lots of writes to that table (which is true of most applications) then you are going to be fine.
Don't create indexes that are redundant or unused. But do create indexes you need to optimize the queries you run.
You choose indexes in any table based on your queries. Each query may use a different index, so it pays to analyze your queries carefully. See my presentation MENTOR Your Indexes. I also cover similar information in the chapter on indexing in my book SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
There is no specific rule about how many indexes is too many. In Oracle SQL Tuning Pocket Reference, author Mark Gurry says:
My recommendation is to avoid rules stating a site will not have any more than a certain number of indexes. The bottom line is that all SQL statements must run acceptably. There is ALWAYS a way to achieve this. If it requires 10 indexes on a table, then you should put 10 indexes on the table.
There are a couple of good tools to help you find redundant or unused indexes for MySQL in Percona Toolkit: http://www.percona.com/doc/percona-toolkit/pt-duplicate-key-checker.html and pt-index-usage.
This is a good question and everyone who works with mysql should know the answer. It is also commonly asked. Here is a link to one of them with a good answer:
Indexing every column in a table
In a nutshell, each new index requires space (especially if you use InnoDB - see the "Disadvantages of clustering" section in this article) and slows down INSERTs, UPDATEs and DELETEs.
Only you are in a position to decide whether speedup you'll get in SELECT and the frequency with which it will be used is worth it. But whatever you eventually decide, make sure you base your decision on measurement, not guessing!
P.S. INSERTs, UPDATEs and DELETEs with WHERE can also be sped-up by index(es), but that's another topic...
The cost of an index in disk space is generally trivial. The cost of additional writes to update the index when the table changes is often moderate. The cost in additional locking can be severe.
It depends on the read vs write ratio on the table, and on how often the index is actually used to speed up a query.
Indexes use up disc space to store, and take time to create and maintain. Unused ones don't give any benefit. If there are lots of candidate indexes for a query, the query may be slowed down by having the server choose the "wrong" one for the query.
Use those factors to decide whether you need an index.
It is usually possible to create indexes which will NEVER be used - for example, and index on a (not null) field with only two possible values, is almost certainly going to be useless.
You need to explain your own application's queries to make sure that the frequently-performed ones are using sensible indexes if possible, and create no more indexes than required to do that.
You can get more by following this links:
For mysql:
http://www.mysqlfaqs.net/mysql-faqs/Indexes/What-are-advantages-and-disadvantages-of-indexes-in-MySQL
For DB2:
http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0005052.htm
Indexes improve read performance, but increase size, and degrade insert/update. 8 indexes seem to be a bit too many for me; however, it depends on how often you typically update the table
Assuming MySQL from tag, even though OP makes no mention of it.
You should edit your question and add the fact that you are conducting order by operations as well (from a comment you posted to a solution). order by operations will also slow down queries (as will various other mysql ops) because MySQL has to create a temp table to accomplish the ordered result set (more info here). A lot of times, if the dataset allows it, I will pull the data I need, then order it at the application layer to avoid this penalty.
Your best bet is to EXPLAIN your most used queries, and check your slow query log.
Related
I'm attempting to write a search functionality for a website, and I've decided upon an approach of using MySQL temporary tables to handle the data input, via the query below:
CREATE TEMPORARY TABLE `patternmatch`
(`pattern` VARCHAR(".strlen($queryLengthHere)."))
INSERT INTO `patternmatch` VALUES ".$someValues
Where $someValues is a set of data with the layout ('some', 'search', 'query') - or basically what the user searched. I then search my main table images based on the data within table patternmatch like so:
SELECT images.* FROM images JOIN patternmatch ON (images.name LIKE patternmatch.pattern)
I then apply a heuristic or scoring system based on how well each result matched the input and display the results by that heuristic etc.
What I'm wondering is how much overhead does creating a temporary table require? I understand that they only exist in session, and are dropped as soon as the session is ended, but if I have hundreds of thousands of searches per second, what sort of performance issues might I encounter? Is there any better way of implementing a search functionality?
What you stated is totally correct, the temporary table will only be visible to the current user/connection. Still, there is some overhead and some other problems such as:
For each of the thousands of searches you are going to create and fill that table (and drop it later) - not per user, per search. Because each search most likely will re-execute the script, and "per session" does not mean PHP session - it means database session (open connection).
You will need the CREATE TEMPORARY TABLES privilege, which you might not have.
Still, that table really should have MEMORY type, which steals your RAM more than it looks like. Because even having VARCHAR, MEMORY tables use fixed length row-storage.
If your heuristics later need to refer to that table twice (like SELECT xyz FROM patternmatch AS pm1, patternmatch AS pm2 ...) - this is not possible with MEMORY tables.
Next, it would be easier for you - and also for the database - to add the LIKE '%xyz%' directly to your images tables WHERE clause. It will do the same without the overhead of creating a TEMP TABLE and joining it.
In any case - no matter which way you go - that WHERE will be horribly slow. Even if you add an index on images.name you most likely will need LIKE '%xyz%' instead of LIKE 'xyz%', so that index will not get used.
I'm asking whether a session-specific temporary table to handle the search input by the user (created on a search, dropped on the end of a session) is an appropriate way of handling a search functionality.
No. :)
Alternative options
MySQL has a build-in Fulltext-Search (since 5.6 also for InnoDB) that even can give you that scoring: I highly recommend giving it a read and a try. You can be sure that the database knows better than you how to do that search efficiently.
If you are going to use MyISAM instead of InnoDB, be aware of the often overlooked limitation that FULLTEXT searches only return anything if the number of results is less than 50% of the total table rows.
Other things that you might want to look at, are for example Solr (Nice introduction read to that topic itself would be the beginning of http://en.wikipedia.org/wiki/Apache_Solr ). We are using it in our company and it does a great job, but it requires quite some learning.
Summary
The solution to your current problem itself (the search) is to use the FULLTEXT capabilities.
If I have hundreds of thousands of searches per second, what sort of performance issues might I encounter? Is there any better way of implementing a search functionality?
To give you a number, 10.000 calls per second is not "trivial" already - with hundreds of thousands of searches per second the sort of performance issues you will encounter are everywhere in your set-up. You are going to need a couple of servers, load balancing and tons of other amazing tech crap. And one of this will be for example Solr ;)
Creating temporary tables on disk is relatively expensive. In your scenario it sounds like it'll be slower than it's worth.
It's usually only worthwhile to create temporary tables in memory. But you need to know you have enough memory available at all times. If you plan to support so many searches per second this is not a good solution.
MySQL has full-text searching built-in. It's good for small systems. This would likely perform far better than your temp table and JOIN. But if you want to support thousands of searches per second I would not recommend it. It could consume too much of your overall database performance. Plus you're then forced to use MyISAM for storage which might have its own issues in your scenario.
For so many searches you'll want to offload the work to another system. Plenty of searching systems with scoring already exist. Take a look at ElasticSearch, Solr/Lucene, Redis, etc.
From the code you give, I really don't think tmp tables are needed, nor is FULLTEXT searching. But ... about tmp table performance:
The creation/cleanup of the tmp table is not written to transaction logs, so it will be relatively quick for the OS to do the I/O involved. If the temporary tables will be small and short-lived, and you have lots of buffers available for the OS, the disk realistically wont even be touched. If you think it will be anyways, get an SSD drive, and get more RAM.
But if you are realistic that you are looking at hundreds of thousands of searches per second then you have a big engineering project on hand. Why not just do:
select images.* from images where name in ('some', 'search', 'query')
?
We have an existing PHP/MySQL app which doesn't have indexes configured correctly (monitoring shows that we do 85% table scans, ouch!)
What is a good process to follow to identify where we should be putting our indexes?
We're using PHP (Kohana using ORM for the DB access), and MySQL.
The answer likely depends on many things. For example, your strategy might be different if you want to optimize SELECTs at all costs or whether INSERTs are important to you as well. You might do well to read a MySQL Performance Tuning book or web site. There are several decent-to-great ones.
If you have a Slow Query Log, check it to see if there are particular queries that are causing problems. http://dev.mysql.com/doc/refman/5.5/en/slow-query-log.html
If you know the types of queries you'll be running or have identified problematic queries via the Slow Query Log or other mechanisms, you can then use the EXPLAIN command to get some stats on those queries. http://dev.mysql.com/doc/refman/5.5/en/explain.html
Once you have the output from EXPLAIN, you can use it to optimize. See http://dev.mysql.com/doc/refman/5.5/en/using-explain.html
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
I think this will help you on your database problems.
http://net.tutsplus.com/tutorials/other/top-20-mysql-best-practices/
In the company where I came to work, they run a PHP/MySQL relational database. I had always thought that if I needed to pull different info from different tables, that I could just do a simple join to pull in the data such as....
SELECT table_1.id, table_2.id FROM table_1 LEFT JOIN table_2 ON table_1.sub_id = table_2.id
When I got to where I currently work, this is what they do.
<?php $query = mysql_query("SELECT sub_id FROM table_1");
while($rs = mysql_fetch_assoc($query)) {
$query_2 = mysql_fetch_assoc(mysql_query("SELECT * FROM table_2 WHERE id = '{$rs['sub_id']}'"));
//blah blah blah more queries
?>
When I asked why the did it the second way, they said that it actually ran faster than a join. They manage a database that has millions of records on different tables and some of the tables are a little wide (row-wise). They said that they wanted to avoid joins in the case that a poorly executed query could lock up a table (or several of them). One other thing to keep in mind is that there is a massive report builder attached to this database that a client can use to build their own report and if they go crazy and build a big report, it could cause some havoc.
I was confused so I thought I'd throw this out there for the general programming public. This could be a matter of opinion, but is it really faster to do the while statement (one larger query to pull a lot of rows, followed by a lot of small tiny sub-queries if you will) or to do a join (pull a larger query one time to get all the data you need). As long as indexes are done properly, does it matter? One other thing to consider is that the current DB is in InnoDB format.
Thanks!
Update 8/28/14
So I thought I'd throw up an update to this one and what has worked more long term. After this discussion I decided to rebuild the report generator here at work. I don't have definitive result numbers, but I thought I'd share what the result was.
I think went a little overkill because I turned the entire report (it's pretty dynamic as far as the data that's returned) into a massive join fest. Most of the joins, if not all are joining a value to a primary key so they all run really really fast. If the report had lets say 30 columns of data to pull and it pulled 2000 records, every single field was running a query to fetch the data (because that piece of data could be on a different field). 30 x 2000 = 60000 and even under a sweet query time of 0.0003 seconds per query, that was still 18 seconds of just query time (which is pretty much what I remember it being). Now that I rebuilt the query as a massive join on a bunch of primary keys (where possible), that same report loaded in about 2-3 seconds, and most of that time was downloading the html. Each record that returns runs between 0-4 extra queries depending on the data that's needed (may not need any data if it can fetch it in the joins, which happens 75% of the time). So the same 2000 records would return an additional 0-8000 queries, (much better than 60000).
I would say that the while statement is useful in some cases, but as stated below in the comments, benchmarking is what it's all about. In my case, joins were the better option, but in other areas of my site, a while statement is more useful. In one instance I have a report where a client could request several categories to pull by and only return data for those categories. What happened was I had a category_id IN(...,...,..,.., etc etc etc) with 50-500 IDs and the index would choke and die in my arms as I was holding it in it's final moments. So what I did was spread out the ids in groups of 10 and ran the same query x / 10 times and my results were fetch way faster than before because the index likes dealing with 10 IDs, not 500, so I saw a great improvement on my queries then because of doing the while statement.
If the indexes are properly used, then it is almost always more efficient to use a JOIN. The emphasis is added because best efficiency does not always equal best performance.
There isn't really a one-size-fits all answer, though; you should analyze a query using EXPLAIN to ensure that the indexes are indeed being used, that there is no unnecessary temp table use, etc. In some cases, conditions conspire to create a query that just can't use indexes. In those cases, it might be faster to separate the queries into pieces in the fashion you've indicated.
If I encountered such code in an existing project, I would question it: check the query, think of different ways to perform the query, make sure that these things have been considered, build a scientific, fact-supported case for or against the practice. Make sure that the original developers did their due diligence, since not using a JOIN superficially points to poor database or query design. In the end, though, the results speak loudly and if all the optimizations and corrections still result in a slower join than using query fragments provides, then the faster solution prevails. Benchmark and act on the results of the benchmark; there is no case in software design that you should trade poor performance for adhesion to arbitrary rules about what you should or should not do. The best-performing method is the best method.
It should be better to do the big query, if the indexes are well placed.
The logic behind it:
1 query = 1 call to the DB server, wich then processes the query (optimizer and all) and finally returns the result. N queries mean N calls to the database, including N calls to the optimizer and, in a bad case, I/O.
MySQL has optimizations wich work on JOINs. Those optimizations can not work if you do a while.
As stated in previous answers, check with EXPLAIN if there is something wich isn't using an index in case you use the JOIN. Also, you should check the memory wich is given to the InnoDB cache, and the memory given to MySQL to parse a given query. Maybe it's because of those parameters that the database goes slower when doing the JOINs.
I would say the answer is, it depends. Normally, I'd say joins are the answer, and doing multiple queries in a loop is bad practise, however, it depends entirely on what is being done.
Is it the case for you? Without detailed table structures and info on indexes as well as use of foreign keys etc, we can't say for sure. Best idea if you want to check, is try it and see. Get their queries, EXPLAIN them, write your own, and do an EXPLAIN on that, see which is more efficient.
I'm not sure about huge databases, but in my projects I always try to keep the queries to a minimum. Queries use harddrive access and (if not on same host) network access, which are slow. If there are many entries in that first query, you could be running thousands of queries per page which is going to be slow.
Benchmark to find out the actual answer.
With the example you provided, it is highly unlikely that (with equivalent data) a join by the database will use more resources than setting up a new connection and perform the exact same operation (after all: you're still connecting the data in the same way as a join, even if it is externally done): if it was, the engine could simply be rewritten to use that external route to improve performance.
When joins use more resources (apart from indexing problems), it mostly comes from the downsides of retrieving the data per row, which means that information of the parent table will be duplicated in every row, even when this is redundant.
This may cause performance problems that can be helped by splitting queries if:
there are many children to one parent AND
you fetch lots of data from the parent (many columns or large fields)
In my experience, reducing the number of queries almost always benefits performance (I've optimized by combining queries far more than picking them apart).
The correct use of indices is good advice of course, but at first sight I don't think it will account for differences between those two scenarios, as the same indices (or lack of) would apply in both cases.
Given a site like StackOverflow, would it be better to create num_comments column to store how many comments a submission has and then update it when a comment is made or just query the number of rows with the COUNT function? It seems like the latter would be more readable and elegant but the former would be more efficient. What does SO think?
Definitely to use COUNT. Storing the number of comments is a classic de-normalization that produces headaches. It's slightly more efficient for retrieval but makes inserts much more expensive: each new comment requires not only an insert into the comments table, but a write lock on the row containing the comment count.
The former is not normalized but will produce better performance (assuming many more reads than writes).
The latter is more normalized, but will require more resources and hence be less performant.
Which is better boils down to application requirements.
I would suggest counting comment records. Although the other method would be faster it lends to a cleaner database. Adding a count column would be a sort of data duplication not to mention require on additional code step and insert.
If you were to expect millions of comments, then you may want to pick the count column approach.
I agree with #Oded. It depends on the app requirements and also how active is the site, however here is also my two cents
I would try to avoid the writes which will have to be done by triggers, UPDATES to post table when new comments are added.
If you are concerned about reporting the data then don't do that on a transactional system. Create a reporting DB and update that periodically.
The "correct" way to design is to use another table, join it and COUNT. This is consistent with what database normalization teaches.
The problem with normalization is that it cannot scale. There are only so many ways to skin a cat, so if you have millions of queries per day and a lot of them involve table X, the database performance is going below ground as the server also has to deal with concurrent writes, transactions, etc.
To deal with this problem, a common practice is sharding. Sharding has the side effect that the rows of a table are not stored in the same physical location, and a primary consequence of this is that you cannot JOIN anymore; how can you JOIN against half a table and receive meaningful results? And obviously, trying to JOIN against all partitions of a table and merge the results is going to be worse than the disease.
So you see that not only the alternative you examine is used in practice to achieve high performance, but also that there are even more radical steps that engineers can and do take.
Of course, unless you do have performance issues, sharding or even de-normalizing is just making your life harder for no tangible benefit.
I have a classifieds website, and I am thinking about redesigning the database a bit.
Currently I have 7 tables in the db. One table for each "MAIN CATEGORY".
For example, I have a "VEHICLES" table which holds all information about the following categories of classifieds:
cars
mc
mopeds/scooters
trucks
boats
etc etc
However, users on the website usually search in specific categories. For example, the user chooses the "cars" category to search in, and enters a keyword.
My code today, will search the entire VEHICLES table for all records with the field "category" equal to "cars", and then get their details:
"SELECT * IN vehicles WHERE category='cars' AND alot of other conditions" // just for example, not tested
I am thinking about making a table now, for each of these "sub-categories".
Ie, one for cars, one for mc, one for trucks etc, so that search isn't done through information which isn't needed.
Will this increase search speed? Because I have calculated that I will need atleast 30 or so tables for this.
Thanks
With a properly indexed table and a "reasonable" number of rows, you will not gain much speed from this approach. Anything you gain in speed of execution you will lose in time-to-market because your programming will become more complicated.
Do not perform this optimization unless and until you encounter a performance problem in testing with a representative set of data.
It will increase the speed of a search within the same category. It will potentially slow down queries where you need aggregate information from the different categories. You need to decide which is the best option for your site.
How many records do you have in total in the vehicles table. Its quite likely that adding proper indexes will greatly increase the speed of your searches.
Check out the 'EXPLAIN' query option in MySQL. Understanding this will help you optimize your database a lot with indices.
Performance optimization is as much art as science, and to really understand what's the best option requires that you do some benchmarking; anyone offering a definitive answer given the available information is just wrong. That said, a few thoughts on your situation:
You don't say what type your category column is now, but if it's a string type, it's probably using more space than other options, thus making the table larger. Proper indexing can help tremendously with speed, but a larger table with larger indexes will always work to do just the opposite.
As already mentioned by someone else, your queries within a category will be faster in the simple case of a category search. How much faster depends on how much data you have in your current table, and the increases may be negated if you have to join in other tables to satisfy the need for all the other conditions to which you alluded. OTOH, it may actually speed things up in certain join cases (e.g., if you were doing self-joins with your all-encompassing table).
If you're working with a lot of data, splitting into multiple tables can greatly ease backups.
Splitting into multiple tables may also make it easier to shard your data across multiple servers for performance reasons. Similarly, it may make replication setups easier to keep running.
If you're tracking data that's category-specific, separate tables enables you to better normalize your database and likely reap some nice performance as a result of using much smaller tables.
Splitting obviously means modifying your code. If your code is of the old, creaky type, you may very well achieve a performance gain from the clean-up. Of course, there's also the risk that you'll break something....
Check your indexes. Bad indexes are a very common cause of poor performance but are relatively easy to fix with a bit of quality time spent on self-education. MySQL's EXPLAIN can tell you whether your queries are using the indexes, and the index stats (look in the docs) can tell you how efficiently your indexes are working.
Finally, speaking of code, check yours. Try experimenting with a few approaches, regardless of how the database is set up. For example, it may be quicker to do a couple of separate queries and join the results in code than to do the join in the database. Likewise, it's often quicker to do things like sorts in code, particularly in cases where a join or something means the database would have to create a temporary file/table. Again, check the EXPLAIN output, and if you can't eliminate a problem area in your queries, see if it helps to simplify the queries and do more work in the code. This can be particularly beneficial in the common case where the web server has more resources to spare than the database server.
There are many more factors to consider. Ultimately, though, the best way to make these decisions is not to spend time pondering theories but to put both methods to the test. Create some test databases and benchmark the sort of queries you'd run most often, with and without simulated load. You'll get your answer.
if you are using php try something like
$query = mysql_query($sql);
while($row = mysql_fetch_assoc($query)){
$tempvalue[]=$row;
}
and then to loop the info use for like sentence
foreach($tempvalue as $key => $value){
write the table .....
}
maybe mysql isnt slow and the problem is in the code
test dont kill anyone =)