Apologies in advance if this is a silly question but I'm wondering which might be faster/better in the following simplified scenario...
I've got registered users (in a users table) and I've got countries (in a countries table) roughly as follows:
USERS TABLE:
user_id (PK, INT) | country_id (FK, TINYINT) | other user-related fields...
COUNTRIES TABLE:
country_id (PK, TINYINT) | country_name (VARCHAR) | other country-related fields...
Now, every time I need to display a user's country, I need to do a MySQL join. However, I often need to do lots of other joins with regard to the users and the big picture seems quite "join-heavy".
I'm wondering what the pros & cons might be of taking the countries out of the database and sticking them into a class as an array, from which I could easily retrieve them with public method calls using country_id? Would there be a speed advantage/disadvantage?
Thanks a lot.
EDIT: Thanks for the all the views, very useful. I'll pick the first answer as the accepted solution although all contributions are valued.
Do you have a serious problem performance problem now? I recently went through a performance improvement on a php/mysql website I developed for my company. Certain areas were too slow, and it turned out a lot of fault was with the queries themselves. I used timers to figure out which queries were slow, and I reorganized them (added indexes, etc). In a few cases, it was faster to make two separate queries and join them in php (I had some pretty complicated joins).
Do not try to optimize until you know you have a problem. Figure out if you have a problem first by measuring it, and then if you need to rearrange your queries you will be able to know if you made an improvement.
It would ease stress on your MySQL server to have less JOIN statements, but not significantly so (there aren't that many countries in the world). However, you'll make up that time in the fact that you'll have to implement the JOIN yourself in PHP. And since you're writing it yourself, you will probably write it less efficiently than the SQL statement, which means that it will take more time. I would recommend keeping it in the SQL server, since the advantages of moving it out are so few (and if the PHP instance and the MySQL instance are on the same box, there are not real advantages).
What you suggest should be faster. Granted, the join probably doesn't cost much, but looking it up in a dictionary should be just about free as far as compute power goes.
This is really just a trade off of memory for speed. The only downsides I could see would of course be the increased memory usage to store the country info and the fact that you would have to invalidate that cache if you ever update the countries table (which is probably not very often).
I don't think you'd gain anything from removing the join, as you'd have to iterate over all your result rows and manually lookup the country name, which I doubt would be quicker than MySQL can do.
I also would not consider such an approach for the following reason: If you want to change the name of a country (say you've got a typo), you can do so just by updating a row in the database. But if the names of the countries are in your PHP code, you'd have to redeploy the code in order to make a change. I don't know PHP, but that might not be as straightforard than a DB change in a production system.
So for maintainability reasons, IMHO let the DB do the work.
The general rule in a database world is to NORMALIZED first (results in more tables) and figure performance issues later.
You will want to DENORMALIZED only for simplicity of code, not for performance. Use indexes and stored procedures. DBMS are designed to optimize on joins.
The reason not "normalize as you go" is that you would have to modify the code you already have written most every time you modify the database design.
Related
I'm building a very large website currently it uses around 13 tables and by the time it's done it should be about 20.
I came up with an idea to change the preferences table to use ID, Key, Value instead of many columns however I have recently thought I could also store other data inside the table.
Would it be efficient / smart to store almost everything in one table?
Edit: Here is some more information. I am building a social network that may end up with thousands of users. MySQL cluster will be used when the site is launched for now I am testing using a development VPS however everything will be moved to a dedicated server before launch. I know barely anything about NDB so this should be fun :)
This model is called EAV (entity-attribute-value)
It is usable for some scenarios, however, it's less efficient due to larger records, larger number or joins and impossibility to create composite indexes on multiple attributes.
Basically, it's used when entities have lots of attributes which are extremely sparse (rarely filled) and/or cannot be predicted at design time, like user tags, custom fields etc.
Granted I don't know too much about large database designs, but from what i've seen, even extremely large applications store their things is a very small amount of tables (20GB per table).
For me, i would rather have more info in 1 table as it means that data is not littered everywhere, and that I don't have to perform operations on multiple tables. Though 1 table also means messy (usually for me, each object would have it's on table, and an object is something you have in your application logic, like a User class, or a BlogPost class)
I guess what i'm trying to say is that do whatever makes sense. Don't put information on the same thing in 2 different table, and don't put information of 2 things in 1 table. Stick with 1 table only describes a certain object (this is very difficult to explain, but if you do object oriented, you should understand.)
nope. preferences should be stored as-they-are (in users table)
for example private messages can't be stored in users table ...
you don't have to think about joining different tables ...
I would first say that 20 tables is not a lot.
In general (it's hard to say from the limited info you give) the key-value model is not as efficient speed wise, though it can be more efficient space wise.
I would definitely not do this. Basically, the reason being if you have a large set of data stored in a single table you will see performance issues pretty fast when constantly querying the same table. Then think about the joins and complexity of queries you're going to need (depending on your site)... not a task I would personally like to undertake.
With using multiple tables it splits the data into smaller sets and the resources required for the query are lower and as an extra bonus it's easier to program!
There are some applications for doing this but they are rare, more or less if you have a large table with a ton of columns and most aren't going to have a value.
I hope this helps :-)
I think 20 tables in a project is not a lot. I do see your point and interest in using EAV but I don't think it's necessary. I would stick to tables in 3NF with proper FK relationships etc and you should be OK :)
the simple answer is that 20 tables won't make it a big DB and MySQL won't need any optimization for that. So focus on clean DB structures and normalization instead.
Given a site like StackOverflow, would it be better to create num_comments column to store how many comments a submission has and then update it when a comment is made or just query the number of rows with the COUNT function? It seems like the latter would be more readable and elegant but the former would be more efficient. What does SO think?
Definitely to use COUNT. Storing the number of comments is a classic de-normalization that produces headaches. It's slightly more efficient for retrieval but makes inserts much more expensive: each new comment requires not only an insert into the comments table, but a write lock on the row containing the comment count.
The former is not normalized but will produce better performance (assuming many more reads than writes).
The latter is more normalized, but will require more resources and hence be less performant.
Which is better boils down to application requirements.
I would suggest counting comment records. Although the other method would be faster it lends to a cleaner database. Adding a count column would be a sort of data duplication not to mention require on additional code step and insert.
If you were to expect millions of comments, then you may want to pick the count column approach.
I agree with #Oded. It depends on the app requirements and also how active is the site, however here is also my two cents
I would try to avoid the writes which will have to be done by triggers, UPDATES to post table when new comments are added.
If you are concerned about reporting the data then don't do that on a transactional system. Create a reporting DB and update that periodically.
The "correct" way to design is to use another table, join it and COUNT. This is consistent with what database normalization teaches.
The problem with normalization is that it cannot scale. There are only so many ways to skin a cat, so if you have millions of queries per day and a lot of them involve table X, the database performance is going below ground as the server also has to deal with concurrent writes, transactions, etc.
To deal with this problem, a common practice is sharding. Sharding has the side effect that the rows of a table are not stored in the same physical location, and a primary consequence of this is that you cannot JOIN anymore; how can you JOIN against half a table and receive meaningful results? And obviously, trying to JOIN against all partitions of a table and merge the results is going to be worse than the disease.
So you see that not only the alternative you examine is used in practice to achieve high performance, but also that there are even more radical steps that engineers can and do take.
Of course, unless you do have performance issues, sharding or even de-normalizing is just making your life harder for no tangible benefit.
I have a classifieds website, and I am thinking about redesigning the database a bit.
Currently I have 7 tables in the db. One table for each "MAIN CATEGORY".
For example, I have a "VEHICLES" table which holds all information about the following categories of classifieds:
cars
mc
mopeds/scooters
trucks
boats
etc etc
However, users on the website usually search in specific categories. For example, the user chooses the "cars" category to search in, and enters a keyword.
My code today, will search the entire VEHICLES table for all records with the field "category" equal to "cars", and then get their details:
"SELECT * IN vehicles WHERE category='cars' AND alot of other conditions" // just for example, not tested
I am thinking about making a table now, for each of these "sub-categories".
Ie, one for cars, one for mc, one for trucks etc, so that search isn't done through information which isn't needed.
Will this increase search speed? Because I have calculated that I will need atleast 30 or so tables for this.
Thanks
With a properly indexed table and a "reasonable" number of rows, you will not gain much speed from this approach. Anything you gain in speed of execution you will lose in time-to-market because your programming will become more complicated.
Do not perform this optimization unless and until you encounter a performance problem in testing with a representative set of data.
It will increase the speed of a search within the same category. It will potentially slow down queries where you need aggregate information from the different categories. You need to decide which is the best option for your site.
How many records do you have in total in the vehicles table. Its quite likely that adding proper indexes will greatly increase the speed of your searches.
Check out the 'EXPLAIN' query option in MySQL. Understanding this will help you optimize your database a lot with indices.
Performance optimization is as much art as science, and to really understand what's the best option requires that you do some benchmarking; anyone offering a definitive answer given the available information is just wrong. That said, a few thoughts on your situation:
You don't say what type your category column is now, but if it's a string type, it's probably using more space than other options, thus making the table larger. Proper indexing can help tremendously with speed, but a larger table with larger indexes will always work to do just the opposite.
As already mentioned by someone else, your queries within a category will be faster in the simple case of a category search. How much faster depends on how much data you have in your current table, and the increases may be negated if you have to join in other tables to satisfy the need for all the other conditions to which you alluded. OTOH, it may actually speed things up in certain join cases (e.g., if you were doing self-joins with your all-encompassing table).
If you're working with a lot of data, splitting into multiple tables can greatly ease backups.
Splitting into multiple tables may also make it easier to shard your data across multiple servers for performance reasons. Similarly, it may make replication setups easier to keep running.
If you're tracking data that's category-specific, separate tables enables you to better normalize your database and likely reap some nice performance as a result of using much smaller tables.
Splitting obviously means modifying your code. If your code is of the old, creaky type, you may very well achieve a performance gain from the clean-up. Of course, there's also the risk that you'll break something....
Check your indexes. Bad indexes are a very common cause of poor performance but are relatively easy to fix with a bit of quality time spent on self-education. MySQL's EXPLAIN can tell you whether your queries are using the indexes, and the index stats (look in the docs) can tell you how efficiently your indexes are working.
Finally, speaking of code, check yours. Try experimenting with a few approaches, regardless of how the database is set up. For example, it may be quicker to do a couple of separate queries and join the results in code than to do the join in the database. Likewise, it's often quicker to do things like sorts in code, particularly in cases where a join or something means the database would have to create a temporary file/table. Again, check the EXPLAIN output, and if you can't eliminate a problem area in your queries, see if it helps to simplify the queries and do more work in the code. This can be particularly beneficial in the common case where the web server has more resources to spare than the database server.
There are many more factors to consider. Ultimately, though, the best way to make these decisions is not to spend time pondering theories but to put both methods to the test. Create some test databases and benchmark the sort of queries you'd run most often, with and without simulated load. You'll get your answer.
if you are using php try something like
$query = mysql_query($sql);
while($row = mysql_fetch_assoc($query)){
$tempvalue[]=$row;
}
and then to loop the info use for like sentence
foreach($tempvalue as $key => $value){
write the table .....
}
maybe mysql isnt slow and the problem is in the code
test dont kill anyone =)
First of all I am an autodidact so I don't have great know how about optimization and stuff. I created a social networking website.
It contains 29 tables right now. I want to extend its functionality by adding things like yellow pages, events etc to make it more like a portal.
Now the question is should I simply add the tables in the same database or should I use a different database?
And in case if I create a new database, I also want users to be able to comment on business listing etc just like reviews. So how will I be able to pull out entries since the reviews will be on one database and user details on other.
Is it possible to join tables on 2 different databases ?
You can join tables in separate databases by fully justifying the name, but the real question is why do you want the information in separate databases? If the information you are storing all relates together, it should go in one database unless there is a compelling (usually performance related) reason against it.
The main reason I could see for separating your YellowPages out is if you wished to have one YellowPages accessible to several different, non-interacting, websites. That said, assumably you wouldn't want cross-talk comments on the listings, so comments would need to be stored in the website databases rather than the YellowPages database. And that just sounds like a maintenance nightmare.
Don't Optimize until you need to.
If performance is ok, go for the easiest to maintain solution.
Monitor the performance of your site and if it starts to get slow, figure out exactly what is causing the slowdown and focus on performance on that section only.
You definitely can query and join tables from two different databases - you just need to specify the tables in a dbname.tablename format.
SELECT a.username, b.post_title
FROM dbOne.users a INNER JOIN dbTwo.posts b USING (user_id)
However, it might make management and maintenance a lot more complicated for you. For example, you'll have to track which table belongs in which database, and will continually need to be adding the database names into all your queries. When it comes time to back up the data, your work will increase there as well. MySQL databases can easily contain hundreds of tables so I see no benefit in splitting it up - just stick with one.
You can prove an algorithm is the fastest it can. math.h and C libraries are very optimized since half a century and other very advances when optimizing is perl strucutres. Just avoid put everything on online to easify debugging. There're conventions, try keep every programmer in the team following same convention. Which convention is "right" makes less optimum than being consequent and consistent. Performance is the last thing you do, security and intelligibility top prios. Read about ordo notation depends on software only while suboptimal software can be faster than optimal relative different hardware. A totally buginfested spaghetti code with no structure can respond many times faster than the most proven optimal software relative hardware.
How to increase the performance for mysql database because I have my website hosted in shared server and they have suspended my account because of "too many queries"
the stuff asked "index" or "cache" or trim my database
I don't know what does "index" and cache mean and how to do it on php
thanks
What an index is:
Think of a database table as a library - you have a big collection of books (records), each with associated data (author name, publisher, publication date, ISBN, content). Also assume that this is a very naive library, where all the books are shelved in order by ISBN (primary key). Just as the books can only have one physical ordering, a database table can only have one primary key index.
Now imagine someone comes to the librarian (database program) and says, "I would like to know how many Nora Roberts books are in the library". To answer this question, the librarian has to walk the aisles and look at every book in the library, which is very slow. If the librarian gets many requests like this, it is worth his time to set up a card catalog by author name (index on name) - then he can answer such questions much more quickly by referring to the catalog instead of walking the shelves. Essentially, the index sets up an 'alternative ordering' of the books - it treats them as if they were sorted alphabetically by author.
Notice that 1) it takes time to set up the catalog, 2) the catalog takes up extra space in the library, and 3) it complicates the process of adding a book to the library - instead of just sticking a book on the shelf in order, the librarian also has to fill out an index card and add it to the catalog. In just the same way, adding an index on a database field can speed up your queries, but the index itself takes storage space and slows down inserts. For this reason, you should only create indexes in response to need - there is no point in indexing a field you rarely search on.
What caching is:
If the librarian has many people coming in and asking the same questions over and over, it may be worth his time to write the answer down at the front desk. Instead of checking the stacks or the catalog, he can simply say, "here is the answer I gave to the last person who asked that question".
In your script, this may apply in different ways. You can store the results of a database query or a calculation or part of a rendered web page; you can store it to a secondary database table or a file or a session variable or to a memory service like memcached. You can store a pre-parsed database query, ready to run. Some libraries like Smarty will automatically store part or all of a page for you. By storing the result and reusing it you can avoid doing the same work many times.
In every case, you have to worry about how long the answer will remain valid. What if the library got a new book in? Is it OK to use an answer that may be five minutes out of date? What about a day out of date?
Caching is very application-specific; you will have to think about what your data means, how often it changes, how expensive the calculation is, how often the result is needed. If the data changes slowly, it may be best to recalculate and store the result every time a change is made; if it changes often but is not crucial, it may be sufficient to update only if the cached value is more than a certain age.
Setup a copy of your application locally, enable the mysql query log, and setup xdebug or some other profiler. The start collecting data, and testing your application. There are lots of guides, and books available about how to optimize things. It is important that you spend time testing, and collecting data first so you optimize the right things.
Using the data you have collected try and reduce the number of queries per page-view, Ideally, you should be able to get everything you need in less 5-10 queries.
Look at the logs and see if you are asking for the same thing twice. It is a bad idea to request a record in one portion of your code, and then request it again from the database a few lines later unless you are sure the value is likely to have changed.
Look for queries embedded in loop, and try to refactor them so you make a single query and simply loop on the results.
The select * you mention using is an indication you may be doing something wrong. You probably should be listing fields you explicitly need. Check this site or google for lots of good arguments about why select * is evil.
Start looking at your queries and then using explain on them. For queries that are frequently used make sure they are using a good index and not doing a full table scan. Tweak indexes on your development database and test.
There are a couple things you can look into:
Query Design - look into more advanced and faster solutions
Hardware - throw better and faster hardware at the problem
Database Design - use indexes and practice good database design
All of these are easier said than done, but it is a start.
Firstly, sack your host, get off shared hosting into an environment you have full control over and stand a chance of being able to tune decently.
Replicate that environment in your lab, ideally with the same hardware as production; this includes things like RAID controller.
Did I mention that you need a RAID controller. Yes you do. You can't achieve decent write performance without one - which needs a battery backed cache. If you don't have one, each write needs to physically hit the disc which is ruinous for performance.
Anyway, back to read performance, once you've got the machine with the same spec RAID controller (and same discs, obviously) as production in your lab, you can try to tune stuff up.
More RAM is usually the cheapest way of achieving better performance - make sure that you've got MySQL configured to use it - which means tuning storage-engine specific parameters.
I am assuming here that you have at least 100G of data; if not, just buy enough ram that your entire DB fits in ram then read performance is essentially solved.
Software changes that others have mentioned such as optimising queries and adding indexes are helpful too, but only once you've got a development hardware environment that enables you to usefully do performance work - i.e. measure performance of your application meaningfully - which means real hardware (not VMs), which is consistent with the hardware environment used in production.
Oh yes - one more thing - don't even THINK about deploying a database server on a 32-bit OS, it's a ruinous waste of good ram.
Indexing is done on the database tables in order to speed queries. If you don't know what it means you have none. At a minumum you should have indexes on every foriegn key and on most fileds that are used frequently in the where clauses of your queries. Primary keys should have indexes automatically assuming you set them up to begin with which I would find unlikely in someone who doesn't know what an index is. Are your tables normalized?
BTW, since you are doing a division in your math (why I haven't a clue), you should Google integer math. You may neot be getting correct results.
You should not select * ever. Instead, select only the data you need for that particular call. And what is your intention here?
order by votes*1000+((1440 - ($server_date - date))/60)2+visites600 desc
You may have poorly-written queries, and/or poorly written pages that run too many queries. Could you give us specific examples of queries you're using that are ran on a regular basis?
sure
this query to fetch the last 3 posts
select * from posts where visible = 1 and date > ($server_date - 86400) and dont_show_in_frontpage = 0 order by votes*1000+((1440 - ($server_date - date))/60)*2+visites*600 desc limit 3
what do you think?