I have a table which stores highscores for a game. This game has many levels where scores are then ordered by score DESC (which is an index) where the level is a level ID. Would partitioning on this level ID column create the same result as create many seperate level tables (one for each level ID)? I need this to seperate out the level data somehow as I'm expecting 10's of millions of entries. I hear partitioning could speed this process up, whilst leaving my tables normalised.
Also, I have an unknown amount of levels in my game (levels may be added or removed at any time). Can I specify to partition on this level ID column and have new partitions automaticaly get created when a new (distinct level ID) is added to the highscore table? I may start with 10 seperate levels but end up with 50, but all my data is still kept in one table, but many partitions? Do I have to index the level ID to make this work?
Thanks in advance for your advice!
Creting an index on a single column is good, but creating an index that contains two columns would be a better solution based on the information you have given. I would run a
alter table highscores add index(columnScore, columnLevel);
This will make performance much better. From a database point of view, no matter what highscores you are looking for, the database will know where to search for them.
On that note, if you can, (and you are using mysami tables) you could also run a:
alter table order by columnScore, columnLevel;
which will then group all your data together, so that even though the database KNOWS where each bit is, it can find all the records that belong to one another nearby - which means less hard drive work - and therefore quicker results.
That second operation too, can make a HUGE difference. My PC at work (horrible old machine that was top of the range in the nineties) has a database with several million records in it that I built - nothing huge, about 2.5gb of data including indexes - and performance was dragging, but ordering the data for the indexes improved query time from about 1.5 minutes per query to around 8 seconds. That's JUST due to hard drive speed in being able to get to all the sectors that contain the data.
If you plan to store data for different users, what about having 2 tables - one with all the information about different levels, another with one row for every user alongside with his scores in XML/json?
Related
I have a MySQL database with two tables I am interested in querying:
Users: Stores information about users such as userID etc.
Map: A map table containing about 7 million mapIDs (an index referring to a physical lat/long on earth).
Many of these mapIDs are associated to userIDs, so for example user #1 may have 10 mapIDs associated with him, user #2 may have 100 etc.
I am interested in knowing what is more efficient/safer/best practice to count how many mapIDs belong to a user when I query the database with a userID:
1) Query the Map table to count how many mapIDs belong to the userID, OR
2) Store the number of mapIDs belonging to users in an additional column in the Users table (e.g. mapCount), and only query this value (rather than searching the large Maps table each time).
I know option 2 will be faster, but I am worried about potential problems with synchronization etc. For example, every time a user performs an action (e.g. add a mapID to his account) I would add the userID to the associated mapID on the Maps table, and also increment the mapCount value in Users so that subsequent searches/actions will be faster. But what if the second query failed for some reason and the mapCount field fell out of synch? Is this worth the risk?
What is generally the best thing to do in this situation?
If you are building the database, start by using a query to extract the data you want using a query. You can optimize this query by adding an index on map(usersId). If the performance is adequate, you are done.
If performance is not sufficient, then you can consider storing the count separately. Maintaining the count requires triggers on insert and delete and possibly on update.
These triggers will have an effect on performance when adding and modifying data. This is usually small, but it can be important. If you are doing bulk-load operations, then you will need to manually handle the summarization values.
All this maintenance is a lot of work, and you should only go down that path if you really need to do it that way.
You are facing one of the classic database design trade offs: speed vs. accuracy / synchronization. If your DBMS supports triggers, you could denormalize the count into the user table via a trigger on the maps table, in which case you would no longer have to worry about accuracy. This is about as detailed as my answer can be until we know more about your DBMS.
Option 1 reduces the need for an additional write, is easier to implement and maintain, and the read performance difference will be so marginal there's no point in measuring it yet.
I want to love DynamoDB, but the major drawback is the query/scan on the whole DB to pull the results for one query. Would I be better sicking with MySQL or is there another solution I should be aware of?
Uses:
Newsfeed items (Pulls most recent items from table where id in x,x,x,x,x)
User profiles relationships (users follow and friend eachother)
User lists (users can have up to 1,000 items in one list)
I am happy to mix and match database solutions.The main use is lists.
There will be a few million lists eventually, ranging from 5 to 1000 items per list. The list table is formatted as follows: list_id(bigint)|order(int(1))|item_text(varchar(500))|item_text2(varchar(12))|timestamp(int(11))
The main queries on this DB would be on the 'list_relations' table:
Select 'item_text' from lists where list_id=539830
I suppose my main question. Can we get all items for a particular list_id, without a slow query/scan? and by 'slow' do people mean a second? or a few minutes?
Thank you
I'm not going to address whether or not it's a good choice or the right choice, but you can do what you're asking. I have a large dynamoDB instance with vehicle VINs as the Hash, something else for my range, and I have a secondary index on vin and a timestamp field, I am able to make fast queries over thousands of records for specific vehicles over timestamp searches, no problem.
Constructing your schema in DynamoDB requires different considerations than building in MySQL.
You want to avoid scans as much as possible, this means picking your hash key carefully.
Depending on your exact queries, you may also need to have multiple tables that have the same data..but with different hashkeys depending on your querying needs.
You also did not mention the LSI and GSI features of DynamoDB, these also help your query-ability, but have their own sets of drawbacks. It is difficult to advise further without knowing more details about your requirements.
I have an online iphone turnbased game, with lots of games running at the same time. I'm in the process of optimizing the code, since both me and the server have crashed today.
This is the setup:
Right now I have one table, "matches" (70 fields of data for each row. The structure), that keep track of all the active matches. Every 7 seconds, the iphone will connect, download all the matches in the "matches" table that he/she is active in, and update the UI in the iphone.
This worked great until about 1,000 people downloaded the game and played. The server crashed.
So to optimize, I figure I can create a new table called "matches_needs_update". This table have 2 rows; name and id. The "id" is the same as the match in the "matches" table. When a match is updated, it's put in this table.
Now, instead for search through the whole "matches" table, the query just check if the player have any matches that need to be updated, and then get those matches from the "matches" table.
My question is twofold:
Is this the optimal solution?
If a player is active in, say 10 matches, is there a good way to get those 10 matches from the "matches" table at the same time, or do I need a for loop doing 10 queries, one for each match:
"SELECT * FROM matches WHERE id = ?"
Thanks in advance
You need to get out of the database. Look to memcache or redis.
I suggest APC...
...as you're on PHP, and I assume you're doing this from a single mysql database,
It's easy to install, and will be default from PHP 6 onwards.
Keep this 1 table in memory and it will fly.
Your database looks really small. A table with 70 rows should return within milliseconds and even hundreds of queries per second should work without any problems.
A couple of traditional pointers
Make sure you pool your connections. You should never have to do the connect when a customer needs the data.
Make sure there is an index on "user is in match" so that the result will be fetched from the index.
I'm sure you have enough memory to hold the entire structure in the cache and with these small tables no additional config should be needed.
Make sure your schema is normalized. One table for each user. One for each match. And one for each user in a match.
Its time to start caching things eg memcache and apc.
As for looping though the matches... that is the wrong way to go about it.
How is a user connected to a match by a xref tabel? or does the match table have somthing like player1,player2.
Looping though queries is not the way to go properly indexing your tables and doing a join to pull all the active matches by a userId would me more efficient. Givin the number of users you may also want to (if you havent) split the tables up for active and inactive games.
If theres 6000 active games and 3,000,000 inactive its extremely beneficial to partition these tables.
I pull a range (e.g. limit 72, 24) of games from a database according to which have been voted most popular. I have a separate table for tracking game data, and one for tracking individual votes for a game (rating from 1 to 5, one vote per user per game). A game is considered "most popular" or "more popular" when that game has the highest average rating of all the rating votes for said game. Games with less than 5 votes are not considered. Here is what the tables look like (two tables, "games" and "votes"):
games:
gameid(key)
gamename
thumburl
votes:
userid(key)
gameid(key)
rating
Now, I understand that there is something called an "index" which can speed up my queries by essentially pre-querying my tables and constructing a separate table of indices (I don't really know.. that's just my impression).
I've also read that mysql operates fastest when multiple queries can be condensed into one longer query (containing joins and nested select statements, I presume).
However, I am currently NOT using an index, and I am making multiple queries to get my final result.
What changes should be made to my database (if any -- including constructing index tables, etc.)? And what should my query look like?
Thank you.
Your query that calculates the average for every game could look like:
SELECT gamename, AVG(rating)
FROM games INNER JOIN votes ON games.gameid = votes.gameid
GROUP BY games.gameid
HAVING COUNT(*)>=5
ORDER BY avg(rating) DESC
LIMIT 0,25
You must have an index on gameid on both games and votes. (if you have defined gameid as a primary key on table games that is ok)
According to the MySQL documentation, an index is created when you designate a primary key at table creation. This is worth mentioning, because not all RDBMS's function this way.
I think you have the right idea here, with your "votes" table acting as a bridge between "games" and "user" to handle the many-to-many relationship. Just make sure that "userid" and "gameid" are indexed on the "votes" table.
If you have access to use InnoDB storage for your tables, you can create foreign keys on gameid in the votes table which will use the index created for your primary key in the games table. When you then perform a query which joins these two tables (e.g. ... INNER JOIN votes ON games.gameid = votes.gameid) it will use that index to speed things up.
Your understanding of an index is essentially correct — it basically creates a separate lookup table which it can use behind the scenes when the query is executed.
When using an index it is useful to use the EXPLAIN syntax (simply prepend your SELECT with EXPLAIN to try this out). The output it gives show you the list of possible keys available for the query as well as which key the query is using. This can be very helpful when optimising your query.
An index is a PHYSICAL DATA STRUCTURE which is used to help speed up retrieval type queries; it's not simply a table upon a table -> good for a concept though. Another concept is the way indexes work at the back of your text book (the only difference is with your book a search key could point to multiple pages / matches whereas with indexes a search key points to only one page/match). An index is defined by data structures so you could use a B+ tree index and there are even hash indexes. It's Database/Query optimization from the physical/internal level of the Database - I'm assuming that you know that you're working at the higher levels of the DBMS which is easier. An index is rooted within the internal levels and that make DB query optimization much more effective and interesting.
I've noticed from your question that you have not even developed the query as yet. Focus on the query first. Indexing comes after, as a matter of a fact, in any graduate or post graduate Database course, indexing falls under the maintenance of a Database and not necessarily the development.
Also N.B. I have seen quite many people say as a rule to make all primary keys indexes. This is not true. There are many instances where a primary key index would slow up the Database. Infact, if we were to go with only primary indexes then should use hash indexes since they work better than B+ trees!
In summary, it doesn't make sense to ask a question for a query and an index. Ask for help with the query first. Then given your tables (relational schema) and SQL query, then and only then could I advice you on the best index - remember its maintenance. We can't do maintanance if there is 0 development.
Kind Regards,
N.B. most questions concerning indexes at the post graduate level of many computing courses are as follows: we give the students a relational schema (i.e. your tables) and a query and then ask: critically suggest a suitable index for the following query on the tables ----> we can't ask a question like this if they dont have a query
I want to log access to pages in my PHP/MySQL app to implement a view count similar to the one on SO.
My plan is to count the requests by unique IP addresses on each page. There about 5000 different pages with a view-count.
(I know counting IPs is not exact but that is OK for my purposes.)
I see two options to do organize the database tables:
Either one large table with the fields “page_id”, “request_ip”. Assuming each page has 50 views by unique IPs on average, I'd get 5000 x 50 = 250 000 rows. As the views are displayed on the pages, the table will have read and write access for each request on each page.
The other option is to have one table per page with a single column “request_ip”. I'd then have 5000 tables storing 50 rows on average. A table will only get accessed when it's page is viewed.
Which one is better generally and performance wise? Or am I completely on the wrong track?
5000 tables means 5000 different queries + 5000 different sets of index + 5000 different sets of data competing for space in the server's caches. Performance will most likely be abysmal.
Multiple tables storing exactly the same data structure is almost ALWAYS a bad design. If you're worried about performance, you can use MySQL's partitioning support to split the table into multiple pieces automatically, and that's done transparently to the end-user (eg. your queries).
Wouldnt a better a approach to be to have a table that stores DateTime of access, page id, ip address etc etc. Then every time a page is access you simply add a row to the table. That will give you the data at a raw level and then you can simply aggregate it to answer the questions that you want.
Storing the data in this way also allows you to answer more granular questions like how many page views were made on a particular day or week? Which you wouldn't be able to do with the table structure you have purposed in your question.