I want to log access to pages in my PHP/MySQL app to implement a view count similar to the one on SO.
My plan is to count the requests by unique IP addresses on each page. There about 5000 different pages with a view-count.
(I know counting IPs is not exact but that is OK for my purposes.)
I see two options to do organize the database tables:
Either one large table with the fields “page_id”, “request_ip”. Assuming each page has 50 views by unique IPs on average, I'd get 5000 x 50 = 250 000 rows. As the views are displayed on the pages, the table will have read and write access for each request on each page.
The other option is to have one table per page with a single column “request_ip”. I'd then have 5000 tables storing 50 rows on average. A table will only get accessed when it's page is viewed.
Which one is better generally and performance wise? Or am I completely on the wrong track?
5000 tables means 5000 different queries + 5000 different sets of index + 5000 different sets of data competing for space in the server's caches. Performance will most likely be abysmal.
Multiple tables storing exactly the same data structure is almost ALWAYS a bad design. If you're worried about performance, you can use MySQL's partitioning support to split the table into multiple pieces automatically, and that's done transparently to the end-user (eg. your queries).
Wouldnt a better a approach to be to have a table that stores DateTime of access, page id, ip address etc etc. Then every time a page is access you simply add a row to the table. That will give you the data at a raw level and then you can simply aggregate it to answer the questions that you want.
Storing the data in this way also allows you to answer more granular questions like how many page views were made on a particular day or week? Which you wouldn't be able to do with the table structure you have purposed in your question.
Related
I have a MySQL database with two tables I am interested in querying:
Users: Stores information about users such as userID etc.
Map: A map table containing about 7 million mapIDs (an index referring to a physical lat/long on earth).
Many of these mapIDs are associated to userIDs, so for example user #1 may have 10 mapIDs associated with him, user #2 may have 100 etc.
I am interested in knowing what is more efficient/safer/best practice to count how many mapIDs belong to a user when I query the database with a userID:
1) Query the Map table to count how many mapIDs belong to the userID, OR
2) Store the number of mapIDs belonging to users in an additional column in the Users table (e.g. mapCount), and only query this value (rather than searching the large Maps table each time).
I know option 2 will be faster, but I am worried about potential problems with synchronization etc. For example, every time a user performs an action (e.g. add a mapID to his account) I would add the userID to the associated mapID on the Maps table, and also increment the mapCount value in Users so that subsequent searches/actions will be faster. But what if the second query failed for some reason and the mapCount field fell out of synch? Is this worth the risk?
What is generally the best thing to do in this situation?
If you are building the database, start by using a query to extract the data you want using a query. You can optimize this query by adding an index on map(usersId). If the performance is adequate, you are done.
If performance is not sufficient, then you can consider storing the count separately. Maintaining the count requires triggers on insert and delete and possibly on update.
These triggers will have an effect on performance when adding and modifying data. This is usually small, but it can be important. If you are doing bulk-load operations, then you will need to manually handle the summarization values.
All this maintenance is a lot of work, and you should only go down that path if you really need to do it that way.
You are facing one of the classic database design trade offs: speed vs. accuracy / synchronization. If your DBMS supports triggers, you could denormalize the count into the user table via a trigger on the maps table, in which case you would no longer have to worry about accuracy. This is about as detailed as my answer can be until we know more about your DBMS.
Option 1 reduces the need for an additional write, is easier to implement and maintain, and the read performance difference will be so marginal there's no point in measuring it yet.
I have a large database of three million articles in a specific category.I'm going with this database, few sites launch.but my budget is low.So the best thing is for me to use a shared host but the problem is that the shared host hardware power is weak given to the user because it shared so I have to get a new post to a site that has already been posted i'm in trouble. I used the following method to get the new contents of the database but now with the increasing number and growing database records more than the power of a shared host to display information at the right time.
My previous method :
I have a table for content
And a table to know what entry was posted statistics that for every site.
My query is included below:
SELECT * FROM postlink WHERE `source`='$mysource' AND NOT EXISTS (SELECT sign FROM `state` WHERE postlink.sign = state.sign AND `cite`='$mycite') ORDER BY `postlink`.`id` ASC LIMIT 5
i use mysql
I've tested with different queries but did not get a good result and we had to show a few post more very time-consuming.
Now I want you to help me and offer me a solution thats I can with the number of posts and with normally shared host show in the shortest possible time some new content to the site requesting new posts.
The problem will happen when the sending post stats table is too large and if I want to empty this table we'll be in problems with sending duplicate content so I have no other choice to table statistics.
Statistics table now has a record 500 thousand entries for 10 sites.
thanks all in advance
Are you seriously calling 3 million articles a large database? PostgreSQL will not even start making Toasts at this point.
Consider migrating to a more serious database where you can use partial indexes, table partitioning, materialized views, etc.
I want to love DynamoDB, but the major drawback is the query/scan on the whole DB to pull the results for one query. Would I be better sicking with MySQL or is there another solution I should be aware of?
Uses:
Newsfeed items (Pulls most recent items from table where id in x,x,x,x,x)
User profiles relationships (users follow and friend eachother)
User lists (users can have up to 1,000 items in one list)
I am happy to mix and match database solutions.The main use is lists.
There will be a few million lists eventually, ranging from 5 to 1000 items per list. The list table is formatted as follows: list_id(bigint)|order(int(1))|item_text(varchar(500))|item_text2(varchar(12))|timestamp(int(11))
The main queries on this DB would be on the 'list_relations' table:
Select 'item_text' from lists where list_id=539830
I suppose my main question. Can we get all items for a particular list_id, without a slow query/scan? and by 'slow' do people mean a second? or a few minutes?
Thank you
I'm not going to address whether or not it's a good choice or the right choice, but you can do what you're asking. I have a large dynamoDB instance with vehicle VINs as the Hash, something else for my range, and I have a secondary index on vin and a timestamp field, I am able to make fast queries over thousands of records for specific vehicles over timestamp searches, no problem.
Constructing your schema in DynamoDB requires different considerations than building in MySQL.
You want to avoid scans as much as possible, this means picking your hash key carefully.
Depending on your exact queries, you may also need to have multiple tables that have the same data..but with different hashkeys depending on your querying needs.
You also did not mention the LSI and GSI features of DynamoDB, these also help your query-ability, but have their own sets of drawbacks. It is difficult to advise further without knowing more details about your requirements.
I have a table which stores highscores for a game. This game has many levels where scores are then ordered by score DESC (which is an index) where the level is a level ID. Would partitioning on this level ID column create the same result as create many seperate level tables (one for each level ID)? I need this to seperate out the level data somehow as I'm expecting 10's of millions of entries. I hear partitioning could speed this process up, whilst leaving my tables normalised.
Also, I have an unknown amount of levels in my game (levels may be added or removed at any time). Can I specify to partition on this level ID column and have new partitions automaticaly get created when a new (distinct level ID) is added to the highscore table? I may start with 10 seperate levels but end up with 50, but all my data is still kept in one table, but many partitions? Do I have to index the level ID to make this work?
Thanks in advance for your advice!
Creting an index on a single column is good, but creating an index that contains two columns would be a better solution based on the information you have given. I would run a
alter table highscores add index(columnScore, columnLevel);
This will make performance much better. From a database point of view, no matter what highscores you are looking for, the database will know where to search for them.
On that note, if you can, (and you are using mysami tables) you could also run a:
alter table order by columnScore, columnLevel;
which will then group all your data together, so that even though the database KNOWS where each bit is, it can find all the records that belong to one another nearby - which means less hard drive work - and therefore quicker results.
That second operation too, can make a HUGE difference. My PC at work (horrible old machine that was top of the range in the nineties) has a database with several million records in it that I built - nothing huge, about 2.5gb of data including indexes - and performance was dragging, but ordering the data for the indexes improved query time from about 1.5 minutes per query to around 8 seconds. That's JUST due to hard drive speed in being able to get to all the sectors that contain the data.
If you plan to store data for different users, what about having 2 tables - one with all the information about different levels, another with one row for every user alongside with his scores in XML/json?
I need to store about 73,200 records per day consisting of 3 points of data: id, date, and integer.
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Any suggestions on how to deal with this amount of data? Thanks.
========== Thank you to all the feedback.
I recommend against that. I call this antipattern Metadata Tribbles. It creates multiple problems:
You need to remember to create a new table every year or else your app breaks.
Querying aggregates against all rows regardless of year is harder.
Updating a date potentially means moving a row from one table to another.
It's harder to guarantee the uniqueness of pseudokeys across multiple tables.
My recommendation is to keep it in one table until and unless you've demonstrated that the size of the table is becoming a genuine problem, and you can't solve it any other way (e.g. caching, indexing, partitioning).
Seems like it should be just fine holding everything in one table. It will make retrieval much easier in the future to maintain 1 table, as opposed to 12 tables per year. At 73,200 records per day it will take you almost 4 years to hit 100,000,000 which is still well within MySQLs capabilities.
Absolutely not.
It will ruin relationship between tables.
Table relations being built based on field values, not table names.
Especially for this very table that will grow by just 300Mb/year
so in 100 days you have 7.3 M rows, about 25M a year or so. 25M rows isn't a lot anymore. MySQL can handle tables with millions of rows. It really depends on your hardware and your query types and query frequency.
But you should be able to partition that table (if MySQL supports partitioning), what you're describing is an old SQL Server method of partition. After building those monthly tables you'd build a view that concatenates them together to look like one big table... which is essentially what partitioning does but it's all under-the-covers and fully optimized.
Usually this creates more trouble than it's worth, it's more maintenance , your queries need more logic, and it's painful to pull data from more than one period.
We store 200+ million time based records in one (MyISAM) table, and queries are blazingly still fast.
You just need to ensure there's an index on your time/date column and that your queries makes use of the index (e.g. a query that messes around with DATE_FORMAT or similar on a date column will likely not use an index. I wouldn't put them in separate tables just for the sake of retreival performance.
One thing that gets very painful with such a large number of records is when you have to delete old data, this can take a long time (10 minutes to 2 hours for e.g. wiping a month worth of data in tables with hundreds of mullions rows). For that reason we've partitioning the tables, and use a time_dimension(see e.g. the time_dimension table a bit down here) relation table for managing the periods instead of simple date/datetime columns or strings/varchars representing dates.
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Don't listen to them. You're already storing a date stamp, what about different months makes it a good idea to split the data that way? The engine will handle the larger data sets just fine, so splitting by month does nothing but artificially segregate the data.
My first reaction is: Aaaaaaaaahhhhhhhhh!!!!!!
Table names should not embed data values. You don't say what the data means, but supposing for the sake of argument it is, I don't know, temperature readings. Just imagine trying to write a query to find all the months in which average temperature increased over the previous month. You'd have to loop through table names. Worse yet, imagine trying to find all 30-day periods -- i.e. periods that might cross month boundaries -- where temperature increased over the previous 30-day period.
Indeed, just retrieving an old record would go from a trivial operation -- "select * where id=whatever" -- would become a complex operation requiring you to have the program generate table names from the date on the fly. If you didn't know the date, you would have to scan through all the tables searching each one for the desired record. Yuck.
With all the data in one properly-normalized table, queries like the above are pretty trivial. With separate tables for each month, they're a nightmare.
Just make the date part of the index and the performance penalty of having all the records in one table should be very small. If the size of table really becomes a performance problem, I could dimply comprehend making one table for archive data with all the old stuff and one for current data with everything you retrieve regularly. But don't create hundreds of tables. Most database engines have ways to partition your data across multiple drives using "table spaces" or the like. Use the sophisticated features of the database if necessary, rather than hacking together a crude simulation.
Depends on what searches you'll need to do. If normally constrained by date, splitting is good.
If you do split, consider naming the tables like foo_2010_09 so the tables will sort alphanumerically.
what is your DB platform?
In SQL Server 2K5+ you can partition on date.
My bad, I didnt notice the tag. #thetaiko is right though and this is well within MySQL capabilities to deal with this.
I would say it depends on how the data is used. If most queries are done over the complete data, it would be an overhead to always join the tables back together again.
If you most times only need a part of the data (by date), it is a good idea to segment the tables into smaller pieces.
For the naming i would do tablename_yyyymm.
Edit: For sure you should then also think about another layer between the DB and your app to handle the segmented tables depending on some date given. Which can then get pretty complicated.
I'd suggest dropping the year and just having one table per month, named after the month. Archive your data annually by renaming all the tables $MONTH_$YEAR and re-creating the month tables. Or, since you're storing a timestamp with your data, just keep appending to the same tables. I assume by virtue of the fact that you're asking the question in the first place, that segregating your data by month fits your reporting requirements. If not, then I'd recommend keeping it all in one table and periodically archiving off historical records when performance gets to be an issue.
I agree with this idea complicating your database needlessly. Use a single table. As others have pointed out, it's not nearly enough data to warrent extraneous handling. Unless you use SQLite, your database will handle it well.
However it also depends on how you want to access it. If the old entries are really only there for archival purposes, then the archive pattern is an option. It's common for versioning systems to have the infrequently used data separated out. In your case you'd only want everything >1 year to move out of the main table. And this is strictly an database administration task, not an application behavior. The application would only join the current list and the _archive list, if at all. Again, this highly depends on the use case. Are the old entries generally needed? Is there too much data to process regularily?