I have looked at different ways to approach this but I would like a method which does not allow people to get around it. Just need a simple, light-weight method to count the number of views off different news articles which are stored in a database:
id | title | body | date | views
1 Stack Overflow 2010-01-01 23
Session
- Could they not just clear browser data and reload page for another view? Any way to stop this?
Database table of ip addresses
- Tons of entries, may hinder performance
Log file
- Same issue as database however I've seen lots of examples
For a performance critical system and for ensuring accuracy, which method should I look into further?
Thanks.
If you're looking to figure out how many unique visitors you have to a given page, then you need to keep information that is unique to each visitor somewhere in your application to reference.
IP addresses are definitely the "safest" way to go, as a user would have to jump through a good many hoops to manually change their IP address. That being said you would have to store a pretty massive amount of data if this is a commercial web-site for each and every page.
What is more reasonable to do is to store the information in a cookie on the client's machine. Sure if your client doesn't allow cookies you will have a skewed number and sure the user can wipe their browser history and you will have a skewed number but overall your number should be relatively accurate.
You could potentially keep this information cached or in session-level variables, but then if your application crashes or restarts you're SOL.
If you REALLY need to have nearly 100% accurate numbers then your best bet is to log the IP addresses of each page's unique visitors. This will ensure you the most accurate count. This is pretty extreme though and if you can take a ~5+% hit in accuracy then I would definitely go for the cookies.
I think that to keep it lightweight you should use someone else's processing power, so for that reason you should sign up to Google Analytics and insert their code into your pages that you want to track.
If you want more accuracy then track each database request in the database itself; or employ a log reading tool that then drops summaries of page reads into a database or file system each day.
Another suggestion:
When the user visits your website log their IP address in a table and drop a cookie with a unique ID. Store this unique ID in a table, along with a reference to the IP address record. This way you are able to figure out a more accurate count (and make adjustments to your final number)
Setup an automated task to create summary tables - making querying the data much faster. This will also allow you to prune the data on a regular basis.
If you're happy to sacrifice better accuracy then this might be a solution:
This would be the "holding" table - which contains the raw data. This is not the table you'd use to query data from - it'd just be for writing to. You'd run through this whole table on a daily/weekly/monthly basis. Yet again - you may need indexes dependant on how you wish to prune this.
CREATE TABLE `article_views` (
`article_id` int(10) unsigned NOT NULL,
`doy` smallint(5) unsigned NOT NULL,
`ip_address` int(10) unsigned NOT NULL
) ENGINE=InnoDB
You'd then have a summary table, which you would update on a daily/weekly or monthly basis which would be super fast to query.
CREATE TABLE `summary_article_uniques_2011` (
`article_id` int(10) unsigned NOT NULL,
`doy` smallint(5) unsigned NOT NULL,
`unique_count` int(10) unsigned NOT NULL,
PRIMARY KEY (`article_id`,`doy`),
KEY(`doy`)
) ENGINE=InnoDB
Example queries:
Unique count for a specific article on a day:
SELECT unique_count FROM summary_article_uniques_2011 WHERE article_id=? AND doy=" . date('z') . "
Counts per day for a specific article:
SELECT unique_count FROM summary_article_uniques_2011 WHERE article_id=?
Counts across the entire site, most popular articles today:
SELECT article_id FROM summary_article_uniques WHERE doy=? ORDER BY unique_count DESC LIMIT 10 // note this query will not hit an index, if you are going to have a lot of articles your best bet is to add another summary table/index "unique_count"
Related
I am working on a Project in which i have Voting Options Up and Down similar to StackOverFlow. I am not much experienced in DB Designing and thus i got up with the following issue,
First of all, here is my table structure for voting :
voteId-----AUTO_INCREMENT with PRIMARY KEY.
mediaId----The Media for which user gives an up/down vote.
userId-----The User who Voted.
voteMode---1 for Up Vote and 0 for Down Vote. This is an Integer Field.
In this case if i have 100 users and 100 medias, then i will have 100x100 records all total in this table.
The Problem arises here is the DB is getting up with a lot of records and the vote button is dead slow to react now. This is making my client unhappy and me in trouble.
Can anyone suggest me a better model to avoid this Huge Table?
I am using jQuery.ajax to post my vote to server. Also, the project is based on PHP and Zend Framework 1.11. So when i click on UP Icon, it will take some time to respond. Mozilla used to Crash certain times. I tested with inserting via a loop with lots of junk records (around 15000).
You can try these up gradation of your table schema:
//All id's are now unsigned , As you do not need any sign
ALTER TABLE `voting`
CHANGE `voteid` `voteid` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
CHANGE `mediaId` `mediaId` INT(11) UNSIGNED NOT NULL,
CHANGE `userId` `userId` INT(11) UNSIGNED NOT NULL,
//ENUM datatype as you need only 2 type of value
CHANGE `voteMode` `voteMode` ENUM('1' , '2') NOT NULL ;
//Adding index , it will surely increase some speed
//Do **not use** index in **columns you do not need**
ALTER TABLE `voting` ADD INDEX ( `mediaId` , `userId` ) ;
Go through Mysql Index to know more about indexing.
If you are using MyISAM Storage Engine , then i will suggest you to go for InnoDB Storage Engine. It may help you to decide about which engine you should use.
And some other hacks that may help you are:
MySQL Query Cache
Prepared Statements in php
COLUMNS Partitioning
Some resources about MySql Database optimizations :
MySQL Tuning
Mysql Optimization
Real World Scalability MySQL.
OK, two things. 15k records is nothing so that can't hardly give a problem. I'm using tables with 150M rows and queries are still performing well under .005s
I'm suspecting you're using MyIsam and not InnoDB. With MyIsam, each insert (or update) locks the entire table. So while someone is voting, the tables is locked and others can't read from it. This might become a problem if you have thousands of users.
Make sure you have the right indexes. I'm not sure what queries are slow (and how slow!) but make sure you have an index on the columns you are searching for (probably mediaId).
If you want better advise, post the queries that are slow.
If you want to keep track of what user has voted for x media, and every user votes, then your minimal data amount is users * media.
If you want to have less data, you have to make a concession. Perhaps let a user register and vote anonymously? Most users are not very happy if there personal preferences can be distilled from there voting behavior.
I am creating a database for keeping track of water usage per person for a city in South Florida.
There are around 40000 users, each one uploading daily readouts.
I was thinking of ways to set up the database and it would seem easier to give each user separate a table. This should ease the download of data because the server will not have to sort through a table with 10's of millions of entries.
Am I false in my logic?
Is there any way to index table names?
Are there any other ways of setting up the DB to both raise the speed and keep the layout simple enough?
-Thank you,
Jared
p.s.
The essential data for the readouts are:
-locationID (table name in my idea)
-Reading
-ReadDate
-ReadTime
p.p.s. during this conversation, i uploaded 5k tables and the server froze. ~.O
thanks for your help, ya'll
Setting up thousands of tables in not a good idea. You should maintain one table and put all entries in that table. MySQL can handle a surprisingly large amount of data. The biggest issue that you will encounter is the amount of queries that you can handle at a time, not the size of the database. For instances where you will be handling numbers use int with attribute unsigned, and instances where you will be handling text use varchar of appropriate size (unless text is large use text).
Handling users
If you need to identify records with users, setup another table that might look something like this:
user_id INT(10) AUTO_INCREMENT UNSIGNED PRIMARY
name VARCHAR(100) NOT NULL
When you need to link a record to the user, just reference the user's user_id. For the record information I would setup the SQL something like:
id INT(10) AUTO_INCREMENT UNSIGNED PRIMARY
u_id INT(10) UNSIGNED
reading Im not sure what your reading looks like. If it's a number use INT if its text use VARCHAR
read_time TIMESTAMP
You can also consolidate the date and time of the reading to a TIMESTAMP.
Do NOT create a seperate table for each user.
Keep indexes on the columns that identify a user and any other common contraints such as date.
Think about how you want to query the data at the end. How on earth would you sum up the data from ALL users for a single day?
If you are worried about primary key, I would suggest keeping a LocationID, Date composite key.
Edit: Lastly, (and I do mean this in a nice way) but if you are asking these sorts of questions about database design, are you sure that you are qualified for this project? It seems like you might be in over your head. Sometimes it is better to know your limitations and let a project pass by, rather than implement it in a way that creates too much work for you and folks aren't satisfied with the results. Again, I am not saying don't, I am just saying have you asked yourself if you can do this to the level they are expecting. It seems like a large amount of users constantly using it. I guess I am saying that learning certain things while at the same time delivering a project to thousands of users may be an exceptionally high pressure environment.
Generally speaking tables should represent sets of things. In your example, it's easy to identify the sets you have: users and readouts; there the theoretical best structure would be having those two tables, where the readouts entries have a reference to the id of the user.
MySQL can handle very large amounts of data, so your best bet is to just try the user-readouts structure and see how it performs. Alternatively you may want to look into a document based NoSQL database such as MongoDB or CouchDB, since storing readouts reports as individual documents could be a good choice aswell.
If you create a summary table that contains the monthly total per user, surely that would be the primary usage of the system, right?
Every month, you crunch the numbers and store the totals into a second table. You can prune the log table on a rolling 12 month period. i.e., The old data can be stuffed in the corner to keep the indexes smaller, since you'll only need to access it when the city is accused of fraud.
So exactly how you store the daily readouts isn't that big of a concern that you need to be freaking out about it. Giving each user his own table is not the proper solution. If you have tons and tons of data, then you might want to consider sharding via something like MongoDB.
I'm pretty sure it's not much of a problem for 10,000 MYSQL rows, but what if we have hundreds of thousands, or even millions of rows?
Someone might tell me that cookies could solve the problem, but since I'm a rookie programmer, I figure that using cookies might raise more problem than it solves problems.
Is there any alternative? Or should I stick to non-IP-sensitive counter? In my application, this counter is only viewable by the seller of an item, not the users, some of whom might want to play around with the counter and refresh many times. So if they don't see a counter, they wont play around with refreshing.
Thanks in advance,
Regards
IP addresses are basically integers.
Store them as integers and use index on the corresponding column - queries are going to be very fast that way. Just keep in mind that ipv6 addresses are too large for 32 bit integers, so you might want to consider using varchar(16) instead and store binary representations of your ip addresses.
Concerning performance of your application, in my opinion, it is always good to use some kind of a caching system for this kind of statistics. For example regenerate your statistics only if, certain time interval has passed.
There are lots of different ways of approaching this. One way would be to have a log of item_id, ip_address and date with a unique index across all three columns. Then do an INSERT IGNORE into the table -
CREATE TABLE `test`.`view_log` (
`item_id` INTEGER UNSIGNED NOT NULL,
`ip_address` INTEGER UNSIGNED NOT NULL,
`date` DATE NOT NULL,
PRIMARY KEY (`item_id`, `ip_address`, `date`)
);
INSERT IGNORE INTO view_log ($item_id, INET_ATON('$ip_address'), CURRENT_DATE);
Note: this will only work for IPv4. To support IPv6 you will need to use a different method for storing the IP addresses.
Assuming we have to log all the users activties of a community, i guess that in brief time our database will become very huge; so my question is:
is this anyway an acceptable compromise (to have a huge DB table) in order to offer this kind of service? Or we can do this in more efficent way?
EDIT:
the kind of activity to be logged is a "classic" social-networking activity-log whre people can look what others are doing or have done and viceversa, so it will track for example when user edit profile, post something, login, logout etc...
EDIT 2:
my table is already optimized in order to store only id's
log_activity_table(
id int
user int
ip varchar
event varchar #event-name
time varchar
callbacks text #some-info-from-the-triggered-event
)
Im actually working on a similar system so Im interested in the answers you get.
For my project, having a full historical accounting was not important so we chose to keep the table fairly lean much like what youre doing. Our tables look something like this:
CREATE TABLE `activity_log_entry` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`event` varchar(50) NOT NULL,
`subject` text,
`publisher_id` bigint(20) NOT NULL,
`created_at` datetime NOT NULL,
`expires_at` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `event_log_entry_action_idx` (`action`),
KEY `event_log_entry_publisher_id_idx` (`publisher_id`),
CONSTRAINT `event_log_entry_publisher_id_user_id`
FOREIGN KEY (`publisher_id`)
REFERENCES `user` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
We decided that we dont want to store history forever so we will have a cron job that kills history after a certain time period. We have both created_at and expired_at columns simply out of convenience. When an event is logged these columns are updated automatically by the model and we use a simple strftime('%F %T', strtotime($expr)) where $expr is a string like '+30 days' we pull from configuration.
Our subject column is similar to your callback one. We also chose not to directly relate the subject of the activity to other tables because there is a possibility that not all event subjects will have a table, additionally its not even important to hold this relationship because the only thing we do with this event log is display activity feed messages. We store a serialized value object of data pertinent to the event for use in predetermined message templates. We also directly encode what the event pertained to (ie. profile, comment, status, etc..).
Our events (aka activities.) are simple strings like 'update','create', etc.. These are used in some queries and of course to help determine which message to display to a user.
We are still in the early stages so this may change quite a bit (possibly based on comments and answers to this question) but given our requirements it seemed like a good approach.
Case: When all user activities have different tables. Eg. Like, comment, post, become a member.
Then these table should have a key associating the entry to a user. Given a user you can get recent activities by querying each table by the user_key.
Hence if you don't have a schema yet or you are privileged to change it, go with having different tables for different activities and search multiple activities.
Case: There are some activities which are say generic and don't have individual table for it
Then have table for generic activities and search it along with other activity tables.
Do you need to store the specific activity of each user, or do you just want to log the kind of activity that is happening over time. If the latter, then you might consider something like RRDtool (or a similar approach) and store the amount of activity over different timesteps in a circular buffer, the size of which stays constant over time. See http://en.wikipedia.org/wiki/RRDtool.
I'm working no a site which stores individual page views in a 'views' table:
CREATE TABLE `views` (
`view_id` bigint(16) NOT NULL auto_increment,
`user_id` int(10) NOT NULL,
`user_ip` varchar(15) NOT NULL,
`view_url` varchar(255) NOT NULL,
`view_referrer` varchar(255) NOT NULL,
`view_date` date NOT NULL,
`view_created` int(10) NOT NULL,
PRIMARY KEY (`view_id`),
KEY `view_url` (`view_url`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
It's pretty basic, stores user_id (the user's id on the site), their IP address, the url (without the domain to reduce the size of the table a little), the referral url (not really using that right now and might get rid of it), the date (YYYY-MM-DD format of course), and the unix timestamp of when the view occurred.
The table, of course, is getting rather big (4 million rows at the moment and it's a rather young site) and running queries on it are slow.
For some basic optimization I've now created a 'views_archive' table:
CREATE TABLE `views_archive` (
`archive_id` bigint(16) NOT NULL auto_increment,
`view_url` varchar(255) NOT NULL,
`view_count` smallint(5) NOT NULL,
`view_date` date NOT NULL,
PRIMARY KEY (`archive_id`),
KEY `view_url` (`view_url`),
KEY `view_date` (`view_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
This ignores the user info (and referral url) and stores how many times a url was viewed per day. This is probably how we'll generally want to use the data (how many times a page was viewed on a per day basis) so should make querying pretty quick, but even if I use it to mainly replace the 'views' table (right now I imagine I could show page views by hour for the last week/month or so and then show daily views beyond that and so would only need the 'views' table to contain data from the last week/month) but it's still a large table.
Anyway, long story short, I'm wondering if you can give me any tips on how to best handle the storage of stats/page views in a MySQL site, the goal being to both keep the size of the table(s) in the db as small as possible and still be able to easily (and at least relatively quickly) query the info. I've looked at partitioned tables a little, but the site doesn't have MySQL 5.1 installed. Any other tips or thoughts you could offer would be much appreciated.
You probably want to have a table just for pages, and have the user views have a reference to that table. Another possible optimization would be to have the user IP stored in a different table, perhaps some session table information. That should reduce your query times somewhat. You're on the right track with the archive table; the same optimizations should help that as well.
MySQL's Archive Storage Engine
http://dev.mysql.com/tech-resources/articles/storage-engine.html
It is great for logs, it is quick to write, the one downside is reading is a bit slower. but it is great for log tables.
Assuming your application is a blog and you want to keep track of views for your blog posts, you will probably have a table called blog_posts. In this table, I suggest you create a column called "views" and in this column, you will store a static value of how many views this post has. You will still use the views table, but that will only be utilized to keep track of all the views (and to do checks if they are "unique" or not).
Basically, when a user visits a blog post post, it will check the views table to see if it should be added. If so, it will also increment the "views" field in the corresponding row for the blog post in blog_posts. That way, you can just refer to the "views" field for each post to get a quick peek at how many views it has. You can take this a step further and add redudancy by setting up a CRON job to re-count and verify all the views and update each blog_posts row accordingly at the end of the day. Or if you prefer, you can also perform a re-count on each update if accuracy to-the-second is key.
This solution works well if your site is read-intensive and you are constantly having to get a count of how many views each blog post has (again, assuming that is your application :-))