Im building a classifieds website of adverts where I want to store a count of the number of views of each advert which I want to be able to display in a graph at a later date by day and month etc.. for each user and each of their adverts. Im just struggling with deciding how best to implement the mysql database to store potentially a large amount of data for each advert.
I am going to create a table for the page views as follows which would store a record for each view for each advert, for example if advert (id 1) has 200 views the table will store 200 records:
Advert_id (unique id of advert)
date_time (date and time of view)
ip_address ( unique ip address of person viewing advert)
page_referrer (url of referrer page)
As mentioned I am going to create the functionality for each member of the site to view a graph for the view statistics for each of their adverts so they can see how many total views each of their adverts have had, and also how many views their advert has had each day (between 2 given dates) and also how many views per month each advert has had. I'll do this by grouping by the date_time field.
If my site grows quite large and for example has 40,000 adverts and each advert has on average 3,000 page views, that would mean the table has 120 Million records. Is this too large ? and would the mysql queries to produce the graphs be very slow?
Do you think the table and method above is the best way to store these advert view statistics or is there a better way to do this?
Unless you really need to store all that data it would probably be better to just increment the count when the advert is viewed. So you just have one row for each advert (or even a column in the row for the advert).
Another option is to save this into a text file and then process it offline but it's generally better to process data as you get it and incorporate that into your applications process.
If you really need to save all of that data then rotating the log table weekly maybe (after processing it) would reduce the overhead of storing all of that information indefinitely.
I was working with website with 50.000 unique visitors per day, and i had same table as you.
Table was growthing ~200-500 MB/day, but i was able to clean table every day.
Best option is make second table, count visitors every day, add result to 2nd table, and flush 1st table.
first table example:
advert_id
date & time
ip address
page referrer
second table example (for graph):
advert_id
date
visitors
unique visitors
Example SQL query to count unqiue visitors:
SELECT
advert_id,
Count(DISTINCT ip_address),
SUBSTRING(Date,1,10) as Date
FROM
adverts
GROUP BY
advert_id,
Date
Problem is not even perfomance (MySQL ISAM Engine is quite smart and fast), problem is storage such big data.
90% statistics tools (even google analytics or webalyzer) is making graphs only once per day, not in real-time.
And quite good idea is store IP as INT using function ip2long()
Related
This is more of a general question on how I would approach this.
I have a audio player. No data about songs is stored in the database. I would like to use database for counting likes for each song. My question is how would I organize this?
1.
If I list all my songs in database beforehand and add a column for like, then where would I keep track of user ip address? (not sure if there is a better way than using ip address?) So I prevent of same user voting multiple times.
2.
Every time user likes a song I would store {song title, like count, user ip} (maybe something else) in database. But then I would end up with multiple rows of same song with different user ip addresses, so how would I keep track of likes for every song in this case?
No data about songs is stored in the database
If you want to store data about each song, you should probably add a table for the songs.
not sure if there is a better way than using ip address?
If you have a table with user IDs, use that because the same user can log on from different IPs, and different users can log on from the same IP.
As for database structure, what you proposed is probably the right answer. Once you have a table set up for songs, have a dedicated column just for likes to you can retrieve it easily.
If you want to know which user liked which song (and when) then you need to make a new table like you suggested in your second point. This is called a junction-table, and is used for many:many relationships (m:m). There is no unique id. It joins two tables with many keys (users:songs) and stores any relationship between the two.
Every time user likes a song I would store {song title, like count, user ip} (maybe something else) in database. But then I would end up with multiple rows of same song with different user ip addresses, so how would I keep track of likes for every song in this case?
Even if you end up with many rows of the same song, they are all unique since each has a different user liking it (and vice-versa, you can get all the songs a specific user liked). You wouldn't need to store a like-count in here, since you can just count the number of records in the database. {song title, user id, datestamp} would be fine.
I'm going to be building a page counter for a site to determine views of different pages. Basically, I have users that I'd like to track with respect to page views of about 120 specific pages.
I'm having trouble determining the most efficient way to do this as well as the logic behind it. I was trying 3 tables, but this would end up in many unnecessary rows. I thought storing an array in a single field then updating the array, but I'm not sure this will grow or if it is even possible. Below is my way of limiting the number of requests. Basically every 100 will count 100. Any ideas for structuring this?
$sample_rate = 100;
if(mt_rand(1,$sample_rate) == 1) {
//update set query
}
Based on our conversation in the comments I would recommend the following simple design:
TblPages
--------
Page_Id (pk)
Page_URL (unique)
TblViews
--------
View_Page_Id
View_User_Id
View_Count
(pk is View_Page_Id + View_User_Id)
Whenever a user views a page, you execute a stored procedure that will either insert a record to the TblViews table if there is no record for that user and page, or update the View_Count of the existing record.
To get all the views for a page you use SUM(View_Count) and group by View_Page_Id, to get all the page views by a user you group by View_User_Id, etc.
I'm trying to create a filter to show certain records that are considered 'trending'. So the idea is to select records that are voted heavily upon but not list them in descending order from most voted to least voted. This is so a user can browse and have a chance to see all links, not just the ones that are at the top. What do you recommend would be the best way to do this? I'm lost as to how I would create a random assortment of trending links, but not have them repeat as a user goes from page to page. Any suggestions? Let me know if any of this is unclear, thanks!
This response assumes you are tracking up votes in a child table on a per row basis for each vote, rather than just +1'ing a counter on the specific item.
Determine the time frame you care about the trending topics. Maybe 1 hour would be good?
Then run a query to determine which item has the highest number of votes in the last hour. Order by this count and you will have a continually updating most upvoted list of items.
SELECT items.name, item_votes.item_count FROM items
INNER JOIN
(
SELECT item_id, COUNT(item_id) AS item_count
FROM item_votes
WHERE date >= dateAdd(hh, -1, getDate()) AND
## only count upvotes, not downvotes
item_vote > 0
group by item_id
) AS item_votes ON item_votes.item_id = items.item_id
ORDER BY item_votes.item_count DESC
You're mentioning that you don't want to repeat items over several pages which means that you can't get random ordering per request. You'll instead need to retrieve the items, order them, and persist them in either a server-wide or session-specific cache.
A server-wide cache would need to be updated every once in a while, a time interval you'll need to define. Users switching page when this update occurs will see their items scrambled.
A session-specific cache would maintain the items as long as the user browses your website, which means that the items would be outdated if your users never leave. Once again, you'll need to determine a time interval to enforce updates.
I'm thinking that you need a versioned list. You could do the server-wide cache solution, and give it an version (date, integer, anything). You need pass this version around when browsing the latest trends, and the user will keep viewing the same list. Clicking on the Trends menu link will send them to the browsing pages without version information, which should grab the latest from your cache. You then keep this as long as the user is browsing these pages.
I can't get into sql statements, not because they are hard, but we don't know your database structure. Do you keep track of individual votes in a separate table? Are they just aggregated into a single column?
Maybe create a new column to record how many views it has? Then update it every time someone views it and order by threads with the largest number of views.
I am rebuilding the background system of a site with a lot of traffic.
This is the core of the application and the way I build this part of the database is critical for a big chunk of code and upcoming work. The system described below will have to run millions of times each day. I would appreciate any input on the issue.
The background is that a user can add what he or she has been eating during the day.
Simplified, the process is more or less this:
The user arrives to the site and the site lists his/her choices for the day (if entered before as the steps below describes).
The user can add a meal (consisting of 1 to unlimited different items of food and their quantity). The meal is added through a search field and is organized in different types (like 'Breakfast', 'Lunch').
During the meal building process a list of the most commonly used food items (primarily by this user, but secondly also by all users) will be shown for quick selection.
The meals will be stored in a FoodLog table that consists of something like this: id, user_id, date, type, food_data.
What I currently have is a huge database with food items from which the search will be performed. The food items are stored with information on both the common name (like "pork cutlets") and on producer (like "coca cola"), along with other detailed information needed.
Question summary:
My problem is that I do not know the best way to store the data for it to be easily accessible in the way I need it and without the database going out of hand.
Consider 1 million users adding 1 to 7 meals each day. To store each food item for each meal, each day and each user would potentially create (1*avg_num_meals*avg_num_food_items) million rows each day.
Storing the data in some compressed way (like the food_data is an json_encoded string), would lessen the amount of rows significally, but at the same time making it hard to create the 'most used food items'-list and other statistics on the fly.
Should the table be split into several tables? If this is the case, how would they interact?
The site is currently hosted on a mid-range CDN and is using a LAMP (Linux, Apache, MySQL, PHP) backbone.
Roughly, you want a fully normalized data structure for this. You want to have one table for Users, one table for Meals (one entry per meal, with a reference to User; you probably also want to have a time / date of the meal in this table), and a table for MealItems, which is simply an association table between Meal and the Food Items table.
So when a User comes in and creates an account, you make an entry in the Users table. When a user reports a Meal they've eaten, you create a record in the Meals table, and a record in the MealItems table for every item they reported.
This structure makes it straightforward to have a variable number of items with every meal, without wasting a lot of space. You can determine the representation of items in meals with a relatively simple query, as well as determining just what the total set of items any one user has consumed in any given timespan.
This normalized table structure will support a VERY large number of records and support a large number of queries against the database.
First,
Storing the data in some compressed way (like the food_data is an
json_encoded string)
is not a recommended idea. This will cause you countless headaches in the future as new requirements are added.
You should definitely have a few tables here.
Users
id, etc
Food Items
id, name, description, etc
Meals
id, user_id, category, etc
Meal Items
id, food_item_id, meal_id
The Meal Items would tie the Meals to the Food Items using ids. The Meals would be tied to Users using ids. This makes it simple to use joins in order to get detailed lists of data- totals, averages, etc. If the fields are properly indexed, this should be a great model to support a large number of records.
In addition to what's been said:
be judicious in your use of indexes. Properly applying these to your database could significantly speed up read access to your tables.
Consider using language-specific features to minimize space. You mention that you're using mysql; consider using ENUM when appropriate (food types, meal types) to minimize database size and to simplify management.
I would split up your meal table into two tables, one table stores a single row for each meal, the second table stores one row for each food item used in a meal, with a foreign key reference to the meal it was used in.
After that, just make sure you have indices on any table columns used in joins or WHERE clauses.
I am building a database in MySQL that will be accessed by PHP scripts. I have a table that is the activity stream. This includes everything that goes on on the website (following of many different things, liking, upvoting etc.). From this activity stream I am going to run an algorithm for each user depending on their activity and display relevant activity. Should I create another table that stores the activity for each user once the algorithm has been run on the activity or should I run the algorithm on the activity table every time the user accesses the site?
UPDATE:(this is what is above except rephrased hopefully in an easier to understand way)
I have a database table called activity. This table creates a new row every time an action is performed by a user on the website.
Every time a user logs in I am going to run an algorithm on the new rows (since the users last login) in the table (activity) that apply to them. For example if the user is following a user who upvoted a post in the activity stream that post will be displayed when the user logs in. I want the ability for the user to be able to access previous content applying to them. Would it be easiest to create another table that saved the rows that have already been run over with the algorithm except attached to individual users names? (a row can apply to multiple different users)
I would start with a single table and appropriate indexes. Using a union statement, you can perform several queries (using different indexes) and then mash all the results together.
As an example, lets assume that you are friends with user 37, 42, and 56, and you are interested in basketball and knitting. And, lets assume you have an index on user_id and an index on subject. This query should be quite performant.
SELECT * FROM activity WHERE user_id IN (37, 42, 56)
UNION DISTINCT
SELECT * FROM activity WHERE subject IN ("basketball", "knitting")
ORDER BY created
LIMIT 50
I would recommend tracking your user specific activities in a separate table and then upon login you could show all user activities that relate to them more easily. ie. So if a user is say big into baseball and hockey you could retrieve that from their recent activity, then got to your everything activities table and grab relevant items from it.