There is a news site - about 50 000 news in mysql db for now. I need to create a list of most interesting and relevant news for each news page and remove the already viewed items for the current user (the actual personalization).
I have made a list of news viewed in cookies already. So all I need is an architectural best approach for the way to filter viewed news.
I see only tow options:
Keep in memory already calculated full list of most popular news (20-30k items) and for each customer request remove viewed ones.
Each time user opens the page create a list of popular items for him again.
In option 1 we can use caching with APC, REDIS etc., but always have a big arrays of data copied to each request which is eating a lot of memory. But in the option 2 we would have to request db each time so it would be not fast and CPU and DB resource consuming.
So is there any way I can avoid using so many resources and make it fast?
You can make something like
SELECT ... article data .. FROM Articles
LEFT JOIN ViewedArticles USING (articleId)
LEFT JOIN Users USING (userId)
WHERE ViewedArticles.articleId IS NULL AND Users.userId = :id
That should select select only the articles, that don't have matching articleId in the ViewedArticles table with matching userId.
Related
Background Info :
I'm trying to retrieve images from people I follow, sort by latest time. It's like a twitter news feed where they show the latest feed by your friends.
Plans:
Currently there is only 1 item i need to keep in consideration, which is the images. In future i'm planning to analyse user's behavior and add in other images they might like into their feed, etc.
http://www.quora.com/What-are-best-practices-for-building-something-like-a-News-Feed
I personally feel that "Pull" Model, or Fan-out-on-load where i pull all info at real time would be worst than the push model. Because imagine i have 100 following, i would have to fetch and sort by time. (Let me know if i'm wrong eg, Read is 100x better than Write(Push Model)
The current design of the push model i have in mind is as follows
Table users_feed(ID, User_ID, Image_ID,datetime)
Option 1 : Store A list of Image_ID
Option 2 : Store one image ID and duplicate rows(More Rows of same User_ID but different Image_ID)
The plan is to limit each Row a user can have in this feed , which means , there would always be a max of 50 images. If they want more items beyond the 50 images in their news feed. They cant(I might code a alternative to store more so they can view more in future)
Question 1
Since when user following users add a item into their "collection" i have to push it into each of their follower's feed. Wont there be a problem in Write? 200 followers = 200 writes?
Question 2
Which method would be better for me keeping in consideration that i only have one type of data which is images. Feeds of images.
Question 3
If i choose to store the feed in advance(push method) how do i actually write it into all my friends?
Insert xxx into feeds whereIn (array of FriendsID)?
Any form of advice would be greatly appreciated. Thanks in advance!
I would recommend you to follow pull method over push method for the following reasons:
It gives to more freedom for extencibility in the future.
Less number of writes ( imagine 10M followers then there has to be
10M writes for just 1 post).
You can get all feed of a user simply by query similar to:
SELECT * FROM users_feed as a WHERE a.user_id in ( < //select all
user_ids of followers of loged in user// > )
(Syntax not followed as table
structure of followers is not known)
I'm building an activity stream for our site, and have made some decent headway with something that works pretty well.
It's powered by two tables:
stream:
id - Unique Stream Item ID
user_id - ID of the user who created the stream item
object_type - Type of object (currently 'seller' or 'product')
object_id - Internal ID of the object (currently either the seller ID or the product ID)
action_name - The action taken against the object (currently either 'buy' or 'heart')
stream_date - Timestamp that the action was created.
hidden - Boolean of if the user has chosen to hide the item.
follows:
id - Unique Follow ID
user_id - The ID of the user initiating the 'Follow' action.
following_user - The ID of the user being followed.
followed - Timestamp that the follow action was executed.
Currently I'm using the following query to pull content from the database:
Query:
SELECT stream.*,
COUNT(stream.id) AS rows_in_group,
GROUP_CONCAT(stream.id) AS in_collection
FROM stream
INNER JOIN follows ON stream.user_id = follows.following_user
WHERE follows.user_id = '1'
AND stream.hidden = '0'
GROUP BY stream.user_id,
stream.action_name,
stream.object_type,
date(stream.stream_date)
ORDER BY stream.stream_date DESC;
This query actually works pretty well, and using a little PHP to parse the data that MySQL returns we can create a nice activity stream with actions of the same type by the same user being grouped together if the time between the actions isn't too great (see below example).
My question is, how do I make this smarter? Currently it groups by one axis, "user" activity, when there are multiple items by a particular user within a certain timeframe the MySQL knows to group them.
How can I make this even smarter and group by another axis, such as "object_id" so if there are multiple actions for the same object in sequence these items are grouped, but maintain the grouping logic we currently have for grouping actions/objects by user. And implementing this without data duplication?
Example of multiple objects appearing in sequence:
I understand solutions to problems like this can get very complex, very quickly but I'm wondering if there's an elegant, and fairly simple solution to this (hopefully) in MySQL.
Some observations about your desired results:
Some of the items are aggregated (Jack Sprat hearted seven sellers) and others are itemized (Lord Nelson chartered the Golden Hind). You probably need to have a UNION in your query that pulls together these two classes of items from two separate subqueries.
You use a fairly crude timestamp-nearness function to group your items ... DATE(). You may want to use more sophisticated and tweakable scheme... like this, maybe
GROUP BY TIMESTAMPDIFF(HOUR,CURRENT_TIME(),stream_date) DIV hourchunk
This will let you group stuff by age chunks. For example if you use 48 for hourchunk you'll group stuff that's 0-48 hours ago together. As you add traffic and action to your system you may want to decrease the hourchunk value.
My impression is you need to group by user, as you do, but also, after that grouping, by action.
It looks to me like you need a subquery like this:
SELECT *, -- or whatever columns
SUM(actions_in_group) AS total_rows_in_group,
GROUP_CONCAT(in_collection) AS complete_collection
FROM
( SELECT stream.*, -- or whatever columns
COUNT(stream.id) AS actions_in_user_group,
GROUP_CONCAT(stream.id) AS actions_in_user_collection
FROM stream
INNER JOIN follows
ON stream.user_id = follows.following_user
WHERE follows.user_id = '1'
AND stream.hidden = '0'
GROUP BY stream.user_id,
date(stream.stream_date)
)
GROUP BY object_id,
date(stream.stream_date)
ORDER BY stream.stream_date DESC;
Your initial query (now the inner one) groups by user, but then the user groups are regrouped by identical actions - that is, identical products bought or sales from one seller would be put together.
Over at Fashiolista we've opensourced our approach to building feed systems.
https://github.com/tschellenbach/Feedly
It's currently the largest open source library aimed at solving this problem. (but written in Python)
The same team which built Feedly also offers a hosted API, which handles the complexity for you. Have a look at getstream.io There are clients for PHP, Node, Ruby and Python.
https://github.com/tbarbugli/stream-php
It also offers support for custom defined aggregations, which you are looking for.
In addition have a look at this high scalability post were we explain some of the design decisions involved:
http://highscalability.com/blog/2013/10/28/design-decisions-for-scaling-your-high-traffic-feeds.html
This tutorial will help you setup a system like Pinterest's feed using Redis. It's quite easy to get started with.
To learn more about feed design I highly recommend reading some of the articles which we based Feedly on:
Yahoo Research Paper
Twitter 2013 Redis based, with fallback
Cassandra at Instagram
Etsy feed scaling
Facebook history
Django project, with good naming conventions. (But database only)
http://activitystrea.ms/specs/atom/1.0/ (actor, verb, object, target)
Quora post on best practises
Quora scaling a social network feed
Redis ruby example
FriendFeed approach
Thoonk setup
Twitter's Approach
We have resolved similar issue by using 'materialized view' approach - we are using dedicated table that gets updated on insert/update/delete event. All user activities are logged into this table and pre-prepared for simple selection and rendering.
Benefit is simple and fast selection, drawback is little bit slower insert/update/delete since log table has to be updated as well.
If this system is well design - it is a wining solution.
This is quite easy to implement if you are using ORM with post insert/update/delete events (like Doctrine)
I am about to build a web shop and need to come up with a solution of tracking user information, and based upon that suggest the users products they may like too and so build an individual user profile (what they like).
Information to be tracked/used for the algorithm, I thought should include:
past orders
wish list/bookmarks/favourites...
search terms entered
products viewed (and here also track and consider the "drop-off"-quote, meaning wether a user closes the site/goes back immediately or looks at more pictures/scrolls down (viewport) etc)
Products are assigned to categories as well as different attributes such as colors, tags etc. The table product has relations with color, category, etc.
product
id_product
price
timestamp_added
color
id_color
...
product_color
id_product_color
id_product
id_color
The questions are:
1) How would you structure a database to track e.g. products viewed? Should it be just like this?:
product_viewed
id_product_viewed
id_product
id_user
timestamp
2) If I want to calculate e.g. the users top 3 favourite colors based on colors of products the user bought, put on their wish list, bookmarked, viewed: can it be handled from a performance point of view to calculate which products should be recommended to this when querying the database every single time? Or do you update a user profile from time to time, storing only the already calculated favourite color at the moment based upon the tracked data and use the stored calculated data to find products that match this information?
How do big sites like facebook, amazon or pinterest do this? On pinterest you get suggestions for items you may like based on what items you clicked on before. How do they handle this?
Yes, your schema for product_viewed is OK.
As for their three favorite colors, try this untested code:
select c.name, count(*) as rank
from product_viewed pv
JOIN product_color pc on pc.id_product = pv.id_product
JOIN color c on pc.id_color = c.id_color
where pv.id_user = 1
group by c.name
order by rank desc
limit 3
Given indexes on the ids used to join the tables and a reasonable limit on the number of items viewed, this should have decent performance. Down the road, you might only look at their most recent 100 products, etc., just to keep it from growing forever. (Or, as you suggest, caching).
There's no magic to this, so it's probably similar to that those other sites are doing.
Doing it with tables like you just wrote is a good way.
Facebook and etc. is doing it that way as well.
But for more efficiency, they use so called B-Trees.
I'm trying to create a filter to show certain records that are considered 'trending'. So the idea is to select records that are voted heavily upon but not list them in descending order from most voted to least voted. This is so a user can browse and have a chance to see all links, not just the ones that are at the top. What do you recommend would be the best way to do this? I'm lost as to how I would create a random assortment of trending links, but not have them repeat as a user goes from page to page. Any suggestions? Let me know if any of this is unclear, thanks!
This response assumes you are tracking up votes in a child table on a per row basis for each vote, rather than just +1'ing a counter on the specific item.
Determine the time frame you care about the trending topics. Maybe 1 hour would be good?
Then run a query to determine which item has the highest number of votes in the last hour. Order by this count and you will have a continually updating most upvoted list of items.
SELECT items.name, item_votes.item_count FROM items
INNER JOIN
(
SELECT item_id, COUNT(item_id) AS item_count
FROM item_votes
WHERE date >= dateAdd(hh, -1, getDate()) AND
## only count upvotes, not downvotes
item_vote > 0
group by item_id
) AS item_votes ON item_votes.item_id = items.item_id
ORDER BY item_votes.item_count DESC
You're mentioning that you don't want to repeat items over several pages which means that you can't get random ordering per request. You'll instead need to retrieve the items, order them, and persist them in either a server-wide or session-specific cache.
A server-wide cache would need to be updated every once in a while, a time interval you'll need to define. Users switching page when this update occurs will see their items scrambled.
A session-specific cache would maintain the items as long as the user browses your website, which means that the items would be outdated if your users never leave. Once again, you'll need to determine a time interval to enforce updates.
I'm thinking that you need a versioned list. You could do the server-wide cache solution, and give it an version (date, integer, anything). You need pass this version around when browsing the latest trends, and the user will keep viewing the same list. Clicking on the Trends menu link will send them to the browsing pages without version information, which should grab the latest from your cache. You then keep this as long as the user is browsing these pages.
I can't get into sql statements, not because they are hard, but we don't know your database structure. Do you keep track of individual votes in a separate table? Are they just aggregated into a single column?
Maybe create a new column to record how many views it has? Then update it every time someone views it and order by threads with the largest number of views.
I'm building a website that constructs both site-wide and user-specific activity feeds. I hope that you can see the structure below and share you insight as to whether my solution is doing the job. This is complicated by the fact that I have multiple types of users that right now are not stored in one master table. This is because the types of users are quite different and constructing multiple different tables for user meta-data would I think be too much trouble. In addition, there are multiple types of content that can be acted upon, and multiple types of activity (following, submitting, commenting, etc.).
Constructing a site-wide activity feed is simple because everything is logged to the main feed table and I just build out a list. I have a master feed table in MySQL that simple logs:
type of activity;
type of target entity;
id of target entity;
type of source entity (i.e., user or organization);
id of source entity.
(This is just a big reference table that points the script generating the feed to the appropriate table(s) for each feed entry).
In generating the user-specific feed, I'm trying to figure out some way to join the relationship table with the feed table, and using that to parse results. I have a relationships table, comprised of 'following' relationships, that is similar to the feed table. It is simpler though b/c only one type of user is allowed to follow other content types/users.
user/source id;
type of target entity;
id of target entity.
Columns 2 & 3 in the feed and follow table are the same, and I have been trying to use various JOIN methodologies to match them up, and then limit them by any relationships in the follow table that the user has. This is has not been very successful.
The basic query I am using is:
SELECT *
FROM (`feed` as fe) LEFT OUTER JOIN `follow` as fo
ON `fe`.`feed_target_type` = `fo`.`follow_e_type`
AND fo.follow_e_id = fe.feed_target_id
WHERE `fo`.`follow_u_id` = 1 OR fe.feed_e_id = 1
AND fe.feed_e_type = 'user'
ORDER BY `fe`.`feed_timestamp` desc LIMIT 10
This query also attempts to grab any content that the user has created (which data is logged in the feed table) that the user is, in effect, following by default.
This query seems to work, but it took me sometime to get to it and am pretty sure I'm missing a more elegant solution. Any ideas?
The first site I made with an activity feed had a notifications table where activities were logged, and then friends actions were pulled from that. However a few months down the line this hit millions of records.
The solution I am programming now pulls latest "friends" activities from separate tables and then orders by date. The query is at home, can post the example later if interested?