Intelligent MySQL GROUP BY for Activity Streams

Intelligent MySQL GROUP BY for Activity Streams - php

I'm building an activity stream for our site, and have made some decent headway with something that works pretty well.
It's powered by two tables:
stream:
id - Unique Stream Item ID
user_id - ID of the user who created the stream item
object_type - Type of object (currently 'seller' or 'product')
object_id - Internal ID of the object (currently either the seller ID or the product ID)
action_name - The action taken against the object (currently either 'buy' or 'heart')
stream_date - Timestamp that the action was created.
hidden - Boolean of if the user has chosen to hide the item.
follows:
id - Unique Follow ID
user_id - The ID of the user initiating the 'Follow' action.
following_user - The ID of the user being followed.
followed - Timestamp that the follow action was executed.
Currently I'm using the following query to pull content from the database:
Query:
SELECT stream.*,
COUNT(stream.id) AS rows_in_group,
GROUP_CONCAT(stream.id) AS in_collection
FROM stream
INNER JOIN follows ON stream.user_id = follows.following_user
WHERE follows.user_id = '1'
AND stream.hidden = '0'
GROUP BY stream.user_id,
stream.action_name,
stream.object_type,
date(stream.stream_date)
ORDER BY stream.stream_date DESC;
This query actually works pretty well, and using a little PHP to parse the data that MySQL returns we can create a nice activity stream with actions of the same type by the same user being grouped together if the time between the actions isn't too great (see below example).
My question is, how do I make this smarter? Currently it groups by one axis, "user" activity, when there are multiple items by a particular user within a certain timeframe the MySQL knows to group them.
How can I make this even smarter and group by another axis, such as "object_id" so if there are multiple actions for the same object in sequence these items are grouped, but maintain the grouping logic we currently have for grouping actions/objects by user. And implementing this without data duplication?
Example of multiple objects appearing in sequence:
I understand solutions to problems like this can get very complex, very quickly but I'm wondering if there's an elegant, and fairly simple solution to this (hopefully) in MySQL.

Some observations about your desired results:
Some of the items are aggregated (Jack Sprat hearted seven sellers) and others are itemized (Lord Nelson chartered the Golden Hind). You probably need to have a UNION in your query that pulls together these two classes of items from two separate subqueries.
You use a fairly crude timestamp-nearness function to group your items ... DATE(). You may want to use more sophisticated and tweakable scheme... like this, maybe
GROUP BY TIMESTAMPDIFF(HOUR,CURRENT_TIME(),stream_date) DIV hourchunk
This will let you group stuff by age chunks. For example if you use 48 for hourchunk you'll group stuff that's 0-48 hours ago together. As you add traffic and action to your system you may want to decrease the hourchunk value.

My impression is you need to group by user, as you do, but also, after that grouping, by action.
It looks to me like you need a subquery like this:
SELECT *, -- or whatever columns
SUM(actions_in_group) AS total_rows_in_group,
GROUP_CONCAT(in_collection) AS complete_collection
FROM
( SELECT stream.*, -- or whatever columns
COUNT(stream.id) AS actions_in_user_group,
GROUP_CONCAT(stream.id) AS actions_in_user_collection
FROM stream
INNER JOIN follows
ON stream.user_id = follows.following_user
WHERE follows.user_id = '1'
AND stream.hidden = '0'
GROUP BY stream.user_id,
date(stream.stream_date)
)
GROUP BY object_id,
date(stream.stream_date)
ORDER BY stream.stream_date DESC;
Your initial query (now the inner one) groups by user, but then the user groups are regrouped by identical actions - that is, identical products bought or sales from one seller would be put together.

Over at Fashiolista we've opensourced our approach to building feed systems.
https://github.com/tschellenbach/Feedly
It's currently the largest open source library aimed at solving this problem. (but written in Python)
The same team which built Feedly also offers a hosted API, which handles the complexity for you. Have a look at getstream.io There are clients for PHP, Node, Ruby and Python.
https://github.com/tbarbugli/stream-php
It also offers support for custom defined aggregations, which you are looking for.
In addition have a look at this high scalability post were we explain some of the design decisions involved:
http://highscalability.com/blog/2013/10/28/design-decisions-for-scaling-your-high-traffic-feeds.html
This tutorial will help you setup a system like Pinterest's feed using Redis. It's quite easy to get started with.
To learn more about feed design I highly recommend reading some of the articles which we based Feedly on:
Yahoo Research Paper
Twitter 2013 Redis based, with fallback
Cassandra at Instagram
Etsy feed scaling
Facebook history
Django project, with good naming conventions. (But database only)
http://activitystrea.ms/specs/atom/1.0/ (actor, verb, object, target)
Quora post on best practises
Quora scaling a social network feed
Redis ruby example
FriendFeed approach
Thoonk setup
Twitter's Approach

We have resolved similar issue by using 'materialized view' approach - we are using dedicated table that gets updated on insert/update/delete event. All user activities are logged into this table and pre-prepared for simple selection and rendering.
Benefit is simple and fast selection, drawback is little bit slower insert/update/delete since log table has to be updated as well.
If this system is well design - it is a wining solution.
This is quite easy to implement if you are using ORM with post insert/update/delete events (like Doctrine)

Related

PHP user personalization

There is a news site - about 50 000 news in mysql db for now. I need to create a list of most interesting and relevant news for each news page and remove the already viewed items for the current user (the actual personalization).
I have made a list of news viewed in cookies already. So all I need is an architectural best approach for the way to filter viewed news.
I see only tow options:
Keep in memory already calculated full list of most popular news (20-30k items) and for each customer request remove viewed ones.
Each time user opens the page create a list of popular items for him again.
In option 1 we can use caching with APC, REDIS etc., but always have a big arrays of data copied to each request which is eating a lot of memory. But in the option 2 we would have to request db each time so it would be not fast and CPU and DB resource consuming.
So is there any way I can avoid using so many resources and make it fast?

You can make something like
SELECT ... article data .. FROM Articles
LEFT JOIN ViewedArticles USING (articleId)
LEFT JOIN Users USING (userId)
WHERE ViewedArticles.articleId IS NULL AND Users.userId = :id
That should select select only the articles, that don't have matching articleId in the ViewedArticles table with matching userId.

What is the best way to handle large recursive queries in mysql?

Using PHP & Mysql-
I have a list of 120,000 employees. Each has a supervisor field with the supervisor employee number.
I am looking to build something that shows the employees in a tree like format. Given that if you click on anyone that you have an option to download all of the employees (with their info) that are under them.
So two questions - should I write my script to handle the query (which I have but is SLOW) or should create some sort of helper table/view? I am looking for best practice behind this.
Also I am sure this has been done a million times. Is there a good class that handles organization hierarchy?

The standard way of doing this is to use one table to store all of the employees, with a primary key field for the employee_id, and a field for supervisor_id which is a 'self join' - meaning that the value in this field points back to the employee id of this employee's supervisor. As far as displaying the employee tree - for relatively small trees, the entire tree structure can be sent to the client's browser when the page is created, and tree nodes can be displayed as the nodes are clicked from the stored data. But, for larger trees, it is better to fetch the data as needed, i.e. when the nodes are clicked. If you have 120,000 employees, then you might want to use the later approach.

Per-user tag cloud in PHP & MySQL

I am looking to implement a per-user tag/interest cloud feature to a website I am making.
Each user has a profile page, and on said page a tag cloud of their preselected interests will be displayed. Each user can type their interests comma delimitated, with suggestions if such a tag has been used before or creation if it doesn't exist. Interests will be things such as Music Genres, Hobbies etc.
I'd like to also add basic features such as comparing users tag clouds (shared tags) for finding users that are 'compatible' according to their cloud.
I could use help with the logistics of the database to achieve this. I understand simple database design, but I can't wrap my head around design for the above.
At the moment the database is one single table, with ID/Username/Password/Verification (the last a key for email verification).
The only idea I have come up with for the tag cloud db is two tables - one called tags with a tagid and tagname field, and another users_tags with a tagid and userid field, and an entry for every single tag a user has. However I am unsure if this is best practice.
Hope someone can give me some direction on all this - thanks in advance.

having a table with userid and tagid only sounds like the best route for this.
to find "compatible" users as you mention you can just run a query similar to
SELECT
ut.userid, COUNT(*) ct
FROM
user_tags ut
WHERE
ut.tagid IN (SELECT uta.tagid FROM user_tags uta WHERE uta.userid=24 )
GROUP BY ut.userid ORDER BY ct DESC;
note that the above query will also return the original user, but it's much more efficient than removing him from the query.

database table design for some unknown data

So, not having come from a database design background, I've been tasked with designing a web app where the end user will be entering products, and specs for their products. Normally I think I would just create rows for each of the types of spec that they would be entering. Instead, they have a variety of products that don't share the same spec types, so my question is, what's the most efficient and future-proof way to organize this data? I was leaning towards pushing a serialized object into a generic "data" row, but then are you able to do full-text searches on this data? Any other avenues to explore?

split products and specifications into two tables like this:
products
id name
specifications
id name value product_id
get all the specifations of a product when you know the product id:
SELECT name,
value
FROM specifications
WHERE product_id = ?;
add a specification to a product when you know the product id, the specification's name and the value of said specification:
INSERT INTO specifications(
name,
value,
product_id
) VALUES(
?,
?,
?
);
so before you can add specifications to a product, this product must exist. also, you can't reuse specifications for several products. that would require a somewhat more complex solution :) namely...
three tables this time:
products
id name
specifications
id name value
products_specifications
product_id specification_id
get all the specifations of a product when you know the product id:
SELECT specifications.name,
specifications.value
FROM specifications
JOIN products_specifications
ON products_specifications.specification_id = specifications.id
WHERE products_specifications.product_id = ?;
now, adding a specification becomes a little bit more tricky, cause you have to check if that specification already exists. so this will be a little heavier than the first way of doing this, since there are more queries on the db, and there's more logic in the application.
first, find the id of the specification:
SELECT id
FROM specifications
WHERE name = ?
AND value = ?;
if no id is returned, this means that said specification doesn't exist, so it must be created:
INSERT INTO specifications(
name,
value
) VALUES(
?,
?
);
next, either use the id from the select query, or get the last insert id to find the id of the newly created specification. use that id together with the id of the product that's getting the new specification, and link the two together:
INSERT INTO products_specifications(
product_id,
specification_id
) VALUES(
?,
?
);
however, this means that you have to create one row for every specific specification. e.g. if you have size for shoes, there would be one row for every known shoe size
specifications
id name value
1 size 7
2 size 7½
3 size 8
and so on. i think this should be enough though.

You could take a look at using an EAV model.

I've never built a products database, but I can point you to a data model for that. It's one of over 200 models available for the taking, at Database Answers. Here is the model
If you don't like this one, you can find 15 different data models for Product oriented databases. Click on "Data Models" to get a list and scroll down to "Products".
You should pick up some good design ideas there.

This is a pretty common problem - and there are different solutions for different scenarios.
If the different types of product and their attributes are fixed and known at development time, you could look at the description in Craig Larman's book (http://www.amazon.com/Applying-UML-Patterns-Introduction-Object-Oriented/dp/0131489062/ref=sr_1_1/002-2801511-2159202?ie=UTF8&s=books&qid=1194351090&sr=1-1) - there's a section on object-relational mapping and how to handle inheritance.
This boils down to "put all the possible columns into one table", "create one table for each sub class" or "put all base class items into a common table, and put sub class data into their own tables".
This is by far the most natural way of working with a relational database - it allows you to create reports, use off-the-shelf tools for object relational mapping if that takes your fancy, and you can use standard concepts such as "not null", indexing etc.
Of course, if you don't know the data attributes at development time, you have to create a flexible database schema.
I've seen 3 general approaches.
The first is the one described by davogotland. I built a solution on similar lines for an ecommerce store; it worked great, and allowed us to be very flexible about the product database. It performed very well, even with half a million products.
Major drawbacks were creating retrieval queries - e.g. "find all products with a price under x, in category y, whose manufacturer is z". It was also tricky bringing in new developers - they had a fairly steep learning curve.
It also forced us to push a lot of relational concepts into the application layer. For instance, it was hard to create foreign keys to other tables (e.g. "manufacturer") and enforce them using standard SQL functionality.
The second approach I've seen is the one you mention - storing the variable data in some kind of serialized format. This is a pain when querying, and suffers from the same drawbacks with the relational model. Overall, I'd only want to use serialization for data you don't have to be able to query or reason about.
The final solution I've seen is to accept that the addition of new product types will always require some level of development effort - you have to build the UI, if nothing else. I've seen applications which use a scaffolding style approach to automatically generate the underlying database structures when a new product type is created.
This is a fairly major undertaking - only really suitable for major projects, though the use of ORM tools often helps.

Optimal MySQL design for user-specific activity feeds

I'm building a website that constructs both site-wide and user-specific activity feeds. I hope that you can see the structure below and share you insight as to whether my solution is doing the job. This is complicated by the fact that I have multiple types of users that right now are not stored in one master table. This is because the types of users are quite different and constructing multiple different tables for user meta-data would I think be too much trouble. In addition, there are multiple types of content that can be acted upon, and multiple types of activity (following, submitting, commenting, etc.).
Constructing a site-wide activity feed is simple because everything is logged to the main feed table and I just build out a list. I have a master feed table in MySQL that simple logs:
type of activity;
type of target entity;
id of target entity;
type of source entity (i.e., user or organization);
id of source entity.
(This is just a big reference table that points the script generating the feed to the appropriate table(s) for each feed entry).
In generating the user-specific feed, I'm trying to figure out some way to join the relationship table with the feed table, and using that to parse results. I have a relationships table, comprised of 'following' relationships, that is similar to the feed table. It is simpler though b/c only one type of user is allowed to follow other content types/users.
user/source id;
type of target entity;
id of target entity.
Columns 2 & 3 in the feed and follow table are the same, and I have been trying to use various JOIN methodologies to match them up, and then limit them by any relationships in the follow table that the user has. This is has not been very successful.
The basic query I am using is:
SELECT *
FROM (`feed` as fe) LEFT OUTER JOIN `follow` as fo
ON `fe`.`feed_target_type` = `fo`.`follow_e_type`
AND fo.follow_e_id = fe.feed_target_id
WHERE `fo`.`follow_u_id` = 1 OR fe.feed_e_id = 1
AND fe.feed_e_type = 'user'
ORDER BY `fe`.`feed_timestamp` desc LIMIT 10
This query also attempts to grab any content that the user has created (which data is logged in the feed table) that the user is, in effect, following by default.
This query seems to work, but it took me sometime to get to it and am pretty sure I'm missing a more elegant solution. Any ideas?

The first site I made with an activity feed had a notifications table where activities were logged, and then friends actions were pulled from that. However a few months down the line this hit millions of records.
The solution I am programming now pulls latest "friends" activities from separate tables and then orders by date. The query is at home, can post the example later if interested?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.