Constructing a simple recommendation engine - php

Both users and pages on my website have IDs. When a user goes on a certain page, their userID and the pageID will be written to a MySQL table as such:
userID | pageID
3 | 1
2 | 1
3 | 2
etc...
In this table, called user_pages, I would end up with a bunch of raw data that can be turned into a recommendation engine. What I mean by recommendation engine - I want to analyze historical data, and be able to predict, based on a set of viewed pages, the next pages that a user may like. Let's say there is a strong correlation between visiting page with ID 3 after going to pages with IDs 4, 9, 15. If a user goes on pages 4, 9, and 15, then the engine should recommend page 3.
I think I have all of the data input code necessary for creating this. How would I write something that analyzes the data for correlation of pages (i.e. almost everyone who visited page 5 visited page 1 also), and somehow use that to predict in the future the pages that a user may end up liking?

Recommendation systems are a big part of A.I research. I believe you are interested in a collection of algorithms called collaborative filtering. Since the netflix prize in 2007 this field has developed greatly. I would recommend going here and having a read. It explains the basic concepts of recommender systems in a succinct and clear way and also provides a link to Java source code for an approach to the Netflix project, MemReader. You could examine this source code and extrapolate the basic algorithms for building a recommendation engine.
Alternatively if you want a more mathematical explanation of the algorithms employed go here.
It shouldn't take too long to implement at all.

This post posed a similar question: Advanced MySQL: Find correlations between poll responses
I think you would be able to generate a similar response if your primary data table had one additional field in it, specifically the id of the page the used last visited or visited immediately following.
Something like this:
+------+----------+--------------+----------+
| id | page_id | next_page_id | user_id |
+------+----------+--------------+----------+
| 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 2 | 3 |
| 4 | 1 | 2 | 4 |
| 5 | 2 | 3 | 1 |
| 6 | 2 | 3 | 2 |
| 7 | 2 | 3 | 3 |
| 8 | 2 | 4 | 4 |
| 9 | 3 | 5 | 1 |
+------+----------+--------------+----------+
Then you should be able to use a modified version of one of the SQL queries suggested there to generate a list of high-correlation recommendations between the current page and the next page.

Related

PHP, MySQL - efficiently determining access permissions based on distant parent

I'm struggling to come up with an efficient solution to determine user access to a specified folder, using PHP (specifically Laravel) and MySQL. I want to create a system that has Google Drive-esque functionality...
For example, Joe Bloggs creates many folders within folders, e.g. Level 1 > Level 2 > Level 3 > Level 4 > Level 5. Within any of these folders, can be any number of additional sub files and folders.
This would be the resulting database structure -
Table name: users
| id | name |
| -- | ---------- |
| 1 | Joe Bloggs |
| 2 | John Snow |
Table name: folders
| id | parent_id | author_id | name |
| -- | --------- | --------- | --------- |
| 1 | NULL | 1 | Level 1 |
| 2 | 1 | 1 | Level 2 |
| 3 | 2 | 1 | Level 3 |
| 4 | 3 | 1 | Level 4 |
| 5 | 4 | 1 | Level 5 |
| 6 | 2 | 1 | Level 3.1 |
| 7 | 2 | 1 | Level 3.2 |
Table name: folders_users
| id | folder_id | user_id | owner | read | write |
| 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 3 | 2 | 0 | 1 | 1 |
So based on record 1 in folder_users, Joe Bloggs should have owner, read & write permissions for all folders underneath Level 1. Joe Bloggs, then gives John Snow read & write access to Level 3, which in turn should give Joe Bloggs read & write access to Level 3, Level 3.1, Level 3.2 and anything created under any of these in future.
Additionally, it should be possible for a user to star a folder. I'd imagine this can simply be achieved with a separate table and query this separately -
Table name: starred_folders
| id | folder_id | user_id |
| -- | --------- | ------- |
| 1 | 7 | 2 |
The current solution I have is for every folder in the chain a user has permission to access, a record is created in the folders_users table. I feel like this is just overcomplicating things and creating excessive numbers of records. This is especially true when it comes to sharing a folder as I have to recreate the entire tree for that one user. Or, imagine if a user revokes write access to one of the shared users, the entire tree (potentially hundreds of records) has to be updated for a single flag.
What would be the best way to generate these trees, and to quickly and efficiently determine the user's access level in any given folder? I suspect the only way to do this is recursion, but I'm concerned about its efficiency? Or, should I perhaps be using something entirely different from MySQL for this? I've had a brief look into graph databases but I can't see it being a way forward for us as we don't have the infrastructure to support it.
Thanks,
Chris.
I'm writing this as a solution not the most efficient one.
You can add a column to your folders table (let's call it access) and then put ids of people that have access to that folder and it's children. I assume when you want to show information about a folder you must get its parents information from table as well so you won't need to add new queries for that.
And if you just have a access definition you can simple add records to this column like user1,user2,... and if not you can serialize an array like this
[
"read" => [user1,user2,...],
"write" => [user2]
]
Of course you can add a column for each access but if you have so many accesses this might be a solution too.

Best practices linking data in MySQL tables

For an online game, I have a table that contains all the plays, and some information on those plays, like the difficulty setting etc.:
+---------+---------+------------+------------+
| play-id | user-id | difficulty | timestamp |
+---------+---------+------------+------------+
| 1 | abc | easy | 1335939007 |
| 2 | def | medium | 1354833214 |
| 3 | abc | easy | 1354833875 |
| 4 | abc | medium | 1354833937 |
+---------+---------+------------+------------+
In another table, after the game has finished, I store some stats related to that specific game, like the score etc:
+---------+----------------+--------+
| play-id | type | value |
+---------+----------------+--------+
| 1 | score | 201487 |
| 1 | enemies_killed | 17 |
| 1 | gems_found | 4 |
| 2 | score | 110248 |
| 2 | enemies_killed | 12 |
| 2 | gems_found | 7 |
+---------+----------------+--------+
Now, I want to make a distribution graph so users can see in what score percentile they are. So I basically want the boundaries of the percentiles.
If it would be on a score level, I could rank the scores and start from there, but it needs to be on a highscore level. So mathematically, I would need to sort all the highscores of users, and then find the percentiles.
I'm in doubt what's the best approach here.
On one hand, constructing an array that holds all the highscores seems like a performance heavy thing to do, because it needs to cycle through both tables and match the scores and the users (the first table holds around 10M rows).
On the other hand, making a separate table with the highscore of users would make things easier, but it feels like it's against the rules of avoiding data redundancy.
Another approach that came to mind was doing the performance heavy thing once a week and keep the result in a separate table, or doing the performance heavy stuff on only a (statistically relevant) subset of the data.
Or maybe I'm completely missing the point here and should use a completely different database setup?
What's the best practice here?

Testing thousands of similar fields for differences

I have created a privilege system for my application which allows/disallows access to specific pages based on user input.
The table looks something like this:
page_id | client_id | sys_group_no | name | friendly_name | viewable |
1 | 4 | 1 | home | Home | true |
2 | 4 | 1 | admin| Admin Home | false |
So if the user in client_id 4 is of group 1 they are NOT allowed to view 'Admin Home' it isn't actually quite this simple but for the sake of this question we can pretend.
The problem is as maintenance goes on this table get out of date quickly, and when you have a few thousand rows, constantly checking the table against the actual page names (using scandir() and array_diff()) will be expensive. Is there a different paradigm for checking this kind of integrity other than direct comparison? - For instance would hashing my $page_array and comparing it be a better approach?

Like and Unlike System in PHP

I am developing a community site for high school students. I am trying to implement a like and unlike system using PHP. Heres what I have got :
A table named likes in MySQL with 3 columns namely app_id VARCHAR(32), user VARCHAR(12), dormant VARCHAR(6).
UNIQUE(app_id,user)
When a person likes a page on my site, a row is either inserted or updated in the likes table with dormant = false.
When a person unlikes a page, the row present is again updated with dormant = true. This is an alternative to deleting the row as it is a bit intensive for a speedy work of likes and unlikes.
I want to know, if I should go for deleting the row instead of updating it, when someone unlikes the page.
Dont Delete the row. Every data you can gather its a valuable data point.
I would say you should create a new record for every unlike also.
These data will be usefull to you in the future to figure out user behaviour.
Some ppl might like smth now and then unlike it , then like it again and so on.
Maybe in the future u would like to see why so many people who liked an item suddely unliked it then liked it again.
So i say gather as much data as you can.
Sounds like premature optimization. Don't do that.
Design your application as you want to use it /as it should work. When it gets busy, find out the bottlenecks and fix them.
If you want to design your application for scalability to the millions, consider using a different database engine / programming platform altogether.
Looks like you haven't record the number of user liked or unliked the pages. In this case, LIKES should be a many table and there should be another table called APPS (or any name you wish) to store pages:
**USER**
+---------+-------+-----+
| user_id | name | ....|
+---------+-------+-----+
| 1 | ... | ... |
+---------+-------+-----+
| 2 | ... | ... |
+---------+-------+-----+
**APPS**
+---------+-------+-----+
| app_id | name | ....|
+---------+-------+-----+
| 1 | ... | ... |
+---------+-------+-----+
| 2 | ... | ... |
+---------+-------+-----+
**LIKES**
+---------+-------+----------+----------+
| like_id |user_id| app_id | is_liked |
+---------+-------+----------+----------+
| 1 | 1 | 2 | 1 |
+---------+-------+----------+----------+
| 2 | 1 | 3 | 0 |
+---------+-------+----------+----------+
Where you can toggle if the user click like( is_liked = 1) or unlike( is_liked = 0) the page

handle language identification

In a multilanguage website, should I reference the language with numbers or keywords?
For example, let's say an english person selects a service from a list of services, the list of services will be in english, while if a spanish person selects from a list of services, the list will be in spanish.
The list of services is selected from a table in a database, each service has a unique number to identify it, and something to identify in what language the service is written.
What I'm asking is, which is better. To use a number to identify the language, or to use a language code?
Example:
hypothetical table of services:
id | service_id | service | lang
------------------------------------
0 | cooking | 1 | en
1 | driving | 2 | en
2 | singing | 3 | en
3 | running | 4 | en
4 | cocinar | 1 | es
5 | conducir | 2 | es
6 | cantar | 3 | es
7 | correr | 4 | es
VS
id | service_id | service | lang
------------------------------------
0 | cooking | 1 | 1
1 | driving | 2 | 1
2 | singing | 3 | 1
3 | running | 4 | 1
4 | cocinar | 1 | 2
5 | conducir | 2 | 2
6 | cantar | 3 | 2
7 | correr | 4 | 2
Where I give a numerical id to every language
I can see the language code approach makes the database more human readable, but why should it really matter if the server handles it all anyway, where numbers are easier for the server, but then I would have to give a number to every language.
So which approach do you think is better and why?
I would almost always normalize such things, but this may be a rare exception for the following reasons:
An nchar(2) column would only occupy 4 bytes, which is the same as an int column. Therfore, performance should not be impacted, especially if you set the coalation to ordinal.
The two-character language codes are in international standards which are extremely unlikely to ever be changed. So massive updates should not be an issue.
So the arguments for normalization do not really apply in this case.
There is an ISO standardized set of language codes. I'd just go with using those like example 1. You should probabbly have a secondary table that lists the short 2/3 digit codes to the long spelled out version as well.

Categories