Personalized Search Results based on History

Personalized Search Results based on History - php

What are some of the techniques for providing personalized search results to a logged in user? One way I can think of will be by analyzing the user's browsing history.
Tracking: A log of a user's activities like pages viewed and 'like' buttons clicked can be use to bias search results.
Question 1: How do you track a user's browsing history? A table with columns user_id, number_of_hits, page id? If I have 1000 daily visitors, each browsing 10 pages on average, wont there be a large number of records to select each time a personalized recommendation is required? The table will grow at 300K rows a month! It will take longer and longer to select the rows each time a search is made. I guess the table for recording 'likes' will take the same table design.
Question 2: How do you bias the results of a search? For example, if a user as been searching for apple products, how does the search engine realise that the user likes apple products and subsequently bias the search towards them? Tag the pages and accumulate a record of tags on the page visited?

You probably don't want to use a relational database for this type of thing, take a look at mongodb or cassandra. That's because you basically want to add a new column to the user's history so a column-oriented database makes more sense.

300k rows per month is not really that much, in fact, that's almost nothing. it doesn't matter if you use a relational or non-relational database for this.
Straightforward approach is the following:
put entries into the table/collection like this:
timestamp, user, action, misc information
(make sure that you put as much information as possible, such that you don't need to join this data warehousing table with any other table)
partition by timestamp (one partition per month)
never go against this table directly, instead have say daily report jobs running over all data and collect and compute the necessary statistics and write them to a summary table.
reflect on your report queries and put appropriate partition local indexes
only go against the summary table from your web frontend

If you stored only the last X results as opposed to everything, it would probably be do-able. Might slow things down, but it'd work. Any time you're writing more data and reading more data, there's going to be an impact. Proper DBA methods such as indexing and query optimizing can help, but no matter what you use there's going to be an affect.
I'd personally look at storing just a default view for the user in a DB and use the session to keep track of the rest. Sure, when you login there'd be no history. But you could take advantage of that to highlight a set of special pages that you think are important or relevant to steer the user to. A highlight system of sorts. Faster, easier, and more user-friendly.
As for bias, you could write a set of keywords for each record and array sort them accordingly. Wouldn't be terribly difficult using PHP.

I use MySQL and over 2M records (page views) a month and we run reports on that table daily and often.
The table is partitioned by month (like already suggested) and indexed where needed.
I also clear the table from data that is over 6 months by creating a new table called "page_view_YYMM" (YY=year, MM=month) and using some UNIONS when necessary
for the second question, the way I would approach it is by creating a table with the list of your products that is a simple:
url, description
the description will be a tag stripped of the content of your page or item (depend how you want to influence the search) and then add a full text index on description and a search on that table adding possible extra terms that you have been collecting while the user was surfing your site that you think are relevant (for example category name, or brand)

Related

MySQL/PHP - Web Based Game -User specific inventory table or 1 giant table? Another option?

So im making a web based game similar to Torn City, where there could potentially be millions of users.
My issue is regarding user inventories. I started out creating dynamic tables based on each user's id. e.g Table name = [UserID]_Inventory.
From what Ive found out this can create a load of hacker friendly entries with sql injections and such because of the dynamic creation.
My only other option seems to be creating 1 giant table holding every item that every player has and all the varied details of each item. This seems like it would take longer and longer to load once user count increases and the user's inventory will likely be accessed often.
Is there another option?
My only idea so far is to create some kind of temporary inventory that grabs only the active player inventories. That helps the database search time issues but still brings me back to creating dynamic tables.
At this stage I don't really need coding help, rather I need database structure help.
Code is appreciated tho.
Cheers.

Use the big table. Index it optimally. It should not give you trouble until you get well past a billion rows.
Here's a trick to optimizing use of such a table. Instead of
PRIMARY KEY(id),
INDEX(user_id)
have
PRIMARY KEY(user_id, id),
INDEX(id)
Since the PK is "clustered" with the data and the data is ordered according to the PK, this makes all of one user's data rows sitting next to each other. In huge tables, this cuts back significantly in I/O, hence improves overall speed. Also, it cuts back on pressure on the buffer_pool. (I assume you are using InnoDB?)
The INDEX(id) is sufficient for AUTO_INCREMENT.
There could be more suggestions, but I need more details. Please provide SHOW CREATE TABLE (as it stands now) and the main SELECTs. I am likely to suggest more changes to the indexes, datatypes, and query formulations.
(Dynamic tables is a mistake, and your troubles in that direction have only begun.)

How to design an efficient Like system?

I'm trying to create a Like/Unlike system akin to Facebook's for an existing comments section of a website, and I need help in designing the system.
Currently, every product on the website has a comments section and members can post and like comments. I need to know each member has posted how many comments and each of his comments has received how many likes. Of course, I need to know who liked what comments too (partly so that I can prevent a user from liking a comment more than once) for analytical purposes.
The naive way of implementing a Like system to the current comments module is to create a new table in the database that has foreign keys to the CommentID and UserID. Then for every "like" given to a comment by a user, I would insert a row to this new table with the targeting comment ID and user ID.
While this might work, the massive amount of comments and users is going to cause this table to grow quickly and retrieving records from and doing counts on this huge table will become slow and inefficient. I can index either one of the columns, but I don't know how effective it would be. The website has over a million comments.
I'm using PHP and MySQL. For a system like this with a huge database, how should I designing a Like system so that it is more optimised and stable?

For scalability, do not include the count column in the same table with other things. This is a rare case where "vertical partitioning" is beneficial. Why? The LIKEs/UNLIKEs will come fast and furious. If the code to do the increment/decrement hits a table used for other things (such as the text of the Comment), there will be an unacceptable amount of contention between the two.
This tip is the first of many steps toward being able to scale to Facebook levels. The other tips will come, not from a free forum, but from the team of smart engineers you will have to hire to get to that level. (Hints: Sharding, Buffering, Showing Estimates, etc.)

Your main concern will be a lot of counts, so the easy thing to do is to keep a separate count in your comments table.
Then you can create a TRIGGER that increments/decrements the count based on a like/unlike.
That way you only use the big table to figure out if a user already voted.

MySQL big chunks of data - fast access without recalculation

I would like to store big chunks of data in RAM using sphinx /solr/ elastic search whatever else suits such needs (The problem is I don't know what tool suits the best I had only heard that people use them).
I build reports about sales, I get nearly 800-900k lines of sales per month and user wants to scroll the page and see them smoothly.
I can't give them all data at once becasue browser will just hang
in the same time I can't use LIMIT from mysql because queries demand merging cross tables.
Recalculating it on the flow is not an option.
Creating a temp table in mysql is a bad idea because there are a bunch of criteria and more than one user can view data.
Temporary_table
id product_id product_count order_id order_status.... .....user_id
Having such table I would store all result for current user in the table and would hold them there as long as user doesn't make a new query. But I don't like this solution. There must be something better.
I feel like it's over my head.
Any ideas?

"Drill down", don't "Scroll down" !
When I need to present a million lines of info, I start by thinking of way to slice and dice it -- subtotals by hour, by region, by product type, by whatever. Each slice might be a hundred lines -- quite manageable, especially with summary tables.
In that hundred lines would be clickable items that take the user to a more detailed page about one of the items. That would also have a hundred lines (or 10 or 1000 -- whatever makes sense; but >1000 is usually unreasonable). That page may have further links to drill further down. And/or links to move laterally.
With suitable slicing and dicing, you are very unlikely to need to send him a million lines; only a few hundred.
With suitable Summary Tables, the "tmp tables", etc, go away.

How do I make a huge table of data load faster on a webpage assuming DB is already indexed?

Say I have a webmail service and I have a table with fields - username, IP, location, login_time. Let us say my service is hugely popular with 100s of users logging in every minute. At end of the day, if I want to display this table for today's list of users, there are say half a million rows. Now even after indexing DB table, it's taking a huge amount of time to load this page. How do I make it faster (or give a feel of speedy load) ? May be I can do pagination and load say 50 rows at a time as users shift pages. What if I do not have that option ?

Best would be to use a Jquery "Load more" plugin and get only a restricted amount of data at once... Users can click "Load more" button and see the whole table if they want.

Use backend pagination, as you said.
Imagine that you have an excel file containing that many rows - how fast do you think it will open? And, unlike your browser, excel is a specialized software to work with rows of data.
Put it in another perspective - is it helpful in some way for the user to see .5 millions rows at once? I doubt they do. The user can get exactly the same functionality from your software if you offer him a paged results list, with a search form.

I think table Partitioning based on the login_time column is the solution.Partitioning lets you to store parts of your table in their own logical space. Your query will only have to look at a subset of the data to get a result, and not the whole table making the query multiple times faster depending on the number of rows. More about partitioning in the below link
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Once you have partitioned your table, you can use a pagination mechanism since showing all 0.5 million rows to the user would not serve any purpose.

MySQL managing catalogue views

A friend of mine has a catalogue that currently holds about 500 rows or 500 items. We are looking at ways that we can provide reports on the catalogue inclduing the number of times an item was viewed, and dates for when its viewed.
His site is averaging around 25,000 page impressions per month and if we assumed for a minute that half of these were catalogue items then we'd assume roughly 12,000 catalogue items viewed each month.
My question is the best way to manage item views in the database.
First option is to insert the catalogue ID into a table and then increment the number of times its viewed. The advantage of this is its compact nature. There will only ever be as many rows in the table as there are catalogue items.
`catalogue_id`, `views`
The disadvantage is that no date information is being held, short of maintaining the last time an item was viewed.
The second option is to insert a new row each time an item is viewed.
`catalogue_id`, `timestamp`
If we continue with the assumed figure of 12,000 item views that means adding 12,000 rows to the table each month, or 144,000 rows each year. The advantage of this is we know the number of times the item is viewed, and also the dates for when its viewed.
The disadvantage is the size of the table. Is a table with 144,000 rows becoming too large for MySQL?
Interested to hear any thoughts or suggestions on how to achieve this.
Thanks.

As you have mentioned the first is a lot more compact but limited. However if you look at option 2 in more detail; for example if you wish to store more than just view count, for instance entry/ exit page, host ip ect. This information maybe invaluable for stats and tracking. The other question is are these 25,000 impressions unique? If not you are able to track by username, ip or some other unique identifier, this could enable you to not use as many rows. The answer to your question relies on how much detail you wish to store? and what is the importance of the data?
Update:
True, limiting the repeats on a given item due to a time interval would be a good solution. Also knowing if someone visited the same item could be useful for suggested items perdition widgets similar to what amazon does. Also knowing that someone visited an item many times says to me that this is a good item to promote to them or others in a mail-out, newsletter or popular product page. Tracking unique views will give a more honest view count, which you can choose to display or store. On the issue of limiting the value of repeat visitors, this mainly only comes into play depending on what information you display. It is all about framing the information in the way that best suits you.

Your problem statement: We want to be able to track number of views for a particular catalogue item.
Lets review you options.
First Option:
In this option you will be storing the catalogue_id and a integer value of the number of views of the items.
Advantages:
Well since you really have a one to one relationship the new table is going to be small. If you have 500 items you will have 500 hundred rows. I would suggest if you choose this route not to create a new table but add another column to the catalogue table with the number of views on it.
Disadvantages:
The problem here is that since you are going to be updating this table relatively frequently it is going to be a very busy little table. For example 10 users are viewing the same item. These 10 updates will have to run one after the other. Assuming you are using InnoDB the first view action would come in lock the row update the counter release the lock. The other updates would queue behind it. So while the data is small on the table it could potentially become a bottleneck later on especially if you start scaling the system.
You are loosing granular data i.e. you are not keeping track of the raw data. For example lets say the website starts growing and you have a interested investor they want to see a breakdown of the views per week over the last 6 months. If you use this option you wont have the data to provide to the investor. Essentially you are keeping a summary.
Second Option:
In this option you would create a logging table with at least the following minimal fields catalogue_id and timestamp. You could expand this to add a username/ip address or some other information to make it even more granular.
Advantages:
You are keeping granular data. This will allow you to summarise the data in a variety of ways. You could for example add a ip address column store the visitors IP and then do a monthly report showing you products viewed by country(you could do a IP address lookup to get a idea of which country they were from). Another example would be to see over the last quarter which products was viewed the most etc. This data is pretty essential in helping you make decisions on how to grow you business. If you want to know what is working what is not working as far as products are concerned this detail is absolutely critical.
Your new table will be a logging table. It will only be insert operations. Inserts can pretty much happen in parallel. If you go with this option it will probably scale better as the site grows compared to a constantly updated table.
Disadvantages:
This table will be bigger probably the biggest table in the database. However this is not a problem. I regularly deal with 500 000 000 rows+ tables. Some of my tables are over 750GB by themselves and I can still run reporting on it. You just need to understand your queries and how to optimise them. This is really not a problem as MySQL was designed to handle millions of rows with ease. Just keep in mind you could archive some information into other tables. Say you archive the data every 3 years you could move data older than 3 years into another table. You dont have to keep all the data there. Your estimate of 144 000 rows means you could probably safely keep about 15+ years worth without every worrying about the performance of the table.
My suggestion to you is to serious consider the second option. If you decide to go this route update your question with the proposed table structures and let us have a look at it. Don't be scared of big data rather be scared of BAD design it is much more difficult to deal with.
However as always the choice is yours.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.