Implementing a simple review database/application scheme

Implementing a simple review database/application scheme - php

I'm new to web development and database design, and I'm kind of stumped as how to best accomplish a simple review system for items.
In the current database schema I have a table, call it tbl_item, that has columns for different properties of items. I want users to be able to review items and associate each review in the tbl_reviews to a particular item.
Of course I have a foreign key set up referencing an id column in tbl_item but I do not know where to go from here. Basically my question is: What should calculate the review average?
Should the application make a SQL call every time a review score is requested for a particular item, where the DB would have to then search through all the tbl_reviews rows to find those with a particular item_id?
(That seems wrong.) Should the DB get involved and have some type of calculated field or view or stored procedure that does the same?
Should I have a new column in tbl_item that has the average score in it and is updated whenever any new review corresponding to a particular item is CRUD'ded?
If it matters, I'm using Yii (PHP) and MySQL.

Basically you're asking about efficiency and math.
Here's what I would do:
Your DB is relational. Good, you got that. Each review has a numerical value? Like 1 - 10?
Say it does for this example.
I would say that upon each review, the review itself is set in the DB as well as a queue in an action table. Something that has the item id and a type of action. In this case review.
You then have a cron running in the background every minute or so checking that action queue and in the event of a new review or set of reviews, you run an algorithm for each applicable item that collects all of the data available on the review and returns an educated number based on the standard deviation of the collective data.
This way the math is not run in realtime by the user or when a review is sent. For all we know you have tons of items and tons of reviews, so real time would be bad if your intelligence script is heavy.
As for standard deviation, I check a large variety of things for anti-spam. I store all userdata, IP, datetime, and anything else I can to make sure it's not just one guy logging in with different accounts reviewing his own things with a 10 rating each time. Can't fall for that.
Plus, if you get 100 10 reviews that look legit and 1 review with a score of 1 you can discount it as a hater and just ignore it in the results.
You have to understand your request is enormous, so code snippets are out of the question here.
What I just explained was like 4 months of work for a huge client and a serious anti-spam calculator.
good luck though

Related

MySQL Database Structure for Time Based Chart and Report Generation

My application will allow users to like or dislike a product and leave a short feedback. I have to make a functionality which will show graph and produce report based on different time frame, probably it will be yearly, monthly, weekly and daily basis.
I have to show how many users liked or disliked the product on a particular time duration via a chart and generate the report. So my application should able to produce the daily graph of August 2018 or monthly graph of year 2018 of a particular product. The graph should able to reveal how many users liked or disliked the product on daily basis if it is daily graph, Similarly it may be for weekly, monthly or yearly time frame
I am not sure what should be the database structure for this type of application? Here what I have thought so far.
products: id, name, descp...etc // products table
users: id, name, email ...etc // users table
user_reactions: id, user_id(foreign key), product_id(foreign key), action(liked or disliked, tinyint), feedback // user_reactions table
data: id, product_id(foreign key), date(Y-m-d), total_like, total_dislike. // data table, will be used to make graph and report
What, I am thinking is that, I will run a cron job on 23:59:59 every day to count the like and dislike of each product and will add the data in last table, i.e. data table as mentioned above and then will use this data table to make graph and report. I am not sure if this database structure is correct or it have some unseen problem (may be in future?)
Note: My Application will be in PHP and MySQL

Well, There is no right answer to your question. Because an answer to your question is called an opinion based answer. You and I will get enough downvotes for sure . But still, hear me out my friend because I was in your state once.
There is a quote by famous professor Mr. Donald Knuth
Premature optimization is the root of all evil
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
The idea is that you have to start building. as your application progress, you will face troubles, you will face problems with your database, your system might not scale or it can't handle a million requests. But until you hit that problem you don't have to worry about it.
I am not saying that you should go and build a system blindly with an infinite loop, or create a table join which can cause deadlocks. I hope you get my point.
Build a system with your knowledge and understanding. Because there is no one straight way to a problem. build a feature -> you hit an issue -> tweak your app -> and rinse and repeat. One day your own experiences will show you the right path.
From your given description I can't figure out exactly how it will come out, but I am sure it will suffice for your initial days. As you progress, you might find it hard to add new features or add additional constraints, But its another day. wait for it and ask another question.
I hope I have answered your question.

What do you think of this approach for logging changes in mysql and have some kind of audit trail

I've been reading through several topics now and did some research about logging changes to a mysql table. First let me explain my situation:
I've a ticket system with a table: 'ticket'
As of now I've created triggers which will enter a duplicate entry in my table: 'ticket_history' which has "action" "user" and "timestamp" as additional columns. After some weeks and testing I'm somewhat not happy with that build since every change is creating a full copy of my row in the history table. I do understand that disk space is cheap and I should not worry about it but in order to retrieve some kind of log or nice looking history for the user is painful, at least for me. Also with the trigger I've written I get a new row in the history even if there is no change. But this is just a design flaw of my trigger!
Here my trigger:
BEFORE UPDATE ON ticket FOR EACH ROW
BEGIN
INSERT INTO ticket_history
SET
idticket = NEW.idticket,
time_arrival = NEW.time_arrival,
idticket_status = NEW.idticket_status,
tmp_user = NEW.tmp_user,
action = 'update',
timestamp = NOW();
END
My new approach in order to avoid having triggers
After spening some time on this topic I came up with an approach I would like to discuss and implement. But first I would have some questions about that:
My idea is to create a new table:
id sql_fwd sql_bwd keys values user timestamp
-------------------------------------------------------------------------
1 UPDATE... UPDATE... status 5 14 12345678
2 UPDATE... UPDATE... status 4 7 12345678
The flow would look like this in my mind:
At first I would select something or more from the DB:
SELECT keys FROM ticket;
Then I display the data in 2 input fields:
<input name="key" value="value" />
<input type="hidden" name="key" value="value" />
Hit submit and give it to my function:
I would start with a SELECT again: SELECT * FROM ticket;
and make sure that the hidden input field == the value from the latest select. If so I can proceed and know that no other user has changed something in the meanwhile. If the hidden field does not match I bring the user back to the form and display a message.
Next I would build the SQL Queries for the action and also the query to undo those changes.
$sql_fwd = "UPDATE ticket
SET idticket_status = 1
WHERE idticket = '".$c_get['id']."';";
$sql_bwd = "UPDATE ticket
SET idticket_status = 0
WHERE idticket = '".$c_get['id']."';";
Having that I run the UPDATE on ticket and insert a new entry in my new table for logging.
With that I can try to catch possible overwrites while two users are editing the same ticket in the same time and for my history I could simply look up the keys and values and generate some kind of list. Also having the SQL_BWD I simply can undo changes.
My questions to that would be:
Would it be noticeable doing an additional select everytime I want to update something?
Do I lose some benefits I would have with triggers?
Are there any big disadvantages
Are there any functions on my mysql server or with php which already do something like that?
Or is there might be a much easier way to do something like that
Is maybe a slight change to my trigger I've now already enough?
If I understad this right MySQL is only performing an update if the value has changed but the trigger is executed anyways right?
If I'm able to change the trigger, can I still prevent somehow the overwriting of data while 2 users try to edit the ticket the same time on the mysql server or would I do this anyways with PHP?
Thank you for the help already

Another approach...
When a worker starts to make a change...
Store the time and worker_id in the row.
Proceed to do the tasks.
When the worker finishes, fetch the last worker_id that touched the record; if it is himself, all is well. Clear the time and worker_id.
If, on the other hand, another worker slips in, then some resolution is needed. This gets into your concept that some things can proceed in parallel.
Comments could be added to a different table, hence no conflict.
Changing the priority may not be an issue by itself.
Other things may be messier.
It may be better to have another table for the time & worker_ids (& ticket_id). This would allow for flagging that multiple workers are currently touching a single record.
As for History versus Current, I (usually) like to have 2 tables:
History -- blow-by-blow list of what changes were made, when, and by whom. This is table is only INSERTed into.
Current -- the current status of the ticket. This table is mostly UPDATEd.
Also, I prefer to write the History directly from the "database layer" of the app, not via Triggers. This gives me much better control over the details of what goes into each table and when. Plus the 'transactions' are clear. This gives me confidence that I am keeping the two tables in sync:
BEGIN; INSERT INTO History...; UPDATE Current...; COMMIT;

I've answered a similar question before. You'll see some good alternatives in that question.
In your case, I think you're merging several concerns - one is "storing an audit trail", and the other is "managing the case where many clients may want to update a single row".
Firstly, I don't like triggers. They are a side effect of some other action, and for non-trivial cases, they make debugging much harder. A poorly designed trigger or audit table can really slow down your application, and you have to make sure that your trigger logic is coordinated between lots of developers. I realize this is personal preference and bias.
Secondly, in my experience, the requirement is rarely "show the status of this one table over time" - it's nearly always "allow me to see what happened to the system over time", and if that requirement exists at all, it's usually fairly high priority. With a ticketing system, for instance, you probably want the name and email address of the users who created, and changed the ticket status; the name of the category/classification, perhaps the name of the project etc. All of those attributes are likely to be foreign keys on to other tables. And when something does happen that requires audit, the requirement is likely "let me see immediately", not "get a database developer to spend hours trying to piece together the picture from 8 different history tables. In a ticketing system, it's likely a requirement for the ticket detail screen to show this.
If all that is true, then I don't think history tables populated by triggers are a good idea - you have to build all the business logic into two sets of code, one to show the "regular" application, and one to show the "audit trail".
Instead, you might want to build "time" into your data model (that was the point of my answer to the other question).
Since then, a new style of data architecture has come along, known as CQRS. This requires a very different way of looking at application design, but it is explicitly designed for reactive applications; these offer much nicer ways of dealing with the "what happens if someone edits the record while the current user is completing the form" question. Stack Overflow is an example - we can see, whilst typing our comments or answers, whether the question was updated, or other answers or comments are posted. There's a reactive library for PHP.

I do understand that disk space is cheap and I should not worry about it but in order to retrieve some kind of log or nice looking history for the user is painful, at least for me.
A large history table is not necessarily a problem. Huge tables only use disk space, which is cheap. They slow things down only when making queries on them. Fortunately, the history is not something you'd use all the time, most likely it is only used to solve problems or for auditing.
It is useful to partition the history table, for example by month or week. This allows you to simply drop very old records, and more important, since the history of the previous months has already been backed up, your daily backup schedule only needs to backup the current month. This means a huge history table will not slow down your backups.
With that I can try to catch possible overwrites while two users are editing the same ticket in the same time
There is a simple solution:
Add a column "version_number".
When you select with intent to modify, you grab this version_number.
Then, when the user submits new data, you do:
UPDATE ...
SET all modified columns,
version_number=version_number+1
WHERE ticket_id=...
AND version_number = (the value you got)
If someone came in-between and modified it, then they will have incremented the version number, so the WHERE will not find the row. The query will return a row count of 0. Thus you know it was modified. You can then SELECT it, compare the values, and offer conflict resolution options to the user.
You can also add columns like who modified it last, and when, and present this information to the user.
If you want the user who opens the modification page to lock out other users, it can be done too, but this needs a timeout (in case they leave the window open and go home, for example). So this is more complex.
Now, about history:
You don't want to have, say, one large TEXT column called "comments" where everyone enters stuff, because it will need to be copied into the history every time someone adds even a single letter.
It is much better to view it like a forum: each ticket is like a topic, which can have a string of comments (like posts), stored in another table, with the info about who wrote it, when, etc. You can also historize that.
The drawback of using a trigger is that the trigger does not know about the user who is logged in, only the MySQL user. So if you want to record who did what, you will have to add a column with the user_id as I proposed above. You can also use Rick James' solution. Both would work.
Remember though that MySQL triggers don't fire on foreign key cascade deletes... so if the row is deleted in this way, it won't work. In this case doing it in the application is better.

How to design an efficient Like system?

I'm trying to create a Like/Unlike system akin to Facebook's for an existing comments section of a website, and I need help in designing the system.
Currently, every product on the website has a comments section and members can post and like comments. I need to know each member has posted how many comments and each of his comments has received how many likes. Of course, I need to know who liked what comments too (partly so that I can prevent a user from liking a comment more than once) for analytical purposes.
The naive way of implementing a Like system to the current comments module is to create a new table in the database that has foreign keys to the CommentID and UserID. Then for every "like" given to a comment by a user, I would insert a row to this new table with the targeting comment ID and user ID.
While this might work, the massive amount of comments and users is going to cause this table to grow quickly and retrieving records from and doing counts on this huge table will become slow and inefficient. I can index either one of the columns, but I don't know how effective it would be. The website has over a million comments.
I'm using PHP and MySQL. For a system like this with a huge database, how should I designing a Like system so that it is more optimised and stable?

For scalability, do not include the count column in the same table with other things. This is a rare case where "vertical partitioning" is beneficial. Why? The LIKEs/UNLIKEs will come fast and furious. If the code to do the increment/decrement hits a table used for other things (such as the text of the Comment), there will be an unacceptable amount of contention between the two.
This tip is the first of many steps toward being able to scale to Facebook levels. The other tips will come, not from a free forum, but from the team of smart engineers you will have to hire to get to that level. (Hints: Sharding, Buffering, Showing Estimates, etc.)

Your main concern will be a lot of counts, so the easy thing to do is to keep a separate count in your comments table.
Then you can create a TRIGGER that increments/decrements the count based on a like/unlike.
That way you only use the big table to figure out if a user already voted.

MySQL managing catalogue views

A friend of mine has a catalogue that currently holds about 500 rows or 500 items. We are looking at ways that we can provide reports on the catalogue inclduing the number of times an item was viewed, and dates for when its viewed.
His site is averaging around 25,000 page impressions per month and if we assumed for a minute that half of these were catalogue items then we'd assume roughly 12,000 catalogue items viewed each month.
My question is the best way to manage item views in the database.
First option is to insert the catalogue ID into a table and then increment the number of times its viewed. The advantage of this is its compact nature. There will only ever be as many rows in the table as there are catalogue items.
`catalogue_id`, `views`
The disadvantage is that no date information is being held, short of maintaining the last time an item was viewed.
The second option is to insert a new row each time an item is viewed.
`catalogue_id`, `timestamp`
If we continue with the assumed figure of 12,000 item views that means adding 12,000 rows to the table each month, or 144,000 rows each year. The advantage of this is we know the number of times the item is viewed, and also the dates for when its viewed.
The disadvantage is the size of the table. Is a table with 144,000 rows becoming too large for MySQL?
Interested to hear any thoughts or suggestions on how to achieve this.
Thanks.

As you have mentioned the first is a lot more compact but limited. However if you look at option 2 in more detail; for example if you wish to store more than just view count, for instance entry/ exit page, host ip ect. This information maybe invaluable for stats and tracking. The other question is are these 25,000 impressions unique? If not you are able to track by username, ip or some other unique identifier, this could enable you to not use as many rows. The answer to your question relies on how much detail you wish to store? and what is the importance of the data?
Update:
True, limiting the repeats on a given item due to a time interval would be a good solution. Also knowing if someone visited the same item could be useful for suggested items perdition widgets similar to what amazon does. Also knowing that someone visited an item many times says to me that this is a good item to promote to them or others in a mail-out, newsletter or popular product page. Tracking unique views will give a more honest view count, which you can choose to display or store. On the issue of limiting the value of repeat visitors, this mainly only comes into play depending on what information you display. It is all about framing the information in the way that best suits you.

Your problem statement: We want to be able to track number of views for a particular catalogue item.
Lets review you options.
First Option:
In this option you will be storing the catalogue_id and a integer value of the number of views of the items.
Advantages:
Well since you really have a one to one relationship the new table is going to be small. If you have 500 items you will have 500 hundred rows. I would suggest if you choose this route not to create a new table but add another column to the catalogue table with the number of views on it.
Disadvantages:
The problem here is that since you are going to be updating this table relatively frequently it is going to be a very busy little table. For example 10 users are viewing the same item. These 10 updates will have to run one after the other. Assuming you are using InnoDB the first view action would come in lock the row update the counter release the lock. The other updates would queue behind it. So while the data is small on the table it could potentially become a bottleneck later on especially if you start scaling the system.
You are loosing granular data i.e. you are not keeping track of the raw data. For example lets say the website starts growing and you have a interested investor they want to see a breakdown of the views per week over the last 6 months. If you use this option you wont have the data to provide to the investor. Essentially you are keeping a summary.
Second Option:
In this option you would create a logging table with at least the following minimal fields catalogue_id and timestamp. You could expand this to add a username/ip address or some other information to make it even more granular.
Advantages:
You are keeping granular data. This will allow you to summarise the data in a variety of ways. You could for example add a ip address column store the visitors IP and then do a monthly report showing you products viewed by country(you could do a IP address lookup to get a idea of which country they were from). Another example would be to see over the last quarter which products was viewed the most etc. This data is pretty essential in helping you make decisions on how to grow you business. If you want to know what is working what is not working as far as products are concerned this detail is absolutely critical.
Your new table will be a logging table. It will only be insert operations. Inserts can pretty much happen in parallel. If you go with this option it will probably scale better as the site grows compared to a constantly updated table.
Disadvantages:
This table will be bigger probably the biggest table in the database. However this is not a problem. I regularly deal with 500 000 000 rows+ tables. Some of my tables are over 750GB by themselves and I can still run reporting on it. You just need to understand your queries and how to optimise them. This is really not a problem as MySQL was designed to handle millions of rows with ease. Just keep in mind you could archive some information into other tables. Say you archive the data every 3 years you could move data older than 3 years into another table. You dont have to keep all the data there. Your estimate of 144 000 rows means you could probably safely keep about 15+ years worth without every worrying about the performance of the table.
My suggestion to you is to serious consider the second option. If you decide to go this route update your question with the proposed table structures and let us have a look at it. Don't be scared of big data rather be scared of BAD design it is much more difficult to deal with.
However as always the choice is yours.

Personalized Search Results based on History

What are some of the techniques for providing personalized search results to a logged in user? One way I can think of will be by analyzing the user's browsing history.
Tracking: A log of a user's activities like pages viewed and 'like' buttons clicked can be use to bias search results.
Question 1: How do you track a user's browsing history? A table with columns user_id, number_of_hits, page id? If I have 1000 daily visitors, each browsing 10 pages on average, wont there be a large number of records to select each time a personalized recommendation is required? The table will grow at 300K rows a month! It will take longer and longer to select the rows each time a search is made. I guess the table for recording 'likes' will take the same table design.
Question 2: How do you bias the results of a search? For example, if a user as been searching for apple products, how does the search engine realise that the user likes apple products and subsequently bias the search towards them? Tag the pages and accumulate a record of tags on the page visited?

You probably don't want to use a relational database for this type of thing, take a look at mongodb or cassandra. That's because you basically want to add a new column to the user's history so a column-oriented database makes more sense.

300k rows per month is not really that much, in fact, that's almost nothing. it doesn't matter if you use a relational or non-relational database for this.
Straightforward approach is the following:
put entries into the table/collection like this:
timestamp, user, action, misc information
(make sure that you put as much information as possible, such that you don't need to join this data warehousing table with any other table)
partition by timestamp (one partition per month)
never go against this table directly, instead have say daily report jobs running over all data and collect and compute the necessary statistics and write them to a summary table.
reflect on your report queries and put appropriate partition local indexes
only go against the summary table from your web frontend

If you stored only the last X results as opposed to everything, it would probably be do-able. Might slow things down, but it'd work. Any time you're writing more data and reading more data, there's going to be an impact. Proper DBA methods such as indexing and query optimizing can help, but no matter what you use there's going to be an affect.
I'd personally look at storing just a default view for the user in a DB and use the session to keep track of the rest. Sure, when you login there'd be no history. But you could take advantage of that to highlight a set of special pages that you think are important or relevant to steer the user to. A highlight system of sorts. Faster, easier, and more user-friendly.
As for bias, you could write a set of keywords for each record and array sort them accordingly. Wouldn't be terribly difficult using PHP.

I use MySQL and over 2M records (page views) a month and we run reports on that table daily and often.
The table is partitioned by month (like already suggested) and indexed where needed.
I also clear the table from data that is over 6 months by creating a new table called "page_view_YYMM" (YY=year, MM=month) and using some UNIONS when necessary
for the second question, the way I would approach it is by creating a table with the list of your products that is a simple:
url, description
the description will be a tag stripped of the content of your page or item (depend how you want to influence the search) and then add a full text index on description and a search on that table adding possible extra terms that you have been collecting while the user was surfing your site that you think are relevant (for example category name, or brand)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.