Let us Imagine the Facebook homepage. There is a list of posts, I report a post, and that post is blocked.
So, in the PHP & Mysql backend, I would do something like.
reported_posts = MySQL GROUP_CONCAT(reported_post_id) and fetch all my reported posts, store it in some cache like memcached or redis. This will give me a response with comma separated post_ids like 123, 234, 45
Fetch all homepage_posts which are NOT IN (reported_posts). This will give us all the post_ids that needs to be in the homepage other than the posts, 123, 234 and 45, as I have used NOT IN.
The issue here is that, as time goes by, the reported_posts will keep on increasing(lets assume it increases 1000 ids). At that time, the NOT IN (reported_posts) clause will take a huge input. Will this effect the performance of the query? What is an alternative solution to this?
I hope I could convey my doubt clearly, please let me know if it needs more clarification, I would edit as such. Thank you.
EDIT
The Reported post is not to be considered Globally, i.e. If I report the post, it should be Hidden only for me, and not for anyone else. So, it's also dependent on the account_id as well.
Assuming that reported_posts contains a list of user-specific blacklisted posts, it would be much better to do an exclusive left join and let the database handle everything:
SELECT *
FROM homepage_posts hp
LEFT JOIN
reported_posts rp
ON hp.id = rp.post_id
AND rp.account_id = 123
WHERE
rp.id IS NULL
In mysql "IN" operator works fine if the column is indexed. If that column is not indexed then it impacts performance.
Related
First of, I'm pretty new to this site and coding in general so please explain in simple terms as I'm still learning! Thanks
Ok, so I've got a database of results. These are 1-6 ratings. I've already created the ability to retrieve certain results (user, group, all).
But now I'm wanting to alongside retrieving the group and all results to display at the top of the results a mean for each question.
So to start I'm wanting something like this I believe.
SELECT sum(r1), sum(r2), sum(r3) so on,
FROM table
This is where I get confused.
I think I'd need a variable to contain these and then another that counts the amount of entries to divide the total of r1 hence the mean.
Any ideas?..
To calculate a mean, use the AVG function, e.g.
SELECT AVG(r1), AVG(r2)
FROM table
See the MySQL docs.
Hello again Stackoverflow!
I'm currently working on custom forumsoftware and one of the things you like to see on a forum is a viewcounter.
All the approaches for a viewcounter that I found would just select the topic from the database, retrieve the number from a "views" column, add one and update it.
But here's my thought: If, lets say 400, people at the exact same time open a topic, the MySQL database probably won´t count all views because it takes time for the queries to complete, and so the last person (of the 400) might overwrites the first persons (of the 400) view.
Ofcourse one could argue that on a normal site this is never going to happen, but if you have ~7 people opening that topic at the exact same second and the server is struggleing at that moment, you could have the same problem.
Is there any other good approach to count views?
EDIT
Woah, could the one who voted down specify why?
I ment by "Retrieving the number of views and adding one" that I would use SELECT to retrieve the number, add one using PHP (note the tags) and updating it using UPDATE. I had no idea of the other methods specified below, that's why I asked.
If, lets say 400, people at the exact same time open a topic, the MySQL database apparently would count all the views because this is exactly what databases were invented for.
All the approaches for a viewcounter that you have found are wrong. To update a field you don't need to retrieve it, but just already update:
UPDATE forum SET views + 1 WHERE id = ?
So something like that will work:
UPDATE tbl SET cnt = cnt+1 WHERE ...
UPDATE is guaranteed to be atomic. That means no one will be able to alter cnt between the time it is read and the time it is replaced. If you have several concurrent UPDATE for the same row (InnoDB) or table (MyISAM) they have to wait their turn to update the date.
See Is incrementing a field in MySQL atomic?
and http://dev.mysql.com/doc/refman/5.1/en/ansi-diff-transactions.html
I have recently written a survey application that has done it's job and all the data is gathered. Now i have to analyze the data and i'm having some time issues.
I have to find out how many people selected what option and display it all.
I'm using this query, which does do it's job:
SELECT COUNT(*)
FROM survey
WHERE users = ? AND table = ? AND col = ? AND row = ? AND selected = ?
GROUP BY users,table,col,row,selected
As evident by the "?" i'm using MySQLi (in php) to fetch the data when needed, but i fear this is causing it to be so slow.
The table consists of all the elements above (+ an unique ID) and all of them are integers.
To explain some of the fields:
Each survey was divided into 3 or 4 tables (sized from 2x3 to 5x5) with a 1 to 10 happiness grade to select form. (questions are on the right and top of the table, then you answer where the questions intersect)
users - age groups
table, row, col - explained above
selected - dooooh explained above
Now with the surveys complete and around 1 million entries in the table the query is getting very slow. Sometimes it takes like 3 minutes, sometimes (i guess) the time limit expires and you get no data at all. I also don't have access to the full database, just my empty "testing" one since the costumer is kinda paranoid :S (and his server seems to be a bit slow)
Now (after the initial essay) my questions are: I left indexing out intentionally because with a lot of data being written during the survey, it would be a bad idea. But since no new data is coming in at this point, would it make sense to index all the fields of a table? How much sense does it make to index integers that never go above 10? (as you can guess i haven't got a clue about indexes). Do i need the primary unique ID in this table? I
I read somewhere that indexing may help groups but only if you group by the first columns in a table (and since my ID is first and from my point of view useless can i remove it and gain anything by it?)
Is there another way to write my query that would basically do the same thing but in a shorter period of time?
Thanks for all your suggestions in advance!
Add an index on entries that you "GROUP BY" or do "WHERE". So that's ONE index incorporating users,table,col,row and selected in your case.
Some quick rules:
combine fields to have the WHERE first, and the GROUP BY elements last.
If you have other queries that only use part of it (e.g. users,table,col and selected) then leave the missing value (row, in this example) last.
Don't use too many indexes/indeces, as each will slow the table to updates marginally - so on really large system you need to balance queries with indexes.
Edit: do you need the GROUP BY user,col,row as these are used in the WHERE. If the WHERE has already filtered them out, you only need group by "selected".
OK, I know the technical answer is NEVER.
BUT, there are times when it seems to make things SO much easier with less code and seemingly few downsides, so please here me out.
I need to build a Table called Restrictions to keep track of what type of users people want to be contacted by and that will contain the following 3 columns (for the sake of simplicity):
minAge
lookingFor
drugs
lookingFor and drugs can contain multiple values.
Database theory tells me I should use a join table to keep track of the multiple values a user might have selected for either of those columns.
But it seems that using comma-separated values makes things so much easier to implement and execute. Here's an example:
Let's say User 1 has the following Restrictions:
minAge => 18
lookingFor => 'Hang Out','Friendship'
drugs => 'Marijuana','Acid'
Now let's say User 2 wants to contact User 1. Well, first we need to see if he fits User 1's Restrictions, but that's easy enough EVEN WITH the comma-separated columns, as such:
First I'd get the Target's (User 1) Restrictions:
SELECT * FROM Restrictions WHERE UserID = 1
Now I just put those into respective variables as-is into PHP:
$targetMinAge = $row['minAge'];
$targetLookingFor = $row['lookingFor'];
$targetDrugs = $row['drugs'];
Now we just check if the SENDER (User 2) fits that simple Criteria:
COUNT (*)
FROM Users
WHERE
Users.UserID = 2 AND
Users.minAge >= $targetMinAge AND
Users.lookingFor IN ($targetLookingFor) AND
Users.drugs IN ($targetDrugs)
Finally, if COUNT == 1, User 2 can contact User 1, else they cannot.
How simple was THAT? It just seems really easy and straightforward, so what is the REAL problem with doing it this way as long as I sanitize all inputs to the DB every time a user updates their contact restrictions? Being able to use MySQL's IN function and already storing the multiple values in a format it will understand (e.g. comma-separated values) seems to make things so much easier than having to create join tables for every multiple-choice column. And I gave a simplified example, but what if there are 10 multiple choice columns? Then things start getting messy with so many join tables, whereas the CSV method stays simple.
So, in this case, is it really THAT bad if I use comma-separated values?
****ducks****
You already know the answer.
First off, your PHP code isn't even close to working because it only works if user 2 has only a single value in LookingFor or Drugs. If either of these columns contains multiple comma-separated values then IN won't work even if those values are in the exact same order as User 1's values. What do expect IN to do if the right-hand side has one or more commas?
Therefore, it's not "easy" to do what you want in PHP. It's actually quite a pain and would involve splitting user 2's fields into single values, writing dynamic SQL with many ORs to do the comparison, and then doing an extremely inefficient query to get the results.
Furthermore, the fact that you even need to write PHP code to answer such a relatively simple question about the intersection of two sets means that your design is badly flawed. This is exactly the kind of problem (relational algebra) that SQL exists to solve. A correct design allows you to solve the problem in the database and then simply implement a presentation layer on top in PHP or some other technology.
Do it correctly and you'll have a much easier time.
Suppose User 1 is looking for 'Hang Out','Friendship' and User 2 is looking for 'Friendship','Hang Out'
Your code would not match them up, because 'Friendship','Hang Out' is not in ('Hang Out','Friendship')
That's the real problem here.
I am just learning php as I go along, and I'm completely lost here. I've never really used join before, and I think I need to here, but I don't know. I'm not expecting anyone to do it for me but if you could just point me in the right direction it would be amazing, I've tried reading up on joins but there are like 20 different methods and I'm just lost.
Basically, I hand coded a forum, and it works fine but is not efficient.
I have board_posts (for posts) and board_forums (for forums, the categories as well as the sections).
The part I'm redoing is how I get the information for the last post for the index page. The way I set it up is that to avoid using joins, I have it store the info for latest post in the table for board_forums, so say there is a section called "Off Topic" there I would have a field for "forum_lastpost_username/userid/posttitle/posttime" which I woudl update when a user posts etc. But this is bad, I'm trying to grab it all dynamically and get rid of those fields.
Right now my query is just like:
`SELECT * FROM board_forums WHERE forum_parent='$forum_id''
And then I have the stuff where I grab the info for that forum (name, description, etc) and all the data for the last post is there:
$last_thread_title = $forumrow["forum_lastpost_title"];
$last_thread_time = $forumrow["forum_lastpost_time"];
$lastpost_username = $forumrow["forum_lastpost_username"];
$lastpost_threadid = $forumrow["forum_lastpost_threadid"];
But I need to get rid of that, and get it from board_posts. The way it's set up in board_posts is that if it's a thread, post_parentpost is NULL, if it's a reply, then that field has the id of the thread (first post of the topic). So, I need to grab the latest post_date, see which user posted that, THEN see if parentpost is NULL (if it's null then the last post is a new thread, so I can get all the info of the title and user there, but if it's not, then I need to get the info (title, id) of the first post in that thread (which can be found by seeing what post_parentpost is, looking up that ID and getting the title from it.
Does that make any sense? If so please help me out :(
Any help is greatly appreciated!!!!
Updating board___forums whenever a post or a reply is inserted is - regarding performance - not the worst idea. For displaying the index page you only have to select data from one table board_forums - this is definitely much faster than selecting a second table to get the "last posts' information", even when using a clever join.
You are better off just updating the stats on each action, New Post, Delete Post etc.
The other instances would not likely require any stats update (deletion of a thread would trigger a forum update, to show one less topic in the topic count).
Think about all the actions the user would do, in most cases, you dont need to update any stats, therefore, getting the counts on the fly is very inefficient and you are right to think so.
It looks like you've already done the right thing.
If you were to join, you'd do it like this:
SELECT * FROM board_forums
JOIN board_posts ON board_posts.forum_id = board_forums.id
WHERE forum_parent = '$forum_id'
The problem with that, is that it gets you every post, which is not useful (and very slow). What you would want to do is something like this
SELECT * FROM board_forums
JOIN board_posts ON board_posts.forum_id = board_forums.id ORDER BY board_posts.id desc LIMIT 1
WHERE forum_parent = '$forum_id'
except SQL doesn't work like that. You can't order or limit on a join (or do many other useful things like that), so you have to fetch every row and then scan them in code (which sucks).
In short, don't worry. Use joins for the actual case where you do want to load all forums and all posts in one hit.
The simple solution will result in numerous queries, some optional, as you're already discovered.
The classic approach to this is to cache the results, and only retrieve it once in a while. The cache doesn't have to live long; even two or three seconds on a busy site will make a significant difference.
De-normalizing the data into a table you're already reading anyway will help. This approach saves you figuring out optional queries and can be a bit of a cheap win because it's just one more update when an insert is already happening. But it shifts some data integrity to the application.
As an aside, you might be running into the recursive-query problem with your threads. Relational databases do not store heirarchical data all that well if you use a "simple" algorithim. A better way is something sometimes called 'set trees'. It's a bit hard to Google, unfortunately, so here are some links.