Choosing data pseudo-randomly with even distribution

Choosing data pseudo-randomly with even distribution - php

I'm currently working on a medium-sized web project, and I've ran into a problem.
What I want to do is display a question, together with an image. I have a (global) list of questions, and a (global) list of images, all questions should be asked for all images.
As far as the user can see the question and image should be chosen at random. However the statistics from the answers (question/image-pair) will be used for research purposes. This means that all the question/image-pair must be chosen such that the answers will be distributed evenly across all question, and across all images.
A user should only be able to answer a specific question/image-pair one time.
I am using a mysql database and php. Currently, i have three database tables:
tbl_images (image_id)
tbl_questions (question_id)
tbl_answers (answer_id, image_id, question_id, user_id)
The other columns are not related to this specific problem.
Solution 1:
Track how many times each image/question has been used (add a column in each table). Always choose the image and question that has been asked the least.
Problem:
What I'm actually interested in is distribution among questions for an image and vice versa, not that each question is even globally.
Solution 2:
Add another table, containing all question/image-pairs along with how many times it has been asked. Choose the lowest combination (first row if count column is sorted by ascending order).
Problem:
Does not enforce that the user can only answer a question once. Also does not give the appearance that the choice is random to the user.
Solution 3:
Same as #2, but store question/image/user_id in table.
Problem:
Performance issues (?), a lot of space wasted for each user. There will probably be semi-large amounts of data (thousands of questions/images and atleast hundreds of users).
Solution 4:
Choose a question and image at true random from all available. With a large enough amount of answers they will be distributed evenly.
Problem:
If i add a new question or image they will not get more answers than the others and therefore never catch up. I want an even amount of statistics for all question/image-pairs.
Solution 5:
Weighted random. Choose a number of question/image pairs (say about 10-100) at true random and pick the best (as in, lowest global count) of these that the user has not answered.
Problem:
Does not guarantee that a recently added question or image gets a lot of answers quickly.
Solution #5 is probably the best once I've come up with so far.
Your input is very much appreciated, thank you for your time.

From what I understand of your problem, I would go with #1. However, you do not need a new column. I would create an SQL View instead becuase it sounds like you'll need to report on things like that anyway. A view is basically a cached select, but acts similar to a table. Thus you would create a view for keeping the total of each question answered for each image:
DROP VIEW IF EXISTS "main"."view_image_question_count";
CREATE VIEW "view_image_question_count" AS
SELECT a.image_id, a.question_id, SUM(b.question_id) as "total"
FROM answer AS a
INNER JOIN answer AS b ON a.question_id = b.question_id
GROUP BY a.image_id, a.question_id;
Then, you need a quick and easy way to get the next best image/question combo to ask:
DROP VIEW IF EXISTS "main"."view_next_best_question";
CREATE VIEW "view_next_best_question" AS
SELECT a.*, user_id
FROM view_image_question_count a
JOIN answer USING( image_id, question_id )
JOIN question USING(question_id)
JOIN image USING(image_id)
ORDER BY total ASC;
Now, if you need to report on your image to question performace, you can do so by:
SELECT * FROM view_image_question_count
If you need the next best image+question to ask for a user, you would call:
SELECT * FROM view_next_best_question WHERE user_id != {USERID} LIMIT 1
The != {USERID} part is to prevent getting a question the user has already answered. The LIMIT optimizes to only get one.
Disclaimer: There is probably a lot that could be done to optimize this. I just wanted to post something for thought.
Also, here is the database dump I used for testing. http://pastebin.com/yutyV2GU

Related

Storing/retrieving multiple form input submissions by same user to mysql

I have a simple form that asks "how are you doing right now at this moment?" and they select #1-10 from a dropdown.
The challenge: the user will answer this question endlessly over time, and I'd like to, if possible, store their ongoing answers in 1 column of a record with their unique user_id. Since they can potentially have hundreds of submissions to the question, what would the best way to store and retrieve their stored answer? There will be an option for them to view their past 5, 10, or even 100 answers so they can see a pattern over time how they're doing. Their info would be displayed probably in a table going across the screen like:
Here's how you've been doing:
2 4 8 9 4 9 4 etc etc
Is there a way, and is it in this case recommended, to save all their submitted answers to the question in 1 single table row column? If so, can you give me an idea of the mysql code to save ... and code to retrieve it? I would create x # of columns to save each answer if there was a known total, but in this case, we don't know how many there will be.
I wasn't able to find a solution to online.

Yes , according to the Jeff, If I were you, I will create some table that we call it temporary_answer with the field,
user_id, question_id, answer_id, created_datetime
And you will able to fetch this temporary answer anytime, anywhere by filtering the user_id and created_datetime. I have done with this when I was developing e-learning sites. I hope this answer can help.
CMIIW.

Many to many vs one row [duplicate]

This question already has answers here:
Many database rows vs one comma separated values row
(4 answers)
Closed 8 years ago.
I'm interested how and why many to many relationship is better than storing the information in one row.
Example: I have two tables, Users and Movies (very big data). I need to establish a relationship "view".
I have two ideas:
Make another column in Users table called "views", where I will store the ids of the movies this user has viewed, in a string. for example: "2,5,7...". Then I will process this information in PHP.
Make new table users_movies (many to many), with columns user_id and movie_id. row with user_id=5 and movie_id=7 means that user 5 has viewed movie 7.
I'm interested which of this methods is better and WHY. Please consider that the data is quite big.

The second method is better in just about every way. Not only will you utilize your DBs indexes to find records faster, it will make modification far far easier.

Approach 1) could answer the question "Which movies has User X viewed" by just having an SQL like "...field_in_set(movie_id, user_movielist) ...". But the other way round ("Which user do have viewed movie x") won't work on an sql basis.
That's why I always would go for approach 2): clear normalized structure, both ways are simple joins.

It's just about the needs you have. If you need performance then you must accept redundancy of the information and add a column. If your main goal is to respect the Normalization paradigma then you should not have redundancy at all.
When I have to do this type of choice I try to estimate the space loss of redundancy vs the frequency of the query of interest and its performance.

A few more thoughts.
In your first situation if you look up a particular user you can easily get the list of ids for the films they have seen. But then would need a separate query to get the details such as the titles of those movies. This might be one query using IN with the list of ids, or one query per film id. This would be inefficient and clunky.
With MySQL there is a possible fudge to join in this situation using the FIND_IN_SET() function (although a down side of this is you are straying in to non standard SQL). You could join your table of films to the users using ON FIND_IN_SET(film.id, users.film_id) > 0 . However this is not going to use an index for the join, and involves a function (which while quick for what it does, will be slow when performed on thousands of rows).
If you wanted to find all the users who had view any film a particular user had viewed then it is a bit more difficult. You can't just use FIND_IN_SET as it requires a single string and a comma separated list. As a single query you would need to join the particular user to the film table to get a lot of intermediate rows, and then join that back against the users again (using FIND_IN_SET) to find the other users.
There are ways in SQL to split up a comma separated list of values, but they are messy and anyone who has to maintain such code will hate it!
These are all fudges. With the 2nd solution these easy to do, and any resulting joins can easily use indexes (and possibly the whole queries can just use indexes without touching the actual data).
A further issue with the first solution is data integretity. You will have to manually check that a film doesn't appear twice for a user (with the 2nd solution this can easily be enforced using a unique key). You also cannot just add a foreign key to ensure that any film id for a user does actually exist. Further you will have to manually ensure that nothing enters a character string in your delimited list of ids.

How to make random combination not to repeat?

I'm working on already made Facemash-Alike Script. It's script that shows two pictures, and user make a choice which picture is better for him.
I wanted to create a small improvement that won't show a user the same combination of two pictures he already voted.
I tried to do this in two ways. But any of this ways is not good enough or not comfortable for user.
First one - Choices of two pictures are randomized. After vote, in database, new record is created with this specific combination, and value of vote. If combination of two pictures already exist as record in database then page shows historical vote, and after few seconds page refreshing, making another random combination.
Second one - In the moment when names of pictures are added to database then scripts creates all possible combinations as records in database. It's good way, because script pulls out from database a random record that doesn't contains any result, and after vote saves with a value. So it's no way to make any repeats. The main problem of this way is in the moment of adding new pictures. Database at the start becoming huge, and creating all possible combination at start taking forever.
Because of that I'm looking for another solution. I would like to hear even small advice that might help find me a way.

Your first approach scales better, you just want to avoid showing an historical vote. You need to keep a history of votes anyway, so use that history as a filter. In the SELECT statement you are using to get the random faces, left join on the history table to use the join as a filter.
Example:
SELECT faces.uid f_uid, votes.uid v_uid FROM faces
LEFT JOIN votes ON votes.user_id=# AND faces.uid=votes.face_id1 AND
faces.uid=votes.face_id2
WHERE v_uid IS NULL
ORDER BY RAND() LIMIT 2
That will make sure they never see the same face twice. It will become slower the more faces a user votes on. It won't be noticeably slower until they have done many hundreds of votes.
That said, you could change the LIMIT to something like 20 and cache that (i.e. in the session). You then have the next 10 pairings (20/2=10) ready to go. That is sort of a combination of 1 & 2.

MCQ quiz is slow to load and randomization does not work

I am creating a MCQ quiz based on php and mysql. Here are the structures of my main tables:
quiz table: quiz id, quiz_category
category table: id, title...
questions table: id, quiz id, categoryid, title...
answers table: id, question id...
To start things, I have the tables populated with 150+ quizzes, 4 categories, 14000+ questions and rightanswers for each.
To save time, for each question, the right answer is pulled from the answers table https://stackoverflow.com/editing-helpalongwith 3 other random answers .
Now when I was testing it with just two quizzes, it worked fine. But with 150 quizzes, several problems have cropped up:
the database is slow and for later quizzes takes forever to load questions
the randomization of answers is not working anymore - along with the right answer, the other options show the same entry, making it easy for the user to guess the right answer.
You can see the code I am working with in my previous Stackoverflow query. https://stackoverflow.com/questions/14826573/randomising-questions-and-answers-php-quiz-not-working
Any idea about what the ideal queries should be for the quiz program to work?

I will provide some tips on how to improve performance, however these will be generic and may not be complete.
From briefly looking at your PHP and SQL statements from your previous question, there are a few logical places for an index. To add an index please reefer to the MySQL manual for more information
$sql4="select * from answers where question_id=".$row2['id'];
question_id should have an index
$sql2="select * from questions where quiz_id=".$_SESSION['quizid'];
quiz_id should have an index
Adding these two indexes will also improve selectivity on this
$sql3="select * from answers where question_id in (select id from
questions where quiz_id =$row2[quiz_id]) order by rand()";
This will help as previously you would have been performing a full table scan for each query.
Your other issue is that you have a loop and on each iteration you are sending commands to query the database, you should collect all the information at once before the loop and then iterate using that rather than sending individual queries each iteration.

MySql -- How to keep a record of used entries?

I have an application (More likely a quiz app) where i have saved all my 1000 quizzes in MySQL database, I want to retrieve a random question from this table when a user request one, I can easily do it using the RAND() function in MySQL.. my problem is , I don't want to give the same question two or more times to a user, how can i keep a record of retrieved questions? Do I have to create tables for each and every users? won't that increase the load time?? please help me, any help would be a big favor ..
-regards

If you want it for a short time, use the user's $_SESSION for that.
If you need the long term ( say tomorrow, not to ask the same questions) - you'll have to create additional table for usersToQuestions, where you'll store the user id and the questions the user had been already asked.
Retrieving a question in both cases would require a simple IN condition:
SELECT * FROM questions
WHERE id not IN ('implode(",", $_SESSION["asked"])')
SELECT * FROM questions
WHERE id not IN (
SELECT question_id FROM questions2users WHERE userid = 123
)

my problem is , I don't want to give the same question two or more times to a user,
how can i keep a record of retrieved questions? Do I have to create tables for each
and every users? won't that increase the load time?
Yes, but possibly not so much.
You keep a single extra table with userId, questionId and insert there the questions already asked to the various users.
When you ask question 123 to user 456, you run a single INSERT
INSERT INTO askedQuestions (userId, questionId) VALUES (456, 123);
Then you extract questions from questions with a LEFT JOIN
SELECT questions.* FROM questions
LEFT JOIN askedQuestions ON (questions.id = askedQuestions.questionId AND askedQuestions.userId = {$_SESSION['userId']} )
WHERE askedQuestions.userId IS NULL
ORDER BY RAND() LIMIT 1;
if you keep askedQuestions indexed on (userId, questionId), joining will be very efficient.
Notes on RAND()
Selecting on a table like this should not done with ORDER BY RAND(), which will retrieve all the rows in the table before outputting one of them. Normally you would choose a questionId at random, and select the question with that questionId, and that would be waaaay faster. But here, you have no guarantee that the question has not been already asked to that user, and the faster query might fail.
When most questions are still free to ask, you can use
WHERE questions.questionId IN ( RAND(N), RAND(N), RAND(N), ... )
AND askedQuestions.userId IS NULL LIMIT 1
where N is the number of questions. Chances are that at least one of the random numbers you extract will still be free. The IN will decrease performances, and you will have to strike a balance with the number of RANDs. When questions are almost all asked, chances of a match decrease, and your query might return nothing even with many RANDs (also because RANDs will start yielding duplicate IDs, in what is known as the Birthday Paradox).
One way to achieve the best of both worlds could be to fix a maximum number of attempts, say, three (or better still, based on the number of questions left over).
For X times you generate (in PHP) a set of Y random ids betweeen 1 and 1000, and try to retrieve (userId, questionId) from askedQuestions. The table is thin and indexed, so this is really fast. If you fail, then the extracted questionId is random and free, and you can run
SELECT * FROM questions WHERE id = {$tuple['questionId']};
which is also very fast. If you succeed X times, i.e., for X times, all Y random questionIds are registered as being already asked, then you run the full query. Most users will be served almost instantly (two very quick queries), and only a few really dedicated users will require more processing. You might want to set some kind of alerting to warn you of users running out of questions.

One solution is to add an ID column in the question table and when you serve it to a user you check that ID with the list of questions that you served the user.
You can use in memory data structure like List to keep track of the questions that are served to a particular user. This way, you only need array of Lists instead of tables to get the job done.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.