For a Facebook Application, I have to store a list of friends of a user in my MySQL database. This list is requested from my db, compared with other data, etc.
Currently, I store this list of friends within my user table, the uids of the friends are put together in one 'text' field, with a '|' as separator. For example:
ID - UID - NAME - FRIENDS => 1 - 123456789 - John Doe - 987654321|123456|765432
My PHP file requests this row and extracts the list of friends by exploding that field ('|'). This all works fine, every 1000 users are about 5MB diskspace.
Now the problem:
For an extra feature, I also need to save the names of the friends of the user. I can do this in different ways:
1) Save this data in an extra table. For example:
ID - UID - NAME => 1 - 1234321 - Jane Doe
If I need the name of the friend with ID 1234321, I can request the name from this table. However, the problem is that this table will keep growing, until all users on Facebook are indexed (>500million rows). My webhost is not going to like this! Such a table will take about 25GB of diskspace.
2) Another solution is to extend the data saved in the user table, by adding the name to the UID in the friends field (with an extra separator, let's use ','). For example:
ID - UID - NAME - FRIENDS => 1 - 123456789 - John Doe - 987654321,Mike Jones|123456,Tom Bright|765432,Rick Smith
For this solution I have to alter the script, to add another extra explode (','), etc. I'm not sure how many extra diskspace this is going to take... But the data doesn't get easy to handle this way!
3) A third solution gives a good overview of all the data, but will cause the database to be huge. In this solution we create a table of friends, with a row for every friendship. For example:
ID - UID - FRIENDUID => 1 - 123456789 - 54321
ID - UID - FRIENDUID => 3 - 123456789 - 65432
ID - UID - FRIENDUID => 2 - 987654321 - 54321
ID - UID - FRIENDUID => 4 - 987654321 - 65432
As you can see in this example, it gives a very good overview of all the friendships. However, with about 500million users, and let's say an average of 300 friendships per user, this will create a table with 150billion rows. My host is definitely not going to like that... AND I think this kind of table will take a lot of diskspace...
So... How to solve this problem? What do you think, what is the best way to store the UIDs + names of friends of a user on Facebook? How to scale this kind of data? Or do you have another (better) solution than the three possibilities mentioned above?
Hope you can help me!
If I need the name of the friend with
ID 1234321, I can request the name
from this table. However, the problem
is that this table will keep growing,
until all users on Facebook are
indexed (>500million rows). My webhost
is not going to like this! Such a
table will take about 25GB of
diskspace.
If storing the names of the users you need really takes 25GB, then it takes 25GB. You can't move data around and expect it to get smaller - and the overhead of a table is not that much. Instead, you need to focus on only storing the data you actually need. It is unlikely that everyone on Facebook uses your application (if it were the case, you shouldn't be using a host where 25GB of space is a worry).
So instead of indexing the entirety of Facebook (which would be difficult regardless), just store the data relevant for the people who actually use your application and their immediate friends, which is a much smaller dataset.
Your first proposed solution is the proper way to do it; it eliminates any potential redundancy in name storage.
I agree with Amber, solution 1 is going to be the most efficient way to store this data. If you want to stick with your current approach (similar to solution 2), you may want to consider storing the friendship data as a JSON string. It won't produce the shortest possible string, but it will be very easy to parse.
To save the data:
$friends = array(
'uid1' => 'John Smith',
'uid2' => 'Jane Doe'
);
$str = json_encode($friends);
// save $str to the database in the "friends" column
To get the data back:
// get $str from the database
$friends = json_decode($str, TRUE);
var_dump($friends);
I really think you should go with the third option. For scalability you would want to do this.
With the first method you have a LOT of redundant data because if 1 is friends with 2, 2 is also friends with 1. But you are storing both relations.
This also makes the 150 billion row count impossible. It is more likely that this will be at most half, because the relations table can work both ways!!
So the first user will generate 300 rows in the table, but the second user (if he is friends with 1) will generate just 299. Continue to do so and the last user won't even generate a relation row, because they are all already present!
Also when you want to start searching for certain relations the third option will be much faster since you'll have a int index in stead of a fulltext index which probably saves another 50% in both storage and processing speed.
If your application will reach 500 million users you will just have to get a better hosting service.
Related
Let's say that I have 3 different headlines for an article:
"Man Bites Dog"
"This Man Unhinged His Jaw as He Approached A Dog, What Happens Next Will Shock You!"
"Only 90's Kids Will Remember That Time a Man Bit a Dog"
I want to use PHP to randomly display one of these three headlines based on the current user (so they're not getting new headlines each time they refresh), then record the number of clicks for each version of the headline via SQL where I get something similar to:
USER HEADLINE CLICK?
1 1 No
2 3 Yes
3 2 Yes
4 3 No
5 2 Yes
6 1 No
Specifically, I'd like advice about:
- Retrieving some sort of variable that's unique to the user (IP address, maybe?)
- Randomly assigning a number (1-3, in the example) based on that unique user variable.
- Displaying different text based on the assigned number.
I can figure out the SQL stuff once I figure this part out. I appreciate any advice you can provide.
You have three problems here:
How to identify user constantly
How to count user clicks(actions)
How to get result statistics
Here I think that showing different subjects on one page is not a problem
Problem 1
Basically you can use an IP address but it is not a constant id for user. For example if user uses mobile phone and walks, he can switch between towers or loose connection and then restore it with different ip.
There are many ways to identify user by the web, but there is no way to identify user on 100% without authorization (active action done by user)
For example you can set Cookie to user with his generated ID. You can easily generate id you can look here. When you set up cookie and user will come back to you, you will know who it is and do the stuff you need.
Also within user uniqueness you can reed this article - browser uniqueness
Also if you use Cookie, you can easily store there subject id for your task. If you will not use Cookie i recommend you use mongodb for this kind of tasks (many objects with small data, that must be retrieved from db very fast, inserted to db very fast and there are no updates in your case)
Problem 2
You showed table that has 3 fields: ID, Used title, Is title clicked.
In this kind of table you will lose all not unique click (when user clicks on subject twice, goes there tomorrow or refreshes target page multiple times)
I suggest you to use following kind of table
ID - some unique id, auto increment field will be good here
Date - some period of measurements (daily, hourly or something like that)
SubjectID - id of subject that was shown
UniqueClicks - count of users that clicks on subject
Clicks - Total count of clicks on subject
In this case you will have aggregated data by period of time and you will easily show data in admin panel
But still we have problem with collecting this data. Solution of this problem depends on count of users. If there is more than 1000 clicks in minute, I think that you need some logging system. For example you will send all data to file 'clickLog-' . date('Ymd_H') . '.log' and send data to this file in some static format, for example:
clientId;SubjectId;
When hour is end you can aggregate this data by shell script or your code and put it to db:
cat clickLog-20160907_12.log | sort -u | awk -F';' '{print $2}' | sort | uniq -c
after this code you will have 2 columns of data. First will be count of unique clicks and second will be subject id
Modifying this script you can get total clicks with just removing sort -u section
Also if you have several subject ids you can do it with for:
For example bash script for unique clicks can be following
for i in subj1 subj2 subj3; do
uniqClicks=$(cat clickLog-20160907_12.log |
grep ';'$i'$' |
sort -u |
wc -l);
clicks=$(cat clickLog-20160907_12.log |
grep ';'$i'$' |
wc -l);
# save data here
done
After this manipulations you will have prepared aggregated data for calculating and source data for future processing (if needed)
And also your db will be small and fast and all source data will be stored in files.
Problem 3
If you will do solution in Problem 2 section, all queries for getting statistic will be so simple, that your database will do it very fast
For example you can run this query in PostgreSQL:
SELECT
SubjectId,
sum(uniqueClicks) AS uniqueClicks,
sum(clicks) AS clicks
FROM
statistic_table
WHERE
Date BETWEEN '2016-09-01 00:00:00' and '2016-09-08 00:00:00'
GROUP BY
SubjectId
ORDER BY
sum(uniqueClicks) DESC
in this case if you have 3 subject ids and hourly based aggregation you will have 504 new rows in weeks (3 subjects * 24 hours * 7 days) that is really small amount of data for database.
Alternatives
You can also use Google Analytics for all calculations. But in this case you need to do some other steps. Most of them are configuration steps that need to be done to enable google analytics monitoring scripts on your site. If you have it, you can easily configure goals support and just apply to script additonal data with subjectid by using GA script api
You can use the IP of the user or his MAC, if the user is registered on the web you can use the user id.
For the second part you can use the function mt_rand() for PHP:
mt_rand(min,max) -> if you want a number bettween 1 and 3 user mt_rand(1,3);
the use an array to store the three diferent headlines and use the ramdomly generated number to acces the array.
Better you can generate a number bettween 0-2 because the arrays start with 0.
Assumptions
If A is a friend of B, B is also a friend of A.
I searched for this question and there are already lots of questions on Stack Overflow. But all of them suggest the same approach.
They are creating a table friend and have three columns from, to and status. This serves both purposes : who sent friend request as well as who are friends if status is accepted.
But this means if there are m users and each user has n friends, then I will have mn rows in the friends table.
What I was thinking is to store friends list in a text column. For every user I have a single row and a friends column which will have all accepted friends' IDs separated by a character, say | which I can explode to get all friends list. Similarly, I will have another column named pending requests. When a request is accepted, IDs will move from pending requests to friends column.
Now, this should significantly reduce the entries in the table and the search time.
The only overhead will be when I will have to delete a friend, I will have to retrieve the friend string, search the ID of the friend to be deleted, delete the ID and update the column. However, this is almost negligible if I assume a user cannot have more than 2000 friends.
I assume that I will definitely be forgetting some situations or this approach will have certain pitfalls. So please correct if so.
The answer is NO! Do not try to implement this idea - its complete disaster.
I am going to describe more precise why:
Relations. You are storing just keys separeted with |. What if you want to display list with names of friends? You will have to get list, explode it and make another n queries to DB. With relation table from | to | status you will be able to do that with one JOIN.
Deletions. Just horrible.
Inserts. For every insert you will need to do SELECT + UPDATE instead of INSERT.
Types. You should keep items in DB as they are, so integers as integers. Converting ints into string and back could cause some errors, bugs etc.
No ORM support. In future you will probably leave plain PHP for some framework. Take in mind that none of them will support your idea.
Search time?
Please do some tests. Search with WHERE + PRIMARY KEY is very fast.
When storing relationship data for a user (potentially a thousand friends per user), would it be faster to create a new row for each relationship, or to concatenate all of of their friends into a string and then parse that later?
I.e.
Primary id | Friend1ID | Friend2ID|
1| 234| 5789|
2| 5789| 234|
Where the IDs are references to primary IDs in a 'Users' table.
Or for the 'Users' table to just have a column called friends which may look like this:
Primary id | Friend1ID |
234| 5789.123.8474|
5789| 234|
I'm of the understanding that string concatenation and parsing is generally quite slow, so I'd be tempted to lean towards the first method. However as the number of users grows, this then becomes a case of selecting one row and parsing it V searching millions of rows for rows which match the WHERE criteria.
Is one method distinctly faster than the other? Particularly as the number of users grows.
You should use a second table to store the friends.
Users Table
----------
userid | username
1 | Bob
2 | Mike
3 | John
Users Friends Table
--------------------
userid | friend_id
1 | 2
3 | 2
Here you can see that Mike is friends with both Bob and John.... This is of course a very simply demonstration.
Your second option will not scale, some people may have hundreds of thousands of friends, storing each Id in a single field is going to cause a headache further down the line. adding friends, removing friends. working out complex relationships between people. Lots of over head.
Querying millions of records with a WHERE clause on a properly indexed table should take no more than a second, the first option is the better one.
The "correct" way would probably be keeping multiple rows. This allows for much easier statistical analysis and more complex queries (like friends of friends) without any hacky stuff. Integer storage size is also often smaller than string storage, even though you're repeating one ID - especially if you use an appropriately sized integer store (like mediumint).
It's also more maintainable, scalable (if they start getting a damn lot of friends) export and importable. The speed gain from concatenation, if any, wouldn't be worth the rest of the benefits.
If you wanted for instance to search if Bob was a friend of Jane, this would be a single row lookup in the multiple row implementation, or in the single row implementation: get Bob's row, decode field, loop through field looking for Jane - found Jane. DBMS optimisation and indexing would make the multiple row implementation much faster in this case - if you had the primary key as (id, friendid) then it'd be pretty much instantaneous as the table would probably be hashed on that key.
I believe the proper way to do it which might be more faster is two do a two columns table
user | friend
1 | 2
1 | 3
It will simple and will make queering and updating much easier and you can have as many relationship as you want.
Don't over complicate the problem...
... Asking for the more "correct" way is wrong itself.
It depends based on case.
If you have low access rate to your web application having more rows won't change anything on the other side of the coins (i'm not English), on large and medium application access it's maybe better to have the minimal access to the db possible.
To obtain this as you've already thinked you can concatenate the values and then split them on login of the user and then put everything into the $_SESSION supervar.
At least this is what i think.
I have a PHP application and I need to store black list data. My site members will add any user to his/her black list. So they won't see the texts of that users.
Every user's black list is different.
A user can have 1000-1500 users in his/her black list.
User can add/remove anybody from his/her list.
Black list will have member's id and black listed people's ids.
I'm trying to design database table for this. But I couldn't be sure about how can structure be ?
I have 7-8 MySQL tables but none of them is like this.
Way 1:
--member ID-----black listed people (BLOB)
-----------------------------------------
--1234----------(Some BLOB data)---------
--6789----------(Some BLOB data)---------
I can serialize blacklisted people's IDs and save them inside a BLOB data column. When a user want to edit his/her list, I get BLOB data from table, remove unwanted ID and update column with new data. IT seems like a bit slow operation when a user has 1k-2k IDs.
Way 2:
--member ID----black listed ID--------
--------------------------------------
--1234---------113434545--------------
--1234---------444445454--------------
--1234---------676767676--------------
--6789---------534543545--------------
--6789---------353453454--------------
In this way, when a user wants to see his/her black list I give them all users in "black listed ID" column. When editing I add/remove new rows to table. This operation is fast but the table can be huge in time.
Way 3:
--member ID----113434545----444445454----676767676---534543545-----353453454
----------------------------------------------------------------------------
--1234--------yes------------yes------------yes------------no------no-------
--6789--------no-------------no-------------no-------------yes------yes------
Yes shows black listed, No shows not black listed. I create new column for each black listed person and update that column when a user adds a person or removes it.
Way 4:
???
These are my ideas. I really appreciate if you can offer me a better one?
Thank you.
What you are creating is a so-called 1 to n relation table.
3rd version
The 3rd version would require to have n rows x n columns, where n is the amount of registred users. InnoDB has a limit of 1000 columns, breaking your logic as soon as 1001st. user registers. Not to mention that you don´t want to ALTER TABLE for every new user. Forget that.
1st version
The first solution is really slow: BLOB data won´t be really idexed, it tends to get into a second page (file on harddisk, effectively doubeling disk I/O), it has massive datasize overhead, sorting and grouping won't happen in RAM, and you have no efficient way for backwards search (how many people did blacklist user xy?)... as a general advise, try to avoid BLOB untill absolutely necesarry.
2nd version
The second solution is the way to go. MySQL is optimized for stuff like that, and a table with 2 numeric, indexed rows is really fast.
Table design
I would create a table consisting of
blocker_id | blocked_id
and no separate primary key. Instead I would create a 2-column-primary-key with blocker beeing the first column and blocked the second. That way you save a B-Tree (expensive to create index) and can search fast for both all blockeds from a blocker (using half of the key) and for the existence of a single combination. (That will be most relevant for filtering posts, and should be optimized for.)
I think you should make the blacklist like way 2:
black_list_id | blocker | blocked
So when you want to take whom a user blocks you would get it by SELECT * FROM black_list_table WHERE blocker = :user_id.
To get who is blocking the user you get SELECT SELECT * FROM black_list_table WHERE blocked = :user_id.
You can easily take how many people block user, how many blocked people user has, and moreover, you can set indices on all columns and get other users' data using JOIN statements.
I would like to build a website that has some elements of a social network.
So I have been trying to think of an efficient way to store a friend list (somewhat like Facebook).
And after searching a bit the only suggestion I have come across is making a "table" with two "ids" indicating a friendship.
That might work in small websites but it doesn't seem efficient one bit.
I have a background in Java but I am not proficient enough with PHP.
An idea has crossed my mind which I think could work pretty well, problem is I am not sure how to implement it.
the idea is to have all the "id"s of your friends saved in a tree data structure,each node in that tree resembles one digit from the friend's id.
first starting with 1 node, and then adding more nodes as the user adds friends.
(A bit like Lempel–Ziv).
every node will be able to point to 11 other nodes, 0 to 9 and X.
"X" marks the end of the Id.
for example see this tree:
An Example
In this tree the user has 4 friends with the following "id"s:
0
143
1436
15
Update: as it might have been unclear before, the idea is that every user will have a tree in a form of multidimensional array in which the existence of the pointers themselves indicate the friend's "id".
If every user had such a multidimensional array, searching if id "y" is a friend of mine, deleting id "y" from my friend list or adding id "y" to my friend list would all require constant time O(1) without being dependent on the number of users the website might have, only draw back is, taking such a huge array, serializing it and pushing it into each row of the table just doesn't seem right.
-Is this even possible to implement?
-Would using serializing to insert that tree into a table be practical?
-Is there any better way of doing this?
The benefits upon which I chose this is that even with a really large number of ids (millions or billions) the search,add,delete time is linear (depends of the number of digits).
I'd greatly appreciate any help with implementing this or any suggestions for alternative ways to improve or change this method.
I would strongly advise against this.
Storage savings are not significant, and may (probably?) be worse. In a real dataset, the actual space-savings afforded to you with this approach are minimal. Computing the average savings is a very difficult problem, but use some real numbers and try a few samples with random IDs. If you have a million users, consider a user with 15 friends. How much data do you save with this approch? You may actually use more space, since tree adjacency models can require significant data.
"Rendering" a list of users requires CPU investment.
Inserts are non-deterministic and non-trivial. When you add a new user to an existing tree, you will have a variety of methods of inserting them. Assuming you don't choose arbitrarily, it is difficult to compute which approach is the best (and would only be based on heuristics).
This are the big ones that came to my mind. But generally, I think you are over-thinking this.
You should check out OQGRAPH, the Open Query graph storage engine. It is designed to handle efficient tree and graph storage for MySQL.
You can also check out my presentation Models for Hierarchical Data with SQL and PHP, or my answer to What is the most efficient/elegant way to parse a flat table into a tree? here on Stack Overflow.
I describe a design I call Closure Table, which records all paths between ancestors and descendants in a hierarchy.
You say 'using PHP' in the title, but this seems to be just a database question at its heart. And believe it or not the linking table is by far the best way to go. Especially if you have millions or billions of users. It would be faster to process, easier to handle in the PHP code and smaller to store.
Update
Users table:
id | name | moreInfo
1 | Joe | stuff
2 | Bob | stuff
3 | Katie | stuff
4 | Harold | stuff
Friendship table:
left | right
1 | 4
1 | 2
3 | 1
3 | 4
In this example Joe knows everyone and Katie knows Harold.
This is of course a simplified example.
I'd love to hear if someone has a better logic to the left and right and an explanation as to why.
Update
I gave some php code in a comment below but it was marked up wrong so here it is again.
$sqlcmd = sprintf( 'SELECT IF( `left` = %1$d, `right`, `left`) AS "friend" FROM `friendship` WHERE `left` = %1$d OR `right` = %1$d', $userid);
Few ideas:
ordered lists - searching through ordered list is fast, though ordering itself might be heavier;
horizontal partitioning data;
getting rid of premature optimizations.