I am building a facebook application and the application accesses the user's friends through the open graph and I need them contained in a database.
Now here this database contains the user id, name and some other info needed by my app. In order to prevent multiple entries in the database, I assigned the user id as unique using phpMyAdmin. now this works fine for many values but at the same time it fails big time.
Lets say the values that are unique according to mysql are:
51547XXXX
52160XXXX
52222XXXX
52297XXXX
52448XXXX
But if the ids become
5154716XX
5154716XX
or
5216069673X
521606903XX
Then it counts it as similar and thus discards one of them.
The result, lets say I am entering my friend list into the table, then it should have 830 records and if I do not use the unique constraint thats the value I get.
But as soon as unique is activated, I just get 375 which means 455 records are discarded, considering them same as the previous data..
The solution what I can think of is, comparing data with php first and then logging them into the database, but then with that much queries, it will take long long time. Google cannot answer this, dunno why.. :(
Facebook user ids are too big to fit into MySQL's INT type (which is 32bit). You need to use the BIGINT type which is 64 bit and can thus handle the range of ids facebook uses.
Related
I have many tables in my database, an example is the table fs_user, the following is an extract of the table columns (dealing with privacy settings):
4 Columns from the table fs_user:
show_email_to
show_address_to
show_gender_to
show_interested_in_to
Like many social networks, I need not only to specify which data is private and which is public, but also which data is available to a chosen users, and which one is not.
As I have about 30 data like the 4 data above, I think it will be bad to create one table for every data, and make a many to many relation with the table fs_user.
This is why, I got the idea of saving this data in a Json form for every column (whose type=TEXT), example
show_email_to => {1:'ALL',2:'BUT',3:'3'}
This data means, show email to all users, except the user whose id=3.
Another example:
show_email_to => {1:'NONE',2:'BUT',3:'3',4:'80',5:'10'}
This means, no user will see the email except the users id=3,id=80 and id=10.
Of course, the MySql query will select this data, and PHP/Js will extract the data I need from Json.
Another point, is that sometimes .. a user wants to show data only to his friends except 3 friends.
This will do :
show_email_to => {1:'FRIENDS',2:'BUT',3:'3'}
This means that the email will be shown to all his friends, except user with id=3.
My question is : How much will be this system performant, flexible (for other uses) compared to the 'many to many' solution (which requires to have many data in many tables)??
Note: I know already that saving many elements in one column is a bad practice, But here: I think this is a json element and can be considered as a one Object
This is a good question. What you propose is, with respect, a very bad idea indeed if you're using any flavor of SQL. You are proposing to denormalize your tables in a way that will defeat every attempt to speed up searching or querying in the future.
What should you do instead? You could take a look at using an XML-centric dbms like MarkLogic. It's capable of creating indexes that accelerate various Xpath-style queries, so you would be able to search on relationships. If you do that, I hope you have a big budget.
Or, you could use normalized permission tables.
item_to_show (item id)
order (an integer specifying rule ordering, needed for this)
recipient (user id)
isdenied (0 means recipient is allowed, 1 means she is denied)
In this table, the primary key is a compound key constructed of the first two columns.
I'm aware that you have many types of items. You assert that it's bad to have an extra table for each item type in your system. I don't agree that it's inherently bad. I believe your proposed solution is far worse.
You could arrange to give each item a unique id number to allow you to use a single permission table. See this for an example of how to do that. Fastest way to generate 11,000,000 unique ids
Or you could have a single permission table with a type id.
item_to_show (item id)
item_type_to_show (item type id)
order (an integer specifying rule ordering, needed for this)
recipient (user id)
isdenied (0 means recipient is allowed, 1 means she is denied)
In this case the primary key is the first three columns.
Or, you can do what you don't want to do and have a separate permission table for each item type.
You say, "As I have about 30 data like the 4 data above, I think it will be bad to create one table for every data, and make a many to many relation with the table fs_user"
I agree with the first part of your statement only. You only need one table. For the sake of a name, I'll call it ShowableItems. Fields would be ShowableItemId (PK) and Item. Some of these items would be email, gender, address, etc.
Then you need a many to many table that shows what items can be shown to whom. Your three fields would be, the id of the person who owns the item, the showable item id, and the id of the person who can see it.
I'm making a table (with MySQL) to store some data, but i'm not sure of the way to do it properly, because of the amount of data. For example if it's adress book database.
so there is a table for users and a table for contacts. Each users can own hundreds of contacts, and there could be thousans of users. Should I add a new row for every single contact (it will make a lot of rows!), or can i just concatenate all of them in one row with the user id.
uuh, this is just an example, but in my case once contacts are INSERTED they will never be UPDATED so, no modifications, they can only be DELETED.
To go by the normal forms, you should have three tables
1) Users -> {User_id} (primary key)
2) Contacts -> {Contact_id} (primary key)
3) Users_Contacts -> {User_id, Contact_id} (Compound key)
The Junction table Users_Contacts will have one record per contact - meaning for each unique value of User_id+Contact_id, there will be one record.
However, In practice, it is not always necessary to stick to the rule book. Depending on the use case, it is often advisable to have a denormalized table. The call is yours.
There is also another option of using NoSQL with MySQL. For example, the contacts can be serialized into JSON and stored. Mysql 5.7 seem to support this data format (with some external help). See this for details.
Say for eg: If you add 3 contacts for a single user and as you mentioned you would be deleting contacts the its better to insert all three contacts, each in a new row with its user id. Because if you want to delete any one of the contact from 3 of them, then it will be easy.
If you concatenate all the contacts for an user and add them in one row could land up many issues. What in future the requirement changes and you need to make a layout all the contacts for an user with edit/delete individual contacts. So you should have one contact in each row.
You can optimize your query by indexing the columns.
Say userid#1234 has 1000 contacts in contact table where the primary key in contact table is idcontact (Indexed by default) and then in contact table another field called "iduser" which is also indexed, then the select performance over an iduser on contact table will be fast.
Ideally its the best approach using mysql database. There are examples of many apps where it maintains millions of data so it should be fine with a contact table and for each contact a new row.
I wouldn't worry about lots of rows. You have to keep in-mind the granularity of control the user would expect (deleting / adding a contact, rearranging the list based on different factors, etc). It's always better to break things out into their owns rows if they are going to be treated independently from a similar item (contacts, users, addresses, etc). Additionally, if you were to concatenate your data, re-ordering for display or removing data becomes extremely resource intensive. Where as MySQL is designed to do exactly that "on the cheap".
MySQL can easily handle millions of rows of data. If you are worried about speed, just make sure your indexes are in-place before your data collection is too big (I would venture a guess, and say you'll need to index the user ID the contact belongs to and the first/last names). Indexes are a double-edged sword, however, as they take up disk space, but allow fast querying of large data sets. So you don't want to go over-board and index everything, only what you'll be sorting/searching by.
(Why on earth will contacts never be updated?...)
In a table of Users, I want to keep track of the time of day each user logs in as running totals. For example
UserID midnightTo6am 6amToNoon noonTo6pm 6pmToMidnight
User1 3 2 7 1
User2 4 9 1 8
Note that this is part of a larger table that contains more information about a user, such as address and gender, hair color, etc, etc.
In this example, what is the best way to store this this data? Should it be part of the users table, despite knowing that not every user will log in at every time (a user may never log in between 6am and noon)? Or is this table a 1NF failure because of repeating columns that should be moved to a separate table?
If stored as part of the Users Table, there may be empty cells that never get populated with data because the user never logs in at that time.
If this data is a 1NF failure and the data is to be put in a separate table, how would I ensure that a +1 for a certain time goes smoothly? Would I search for the user in the separate table to see if they have logged in at that time before and +1? Or add a column to that table if it is their first time logging in during that time period?
Any clarifications or other solutions are welcome!
I would recommend storing the login events either in a file based log or in a simple table with just the userid and DATETIME of the login.
Once a day, or however often you need to report on the data you illustrated in your question, aggregate that data up into a table in the shape that you want. This way you're not throwing away any raw data and can always reaggregate for different periods, by hour, etc at a later date.
addition: I suspect that the fastest way of deriving the aggregated data would be to run a number of range queries for each of your aggregation periods so you're searching for (e.g.) login dates in the range 2011-12-25 00:00:00 - 2011-12-24 03:00:00. If you go with that approach and index of (datetime, user_id) would work well. It seems counter-intuitive as you want to do stuff on a user-centric basis but the index on the DATETIME field would allow easy finding of the rows and then the trailing user_id index would allow for fast grouping.
A couple of things. Firstly, this is not a violation of 1NF. Doing it as 4 columns may in fact be acceptable. Secondly, if you do go with this design, you should not use nulls, use zero instead(with the possible exception of existing records). Finally, WHETHER you should use this design or split it into another table (or two) is dependent upon your purpose and usage. If your standard use of the table does not make use of this information, it should go into another table with a 1 to 1 relationship. If you may need to increase the granuality of the login times, then you should use another table. Finally, if you do split this off into another table with a timestamp, give some consideration to privacy.
I am thinking of storing online users in memcached.
First I thought about having array of key => value pairs where key will be user id and value timestamp of last access.
My problem is that it will be quite large array when there are many users currently online(and there will be).
As memcached is not built to store large data, how would you solve it? What is the best practice!
Thanks for your input!
The problem with this approach is memcache is only queriable if you know the key in advance. This means you would have to keep the entire online user list under a single known key. Each time a user came online or went offline, it would become necessary to read the list, adjust it, and rewrite it. There is serious potential for a race condition there so you would have to use the check-and-set locking mechanism.
I don't think you should do this. Consider keeping a database table of recent user hits:
user_id: int
last_seen: timestamp
Index on timestamp and user_id. Query friends online using:
SELECT user_id FROM online WHERE user_id IN (...) AND timestamp > (10 minutes ago);
Periodically go through the table and batch remove old timestamp rows.
When your site becomes big you can shard this table on the user_id.
EDIT:
Actually, you can do this if you don't need to query all the users who are currently online, but just need to know if certain users are online.
When a user hits a page,
memcache.set("online."+user_id, true, 600 /* 10 mins */);
To see if a user is online,
online = memcache.get("online."+user_id);
There should also be a way to multikey query memcache, look that up. Some strange things could happen if you add a memcache server, but if you use this information for putting an "online" marker next to user names that shouldn't be a big deal.
I'm used to building websites with user accounts, so I can simply auto-increment the user id, then let them log in while I identify that user by user id internally. What I need to do in this case is a bit different. I need to anonymously collect a few rows of data from people, and tie those rows together so I can easily discern which data rows belong to which user.
The difficulty I'm having is in generating the id to tie the data rows together. My first thought was to poll the database for the highest user ID in existence, and write to the database with user ID +1. This will fail, however, if two submissions poll the database before either of them writes to it - they will each share the same user ID.
Another thought I had was to create a separate user ID table that would be set to auto-increment, and simply generate a new row, then poll that table for the id of the last row created. That also fails for the same reason as above - if two submissions create a row before either of them polls for the latest user ID, then they'll end up sharing an ID.
Any ideas? I get the impression I'm missing something obvious.
I think I'm understanding you right; I was having a similar issue. There's a super handy php function, though. After you query the database to insert a new row and auto-incrementing their user ID, do:
$user_id = mysql_insert_id();
That just returns the auto-increment value from the previous query on the current mysql connection. You can read more about it here if you need to.
You can then use this to populate the second table's data, being sure nobody will get a duplicate ID from the first one.
You need to insert the user, get the auto-generated id, and then use that id as a foreign key in the couple of rows you need to associate with the parent record. The hat rack must exist before you can hang hats on it.
This is a common issue, and to solve it, you would use a transaction. This gives you the atomic idea being being able to do more than one thing, but have it tied to either a success or fail as a package. It's an advanced db feature, and does require awareness of some more advanced programming in order to implement it in as fault-tolerant a manner as possible.