How can I make this SQL query the most effective? - php

I am making a website with a large pool of images added by users.
I want to choose randomly one image out of this pool, and display it to the user, but I want to make sure that this user has never seen this image before.
So i was thinking that: when a user views an image, I make a row INSERT in MYSQL that would say "This USER has watched THIS IMAGE at (TIME)" for every entry.
But the thing is, since there might be a lot of users and a lot of images, this table can easily grow to tens of thousands of entries quite rapidly.
So alternatively, it might be done like that:
I was thinking of making a row INSERT for every USER, and in ONE field, I insert an array all id's of images that user has watched.
I can even do that to the array:
base64_encode(gzcompress(serialize($array)
And then:
unserialize(gzuncompress(base64_decode($array))
What do you think I should do?
Is the encoding/decoding functions fast enough, or at least faster than the conventional way i was describing at the beginning of the post?
Is that compression good enough to store large chunks of data into only ONE database field? (imagine if the user has viewed thousands images?)
Thanks a lot

in ONE field, I insert an array all id's
In almost all cases, serializing values like this is bad practice. Let the database do what it's designed to do -- efficiently handle large amounts of data. As long as you ensure that your cross table has an index on the user field, retrieving the list of images that a user has seen will not be an expensive operation, regardless of the number of rows in the table. Tens of thousands of entries is nothing.

You should create a new table UserImageViews with columns user_id and image_id (additionally, you could add more information on the view, such as Date/Time, IP and Browser).
That will make queries like "What images the user has (not) seen" much faster.

You should use a table. Serializing data into a single field in a database is a bad practice, as the DBMS has no clue what that data represents and cannot be used in ANY queries. For example, if you wanted to see which users had viewed an image, you wouldn't be able to in SQL alone.
Tens of thousands of entries isn't much, BTW. The main application we develop has multiple tables with hundreds of thousands of records, and we're not that big. Some web applications have tables with millions of rows. Don't worry about having "too much data" unless it starts becoming a problem - the solutions for that problem will be complex and might even slow down your queries until you get to that amount of data.
EDIT: Oh yeah, and joins against those 100k+ tables happen in under a second. Just some perspective for ya...

I don't really think that tens of thousands of rows will be a problem for a database lookup. I will recommend using the first approach over the second.

I want to choose randomly one image out of this pool, and display it
to the user, but I want to make sure that this user has never seen
this image before.
For what it's worth, that's not a random algorithm; that's a shuffle algorithm. (Knowing that will make it easier to Google when you need more details about it.) But that's not your biggest problem.
So i was thinking that: when a user views an image, I make a row
INSERT in MYSQL that would say "This USER has watched THIS IMAGE at
(TIME)" for every entry.
Good thought. Using a table that stores the fact that a user has seen a specific image makes sense in your case. Unless I've missed something, you don't need to store the time. (And you probably shouldn't. It doesn't seem to serve any useful business purpose.) Something along these lines should work well.
-- Predicate: User identified by [user_id] has seen image identified by
-- [image_filename] at least once.
create table images_seen (
user_id integer not null references users (user_id),
image_filename not null references images (image_filename),
primary key (user_id, image_filename)
);
Test that and look at the output of EXPLAIN. If you need a secondary index on image_filename . . .
create index images_seen_img_filename on images_seen (image_filename);
This still isn't your biggest problem.
The biggest problem is that you didn't test this yourself. If you know any scripting language, you should be able to generate 10,000 rows for testing in a matter of a couple of minutes. If you'd done that, you'd find that a table like that will perform well even with several million rows.
I sometimes generate millions of rows to test my ideas before I answer a question on StackOverlow.
Learning to generate large amounts of random(ish) data for testing is a fundamental skill for database and application developers.

Related

Design a MySQL table(s) to store data of uploaded files

Im doing some redesigning of our database im building new tables to hold data about files users uploaded. The overlying issue here is that there are a bunch of different types of files users may upload. They may for example upload an mp3 file as a song, a profile picture, a profile cover photo, etc. Im running into a few design and practical issues, though, and am trying to figure out the best to do this. At the moment the main design looks something like this:
ID | name | type | amazon_S3_info
ID: auto-increment an ID for every new upload.
name: name of the upload i.e. file name for example
type: what type of upload it is, for example a profile picture, a cover photo, audio file etc.
amazon_S3_info: Im storing all the files in an S3 and this field holds the data so I can generate the URL. I can't store a URL here, since im using signed urls and they always need to be regenerated with the data stored in this field.
After creating a table like this I could then just make matching tables where I, for example, create the relationships between a user ID and the upload ID of the profile picture they uploaded, etc. which is pretty simple.
My original idea was to break this whole thing up into multiple tables, meaning I would make 1 table for profile pictures, 1 for cover photos, etc. The reason this would become a bit of a headache on the php-side is that I have one standard function which uses an ID to retrieve the file URL for these files. If I have multiple tables then each type of upload would have 1 of the same ID in it, thus rendering my current URL retrieval useless. This is already in use all over the site and would be a nuisance to redo, however if it needs to it needs to be.
To be clear the idea here to break up into a few tables was speed. My logic is that it would more efficient to break a single table that could be 2,000,000 rows into a 4 tables of 500,000. It would be quicker to pull data from each one of those 500,000 rows table, or is that a false premise?
So my question to you lot is which database design is better, particularly when we are talking about scaling to be quite large?
With databases (and computers in general), you're usually worried about factors of 10, not just 2x or 3x.
So splitting up the table by type into multiple tables, say 5 tables altogether instead of 1, ultimately won't solve your performance issues once the data grows extremely large. And like you said, it's a programming pain. (Basically you'd be sharding manually without an algorithm...if going to shard might as well use a hash shard algorithm to find the database/table).
The design you have is standard many to many. Index the tables correctly and that's the best you can do.
If performance becomes a problem, you need to scale horizontally. Relational datastores don't do this well, but NoSQL datastores do. You can have those types of references in NoSQL as well. If design changes are still possible, look into AWS DynamoDB (NoSQL service).
Edit: to respond to a comment...
#arian1123 In my experience, there's a point (table size) where all of a sudden mysql starts performing poorly. The more hardware (especially memory) you have, the larger the tables can grow before this happens. (The killer are joins. If you don't join big tables on big tables, then a big table by itself probably can grow very large with adequate hardware, I've dealt with 1Billion+ row tables where only reads were done with no joins, and it wasn't a problem.)
On your own laptop, you may see 100k tables performing fine, and 1M tables not. If the data isn't going to grow anymore and that's the power of hardware you'll have on production, then splitting would be a good idea. However if you're going to always be increasing table size, like 50M as you mention, then splitting it up would only help if you could split indefinetly (like every 2 million rows you divide the table again). In you're case you aren't wanting to continue to divide 1 table to 4 to 20 to 100...so I think it'd be better to leave as 1 table, and if it doesn't perform then look into other datastore types.

How to scale mysql tables for growth

So I'm working on site that will replace an older site with a lot of traffic, and I will also have a lot of data in the DB, so my question to you guys is what is the best way to design mysql tables for growth?
I was thinking to split let's say a table with 5 000 000 rows in 5 tables,with 1 000 000 rows/table and create a relationship between the tables, but I guess this isn't a good option since I will spend a lot of resources and time to figure out in what table my data is.
Or can you guys give me some tips mabe some useful articles?
No, you're absolutely right on the relationships. This technique is called Normalization where you define separate tables because these individual tables are affected with time and independent of other tables.
So if you have a hotel database that keeps a track of rooms and guests, then you know normalization is necessary because rooms and guests are independent of each other.
But you will have foreign keys/surrogate keys in each table (for instance, room_id) that could relate the particular guest entering for that particular room.
Normalization, in your case, could help you optimize that 5000 rows of yours as it would not be optimal for a loop to go over 5000 elements and retrieve an entire data.
Here is a strong example for why normalization is essential in database management.
Partitioning as mentioned in a comment is one way to go, but the first path to check out is even determining if you can break down the tables with the large amounts of data into workable chunks based on some internal data.
For instance, lets say you have a huge table of contacts. You can essentially break down the data into contacts that start from a-d, e-j, etc. Then when you go to add records you just make sure you add the records to the correct table (I'd suggest checking out stored procedures for handling this, so that logic is regulated in the database). You'd also probably set up stored procedures to also get data from the same tables. By doing this however, you have to realize that using auto-incrementing IDs won't work correctly as you won't be able to maintain unique IDs across all of the tables without doing some work yourself.
These of course are the simple solutions. There are tons of solutions for large data sets which also includes looking at other storage solutions, clustering, partitioning, etc. Doing some of these things manually yourself can give you a little bit of an understanding on some of the possibly "manual solutions".

Which is faster in SQL: many Many MANY tables vs one huge table?

I am in the process of creating a website where I need to have the activity for a user (similar to your inbox in stackoverflow) stored in sql. Currently, my teammates and I are arguing over the most effective way to do this; so far, we have come up with two alternate ways to do this:
Create a new table for each user and have the table name be theirusername_activity. Then when I need to get their activity (posting, being commented on, etc.) I simply get that table and see the rows in it...
In the end I will have a TON of tables
Possibly Faster
Have one huge table called activity, with an extra field for their username; when I want to get their activity I simply get the rows from that table "...WHERE username=".$loggedInUser
Less tables, cleaner
(assuming I index the tables correctly, will this still be slower?)
Any alternate methods would also be appreciated
"Create a new table for each user ... In the end I will have a TON of tables"
That is never a good way to use relational databases.
SQL databases can cope perfectly well with millions of rows (and more), even on commodity hardware. As you have already mentioned, you will obviously need usable indexes to cover all the possible queries that will be performed on this table.
Number 1 is just plain crazy. Can you imagine going to manage it, and seeing all those tables.
Can you imagine the backup! Or the dump! That many create tables... that would be crazy.
Get you a good index, and you will have no problem sorting through records.
here we talk about MySQL. So why would it be faster to make separate tables?
query cache efficiency, each insert from one user would'nt empty the query cache for others
Memory & pagination, used tables would fit in buffers, unsued data would easily not be loaded there
But as everybody here said is semms quite crazy, in term of management. But in term of performances having a lot of tables will add another problem in mySQL, you'll maybe run our of file descriptors or simply wipe out your table cache.
It may be more important here to choose the right engine, like MyIsam instead of Innodb as this is an insert-only table. And as #RC said a good partitionning policy would fix the memory & pagination problem by avoiding the load of rarely used data in active memory buffers. This should be done with an intelligent application design as well, where you avoid the load of all the activity history by default, if you reduce it to recent activity and restrict the complete history table parsing to batch processes and advanced screens you'll get a nice effect with the partitionning. You can even try a user-based partitioning policy.
For the query cache efficiency, you'll have a bigger gain by using an application level cache (like memcache) with history-per-user elements saved there and by emptying it at each new insert .
You want the second option, and you add the userId (and possibly a seperate table for userid, username etc etc).
If you do a lookup on that id on an properly indexed field you'd only need something like log(n) steps to find your rows. This is hardly anything at all. It will be way faster, way clearer and way better then option 1. option 1 is just silly.
In some cases, the first option is, in spite of not being strictly "the relational way", slightly better, because it makes it simpler to shard your database across multiple servers as you grow. (Doing this is precisely what allows wordpress.com to scale to millions of blogs.)
The key is to only do this with tables that are entirely independent from a user to the next -- i.e. never queried together.
In your case, option 2 makes the most case: you'll almost certainly want to query the activity across all or some users at some point.
Use option 2, and not only index the username column, but partition (consider a hash partition) on that column as well. Partitioning on username will provide you some of the same benefits as the first option and allow you to keep your sanity. Partitioning and indexing the column this way will provide a very fast and efficient means of accessing data based on the username/user_key. When querying a partitioned table, the SQL Engine can immediately lop off partitions it doesn't need to scan as it can tell based off of the username value queried vs. the ability of that username to reside within a partition. (in this case only one partition could contain records tied to that user) If you have a need to shard the table across multiple servers in the future, partitioning doesn't hinder that ability.
You will also want to normalize the table by separating the username field (and any other elements in the table related to username) into its own table with a user_key. Ensure a primary key on the user_key field in the username table.
This majorly depends now on where you need to retrieve the values. If its a page for single user, then use first approach. If you are showing data of all users, you should use single table. Using multiple table approach is also clean but in sql if the number of records in a single table are very high, the data retrieval is very slow

Is there a limit to how many tables you can have in a database?

Is there a limit to how many tables you can have in a database? Would this be considered bad programming or whatever? I have a lot of user information and I'm wondering if it would be ok to have many tables?
If you're considering this question, you should probably change the way you are planning to store data. It is generally considered a bad practice to have a database schema where the number of tables grows over time. A better way to store data is to have identifying columns (like a User ID) and use a row for each user. You can have multiple tables for different types of data, but shouldn't have tables for each user.
No, mysql does not have a limit to number of tables in a database, although obviously you'll be constrained by how much disk space you have available.
That said, if you're asking this question, your prospective design is probably fairly ugly.
Just found this
http://bobfield.blogspot.com/2006/03/million-tables.html
So if you suspect you will have more than one million tables, you should consider redesigning the database ;) Also note, that this blogpost is from 2006.
not usually a logical limit no. but this question begs the discussion - why would you think you might approach a limit? if you will be creating many many tables, then this feels like maybe you really want to be creating many many rows instead... perhaps you could elaborate on your idea so we could provide some schema guidance..?
Generally the limit, if there is one, should be large enough not to worry about. If you find yourself worrying about it, you have larger problems. For instance if you were dealing with customers who have orders, you would create a table for customers and a table for orders. You should not be creating a table for each customer.
Yep, there is a limit.. but you are likely to find it. 65,000 last I heard.. http://forums.mysql.com/read.php?32,100653,100653
I can see a reason some might want a table per user. If each user is going to have an increasing number of logs/entries/rows over time, and you do not want the code to have to sort through a gigantic list of entries looking for rows matching only the particular userID, then the application would simply look for a table with the given userID, and then everything in that table is for that user only. It would improve performance when wanting to compare and sort data for one particular user. I have used this method, all be it with less than one hundred users though. Not sure of any consequences that may be faced with thousands of users.
Why put users in their own tables? Seems like a waste of time to me. One users table with an identifying ID that increases every time a new row is added would work fine.
The ID could be a foreign key for other tables, such as "Blog_Posts" - each blog post would need to have an author, and so you could use an "AuthorID" column which would correlate to a User ID in your users table.
Saves space and time - plus it's cleaner.

Should I break a larger mysql table into multiple?

I have a pretty large social network type site I have working on for about 2 years (high traffic and 100's of files) I have been experimenting for the last couple years with tweaking things for max performance for the traffic and I have learned a lot. Now I have a huge task, I am planning to completely re-code my social network so I am re-designing mysql DB's and everything.
Below is a photo I made up of a couple mysql tables that I have a question about. I currently have the login table which is used in the login process, once a user is logged into the site they very rarely need to hit the table again unless editing a email or password. I then have a user table which is basicly the users settings and profile data for the site. This is where I have questions, should it be better performance to split the user table into smaller tables? For example if you view the user table you will see several fields that I have marked as "setting_" should I just create a seperate setting table? I also have fields marked with "count" which could be total count of comments, photo's, friends, mail messages, etc. So should I create another table to store just the total count of things?
The reason I have them all on 1 table now is because I was thinking maybe it would be better if I could cut down on mysql queries, instead of hitting 3 tables to get information on every page load I could hit 1.
Sorry if this is confusing, and thanks for any tips.
alt text http://img2.pict.com/b0/57/63/2281110/0/800/dbtable.jpg
As long as you don't SELECT * FROM your tables, having 2 or 100 fields won't affect performance.
Just SELECT only the fields you're going to use and you'll be fine with your current structure.
should I just create a seperate setting table?
So should I create another table to store just the total count of things?
There is not a single correct answer for this, it depends on how your application is doing.
What you can do is to measure and extrapolate the results in a dev environment.
In one hand, using a separate table will save you some space and the code will be easier to modify.
In the other hand you may lose some performance ( and you already think ) by having to join information from different tables.
About the count I think it's fine to have it there, although it is always said that is better to calculate this kind of stuff, I don't think for this situation it hurt you at all.
But again, the only way to know what's better your you and your specific app, is to measuring, profiling and find out what's the benefit of doing so. Probably you would only gain 2% of improvement.
You'll need to compare performance testing results between the following:
Leaving it alone
Breaking it up into two tables
Using different queries to retrieve the login data and profile data (if you're not doing this already) with all the data in the same table
Also, you could implement some kind of caching strategy on the profile data if the usage data suggests this would be advantageous.
You should consider putting the counter-columns and frequently updated timestamps in its own table --- every time you bump them the entire row is written.
I wouldn't consider your user table terrible large in number of columns, just my opinion. I also wouldn't break that table into multiple tables unless you can find a case for removal of redundancy. Perhaps you have a lot of users who have the same settings, that would be a case for breaking the table out.
Should take into account the average size of a single row, in order to find out if the retrieval is expensive. Also, should try to use indexes as while looking for data...
The most important thing is to design properly, not just to split because "it looks large". Maybe the IP or IPs could go somewhere else... depends on the data saved there.
Also, as the socialnetworksite using this data also handles auth and autorization processes (guess so), the separation between login and user tables should offer a good performance, 'cause the data on login is "short enough", while the access to the profile could be done only once, inmediately after the successful login. Just do the right tricks to improve DB performance and it's done.
(Remember to visualize tables as entities, name them as an entity, not as a collection of them)
Two things you will want to consider when deciding whether or not you want to break up a single table into multiple tables is:
MySQL likes small, consistent datasets. If you can structure your tables so that they have fixed row lengths that will help performance at the potential cost of disk space. One thing that from what I can tell is common is taking fixed length data and putting it in its own table while the variable length data will go somewhere else.
Joins are in most cases less performant than not joining. If the data currently in your table will normally be accessed all at the same time then it may not be worth splitting it up as you will be slowing down both inserts and quite potentially reads. However, if there is some data in that table that does not get accessed as often then that would be a good candidate for moving out of the table for performance reasons.
I can't find a resource online to substantiate this next statement but I do recall in a MySQL Performance talk given by Jay Pipes that he said the MySQL optimizer has issues once you get more than 8 joins in a single query (MySQL 5.0.*). I am not sure how accurate that magic number is but regardless joins will usually take longer than queries out of a single table.

Categories