Elegant database design help... (MySQL/PHP)

Elegant database design help... (MySQL/PHP) - php

I'm building a movies website... I need to display info about each movie, including genres, actors, and a lot of info (IMDB.com like)...
I created a 'movies' table including an ID and some basic information.
For the genres I created a 'genres' table including 2 columns: ID and genre.
Then I use a 'genres2movies' table with two columns:movieID and the genreID, to connect between the genres and the movies tables...
This way, for example, if a movie have 5 different genres I get the movieID in 5 different rows of the'genres2movies' table. Its better than including the genre each time for each movie but...
There is a better way for doing this???
I need to do this also for actors, languages and countries so performance and database size is really important.
Thanks!!!

It sounds like you are following proper normalisation rules at the moment, which is exactly what you want.
However, you may find that if performance is a key factor you may want to de-normalise some parts of your data, since JOINs between tables are relatively expensive operations.
It's usually a trade-off between proper/full normalisation and performance

You are in the right track. That's the way to do many-to-many relationships. Database size won't grow much because you use integers and for speed you must set up correct indexes for those IDs. When making SELECt queries check out the EXPLAIN - it helps to find the bottlenecks of speed.

You're on exactly the right track - this is the correct, normalized, approach.
The only thing I would add is to ensure that your index on the join table (genres2movies) includes both genre and movie id and it is generally worthwhile (depending upon the selects used) to define indexes in both directions - ie. two indexes, ordered genre-id,movie-id and movie-id,genre-id. This ensures that any range select on either genre or movie will be able to use an index to retrieve all the data it needs and not have to resort to a full table scan, or even have to access the table rows themselves.

Related

Is DynamoDB the right option for this use case?

I want to love DynamoDB, but the major drawback is the query/scan on the whole DB to pull the results for one query. Would I be better sicking with MySQL or is there another solution I should be aware of?
Uses:
Newsfeed items (Pulls most recent items from table where id in x,x,x,x,x)
User profiles relationships (users follow and friend eachother)
User lists (users can have up to 1,000 items in one list)
I am happy to mix and match database solutions.The main use is lists.
There will be a few million lists eventually, ranging from 5 to 1000 items per list. The list table is formatted as follows: list_id(bigint)|order(int(1))|item_text(varchar(500))|item_text2(varchar(12))|timestamp(int(11))
The main queries on this DB would be on the 'list_relations' table:
Select 'item_text' from lists where list_id=539830
I suppose my main question. Can we get all items for a particular list_id, without a slow query/scan? and by 'slow' do people mean a second? or a few minutes?
Thank you

I'm not going to address whether or not it's a good choice or the right choice, but you can do what you're asking. I have a large dynamoDB instance with vehicle VINs as the Hash, something else for my range, and I have a secondary index on vin and a timestamp field, I am able to make fast queries over thousands of records for specific vehicles over timestamp searches, no problem.

Constructing your schema in DynamoDB requires different considerations than building in MySQL.
You want to avoid scans as much as possible, this means picking your hash key carefully.
Depending on your exact queries, you may also need to have multiple tables that have the same data..but with different hashkeys depending on your querying needs.
You also did not mention the LSI and GSI features of DynamoDB, these also help your query-ability, but have their own sets of drawbacks. It is difficult to advise further without knowing more details about your requirements.

How can I optimize this simple database and query using php and mysql?

I pull a range (e.g. limit 72, 24) of games from a database according to which have been voted most popular. I have a separate table for tracking game data, and one for tracking individual votes for a game (rating from 1 to 5, one vote per user per game). A game is considered "most popular" or "more popular" when that game has the highest average rating of all the rating votes for said game. Games with less than 5 votes are not considered. Here is what the tables look like (two tables, "games" and "votes"):
games:
gameid(key)
gamename
thumburl
votes:
userid(key)
gameid(key)
rating
Now, I understand that there is something called an "index" which can speed up my queries by essentially pre-querying my tables and constructing a separate table of indices (I don't really know.. that's just my impression).
I've also read that mysql operates fastest when multiple queries can be condensed into one longer query (containing joins and nested select statements, I presume).
However, I am currently NOT using an index, and I am making multiple queries to get my final result.
What changes should be made to my database (if any -- including constructing index tables, etc.)? And what should my query look like?
Thank you.

Your query that calculates the average for every game could look like:
SELECT gamename, AVG(rating)
FROM games INNER JOIN votes ON games.gameid = votes.gameid
GROUP BY games.gameid
HAVING COUNT(*)>=5
ORDER BY avg(rating) DESC
LIMIT 0,25
You must have an index on gameid on both games and votes. (if you have defined gameid as a primary key on table games that is ok)

According to the MySQL documentation, an index is created when you designate a primary key at table creation. This is worth mentioning, because not all RDBMS's function this way.
I think you have the right idea here, with your "votes" table acting as a bridge between "games" and "user" to handle the many-to-many relationship. Just make sure that "userid" and "gameid" are indexed on the "votes" table.

If you have access to use InnoDB storage for your tables, you can create foreign keys on gameid in the votes table which will use the index created for your primary key in the games table. When you then perform a query which joins these two tables (e.g. ... INNER JOIN votes ON games.gameid = votes.gameid) it will use that index to speed things up.
Your understanding of an index is essentially correct — it basically creates a separate lookup table which it can use behind the scenes when the query is executed.
When using an index it is useful to use the EXPLAIN syntax (simply prepend your SELECT with EXPLAIN to try this out). The output it gives show you the list of possible keys available for the query as well as which key the query is using. This can be very helpful when optimising your query.

An index is a PHYSICAL DATA STRUCTURE which is used to help speed up retrieval type queries; it's not simply a table upon a table -> good for a concept though. Another concept is the way indexes work at the back of your text book (the only difference is with your book a search key could point to multiple pages / matches whereas with indexes a search key points to only one page/match). An index is defined by data structures so you could use a B+ tree index and there are even hash indexes. It's Database/Query optimization from the physical/internal level of the Database - I'm assuming that you know that you're working at the higher levels of the DBMS which is easier. An index is rooted within the internal levels and that make DB query optimization much more effective and interesting.
I've noticed from your question that you have not even developed the query as yet. Focus on the query first. Indexing comes after, as a matter of a fact, in any graduate or post graduate Database course, indexing falls under the maintenance of a Database and not necessarily the development.
Also N.B. I have seen quite many people say as a rule to make all primary keys indexes. This is not true. There are many instances where a primary key index would slow up the Database. Infact, if we were to go with only primary indexes then should use hash indexes since they work better than B+ trees!
In summary, it doesn't make sense to ask a question for a query and an index. Ask for help with the query first. Then given your tables (relational schema) and SQL query, then and only then could I advice you on the best index - remember its maintenance. We can't do maintanance if there is 0 development.
Kind Regards,
N.B. most questions concerning indexes at the post graduate level of many computing courses are as follows: we give the students a relational schema (i.e. your tables) and a query and then ask: critically suggest a suitable index for the following query on the tables ----> we can't ask a question like this if they dont have a query

mysql: use SET or lots of columns?

I'm using PHP and MySQL. I have records for:
events with various "event types" that are hierarchical (events can have multiple categories and subcategories, but there are a fixed amount of such categories and subcategories) (timestamped)
What is the best way to set up the table? Should I have a bunch of columns (30 or so) with enums for yes or no indicating membership in that category? or should I use MySQL SET datatype?
http://dev.mysql.com/tech-resources/articles/mysql-set-datatype.html
Basically I have performance in mind and I want to be able to retrieve all of the ids of the events for a given category. Just looking for some insight on the most efficient way to do this.

It sounds like you're chiefly concerned with performance.
A couple people have suggested splitting into 3 tables (category table plus either simple cross-reference table or a more sophisticated way of modeling the tree hierarchy, like nested set or materialized path), which is the first thing I thought when I read your question.
With indexes, a fully normalized approach like that (which adds two JOINs) will still have "pretty good" read performance. One issue is that an INSERT or UPDATE to an event now may also include one or more INSERT/UPDATE/DELETEs to the cross-reference table, which on MyISAM means the cross-reference table is locked and on InnoDB means the rows are locked, so if your database is busy with a significant number of writes you're going to have a larger contention problems than if just the event rows were locked.
Personally, I would try out this fully normalized approach before optimizing. But, I'll assume you know what you're doing, that your assumptions are correct (categories never change) and you have a usage pattern (lots of writes) that calls for a less-normalized, flat structure. That's totally fine and is part of what NoSQL is about.
SET vs. "lots of columns"
So, as to your actual question "SET vs. lots of columns", I can say that I've worked with two companies with smart engineers (whose products were CRM web applications ... one was actually events management), and they both used the "lots of columns" approach for this kind of static set data.
My advice would be to think about all of the queries you will be doing on this table (weighted by their frequency) and how the indexes would work.
First, with the "lots of columns" approach you are going to need indexes on each of these columns so that you can do SELECT FROM events WHERE CategoryX = TRUE. With the indexes, that is a super-fast query.
Versus with SET, you must use bitwise AND (&), LIKE, or FIND_IN_SET() to do this query. That means the query can't use an index and must do a linear search of all rows (you can use EXPLAIN to verify this). Slow query!
That's the main reason SET is a bad idea -- its index is only useful if you're selecting by exact groups of categories. SET works great if you'd be selecting categories by event, but not the other way around.
The primary problem with the less-normalized "lots of columns" approach (versus fully normalized) is that it doesn't scale. If you have 5 categories and they never change, fine, but if you have 500 and are changing them, it's a big problem. In your scenario, with around 30 that never change, the primary issue is that there's an index on every column, so if you're doing frequent writes, those queries become slower because of the number of indexes that have to updated. If you choose this approach, you might want to check the MySQL slow query log to make sure there aren't outlier slow queries because of contention at busy times of day.
In your case, if yours is a typical read-heavy web app, I think going with the "lots of columns" approach (as the two CRM products did, for the same reason) is probably sane. It is definitely faster than SET for that SELECT query.
TL;DR Don't use SET because the "select events by category" query will be slow.

It's good that the number of categories is fixed. If it wasn't you couldn't use either approach.
Check the Why You Shouldn't Use SET on the page you linked. I think that should give you a comprehensive guide.
I think the most important one is about indexes. Also, modifying a SET is slightly more complex.

The relationship between events and event types/categories is a many to many relationship, as echo says, but a simple xref table will leave you with a problem: If you want to query for all descendants of any given node, then you must make multiple recursive queries. On a deep tree, that will be very inefficient.
So when you say "retrieve all ids for a given category", if you do mean all descendants, then you want to use a Nested Set Model:
http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
The Nested Set model makes writes updates a bit slower, but makes it very easy to retrieve subtrees:
To get the Televisions sub tree, you query for all categories left >= 2 and right <= 9.
Leaf nodes always have left = right - 1
You can find the count of descendants without pulling those rows: (right - left - 1)/2
Finding inheritance paths and depth is also very easy (single query stuff). See the article for full details.

You might try using a cross-reference (Xref) table, to create a many-to-many relationship between your events and their types.
create table event_category_event_xref
(
event_id int,
event_category_id int,
foreign key(event_id) references event(id),
foreign key (event_category_id) references event_category(id)
);
Event / category membership is defined by records in this table. So if you have a record with {event_id = 3, event_category_id = 52}, it means event #3 is in category #52. Similarly you can have records for {event_id = 3, event_category_id = 27}, and so on.

How to design the user table for an online dating site?

I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?

This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.

What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.

The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.

In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.

In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.

Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.

How to store multi-valued profile details?

I have many fields which are multi valued and not sure how to store them? if i do 3NF then there are many tables. For example: Nationality.
A person can have single or dual nationality. if dual this means it is a 1 to many. So i create a user table and a user_nationality table. (there is already a nationality lookup table). or i could put both nationalities into the same row like "American, German" then unserialize it on run-time. But then i dont know if i can search this? like if i search for only German people will it show up?
This is an example, i have over 30 fields which are multi-valued, so i assume i will not be creating 61 tables for this? 1 user table, 30 lookup tables to hold each multi-valued item's lookups and 30 tables to hold the user_ values for the multi valued items?
You must also keep in mind that some multi-valued fields group together like "colleges i have studied at" it has a group of fields such as college name, degree type, time line, etc. And a user can have 1 to many of these. So i assume i can create a separate table for this like user_education with these fields, but lets assume one of these fields is also fixed list multi-valued like college campuses i visited then we will end up in a never ending chain of FK tables which isn't a good design for social networks as the goal is it put as much data into as fewer tables as possible for performance.

If you need to keep using SQL, you will need to create these tables. you will need to decide on how far you are willing to go, and impose limitations on the system (such as only being able to specify one campus).
As far as nationality goes, if you will only require two nationalities (worst-case scenario), you could consider a second nationality field (Nationality and Nationality2) to account for this. Of course this only applies to fields with a small maximum number of different values.

If your user table has a lot of related attributes, then one possibility is to create one attributes table with rows like (user_id, attribute_name, attribute_value). You can store all your attributes to one table. You can use this table to fetch attributes for given users, also search by attribute names and values.

The simple solution is to stop using a SQL table. This what NoSQL is deigned for. Check out CouchDB or Mongo. There each value can be stored as a full structure - so this whole problem could be reduced to a single (not-really-)table.
The downside of pretty much any SQL based solution is that it will be slow. Either slow when fetching a single user - a massive JOIN statement won't execute quickly or slow when searching (if you decide to store these values as serialized).

You might also want to look at ORM which will map your objects to a database automatically.
http://en.wikipedia.org/wiki/List_of_object-relational_mapping_software#PHP

This is an example, i have over 30
fields which are multi-valued, so i
assume i will not be creating 61
tables for this?
You're right that 61 is the maximum number of tables, but in reality it'll likely be less, take your own example:
"colleges i have studied at"
"college campuses i visited"
In this case you'll probably only have one "collage" table, so there would be four tables in this layout, not five.
I'd say don't be afraid of using lots of tables if the data set you're modelling is large - just make sure you keep an up to date ERD so you don't get lost! Also, don't get caught up too much in the "link table" paradigm - "link tables" can be "entities" in their own rights, for example you could think of the "colleges i have studied at" link table as an "collage enrolments" table instead, give it it's own primary key, and store each of the times you pay your course fees as rows in a (linked) "collage enrolment payments" table.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.