compute manually in php file or through mysql? - php

Excuse me if this has been asked before, but I tried looking for something similar but couldn't find anything.
I have three tables: users, hobbies and user_hobbies (linking the first two). I want to calculate the similarity betweet two users based on their hobbies. For this, I need, first of all, two sets: User A hobbies and user B hobbies, which I can acquire with two simple queries. I have to calculate these two sets for other reasons too, in a php file, so they are available to me, in two arrays, for the next step:
I have to calculate their common hobbies (i.e. the intersection of the sets).
Idea #1: Having two arrays, I can calculate through some method the common elements.
Idea #2: I can make a third query (e.g. SELECT hobby FROM user_hobbies WHERE user_id IN ('uid_A', 'uid_B') GROUP BY hobby HAVING COUNT (*) = 2) and not bother myself.
I suppose my question is about performance. Is it quicker to calculate manually or are mysql queries much faster?

You already have a normalized table to hold the user-hobbies table, so why not go with that?
Generally speaking, SQL will be much faster, at least for the first 100k records or so. Then you'll see a performance drop on queries that vet through columns that aren't indexed, or from queries that use the 'filesort' to order large datasets brought on by the ORDER BY keyword.
For scalability, I recommend using an inner join to narrow down the possibilities for starters.
Think critically about this. Are there any other columns not mentioned could indicate that the user could have more than one hobby? These are the things you consider when looking to scale your application.
Otherwise, you should be fine for starters, lest you should be optimizing prematurely.
I would go with Option #2.

In short: If your operations is NOT a set base operation it is better to be shifted out of the MsSql or any RDBMS.
Because, you can not scale MsSQL easily.

Related

Many to many vs one row [duplicate]

This question already has answers here:
Many database rows vs one comma separated values row
(4 answers)
Closed 8 years ago.
I'm interested how and why many to many relationship is better than storing the information in one row.
Example: I have two tables, Users and Movies (very big data). I need to establish a relationship "view".
I have two ideas:
Make another column in Users table called "views", where I will store the ids of the movies this user has viewed, in a string. for example: "2,5,7...". Then I will process this information in PHP.
Make new table users_movies (many to many), with columns user_id and movie_id. row with user_id=5 and movie_id=7 means that user 5 has viewed movie 7.
I'm interested which of this methods is better and WHY. Please consider that the data is quite big.
The second method is better in just about every way. Not only will you utilize your DBs indexes to find records faster, it will make modification far far easier.
Approach 1) could answer the question "Which movies has User X viewed" by just having an SQL like "...field_in_set(movie_id, user_movielist) ...". But the other way round ("Which user do have viewed movie x") won't work on an sql basis.
That's why I always would go for approach 2): clear normalized structure, both ways are simple joins.
It's just about the needs you have. If you need performance then you must accept redundancy of the information and add a column. If your main goal is to respect the Normalization paradigma then you should not have redundancy at all.
When I have to do this type of choice I try to estimate the space loss of redundancy vs the frequency of the query of interest and its performance.
A few more thoughts.
In your first situation if you look up a particular user you can easily get the list of ids for the films they have seen. But then would need a separate query to get the details such as the titles of those movies. This might be one query using IN with the list of ids, or one query per film id. This would be inefficient and clunky.
With MySQL there is a possible fudge to join in this situation using the FIND_IN_SET() function (although a down side of this is you are straying in to non standard SQL). You could join your table of films to the users using ON FIND_IN_SET(film.id, users.film_id) > 0 . However this is not going to use an index for the join, and involves a function (which while quick for what it does, will be slow when performed on thousands of rows).
If you wanted to find all the users who had view any film a particular user had viewed then it is a bit more difficult. You can't just use FIND_IN_SET as it requires a single string and a comma separated list. As a single query you would need to join the particular user to the film table to get a lot of intermediate rows, and then join that back against the users again (using FIND_IN_SET) to find the other users.
There are ways in SQL to split up a comma separated list of values, but they are messy and anyone who has to maintain such code will hate it!
These are all fudges. With the 2nd solution these easy to do, and any resulting joins can easily use indexes (and possibly the whole queries can just use indexes without touching the actual data).
A further issue with the first solution is data integretity. You will have to manually check that a film doesn't appear twice for a user (with the 2nd solution this can easily be enforced using a unique key). You also cannot just add a foreign key to ensure that any film id for a user does actually exist. Further you will have to manually ensure that nothing enters a character string in your delimited list of ids.

Is DynamoDB the right option for this use case?

I want to love DynamoDB, but the major drawback is the query/scan on the whole DB to pull the results for one query. Would I be better sicking with MySQL or is there another solution I should be aware of?
Uses:
Newsfeed items (Pulls most recent items from table where id in x,x,x,x,x)
User profiles relationships (users follow and friend eachother)
User lists (users can have up to 1,000 items in one list)
I am happy to mix and match database solutions.The main use is lists.
There will be a few million lists eventually, ranging from 5 to 1000 items per list. The list table is formatted as follows: list_id(bigint)|order(int(1))|item_text(varchar(500))|item_text2(varchar(12))|timestamp(int(11))
The main queries on this DB would be on the 'list_relations' table:
Select 'item_text' from lists where list_id=539830
I suppose my main question. Can we get all items for a particular list_id, without a slow query/scan? and by 'slow' do people mean a second? or a few minutes?
Thank you
I'm not going to address whether or not it's a good choice or the right choice, but you can do what you're asking. I have a large dynamoDB instance with vehicle VINs as the Hash, something else for my range, and I have a secondary index on vin and a timestamp field, I am able to make fast queries over thousands of records for specific vehicles over timestamp searches, no problem.
Constructing your schema in DynamoDB requires different considerations than building in MySQL.
You want to avoid scans as much as possible, this means picking your hash key carefully.
Depending on your exact queries, you may also need to have multiple tables that have the same data..but with different hashkeys depending on your querying needs.
You also did not mention the LSI and GSI features of DynamoDB, these also help your query-ability, but have their own sets of drawbacks. It is difficult to advise further without knowing more details about your requirements.

How to design the user table for an online dating site?

I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?
This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.
What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.
The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.
In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.
In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.
Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.

Is naming tables september_2010 acceptable and efficient for large data sets dependent on time?

I need to store about 73,200 records per day consisting of 3 points of data: id, date, and integer.
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Any suggestions on how to deal with this amount of data? Thanks.
========== Thank you to all the feedback.
I recommend against that. I call this antipattern Metadata Tribbles. It creates multiple problems:
You need to remember to create a new table every year or else your app breaks.
Querying aggregates against all rows regardless of year is harder.
Updating a date potentially means moving a row from one table to another.
It's harder to guarantee the uniqueness of pseudokeys across multiple tables.
My recommendation is to keep it in one table until and unless you've demonstrated that the size of the table is becoming a genuine problem, and you can't solve it any other way (e.g. caching, indexing, partitioning).
Seems like it should be just fine holding everything in one table. It will make retrieval much easier in the future to maintain 1 table, as opposed to 12 tables per year. At 73,200 records per day it will take you almost 4 years to hit 100,000,000 which is still well within MySQLs capabilities.
Absolutely not.
It will ruin relationship between tables.
Table relations being built based on field values, not table names.
Especially for this very table that will grow by just 300Mb/year
so in 100 days you have 7.3 M rows, about 25M a year or so. 25M rows isn't a lot anymore. MySQL can handle tables with millions of rows. It really depends on your hardware and your query types and query frequency.
But you should be able to partition that table (if MySQL supports partitioning), what you're describing is an old SQL Server method of partition. After building those monthly tables you'd build a view that concatenates them together to look like one big table... which is essentially what partitioning does but it's all under-the-covers and fully optimized.
Usually this creates more trouble than it's worth, it's more maintenance , your queries need more logic, and it's painful to pull data from more than one period.
We store 200+ million time based records in one (MyISAM) table, and queries are blazingly still fast.
You just need to ensure there's an index on your time/date column and that your queries makes use of the index (e.g. a query that messes around with DATE_FORMAT or similar on a date column will likely not use an index. I wouldn't put them in separate tables just for the sake of retreival performance.
One thing that gets very painful with such a large number of records is when you have to delete old data, this can take a long time (10 minutes to 2 hours for e.g. wiping a month worth of data in tables with hundreds of mullions rows). For that reason we've partitioning the tables, and use a time_dimension(see e.g. the time_dimension table a bit down here) relation table for managing the periods instead of simple date/datetime columns or strings/varchars representing dates.
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Don't listen to them. You're already storing a date stamp, what about different months makes it a good idea to split the data that way? The engine will handle the larger data sets just fine, so splitting by month does nothing but artificially segregate the data.
My first reaction is: Aaaaaaaaahhhhhhhhh!!!!!!
Table names should not embed data values. You don't say what the data means, but supposing for the sake of argument it is, I don't know, temperature readings. Just imagine trying to write a query to find all the months in which average temperature increased over the previous month. You'd have to loop through table names. Worse yet, imagine trying to find all 30-day periods -- i.e. periods that might cross month boundaries -- where temperature increased over the previous 30-day period.
Indeed, just retrieving an old record would go from a trivial operation -- "select * where id=whatever" -- would become a complex operation requiring you to have the program generate table names from the date on the fly. If you didn't know the date, you would have to scan through all the tables searching each one for the desired record. Yuck.
With all the data in one properly-normalized table, queries like the above are pretty trivial. With separate tables for each month, they're a nightmare.
Just make the date part of the index and the performance penalty of having all the records in one table should be very small. If the size of table really becomes a performance problem, I could dimply comprehend making one table for archive data with all the old stuff and one for current data with everything you retrieve regularly. But don't create hundreds of tables. Most database engines have ways to partition your data across multiple drives using "table spaces" or the like. Use the sophisticated features of the database if necessary, rather than hacking together a crude simulation.
Depends on what searches you'll need to do. If normally constrained by date, splitting is good.
If you do split, consider naming the tables like foo_2010_09 so the tables will sort alphanumerically.
what is your DB platform?
In SQL Server 2K5+ you can partition on date.
My bad, I didnt notice the tag. #thetaiko is right though and this is well within MySQL capabilities to deal with this.
I would say it depends on how the data is used. If most queries are done over the complete data, it would be an overhead to always join the tables back together again.
If you most times only need a part of the data (by date), it is a good idea to segment the tables into smaller pieces.
For the naming i would do tablename_yyyymm.
Edit: For sure you should then also think about another layer between the DB and your app to handle the segmented tables depending on some date given. Which can then get pretty complicated.
I'd suggest dropping the year and just having one table per month, named after the month. Archive your data annually by renaming all the tables $MONTH_$YEAR and re-creating the month tables. Or, since you're storing a timestamp with your data, just keep appending to the same tables. I assume by virtue of the fact that you're asking the question in the first place, that segregating your data by month fits your reporting requirements. If not, then I'd recommend keeping it all in one table and periodically archiving off historical records when performance gets to be an issue.
I agree with this idea complicating your database needlessly. Use a single table. As others have pointed out, it's not nearly enough data to warrent extraneous handling. Unless you use SQLite, your database will handle it well.
However it also depends on how you want to access it. If the old entries are really only there for archival purposes, then the archive pattern is an option. It's common for versioning systems to have the infrequently used data separated out. In your case you'd only want everything >1 year to move out of the main table. And this is strictly an database administration task, not an application behavior. The application would only join the current list and the _archive list, if at all. Again, this highly depends on the use case. Are the old entries generally needed? Is there too much data to process regularily?

MySQL - large query vs Short query with multiple if's

This is a really broad question, but I have come across it a couple of times in the last few weeks and I was wondering what the general consensus is regarding good practice and efficiency.
1)
SELECT COUNT(*) FROM table WHERE id='$id', name='$name', owner='$owner_id'
and then based on if there is one result then the record matches.
2)
SELECT * FROM table WHERE id='$id'
and then a series of if commands to check the results match.
Now obviously there are advantages to the second solution as it allows for accurate error reports as to the field that does not match... but if that is not required which is more efficient, considered better practice and is there a difference to the load on the mySQL server between the two?
Option 1 by a long shot. Let SQL do what it is designed to do best, and better than procedural code. That is, filtering and sorting data.
Also, it is a much more efficient use of resources (bandwidth, DB utilization, etc) to pull down only the data you need from the server.
Use 1). Mysql is very efficient in selecting data based on certain conditions.
Large query can take .1 to 5.1 or more seconds, you need to find, run it and find it. Usually multiple if are way better as PHP is very fast. I did that when I was using it with 5 joins in table with 5 billion products, then I reduce one join and then use if statement to fix it up. Query was taking 4.2 seconds, when I reduced join, it took 3.8s but as you know PHP is way faster.

Categories