Compare/Diff Multiple (>Millions) Arrays

Compare/Diff Multiple (>Millions) Arrays - php

I'm not sure if this is possible; but I have millions of "lists" in a MySQL database, and would like develop a system where I take one of the lists; and compare it against all of the other lists in the database and return:
1.) Lists that closely resemble the primary list (some sort of % would be great)
2.) Given a certain items in a list; it would return a list of of items that are included in the majority of all the other lists (ie. autocomplete a list based on popular options).
I would've intially thought this would've been possible if I could create some sort of 'loose hash' that I can compare lists mathematically, but I haven't been able to find a solution that scales (since this is exponential when tackled head-on).
Any new ideas/solutions would be greatly appreciated. Thanks!

Your basic MD5 is a (somewhat) loose hash, supported by both php and mysql and quite fast in these kind of things. Just get an MD5 of what ever data and compare it to others.
Do it in PHP, store the MD5 of the data in array key an use if isset().

Your part 2) Given a certain items in a list; it would return a list of of items that are included in the majority of all the other lists (ie. autocomplete a list based on popular options).
is not very clear, but I interpret it as: Given few items, find all lists that contain all or most of the items.
This should be easy once you create an index on your list elements, essentially like a hash table. The exact query will depend on your requirement, length of lists (whether that is a factor in defining the specs, etc).

If you're saying there are millions of lists, itnis really not an option to load them all into a php script.
You could get the values of the list you are comparing the others to, and then run an SQL query similar to this:
SELECT list_id, COUNT(value) as c FROM lists WHERE value IN (a,b,c) GROUP BY list_id
ORDER BY c DESC
I'm not sure the sql is correct, but the idea is to select the ids of the lists that have the same members in them and then sort the output by the number of list items that intersect with the original list. The percenage of item correspondence is easily obtained in this case.

Related

Many to many vs one row [duplicate]

This question already has answers here:
Many database rows vs one comma separated values row
(4 answers)
Closed 8 years ago.
I'm interested how and why many to many relationship is better than storing the information in one row.
Example: I have two tables, Users and Movies (very big data). I need to establish a relationship "view".
I have two ideas:
Make another column in Users table called "views", where I will store the ids of the movies this user has viewed, in a string. for example: "2,5,7...". Then I will process this information in PHP.
Make new table users_movies (many to many), with columns user_id and movie_id. row with user_id=5 and movie_id=7 means that user 5 has viewed movie 7.
I'm interested which of this methods is better and WHY. Please consider that the data is quite big.

The second method is better in just about every way. Not only will you utilize your DBs indexes to find records faster, it will make modification far far easier.

Approach 1) could answer the question "Which movies has User X viewed" by just having an SQL like "...field_in_set(movie_id, user_movielist) ...". But the other way round ("Which user do have viewed movie x") won't work on an sql basis.
That's why I always would go for approach 2): clear normalized structure, both ways are simple joins.

It's just about the needs you have. If you need performance then you must accept redundancy of the information and add a column. If your main goal is to respect the Normalization paradigma then you should not have redundancy at all.
When I have to do this type of choice I try to estimate the space loss of redundancy vs the frequency of the query of interest and its performance.

A few more thoughts.
In your first situation if you look up a particular user you can easily get the list of ids for the films they have seen. But then would need a separate query to get the details such as the titles of those movies. This might be one query using IN with the list of ids, or one query per film id. This would be inefficient and clunky.
With MySQL there is a possible fudge to join in this situation using the FIND_IN_SET() function (although a down side of this is you are straying in to non standard SQL). You could join your table of films to the users using ON FIND_IN_SET(film.id, users.film_id) > 0 . However this is not going to use an index for the join, and involves a function (which while quick for what it does, will be slow when performed on thousands of rows).
If you wanted to find all the users who had view any film a particular user had viewed then it is a bit more difficult. You can't just use FIND_IN_SET as it requires a single string and a comma separated list. As a single query you would need to join the particular user to the film table to get a lot of intermediate rows, and then join that back against the users again (using FIND_IN_SET) to find the other users.
There are ways in SQL to split up a comma separated list of values, but they are messy and anyone who has to maintain such code will hate it!
These are all fudges. With the 2nd solution these easy to do, and any resulting joins can easily use indexes (and possibly the whole queries can just use indexes without touching the actual data).
A further issue with the first solution is data integretity. You will have to manually check that a film doesn't appear twice for a user (with the 2nd solution this can easily be enforced using a unique key). You also cannot just add a foreign key to ensure that any film id for a user does actually exist. Further you will have to manually ensure that nothing enters a character string in your delimited list of ids.

Reasons not to use GROUP_CONCAT?

I just discovered this amazingly useful MySQL function GROUP_CONCAT. It appears so useful and over-simplifying for me that I'm actually afraid of using it. Mainly because it's been quite some time since I started in web-programming and I've never seen it anywhere. A sample of awesome usage would be the following
Table clients holds clients ( you don't say... ) one row per client with unique IDs.
Table currencies has 3 columns client_id, currency and amount.
Now if I wanted to get user 15's name from the clients table and his balances, with the "old" method of array overwriting I would have to do use the following SQL
SELECT id, name, currency, amount
FROM clients LEFT JOIN currencies ON clients.id = client_id
WHERE clients.id = 15
Then in php I would have to loop through the result set and do an array overwrite ( which I'm really not a big fan of, especially in massive result sets ) like
$result = array();
foreach($stmt->fetchAll() as $row){
$result[$row['id']]['name'] = $row['name'];
$result[$row['id']]['currencies'][$row['currency']] = $row['amount'];
}
However with the newly discovered function I can use this
SELECT id, name, GROUP_CONCAT(currency) as currencies GROUP_CONCAT(amount) as amounts
FROM clients LEFT JOIN currencies ON clients.id = client_id
WHERE clients.id = 15
GROUP BY clients.id
Then on application level things are so awesome and pretty
$results = $stmt->fetchAll();
foreach($results as $k => $v){
$results[$k]['currencies'] = array_combine(explode(',', $v['currencies']), explode(',', $v['amounts']));
}
The question I would like to ask is are there any drawbacks to using this function in performance or anything at all, because to me it just looks like pure awesomeness, which makes me think that there must be a reason for people not to be using it quite often.
EDIT:
I want to ask, eventually, what are the other options besides array overwriting to end up with a multidimensional array from a MySQL result set, because if I'm selecting 15 columns it's a really big pain in the neck to write that beast..

Using GROUP_CONCAT() usually invokes the group-by logic and creates temporary tables, which are usually a big negative for performance. Sometimes you can add the right index to avoid the temp table in a group-by query, but not in every case.
As #MarcB points out, the default length limit of a group-concatenated string is pretty short, and many people have been confused by truncated lists. You can increase the limit with group_concat_max_len.
Exploding a string into an array in PHP does not come for free. Just because you can do it in one function call in PHP doesn't mean it's the best for performance. I haven't benchmarked the difference, but I doubt you have either.
GROUP_CONCAT() is a MySQLism. It is not supported widely by other SQL products. In some cases (e.g. SQLite), they have a GROUP_CONCAT() function, but it doesn't work exactly the same as in MySQL, so this can lead to confusing bugs if you have to support multiple RDBMS back-ends. Of course, if you don't need to worry about porting, this is not an issue.
If you want to fetch multiple columns from your currencies table, then you need multiple GROUP_CONCAT() expressions. Are the lists guaranteed to be in the same order? That is, does the third field in one list correspond to the third field in the next list? The answer is no -- not unless you specify the order with an ORDER BY clause inside the GROUP_CONCAT().
I usually favor your first code format, use a conventional result set, and loop over the results, saving to a new array indexed by client id, appending the currencies to an array. This is a straightforward solution, keeps the SQL simple and easier to optimize, and works better if you have multiple columns to fetch.
I'm not trying to say GROUP_CONCAT() is bad! It's really useful in many cases. But trying to make any one-size-fits-all rule to use (or to avoid) any function or language feature is simplistic.

The biggest problem that I see with GROUP_CONCAT is that it is highly specific to MySql: if you want to port your code to run against any other platform, you would have to rewrite all queries that use GROUP_CONCAT. For example, your first query is a lot more portable - you can probably run it against any major RDBMS engine without changing a single character in it.
If you are fine with working only with MySql (say, because you are writing a tool that is meant to be specific to MySql) the queries with GROUP_CONCAT would probably go faster, because the RDBMS would do more work for you, saving on the size of the data transfer.

Speed of SELECT Distinct vs array unique

I am using WordPress with some custom post types (just to give a description of my DB structure - its WP's).
Each post has custom meta, which is stored in a separate table (postmeta table). In my case, I am storing city and state.
I've added some actions to WP's save_post/trash_post hooks so that the city and state are also stored in a separate table (cities) like so:
ID postID city state
auto int varchar varchar
I did this because I assumed that this table would be faster than querying the rather large postmeta table for a list of available cities and states.
My logic also forced me to add/update cities and states for every post, even though this will cause duplicates (in the city/state fields). This must be so because I must keep track of which states/cities exist (actually have a post associated with them). When a post is added or deleted, it takes its record to or from the cities table with it.
This brings me to my question(s).
Does this logic make sense or do I suck at DB design?
If it does make sense, my real question is this: **would it be faster to use MySQL's "SELECT DISTINCT" or just "SELECT *" and then use PHP's array_unique on the results?**
Edits for comments/answers thus far:
The structure of the table is exactly how I typed it out above. There is an index on ID, but the point of this table isn't to retrieve an indexed list, but to retrieve ALL results (that are unique) for a list of ALL available city/state combos.
I think I may go with (I don't know why I didn't think of this before) just adding a serialized list of city/state combos in ONE record in the wp_options table. Then I can just get that record, and filter out the unique records I need.
Can I get some feedback on this? I would imagine that retrieving and filtering a serialized array would be faster than storing the data in a separate table for retrieval.

To answer your question about using SELECT distinct vs. array_unique, I would say that I would almost always prefer to limit the result set in the database assuming of course that you have an appropriate index on the field for which you are trying to get distinct values. This saves you time in transmitting extra data from DB to application and for the application reading that data into memory where you can work with it.
As far as your separate table design, it is hard to speculate whether this is a good approach or not, this would largely depend on how you are actually preforming your query (i.e. are you doing two separate queries - one for post info and one for city/state info or querying across a join?).
The is really only one definitive way to determine what is fastest approach. That is to test both ways in your environment.

1) Fully normalized table(when it have only integer values and other tables have only one int+varchar) have advantage when you not dooing full table joins often and dooing alot of search on normalized fields. As downside it require large join/sort buffers and result more complex queries=much less chance query will be auto-optimized by mysql. So you have optimize your queries yourself.
2)Select distinct will be faster in almost any cases. Only case when it will be slower - you have low size sort buffer in /etc/my.conf and much more size memory buffer for php.
Distinct select can use indexes, while your code can't.
Also sending large amount of data to your app require alot of mysql cpu time and real time.

Is there any performance reason to store ids as comma seperated string in a RDMS? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Is storing a delimited list in a database column really that bad?
I have been working on a couple of PHP/MySQL projects where all relationships are stored as comma separated strings.
For example a common relationship would be like
(in psuedocode)
table people
id - integer
name - string
age - integer
teams - string (CSV OF integers, ex '1,3,9,21')
table teams
name - String
id - integer
managing relationships becomes a hassle.
To get all teams for a person:
$person = 'SELECT * FROM People WHERE id= x';
then in php I have been doing something like
$person['teams'] = SELECT * FROM teams WHERE id IN ($person['teams']);
as I was writing this i realized i could probably combine them in a mysql query, something like:
SELECT
people.id,
people.name,
people.teams,
teams.name
FROM people
JOIN teams ON FIND_IN_SET(teams.id, people.teams) WHERE people.id=x
with this type of setup I find myself using FIND_IN_SET, pretty frequently
So finally, my question is: Is there a performance benefit to creating relationships like this?
In my experiences so far FIND_IN_SET has usually been doing a full table scan. If there is no performance benefit, in which instances is it beneficial to using a comma seperated list of integers? It seems that mysql designers had something in mind when creating FIND_IN_SET.

You're right, FIND_IN_SET() cannot make use of an index, so it causes a full table scan. Technically, that function is a bogus operation for a relational database, but no doubt there was a lot of demand for it so MySQL implemented it.
Storing data in a comma-separated list is an example of denormalization. Any departure from normalized design can give a performance boost for one type of query, but usually at the expense of all other types of queries against the same data.
For example, if you store players and their teams as a comma-separated list, it makes it very easy to get the list of teams for a given player, without doing a join. That's a performance improvement. But fetching the details for a given player's teams is much more difficult. Likewise searching for all players on a given team.
Use comma-separated lists only if that list is treated as a discrete "black box" piece of data. I.e. your application needs to fetch that list as a whole item, but never a subset of the list, and you never need to write SQL to use elements in that list for searching, joining, sorting, subtotals, etc.
See also my answer to Is storing a delimited list in a database column really that bad?

Table scan can not be considered as a benefit, at any time.
Moreover it's breaking the Normal form ( http://en.wikipedia.org/wiki/Database_normalization), as far as I remember from the school.
I think it's a good practice to have all the primary/foreign keys columns indexed to have performance benefit.
The only idea I would have in such a situation, is to politely ask architect on the particular project what was his idea behind the solution and explain him/her the performance disaster behind this :)

How to design the user table for an online dating site?

I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?

This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.

What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.

The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.

In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.

In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.

Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.