"Cluster analysis" with MySQL

"Cluster analysis" with MySQL - php

This is a tough one. There is probably a name for this and I don't know it, so I'll describe the problem exactly.
I have a dataset including a number of user-submitted values. I need to be able to determine based on some sort of average, or better, a "closeness of data", which value is the correct value. For example, if I received the following three submissions from three users, 4, 10, 3, I would know that 3 or 4 would be the "correct" value in this case. If I were to average it out, I'd get 5.6 which is not the intended result.
I'm attempting to do this using MySQL and PHP.
tl;dr Need to find a value from a dataset based on "closeness" of relative values (using MySQL/PHP)
Thanks!

Clustering using a database isn't going to be a single query type of procedure. It takes iterations to generate the clusters effectively.
You first need to decide how many clusters you want. If you wanted only one cluster, then obviously everything would go into it. If you want two, then you can write your program to separate the nodes into two groups using some sort of correlation metric.
In other words, I don't think this is a MySQL question so much as a clustering question.

I think that is the kind of thing you're looking for:
SELECT id, MIN(ABS(id - (SELECT AVG(id) FROM table))) as min
FROM table
GROUP BY id
ORDER BY min
LIMIT 1;
Per example, if your data set contains the following IDs: 3, 4, 10, with an average of 5.6667. The closest value to 5.6667 is 4. If your data set is 3, 6, 10, 14, with an average of 8.25, the clostest value is 10.
This is what this query returns. Hope it helps.

I have the impression you are looking for the median
E.g. in the list 1 2 3 4 100, the median (central value) is 3.
You may want to search for [https://stackoverflow.com/search?q=sql+median finding the median in SQL].

Related

Subset Sum floats Elimations

I will be happy to get some help. I have the following problem:
I'm given a list of numbers and a target number.
subset_sum([11.96,1,15.04,7.8,20,10,11.13,9,11,1.07,8.04,9], 20)
I need to find an algorithm that will find all numbers that combined will sum target number ex: 20.
First find all int equal 20
And next for example the best combinations here are:
11.96 + 8.04
1 + 10 + 9
11.13 + 7.8 + 1.07
9 + 11
Remaining value 15.04.
I need an algorithm that uses 1 value only once and it could use from 1 to n values to sum target number.
I tried some recursion in PHP but runs out of memory really fast (50k values) so a solution in Python will help (time/memory wise).
I'd be glad for some guidance here.
One possible solution is this: Finding all possible combinations of numbers to reach a given sum
The only difference is that I need to put a flag on elements already used so it won't be used twice and I can reduce the number of possible combinations
Thanks for anyone willing to help.

there are many ways to think about this problem.
If you do recursion make sure to identify your end cases first, then proceed with the rest of the program.
This is the first thing that comes to mind.
<?php
subset_sum([11.96,1,15.04,7.8,20,10,11.13,9,11,1.07,8.04,9], 20);
function subset_sum($a,$s,$c = array())
{
if($s<0)
return;
if($s!=0&&count($a)==0)
return;
if($s!=0)
{
foreach($a as $xd=>$xdd)
{
unset($a[$xd]);
subset_sum($a,$s-$xdd,array_merge($c,array($xdd)));
}
}
else
print_r($c);
}
?>

This is possible solution, but it's not pretty:
import itertools
import operator
from functools import reduce
def subset_num(array, num):
subsets = reduce(operator.add, [list(itertools.combinations(array, r)) for r in range(1, 1 + len(array))])
return [subset for subset in subsets if sum(subset) == num]
print(subset_num([11.96,1,15.04,7.8,20,10,11.13,9,11,1.07,8.04,9], 20))
Output:
[(20,), (11.96, 8.04), (9, 11), (11, 9), (1, 10, 9), (1, 10, 9), (7.8, 11.13, 1.07)]

DISCLAIMER: this is not a full solution, it is a way to just help you build the possible subsets. It does not help you to pick which ones go together (without using the same item more than once and getting the lowest remainder).
Using dynamic programming you can build all the subsets that add up to the given sum, then you will need to go through them and find which combination of subsets is best for you.
To build this archive you can (I'm assuming we're dealing with non-negative numbers only) put the items in a column, go from top to bottom and for each element compute all the subsets that add up to the sum or a lower number than it and that include only items from the column that are in the place you are looking at or higher. When you build a subset you put in its node both the sum of the subset (which may be the given sum or smaller) and the items that are included in the subset. So in order to compute the subsets for an item [i] you need only look at the subsets you've created for item [i-1]. For each of them there are 3 options:
1) the subset's sum is the given sum ---> Keep the subset as it is and move to the next one.
2) the subset's sum is smaller than the given sum but larger than it if item [i] is added to it ---> Keep the subset as it is and move on to the next one.
3) the subset's sum is smaller than the given sum and it will still be smaller or equal to it if item [i] is added to it ---> Keep one copy of the subset as it is and create another one with item [i] added to it (both as a member and added to the sum of the subset).
When you're done with the last item (item [n]), look at the subsets you've created - each one has its sum in its node and you can see which ones are equal to the given sum (and which ones are smaller - you don't need those anymore).
As I wrote at the beginning - now you need to figure out how to take the best combination of subsets that do not have a shared member between any of them.
Basically you're left with a problem that resembles the classic knapsack problem but with another limitation (not every stone can be taken with every other stone). Maybe the limitation actually helps, I'm not sure.
A bit more about the advantage of dynamic programming in this case
The basic idea of dynamic programming instead of recursion is to trade redundancy of operations with occupation of memory space. By that I mean to say that recursion with a complex problem (normally a backtrack knapsack-like problem, as we have here) normally ends up calculating the same thing a fair amount of times because the different branches of calculation have no concept of each other's operations and results. Dynamic programming saves the results and uses them along the way to build "bigger" results, relying on the previous/"smaller" ones. Because the use of the stack is much more straightforward than in recursion, you don't get the memory problem you get with recursion regarding the maintenance of the function's state, but you do need to handle a great deal of memory that you store (sometimes you can optimise that).
So for example in our problem, trying to combine a subset that would add up to the required sum, the branch that starts with item A and the branch that starts with item B do not know of each other's operations. let's assume item C and item D together add up to the sum, but either of them added alone to A or B would not exceed the sum, and that A don't go with B in the solution (we can have sum=10, A=B=4, C=D=5 and there is no subset that sums up to 2 (so A and B can't be in the same group)). The branch trying to figure out A's group would (after trying and rejecting having B in its group) add C (A+C=9) and then add D, in which point would reject this group and trackback (A+C+D=14 > sum=10). The same would happen to B of course (A=B) because the branch figuring out B's group has no information regarding what just happened to the branch dealing with A. So in fact we've calculated C+D twice, and haven't even used it yet (and we're about to calculate it yet a third time to realise they belong in a group of their own).
NOTE:
Looking around while writing this answer I came across a technique I was not familiar with and might be a better solution for you: memoization. Taken from wikipedia:
memoization is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again.

So I have a possbile solution:
#compute difference between 2 list but keep duplicates
def list_difference(a, b):
count = Counter(a) # count items in a
count.subtract(b) # subtract items that are in b
diff = []
for x in a:
if count[x] > 0:
count[x] -= 1
diff.append(x)
return diff
#return combination of numbers that match target
def subset_sum(numbers, target, partial=[]):
s = sum(partial)
# check if the partial sum is equals to target
if s == target:
print "--------------------------------------------sum_is(%s)=%s" % (partial, target)
return partial
else:
if s >= target:
return # if we reach the number why bother to continue
for i in range(len(numbers)):
n = numbers[i]
remaining = numbers[i+1:]
rest = subset_sum(remaining, target, partial + [n])
if type(rest) is list:
#repeat until rest is > target and rest is not the same as previous
def repeatUntil(subset, target):
currSubset = []
while sum(subset) > target and currSubset != subset:
diff = subset_sum(subset, target)
currSubset = subset
subset = list_difference(subset, diff)
return subset
Output:
--------------------------------------------sum_is([11.96, 8.04])=20
--------------------------------------------sum_is([1, 10, 9])=20
--------------------------------------------sum_is([7.8, 11.13, 1.07])=20
--------------------------------------------sum_is([20])=20
--------------------------------------------sum_is([9, 11])=20
[15.04]
Unfortunately this solution does work for a small list. For a big list still trying to break the list in small chunks and calculate but the answer is not quite correct. You can see it o a new thread here:
Finding unique combinations of numbers to reach a given sum

Generating unique fixed integer ids from array of ids

So here is the situation... I got array of objects, each marked with unique integer id, and for each and every combination of those objects, I need to create new ones, each with unique ids. Problem is that that list of objects is dynamic, used in stateless environment, so newly generated ids must be same for every run.
To make it clearer what I need here, consider that array of objects as array of their ids, for example: [10, 7, 23]. And basically, I need to get ids for all the possible combinations:
10, 7
10, 23
7, 23
10, 7, 23
What's important here is that generated ids must be same for each distinct combination (for example: 10 and 7 should always produce same id). Also, newly added objects should not affect previously generated ids. So for example, when some new object is later on added to that list, ids generated from previous combinations must remain the same as before new object was added.
Currently, I have a solution that pretty much comes down to generating new id as a result of the sum of combining ids, so resulting ids are:
17
33
30
40
Of course, this approach can produce duplicate ids, and that's the reason I'm asking for advice for some more sophisticated algorithm. I also tried introducing fixed offset of 1000 for newly generated ids and multiplying sum with number of objects in combination, so that for example resulting ids are 1034 (1000+(10+7)*2), 1066 (1000+(10+23)*2), etc., but I'm not sure that it would save me from duplicates. :)
Clear mention, I need this for the purpose of certain PHP project, but as this problem is not language-specific, I hope that there are some good mathematicians that can bring some good solution. :)
Useful information is fact that combining ids are in range from 10000-99999 and maximum number of items in combination does not exceed 10.
Please note that I do not need solution for how to make all the combinations from array elements, but only that "formula" for producing integer id.
Thanks in advance.

Not really sure what your aim is, but I'll have a go...
Have you tried using character keys? For example 10, 7, 3 becomes a sequence with an underscore. Each sequence will have a unique hash.
$arrayOfKeys = array(10, 7, 3);
$hash = implode('_', $arrayOfKeys);
print $hash;
# 10_7_3
Personally I'd go for this simple approach. If you're using a database and you're not producing, say, 100k records per day, it should be pretty fast using an indexed (primary key or unique) varchar field.
If you are to create numbers, here a tip: take the length of the largest number and that will be the prefix of your sequence, e.g.:
10, 5, 1 -> 2100501
105, 45, 201 -> 3105045201
The prefix will tell you what the length of the following sequences are. I can't think of any way you'd get doubles... Anyone? ;)
Hope it helps...

Step 1: Sort the values you get.
eg: if you get 10, 7 or 7, 10 it should result result in 7, 10 before going to the ID generator. If you know the range of your numbers i.e lets assume [0-100] use radix or count sort, will be fast.
Step 2 : Represent the numbers as strings, seperated by any chosen seperator.(':') maybe.
eg: for 7, 10 id will become "7:10".
Sorting is being done to avoid generating different ID's for 10, 7 and 7, 10.
BTW What do these numbers represent?

I don't think this is possible unless you allow labels of increasing length.
Assume you have a maximum of N distinct objects, corresponding to N distinct labels.
If you want to be able to represent all possible pairs, assuming order in a pair does not matter, you potentially need N.(N-1)/2 extra labels, whatever they are, and you need to reserve them all.
And for all triples, N.(N-1).(N-2)/6, for all quads N.(N-1).(N-2).(N-3)/24...
This grows exponentially and will very quickly exceed the capacity of integers.
Any other solution that tries to compress the space of labels, such as hashing, will result in collisions. You can resolve the collisions by maintaining collision table, but this will break the "generated ids must be same for every run" requirement.

Lottery number analysis

I'm trying to perform some basic analysis on Lotto results :)
I have a database that looks something like:
id|no|day|dd|mmm|yyyy|n1|n2|n3|n4|n5|n6|bb|jackpot|wins|machine|set
--------------------------------------------------------------------
1 |22|mon|22|aug|1999|01|05|11|29|38|39|04|2003202| 1 | Topaz | 3
2 |23|tue|24|aug|1999|01|06|16|21|25|39|03|2003202| 2 | Pearl | 1
That's just an example. So, n1 to n6 are standard balls in the lottery and bb stands for the bonus ball.
I want to write a PHP/SQL code that will display just one random sequence of numbers that have yet to come out. However, If the numbers 01, 04, 05, 11, 29, 38 and 39 have come out, I don't want the code to print out them numbers but just in a different order, as in theory them set of numbers are already winning numbers.
I just can't get my head around the logic of this. I'd appreciate any help.
Thanks in advance

Assuming that the balls are stored in ascending order in your database like the examples you've given, you could just generate a random sequence of 6 numbers, sort them and then generate 1 random bonus number. Once you've done that it would just be a matter of doing a simple SQL query into your database and seeing if it comes back with a result:
$nums=...//generate your 6 numbers plus bonus number here
sort($nums);
$mysqli=new mysqli('...','...','...','...');
$stmt=$mysqli->prepare("SELECT * FROM table
WHERE n1=? AND n2=? AND n3=? AND n4=? AND n5=? AND n6=? AND bb=?");
$stmt->bind_param('iiiiiii', $nums[0], $nums[1], $nums[2], $nums[3], $nums[4], $nums[5], $nums[6]);
$stmt->execute();
$stmt->store_result();
if($stmt->num_rows==0)
//your numbers have not been drawn before - return them
else
//otherwise loop round and try again
As long as both list of numbers (but not the bonus ball) are sorted you won't have any problems with a different ordering of an already drawn set of numbers.
This will become less efficient as your database of previous draws gets fuller, but I don't think you'll have to worry about that for a few decades. :-)

What about sorting each already drawn result (each row) in some order, ascending maybe, then sort the set of already drawn results (all rows)? Then you will have a easy to look up in list in which you can see what is left to be drawn.
Say for example you want a never drawn set before? You would just have to loop through the list until you spot a "hole", which would be a never before drawn set. If you would like to optimise further you could store at what index you last found a "hole" as well. Then you would never need to loop through the same part of the list twice, and you could even abandon "completed" parts of the list to save disk space, or if you would like the new number you come up with to seam random you could start at a random offset in the list.
To do this effectively you should make an extra column to store the pre-sorted set. For example if you have (5, 3, 6, 4, 1, 2) that column could contain 010203040506. Add in enough zeros so that the numbers occur on a fixed offset basis.

What is the most efficient way to find the euclidean distance in 3d using mysql?

I have a MySQL table with thousands of data points stored in 3 columns R, G, B. how can I find which data point is closest to a given point (a,b,c) using Euclidean distance?
I'm saving RGB values of colors separately in a table, so the values are limited to 0-255 in each column. What I'm trying to do is find the closest color match by finding the color with the smallest euclidean distance.
I could obviously run through every point in the table to calculate the distance but that wouldn't be efficient enough to scale. Any ideas?

I think the above comments are all true, but they are - in my humble opinion - not answering the original question. (Correct me if I'm wrong). So, let me here add my 50 cents:
You are asking for a select statement, which, given your table is called 'colors', and given your columns are called r, g and b, they are integers ranged 0..255, and you are looking for the value, in your table, closest to a given value, lets say: rr, gg, bb, then I would dare trying the following:
select min(sqrt((rr-r)*(rr-r)+(gg-g)*(gg-g)+(bb-b)*(bb-b))) from colors;
Now, this answer is given with a lot of caveats, as I am not sure I got your question right, so pls confirm if it's right, or correct me so that I can be of assistance.

Since you're looking for the minimum distance and not exact distance you can skip the square root. I think Squared Euclidean Distance applies here.
You've said the values are bounded between 0-255, so you can make an indexed look up table with 255 values.
Here is what I'm thinking in terms of SQL. r0, g0, and b0 represent the target color. The table Vector would hold the square values mentioned above in #2. This solution would visit all the records but the result set can be set to 1 by sorting and selecting only the first row.
select
c.r, c.g, c.b,
mR.dist + mG.dist + mB.dist as squared_dist
from
colors c,
vector mR,
vector mG,
vector mB
where
c.r-r0 = mR.point and
c.g-g0 = mG.point and
c.b-b0 = mB.point
group by
c.r, c.g, c.b

The first level of optimization that I see you can do would be square the distance to which you want to limit the query so that you don't need to perform the square root for each row.
The second level of optimization I would encourage would be some preprocessing to alleviate the need for extraneous squaring for each query (which could possibly create some extra run time for large tables of RGB's). You'd have to do some benchmarking to see, but by substituting in values for a, b, c, and d and then performing the query, you could alleviate some stress from MySQL.
Note that the performance difference between the last two lines may be negligible. You'll have to use test queries on your system to determine which is faster.
I just re-read and noticed that you are ordering by distance. In which case, the d should be removed everything should be moved to one side. You can still plug in the constants to prevent extra processing on MySQL's end.

I believe there are two options.
You have to either as you say iterate across the entire set and compare and check against a maximum that you set initially at an impossibly low number like -1. This runs in linear time, n times (since you're only comparing 1 point to every point in the set, this scales in a linear way).
I'm still thinking of another option... something along the lines of doing a breadth first search away from the input point until a point is found in the set at the searched point, but this requires a bit more thought (I imagine the 3D space would have to be pretty heavily populated for this to be more efficient on average though).

If you run through every point and calculate the distance, don't use the square root function, it isn't necessary. The smallest sum of squares will be enough.
This is the problem you are trying to solve. (Planar case, select all points sorted by a x, y, or z axis. Then use PHP to process them)
MySQL also has a Spatial Database which may have this as a function. I'm not positive though.

PHP: Compare two sets of numbers, no dupes

I'm creating a lottery contest for my site, and I need to know the easiest way to compare numbers, so that no two people can choose the same numbers. It's 7 sets of numbers, each number is a number between 1 and 30.
For example, if user 1 chooses: 1, 7, 9, 17, 22, 25, 29 how can I make sure that user 2 can't choose those same exact number?
I was thinking about throwing all 7 numbers into an array, sort it so the numbers are in order, then join them into one string. Then when another user chooses their 7 numbers, it does the same, then compares the two. Is there a better way of doing it?

What you describe sounds like the best way to me, IF you are dealing with all submissions in the same script - I would trim(implode(',',$array)) the sorted array, store the resulting string in an array and call in_array() to determine whether the value already exists.
HOWEVER I suspect that what you are actually doing is storing the selections in a database table and comparing later submissions against this table. In this case (I am taking a liberty and assuming MySQL here but I would say it is the most common engine used with PHP) you should create a table with 7 columns choice_1, choice_2 ... choice_7(along with whatever other columns you want) and create a unique index across all seven choice_* columns. This means that when you try and insert a duplicate row, the query will fail. This lets MySQL do all the work for you.

Try array_diff. There are some really good examples on php.net.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.