Finding similar number patterns in table

Finding similar number patterns in table - php

Ok, let's suppose we have members table. There is a field called, let's say, about_member. There will be a string like this 1-1-2-1-2 for everybody. Let's suppose member_1 has this string 1-1-2-2-1 and he searches who has the similar string or as much similar as possible. For example if member_2 has string 1-1-2-2-1 it will be 100% match, but if member_3 has string like this 2-1-1-2-1 it will be 60% match. And it has to be ordered by match percent. What is the most optimal way to do it with MYSQL and PHP? It's really hard to explain what I mean, but maybe you got it, if not, ask me. Thanks.
Edit: Please give me ideas without Levenshtein method. That answer will get bounty. Thanks. (bounty will be announced when I will be able to do that)

convert your number sequences to bit masks and use BIT_COUNT(column ^ search) as similarity function, ranged from 0 (= 100% match, strings are equal) to [bit length] (=0%, strings are completely different). To convert this similarity function to the percent value use
100 * (bit_length - similarity) / bit_length
For example, "1-1-2-2-1" becomes "00110" (assuming you have only two states), 2-1-1-2-1 is "10010", bit_count(00110 ^ 10010) = 2, bit-length = 5, and 100 * (5 - 2) / 5 = 60%.

Jawa posted this idea originally; here is my attempt.
^ is the XOR function. It compares 2 binary numbers bit-by-bit and returns 0 if both bits are the same, and 1 otherwise.
0 1 0 0 0 1 0 1 0 1 1 1 (number 1)
^ 0 1 1 1 0 1 0 1 1 0 1 1 (number 2)
= 0 0 1 1 0 0 0 0 1 1 0 0 (result)
How this applies to your problem:
// In binary...
1111 ^ 0111 = 1000 // (1 bit out of 4 didn't match: 75% match)
1111 ^ 0000 = 1111 // (4 bits out of 4 didn't match: 0% match)
// The same examples, except now in decimal...
15 ^ 7 = 8 (1000 in binary) // (1 bit out of 4 didn't match: 75% match)
15 ^ 0 = 15 (1111 in binary) // (4 bits out of 4 didn't match: 0% match)
How we can count these bits in MySQL:
BIT_COUNT(b'0111') = 3 // Bit count of binary '0111'
BIT_COUNT(7) = 3 // Bit count of decimal 7 (= 0111 in binary)
BIT_COUNT(b'1111' ^ b'0111') = 1 // (1 bit out of 4 didn't match: 75% match)
So to get the similarity...
// First we focus on calculating mismatch.
(BIT_COUNT(b'1111' ^ b'0111') / YOUR_TOTAL_BITS) = 0.25 (25% mismatch)
(BIT_COUNT(b'1111' ^ b'1111') / YOUR_TOTAL_BITS) = 0 (0% mismatch; 100% match)
// Now, getting the proportion of matched bits is easy
1 - (BIT_COUNT(b'1111' ^ b'0111') / YOUR_TOTAL_BITS) = 0.75 (75% match)
1 - (BIT_COUNT(b'1111' ^ b'1111') / YOUR_TOTAL_BITS) = 1.00 (100% match)
If we could just make your about_member field store data as bits (and be represented by an integer), we could do all of this easily! Instead of 1-2-1-1-1, use 0-1-0-0-0, but without the dashes.
Here's how PHP can help us:
bindec('01000') == 8;
bindec('00001') == 1;
decbin(8) == '01000';
decbin(1) == '00001';
And finally, here's the implementation:
// Setting a member's about_member property...
$about_member = '01100101';
$about_member_int = bindec($about_member);
$query = "INSERT INTO members (name,about_member) VALUES ($name,$about_member_int)";
// Getting matches...
$total_bits = 8; // The maximum length the member_about field can be (8 in this example)
$my_member_about = '00101100';
$my_member_about_int = bindec($my_member_about_int);
$query = "
SELECT
*,
(1 - (BIT_COUNT(member_about ^ $my_member_about_int) / $total_bits)) match
FROM members
ORDER BY match DESC
LIMIT 10";
This last query will have selected the 10 members most similar to me!
Now, to recap, in layman's terms,
We use binary because it makes things easier; the binary number is like a long line of light switches. We want to save our "light switch configuration" as well as find members that have the most similar configurations.
The ^ operator, given 2 light switch configurations, does a comparison for us. The result is again a series of switches; a switch will be ON if the 2 original switches were in different positions, and OFF if they were in the same position.
BIT_COUNT tells us how many switches are ON--giving us a count of how many switches were different. YOUR_TOTAL_BITS is the total number of switches.
But binary numbers are still just numbers... and so a string of 1's and 0's really just represents a number like 133 or 94. But it's a lot harder to visualize our "light switch configuration" if we use decimal numbers. That's where PHP's decbin and bindec come in.
Learn more about the binary numeral system.
Hope this helps!

The obvious solution is to look at the levenstein distance (there isn't an implementation built into mysql but there are other implementations accesible e.g. this one in pl/sql and some extensions), however as usual, the right way to solve the problem would be to have normalised the data properly in the first place.

One way to do this is to calculate the Levenshtein distance between your search string and the about_member fields for each member. Here's an implementation of the function as a MySQL stored function.
With that you can do:
SELECT name, LEVENSHTEIN(about_member, '1-1-2-1-2') AS diff
FROM members
ORDER BY diff ASC
The % of similarity is related to diff; if diff=0 then it's 100%, if diff is the size of the string (minus the amount of dashes), it's 0%.

Having read the clarification comments on the original question, the Levenshtein distance is not the answer you are looking for.
You are not trying to compute the smallest number of edits to change one string into another.
You are trying to compare one set of numbers with another set of numbers. What you are looking for is the minimum (weighted) sum of the differences between the two sets of numbers.
Place each answer in a separate column (Ans1, Ans2, Ans3, Ans4, .... )
Assume you are searching for similarities to 1-2-1-2.
SELECT UserName, Abs( Ans1 - 1 ) + Abs( Ans2 - 2 ) + Abs( Ans3 - 1 ) + Abs( Ans4 - 2) as Difference ORDER BY Difference ASC
Will list users by similarity to answers 1-2-1-2, assuming all questions are weighted evenly.
If you want to make certain answers more important, just multiply each of the terms by a weighting factor.
If the questions will always be yes/no and the number of answers is small enough that all the answers can be fitted into a single integer and all answers are equally weighted, then you could encode all the answers in a single column and use BIT_COUNT as suggested. This would be a faster and more space-efficient implementation.

I would go with the similar_text() PHP built-in. It seems to be exactly what you want:
$percent = 0;
similar_text($string1, $string2, $percent);
echo $percent;
It works as the question expects.

I would go with the Levenshtein distance approach, you can use it within MySQL or PHP.

If you don't have too many fields, you could create an index on the integer representation of about_member. Then you can find the 100% by an exact match on the about_member field, followed by the 80% matches by changing 1 bit, the 60% matches by changing 2 bits, and so on.

If you represent your answer patterns as bit sequences you can use the formula (100 * (bit_length - similarity) / bit_length).
Following the mentioned example, when we convert "1"s to bit off and "2"s to bit on "1-1-2-2-1" becomes 6 (as base-10, 00110 in binary) and "2-1-1-2-1" becomes 18 (10010b) etc.
Also, I think you should store the answers' bits to the least significant bits, but it doesn't matter as long as you are consistent that the answers of different members align.
Here's a sample script to be run against MySQL.
DROP TABLE IF EXISTS `test`;
CREATE TABLE `members` (
`id` VARCHAR(16) NOT NULL ,
`about_member` INT NOT NULL
) ENGINE = InnoDB;
INSERT INTO `members`
(`id`, `about_member`)
VALUES
('member_1', '6'),
('member_2', '18');
SELECT 100 * ( 5 - BIT_COUNT( about_member ^ (
SELECT about_member
FROM members
WHERE id = 'member_1' ) ) ) / 5
FROM members;
The magical 5 in the script is the number of answers (bit_length in the formula above). You should change it according to your situation, regardless of how many bits there are in the actual data type used, as BIT_COUNT doesn't know how many bytes you are using.
BIT_COUNT returns the number of bits set and is explained in MySQL manual. ^ is the binary XOR operator in MySQL.
Here the comparison of member_1's answers is compared with everybody's, including their own - which results as 100% match, naturally.

Related

Check 2 numbers algorithm

I have 2 fields on db. Minor and Major:
Minor, Major
0,0
1,0
2,0
3,0
4,0
5,0
7,0
8,0
...
65536,0
0,1
1,1
2,1
3,1
4,1
...
65536,1
0,2
What is best way to compare this. I am doing this on Bookshelf.js but in php or ruby also is welcome. I need to check current situation, get greater major and add minor + 1, if is not 65536 else minor is 0 major gets major + 1.
Thanks in advance.
EDIT:
I have to save major and minor to respective fields. They increment for every user registered.
eg.
Users
id, username,minor,major
1, john , 0, 0
2, mike, 1, 0
....
65537, jeff, 65536,0
Now Tom's ,major increments becuse last minor on table is 65536.
65538, tom, 0 , 1
I don't know how to explain more.

I'm absolutely not sure to understand the problem, but here are some ideas about limiting the range of an integer value:
Like many languages, MySQL has some UNSIGNED SMALLINT data types that holds 2-bytes values, that is from 0 to 65535 (not 65536 !)
Most programming laguage have a "modulus" operator (% -- php mysql) that allow you to collect the rest of an integral division. For example, ... % 65536 will return a value between 0 and 65535 incl. If you really need a value between 0 and 65536 incl, you will write ... % 65537 instead.
You could use mask operator ("bitwise and" & -- php mysql). For example, ... & 0xFFFF will only keep the two lowest significant bytes of a number -- actually performing the equivalent of a "modulo 65536" operation (having a result between 0 and 65535 incl.)

$magicNumber = 65536;
$sql = "
SELECT
MAX(userIndex) userIndex
FROM (
SELECT
(Minor + (Major * ".$magicNumber.")) AS userIndex
FROM TableName
) AS innerSelect
";
running the sql gives you the currently highest userIndex, let's say it is 145323.
Now increment this by one, and you have $newIndex = 145324.
This gives you the currently highest Index. Now the fields can be calculated like this:
$major = (int)($newIndex / $magicNumber);
$minor = $newIndex % $magicNumber;

Bit flags in status. Find status where one flag = 0

Say i have three bit flags in a status, stored in mysql as an integer
Approved Has Result Finished
0|1 0|1 0|1
Now i want to find the rows with status: Finished = 1, Has Result = 1 and Approved = 0.
Then the data:
0 "000"
1 "001"
3 "011"
7 "111"
Should produce
false
false
true
false
Can I do something like? (in mysql)
status & "011" AND ~((bool) status & "100")
Can't quite figure out how to query "Approved = 0".
Or should i completely drop using bit flags, and split these into separate columns?
The reasoning for using bit flags is, in part, for mysql performance.

Use ints instead of binary text. Instead of 011, use 3.
To get approved rows:
SELECT
*
FROM
`foo`
WHERE
(`status` & 4)
or approved and finished rows:
SELECT
*
FROM
`foo`
WHERE
(`status` & 5)
or finished but not accepted:
SELECT
*
FROM
`foo`
WHERE
(`status` & 1)
AND
(`status` ^ 4)
"Finished = 1, Has Result = 1 and Approved = 0" could be as simple as status = 3.

Something I liked to do when I began programming was using powers of 2 as flags given a lack of boolean or bit types:
const FINISHED = 1;
const HAS_RESULT = 2;
const APPROVED = 4;
Then you can check like this:
$status = 5; // 101
if ($status & FINISHED) {
/*...*/
}
EDIT:
Let me expand on this:
Can't quite figure out how to query "Approved = 0".
Or should i completely drop using bit flags, and split these into separate columns?
The reasoning for using bit flags is, in part, for mysql performance.
The issue is, you are not using bitwise flags. You are using a string which "emulates" a bitwise flag and sort of makes it hard to actually do proper flag checking. You'd have to convert it to its bitwise representation and then do the checking.
Store the flag value as an integer and declare the flag identifiers that you will then use to do the checking. A TINYINT should give you 7 possible flags (minus the most significant bit used for sign) and an unsigned TINYINT 8 possible flags.

rand function generates 5 digit number, how can I make it generate 11 digit number [duplicate]

This question already has answers here:
php random x digit number
(21 answers)
Closed 9 years ago.
I have a table named "message_group" and this table includes "user_one, user_two and hash" fields. Hash is int(11). With the following piece of code I insert values into this table. I am using the rand() function. My problem is that this function inserts only a 5 digit number into the fiels of hash in my table. I want this number to be of 11 digits. How can I modify my code to achieve that?
if( isset($_POST['message']) && !empty($_POST['message']) ){
$random_number = rand();
$check_con = mysql_query("SELECT `hash` FROM `message_group` WHERE (`user_one`='$session_user_id' AND `user_two`='$user_id') OR (`user_one`='$user_id' AND `user_two`='$session_user_id')");
if( mysql_num_rows($check_con) == 1 ){
echo"conversation already started";
}else{
mysql_query(" INSERT INTO message_group VALUES ('$session_user_id', '$user_id', '$random_number') ");
echo"conversation started";
}
}

PHP's rand() is only 15 bits on some platforms. You can increase it to 30 bits by using:
((rand() << 15) ^ rand())
This will give you a number in the range 0..1073741823, still evenly distributed. If you need something more specific than just a bigger range, you'll have to do some fancier math after that. You might also consider mt_rand().

From PHP rand()
Note: On some platforms (such as Windows), getrandmax() is only 32767.
If you require a range larger than 32767, specifying min and max will
allow you to create a range larger than this, or consider using
mt_rand() instead.
Therefore, dont use default values and specifi min and max
int rand ( int $min , int $max )

a uniformly distributed random number with arbitrary number of digits can be generated from a 16 bit RNG by repeated multiply-and-add. Example ( pseudo code):
ans=0;
mult = MAXRAND+1;
for(i=0;i<5;i++) ans=mult*ans+rand();
As long as ans can hold the result, you can make any size uniformly distributed random number this way. Obviously you want to make sure you handle overflow, and limit the number of digits at the end.

mysql between question

For the mysql "between" operator, is it necessary for the before and after value to be numerically in order?
like:
BETWEEN -10 AND 10
BETWEEN 10 AND -10
Will both of these work or just the first one?
Also, can I do:
WHERE thing<10 AND thing>-10
Will that work or do I have to use between?
Lastly, can I do:
WHERE -10<thing<10
?

BETWEEN -10 AND 10
This will match any value from -10 to 10, bounds included.
BETWEEN 10 AND -10
This will never match anything.
WHERE thing<10 AND thing>-10
This will match any value from -10 to 10, bounds excluded.
Also, if thing is a non-deterministic expression, it is evaluated once in case of BETWEEN and twice in case of double inequality:
SELECT COUNT(*)
FROM million_records
WHERE RAND() BETWEEN 0.6 AND 0.8;
will return a value around 200,000;
SELECT COUNT(*)
FROM million_records
WHERE RAND() >= 0.6 AND RAND() <= 0.8;
will return a value around 320,000

The min value must come before the max value. Also note that the end points are included, so BETWEEN is equivalent to:
WHERE thing>=-10 AND thing<=10

Please keep it to one question per post. Anyway:
http://dev.mysql.com/doc/refman/5.0/en/comparison-operators.html#operator_between
BETWEEN min AND max, in that order.
from the link:
This is equivalent to the expression (min <= expr AND expr <= max) if
all the arguments are of the same type
The second alternative will also work, of course.

First question:
Will both of these work or just the first one?
yes,both of these work
Second question:
Will that work or do I have to use between?
it also valid but as you can see just empty result

Yes your between must be in order to return the excepted result.
Let's say you have a table with a row called mynumber that contains 10 rows :
MyNumber
--------
1
2
3
4
5
6
7
8
9
10
So
select * from thistable table where table.myNumber BETWEEN 1 and 5
will return
1
2
3
4
5
but
select * from thistable table where table.myNumber BETWEEN 5 and 1
return nothing.
Your 2nd question : yes it is the same thing. but beware in you example you will have to put <= and >= to be the same as between. if not, in our example, you would get
2
3
4
Hope it help

I've already seen such things work with integers :
WHERE -10
But it's better to avoid it. One reason is that it doesn't seem to work well with other types. And MySQL doesn't issue any warning.
I've tried it with datetime columns, and the result was wrong.
My request looked like this one:
SELECT *
FROM FACT__MODULATION_CONSTRAINTS constraints
WHERE constraints.START_VALIDITY<= now() < constraints.END_VALIDITY
The result was not as expected. I got twice as many results as the same request with two inequalities (which returned correct results). Only the 1st part of the expression evaluated correctly.

How can I create a specified amount of random values that all equal up to a specified number in PHP?

For example, say I enter '10' for the amount of values, and '10000' as a total amount.
The script would need to randomize 10 different numbers that all equal up to 10000. No more, no less.
But it needs to be dynamic, as well. As in, sometimes I might enter '5' or '6' or even '99' for the amount of values, and any number (up to a billion or even higher) as the total amount.
How would I go about doing this?
EDIT: I should also mention that all numbers need to be a positive integer

The correct answer here is unbelievably simple.
Just imagine a white line, let's say 1000 units long.
You want to divide the line in to ten parts, using red marks.
VERY SIMPLY, CHOOSE NINE RANDOM NUMBERS and put a red paint mark at each of those points.
It's just that simple. You're done!
Thus, the algorithm is:
(1) pick nine random numbers between 0 and 1000
(2) put the nine numbers, a zero, and a 1000, in an array
(3) sort the array
(4) using subtraction get the ten "distances" between array values
You're done.
(Obviously if you want to have no zeros in your final set, in part (1) simply rechoose another random number if you get a collision.)
Ideally as programmers, we can "see" visual algorithms like this in our heads -- try to think visually whatever we do!
Footnote - for any non-programmers reading this, just to be clear pls note that this is like "the first thing you ever learn when studying computer science!" i.e. I do not get any credit for this, I just typed in the answer since I stumbled on the page. No kudos to me!
Just for the record another common approach (depending on the desired outcome, whether you're dealing with real or whole numbers, and other constraints) is also very "ah hah!" elegant. All you do is this: get 10 random numbers. Add them up. Remarkably simply, just: multiply or divide them all by some number, so that, the total is the desired total! It's that easy!

maybe something like this:
set max amount remaining to the target number
loop for 1 to the number of values you want - 1
get a random number from 0 to the max amount remaining
set new max amount remaining to old max amount remaining minus the current random number
repeat loop
you will end up with a 'remainder' so the last number is determined by whatever is left over to make up the original total.

Generate 10 random numbers till 10000 .
Sort them from big to small : g0 to g9
g0 = 10000 - r0
g1 = r0 - r1
...
g8 = r8 - r9
g9 = r9
This will yield 10 random numbers over the full range which add up to 10000.

I believe the answer provided by #JoeBlow is largely correct, but only if the 'randomness' desired requires uniform distribution. In a comment on that answer, #Artefacto said this:
It may be simple but it does not generate uniformly distributed numbers...
Itis biased in favor of numbers of size 1000/10 (for a sum of 1000 and 10 numbers).
This begs the question which was mentioned previously regarding the desired distribution of these numbers. JoeBlow's method does ensure a that element 1 has the same chance at being number x as element 2, which means that it must be biased towards numbers of size Max/n. Whether the OP wanted a more likely shot at a single element approaching Max or wanted a uniform distribution was not made clear in the question. [Apologies - I am not sure from a terminology perspective whether that makes a 'uniform distribution', so I refer to it in layman's terms only]
In all, it is incorrect to say that a 'random' list of elements is necessarily uniformly distributed. The missing element, as stated in other comments above, is the desired distribution.
To demonstrate this, I propose the following solution, which contains sequential random numbers of a random distribution pattern. Such a solution would be useful if the first element should have an equal chance at any number between 0-N, with each subsequent number having an equal chance at any number between 0-[Remaining Total]:
[Pseudo code]:
Create Array of size N
Create Integer of size Max
Loop through each element of N Except the last one
N(i) = RandomBetween (0, Max)
Max = Max - N(i)
End Loop
N(N) = Max
It may be necessary to take these elements and randomize their order after they have been created, depending on how they will be used [otherwise, the average size of each element decreases with each iteration].

Update: #Joe Blow has the perfect answer. My answer has the special feature of generating chunks of approximately the same size (or at least a difference no bigger than (10000 / 10)), leaving it in place for that reason.
The easiest and fastest approach that comes to my mind is:
Divide 10000 by 10 and store the values in an array. (10 times the value 10000)
Walk through every one of the 10 elements in a for loop.
From each element, subtract a random number between (10000 / 10).
Add that number to the following element.
This will give you a number of random values that, when added, will result in the end value (ignoring floating point issues).
Should be half-way easy to implement.
You'll reach PHP's maximum integer limit at some point, though. Not sure how far this can be used for values towards a billion and beyond.

Related: http://www.mathworks.cn/matlabcentral/newsreader/view_thread/141395
See this MATLAB package. It is accompanied with a file with the theory behind the implementation.
This function generates random, uniformly distributed vectors, x = [x1,x2,x3,...,xn]', which have a specified sum s, and for which we have a <= xi <= b, for specified values a and b. It is helpful to regard such vectors as points belonging to n-dimensional Euclidean space and lying in an n-1 dimensional hyperplane constrained to the sum s. Since, for all a and b, the problem can easily be rescaled to the case where a = 0 and b = 1, we will henceforth assume in this description that this is the case, and that we are operating within the unit n-dimensional "cube".
This is the implementation (© Roger Stafford):
function [x,v] = randfixedsum(n,m,s,a,b)
% Rescale to a unit cube: 0 <= x(i) <= 1
s = (s-n*a)/(b-a);
% Construct the transition probability table, t.
% t(i,j) will be utilized only in the region where j <= i + 1.
k = max(min(floor(s),n-1),0); % Must have 0 <= k <= n-1
s = max(min(s,k+1),k); % Must have k <= s <= k+1
s1 = s - [k:-1:k-n+1]; % s1 & s2 will never be negative
s2 = [k+n:-1:k+1] - s;
w = zeros(n,n+1); w(1,2) = realmax; % Scale for full 'double' range
t = zeros(n-1,n);
tiny = 2^(-1074); % The smallest positive matlab 'double' no.
for i = 2:n
tmp1 = w(i-1,2:i+1).*s1(1:i)/i;
tmp2 = w(i-1,1:i).*s2(n-i+1:n)/i;
w(i,2:i+1) = tmp1 + tmp2;
tmp3 = w(i,2:i+1) + tiny; % In case tmp1 & tmp2 are both 0,
tmp4 = (s2(n-i+1:n) > s1(1:i)); % then t is 0 on left & 1 on right
t(i-1,1:i) = (tmp2./tmp3).*tmp4 + (1-tmp1./tmp3).*(~tmp4);
end
% Derive the polytope volume v from the appropriate
% element in the bottom row of w.
v = n^(3/2)*(w(n,k+2)/realmax)*(b-a)^(n-1);
% Now compute the matrix x.
x = zeros(n,m);
if m == 0, return, end % If m is zero, quit with x = []
rt = rand(n-1,m); % For random selection of simplex type
rs = rand(n-1,m); % For random location within a simplex
s = repmat(s,1,m);
j = repmat(k+1,1,m); % For indexing in the t table
sm = zeros(1,m); pr = ones(1,m); % Start with sum zero & product 1
for i = n-1:-1:1 % Work backwards in the t table
e = (rt(n-i,:)<=t(i,j)); % Use rt to choose a transition
sx = rs(n-i,:).^(1/i); % Use rs to compute next simplex coord.
sm = sm + (1-sx).*pr.*s/(i+1); % Update sum
pr = sx.*pr; % Update product
x(n-i,:) = sm + pr.*e; % Calculate x using simplex coords.
s = s - e; j = j - e; % Transition adjustment
end
x(n,:) = sm + pr.*s; % Compute the last x
% Randomly permute the order in the columns of x and rescale.
rp = rand(n,m); % Use rp to carry out a matrix 'randperm'
[ig,p] = sort(rp); % The values placed in ig are ignored
x = (b-a)*x(p+repmat([0:n:n*(m-1)],n,1))+a; % Permute & rescale x
return

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.