Efficiently get diff of large data set?

Efficiently get diff of large data set? - php

I need to be able to diff the results of two queries, showing the rows that are in the "old" set but aren't in the "new"... and then showing the rows that are in the "new" set but not the old.
Right now, i'm pulling the results into an array, and then doing an array_diff(). But, i'm hitting some resource and timing issues, as the sets are close to 1 million rows each.
The schema is the same in both result sets (barring the setId number and the table's autoincrement number), so i assume there's a good way to do it directly in MySQL... but im not finding how.
Example Table Schema:
rowId,setId,userId,name
Example Data:
1,1,user1,John
2,1,user2,Sally
3,1,user3,Tom
4,2,user1,John
5,2,user2,Thomas
6,2,user4,Frank
What i'm needing to do, is figure out the adds/deletes between setId 1 and setId 2.
So, the result of the diff should (for the example) show:
Rows that are in both setId1 and setId2
1,1,user1,John
Rows that are in setId 1 but not in setId2
2,1,user2,Sally
3,1,user3,Tom
Rows that are in setId 2 but not in setId1
5,2,user2,Thomas
6,2,user4,Frank
I think that's all the details. And i think i got the example correct. Any help would be appreciated. Solutions in MySQL or PHP are fine by me.

You can use exists or not exists to get rows that are in both or only 1 set.
Users in set 1 but not set 2 (just flip tables for the opposite):
select * from set1 s1
where set_id = 1
and not exists (
select count(*) from set1 s2
where s1.user1 = s2.user1
)
Users that are in both sets
select * from set2 s2
where set_id = 2
and exists (
select 1 from set1 s1
where s1.setId = 1
and s2.user1 = s1.user1
)
If you only want distinct users in both groups then group by user1:
select min(rowId), user1 from set1
where set_id in (1,2)
group by user1
having count(distinct set_id) = 2
or for users in group but not the other
select min(rowId), user1 from set1
where set_id in (1,2)
group by user1
having count(case when set_id <> 1 then 1 end) = 0

What we ended up doing, was adding a checksum column to the necessary tables being diffed. That way, instead of having to select multiple columns for comparison, the diff could be done against a single column (the checksum value).
The checksum value was a simple md5 hash of a serialized array that contained the columns to be diffed. So... it was like this in PHP:
$checksumString = serialize($arrayOfColumnValues);
$checksumValue = md5($checksumString);
That $checksumValue would then be inserted/updated into the tables, and then we can more easily do the joins/unions etc on a single column to find the differences. It ended up looking something like this:
SELECT i.id, i.checksumvalue
FROM SAMPLE_TABLE_I i
WHERE i.checksumvalue not in(select checksumvalue from SAMPLE_TABLE_II)
UNION ALL
SELECT ii.id, ii.checksumvalue
FROM SAMPLE_TABLE_II ii
WHERE ii.checksumvalue not in(select checksumvalue from SAMPLE_TABLE_I);
This runs fast enough for my purposes, at least for now :-)

Related

How can I find the leaderboard position of a user based on the amount of points they have using SQL?

In my database I have a profile table which includes the following columns:
Id, profilename,points,lastupdated
Users can be put into the profile table more than once, and for the leaderboard I use the value of the max(lastupdated) column.
Below I have SQL code to try and determine someones leaderboard position, and it works except for 1 small problem.
SELECT COUNT( DISTINCT `profilename` )
FROM `profiletable`
WHERE `points` >= integer
If two people have the same points, then they have the same leaderboard position, but this makes the count function work in a way that I didnt not intend.
For example:
position 1: profilename john points 500
position 2: profilename steve points 300
position 2: profilename alice points 300
position 4: profilename ben points 200
When I run the SQL query with Integer being set to john's points, I get a return value of 1 from the count function, but when I run the SQL query for steve or alice, instead of getting 2, I get 3.
I have also tried using the logic points > integer but this returns 1 for steve or alice.
How can I modify the SQL query so that the correct leaderboard position is returned by the count?

Use greater than in the where clause and add 1 to the count:
SELECT COUNT( DISTINCT `profilename` ) + 1
FROM `profiletable`
WHERE `points` > integer
Update:
If you need to take into the updated time field as well, then first you need to get the record associated with the last update time:
SELECT COUNT(p1.`profilename` ) + 1
FROM `profiletable` p1
LEFT JOIN `profiletable` p2 ON p1.profilename=p2.profilename
and p1.lastupdated<p2.lastupdated
WHERE p1.`points` > integer and p2.profilename is null

MYSQL: SELECT * FROM `user_log` WHERE `id` is sequential and `username` == EQUAL in multiple rows

I've searched high and long for an answer to this.
I have a database that collects data whenever a user logs onto our network.
Some users are complaining of disconnections, so I would like to crawl the database, and find any sections where a user is appearing in the database on 3 sequential rows.
Database Structure is:
ID USER
1 MIKE
2 JOHN
3 MIKE
4 MIKE
5 MIKE
6 JOHN
7 JOHN
8 MIKE
I would like the query to return the below (Mike user logged on with 3 sequential ID's)
ID USER
3 MIKE
4 MIKE
5 MIKE
I'm stumped as to how to even attack this.
I'm thinking something like:
SELECT * FROM `user_log` WHERE `id` IS sequential??? and `username` == ???
Possibly a sub-select ?

What you need to do is establish a grouping identifier for each consecutive sequence of users, and then use that as a temporary table to perform a query that groups on that new grouping identifier. From that, we just grab any group that has three or more rows, and can use the min/max values of the id to show your range. We need to use variables to accomplish this.
select min(id), max(id), user
from (
select if(#prev != user, if(#prev := user, #rc := #rc +1, #rc := #rc + 1), #rc) g,
id, user
from log, (select #prev := -1, #rc := 0) q
order by id desc
) q
group by g
having count(g) >= 3;
demo here
this part: (select #prev := -1, #rc := 0) q initialises the variables for us so that we can do it in a single statement.

This alternative doesn't use variables. It creates two temporary tables, a and b, containing names and either the next id number (in table a) or the one after that (in table b), and then checks for each entry in the original table whether there is a corresponding entry in the two temporary tables with matching name.
SELECT
user_log.username, user_log.id-2 FROM user_log,
(SELECT username, id, (id+1) as nxt FROM user_log) as a,
(SELECT username, id, (id+2) as nxtnxt FROM user_log) as b
WHERE
user_log.id=a.nxt and
user_log.username=a.username and
user_log.id=b.nxtnxt and
user_log.username=b.username;
It returns name and the location (id) of the "event". It doesn't return the sequence as you requested, since that seems redundant to me. id-2 is used in the result because the structure natively returns the last id in the triplet, but the last or middle id might be just as useful depending on how you're going to use the result.
One of the things to watch out for is if you have four entries in a row with the same name, it will give you two results.
Anyone searching for longer sequences is better off using pala_'s variable method, but this method is also useful if you want to find other patterns. For example, if you wanted to find sequences like 'Mike', something, 'Mike', something, 'Mike', you could simply replace id+1 with id+2 and id+2 with id+4 in the subqueries.

Random unique Mysql Id

I have a table and want to genarate random unique value to it using 1 mysql query.
Table structure like:
id | unique_id
1 | 1
2 | 5
3 | 2
4 | 7
etc
unique_id is integer(10) unsigned
So I want to fullfill unique_id field with unique random value (not 1, 5, 2, 7 in my example) every time.
Algorytm is:
1. Get 1 unique random value, which will not have duplicates in unique_id field
2. Create new row with unique_id = result of 1 query
I tried
SELECT FLOOR(RAND() * 9) AS random_number
FROM table
HAVING random_number NOT IN (SELECT unique_id FROM table)
LIMIT 1
but it generates not unique values..
Note: multiplier = 9 is given just for example, it is easy to reproduce this problem with such multiplier

One way to do this is to use the id column, just as a random permutation:
select id
from tables
order by rand()
limit 1
(Your example only returns one value.)
To return a repeatable random number for each id, you can do something like:
select id,
(select count(*) from table t2 where rand(t2.id) < rand(t.id)) as randomnumber
from table t
What this is doing is producing a stable sort order by seeding the random number generator. This guarantees uniqueness, although this is not particularly efficient.
A more efficient alternative that uses variables is:
SELECT id, #curRow := #curRow + 1 AS random_number
FROM table CROSS JOIN (SELECT #curRow := 0) r
order by rand()
Note: this returns random numbers up to the size of the table, not necessarily from the ids. This may be a good thing.
Finally, you can get the idea that you were attempting to work with a bit of a trick. Calculate an md5 hash, then cast the first four characters as an integer and check back in the table:
SELECT convert(hex(left(md5(rand()), 4)), unsigned) AS random_number
FROM table
HAVING random_number NOT IN (SELECT unique_id FROM table)
LIMIT 1
You need to insert the value back into the table. And, there is no guarantee that you will actually be able to get a value not in the table, but it should work for up to millions of values.

If you have the availability to use an MD5 hash as a unique_ID, go for MD5(NOW()). This will almost certainly generate a unique ID every time.
Reference: MySQL Forums

Select random row per distinct field value?

I have a MySQL query that results in something like this:
person | some_info
==================
bob | pphsmbf24
bob | rz72nixdy
bob | rbqqarywk
john | kif9adxxn
john | 77tp431p4
john | hx4t0e76j
john | 4yiomqv4i
alex | n25pz8z83
alex | orq9w7c24
alex | beuz1p133
etc...
(This is just a simplified example. In reality there are about 5000 rows in my results).
What I need to do is go through each person in the list (bob, john, alex, etc...) and pull out a row from their set of results. The row I pull out is sort of random but sort of also based on a loose set of conditions. It's not really important to specify the conditions here so I'll just say it's a random row for the example.
Anyways, using PHP, this solution is pretty simple. I make my query and get 5000 rows back and iterate through them pulling out my random row for each person. Easy.
However, I'm wondering if it's possible to get what I would from only a MySQL query so that I don't have to use PHP to iterate through the results and pull out my random rows.
I have a feeling it might involve a BUNCH of subselects, like one for each person, in which case that solution would be more time, resource and bandwidth intensive than my current solution.
Is there a clever query that can accomplish this all in one command?
Here is an SQLFiddle that you can play with.

To get a random value for a distinct name use
SELECT r.name,
(SELECT r1.some_info FROM test AS r1 WHERE r.name=r1.name ORDER BY rand() LIMIT 1) AS 'some_info'
FROM test AS r
GROUP BY r.name ;
Put this query as it stands in your sqlfiddle and it will work
Im using r and r1 as table alias names. This will also use a subquery to select a random some_info for the name
SQL Fiddle is here

My first response would be to use php to generate a random number:
$randId = rand($min, $max);
Then run a SQL query that only gets the record where your index equals $randID.

Here is the solution:
select person, acting from personel where id in (
select lim from
(select count(person) c, min(id) i, cast(rand()*(count(person)-1) +min(id)
as unsigned) lim from personel group by person order by i) t1
)
The table used in the example is below:
create table personel (
id int(11) not null auto_increment,
person char(16),
acting char(19),
primary key(id)
);
insert into personel (person,acting) values
('john','abd'),('john','aabd'),('john','adbd'),('john','abfd'),
('alex','ab2d'),('alex','abd3'),('alex','ab4d'),('alex','a6bd'),
('max','ab2d'),('max','abd3'),('max','ab4d'),('max','a6bd'),
('jimmy','ab2d'),('jimmy','abd3'),('jimmy','ab4d'),('jimmy','a6bd');

You can limit the number of queries, and order by "rand()" to get your desired result.
Perhaps if you tried something like this:
SELECT name, some_info
FROM test
WHERE name = 'tara'
ORDER BY rand()
LIMIT 1

Help with PHP MYSQL Select Statment

I can't seem to grasp how I can select records when the records of one user span multiple rows.
Here is the schema of the table.
user_id key value
------------------------------------------
1 text this is sample text
1 text_status 0
2 text this is sample text
2 text_status 1
from the above table/row you can see that each user has info that has multiple rows. So in this case how do I select say "All the IDs, text value where text_status is "1"?
And to complicate it 1 step further, I need the email address of these accounts which is on another table. How can I write 1 select statement to pull in the email address as well? I know there is a JOIN statement for this but it's a bit complicated for me especially I can't even figure out the first part.
Added Note I must state that this table schema is a Wordpress default table wp_usermeta..

SELECT t1.*
FROM tbl t2
INNER JOIN tbl t1 ON t1.user_id = t2.user_id
AND t1.key = 'text'
WHERE t2.key = 'text_status'
AND t2.value = '1'

I think you've set up your table incorrectly. Make text_status and value exist within the same row.
The way it is right now, you would have to conduct two queries to get to your end result. Where as, the correct way needs only one.

This arbitrary key:value list scheme is alluring because of its flexibility. But it complicates queries obviously. Depending on the structure of your second table you could get away with:
SELECT key, value FROM user_table WHERE user_id=123
UNION ALL
SELECT 'email' as key, email as value FROM email_table WHERE user_id=123
But that pretty much only returns a list still, not a set of fields.

key and value looks wrong. SQL already gives you "key" (in the column name) and multiple "values" (in the values given per column in each row).
You've designed your table in a way that contravenes the way Database Management Systems are designed to work, which is leading to your problem. Read about database normalization.
Ideally your table would look something like this:
user_id text_status text
------------------------------------------
1 0 this is sample text
2 1 this is sample text
Then your query looks like:
SELECT `user_id`, `text` FROM `table` WHERE `text_status` = '1';
As your table stands now, you'll need something like (untested):
SELECT `table`.*
FROM `table` LEFT JOIN
(SELECT `user_id`
FROM `table`
WHERE `key` = "text_status"
AND `value` = "1"
) AS `_` USING(`user_id`)
WHERE `table`.`key` = "text"

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Efficiently get diff of large data set? - php

Related

How can I find the leaderboard position of a user based on the amount of points they have using SQL?

MYSQL: SELECT * FROM `user_log` WHERE `id` is sequential and `username` == EQUAL in multiple rows

Random unique Mysql Id

Select random row per distinct field value?

Help with PHP MYSQL Select Statment

Categories

Resources