Deleting Duplicate Rows from MySql Table - php

I have a script to find duplicate rows in my MySql table, the table contains 40,000,000 rows. but it is very slow going, is there an easier way to find the duplicate records without going in and out of php?
This is the script i currently use
$find = mysql_query("SELECT * FROM pst_nw ID < '1000'");
while ($row = mysql_fetch_assoc($find))
{
$find_1 = mysql_query("SELECT * FROM pst_nw add1 = '$row[add1]' AND add2 = '$row[add2]' AND add3 = '$row[add3]' AND add4 = '$row[add4]'");
if (mysql_num_rows($find_1) > 0) {
mysql_query("DELETE FROM pst_nw WHERE ID ='$row[ID]'}
}

You have a number of options.
Let the DB do the work
Create a copy of your table with a unique index - and then insert the data into it from your source table:
CREATE TABLE clean LIKE pst_nw;
ALTER IGNORE TABLE clean ADD UNIQUE INDEX (add1, add2, add3, add4);
INSERT IGNORE INTO clean SELECT * FROM pst_nw;
DROP TABLE pst_nw;
RENAME TABLE clean pst_nw;
The advantage of doing things this way is you can verify that your new table is correct before dropping your source table. The disadvantage is it takes up twice as much space and is (relatively) slow to execute.
Let the DB do the work #2
You can also achieve the result you want by doing:
set session old_alter_table=1;
ALTER IGNORE TABLE pst_nw ADD UNIQUE INDEX (add1, add2, add3, add4);
The first command is required as a workaround for the ignore flag being .. ignored
The advantage here is there's no messing about with a temporary table - the disadvantage is you don't get to check that your update does exactly what you expect before you run it.
Example:
CREATE TABLE `foo` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`one` int(10) DEFAULT NULL,
`two` int(10) DEFAULT NULL,
PRIMARY KEY (`id`)
)
insert into foo values (null, 1, 1);
insert into foo values (null, 1, 1);
insert into foo values (null, 1, 1);
select * from foo;
+----+------+------+
| id | one | two |
+----+------+------+
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 1 |
+----+------+------+
3 row in set (0.00 sec)
set session old_alter_table=1;
ALTER IGNORE TABLE foo ADD UNIQUE INDEX (one, two);
select * from foo;
+----+------+------+
| id | one | two |
+----+------+------+
| 1 | 1 | 1 |
+----+------+------+
1 row in set (0.00 sec)
Don't do this kind of thing outside the DB
Especially with 40 million rows doing something like this outside the db is likely to take a huge amount of time, and may not complete at all. Any solution that stays in the db will be faster, and more robust.

Usually in questions like this the problem is "I have duplicate rows, want to keep only one row, any one".
But judging from the code, what you want is: "if a set of add1, add2, add3, add4 is duplicated, DELETE ALL COPIES WITH ID < 1000". In this case, copying from the table to another with INSERT IGNORE won't do what you want - might even keep rows with lower IDs and discard subsequent ones.
I believe you need to run something like this to gather all the "bad IDs" (IDs with a duplicate, the duplicate above 1000; in this code I used "AND bad.ID < good.ID", so if you have ID 777 which duplicates to ID 888, ID 777 will still get deleted. If this is not what you want, you can modify that in "AND bad.ID < 1000 AND good.ID > 1000" or something like that).
CREATE TABLE bad_ids AS
SELECT bad.ID FROM pst_nw AS bad JOIN pst_nw AS good
ON ( bad.ID < 1000 AND bad.ID < good.ID
AND bad.add1 = good.add1
AND bad.add2 = good.add2
AND bad.add3 = good.add3
AND bad.add4 = good.add4 );
Then once you have all bad IDs into a table,
DELETE pst_nw.* FROM pst_nw JOIN bad_ids ON (pst_nw.ID = bad_ids.ID);
Performances will greatly benefit from a (non_unique, possibly only temporary) index on add1, add2, add3, add4 and ID in this order.

Get the duplicate rows using "Group by" operator. Here is a sample that you can try :
select id
from table
group by matching_field1,matching_field2....
having count(id) > 1
So, you are getting all the duplicate ids. Now delete them using a delete query.
Instead of using "IN", use "OR" operator as "IN" is slow compared to "OR".

Sure there is. Note however that with 40 million records You most probably will exceed max php execution time. Try following
Create table temp_pst_nw like pst_nw;
Insert into temp_pst_nw select * from pst_nw group by add1,add2,add3,add4;
Confirm that everything is ok first!!
Drop table pat_nw;
Rename table temp_pst_nw to pst_nw;

Try creating a new table that has the same definitions. i.e. "my_table_two", then do:
SELECT DISTINCT unique_col1, col2, col3 [...] FROM my_table INTO
my_table_two;
Maybe that'll sort it out.

Your code will be better if you don't use select *, only select columns (4 address) you want to compare. It should have limit clause in my sql. It can avoid state not response when you have too large nums rows like that.

Related

SQL query is very slow for certain parameters (MySQL)

I am making a PHP backend API which executes a query on MySQL database. This is the query:
SELECT * FROM $TABLE_GAMES WHERE
($GAME_RECEIVERID = '$userId'OR $GAME_OTHERID = '$userId')
ORDER BY $GAME_ID LIMIT 1"
Essentially, I'm passing $userId as parameter, and getting row with smallest $GAME_ID value and it would return result in less than 100 ms for users that have around 30 000 matching rows in table. However, I have since added new users, that have around <100 matching rows, and query is painfully slow for them, taking around 20-30 seconds every time.
I'm puzzled to why the query is so much slower in situations where it is supposed to return low amount of rows, and extremely fast when returns huge amount of rows especially since I have ORDER BY.
I have read about parameter sniffing, but as far as I know, that's the SQL Server thing, and I'm using MySQL.
EDIT
Here is the SHOW CREATE statement:
CREATE TABLEgames(
IDint(11) NOT NULL AUTO_INCREMENT,
SenderIDint(11) NOT NULL,
ReceiverIDint(11) NOT NULL,
OtherIDint(11) NOT NULL,
Timestamptimestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (ID)
) ENGINE=MyISAM AUTO_INCREMENT=17275279 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Here is the output of EXPLAIN
+----+-------------+-------+------+---------------+------+---------+-----+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
+----+-------------+-------+------+---------------+------+---------+-----+------+-------+
| 1 | SIMPLE | games | NULL | index | NULL | PRIMARY | 4 | NULL | 1 |
+----+-------------+-------+------+---------------+------+---------+-----+------+-------+
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE games NULL index NULL PRIMARY 4 NULL 1 19.00 Using where
I tried prepared statement, but still getting the same result.
Sorry for poor formatting, I'm still noob at this.
You need to use EXPLAIN to analyse the performance of the query.
i.e.
EXPLAIN SELECT * FROM $TABLE_GAMES WHERE
($GAME_RECEIVERID = '$userId'OR $GAME_OTHERID = '$userId')
ORDER BY $GAME_ID LIMIT 1"
The EXPLAIN would provide the information about the select query with execution plan.
It is great tool to identify the slowness in the query. Based on the obtained information you can create the Indexes for the columns used in WHERE clause .
CREATE INDEX index_name ON table_name (column_list)
This would definitely increase the performance of the query.
Your query is being slow because it cannot find a matching record fast enough. With users where a lot of rows match, chances of finding a record to return are much higher, all other things being equal.
That behavior appears when $GAME_RECEIVERID and $GAME_OTHERID aren't part of an index, prompting MySQL to use the index on $GAME_ID because of the ordering. However, since newer players have not played the early games, there are literally millions of rows that won't match, but have to be checked nonetheless.
Unfortunately, this is bound to get worse even for old users, as your database grows. Ideally, you will add indexes on $GAME_RECEIVERID and $GAME_OTHERID - something like:
ALTER TABLE games
ADD INDEX receiver (ReceiverID),
ADD INDEX other (OtherID)
PS: Altering a 17 million rows table is going to take a while, so make sure to do it during a maintenance window or similar if this is used in production.
Is this the query after the interpolation? That is, is this what MySQL will see?
SELECT * FROM GAMES
WHERE RECEIVERID = '123'
OR OTHERID = '123'
ORDER BY ID LIMIT 1
Then this will run fast, regardless:
SELECT *
FROM GAMES
WHERE ID = LEAST(
( SELECT MIN(ID) FROM GAMES WHERE RECEIVERID = '123' ),
( SELECT MIN(ID) FROM GAMES WHERE OTHERID = '123' )
);
But, you will need both of these:
INDEX(RECEIVERID, ID),
INDEX(OTHERID, ID)
Your version of the query is scanning the table until it finds a matching row. My version will
make two indexed lookups;
fetch the other columns for the one row.
It will be the same, fast, speed regardless of how many rows there are for USERID.
(Recommend switching to InnoDB.)

Mysql single query with subquery or two separate queries

I've table with following structure :
id | storeid | code
Where id is primary key
I want to insert data in this table with incremental order like this :
id | storeid | code
1 | 2 | 1
2 | 2 | 2
3 | 2 | 3
4 | 2 | 4
I've two solution to do this task.
1) Fire a query to get last record (code) from table and increment value of code with 1 using PHP and after that second query to insert that incremented value in database.
2) This single query : "INSERT INTO qrcodesforstore (storeid,code) VALUES (2,IFNULL((SELECT t.code+1 FROM (select code from qrcodesforstore order by id desc limit 1) t),1))"
I just want suggestion which approach is best and why for performance.
I'm currently using second method but confuse about performance as I'm using three level sub query.
You simply can use INSERT with SELECT and MAX():
INSERT INTO qrcodesforstore
(storeid, code)
(SELECT 2, IFNULL((MAX(code)+1),1) FROM qrcodesforstore)
SQLFiddle
Wrapping it up in a trigger:-
DELIMITER $$
DROP TRIGGER IF EXISTS bi_qrcodesforstore$$
CREATE TRIGGER bi_qrcodesforstore BEFORE INSERT ON qrcodesforstore
FOR EACH ROW
BEGIN
DECLARE max_code INT;
SELECT MAX(code) INTO max_code FROM qrcodesforstore;
IF max_code IS NULL THEN
SET max_code := 1;
ELSE
SET max_code := max_code + 1;
END IF
SET NEW.code := max_code;
END
you declare the field as primary key and unique and auto increment key they automatically increment the values
You can set the code column as AUTO_INCREMENT, you don't need to set it as primary key, see this.
Anyway, the second solution would be better, only one query is better than two.

Limiting maximum records per key in MySQL table

I've a table for storing products as per the following structure...
id shop_id product_id product_title
Every shop selects a plan, and accordingly it can stored N products in this tables. N is different for every shop.
Problem Statement: While doing insert operation in the table, total number of entries per shop_id can't be more than N.
I can count the #products before every insert operation, and then decide whether new entry should go in the table or be ignored. The operations are triggered by events received, and it may be in millions. So it doesn't seem to efficient. Performance is the key.
Is there a better way?
I Think you should use a stored procedure so you can delegate the validations to MySql and not to PHP, here is an example of what you might need, just be sure to replace the table names and columns names properly.
If you are worried about the performance you should check the indexes of the tables for better performance.
Procedure
DELIMITER //
DROP PROCEDURE IF EXISTS storeProduct//
CREATE PROCEDURE storeProduct(IN shopId INT, IN productId INT, IN productTitle VARCHAR(255))
LANGUAGE SQL MODIFIES SQL DATA SQL SECURITY INVOKER
BEGIN
/*Here we get the plan for the shop*/
SET #N = (SELECT plan FROM planTable WHERE shop_id = shopId);
/*NOW WE COUNT THE PRODUCTS THAT ARE STORED WITH THE shop_id*/
SET #COUNT = (SELECT COUNT(id) FROM storing_products WHERE shop_id = shopId);
/*NOW WE CHECK IF WE CAN STORE THE PRODUCTS OR NOT*/
IF #COUNT < #N THEN
/*YES WE CAN INSERT*/
INSERT INTO storing_products(shop_id, product_id, product_title)
VALUES (shopId, productId, productTitle);
/*1 means that the insert acording to the plan is ok*/
SELECT 1 AS 'RESULT';
ELSE
/*NO WE CAN NOT INSERT*/
/*0 means that the insert acording to the plan is not ok*/
SELECT 0 AS 'RESULT';
END IF;
END//
DELIMITER ;
Now you can call it from PHP just like
<?php
//....
$result = $link->query("CALL storeProduct($shop_id, $product_id, $product_title);");
//....
?>
or whatever you do
the answer is like
+--------+
| RESULT |
+--------+
| 1 |
+--------+
if its ok or
+--------+
| RESULT |
+--------+
| 0 |
+--------+
if not
I hope it will help
Greetings
If you don't want to count at insertion time, you can maintain count in another table than can be referred while insertion
shop_id max_product product_count insertion_allowed
------------------------------------------------------------
1 1000 50 1
2 2000 101 1
3 100 100 0
Two approaches:
compare product_count and max_product and insert when product_count is smaller than max_product. After successful insertion increment product_count of corresponding shop_id.
Alternatively, you may use insertion_allowed flag to check condition and after each successful insertion increment the product_count by 1 of corresponding shop_id.
Hope this will help you.
Please share performance improvement statistics of the approach(if you can). It may help others in choosing better approach.
Another approach without store procedure.
CREATE TABLE `test` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`prod_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=19 DEFAULT CHARSET=utf8;
You can do an insert with select. It's a little bit more efficient of two separate queries. The trick is if the select returns empty rows, then no insert happens:
$prod_id = 12;
$N = 3;
$qry = "insert into test (prod_id)
(select $prod_id FROM test WHERE prod_id=$prod_id HAVING count(id) <= $N )";

get ID at time of insert / replicate value into other

Is it possible to get the ID of what the entry in the MySQL DB is going to be ?
Or even better replicate the value of ID into another field ?
For example I have:
ID and User_ID and I want them always to be the same value
--------------------
| ID User_ID |
| 1 1 |
| 2 2 |
| 3 3 |
| 4 4 |
| 5 5 |
--------------------
And so on
I find your need quite questionable, and a design smell. Nevertheless, you have the chance to achieve what you're looking for. The key is a trigger.
here you can find TRIGGER SYNTAX
Basically you need to define a proper AFTER INSERT trigger.
Insert the row without the user_id field set, read the autoincrement id inside the trigger and set the user_id value as needed.
I'm not going to post an example because the manual is quite exaustive and it's a good read anyway.
Why I think that your idea is bad? well, you're duplicating data (by your very definition, "you want them always to be the same") without gaining value. The duplicate column is completely useless by all rights, and can be removed. If you're using them as a key or foreign key reference, the two column are completely interchangeable. Therefore, simply drop one of them and be done with it.
Use mysql_insert_id() after calling your mysql_query
Or mysqli_insert_id() is using mysqli_ instead of just mysql_
More information here http://uk1.php.net/mysql_insert_id note that all the mysql_ functions are now depreciated, so I would suggest mysqli_ or PDO instead.
if(mysql_query("INSERT INTO `users` (`name`) VALUES ('name')")){
$id = (int)mysql_insert_id();
mysql_query("UPDATE `users` SET `User_ID` = {$id}");
}
But this is not good idea from the beginning :)
If you are using auto increment, then what you want is inside:
SELECT AUTO_INCREMENT FROM information_schema.tables
WHERE table_name = 'myTable' AND table_schema = DATABASE( )
so in that case use for example:
INSERT INTO myTable (ID, User_ID) values ( NULL ,
(
SELECT AUTO_INCREMENT FROM information_schema.tables
WHERE table_name = 'myTable' AND table_schema = DATABASE( )
)
);
The Null indicates an auto increment ID. And the subquery will yield the auto increment ID to be used.
If you are not using auto increment, then you can simply do:
INSERT INTO myTable (ID, User_ID) values ( 5 , ID );
... and thus referencing ID to User_ID

mysql database structure (unique id)

I think most people had serious problems with inserting a value/data into a database (mysql). When I'm inserting a value into a database, I assign an unique id (INT) for that line. When I query the database, easily I can read/modify/delete that line.
With function for() (in php) I easily can read values/data from the database. The problem occurs when I delete a line (in the middle for example).
E.g:
DB:
ID | column1 | column2 | ... | columnN
--------------------------------------
1 | abcdefgh | asdasda | ... | asdasdN
2 | asdasddd | asdasda | ... | asdasdN
...
N | asdewfew | asddsad | ... | asddsaN
php:
for($i = 0; $i <= $n; $i++){
$sql = mysql_query("SELECT * FROM db WHERE ID = '$i' ");
//Code;
}
*$n = last column value from ID
Am I need to reorganize the entire database to have a correct "flow" (1, 2, 3, .. n)? Or am I need to UPDATE the each cell?
What you're doing here is unnecessary thanks to AUTO_INCREMENT in mysql. Run this command from PhpMyAdmin (or another DB management system):
ALTER TABLE db MODIFY COLUMN ID INT NOT NULL AUTO_INCREMENT;
Now when insert a row into db mysql will assign the ID for you:
INSERT INTO db (id, column1, column2) VALUES(NULL, 'abc', 'def');
SELECT * FROM db;:
+--+-------+-------+
|id|column1|column2|
+--+-------+-------+
|1 |old |oldrow |
+--+-------+-------+
|2 |abc |def | <--- Your newly inserted row, with unique ID
+--+-------+-------+
If you delete a row, it is true that there will be an inconsistency in the order of the ID's, but this is ok. IDs are not intended to denote the numeric position of a row in a table. They are intended to uniquely identify each row so that you can perform actions on it and reference it from other tables with foreign keys.
Also if you need to grab a group of ID's (stored in an array, for example), it is much more efficient to perform one query with an IN statement.
$ids = array(1,2,3,4,5,6);
$in = implode(',', $ids);
mysql_query('SELECT * FROM db WHERE id IN ('.$in.')');
However, if you want all rows just use:
SELECT * FROM dbs;
But be weary of bobby tables.
You can select all the rows with query
SELECT * FROM db
and do whatever you want after
You can never ensure, that the ids are continous. As you noticed yourself there are gaps after you delete a row, but even for example when you insert rows within a transaction and don't commit it, because something failed and you need to revert the transaction. An ID is an identifier and not row number.
For example if you want to select X items from somewhere (or such) have a look at LIMIT and OFFSET
SELECT * FROM mytable ORDER BY created DESC LIMIT 10 OFFSET 20;
This selects the rows 21 to 30 ordered by their creation time. Note, that I don't use id for ordering, but you cannot rely on it (as mentioned).
If you really want to fetch rows by their ID you definitely need to fetch the IDs first. You may also fetch a range of IDs like
SELECT * FROM mytable WHERE id IN (1,2,3,4);
But don't assume, that you will ever receive 4 rows.
Ids are surrogate keys–they are in no way derived from the data in the rest of the columns and their only significance is each row has a unique one. There's no need to change them.
If you need a specific range of rows from the table, use BETWEEN to specify them in the query:
SELECT id, col1, col2, ..., coln
FROM `table`
WHERE id BETWEEN ? AND ?
ORDER BY id
If you need all the rows and all columns for that table use:
$query = mysql_query("SELECT * FROM table_name");
and then loop through each row with a while statement:
while ($row = mysql_fetch_array($query ))
{
echo $row['column_name'];
}
You do not have to reorganize the entire database in order to keep the index. But if you feel like it, you'd have to update each cell.
BTW, look at mysql_fetch_array(), it will ease the load on the SQL server.

Categories