delete duplicate rows that have blob text / mediumtext mysql - php

I have seen lots of posts on deleting rows using sql commands but i need to filter out rows which have mediumtext.
I keep getting an error Error Code: 1170. BLOB/TEXT column used in key specification without a key length from solution such as:
ALTER IGNORE TABLE foobar ADD UNIQUE (title, SID)
My table is simple, i need to check for duplicates in mytext, id is unique and they are AUTO_INCREMENT.
As a note, the table has about a million rows, and all attempts keep timing out. I would need a solution that performs actions in batches such as WHERE id>0 AND id<100
Also I am using MySQL Workbench on amazons RDS
From a table like this
+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc | 123 |
| 2 | joe | min | abc | 123 |
| 3 | mar | kam | def | 789 |
| 4 | kel | smi | ghi | 456 |
+------------------------------+
I would like to end up with a table like this
+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc | 123 |
| 3 | mar | kam | def | 789 |
| 4 | kel | smi | ghi | 456 |
+------------------------------+
update forgot to mention this is on amazon RDS using mysql workbench
my table is very large and i keep getting an error Error Code: 1205. Lock wait timeout exceeded from this sql command:
DELETE n1 FROM names n1, names n2 WHERE n1.id > n2.id AND n1.name = n2.name
Also, if anyone else is having issues with MySQL workbench timing out the fix is
Go to Preferences -> SQL Editor and set to a bigger value this parameter:
DBMS connection read time out (in seconds)

OPTION #1: Delete all duplicates records leaving one of each (e.g. the one with max(id))
DELETE
FROM yourTable
WHERE id NOT IN
(
SELECT MAX(id)
FROM yourTable
GROUP BY mytext
)
You could prefer using min(id).
Depending on the engine used, this won't work and, as it did, give you the Error Code: 1093. You can't specify target table 'yourTable' for update in FROM clause. Why? Because deleting one record may cause something to happen which made the WHERE condition FALSE, i.e. max(id) changes the value.
In this case, you could try using another subquery as a temporary table:
DELETE
FROM yourTable
WHERE id NOT IN
(
SELECT MAXID FROM
(
SELECT MAX(id) as MAXID
FROM yourTable
GROUP BY mytext
) as temp_table
)
OPTION #2: Use a temporary table like in this example or:
First, create a temp table with the max ids:
SELECT MAX(id) AS MAXID
INTO tmpTable
FROM yourTable
GROUP BY mytext;
Then execute the delete:
DELETE
FROM yourTable
WHERE id NOT IN
(
SELECT MAXID FROM tmpTable
);

How about this it will delete all the duplicate records from the table
DELETE t1 FROM foobar t1 , foobar t2 WHERE t1 .mytext= t2.mytext

Related

If I have a MySQL table with multiple colum values the same, how do I delete all but two of the most recent entries?

I know this sounds like a duplicate of a few questions, and it may well be, but I've searched through and tried my own implementation of several possible solutions but all of them seem to result in some form of infinite recursion that just chews 100% CPU and does nothing. That could be because I'm doing it wrong or they aren't appropriate for me, I don't know.
I have a MySQL table structured as follows :
+--------+------+-----+-------+--------+--------+----------------+
| id | fid | bid | dec_a | varc_a | varc_b | dec_b | varc_c |
+--------+------+-----+-------+--------+--------+----------------+
| 106861 | 4192 | 22 | 1.40 | blah | blahbr | 0.2 | blahca |
| 108620 | 4192 | 22 | 1.55 | blah | blahbe | 0.2 | blahca |
| 108621 | 4192 | 22 | 1.55 | blah | blahbq | 0.2 | blahca |
| 108622 | 4192 | 22 | 1.55 | blah | blahbw | 0.2 | blahca |
| 108623 | 4192 | 22 | 1.55 | blah | blahbe | 0.2 | blahca |
| 108624 | 4192 | 22 | 1.55 | blah | blahbf | 0.2 | blahca |
| 106863 | 4192 | 33 | 1.40 | blah | blahba | 0.2 | blahca |
+--------+------+-----+-------+--------+--------+-------+--------+
The "id" value is a BIGINT auto-incrementing value and the data is added in proper chronological order from the source, so I am viewing this as the timestamp.
To establish which data is duplicated I am using the "fid", "bid", "varc_a", "dec_b" and "varc_c" columns. From the example above you can see that there are 6 duplicates based on those columns and those are the first six rows, the seventh row shows where there is variation in the "bid" column but obviously any variation in any of those columns excludes the row as a duplicate.
I can easily visualise what I want to do : There are potentially millions of entries in the database, I want to exclude the 2 most recent rows of data based on the entry id where the "fid", "bid", "varc_a", "dec_b" and "varc_c" column values are the same and then sweep away what's left.
For the life of me I can't figure out how to do that using just MySQL and, as I say, all of the questions and answers I've looked at don't seem to be doing what I want to do or I'm not understanding what's proposed.
I know I can do this with PHP+MySQL by trawling through the data and removing the duplicates but considering I can do it in such a horribly inefficient way quite easily I'm thinking that I'm missing something obvious and I should be able to do it with MySQL alone ?
: Note :
Mike's answer is excellent and it did precisely what I need with a little tweaking given the context of my question. What I ended up using was this :
DROP TEMPORARY TABLE IF EXISTS keepers1, keepers2, keepers_all;
CREATE TEMPORARY TABLE keepers1 (KEY(id)) ENGINE=MEMORY AS
SELECT fid, bid, varc_a, dec_b, var_c, MAX(id) AS id
FROM market_prices
GROUP BY fid, bid, varc_a, dec_b, varc_c;
CREATE TEMPORARY TABLE keepers2 AS
SELECT fid, bid, varc_a, dec_b, varc_c, MAX(id) AS id
FROM market_prices AS k
WHERE NOT EXISTS (SELECT 1 FROM keepers1 WHERE id = k.id)
GROUP BY fid, bid, varc_a, dec_b, varc_c;
CREATE TEMPORARY TABLE keepers_all (KEY(id)) ENGINE=MEMORY AS
SELECT id FROM keepers1
UNION ALL
SELECT id FROM keepers2;
DELETE k.* FROM market_prices AS k WHERE NOT EXISTS (SELECT 2 FROM keepers_all WHERE id = k.id);
When grouping be sure to just use the columns that are duplicated and in that last statement your SELECT should be the number of records you want to keep, I needed a SELECT 2 at the end there.
Time to raise a glass to the man of the hour!
This may be a solution for your problem.
However, since there is no date-time column I am assuming that the id column is the primary key. And it is Auto_increment. So my assumption is that the larger the number the newer the record. (it should be true unless you had some old data dumps into the table)
Make sure you back up your data before you delete as this will cause you a permanent data lost. Even better, you can make a copy of the current table into a different table and work on he new table to make sure the logic below is correct. Then change the queries that I have below to read from tbl_new instead on tbl
you can duplicate your table via something like
CREATE TABLE tbl_new LIKE tbl;
I have left comments for every query
DROP TEMPORARY TABLE IF EXISTS keepers1, keepers2, keepers_all;
-- get the #1 top records
CREATE TEMPORARY TABLE keepers1 (KEY(id)) ENGINE=MEMORY AS
SELECT fid, bid, dec_a, varc_a, varc_b, dec_b, varc_c, MAX(id) AS id
FROM tbl
GROUP BY fid, bid, dec_a, varc_a, varc_b, dec_b, varc_c;
-- get the #2 top records
CREATE TEMPORARY TABLE keepers2 AS
SELECT fid, bid, dec_a, varc_a, varc_b, dec_b, varc_c, MAX(id) AS id
FROM tbl AS k
WHERE NOT EXISTS (SELECT 1 FROM keepers1 WHERE id = k.id)
GROUP BY fid, bid, dec_a, varc_a, varc_b, dec_b, varc_c;
-- create a temp table where you have all he ids that you want to keep
CREATE TEMPORARY TABLE keepers_all (KEY(id)) ENGINE=MEMORY AS
SELECT id FROM keepers1
UNION ALL
SELECT id FROM keepers2;
-- delete all records that you don't want to keep
DELETE k.* FROM tbl AS k WHERE NOT EXISTS (SELECT 1 FROM keepers_all WHERE id = k.id);
if this is a one time clean up job then you should be able to execute the queries from the console. but if you are looking for a recruiting Job them you should probably take this code and put it in a procedure.
Note: here I am using MEMORY TEMPORARY tables for better performance. You may run into an issue that say "Table is Full" this is because you have too many records. then you can increase the value max_heap_table_size for the session
something like
SET SESSION tmp_table_size = 1024 * 1024 * 1024 * 2; -- this will set it to 2G
SET SESSION max_heap_table_size = 1024 * 1024 * 1024 * 2; -- this will set it to 2G
This will give you your current value
SELECT VARIABLES LIKE 'max_heap_table_size';
SELECT VARIABLES LIKE 'tmp_table_size';
You will need to write a stored procedure. You can create the stored procedure either via PHP, or MySQL directly:
Creating via PHP
$createProc = "DROP PROCEDURE IF EXISTS `remove_dups`;
CREATE DEFINER=`root`#`localhost` PROCEDURE `remove_dups`( In id varchar(255))
BEGIN
...my code...
END;";
$conn = new PDO("mysql:host=$host;dbname=$dbname", $username, $password);
//create the stored procedure
$stmt = $conn->prepare($createProc);
$stmt->execute();
Create via MySQL GUI
Simply put the create statement in the text box and run it (against the proper DB):
CREATE DEFINER=`root`#`localhost` PROCEDURE `remove_dups`( In id varchar(255))
BEGIN
...my code...
END;";
Then you can call this procedure either from PHP or MySQL.
In your stored proc, you'll want to declare some variables to store the values in and do a check to find rows with the same values (using a cursor), and then check the id against the previous row's. If all the values are the same, delete to one with the lower id.

Returning values even the result are empty

I have a table that is similar below.
| user_id | point_1 | point_2 | point_3
453123 1234 32 433
321543 1 213 321
My query is something like this:
$query = "SELECT * FROM my_table WHERE user_id = 12345 OR user_id = 987654321"
Obviously, this will return nothing since user_id 12345 OR user_id 987654321 do not exist on the table.
But I still want to return something like the one below :
| user_id | point_1 | point_2 | point_3
12345 0 0 0
987654321 0 0 0
You could use an inline view as a rowsource for your query. To return a zero in place of a NULL (which would be returned by the outer join when no matching row is found in my_table, you can use the IFNULL function.
e.g.
SELECT s.user_id
, IFNULL(t.point_1,0) AS point_1
, IFNULL(t.point_2,0) AS point_2
, IFNULL(t.point_3,0) AS point_3
FROM ( SELECT 12345 AS user_id
UNION ALL SELECT 987654321
) s
LEFT
JOIN my_table t
ON t.user_id = s.user_id
NOTE: If datatype of user_id column my_table is character, then I'd enclose the literals in the inline view in single quotes. e.g. SELECT '12345' AS user_id. If the characterset of the column doesn't match your client characterset, e.g. database column is latin1, and client characterset is UTF8, you'd want to force the character strings to be a compatible (coercible) characterset... SELECT _latin1'12345' AS user_id
You can't get the result you want using only a select statement. Only rows that exist somewhere will be returned.
The only way I can think to do this is to insert the query values into a temp table and then outer join against that for your query.
So the basic process would be:
create table temp1 (user_id integer);
insert into temp1 values (987654321); -- repeat as needed for query.
select t.user_id, m.* from temp1 t left outer join my_table m on m.user_id = t.user_id;
drop table temp1;
This isn't very efficient though.
Your desired result resembles the result of an OUTER JOIN - when some records exist only in one table and not the other, an OUTER JOIN will show all of the rows from one of the joined tables, filling in missing fields from the other table with NULL values.
To solve your particular problem purely in SQL, you could create a second table that contains a single field with all of the user_id values that you want to be able to show in your result. Something like:
+-----------+
| user_id |
+-----------+
| 1 |
+-----------+
| 2 |
+-----------+
| 3 |
+-----------+
| ... |
+-----------+
| 12344 |
+-----------+
| 12345 |
+-----------+
| 12346 |
+-----------+
| ... |
+-----------+
And so on. If this second table is named all_ids, you could then get your desired result by modifying your query as follows (exact syntax may vary by database implementation):
SELECT
*
FROM
all_ids AS i
LEFT OUTER JOIN
my_table AS t ON i.user_id = t.user_id
WHERE
i.user_id = 12345
OR i.user_id = 987654321;
This should produce the following result set:
+-----------+----------+----------+----------+
| user_id | point_1 | point_2 | point_3 |
+-----------+----------+----------+----------+
| 12345 | NULL | NULL | NULL |
+-----------+----------+----------+----------+
| 987654321 | NULL | NULL | NULL |
+-----------+----------+----------+----------+
It should be noted that this table full of IDs could take up a significant amount of disk space. An integer column in MySQL can hold 4,294,967,296 4-byte values, or 16 GB of data sitting around purely for your convenience in displaying some other data you don't have. So unless you need some smaller range or set of IDs available, or have disk space coming out your ears, this approach simply may not be practical.
Personally, I would not ask the database to do this in the first place. Essentially it's a display issue; you already get all the information you need from the fact that certain rows were not returned by your query. I would solve the display issue outside of the database, which in your case means filling in those zeroes with PHP.

When joining tables, date column is NULL on some etries

I am trying to migrate some custom CMS DB to Wordpress, and so far it's been a living hell.
I am using WP All import plugin, so I need a neat single .csv export that contains data from multiple tables from this custom cms database.
So, these are the columns from two tables that I want to join:
`eo_items`
| cat_id | identificator | create_date |
---------------------------------------------
| 1 | Title of the post | 1283786285 |
`eo_items_trans`
| item_id | lid | name | s_desc | l_desc |
---------------------------------------------------------
| 1 | 33 | Title of the post | excerpt | content |
Desired result should be:
| item_id | lid | name | s_desc | l_desc | cat_id | create_date |
--------------------------------------------------------------------------------
| 1 | 33 | Title of the post | excerpt | content | 1 | Some date |
Here is the script I am using:
SELECT DISTINCT
eo_items_trans.item_id,
eo_items_trans.lid,
eo_items.cat_id,
DATE_FORMAT( eo_items.create_date, '%d.%m.%Y' ) create_date,
eo_items_trans.s_desc,
eo_items_trans.l_desc,
eo_items_trans.name
FROM eo_items_trans
LEFT JOIN eo_items ON ( eo_items_trans.name = eo_items.identificator )
Trouble with this code is that in resulting table some date columns are NULL, and I don't know if the result is what I need because the table has around 2000 rows and I don't know how to cross check if category IDs are correctly populated.
This is the first time I am doing something like this with MySQL so I am really not sure if the procedure is right for what I am trying to achieve.
If you need any clarifications please ask.
EDIT:
eo_items table has some 300 rows more than eo_items_trans so there are some records there that don't have corresponding records in eo_items_trans. I am guessing this should be reflected in the query as well?
Since you're using a LEFT JOIN, NULLs will be returned for any rows of eo_items_trans that do not have entries in oe_items. This could mean the eo_items.identificator is empty, or doesn't exactly match the name (case sensitivity will apply).
You'll have to investigate and clean up the data for rows in eo_items_trans missing the expected row in eo_items.
You NULL results for date seem to come or from eo_items_trans records that have no corresponding entry in the eo_items table or from eo_items records where create_date is null.
You can easily crosscheck check by doing the following
Is there records in eo_items_trans that have no corresponding entries in eo_items:
SELECT DISTINCT eo_items_trans.name FROM eo_items_trans
where NOT EXISTS (
SELECT * FROM eo_items
where eo_items.identificator = eo_items_trans.name
)
If this yields one ore more rows, that will be the eo_items_trans.name records with no correspondent in eo_items. If this is you problem, the do a JOIN, not a LEFT join in your main query
As for empty dates in eo_items you might want to check like this
SELECT * from eo_items WHERE create_date IS NULL
If you find records here, this is where yout NULL values in the main query come from

How to track a secondary index id

Have a table that will be shared by multiple users. The basic table structure will be:
unique_id | user_id | users_index_id | data_1 | data_2 etc etc
With the id fields being type int and unique_id being an primary key with auto increment.
The data will be something like:
unique_id | user_id | users_index_id
1 | 234 | 1
2 | 234 | 2
3 | 234 | 3
4 | 234 | 4
5 | 732 | 1
6 | 732 | 2
7 | 234 | 5
8 | 732 | 3
How do I keep track of 'users_index_id' so that it 'auto increments' specifically for a user_id ?
Any help would be greatly appreciated. As I've searched for an answer but am not sure I'm using the correct terminology to find what I need.
The only way to do this consistently is by using a "before insert" and "before update" trigger. MySQL does not directly support this syntax. You could wrap all changes to the table in a stored procedure and put the logic there, or use very careful logic when doing an insert:
insert into table(user_id, users_index_id)
select user_id, count(*) + 1
from table
where user_id = param_user_id;
However, this won't keep things in order if you do delete or some updates.
You might find it more convenient to calculate the users_index_id when you query rather than in the database. You can do this using either subqueries (which are probably ok with the right indexes on the table) or using variables (which might be faster but can't be put into a view).
If you have an index on table(user_id, unique_id), then the following query should work pretty well:
select t.*,
(select count(*) from table t2 where t2.user_id = t.user_id and t2.unique_id <= t.unique_id
) as users_index_id
from table t;
You will need the index for non-abyssmal performance.
You need to find the MAX(users_index_id) and increment it by one. To avoid having to manually lock the table to ensure a unique key you will want to perform the SELECT within your INSERT statement. However, MySQL does not allow you to reference the target table when performing an INSERT or UPDATE statement unless it's wrapped in a subquery:
INSERT INTO users (user_id, users_index_id) VALUES (234, (SELECT IFNULL(id, 0) + 1 FROM (SELECT MAX(users_index_id) id FROM users WHERE user_id = 234) dt))
Query without subselect (thanks Gordon Linoff):
INSERT INTO users (user_id, users_index_id) SELECT 234, IFNULL((SELECT MAX(users_index_id) id FROM users WHERE user_id = 234), 0) + 1;
http://sqlfiddle.com/#!2/eaea9a/1/0

WHERE vs HAVING in generated queries

I know that this title is overused, but it seems that my kind of question is not answered yet.
So, the problem is like this:
I have a table structure made of four tables (tables, rows, cols, values) that I use to recreate the behavior of the information_schema (in a way).
In php I am generating queries to retrieve the data, and the result would still look like a normal table:
SELECT
(SELECT value FROM `values` WHERE `col` = "3" and row = rows.id) as "col1",
(SELECT value FROM `values` WHERE `col` = "4" and row = rows.id) as "col2"
FROM rows WHERE `table` = (SELECT id FROM tables WHERE name = 'table1')
HAVING (col2 LIKE "%4%")
OR
SELECT * FROM
(SELECT
(SELECT value FROM `values` WHERE `col` = "3" and row = rows.id) as "col1",
(SELECT value FROM `values` WHERE `col` = "4" and row = rows.id) as "col2"
FROM rows WHERE `table` = (SELECT id FROM tables WHERE name = 'table1')) d
WHERE col2 LIKE "%4%"
note that the part where I define the columns of the result is generated by a php script. It is less important why I am doing this, but I want to extend this algorithm that generates the queries for a broader use.
And we got to the core problem, I have to decide if I will generate a where or a having part for the query, and I know when to use them both, the problem is my algorithm doesn't and I have to make a few extra checks for this. But the two above queries are equivalent, I can always put any query in a sub-query, give it an alias, and use where on the new derived table. But I wonder if I will have problems with the performance or not, or if this will turn back on me in an unexpected way.
I know how they both work, and how where is supposed to be faster, but this is why I came here to ask. Hopefully I made myself understood, please excuse my english and the long useless turns of phrases, and all.
EDIT 1
I already know the difference between the two, and all that implies, my only dilemma is that using custom columns from other tables, with variable numbers and size, and trying to achieve the same result as using a normally created table implies that I must use HAVING for filtering the derived tables columns, at the same time having the option to wrap it up in a subquery and use where normally, this probably will create a temporary table that will be filtered afterwards. Will this affect performance for a large database? And unfortunately I cannot test this right now, as I do not afford to fill the database with over 1 billion entries (that will be something like this: 1 billion in rows table, 5 billions in values table, as every row have 5 columns, 5 rows in cols table and 1 row in tables table = 6,000,006 entries in total)
right now my database looks like this:
+----+--------+-----------+------+
| id | name | title | dets |
+----+--------+-----------+------+
| 1 | table1 | Table One | |
+----+--------+-----------+------+
+----+-------+------+
| id | table | name |
+----+-------+------+
| 3 | 1 | col1 |
| 4 | 1 | col2 |
+----+-------+------+
where `table` is a foreign key from table `tables`
+----+-------+-------+
| id | table | extra |
+----+-------+-------+
| 1 | 1 | |
| 2 | 1 | |
+----+-------+-------+
where `table` is a foreign key from table `tables`
+----+-----+-----+----------+
| id | row | col | value |
+----+-----+-----+----------+
| 1 | 1 | 3 | 13 |
| 2 | 1 | 4 | 14 |
| 6 | 2 | 4 | 24 |
| 9 | 2 | 3 | asdfghjk |
+----+-----+-----+----------+
where `row` is a foreign key from table `rows`
where `col` is a foreign key from table `cols`
EDIT 2
The conditions are there just for demonstration purposes!
EDIT 3
For only two rows, it seems there is a difference between the two, the one using having is 0,0008 and the one using where is 0.0014-0.0019. I wonder if this will affect performance for large numbers of rows and columns
EDIT 4
The result of the two queries is identical, and that is:
+----------+------+
| col1 | col2 |
+----------+------+
| 13 | 14 |
| asdfghjk | 24 |
+----------+------+
HAVING is specifically for GROUP BY, WHERE is to provide conditional parameters. See also WHERE vs HAVING
I believe the having clause would be faster in this case, as you're defining specific values, as opposed to reading through the values and looking for a match.
See: http://database-programmer.blogspot.com/2008/04/group-by-having-sum-avg-and-count.html
Basically, WHERE filters out columns before passing them to an aggregate function, but HAVING filters the aggregate function's results.
you could do it like that
WHERE col2 In (14,24)
your code WHERE col2 LIKE "%4%" is bad idea so what about col2 = 34 it will be also selected.

Categories