I have a table with 8 columns in, but over time I have picked up numerous duplicates. I have looked at the other question with a similar topic, but it does not solve the issue I am currently having.
+---------------------------------------------------------------------------------------+
| id | market | agent | report_name | producer_code | report_date | entered_date | sync |
+---------------------------------------------------------------------------------------+
What defines a unique entry is based on the market, agent, report_name, producer_code, and report_date fields. What I am looking for is a way to list all the duplicate entries and delete them. Or to just delete the duplicate entries.
I have thought about doing it with a script, but the table contains 2.5mil entries, and the time it would take would be unfeasible.
Could anybody suggest any alternatives? I have seen people get a list of duplicates using the following query, but not sure on how to adapt it to my situation:
SELECT id, count(*) AS n
FROM table_name
GROUP BY id
HAVING n > 1
Here are two strategies you might think about. You will have to adjust the columns used to select duplicates based upon what you actually consider a duplicate. I just included all of your listed columns other than the id column.
The first simply creates a new table without duplicates. Sometimes this is actually faster and easier than trying to delete all the offending rows. Just create a new table, insert the unique rows (I used min(id) for the id of the resulting row), rename the two tables, and (once you are satisfied that everything worked correctly) drop the original table. Of course, if you have any foreign key constraints you'll have to deal with those as well.
create table table_copy like table_name;
insert into table_copy
(id, market, agent, report_name, producer_code, report_date, entered_date, sync)
select min(id), market, agent, report_name, producer_code, report_date,
entered_date, sync
from table_name
group by market, agent, report_name, producer_code, report_date,
entered_date, sync;
RENAME TABLE table_name TO table_old, table_copy TO table_name;
drop table table_old;
The second strategy, which just deletes the duplicates, uses a temporary table to hold the information about what rows have duplicates since MySQL won't allow you to select from the same table you are deleting from in a subquery. Simply create a temporary table with the columns that identify the duplicates plus an id column that will actually hold the id to keep and then you can do a multi-table delete where you join the two tables to select just the duplicates.
create temporary table dups
select min(id), market, agent, report_name, producer_code, report_date,
entered_date, sync
from table_name
group by market, agent, report_name, producer_code, report_date,
entered_date, sync
having count(*) > 1;
delete t
from table_name t, dups d
where t.id != d.id
and t.market = d.market
and t.agent = d.agent
and t.report_name = d.report_name
and t.producer_code = d.producer_code
and t.report_date = d.report_date
and t.entered_date = d.entered_date
and t.sync = d.sync;
You can find the dupes, based on your "key" fields, by doing:
select id, count(*) as row_count
from table
group by market, agent, report_name, producer_code, report_date
having (row_count > 1)
which you could then use in a delete script. Of course, you'd have to be very careful doing this, as it'll return ALL the duplicate rows, and you'd want to save at least ONE of those rows from each grouping.
Another easy way would be to
create a new table
put a UNIQUE index on the fields you need to be unique (a primary key is a special kind of unique index)
use INSERT IGNORE INTO newtable SELECT * FROM oldtable (ORDER BY if you want the last/first records to remain - should there be a difference in the other columns)
DROP the old table and RENAME the new table to the old table
You may also use Primary key on the columns the unique entries are based on, this will prevent adding new records with duplicate details.
Related
I have this problem that's been killing me for a couple days now.
So we have a table of all processed orders.
We have a table for all orders that come in.
We need to effectively cross-reference the orders in the new table that is continually updating against the orders already completely in the primary table so that we don't complete the same order multiple times.
After we get a batch of new orders, this is the query that I currently run in an attempt to cross reference it with the table of completed orders:
$sql = "DELETE
FROM
`orders_new`
WHERE
`order` IN (
SELECT DISTINCT
`order`
FROM
`orders_all`
)
AND `name` IN (
SELECT DISTINCT
`name`
FROM
`orders_all`
)
AND `jurisdiction` IN (
SELECT DISTINCT
`jurisdiction`
FROM
`orders_all`
)";
As you can probably tell, I want to delete rows from the "orders_new" table where a row with the same order, name, and jurisdiction already exists in the "orders_all" table.
Is this the right way to handle this sort of query?
Well, the right way depends on many things.
But first, I do not like your division into two tables. In that case I would introduce a column identfying state, that woul reference a table with possible states. Those would be "new", "in process", "completed". That way you have one order stored as only one record as it should be.
But your query migt be ok, but you should check the performance.
Take a look at: https://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join
Not exactly your case but very similar.
Another thing: Why do you use DISTINCT. That would imply that "order" is not a unique identifier.
Based on your edit you identify the order with composite key "order", "name", "jurisdiction". Is this really the key, the whole key and nothing but the key so help you Codd. If not, you could delete a bunch of records. But even so your query would delete an all orders for which the order, name and jurisdiction can be found in table order IN DIFFERENT RECORDS. So your query is false.
Saying that, a variant of your query might be
DELETE order_new
FROM
order_new
INNER JOIN
order_all ON order_all.order = order_new.order
AND order_all.name = order_new.name
AND order_all.jurisdiction = order_new.jurisdiction
But, the real problem is your ER model.
No, your query will delete any record where there are any records with the same order, name, and jurisdiction, even if those records are different from one another. In other words, a row in orders_new will be deleted if one row in order_all has the same order, a different one has the same name, and a third one has the same jurisdiction. You are very very likely to delete way more than you want to. Instead, this would be more appropriate:
DELETE FROM `orders_new`
WHERE (`order`, `name`, jurisdiction`) IN (
SELECT `order`, `name`, `jurisdiction`
FROM `orders_all`
)
or maybe
DELETE FROM `orders_new`
WHERE EXISTS (
SELECT 1
FROM `orders_all` AS oa
WHERE oa.`order` = `orders_new`.`order`
AND oa.`name` = `orders_new`.`name`
AND oa.`jurisdiction` = `orders_new`.`jurisdiction`
)
You should convert that to a DELETE - JOIN construct like
DELETE `orders_new`
FROM `orders_new`
INNER JOIN `orders_all` ON `orders_new`.`order` = `orders_all`.`order`
AND `orders_new`.`name` = `orders_all`.`name`
AND `orders_new`.`jurisdiction` = `orders_all`.`jurisdiction`;
I have a SQL table with two columns:
'id' int Auto_Increment
instancename varchar
The current 114 rows are ordered alphabetically after instancename.
Now i want to insert a new row that fits into the order.
So say it starts with a 'B', it would be at around id 14 and therefore had to 'push down' all of the rows after id 14. How do i do this?
An SQL table is not inherently ordered! (It is just a set.) You would simply add the new row and view it using something like:
select instancename
from thetable
order by instancename;
I think you're going about this the wrong way. IDs shouldn't be changed. If you have tables that reference these IDs as foreign keys then the DBMS wouldn't let you change them, anyway.
Instead, if you need results from a specific query to be ordered alphabetically, tell SQL to order it for you:
SELECT * FROM table ORDER BY instancename
As an aside, sometimes you want something that can seemingly be a key (read- needs to be unique for each row) but does have to change from time to time (such as something like a SKU in a product table). This should not be the primary key for the same reason (there are undoubtedly other tables that may refer to these entries, each of which would also need to be updated).
Keeping this information distinct will help keep you and everyone else working on the project from going insane.
Try using an over and joining to self.
Update thetable
Set ID = r.ID
From thetable c Join
( Select instancename, Row_Number() Over(Order By instancename) As ID
From CollectionStatus) r On c.instancename= r.instancename
This should update the id column to the ordered number. You may have to disable it's identity first.
I have a web application that stores points in a table, and total points in the user table as below:
User Table
user_id | total_points
Points Table
id | date | user_id | points
Every time a user earns a point, the following steps occur:
1. Enter points value to points table
2. Calculate SUM of the points for that user
3. Update the user table with the new SUM of points (total_points)
The values in the user table might get out of sync with the sum in the points table, and I want to be able to recalculate the SUM of all points for every user once in a while (eg. once a month). I could write a PHP script that could loop through each user in the user table and find the sum for that user and update the total_points, but that would be a lot of SQL queries.
Is there a better(efficient) way of doing what I am trying to do?
Thanks...
A more efficient way to do this would be the following:
User Table
user_id
Points Table
id | date | user_id | points
Total Points View
user_id | total_points
A view is effectively a select statement disguised as a table. The select statement would be: SELECT "user_id", SUM("points") AS "total_points" FROM "Points Table" GROUP BY "user_id". To create a view, execute CREATE VIEW "Total Points View" AS <SELECT STATEMENT> where SELECT STATEMENT is the previous select statement.
Once the view has been created, you can treat it as you would any regular table.
P.S.: I don't know that the quotes are necessary unless your table names actually contain spaces, but it's been a while since I worked with MySQL, so I don't remember it's idiosyncrasies.
You have to user Triggers for this, to make the users total points in sync with the user_points table. Something like:
Create Trigger UpdateUserTotalPoints AFTER INSERT ON points
FOR EACH ROW Begin
UPDATE users u
INNER JOIN
(
SELECT user_id, SUM(points) totalPoints
FROM points
GROUP BY user_id
) p ON u.user_id = p.user_id
SET u.total_points = p.totalPoints;
END;
SQL Fiddle Demo
Note that: As noted by #FireLizzard, if these records in the second table, are frequently updated or delted, you have to have other AFTER UPDATE and AFTER DELETE triggers as well, to keep the two tables in sync. And in this case the solution that #FireLizzard will be better in this case.
If you want it once a month, you can’t deal with just MySQL. You have too « logic » code here, and put too logic in database is not the correct way to go. The trigger of Karan Punamiya could be nice, but it will update the user_table on every insert in points table, and it’s not what you seem to want.
For the fact you want to be able to remove points, just add bsarv new negated rows in points, don’t remove any row (it will break the history trace).
If you really want it periodically, you can run a cron script that does that, or even call your PHP script ;)
I have a voting script which pulls out the number of votes per user.
Everything is working, except I need to now display the number of votes per user in order of number of votes. Please see my database structure:
Entries:
UserID, FirstName, LastName, EmailAddress, TelephoneNumber, Image, Status
Voting:
item, vote, nvotes
The item field contains vt_img and then the UserID, so for example: vt_img4 and both vote & nvotes display the number of votes.
Any ideas how I can relate those together and display the users in order of the most voted at the top?
Thanks
You really need to change the structure of the voting table so that you can do a normal join. I would strongly suggest adding either a pure userID column, or at the very least not making it a concat of two other columns. Based on an ID you could then easily do something like this:
select
a.userID,
a.firstName,
b.votes
from
entries a
join voting b
on a.userID=b.userID
order by
b.votes desc
The other option is to consider (if it is a one to one relationship) simply merging the data into one table which would make it even easier again.
At the moment, this really is an XY problem, you are looking for a way to join two tables that aren't meant to be joined. While there are (horrible, ghastly, terrible) ways of doing it, I think the best solution is to do a little extra work and alter your database (we can certainly help with that so you don't lose any data) and then you will be able to both do what you want right now (easily) and all those other things you will want to do in the future (that you don't know about right now) will be oh so much easier.
Edit: It seems like this is a great opportunity to use a Trigger to insert the new row for you. A MySQL trigger is an action that the database will make when a certain predefined action takes place. In this case, you want to insert a new row into a table when you insert a row into your main table. The beauty is that you can use a reference to the data in the original table to do it:
CREATE TRIGGER Entries_Trigger AFTER insert ON Entries
FOR EACH ROW BEGIN
insert into Voting values(new.UserID,0,0);
END;
This will work in the following manner - When a row is inserted into your Entries table, the database will insert the row (creating the auto_increment ID and the like) then instantly call this trigger, which will then use that newly created UserID to insert into the second table (along with some zeroes for votes and nvotes).
Your database is badly designed. It should be:
Voting:
item, user_id, vote, nvotes
Placing the item id and the user id into the same column as a concatenated string with a delimiter is just asking for trouble. This isn't scalable at all. Look up the basics on Normalization.
You could try this:
SELECT *
FROM Entries e
JOIN Voting v ON (CONCAT('vt_img', e.UserID) = v.item)
ORDER BY nvotes DESC
but please notice that this query might be quite slow due to the fact that the join field for Entries table is built at query time.
You should consider changing your database structure so that Voting contains a UserID field in order to do a direct join.
I'm figuring the Entries table is where votes are cast (you're database schema doesn't make much sense to me, seems like you could work it a little better). If the votes are actually on the Votes table and that's connected to a user, then you should have UserID field in that table too. Either way the example will help.
Lets say you add UserID to the Votes table and this is where a user's votes are stored than this would be your query
SELECT Users.id, Votes.*,
SUM(Votes.nvotes) AS user_votes
FROM Users, Votes
WHERE Users.id = Votes.UserID
GROUP BY Votes.UserID
ORDER BY user_votes
USE ORDER BY in your query --
SELECT column_name(s)
FROM table_name
ORDER BY column_name(s) ASC|DESC
I am writing a converter to transfer data from old systems to new systems. I am using php+mysql.
I have one table that contains millions records with duplicate entries. I want to transfer that data in a new table and remove all entries. I am using following queries and pseudo code to perform this task
select *
from table1
insert into table2
ON DUPLICATE KEY UPDATE customer_information = concat('$firstName',',','$lastName')
It takes ages to process one table :(
I am pondering that is it possible to use group by and get all grouped record automatically?
Other than going through each record and checking duplicate etc.?
For example
select *
from table1
group by firstName, lastName
insert into table 2 only one record and add all users'
first last name into column ALL_NAMES with comma
EDIT
There are different records for each customers with different information. Each row is called duplicated if first and last name of user is same. In new table, we will just add one customer and their bought product in different columns (we have only 4 products).
I don't know what you are trying to do with customer_information, but if you just want to transfer the non-duplicated set of data from one table to another, this will work:
INSERT IGNORE INTO table2(field1, field2, ... fieldx)
SELECT DISTINCT field1, field2, ... fieldx
FROM table1;
DISTINCT will take care of rows that are exact duplicates. But if you have rows that are only partial duplicates (like the same last and first names but a different email) then IGNORE can help. If you put a unique index on table2(lastname,firstname) then IGNORE will make sure that only the first record with lastnameX, firstnameY from table1 is inserted. Of course, you might not like which record of a pair of partial duplicates is chosen.
ETA
Now that you've updated your question, it appears that you want to put the values of multiple rows into one field. This is, generally speaking, a bad idea because when you denormalize your data this way you make it much less accessible. Also, if you are grouping by (lastname, firstname), there will not be names in allnames. Because of this, my example uses allemails instead. In any event, if you really need to do this, here's how:
INSERT INTO table2(lastname, firstname, allemails)
SELECT lastname, firstname, GROUP_CONCAT(email) as allemails
FROM table1
GROUP BY lastname, firstname;
If they are really duplicate rows (every field is the the same) then you can use:
select DISTINCT * from table1
instead of :
select * from table1