I'm trying to write a Laravel eloquent statement to do the following.
Query a table and get all the ID's of all the duplicate rows (or ideally all the IDs except the ID of the first instance of the duplicate).
Right now I have the following mysql statement:
select `codes`, count(`codes`) as `occurrences`, `customer_id` from `pizzas`
group by `codes`, `customer_id`
having `occurrences` > 1;
The duplicates are any row that shares a combination of codes and customer_id, example:
codes,customer_id
183665A4,3
183665A4,3
183665A4,3
183665A4,3
183665A4,3
I'm trying to delete all but 1 of those.
This is returning a set of the codes, with their occurrences and their customer_id, as I only want rows that have both.
Currently I think loop through this, and save the ID of the first instance, and then call this again and delete any without that ID. This seems not very fast, as there's about 50 million rows so each query takes forever and we have multiple queries for each duplicate to delete.
// get every order that shares the same code and customer ID
$orders = Order::select('id', 'codes', DB::raw('count(`codes`) as `occurrences`'), 'customer_id')
->groupBy('codes')
->groupBy('customer_id')
->having('occurrences', '>', 1)
->limit(100)
->get();
// loop through those orders
foreach ($orders as $order)
{
// find the first order that matches this duplicate set
$first_order = Order::where('codes', $order->codes)
->where('customer_id', $order->customer_id)
->first();
// delete all but the first
Order::where('codes', $order->codes)
->where('customer_id', $order->customer_id)
->where('id', '!=', $first_order->id)
->delete();
}
There has got to be a more efficient way to track down all rows that share the same code and customer_id, and delete all the duplicates but keep the first instance, right? lol
I'm thinking maybe if I can add a fake column to the results that is an array of every ID, I could at least then remove the first ID and delete the others.
Don't involve PHP
This seems not very fast
The logic in the question is inherently slow because it's lots of queries and for each query there's:
DB<->PHP network roundtrip
PHP ORM logic/overhead
Given the numbers in the question, the whole code needs calling up to 10k times (if there are exactly 2 occurrences for every one of those 2 million duplicate records), for arguments sake let's say there are 1k sets of duplicates, overall that's:
1,000 queries finding duplicates
100,000 queries finding the first record
100,000 delete queries
201,000 queries is a lot and the php overhead makes it an order of magnitude slower (a guess, based on experience).
Do it directly on the DB
Just eliminating php/orm/network (even if it's on the same machine) time would make the process markedly faster, that would involve writing a procedure to mimic the php logic in the question.
But there's a simpler way, the specifics depend on the circumstances. In comments you've said:
The table is 140GB in size
It contains 50 million rows
Approx 2 million are duplicate records
There isn't enough free space to make a copy of the table
Taking these comments at face value the process I suggest is:
Ensure you have a functional DB backup
Before doing anything make sure you have a functional DB backup. If you manage to make a mistake and e.g. drop the table - be sure you can recover without loss of data.
You'll be testing this process on a copy of the database first anyway, right :) ?
Create a table of "ids to keep" and populate it
This is a permutation of removing duplicate with a unique index:
CREATE TABLE ids_to_keep (
id INT PRIMARY KEY,
codes VARCHAR(50) NOT NULL, # use same schema as source table
customer_id INT NOT NULL, # use same schema as source table
UNIQUE KEY derp (codes,customer_id)
);
INSERT IGNORE INTO ids_to_keep
SELECT id, codes, customer_id from pizzas;
Mysql will silently drop the rows conflicting with the unique index, resulting in a table with one id per codes+customer_id tuple.
If you don't have space for this table - make room :). It shouldn't be too large; 140GB and 50M rows means each row is approx 3kb - this temporary table will likely require single-digit % of the original size.
Delete the duplicate records
Before executing any expected-to-be-slow query use EXPLAIN to check if the query will complete in a reasonable amount of time.
To run as a single query:
DELETE FROM
pizzas
WHERE
id NOT IN (SELECT id from ids_to_keep);
If you wish to do things in chunks:
DELETE FROM
pizzas
WHERE
id BETWEEN (0,10000) AND
id NOT IN (SELECT id from ids_to_keep);
Cleanup
Once the table isn't needed any more, get rid of it:
DROP TABLE ids_to_keep;
Make sure this doesn't happen again
To prevent this happening again, add a unique index to the table:
CREATE UNIQUE INDEX ON pizzas(codes, customer_id);
Try this one it will keep only the duplicate and non-duplicate id lastest id:
$deleteDuplicates = DB::table('orders as ord1')
->join('orders as ord2', 'ord1.codes', '<', 'ord2.codes')
->where('ord1.codes', '=', 'ord2.codes') ->delete();
Related
I'm new to sql & php and unsure about how to proceed in this situation:
I created a mysql database with two tables.
One is just a list of users with their data, each having a unique id.
The second one awards certain amounts of points to users, with relevant columns being the user id and the amount of awarded points. This table is supposed to get new entries regularly and there's no limit to how many times a single user can appear in it.
On my php page I now want to display a list of users sorted by their point total.
My first approach was creating a "points_total" column in the user table, intending to run some kind of query that would calculate and update the correct total for each user every time new entries are added to the other table. To retrieve the data I could then use a very simple query and even use sql's sort features.
However, while it's easy to update the total for a specific user with the sum where function, I don't see a way to do that for the whole user table. After all, plain sql doesn't offer the ability to iterate over each row of a table, or am I missing a different way?
I could probably do the update by going over the table in php, but then again, I'm not sure if that is even a good approach in the first place, because in a way storing the point data twice (the total in one table and then the point breakdown with some additional information in a different table) seems redundant.
A different option would be forgoing the extra column, and instead calculating the sums everytime the php page is accessed, then doing the sorting stuff with php. However, I suppose this would be much slower than having the data ready in the database, which could be a problem if the tables have a lot of entries?
I'm a bit lost here so any advice would be appreciated.
To get the total points awarded, you could use a query similar to this:
SELECT
`user_name`,
`user_id`,
SUM(`points`.`points_award`) as `points`,
COUNT(`points`.`points_award`) as `numberOfAwards`
FROM `users`
JOIN `points`
ON `users`.`user_id` = `points`.`user_id`
GROUP BY `users`.`user_id`
ORDER BY `users`.`user_name` // or whatever users column you want.
I have three tables.
Radar data table (with id as primary), also has two columns of violation_file_id, and violation_speed_id.
Violation_speed table (with id as primary)
violation_file table (with id as primary)
I want to select all radar data, limited by 1000, from some start interval to an end interval, joins with violation_speed table. Each radar data must have a violation_speed_id.
I want to then join with the violation_file table, but not each radar records corresponding to violation_file_id, some records just has violation_file_id of 0, means there's no curresponding file.
My current sql is like this,
$results = DB::table('radar_data')
->join('violation_speed', 'radar_data.violation_speed_id', '=', 'violation_speed.id')
->leftjoin('violation_video_file', 'radar_data.violation_video_file_id', '=', 'violation_video_file.id')
->select('radar_data.id as radar_id',
'radar_data.violation_video_file_id',
'radar_data.violation_speed_id',
'radar_data.speed',
'radar_data.unit',
'radar_data.violate',
'radar_data.created_at',
'violation_speed.violation_speed',
'violation_speed.unit as violation_unit',
'violation_video_file.video_saved',
'violation_video_file.video_deleted',
'violation_video_file.video_uploaded',
'violation_video_file.path',
'violation_video_file.video_name')
->where('radar_data.violate', '=', '1')
->orderBy('radar_data.id', 'desc')
->offset($from_id)
->take($max_length)
->get();
It is PHP Laravel. But I think the translation to mysql statement is straight away.
My question is, is it a good way to select data like this? I tried but it seems a bit slow if the radar data grows to a large value.
Thanks.
Assuming you have the proper indices set this is largely the way to go, the only thing that's not 100% clear to me is what the offset() method does, but if it simply adds a WHERE clause than this should give you pretty much the best performance you're going to get. If not, replace it with a where('radar_data.id', '>', $from_id)
The most important indices are the ones on the foreign keys and primary keys here. And make sure not to forget the violate index.
The speed of the query often relies on the use of proper indexing on the joining clause and where clause used.
In your query there are 2 joins and if the joining keys are not indexed then you might need to apply the following
alter table radar_data add index violation_speed_id_idx(violation_speed_id);
alter table radar_data add index violation_video_file_id_idx(violation_video_file_id);
alter table radar_data add index violate_idx(violate);
The ids are primary key hence they are already indexed and should be covered
Assuming I have the following 2 tables:
trips:
driverId (PK), dateTravelled, totalTravelDistance
farthest_trip (same structure)
driverId (PK), dateTravelled, totalTravelDistance
where only the most recent journey is stored:
REPLACE INTO JOURNEYS ('$driverId', '$totalTripDistance');
I want to store the farthest trip travelled for each driverId as well, but unfortunately you can't have a condition on an INSERT... ON DUPLICATE KEY UPDATE statement, or else I'd have a trigger such as:
INSERT INTO farthest_trip(driverId, dateTravelled, totalTravelDistance) ON DUPLICATE KEY UPDATE dateTravelled = new.dateTravelled, totalTravelDistance = new.totalTravelDistance WHERE totalTravelDistance < new.totalTravelDistance;
so what I do is after inserting into the first table, in PHP I check if the current distance is farther than the one previously recorded, and if so, then update farthest_journey. It feels like there should be a better way but I can't quite figure it out, so is there a more efficient way of doing this?
You could create a trigger. Something like
CREATE TRIGGER farhest_trigger BEFORE INSERT ON trips
FOR EACH ROW
BEGIN
UPDATE farthest_trips SET date=NEW.date, distance=NEW.distance WHERE driverId=NEW.driverId AND distance < NEW.distance;
END;
But then you would have code that gets executed "magically" from a PHP perspective.
I think the better solution would be appending new trips to the trips table and selecting the maximum and latest journey with SELECT statements.
I am wondering the best format to lay out my data in a mySQL table so that it can be queried in the fastest manner to gather an array of daily values to be further utilized by php.
So far, I have laid out the table as such:
item_id price_date price_amount
1 2000-03-01 22.4
2 2000-03-01 19.23
3 2000-03-01 13.4
4 2000-03-01 14.95
1 2000-03-02 22.5
2 2000-03-02 19.42
3 2000-03-02 13.4
4 2000-03-02 13.95
with item_id defined as an index.
Also, I am using:
"SELECT DISTINCT price_date FROM table_name"
to get an array containing a unique list of dates.
Furthermore, the part of the code that is within a loop (and the focus of my optimization question) is currently written as:
"SELECT price_amount FROM table_name WHERE item_id = 1 ORDER BY price_date"
This second "SELECT" statement is actually within a loop where I am selecting/storing-in-array the daily prices of each item_id requested.
All is currently functioning and pulling the data from mySQL properly, however, both the above listed "SELECT" statements are taking approx 4-5 seconds to complete per each run, and when looping through 100+ products to create a summary, adds up to a very inefficient/slow information system.
Is there any more-efficient way that I could structure the mySQL table and/or SELECT statements to retrieve the results faster? Perhaps defining a different index on a different column? I have used the EXPLAIN command to return information per the queries but am unsure how to use the EXPLAIN information to increase the efficiency of my queries.
Thanks in advance for any mySQL wizards that may be able to assist.
Single column index
I am using:
"SELECT DISTINCT price_date FROM table_name"
to get an array containing a unique list of dates.
This query can be executed more efficiently if you create an index for the price_date column:
ALTER TABLE table_name ADD INDEX price_idx (price_date);
Mutiple column index
Furthermore, the part of the code that is within a loop (and the focus of my optimization question) is currently written as:
"SELECT price_amount FROM table_name WHERE item_id = 1 ORDER BY price_date"
For the second query, you should create an index covering both the item_id and price_date column:
ALTER TABLE table_name ADD INDEX item_price_idx (item_id, price_date);
I know this is a bit late, but i stumbled across this and thought I would throw my thoughts into the mix.
Indexes used well are very helpful in speeding up queries (Explain shows some really godd results around which indexes are being chosen - if any - for a specific query). However efficient PHP will help even more.
In your case you do not show the PHP, but it looks like you offer a list of dates and then loop through finding all the items in that date to get the prices. It would be more efficient to do something like the following:
Select item_id, price_amount from table_name where price_date= order by item_id, price_amount
with an index (preferably a Unique Index) on price_date,item_id,price_amount
You then have a single loop through the resultant SQL not a loop with multiple SQL connections (this is especially true if your SQL server is separate from the PHP box as an external network connection can have an overhead).
4-5 seconds for a single query though is very slow )by a factor of at least 100x) so it would indicate a problem (very large table with no key to use) or disk issues (potentially).
I have a table in MySQL that I'm accessing from PHP. For example, let's have a table named THINGS:
things.ID - int primary key
things.name - varchar
things.owner_ID - int for joining with another table
My select statement to get what I need might look like:
SELECT * FROM things WHERE owner_ID = 99;
Pretty straightforward. Now, I'd like users to be able to specify a completely arbitrary order for the items returned from this query. The list will be displayed, they can then click an "up" or "down" button next to a row and have it moved up or down the list, or possibly a drag-and-drop operation to move it to anywhere else. I'd like this order to be saved in the database (same or other table). The custom order would be unique for the set of rows for each owner_ID.
I've searched for ways to provide this ordering without luck. I've thought of a few ways to implement this, but help me fill in the final option:
Add an INT column and set it's value to whatever I need to get rows
returned in my order. This presents the problem of scanning
row-by-row to find the insertion point, and possibly needing to
update the preceding/following rows sort column.
Having a "next" and "previous" column, implementing a linked list.
Once I find my place, I'll just have to update max 2 rows to insert
the row. But this requires scanning for the location from row #1.
Some SQL/relational DB trick I'm unaware of...
I'm looking for an answer to #3 because it may be out there, who knows. Plus, I'd like to offload as much as I can on the database.
From what I've read you need a new table containing the ordering of each user, say it's called *user_orderings*.
This table should contain the user ID, the position of the thing and the ID of the thing. The (user_id, thing_id) should be the PK. This way you need to update this table every time but you can get the things for a user in the order he/she wants using ORDER BY on the user_orderings table and joining it with the things table. It should work.
The simplest expression of an ordered list is: 3,1,2,4. We can store this as a string in the parent table; so if our table is photos with the foreign key profile_id, we'd place our photo order in profiles.photo_order. We can then consider this field in our order by clause by utilizing the find_in_set() function. This requires either two queries or a join. I use two queries but the join is more interesting, so here it is:
select photos.photo_id, photos.caption
from photos
join profiles on profiles.profile_id = photos.profile_id
where photos.profile_id = 1
order by find_in_set(photos.photo_id, profiles.photo_order);
Note that you would probably not want to use find_in_set() in a where clause due to performance implications, but in an order by clause, there are few enough results to make this fast.