I have two table both contains data around 200,000. I have write a query like below to retrieve data using some joins.
This is the query I have tried
$dbYTD = DB::table('stdtsum as a')
->join(DB::raw("(select distinct s_id, c_cod, compid from stdcus) b"), function($join){
$join->on('a.compid', '=', 'b.compid')->on('a.c_cod', '=', 'b.c_cod');
})
->select('b.s_id', DB::raw('sum(turnover) as sumturn'))
->whereBetween('date', [$startYTD, $endYTD])
->groupBy('b.s_id')
->get()
->toArray();
This query is giving result correct but the process time it takes is very long, sometimes it even timesout.
Can anybody help me how can I optimize this query?
You'll need to index all your columns on which join condition is applied. In your case : "compid" , "c_cod" in both your tables.
Generally, "Primary Key constraint" columns are automatically indexed in databases, though you'll have to index the "Foreign Key constraint" columns manually.
Some Indexing Tips :
Create an index on the field that has the biggest variety of values first. This will result in the “most bang for your buck”, so to speak.
Keep indexes small. It's better to have an index on just zip code or postal code, rather than postal code & country. The smaller the index, the better the response time.
For high frequency functions (thousands of times per day) it can be wise to have a very large index, so the system does not even need the table for the read function.
For small tables an index may be disadvantageous. The same can also be said for any function where the system would be better off by scanning the whole table.
Remember that an index slows down additions, modifications and deletes because the indexes need to be updated whenever the table is. So, the best practice is to add an index for values that are often used for a search, but that do not change much. As such, an index on a bank account number is better than one on the balance.
Tips reference : https://www.databasejournal.com/features/mysql/article.php/3840606/Maximizing-Query-Performance-through-Column-Indexing-in-MySQL.htm
Related
I'm trying to write a Laravel eloquent statement to do the following.
Query a table and get all the ID's of all the duplicate rows (or ideally all the IDs except the ID of the first instance of the duplicate).
Right now I have the following mysql statement:
select `codes`, count(`codes`) as `occurrences`, `customer_id` from `pizzas`
group by `codes`, `customer_id`
having `occurrences` > 1;
The duplicates are any row that shares a combination of codes and customer_id, example:
codes,customer_id
183665A4,3
183665A4,3
183665A4,3
183665A4,3
183665A4,3
I'm trying to delete all but 1 of those.
This is returning a set of the codes, with their occurrences and their customer_id, as I only want rows that have both.
Currently I think loop through this, and save the ID of the first instance, and then call this again and delete any without that ID. This seems not very fast, as there's about 50 million rows so each query takes forever and we have multiple queries for each duplicate to delete.
// get every order that shares the same code and customer ID
$orders = Order::select('id', 'codes', DB::raw('count(`codes`) as `occurrences`'), 'customer_id')
->groupBy('codes')
->groupBy('customer_id')
->having('occurrences', '>', 1)
->limit(100)
->get();
// loop through those orders
foreach ($orders as $order)
{
// find the first order that matches this duplicate set
$first_order = Order::where('codes', $order->codes)
->where('customer_id', $order->customer_id)
->first();
// delete all but the first
Order::where('codes', $order->codes)
->where('customer_id', $order->customer_id)
->where('id', '!=', $first_order->id)
->delete();
}
There has got to be a more efficient way to track down all rows that share the same code and customer_id, and delete all the duplicates but keep the first instance, right? lol
I'm thinking maybe if I can add a fake column to the results that is an array of every ID, I could at least then remove the first ID and delete the others.
Don't involve PHP
This seems not very fast
The logic in the question is inherently slow because it's lots of queries and for each query there's:
DB<->PHP network roundtrip
PHP ORM logic/overhead
Given the numbers in the question, the whole code needs calling up to 10k times (if there are exactly 2 occurrences for every one of those 2 million duplicate records), for arguments sake let's say there are 1k sets of duplicates, overall that's:
1,000 queries finding duplicates
100,000 queries finding the first record
100,000 delete queries
201,000 queries is a lot and the php overhead makes it an order of magnitude slower (a guess, based on experience).
Do it directly on the DB
Just eliminating php/orm/network (even if it's on the same machine) time would make the process markedly faster, that would involve writing a procedure to mimic the php logic in the question.
But there's a simpler way, the specifics depend on the circumstances. In comments you've said:
The table is 140GB in size
It contains 50 million rows
Approx 2 million are duplicate records
There isn't enough free space to make a copy of the table
Taking these comments at face value the process I suggest is:
Ensure you have a functional DB backup
Before doing anything make sure you have a functional DB backup. If you manage to make a mistake and e.g. drop the table - be sure you can recover without loss of data.
You'll be testing this process on a copy of the database first anyway, right :) ?
Create a table of "ids to keep" and populate it
This is a permutation of removing duplicate with a unique index:
CREATE TABLE ids_to_keep (
id INT PRIMARY KEY,
codes VARCHAR(50) NOT NULL, # use same schema as source table
customer_id INT NOT NULL, # use same schema as source table
UNIQUE KEY derp (codes,customer_id)
);
INSERT IGNORE INTO ids_to_keep
SELECT id, codes, customer_id from pizzas;
Mysql will silently drop the rows conflicting with the unique index, resulting in a table with one id per codes+customer_id tuple.
If you don't have space for this table - make room :). It shouldn't be too large; 140GB and 50M rows means each row is approx 3kb - this temporary table will likely require single-digit % of the original size.
Delete the duplicate records
Before executing any expected-to-be-slow query use EXPLAIN to check if the query will complete in a reasonable amount of time.
To run as a single query:
DELETE FROM
pizzas
WHERE
id NOT IN (SELECT id from ids_to_keep);
If you wish to do things in chunks:
DELETE FROM
pizzas
WHERE
id BETWEEN (0,10000) AND
id NOT IN (SELECT id from ids_to_keep);
Cleanup
Once the table isn't needed any more, get rid of it:
DROP TABLE ids_to_keep;
Make sure this doesn't happen again
To prevent this happening again, add a unique index to the table:
CREATE UNIQUE INDEX ON pizzas(codes, customer_id);
Try this one it will keep only the duplicate and non-duplicate id lastest id:
$deleteDuplicates = DB::table('orders as ord1')
->join('orders as ord2', 'ord1.codes', '<', 'ord2.codes')
->where('ord1.codes', '=', 'ord2.codes') ->delete();
I have two tables employee and attendance.
employee : empID, empName
attendance: attendanceID, empID, date, inTime, outTime
I need to show these data in a grid where employee name in the left side and then dates. So the column headers would be like Emp Name, 1,2,3,4....,30, With or without data, number of days in the month needs to be printed.
I realized three ways to do this.
Get attendance and employee data in a join query order by empID. Then loop through the data and print it if it is matching with current date.This will go until the empID change in current loop.
Loop through employees, then loop for days in the month, in every record get attendance from the database for particular employee and particular dates.
foreach($employees as $emp)
{
$empID = $emp['empID'];
for($day =1; $day<=$maxDaysInTheMonth $day++)
{
$attendance = getAttendanceFromDatabase($empID,$day);
}
}
To make performance better we try to minimize database connections and unnecessary loops. I like to implement the second way as it has minimum conditions and loops and code is clean. But it is making database retrieval for every employee, every day. Can someone pointout some facts for performance please.
Fetching records in a single query and looping through it is better. As it has to call database server a single time. For the second way - it has to call the database server multiple times which is more costlier.
Then make an associative array from the data. The index would be the empID.
After generating the array you can use it as you want.
Try this query
$sql="SELECT employee.empName AS empName, attendance.date AS date FROM employee,attendance WHERE employee.empID=attendance.empID";
As #Sougata suggest, Fetching records in a single query and looping through it is better. But keep in mind the query performance should be increased as follows:
Avoid Multiple Joins in a Single Query
Try to avoid writing a SQL query using multiple joins that includes outer joins, cross apply, outer apply and other complex sub queries. It reduces the choices for Optimizer to decide the join order and join type. Sometime, Optimizer is forced to use nested loop joins, irrespective of the performance consequences for queries with excessively complex cross apply or sub queries
Avoid Use of Non-correlated Scalar Sub Query
You can re-write your query to remove non-correlated scalar sub query as a separate query instead of part of the main query and store the output in a variable, which can be referred to in the main query or later part of the batch. This will give better options to Optimizer, which may help to return accurate cardinality estimates along with a better plan.
Creation and Use of Indexes
We are aware of the fact that Index can magically reduce the data retrieval time but have a reverse effect on DML operations, which may degrade query performance. With this fact, Indexing is a challenging task, but could help to improve SQL query performance and give you best query response time.
Create a Highly Selective Index
Selectivity define the percentage of qualifying rows in the table (qualifying number of rows/total number of rows). If the ratio of the qualifying number of rows to the total number of rows is low, the index is highly selective and is most useful. A non-clustered index is most useful if the ratio is around 5% or less, which means if the index can eliminate 95% of the rows from consideration. If index is returning more than 5% of the rows in a table, it probably will not be used; either a different index will be chosen or created or the table will be scanned.
Position a Column in an Index
Order or position of a column in an index also plays a vital role to improve SQL query performance. An index can help to improve the SQL query performance if the criteria of the query matches the columns that are left most in the index key. As a best practice, most selective columns should be placed leftmost in the key of a non-clustered index.
I have three tables.
Radar data table (with id as primary), also has two columns of violation_file_id, and violation_speed_id.
Violation_speed table (with id as primary)
violation_file table (with id as primary)
I want to select all radar data, limited by 1000, from some start interval to an end interval, joins with violation_speed table. Each radar data must have a violation_speed_id.
I want to then join with the violation_file table, but not each radar records corresponding to violation_file_id, some records just has violation_file_id of 0, means there's no curresponding file.
My current sql is like this,
$results = DB::table('radar_data')
->join('violation_speed', 'radar_data.violation_speed_id', '=', 'violation_speed.id')
->leftjoin('violation_video_file', 'radar_data.violation_video_file_id', '=', 'violation_video_file.id')
->select('radar_data.id as radar_id',
'radar_data.violation_video_file_id',
'radar_data.violation_speed_id',
'radar_data.speed',
'radar_data.unit',
'radar_data.violate',
'radar_data.created_at',
'violation_speed.violation_speed',
'violation_speed.unit as violation_unit',
'violation_video_file.video_saved',
'violation_video_file.video_deleted',
'violation_video_file.video_uploaded',
'violation_video_file.path',
'violation_video_file.video_name')
->where('radar_data.violate', '=', '1')
->orderBy('radar_data.id', 'desc')
->offset($from_id)
->take($max_length)
->get();
It is PHP Laravel. But I think the translation to mysql statement is straight away.
My question is, is it a good way to select data like this? I tried but it seems a bit slow if the radar data grows to a large value.
Thanks.
Assuming you have the proper indices set this is largely the way to go, the only thing that's not 100% clear to me is what the offset() method does, but if it simply adds a WHERE clause than this should give you pretty much the best performance you're going to get. If not, replace it with a where('radar_data.id', '>', $from_id)
The most important indices are the ones on the foreign keys and primary keys here. And make sure not to forget the violate index.
The speed of the query often relies on the use of proper indexing on the joining clause and where clause used.
In your query there are 2 joins and if the joining keys are not indexed then you might need to apply the following
alter table radar_data add index violation_speed_id_idx(violation_speed_id);
alter table radar_data add index violation_video_file_id_idx(violation_video_file_id);
alter table radar_data add index violate_idx(violate);
The ids are primary key hence they are already indexed and should be covered
I am indexing all the columns that I use in my Where / Order by, is there anything else I can do to speed the queries up?
The queries are very simple, like:
SELECT COUNT(*)
FROM TABLE
WHERE user = id
AND other_column = 'something'`
I am using PHP 5, MySQL client version: 4.1.22 and my tables are MyISAM.
Talk to your DBA. Run your local equivalent of showplan. For a query like your sample, I would suspect that a covering index on the columns id and other_column would greatly speed up performance. (I assume user is a variable or niladic function).
A good general rule is the columns in the index should go from left to right in descending order of variance. That is, that column varying most rapidly in value should be the first column in the index and that column varying least rapidly should be the last column in the index. Seems counter intuitive, but there you go. The query optimizer likes narrowing things down as fast as possible.
If all your queries include a user id then you can start with the assumption that userid should be included in each of your indexes, probably as the first field. (Can we assume that the user id is highly selective? i.e. that any single user doesn't have more than several thousand records?)
So your indexes might be:
user + otherfield1
user + otherfield2
etc.
If your user id is really selective, like several dozen records, then just the index on that field should be pretty effective (sub-second return).
What's nice about a "user + otherfield" index is that mysql doesn't even need to look at the data records. The index has a pointer for each record and it can just count the pointers.
I have a database with over 10,000,000 rows. Querying it right now can take a few seconds just to find some basic information. This isn't preferable, I know that the best way to optimize is to minimize the number of rows which is possible, but right now I don't have the time to do this.
What's the easiest way to optimize a MySQL database so that when querying it, the time taken is short?
I don't mind about the size of the database, that doesn't really matter so any optimizations that increase the size are fine. I'm not very good with optimization, right now I have indexes set up, but I'm not sure how much better I can get from there.
I'll eventually trim down the database properly, but is there a quick temporary solution?
Besides indexing which has already been suggested, you may want to also look into partitioning tables if they are large.
Partitioning in MySQL
It's tough to be specific here, because we have very limited information, but proper indexing along with partitioning can go a very long way. Indexing properly can be a long subject, but in a very general sense you'll want to index columns you query against.
For example, say you have a table of employees, and you have your usual columns of SSN, FNAME, LNAME. In addition to those columns, we'll say that you have an additional 10 columns in the table as well.
Now you have this query:
SELECT FNAME, LNAME FROM EMPLOYEES WHERE SSN = 'blah';
Ignoring the fact that the SSN could likely be the primary key here and may already have a unique index on it, you would likely see a performance benefit by creating another composite index containing the columns (SSN, FNAME, LNAME). The reason this is beneficial is because the database can satisfy this query by simply looking at the composite index because it contains all the values needed in a sorted and compact space. (that is, less I/O). Even though the index on SSN only is a better access method to doing a full table scan, the database still has to read the data blocks for the index (I/O), find the value(s) which will contain pointers to the records needed to satisfy the query, then will need to read different data blocks (read: more random I/O) in order to retrieve the actual values for fname and lname.
This is obviously very simplified, but using indexes in this way can drastically reduce I/O and increase performance of your database.
Some other links here you may find helpful:
MySQL indexes - how many are enough?
When should I use a composite index?
MySQL Query Optimization (Particularly the section on "Choosing Indexes")
As I can see you request 40k rows from the database, this load of data needs time just to be transferred.
Also, never ask "how to improve in general". There is no way of "general" optimization. Optimization is always result of profiling and research of your particular case.
Use indexes on columns you search on very often.
In your example, 'WHERE x=y', if y is column name, create an index with y also.
The key with index is the # of result from your select query should be around 3% ~ 5% comparing entire table and it will be faster.
Also archieving table helps. I do not know how to do this, mostly DBA task.
For DBA it is simple task if they have been doing this.
If you're doing ordering or complex queries you may need to use multi-column indexes. For example if you're searching where x.name = 'y' OR x.phone = 'z' it might be worth putting an index on name,phone. Simplified example, but if you need to do this you'll need to research it further anyway :)
Are your queries using your indexes? What does running an EXPLAIN on your select queries tell you?
The first (and easiest) step will be making sure your queries are optimized.