High CPU with mysql query when updating with inner join - php

I have looked around allot and tried different methods and wanted to improve my import mechanic for big data. Importing data on insert works great, however I hit an issue when I want to update existing data based on 2 where statements.
I first load the data from source and place it in a CSV file, than use LOAD DATA LOCAL INFILE, to import the data in a temp table.
Than insert as followed from the temp table to the main table, which works as expected. Fast and uses a low amount of server resources.
INSERT INTO $table ($fields) SELECT $fields FROM $temptable WHERE (ua,gm_id) NOT IN (SELECT ua,gm_id FROM $table)
I than have the following to update the records, the reason I created this method is because the update on duplicate key did not work. As it always inserted a new record. I think I don't understand how this method worked, or have not used it in the right way. Both UA and GM_ID are indexes on both tables, but can't get that to work. The issue with the below script is that, if I update 8000 rows, it uses 200% CPU and takes over 5 to 8 minutes. Which is of course not great.
$query = "UPDATE $table a INNER JOIN $temptable b ON a.gm_id=b.gm_id AND a.ua=b.ua SET ";
foreach($update_columns as $column => $status){
$query .= "a.$column=b.$column,";
}
$query = trim($query, ",");
$result = $pdo->query($query);
Can someone point me in the right direction what I should be using.
I want to update certain columns from the temp table to the main table. This code executes allot of times during the day. Sometimes can update just 100 rows, but sometimes 8k or 60k rows, and the columns can change.
I hope the sample codes are clear.
Thanks in advance for assistance.

"Both UA and GM_ID are indexes on both tables" -- Two separate indexes is the wrong approach. You must have a "composite" UNIQUE(UA, GM_ID) (in either order). If that pair is not unique, then you cannot use IODKU.
WHERE .. NOT IN ( SELECT ... ) is very inefficient. WHERE ... NOT EXISTS ( SELECT ... ) is better; LEFT JOIN ... WHERE .. IS NULL is even better. See "SQL #1" in http://mysql.rjweb.org/doc.php/staging_table#normalization
Read the rest of that blog for more tips on high speed ingestion.

Related

Import only non-existing data to database from CSV

I have created a script that read data from a CSV file, check if the data already exists in the database and import it if it does not. If the data does exist (the code of a specific product) then the rest of the information needs to be updated from the CSV file.
For example; I have a member with code WTW-2LT, named Alex and surname Johnson in my CSV file. The script checks if the member with code WTW-2LT, named Alex and surname Johnson already exist, if it does, the contact details and extra details needs to be updated from the script (other details like subject and lecturer also needs to be checked, all details are in one line in the CSV), if it doesn't exist the new member just have to be created.
My script what I have so far with minimum other checks to prevent distraction for now;
while ($row = fgetcsv($fp, null, ";")) {
if ($header === null) {
$header = $row;
continue;
}
$record = array_combine($header, $row);
$member = $this->em->getRepository(Member::class)->findOneBy([
'code' =>$record['member_code'],
'name' =>$record['name'],
'surname' =>$record['surname'],
]);
if (!$member) {
$member = new Member();
$member->setCode($record['member_code']);
$member->setName($record['name']);
$member->setName($record['surname']);
}
$member->setContactNumber($record['phone']);
$member->setAddress($record['address']);
$member->setEmail($record['email']);
$subject = $this->em->getRepository(Subject::class)->findOneBy([
'subject_code' => $record['subj_code']
]);
if (!$subject) {
$subject = new Subject();
$subject->setCode($record['subj_code']);
}
$subject->setTitle($record['subj_title']);
$subject->setDescription($record['subj_desc']);
$subject->setLocation($record['subj_loc']);
$lecturer = $this->em->getRepository(Lecturer::class)->findOneBy([
'subject' => $subject,
'name' => $record['lec_name'],
'code' => $record['lec_code'],
]);
if (!$lecturer) {
$lecturer = new Lecturer();
$lecturer->setSubject($subject);
$lecturer->setName($record['lec_name']);
$lecturer->setCode($record['lec_code']);
}
$lecturer->setEmail($record['lec_email']);
$lecturer->setContactNumber($record['lec_phone']);
$member->setLecturer($lecturer);
$validationErrors = $this->validator->validate($member);
if (!count($validationErrors)) {
$this->em->persist($member);
$this->em->flush();
} else {
// ...
}
}
You can notice this script has to query the database 3 times to check if one CSV line exists. In my case I have files with up to 2000+ lines, so for every line to perform 3 queries to check if that line exists or not is quite time-consuming.
Unfortunately, I also can't import the rows in batches because if one subject doesn't exist it will create it so many times until the batch is flushed to the database and then I sit with duplicate records that serve no point.
How can I improve performance and speed to the max? Like first get all records from the database and store it in arrays (memory consuming?) and then do the checks and add the line to the array and check from there...
Can somebody please help me find a way to improve this (with sample code please?)
To be honest, I do find 2000+ rows with 3x that amount of queries not that much. But, since you are asking for performance, here are my two cents:
Using a framework will always give overhead. Meaning that if you make this code in native PHP, it will already run quicker. Im not familiar with symfony, but I assume you store your data in a database. In MySQL you can use the command INSERT ... ON DUPLICATE KEY update. If you have set the 3 fields (code, name, lastname) as a primary key (which I assume), you can use that to: insert data, but if the key already exists, update the values in the database. MySQL will do the checsk for you, to see if the data has changed: if not, no diskwrite will happen.
Im quite certain you can write native SQL to symfony, allowing you to use the security the framework provides, yet speeding up your insert.
Generally if you want performance my best experience has been to dump all the data into the database and then transform it in there using SQL statements. The DBS will be able optimize all of your steps this way.
You can import CSV-Files directly into your MySQL database with the SQL command
LOAD DATA INFILE 'data.csv'
INTO TABLE tmp_import
The command has a lot of options where you can specify your CSV file's format, for example:
MySQL Ref on LOAD DATA INFILE
https://stackoverflow.com/a/18941427/1220835
If your data.csv is a full dump containing all old and new rows then you can just replace your current table with the imported one, after you fixed it up a bit.
For example it looks like your csv-file (and import table) might look a bit like
WTW-2LT, Alex, Johnson, subj_code1, ..., lec_name1, ...
WTW-2LT, Alex, Johnson, subj_code1, ..., lec_name2, ...
WTW-2LT, Alex, Johnson, subj_code2, ..., lec_name3, ...
WTW-2LU, John, Doe, subj_code3, ..., lec_name4, ...
You could then get the distinct rows via grouping:
SELECT member_code, name, surname
FROM tmp_import
GROUP BY member_code, name, surname
If member_code is a key you can just GROUP BY member_code in MySQL. The DBS won't complain even though I believe it's technically against the standard.
To get the rest of your data you do the same:
SELECT subj_code, subj_title, member_code
FROM tmp_import
GROUP BY subj_code
and
SELECT lec_code, lec_name, subj_code
FROM tmp_import
GROUP BY lec_code
assuming subj_code and lec_code are both keys for subjects and lectures.
To actually save this result as a table you can use MySQL's CREATE TABLE ... SELECT-syntax, for example
CREATE TABLE tmp_import_members
SELECT member_code, name, surname
FROM tmp_import
GROUP BY member_code, name, surname
You can then do the inserts and updates in two queries:
INSERT INTO members (member_code, name, surname)
SELECT member_code, name, surname
FROM tmp_import_members
WHERE tmp_import_members.member_code NOT IN (
SELECT member_code FROM members WHERE member_code IS NOT NULL
);
UPDATE members
JOIN tmp_import_members ON
members.member_code = tmp_import_members.members_code
SET
members.name = tmp_import_members.name,
members.surname = tmp_import_members.surname;
and the same for subjects and lectures to your liking.
This all amounts to
one bulk import of your CSV file, which should be very fast,
3 temporary tables for your members, subjects and lectures,
3 insert and 3 update statements (one per table)
one drop tables on your temporary tables after you're done
Again: If your CSV-File contains all rows you could just replace your existing tables and save the 3 inserts and 3 updates.
Make sure that you create indexes on the relevant columns of your temporary tables so that MySQL can speed up the NOT IN and JOIN in the above queries.
You can run a custom sql first to get the count from all the three tables
SELECT
(SELECT COUNT(*) FROM member WHERE someCondition) as memberCount,
(SELECT COUNT(*) FROM subject WHERE someCondition) as subjectCount,
(SELECT COUNT(*) FROM lecturer WHERE someCondition) as lecturerCount
Then on the basis of count you can find if data is present in your table or not. You don't have to run the queries multiple times for uniqueness if you go with native SQL
Checkout this link to know how to run custom SQL in Doctrine
Symfony2 & Doctrine: Create custom SQL-Query

very slow search and update database operation

i have a table "table1" which has almost 400,000 records. There is another table "table2" which has around 450,000 records.
I need to delete all the rows in table1 which are duplicate in table2. I been trying to do it with php and the script was running for hours and not completed yet. Does it really takes that much time?
field asin is varchar(20) in table1
field ASIN is Index and char(10) in table2
$duplicat = 0;
$sql="SELECT asin from asins";
$result = $conn->query($sql);
if ($result->num_rows > 0) {
while($row = $result->fetch_assoc()) {
$ASIN = $row['asin'];
$sql2 = "select id from asins_chukh where ASIN='$ASIN' limit 1";
$result2 = $conn->query($sql2);
if ($result2->num_rows > 0) {
$duplicat++;
$sql3 = "UPDATE `asins` SET `duplicate` = '1' WHERE `asins`.`asin` = '$ASIN';";
$result3 = $conn->query($sql3);
if($result3) {
echo "duplicate = $ASIN <br/>";
}
}
}
}
echo "totaal :$duplicat";
u can run one single sql command, instead of a loop, something like:
update table_2 t2
set t2.duplicate = 1
where exists (
select id
from table_1 t1
where t1.id = t2.id);
Warning! i didn't test the sql above, so you may need to verify the syntax.
For such kind of database operation, using php to loop and join is never a good idea. Most of the time will be wasted on network data transfer between your php server and mysql server.
If even the the above sql takes too long, you can consider limiting the query set with some range. Something like:
update table_2 t2
set t2.duplicate = 1
where exists (
select id
from table_1 t1
where t1.id = t2.id
and t2.id > [range_start] and t2.id < [range_end] );
This way, you can kick of several updates running in parallel
Yes, processing RBAR (Row By Agonizing Row) is going to be slow. There is overhead associated with each of those individual SELECT and UPDATE statements that get executed... sending the SQL text to the database, parsing the tokens for valid syntax (keywords, commas, expressions), validating the semantics (table references and column references valid, user has required privileges, etc.), evaluating possible execution plans (index range scan, full index scan, full table scan), converting the selected execution plan into executable code, executing the query plan (obtaining locks, accessing rows, generating rollback, writing to the innodb and mysql binary logs, etc.), and returning the results.
All of that takes time. For a statement or two, the time isn't that noticeable, but put thousands of executions into a tight loop, and it's like watching individual grains of sand falling in an hour glass.
MySQL, like most relational databases, is designed to efficiently operate on sets of data. Give the database work to do, and let the database crank, rather than spend time round tripping back and forth to the database.
It's like you've got a thousand tiny items to deliver, all to the same address. You can individually handle each item. Get a box, put the item into the box with a packing slip, seal the package, address the package, weigh the package and determine postage, affix postage, and then put it into the car, drive to the post office, drop the package off. Then drive back, and handle the next item, put it into a box, ... over and over and over.
Or, we could handle a lot of tiny items together, as a larger package, and reduce the amount of overhead work (time) packaging and round trips to and from the post office.
For one thing, there's really no need to run a separate SELECT statement, to find out if we need to do an UPDATE. We could just run the UPDATE. If there are no rows to be updated, the query will return an "affected rows" count of 0.
(Running the separate SELECT is like another round trip in the car to the post office, to check the list of packages that need to be delivered, before each round trip to the post office to drop off a package. Instead of two round trips, we can take the package with us one the first trip.)
So, that could improve things a bit. But it doesn't really get to the root of the performance problem.
The real performance boost comes from getting more work done in fewer SQL statements.
How would we identify ALL of the rows that need to be updated?
SELECT t.asins
FROM asins t
JOIN asins_chukh s
ON s.asin = t.asin
WHERE NOT ( t.duplicate <=> '1' )
(If asin isn't unique, we need to tweak the query a bit, to avoid returning "duplicate" rows. The point is, we can write a single SELECT statement that identifies all of the rows that need to be updated.)
For non-trivial tables, for performance, we need to have suitable indexes available. In this case, we'd want indexes with a leading column of asin. If such an index doesn't exist, for example...
... ON asins_chukh (asin)
If that query doesn't return a huge number of rows, we can handle the UPDATE in one fell swoop:
UPDATE asins t
JOIN asins_chukh s
ON s.asin = t.asin
SET t.duplicate = '1'
WHERE NOT ( t.duplicate <=> '1' )
We need to be careful about the number of rows. We want to avoid holding blocking locks for a long time (impacting concurrent processes that may be accessing the asins table), and we want to avoid generating a huge amount of rollback.
We can break the work up into more manageable chunks.
(Referring back to the shipping tiny items analogy... if we have millions of tiny items, and putting all of those into a single shipment would create a package larger and heaver than a container ship container... we can break the shipment into manageably sized boxes.)
For example, we could handle the UPDATE in "batches" of 10,000 id values (assuming id is a unique (or nearly unique), is the leading column in the cluster key, and the id values are grouped fairly well into mostly contiguous ranges, we can get the update activity localized into one section of blocks, and not have to revist most of those same blocks again...
The WHERE clause could be something like this:
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 0
AND t.id < 0 + 10000
For the next next batch...
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 10000
AND t.id < 10000 + 10000
Then
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 20000
AND t.id < 20000 + 10000
And so on, repeating that until we're past the maximum id value. (We could run a SELECT MAX(id) FROM asins as the first step, before the loop.)
(We want to test these statements as SELECT statements first, before we convert to an UPDATE.)
Using the id column might not be the most appropriate way to create our batches.
Our objective is to create manageable "chunks" we can put into a loop, where the chunks don't overlap the same database blocks... we won't need to revisit the same block over and over, with multiple statements, to make changes to rows within the same block multiple times.

Pure SQL vs PHP While Loop To Perform an Update Faster

I'm currently working on a system to managed my Magic The Gathering collection. I've written a script to update pricing for all the cards utilizing a WHILE loop to do the main update but it takes about 9 hours to update all 28,000 rows on my i5 laptop. I have a feeling the same thing can be accomplished without the While loop using a MySQL query and it would be faster.
My script starts off by creating a temporary table with the same structure as my main inventory table, and then copies new prices into the the temporary table via a csv file. I then use a While loop to compare the cards in temp table to the inventory table via card_name and card_set to do the update.
My question is, would a pure mysql query be faster than using the while loop, and can you help me construct it? Any help would be much appreciated. Here is my code.
<?php
set_time_limit(0);
echo "Prices Are Updating. This can Take Up To 8 Hours or More";
include('db_connection.php');
mysql_query("CREATE TABLE price_table LIKE inventory;");
//Upload Data
mysql_query("LOAD DATA INFILE 'c:/xampp/htdocs/mtgtradedesig/price_update/priceupdate.csv'
INTO TABLE price_table FIELDS TERMINATED BY ',' ENCLOSED BY '\"' (id, card_name, card_set, price)");
echo mysql_error();
//UPDATE PRICING
//SELECT all from table named price update
$sql_price_table = "SELECT * FROM price_table";
$prices = mysql_query($sql_price_table);
//Start While Loop to update prices. Do this by putting everything from price table into an array and one entry at a time match the array value to a value in inventory and update.
while($cards = mysql_fetch_assoc($prices)){
$card_name = mysql_real_escape_string($cards['card_name']);
$card_set = mysql_real_escape_string($cards['card_set']);
$card_price = $cards['price'];
$foil_price = $cards['price'] * 2;
//Update prices for non-foil in temp_inventory
mysql_query("UPDATE inventory SET price='$card_price' WHERE card_name='$card_name' AND card_set='$card_set' and foil ='0'");
//Update prices for foil in temp_inventory
mysql_query("UPDATE inventory SET price='$foil_price' WHERE card_name='$card_name' AND card_set='$card_set' and foil ='1'");
}
mysql_query("DROP TABLE price_table");
unlink('c:/xampp/htdocs/mtgtradedesign/price_update/priceupdate.csv');
header("Location: http://localhost/mtgtradedesign/index.php");
?>
The easiest remedy is to perform a join between the tables, and update all rows at once. You will then only need to run two queries, one for foil and one for non foil. You can get it done to one but that gets more complicated.
UPDATE inventory i
JOIN price_table pt
ON (i.card_name = pt.card_name AND i.card_set = pt.card_set)
SET i.price = pt.card_price WHERE foil = 0;
Didn't actually test this but it should generally be what your looking for. Also before running these try using EXPLAIN to see how bad the join performance will be. You might benefit from adding an indexes to the tables.
On a side note (and this isn't really your question) but mysql_real_escape_string is deprecated, and in general you should not use any of the built in php mysql functions as they are all known to be unsafe. Php docs recommend using PDO instead.

GROUP BY and ORDER BY too slow. How to make faster?

I've trying to create some stats for my table but it has over 3 million rows so it is really slow.
I'm trying to find the most popular value for column name and also showing how many times it pops up.
I'm using this at the momment but it doesn't work cause its too slow and I just get errors.
$total = mysql_query("SELECT `name`, COUNT(*) as b FROM `people` GROUP BY `name` ORDER BY `b` DESC LIMIT 0,5;")or die(mysql_error());
As you may see I'm trying to get all the names and how many times that name has been used but only show the top 5 to hopefully speed it up.
I would like to be able to then do get the values like
while($row = mysql_fetch_array($result)){
echo $row['name'].': '.$row['b']."\r\n";
}
And it will show things like this;
Bob: 215
Steve: 120
Sophie: 118
RandomGuy: 50
RandomGirl: 50
I don't care much about ordering the names afterwards like RandomGirl and RandomGuy been the wrong way round.
I think I've have provided enough information. :) I would like the names to be case-insensitive if possible though. Bob should be the same as BoB, bOb, BOB and so on.
Thank-you for your time
Paul
Limiting results on the top 5 won't give you a lot of speed-up, you'll gain time in the result retrieval, but in mySQL side the whole table still needs to be parsed (to count).
You will speed-up your count query having index on name column, of course as only the index will be parsed and not the table.
Now if you really want to speed up the result and avoid parsing the name index when you need this result (which will still be quite slow if you really have millions of rows), then the only other solution is computing the stats when inserting, deleting or updating rows on this table. That is using triggers on this table to maintain a statistics table near this one. Then you will really only have a simple select query on this statistics table, with only 5 rows parsed. But you will slow down your inserts, delete and update operations (which are already quite slow, especially if you maintain indexes, so if the stats are important you should study this solution).
Do you have an index on name? It might help.
Since you are doing the counting/grouping and then sorting an index on name doesn't help at all MySql should go through all rows every time, there is no way to optimize this. You need to have a separate stats table like this:
CREATE TABLE name_stats( name VARCHAR(n), cnt INT, UNIQUE( name ), INDEX( cnt ) )
and you should update this table whenever you add a new row to 'people' table like this:
INSERT INTO name_stats VALUES( 'Bob', 1 ) ON DUPLICATE KEY UPDATE cnt = cnt + 1;
Querying this table for the list of top names should give you the results instantaneously.

MySql Temp Tables VS Views VS php arrays

I have currently created a facebook like page that pulls notifications from different tables, lets say about 8 tables. Each table has a different structure with different columns, so the first thing that comes to mind is that I'll have a global table, like a table of contents, and refresh it with every new hit. I know inserts are resource intensive, but I was hoping that since it is a static table, I'd only add maybe one new record every 100 visitors, so I thought "MAYBE" I could get away with this, but I was wrong. I managed to get deadlocks from just three people hammering the website.
So anyways, now I have to redo it using a different method. Initially I was going to do views, but I have an issue with views. The selected table will have to contain the id of a user. Here is an example of a select statement from php:
$get_events = "
SELECT id, " . $userId . ", 'admin_events', 0, event_start_time
FROM admin_events
WHERE CURDATE() < event_start_time AND
NOT EXISTS(SELECT id
FROM admin_event_registrations
WHERE user_id = " . $userId . " AND admin_events.id = event_id) AND
NOT EXISTS(SELECT id
FROM admin_event_declines
WHERE user_id = " . $userId . " AND admin_events.id = event_id) AND
event_capacity > (SELECT COUNT(*) FROM admin_event_registrations WHERE event_id = admin_events.id)
LIMIT 1
Sorry about the messiness. In any event, as you can see, I need to return the user Id from the page as a selected column from the table. I could not figure out how to do it with views so I don't think views are the way that I will be heading because there's a lot more of these types of queries. I come from an MSSQL background, and I love stored procedures, so if there are stored procedures for MYSQL, that would be excellent.
Next I started thinking about temp tables. The table will be in memory, the table will be probably 150 rows max, and there will be no deadlocks. Is it still very expensive to do inserts on a temp table? Will I end up crashing the server? Right now we have maybe 100 users per day, but I want to try to be future proof when we get more users.
After a long thought, I figured that the only way is the user php and get all the results as an array. The problem is that I'd get something like:
$my_array[0]["date_created"] = <current_date>
The problem with the above is that I have to sort by date_created, but this is a multi dimensional array.
Anyways, to pull 150 to 200 MAX records from a database, which approach would you take? Temp Table, View, or php?
Some thoughts:
Temp Tables:
temporary tables will only last as long as the session is alive. If you run the code in a PHP script, the temporary table will be destroyed automatically when the script finishes executing.
Views:
These are mainly for hiding complexity in that you create it with a join and then access it like a single table. The underlining code is a SELECT statement.
PHP Array:
A bit more cumbersome than SQL to get data from. However, PHP does have some functions to make life easier but no real query language.
Stored Procedures:
There are stored procedures in MySQL - see: http://dev.mysql.com/doc/refman/5.0/en/stored-routines-syntax.html
My Recommendation:
First, re-write your query using the MySQL Query Analyzer: http://www.mysql.com/products/enterprise/query.html
Now I would use PDO to put my values into an array using PHP. This will still leaves the initial heavy lifting to the DB Engine and keeps you from making multiple calls to the DB Server.
Try this:
SELECT id, " . $userId . ", 'admin_events', 0, event_start_time
FROM admin_events AS ae
LEFT JOIN admin_event_registrations AS aer
ON ae.id = aer.event_id
LEFT JOIN admin_event_declines AS aed
ON ae.id = aed.event_id
WHERE aed.user_id = ". $userid ."
AND aer.user_id = ". $userid ."
AND aed.id IS NULL
AND aer.id IS NULL
AND CURDATE() < ae.event_start_time
AND ae.event_capacity > (
SELECT SUM(IF(aer2.event_id IS NOT NULL, 1, 0))
FROM admin_event_registrations aer2
JOIN admin_events AS ae2
ON aer2.event_id = ae2.id
WHERE aer2.user_id = ". $userid .")
LIMIT 1
It still has a subquery, but you will find that it is much faster than the other options given. MySQL can join tables easily (they should all be of the same table type though). Also, the last count statement won't respond the way you want it to with null results unless you handle null values. This can all be done in a flash, and with the join statements it should reduce your overall query time significantly.
The problem is that you are using correlated subqueries. I imagine that your query takes a little while to run if it's not in the query cache? That's what would be causing your table to lock and causing contention.
Switching the table type to InnoDB would help, but your core problem is your query.
150 to 200 records is a very amount. MySQL does support stored procedures, but this isn't something you would need it for. Inserts are not resource intensive, but a lot of them at once, or in sequence (use bulk insert syntax) can cause issues.

Categories