PHP MySQL - Update 6.5m rows performance issues

PHP MySQL - Update 6.5m rows performance issues - php

I am working with a MySQL table and I need to increment a value in one column for each row, of which there are over 6.5m.
The col type is varchar and can contain an integer or a string (i.e. +1). The table type is MyISAM.
I have attempted this with PHP:
$adjust_by = 1;
foreach ($options as $option) {
$original_turnaround = $option['turnaround'];
$adjusted_turnaround = $option['turnaround'];
if (preg_match('/\+/i', $original_turnaround)) {
$tmp = intval($original_turnaround);
$tmp += $adjust_by;
$adjusted_turnaround = '+'.$tmp;
} else {
$adjusted_turnaround += $adjust_by;
}
if (!array_key_exists($option['optionid'], $adjusted)) {
$adjusted[$option['optionid']] = array();
}
$adjusted[$option['optionid']][] = array(
'original_turn' => $original_turnaround,
'adjusted_turn' => $adjusted_turnaround
);
}//end fe options
//update turnarounds:
if (!empty($adjusted)) {
foreach ($adjusted as $opt_id => $turnarounds) {
foreach ($turnarounds as $turn) {
$update = "UPDATE options SET turnaround = '".$turn['adjusted_turn']."' WHERE optionid = '".$opt_id."' and turnaround = '".$turn['original_turn']."'";
run_query($update);
}
}
}
For obvious reasons there are serious performance issues with this approach. Running this in my local dev environment leads to numerous errors and eventually the server crashing.
Another thing I need to consider is when this is run in a production environment. This is for an ecommerce store, and I cannot have a huge update like this lock the database or cause any other issues.
One possible solution I have found is this: Fastest way to update 120 Million records
But creating another table comes with it's own issues. The codebase is not in a good state, similar queries are run on this table in loads of places so I would have to modify a large number of queries and files to make this approach work.
What are my options (if there are any)?

You can do this task with SQL.
With CAST you can convert a string into integer.
With IF and SUBSTR you can check if string contains +.
With CONCAT you will add (merge a two values into one string) + to your calculated result (if it will be necessary).
Just try this SQL:
"UPDATE `options` SET `turnaround` = CONCAT(IF(SUBSTR(`turnaround`, 1, 1) = '+', '+', ''), CAST(`turnaround` AS SIGNED) + " + $adjust_by + ") WHERE 1";

can't you just say
UPDATE whatevertable SET whatever = whatever + 1?
Try it and see, I'm pretty sure it will work!
EDIT: You have strings OR integers? Your DB design is flawed, this probably won't work, but would have been the correct answer had your DB design been more strict.

You probably don't have, but need, this 'composite' index (in either order):
INDEX(optionid, turnaround)
Please provide SHOW CREATE TABLE.
Another, slight, performance boost is to explicitly LOCK TABLE WRITE before that update loop. And UNLOCK afterwards. Caution: This only applies to MyISAM.
You would be much better off with InnoDB.

Related

PHP automatically removes the zeros that i need in an identification string

In a table, the primary field is a Char(12) field called ribiid, whose format is RB##########,
It needs to auto-increment it self, and for that i have prepared the following code:
function getid() {
global $connection;
$idquery = "SELECT ribiid FROM systems ORDER BY ribiid DESC LIMIT 1";
$idsave = mysqli_query($connection, $idquery);
$idresult = mysqli_fetch_assoc($idsave);
$idalpha = substr($idresult['ribiid'], 0, 2);
$idnumeric = substr($idresult, 2);
$newidnumeric = $idnumeric + 1;
$newid = $idalpha . $newidnumeric;
return $newid;
}
Now for testing I manually entered a row in cmd with id = RB0000000000, the next entry that I submit through my webpage using php, should have been RB0000000001, but it is coming RB1.
How can I fix this, this is my first web database. Thanks

Your problem is that when adding 1 to $idnumeric PHP needs to treat it as a number. Leading zeroes in numbers do not make sense, so they are discarded.
To keep the zeroes you can use sprintf format the resulting (incremented) number:
$newid = sprintf("%s%010d", $idalpha, $newidnumeric);
However, using code like this is not a really good idea
There's an issue with this code though: it's subject to a race condition. Consider what could happen if two instances of the script run in parallel:
Instance A Instance B
T |
i | Reads ribiid RB..001 Reads ribiid RB..001
m | Generates next id RB..002 Generates next id RB..002
e v Writes RB..002 to DB
Writes RB..002 to DB => OOPS
As you see this situation will result in instance B failing to insert a record due to the use of a duplicate primary key. To solve this problem you need to eliminate the race condition, which you could do in one of several ways:
Use an AUTO_INCREMENT column for the PK instead of manually inserting values. Although this means you can no longer have the "RB" prefix as part of the key, you can move it to a different column and have the PK be a combination of these two columns.
LOCK TABLES ribiid while the insertion is taking place (note that the lock needs to cover all of the process, not just the getid function). Locking tables is something you normally want to avoid, but if inserts are not frequent it's a usable practical solution.

You could try something like this:
$newid = $idalpha . str_pad($newidnumeric, 10, '0', STR_PAD_LEFT);
This will add zeros to reach the ten chars.

You can padd the numeric string again using the following function:
function pad_number($number, $pad=10){
$pad_zero = $pad - strlen($number.'');
$nstr = '';
for($i =0; $i< $pad_zero; $i++){
$nstr .="0";
}
$nstr .= $number;
return $nstr;
}
You can use in your code this function as:
$newid = $idalpha . pad_number($newidnumeric);

Splitting a string of values like 1030:0,1031:1,1032:2 and storing data in database

I have a bunch of photos on a page and using jQuery UI's Sortable plugin, to allow for them to be reordered.
When my sortable function fires, it writes a new order sequence:
1030:0,1031:1,1032:2,1040:3,1033:4
Each item of the comma delimited string, consists of the photo ID and the order position, separated by a colon. When the user has completely finished their reordering, I'm posting this order sequence to a PHP page via AJAX, to store the changes in the database. Here's where I get into trouble.
I have no problem getting my script to work, but I'm pretty sure it's the incorrect way to achieve what I want, and will suffer hugely in performance and resources - I'm hoping somebody could advise me as to what would be the best approach.
This is my PHP script that deals with the sequence:
if ($sorted_order) {
$exploded_order = explode(',',$sorted_order);
foreach ($exploded_order as $order_part) {
$exploded_part = explode(':',$order_part);
$part_count = 0;
foreach ($exploded_part as $part) {
$part_count++;
if ($part_count == 1) {
$photo_id = $part;
} elseif ($part_count == 2) {
$order = $part;
}
$SQL = "UPDATE article_photos ";
$SQL .= "SET order_pos = :order_pos ";
$SQL .= "WHERE photo_id = :photo_id;";
... rest of PDO stuff ...
}
}
}
My concerns arise from the nested foreach functions and also running so many database updates. If a given sequence contained 150 items, would this script cry for help? If it will, how could I improve it?
** This is for an admin page, so it won't be heavily abused **

you can use one update, with some cleaver code like so:
create the array $data['order'] in the loop then:
$q = "UPDATE article_photos SET order_pos = (CASE photo_id ";
foreach($data['order'] as $sort => $id){
$q .= " WHEN {$id} THEN {$sort}";
}
$q .= " END ) WHERE photo_id IN (".implode(",",$data['order']).")";
a little clearer perhaps
UPDATE article_photos SET order_pos = (CASE photo_id
WHEN id = 1 THEN 999
WHEN id = 2 THEN 1000
WHEN id = 3 THEN 1001
END)
WHERE photo_id IN (1,2,3)
i use this approach for exactly what your doing, updating sort orders

No need for the second foreach: you know it's going to be two parts if your data passes validation (I'm assuming you validated this. If not: you should =) so just do:
if (count($exploded_part) == 2) {
$id = $exploded_part[0];
$seq = $exploded_part[1];
/* rest of code */
} else {
/* error - data does not conform despite validation */
}
As for update hammering: do your DB updates in a transaction. Your db will queue the ops, but not commit them to the main DB until you commit the transaction, at which point it'll happily do the update "for real" at lightning speed.

I suggest making your script even simplier and changing names of the variables, so the code would be way more readable.
$parts = explode(',',$sorted_order);
foreach ($parts as $part) {
list($id, $position) = explode(':',$order_part);
//Now you can work with $id and $position ;
}
More info about list: http://php.net/manual/en/function.list.php
Also, about performance and your data structure:
The way you store your data is not perfect. But that way you will not suffer any performance issues, that way you need to send less data, less overhead overall.
However the drawback of your data structure is that most probably you will be unable to establish relationships between tables and make joins or alter table structure in a correct way.

Query on large mysql database

i've got a script which is supposed to run through a mysql database and preform a certain 'test'on the cases. Simplified the database contains records which represent trips that have been made by persons. Each record is a singel trip. But I want to use only roundway trips. So I need to search the database and match two trips to each other; the trip to and the trip from a certain location.
The script is working fine. The problem is that the database contains more then 600.000 cases. I know this should be avoided if possible. But for the purpose of this script and the use of the database records later on, everything has to stick together.
Executing the script takes hours right now, when executing on my iMac using MAMP. Off course I made sure that it can use a lot of memory etcetare.
My question is how could I speed things up, what's the best approach to do this?
Here's the script I have right now:
$table = $_GET['table'];
$output = '';
//Select all cases that has not been marked as invalid in previous test
$query = "SELECT persid, ritid, vertpc, aankpc, jaar, maand, dag FROM MON.$table WHERE reasonInvalid != '1' OR reasonInvalid IS NULL";
$result = mysql_query($query)or die($output .= mysql_error());
$totalCountValid = '';
$totalCountInvalid = '';
$totalCount = '';
//For each record:
while($row = mysql_fetch_array($result)){
$totalCount += 1;
//Do another query, get all the rows for this persons ID and that share postal codes. Postal codes revert between the two trips
$persid = $row['persid'];
$ritid = $row['ritid'];
$pcD = $row['vertpc'];
$pcA = $row['aankpc'];
$jaar = $row['jaar'];
$maand = $row['maand'];
$dag = $row['dag'];
$thecountquery = "SELECT * FROM MON.$table WHERE persid=$persid AND vertpc=$pcA AND aankpc=$pcD AND jaar = $jaar AND maand = $maand AND dag = $dag";
$thecount = mysql_num_rows(mysql_query($thecountquery));
if($thecount >= 1){
//No worries, this person ID has multiple trips attached
$totalCountValid += 1;
}else{
//Ow my, the case is invalid!
$totalCountInvalid += 1;
//Call the markInvalid from functions.php
$totalCountValid += 1;
markInvalid($table, '2', 'ritid', $ritid);
}
}
//Echo the result
$output .= 'Total cases: '.$totalCount.'<br>Valid: '.$totalCountValid.'<br>Invalid: '.$totalCountInvalid; echo $output;

Your basic problem is that you are doing the following.
1) Getting all cases that haven't been marked as invalid.
2) Looping through the cases obtained in step 1).
What you can easily do is to combine the queries written for 1) and 2) in a single query and loop over the data. This will speed up the things a bit.
Also bear in mind the following tips.
1) Selecting all columns is not at all a good thing to do. It takes ample amount of time for the data to traverse over the network. I would recommend replacing the wild-card with all columns that you really need.
SELECT * <ALL_COlumns>
2) Use indexes - sparingly, efficiently and appropriately. Understand when to use them and when not to.
3) Use views if you can.
4) Enable MySQL slow query log to understand which queries you need to work on and optimize.
log_slow_queries = /var/log/mysql/mysql-slow.log
long_query_time = 1
log-queries-not-using-indexes
5) Use correct MySQL field types and the storage engine (Very very important)
6) Use EXPLAIN to analyze your query - EXPLAIN is a useful command in MySQL which can provide you some great details about how a query is ran, what index is used, how many rows it needs to check through and if it needs to do file sorts, temporary tables and other nasty things you want to avoid.
Good luck.

Performance of recursively searching for indices in large sets of data

I came across a question today of search efficiency for large sets today and I've done by best to boil it down to the most basic case. I feel like this sort of thing probably relates to some classic problem or basic concept I'm missing, so a pointer to that would be great.
Suppose I have a table definition like
CREATE TABLE foo(
id int,
type bool,
reference int,
PRIMARY KEY(id),
FOREIGN KEY(reference) REFERENCES foo(id),
UNIQUE KEY(reference)
) Engine=InnoDB;
Populated with n rows where n/2 are randomly assigned type=1. Each row references another with its same type except for the first, which has reference=null.
Now we want to print all items with type 1. I assume that at some point, it will be faster to recursively call something like
function printFoo1($ref){
if($ref==null)
return;
$q = 'SELECT id, reference FROM foo WHERE id='.$ref;
$arr = mysql_fetch_array( mysql_query($q) );
echo $arr[0];
printFoo1($arr[1]);
}
As opposed to
function printFoo2($ref){
$q = 'SELECT id FROM foo WHERE type=1';
$res = mysql_query($q);
while( $id = mysql_fetch_array($res) ){
echo $id[0];
}
}
The main point here being that function 1 searches for "id", which is indexed, whereas function 2 has to make n/2 comparisons that don't result in a hit, but that the overhead of multiple queries is going to be significantly greater than the single SELECT.
Is my assumption correct? If so, how large of a data set would we need before function 1 outperforms function 2?

Your example is a bit difficult to parse, but ill start at the top:
Your first function does not return all of the elements with type = 1. It returns all of the elements that are dependent (based on references) to the element you pass in. From the PHP standpoint, since the link/handle is already open there is a non-trivial overhead from your function call with each successive request, not to mention the string concatenation incurring a cost with each execution of that line.
Typically it is better to use the second function styling because it only queries the database one time and will return the elements you are requesting without further work. It will come down to a profiler of course, to determine which is going to return faster, but from my tests the second is hands down the better choice:
This was executed with n = 5000 elements in the db (n/2 = 2500 type 1 and passing in reference = highest id with type = 1 from query of db).
printFoo1: 3.591840 seconds
printFoo2: 0.010340 seconds
It wouldn't really make sense for it to work any other way. If you were able to do what you propose that would make JOIN calls have to perform less efficient as well.
Code
$res = mysql_query('SELECT MAX( id ) as `MAX_ID` FROM `foo` WHERE `type` = 1', $link);
$res2 = mysql_fetch_assoc($res);
$id = $res2['MAX_ID'];
// cleanup result and free resources here
echo "printFoo1: ";
$start = microtime(true);
printFoo1($id);
echo microtime(true) - $start;
echo '<br />';
echo "printFoo2: ";
$start = microtime(true);
printFoo2();
echo microtime(true) - $start;
mysql_close($link);
All of this was tested on PHP 5.2.17 running on Linux

Lots of 'If statement', or a redundant mysql query?

$url = mysql_real_escape_string($_POST['url']);
$shoutcast_url = mysql_real_escape_string($_POST['shoutcast_url']);
$site_name = mysql_real_escape_string($_POST['site_name']);
$site_subtitle = mysql_real_escape_string($_POST['site_subtitle']);
$email_suffix = mysql_real_escape_string($_POST['email_suffix']);
$logo_name = mysql_real_escape_string($_POST['logo_name']);
$twitter_username = mysql_real_escape_string($_POST['twitter_username']);
with all those options in a form, they are pre-filled in (by the database), however users can choose to change them, which updates the original database. Would it be better for me to update all the columns despite the chance that some of the rows have not been updated, or just do an if ($original_db_entry = $possible_new_entry) on each (which would be a query in itself)?
Thanks

I'd say it doesn't really matter either way - the size of the query you send to the server is hardly relevant here, and there is no "last updated" information for columns that would be updated unjustly, so...
By the way, what I like to do when working with such loads of data is create a temporary array.
$fields = array("url", "shoutcast_url", "site_name", "site_subtitle" , ....);
foreach ($fields as $field)
$$field = mysql_real_escape_string($_POST[$field]);
the only thing to be aware of here is that you have to be careful not to put variable names into $fields that would overwrite existing variables.
Update: Col. Shrapnel makes the correct and valid point that using variable variables is not a good practice. While I think it is perfectly acceptable to use variable variables within the scope of a function, it is indeed better not use them at all. The better way to sanitize all incoming fields and have them in a usable form would be:
$sanitized_data = array();
$fields = array("url", "shoutcast_url", "site_name", "site_subtitle" , ....);
foreach ($fields as $field)
$sanizited_data[$field] = mysql_real_escape_string($_POST[$field]);
this will leave you with an array you can work with:
$sanitized_data["url"] = ....
$sanitized_data["shoutcast_url"] = ....

Just run a single query that updates all columns:
UPDATE table SET col1='a', col2='b', col3='c' WHERE id = '5'

I would recommend that you execute the UPDATE with all column values. It'd be less costly than trying to confirm that the value is different than what's currently in the database. And that confirmation would be irrelevant anyway, because the values in the database could change instantly after you check them if someone else updates them.
If you issue an UPDATE against MySQL and the values are identical to values already in the database, the UPDATE will be a no-op. That is, MySQL reports zero rows affected.
MySQL knows not to do unnecessary work during an UPDATE.
If only one column changes, MySQL does need to do work. It only changes the columns that are different, but it still creates a new row version (assuming you're using InnoDB).
And of course there's some small amount of work necessary to actually send the UPDATE statement to the MySQL server so it can compare against the existing row. But typically this takes only hundredths of a millisecond on a modern server.

Yes, it's ok to update every field.
A simple function to produce SET statement:
function dbSet($fields) {
$set='';
foreach ($fields as $field) {
if (isset($_POST[$field])) {
$set.="`$field`='".mysql_real_escape_string($_POST[$field])."', ";
}
}
return substr($set, 0, -2);
}
and usage:
$fields = explode(" ","name surname lastname address zip fax phone");
$query = "UPDATE $table SET ".dbSet($fields)." WHERE id=$id";

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP MySQL - Update 6.5m rows performance issues - php

can't you just say UPDATE whatevertable SET whatever = whatever + 1? Try it and see, I'm pretty sure it will work! EDIT: You have strings OR integers? Your DB design is flawed, this probably won't work, but would have been the correct answer had your DB design been more strict.

Related

PHP automatically removes the zeros that i need in an identification string

Splitting a string of values like 1030:0,1031:1,1032:2 and storing data in database

Query on large mysql database

Performance of recursively searching for indices in large sets of data

Lots of 'If statement', or a redundant mysql query?

Categories

Resources