optimizing insertion of data into mysql - php

function generateRandomData(){
# $db = new mysqli('localhost','XXX','XXX','scores');
if(mysqli_connect_errno()) {
echo 'Failed to connect to database. Please try again later.';
exit;
}
$query = "insert into scoretable values(?,?,?)";
for($a = 0; $a < 1000000; $a++)
{
$stmt = $db->prepare($query);
$id = rand(1,75000);
$score = rand(1,100000);
$time = rand(1367038800 ,1369630800);
$stmt->bind_param("iii",$id,$score,$time);
$stmt->execute();
}
}
I am trying to populate a data table in mysql with a million rows of data. However, this process is extremely slow. Is there anything obvious I'm doing wrong that I could fix in order to make it run faster?

As hinted in the comments, you need to reduce the number of queries by catenating as many inserts as possible together. In PHP, it is easy to achieve that:
$query = "insert into scoretable values";
for($a = 0; $a < 1000000; $a++) {
$id = rand(1,75000);
$score = rand(1,100000);
$time = rand(1367038800 ,1369630800);
$query .= "($id, $score, $time),";
}
$query[strlen($query)-1]= ' ';
There is a limit on the maximum size of queries you can execute, which is directly related to the max_allowed_packet server setting (This page of the mysql documentation describes how to tune that setting to your advantage).
Therfore, you will have to reduce the loop count above to reach an appropriate query size, and repeat the process to reach the total number you want to insert, by wrapping that code with another loop.
Another practice is to disable check constraints on the table you wish to do bulk insert:
ALTER TABLE yourtablename DISABLE KEYS;
SET FOREIGN_KEY_CHECKS=0;
-- bulk insert comes here
SET FOREIGN_KEY_CHECKS=1;
ALTER TABLE yourtablename ENABLE KEYS;
This practice however must be done carefully, especially in your case since you generate the values randomly. If you have any unique key within the columns you generate, you cannot use that technique with your query as it is, as it may generate a duplicate key insert. You probably want to add a IGNORE clause to it:
$query = "insert INGORE into scoretable values";
This will cause the server to silently ignore duplicate entries on unique keys. To reach the total number of requiered inserts, just loop as many time as needed to fill up the remaining missing lines.
I suppose that the only place where you could have a unique key constraint is on the id column. In that case, you will never be able to reach the number of lines you wish to have, since it is way above the range of random values you generate for that field. Consider raising that limit, or better yet, generate your ids differently (perhaps simply by using a counter, which will make sure every record is using a different key).

You are doing several things wrong. First thing you have to take into account is what MySQL engine you're using.
The default one is InnoDB, previously the default engine is MyISAM.
I'll write this answer under assumption you're using InnoDB, which you should be using for plethora of reasons.
InnoDB operates in something called autocommit mode. That means that every query you make is wrapped in a transaction.
To translate that to a language that us mere mortals can understand - every query you do without specifying BEGIN WORK; block is a transaction - ergo, MySQL will wait until hard drive confirms data has been written.
Knowing that hard drives are slow (mechanical ones are still the ones most widely used), that means your inserts will be as fast as the hard drive is. Usually, mechanical hard drives can perform about 300 input output operations per second, ergo assuming you can do 300 inserts a second - yes, you'll wait quite a bit to insert 1 million records.
So, knowing how things work - you can use them to your advantage.
The amount of data that the HDD will write per transaction will be generally very small (4KB or even less), and knowing today's HDDs can write over 100MB/sec - that indicates that we should wrap several queries into a single transaction.
That way MySQL will send quite a bit of data and wait for the HDD to confirm it wrote everything and that the whole world is fine and dandy.
So, assuming you have 1M rows you want to populate - you'll execute 1M queries. If your transactions commit 1000 queries at a time, you should perform only about 1000 write operations.
That way, your code becomes something like this:
(I am not familiar with mysqli interface so function names might be wrong, and seeing I'm typing without actually running the code - the example might not work so use it at your own risk)
function generateRandomData()
{
$db = new mysqli('localhost','XXX','XXX','scores');
if(mysqli_connect_errno()) {
echo 'Failed to connect to database. Please try again later.';
exit;
}
$query = "insert into scoretable values(?,?,?)";
// We prepare ONCE, that's the point of prepared statements
$stmt = $db->prepare($query);
$start = 0;
$top = 1000000;
for($a = $start; $a < $top; $a++)
{
// If this is the very first iteration, start the transaction
if($a == 0)
{
$db->begin_transaction();
}
$id = rand(1,75000);
$score = rand(1,100000);
$time = rand(1367038800 ,1369630800);
$stmt->bind_param("iii",$id,$score,$time);
$stmt->execute();
// Commit on every thousandth query
if( ($a % 1000) == 0 && $a != ($top - 1) )
{
$db->commit();
$db->begin_transaction();
}
// If this is the very last query, then we just need to commit and end
if($a == ($top - 1) )
{
$db->commit();
}
}
}

DB querying involves many interrelated tasks. As a result it is an 'expensive' process. It is even more 'expensive' when it comes to insertion/update.
Running query once is the best way to enhance performance.
You can prepare the statements in the loop and run it once.
eg.
$query = "insert into scoretable values ";
for($a = 0; $a < 1000000; $a++)
{
$values = " ('".$?."','".$?."','".$?."'), ";
$query.=$values;
...
}
...
//remove the last comma
...
$stmt = $db->prepare($query);
...
$stmt->execute();

Have a look at this gist I've created. It takes about 5 minutes to insert a million rows on my laptop.

Related

SQL select by using limit or using program to fetch special records

I am using php to get special records from Database.
which one is better?
1.
Select * From [table] Limit 50000, 10;
while($row = $stmt->fetch()){
//save in array, total 10 times
}
or
2.
Select * From [table];
$start = 50000;
$length = 10;
while($row = $stmt->fetch()){
if($i < $start+$length && $j >=$start){
//save in array, total 50010 times
}
}
In this case, which one should I use?
Which one using DB with less resources?
which one is better?
Too vague: what is "better"?
Which one using DB with less resources?
You're much better off with the first approach. It's efficient to select as little data as you need and no more. Selecting the whole table will force your script to use a lot more memory because all that data needs to be kept live
The best answer you'll get is: test! You can run your queries multiple times in multiple ways and see for yourself. Just use SELECT SQL_NO_CACHE... instead of the generic SELECT... to force the DB to restart the work from scratch. Measure how long it takes to run the query and process results
function wayOne(){
// execute your 1st query and loop through results
}
function wayTwo(){
// execute 2nd query and loop through results
}
//Measures # of milliseconds it takes to execute another function
function timeThis(callable $callback){
$start_time = microtime();
call_user_func($callback);
$microsecs = microtime()-$start_time; //duration in microseconds
return round($microsecs*1000);//duration in milliseconds
}
$wayOneTime = timeThis('wayOne');
$wayTwoTime = timeThis('wayTwo');
You can then compare the two times. Generally (not always) a process that takes significantly less time uses fewer resources

How to update thousands of rows in mysql database

Im trying to update 100.000 rows in my database, the following code should do that but I always get an error :
Error: Commands out of sync; you can't run this command now
Because it is an update I don't need the result and just want to get rid of them. The $count variable is used so that my database gets chunks of updates instead of one big update. (One big update is not working because of some limitations of the database).
I tried a lot of different things like mysqli_free_result and so on... nothing worked.
global $mysqliObject;
$count = 0;
$statement = "";
foreach ($songsArray as $song) {
$id = $song->getId();
$treepath = $song->getTreepath();
$statement = $statement."UPDATE songs SET treepath='".$treepath."' WHERE id=".$id."; ";
$count++;
if ($count > 10000){
$result = mysqli_multi_query($mysqliObject, $statement);
if(!$result) {
die('<br/><br/>Error1: ' . mysqli_error($mysqliObject));
}
$count = 0;
$statement = "";
}
}
Using a prepared query will reduce the CPU load in the mysqld process as DaveRandom and StevenVI suggest. However in this case I doubt that using prepared queries will materially impact your runtime. The challenge that you have is that you are attempting to update 100K rows in the songs table and this is going to involve a lot of physical I/O on your physical disk subsystem. It is these physical delays (say ~10 mSec per PIO) that will dominate runtimes. Factors such as what is contained in each row, how many indexes are you using on the table (especially those that involve treepath) will all blend into this mix.
The actual CPU costs of preparing a simple statement like
UPDATE songs SET treepath="some treepath" WHERE id=12345;
will be lost in this overall physical I/O delay, and the relative size of this will materially depend on the nature of the physical subsystem where you are storing your data: a single SATA disk; SSD; some NAS with large caches and SSD support ...
You need to rethink your overall strategy here, especially if you are also using the songs table at the same time as an resource for interactive requests through a web front-end. Updating 100K rows is going to take some time -- less if you are updating 100K out of 100K in storage order since this will be more aligned to the MYD organisation and the write-though caching will be better; more if you are update 100K rows in random order out of 1M rows, where the number of PIOs will be a lot more.
When you are doing this, the overall performance of your D/B is going to degrade badly.
Do you want to minimise impact on parallel use of your DB or are you just trying to do this as dedicated batch operation with other services offline?
Is your goal to minimise the total elapsed time or to keep it reasonable short subject to some overall impact constrain, or even just to complete without dying.
I suggest that you've got two sensible approaches: (i) do this as a proper batch activity with the D/B offline to other services. In this case you probably want to take out a lock on the table, and bracket the updates with ALTER TABLE ... DISABLE/ENABLE KEYS. (ii) do this as a trickle update with far smaller update sets and a delay between each set to allow the D/B to flush to disk.
Whatever, I'd drop the batch size. The multi_query essentially optimises RPC over heads involved in calling the out-of-process mysqld. A batch of 10 say cuts this by 90%. You've got diminishing returns after this -- especially saying the updates will be physical I/O intensive.
Try this code using prepared statements:
// Create a prepared statement
$query = "
UPDATE `songs`
SET `treepath` = ?
WHERE `id` = ?
";
$stmt = $GLOBALS['mysqliObject']->prepare($query); // Global variables = bad
// Loop over the array
foreach ($songsArray as $key => $song) {
// Get data about this song
$id = $song->getId();
$treepath = $song->getTreepath();
// Bind data to the statement
$stmt->bind_param('si', $treepath, $id);
// Execute the statement
$stmt->execute();
// Check for errors
if ($stmt->errno) {
echo '<br/><br/>Error: Key ' . $key . ': ' . $stmt->error;
break;
} else if ($stmt->affected_rows < 1) {
echo '<br/><br/>Warning: No rows affected by object at key ' . $key;
}
// Reset the statment
$stmt->reset();
}
// We're done, close the statement
$stmt->close();
I'd do something like this:
$link = mysqli_connect('host');
if ( $stmt = mysqli_prepare($link, "UPDATE songs SET treepath=? WHERE id=?") ) {
foreach ($songsArray as $song) {
$id = $song->getId();
$treepath = $song->getTreepath();
mysqli_stmt_bind_param($stmt, 's', $treepath); // Assuming it's a string...
mysqli_stmt_bind_param($stmt, 'i', $id);
mysqli_stmt_execute($stmt);
}
mysqli_stmt_close($stmt);
}
mysqli_close($link);
Or of course you normal mysql_query's but enclosed in a transaction.
I found another way...
Since this is not a production server - the fastest way to update 100k rows is by deleting all of them and inserting 100k from scratch with the new calculated values. It seems a little bit odd to delete everything and insert everything instead of updating but it is WAYYY faster.
Before: hours Now: seconds!
I would suggest to lock the table and disable the keys before executing multiple updates.
This would avoid that the database engine stops (at least in my case of 300,000 row update).
LOCK TABLES `TBL_RAW_DATA` WRITE;
/*!40000 ALTER TABLE `TBL_RAW_DATA` DISABLE KEYS */;
UPDATE TBL_RAW_DATA SET CREATION_DATE = ADDTIME(CREATION_DATE,'01:00:00') WHERE ID_DATA >= 1359711;
/*!40000 ALTER TABLE `TBL_RAW_DATA` ENABLE KEYS */;
UNLOCK TABLES;

PDO/MySQL memory consumption with large result set

I'm having a strange time dealing with selecting from a table with about 30,000 rows.
It seems my script is using an outrageous amount of memory for what is a simple, forward only walk over a query result.
Please note that this example is a somewhat contrived, absolute bare minimum example which bears very little resemblance to the real code and it cannot be replaced with a simple database aggregation. It is intended to illustrate the point that each row does not need to be retained on each iteration.
<?php
$pdo = new PDO('mysql:host=127.0.0.1', 'foo', 'bar', array(
PDO::ATTR_ERRMODE=>PDO::ERRMODE_EXCEPTION,
));
$stmt = $pdo->prepare('SELECT * FROM round');
$stmt->execute();
function do_stuff($row) {}
$c = 0;
while ($row = $stmt->fetch()) {
// do something with the object that doesn't involve keeping
// it around and can't be done in SQL
do_stuff($row);
$row = null;
++$c;
}
var_dump($c);
var_dump(memory_get_usage());
var_dump(memory_get_peak_usage());
This outputs:
int(39508)
int(43005064)
int(43018120)
I don't understand why 40 meg of memory is used when hardly any data needs to be held at any one time. I have already worked out I can reduce the memory by a factor of about 6 by replacing "SELECT *" with "SELECT home, away", however I consider even this usage to be insanely high and the table is only going to get bigger.
Is there a setting I'm missing, or is there some limitation in PDO that I should be aware of? I'm happy to get rid of PDO in favour of mysqli if it can not support this, so if that's my only option, how would I perform this using mysqli instead?
After creating the connection, you need to set PDO::MYSQL_ATTR_USE_BUFFERED_QUERY to false:
<?php
$pdo = new PDO('mysql:host=127.0.0.1', 'foo', 'bar', array(
PDO::ATTR_ERRMODE=>PDO::ERRMODE_EXCEPTION,
));
$pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);
// snip
var_dump(memory_get_usage());
var_dump(memory_get_peak_usage());
This outputs:
int(39508)
int(653920)
int(668136)
Regardless of the result size, the memory usage remains pretty much static.
Another option would be to do something like:
$i = $c = 0;
$query = 'SELECT home, away FROM round LIMIT 2048 OFFSET %u;';
while ($c += count($rows = codeThatFetches(sprintf($query, $i++ * 2048))) > 0)
{
foreach ($rows as $row)
{
do_stuff($row);
}
}
The whole result set (all 30,000 rows) is buffered into memory before you can start looking at it.
You should be letting the database do the aggregation and only asking it for the two numbers you need.
SELECT SUM(home) AS home, SUM(away) AS away, COUNT(*) AS c FROM round
The reality of the situation is that if you fetch all rows and expect to be able to iterate over all of them in PHP, at once, they will exist in memory.
If you really don't think using SQL powered expressions and aggregation is the solution you could consider limiting/chunking your data processing. Instead of fetching all rows at once do something like:
1) Fetch 5,000 rows
2) Aggregate/Calculate intermediary results
3) unset variables to free memory
4) Back to step 1 (fetch next set of rows)
Just an idea...
I haven't done this before in PHP, but you may consider fetching the rows using a scrollable cursor - see the fetch documentation for an example.
Instead of returning all the results of your query at once back to your PHP script, it holds the results on the server side and you use a cursor to iterate through them getting one at a time.
Whilst I have not tested this, it is bound to have other drawbacks such as utilising more server resources and most likely reduced performance due to additional communication with the server.
Altering the fetch style may also have an impact as by default the documentation indicates it will store both an associative array and well as a numerical indexed array which is bound to increase memory usage.
As others have suggested, reducing the number of results in the first place is most likely a better option if possible.

Query optimization : Which SELECT syntax is faster?

Given 5,000 IDs of records fetch in the database, which query , in your opinion is faster?
Loop through 5000 IDs using php and perform a SELECT query for each one,
foreach($ids as $id){
// do the query
$r = mysql_query("SELECT * FROM TABLE WHERE ID = {$id}");
}
Or collect all ids in an array, and use SELECT * FROM TABLE WHERE ID IN (1 up to 5000)
//assuming $ids = array(1,2 ---- up to 5000);
$r = mysql_query("SELECT * FROM TABLE WHERE ID IN (".join(",",$ids).")");
Without a shadow of a doubt, loading them all in one go will be faster. Running 5,000 queries is going to be a lot slower as each query will carry a certain amount of overhead.
Also, to speed it up even more, DON'T use the * operator! Select the fields you are going to use, if you only need the ID column, specify this! If you want all the columns, specify them all, because you may later add fields in and you do not need to retrieve this new field.
option 2 is definitely going to be faster. 5000 separate db queries are going to have huge network connection overhead.
The fastest way is not to request 5000 rows at all.
You barely need 100 to display them on one page. 5000 is way overkill
Sure measure it, but I'd certainly recommend letting the database doing the job.
All depends, I hope you're not creating a connection for each call though.
Loop is faster if you use a Query Statement using bind variables. Declare the Statement off the loop; then inside the loop bind the variable per each id.
Do not underestimate the time going into SQL parsing; especially on these long winded things.
Option 2 is faster. With option 1 you do a full roundtrim to the server for each iteration.
I'd point out that in this case you might consider using paging to display the data.
Hint: Measure, Measure, Measure. With a code worth 10 minutes of your time you will have the answer right away.
Which is faster for many queries?
Try measure it for example like this:
<?php
$start = getmicrotime();
for ($i=0;$i<100000;$i++)
{
foreach($ids as $id){
// do the query
$r = mysql_query("SELECT * FROM TABLE WHERE ID = {$id}");
}
}
$end = getmicrotime();
echo 'Time (1): '.($end- $start).' sec';
$start = getmicrotime();
for ($i=0;$i<100000;$i++)
{
//assuming $ids = array(1,2 ---- up to 5000);
$r = mysql_query("SELECT * FROM TABLE WHERE ID IN (".join(",",$ids).")");
}
$end = getmicrotime();
echo 'Time (2): '.($end- $start).' sec';
?>

Batch insertion of data to MySQL database using php

I have a thousands of data parsed from huge XML to be inserted into database table using PHP and MySQL. My Problem is it takes too long to insert all the data into table. Is there a way that my data are split into smaller group so that the process of insertion is by group? How can set up a script that will process the data by 100 for example? Here's my code:
foreach($itemList as $key => $item){
$download_records = new DownloadRecords();
//check first if the content exists
if(!$download_records->selectRecordsFromCondition("WHERE Guid=".$guid."")){
/* do an insert here */
} else {
/*do an update */
}
}
*note: $itemList is around 62,000 and still growing.
Using a for loop?
But the quickest option to load data into MySQL is to use the LOAD DATA INFILE command, you can create the file to load via PHP and then feed it to MySQL via a different process (or as a final step in the original process).
If you cannot use a file, use the following syntax:
insert into table(col1, col2) VALUES (val1,val2), (val3,val4), (val5, val6)
so you reduce to total amount of sentences to run.
EDIT: Given your snippet, it seems you can benefit from the INSERT ... ON DUPLICATE KEY UPDATE syntax of MySQL, letting the database do the work and reducing the amount of queries. This assumes your table has a primary key or unique index.
To hit the DB every 100 rows you can do something like (PLEASE REVIEW IT AND FIX IT TO YOUR ENVIRONMENT)
$insertOrUpdateStatement1 = "INSERT INTO table (col1, col2) VALUES ";
$insertOrUpdateStatement2 = "ON DUPLICATE KEY UPDATE ";
$counter = 0;
$queries = array();
foreach($itemList as $key => $item){
$val1 = escape($item->col1); //escape is a function that will make
//the input safe from SQL injection.
//Depends on how are you accessing the DB
$val2 = escape($item->col2);
$queries[] = $insertOrUpdateStatement1.
"('$val1','$val2')".$insertOrUpdateStatement2.
"col1 = '$val1', col2 = '$val2'";
$counter++;
if ($counter % 100 == 0) {
executeQueries($queries);
$queries = array();
$counter = 0;
}
}
And executeQueries would grab the array and send a single multiple query:
function executeQueries($queries) {
$data = "";
foreach ($queries as $query) {
$data.=$query.";\n";
}
executeQuery($data);
}
Yes, just do what you'd expect to do.
You should not try to do bulk insertion from a web application if you think you might hit a timeout etc. Instead drop the file somewhere and have a daemon or cron etc, pick it up and run a batch job (If running from cron, be sure that only one instance runs at once).
You should put it as said before in a temp directory with a cron job to process files, in order to avoid timeouts (or user loosing network).
Use only the web for uploads.
If you really want to import to DB on a web request you can either do a bulk insert or use at least a transaction which should be faster.
Then for limiting inserts by batches of 100 (commiting your trasnsaction if a counter is count%100==0) and repeat until all your rows were inserted.

Categories