Batch insertion of data to MySQL database using php

Batch insertion of data to MySQL database using php - php

I have a thousands of data parsed from huge XML to be inserted into database table using PHP and MySQL. My Problem is it takes too long to insert all the data into table. Is there a way that my data are split into smaller group so that the process of insertion is by group? How can set up a script that will process the data by 100 for example? Here's my code:
foreach($itemList as $key => $item){
$download_records = new DownloadRecords();
//check first if the content exists
if(!$download_records->selectRecordsFromCondition("WHERE Guid=".$guid."")){
/* do an insert here */
} else {
/*do an update */
}
}
*note: $itemList is around 62,000 and still growing.

Using a for loop?
But the quickest option to load data into MySQL is to use the LOAD DATA INFILE command, you can create the file to load via PHP and then feed it to MySQL via a different process (or as a final step in the original process).
If you cannot use a file, use the following syntax:
insert into table(col1, col2) VALUES (val1,val2), (val3,val4), (val5, val6)
so you reduce to total amount of sentences to run.
EDIT: Given your snippet, it seems you can benefit from the INSERT ... ON DUPLICATE KEY UPDATE syntax of MySQL, letting the database do the work and reducing the amount of queries. This assumes your table has a primary key or unique index.
To hit the DB every 100 rows you can do something like (PLEASE REVIEW IT AND FIX IT TO YOUR ENVIRONMENT)
$insertOrUpdateStatement1 = "INSERT INTO table (col1, col2) VALUES ";
$insertOrUpdateStatement2 = "ON DUPLICATE KEY UPDATE ";
$counter = 0;
$queries = array();
foreach($itemList as $key => $item){
$val1 = escape($item->col1); //escape is a function that will make
//the input safe from SQL injection.
//Depends on how are you accessing the DB
$val2 = escape($item->col2);
$queries[] = $insertOrUpdateStatement1.
"('$val1','$val2')".$insertOrUpdateStatement2.
"col1 = '$val1', col2 = '$val2'";
$counter++;
if ($counter % 100 == 0) {
executeQueries($queries);
$queries = array();
$counter = 0;
}
}
And executeQueries would grab the array and send a single multiple query:
function executeQueries($queries) {
$data = "";
foreach ($queries as $query) {
$data.=$query.";\n";
}
executeQuery($data);
}

Yes, just do what you'd expect to do.
You should not try to do bulk insertion from a web application if you think you might hit a timeout etc. Instead drop the file somewhere and have a daemon or cron etc, pick it up and run a batch job (If running from cron, be sure that only one instance runs at once).

You should put it as said before in a temp directory with a cron job to process files, in order to avoid timeouts (or user loosing network).
Use only the web for uploads.
If you really want to import to DB on a web request you can either do a bulk insert or use at least a transaction which should be faster.
Then for limiting inserts by batches of 100 (commiting your trasnsaction if a counter is count%100==0) and repeat until all your rows were inserted.

Related

optimizing insertion of data into mysql

function generateRandomData(){
# $db = new mysqli('localhost','XXX','XXX','scores');
if(mysqli_connect_errno()) {
echo 'Failed to connect to database. Please try again later.';
exit;
}
$query = "insert into scoretable values(?,?,?)";
for($a = 0; $a < 1000000; $a++)
{
$stmt = $db->prepare($query);
$id = rand(1,75000);
$score = rand(1,100000);
$time = rand(1367038800 ,1369630800);
$stmt->bind_param("iii",$id,$score,$time);
$stmt->execute();
}
}
I am trying to populate a data table in mysql with a million rows of data. However, this process is extremely slow. Is there anything obvious I'm doing wrong that I could fix in order to make it run faster?

As hinted in the comments, you need to reduce the number of queries by catenating as many inserts as possible together. In PHP, it is easy to achieve that:
$query = "insert into scoretable values";
for($a = 0; $a < 1000000; $a++) {
$id = rand(1,75000);
$score = rand(1,100000);
$time = rand(1367038800 ,1369630800);
$query .= "($id, $score, $time),";
}
$query[strlen($query)-1]= ' ';
There is a limit on the maximum size of queries you can execute, which is directly related to the max_allowed_packet server setting (This page of the mysql documentation describes how to tune that setting to your advantage).
Therfore, you will have to reduce the loop count above to reach an appropriate query size, and repeat the process to reach the total number you want to insert, by wrapping that code with another loop.
Another practice is to disable check constraints on the table you wish to do bulk insert:
ALTER TABLE yourtablename DISABLE KEYS;
SET FOREIGN_KEY_CHECKS=0;
-- bulk insert comes here
SET FOREIGN_KEY_CHECKS=1;
ALTER TABLE yourtablename ENABLE KEYS;
This practice however must be done carefully, especially in your case since you generate the values randomly. If you have any unique key within the columns you generate, you cannot use that technique with your query as it is, as it may generate a duplicate key insert. You probably want to add a IGNORE clause to it:
$query = "insert INGORE into scoretable values";
This will cause the server to silently ignore duplicate entries on unique keys. To reach the total number of requiered inserts, just loop as many time as needed to fill up the remaining missing lines.
I suppose that the only place where you could have a unique key constraint is on the id column. In that case, you will never be able to reach the number of lines you wish to have, since it is way above the range of random values you generate for that field. Consider raising that limit, or better yet, generate your ids differently (perhaps simply by using a counter, which will make sure every record is using a different key).

You are doing several things wrong. First thing you have to take into account is what MySQL engine you're using.
The default one is InnoDB, previously the default engine is MyISAM.
I'll write this answer under assumption you're using InnoDB, which you should be using for plethora of reasons.
InnoDB operates in something called autocommit mode. That means that every query you make is wrapped in a transaction.
To translate that to a language that us mere mortals can understand - every query you do without specifying BEGIN WORK; block is a transaction - ergo, MySQL will wait until hard drive confirms data has been written.
Knowing that hard drives are slow (mechanical ones are still the ones most widely used), that means your inserts will be as fast as the hard drive is. Usually, mechanical hard drives can perform about 300 input output operations per second, ergo assuming you can do 300 inserts a second - yes, you'll wait quite a bit to insert 1 million records.
So, knowing how things work - you can use them to your advantage.
The amount of data that the HDD will write per transaction will be generally very small (4KB or even less), and knowing today's HDDs can write over 100MB/sec - that indicates that we should wrap several queries into a single transaction.
That way MySQL will send quite a bit of data and wait for the HDD to confirm it wrote everything and that the whole world is fine and dandy.
So, assuming you have 1M rows you want to populate - you'll execute 1M queries. If your transactions commit 1000 queries at a time, you should perform only about 1000 write operations.
That way, your code becomes something like this:
(I am not familiar with mysqli interface so function names might be wrong, and seeing I'm typing without actually running the code - the example might not work so use it at your own risk)
function generateRandomData()
{
$db = new mysqli('localhost','XXX','XXX','scores');
if(mysqli_connect_errno()) {
echo 'Failed to connect to database. Please try again later.';
exit;
}
$query = "insert into scoretable values(?,?,?)";
// We prepare ONCE, that's the point of prepared statements
$stmt = $db->prepare($query);
$start = 0;
$top = 1000000;
for($a = $start; $a < $top; $a++)
{
// If this is the very first iteration, start the transaction
if($a == 0)
{
$db->begin_transaction();
}
$id = rand(1,75000);
$score = rand(1,100000);
$time = rand(1367038800 ,1369630800);
$stmt->bind_param("iii",$id,$score,$time);
$stmt->execute();
// Commit on every thousandth query
if( ($a % 1000) == 0 && $a != ($top - 1) )
{
$db->commit();
$db->begin_transaction();
}
// If this is the very last query, then we just need to commit and end
if($a == ($top - 1) )
{
$db->commit();
}
}
}

DB querying involves many interrelated tasks. As a result it is an 'expensive' process. It is even more 'expensive' when it comes to insertion/update.
Running query once is the best way to enhance performance.
You can prepare the statements in the loop and run it once.
eg.
$query = "insert into scoretable values ";
for($a = 0; $a < 1000000; $a++)
{
$values = " ('".$?."','".$?."','".$?."'), ";
$query.=$values;
...
}
...
//remove the last comma
...
$stmt = $db->prepare($query);
...
$stmt->execute();

Have a look at this gist I've created. It takes about 5 minutes to insert a million rows on my laptop.

How to update thousands of rows in mysql database

Im trying to update 100.000 rows in my database, the following code should do that but I always get an error :
Error: Commands out of sync; you can't run this command now
Because it is an update I don't need the result and just want to get rid of them. The $count variable is used so that my database gets chunks of updates instead of one big update. (One big update is not working because of some limitations of the database).
I tried a lot of different things like mysqli_free_result and so on... nothing worked.
global $mysqliObject;
$count = 0;
$statement = "";
foreach ($songsArray as $song) {
$id = $song->getId();
$treepath = $song->getTreepath();
$statement = $statement."UPDATE songs SET treepath='".$treepath."' WHERE id=".$id."; ";
$count++;
if ($count > 10000){
$result = mysqli_multi_query($mysqliObject, $statement);
if(!$result) {
die('<br/><br/>Error1: ' . mysqli_error($mysqliObject));
}
$count = 0;
$statement = "";
}
}

Using a prepared query will reduce the CPU load in the mysqld process as DaveRandom and StevenVI suggest. However in this case I doubt that using prepared queries will materially impact your runtime. The challenge that you have is that you are attempting to update 100K rows in the songs table and this is going to involve a lot of physical I/O on your physical disk subsystem. It is these physical delays (say ~10 mSec per PIO) that will dominate runtimes. Factors such as what is contained in each row, how many indexes are you using on the table (especially those that involve treepath) will all blend into this mix.
The actual CPU costs of preparing a simple statement like
UPDATE songs SET treepath="some treepath" WHERE id=12345;
will be lost in this overall physical I/O delay, and the relative size of this will materially depend on the nature of the physical subsystem where you are storing your data: a single SATA disk; SSD; some NAS with large caches and SSD support ...
You need to rethink your overall strategy here, especially if you are also using the songs table at the same time as an resource for interactive requests through a web front-end. Updating 100K rows is going to take some time -- less if you are updating 100K out of 100K in storage order since this will be more aligned to the MYD organisation and the write-though caching will be better; more if you are update 100K rows in random order out of 1M rows, where the number of PIOs will be a lot more.
When you are doing this, the overall performance of your D/B is going to degrade badly.
Do you want to minimise impact on parallel use of your DB or are you just trying to do this as dedicated batch operation with other services offline?
Is your goal to minimise the total elapsed time or to keep it reasonable short subject to some overall impact constrain, or even just to complete without dying.
I suggest that you've got two sensible approaches: (i) do this as a proper batch activity with the D/B offline to other services. In this case you probably want to take out a lock on the table, and bracket the updates with ALTER TABLE ... DISABLE/ENABLE KEYS. (ii) do this as a trickle update with far smaller update sets and a delay between each set to allow the D/B to flush to disk.
Whatever, I'd drop the batch size. The multi_query essentially optimises RPC over heads involved in calling the out-of-process mysqld. A batch of 10 say cuts this by 90%. You've got diminishing returns after this -- especially saying the updates will be physical I/O intensive.

Try this code using prepared statements:
// Create a prepared statement
$query = "
UPDATE `songs`
SET `treepath` = ?
WHERE `id` = ?
";
$stmt = $GLOBALS['mysqliObject']->prepare($query); // Global variables = bad
// Loop over the array
foreach ($songsArray as $key => $song) {
// Get data about this song
$id = $song->getId();
$treepath = $song->getTreepath();
// Bind data to the statement
$stmt->bind_param('si', $treepath, $id);
// Execute the statement
$stmt->execute();
// Check for errors
if ($stmt->errno) {
echo '<br/><br/>Error: Key ' . $key . ': ' . $stmt->error;
break;
} else if ($stmt->affected_rows < 1) {
echo '<br/><br/>Warning: No rows affected by object at key ' . $key;
}
// Reset the statment
$stmt->reset();
}
// We're done, close the statement
$stmt->close();

I'd do something like this:
$link = mysqli_connect('host');
if ( $stmt = mysqli_prepare($link, "UPDATE songs SET treepath=? WHERE id=?") ) {
foreach ($songsArray as $song) {
$id = $song->getId();
$treepath = $song->getTreepath();
mysqli_stmt_bind_param($stmt, 's', $treepath); // Assuming it's a string...
mysqli_stmt_bind_param($stmt, 'i', $id);
mysqli_stmt_execute($stmt);
}
mysqli_stmt_close($stmt);
}
mysqli_close($link);
Or of course you normal mysql_query's but enclosed in a transaction.

I found another way...
Since this is not a production server - the fastest way to update 100k rows is by deleting all of them and inserting 100k from scratch with the new calculated values. It seems a little bit odd to delete everything and insert everything instead of updating but it is WAYYY faster.
Before: hours Now: seconds!

I would suggest to lock the table and disable the keys before executing multiple updates.
This would avoid that the database engine stops (at least in my case of 300,000 row update).
LOCK TABLES `TBL_RAW_DATA` WRITE;
/*!40000 ALTER TABLE `TBL_RAW_DATA` DISABLE KEYS */;
UPDATE TBL_RAW_DATA SET CREATION_DATE = ADDTIME(CREATION_DATE,'01:00:00') WHERE ID_DATA >= 1359711;
/*!40000 ALTER TABLE `TBL_RAW_DATA` ENABLE KEYS */;
UNLOCK TABLES;

Transfer data into optimized MySQL tables from "raw" tables in cron with php

I'm working with an MLS real estate listing provider (RETS). Every 48 hours we will be pulling data from their server in a cron job to an SQL database. I'm charged with the task of writing a php script that will be run after the data from the remote server is dumped into our "raw" tables. In these raw tables, all columns are VARCHAR(255), and we want to move the data into optimized tables. Before I send my script to the guy in charge of setting up the cron job, I wondered if there is a more efficient way to do it so I don't look foolish.
Here's what I'm doing:
There are 8 total tables, 4 raw and 4 optimized - all in the same database. The raw table column names are non descriptive, like c1,c2,c2,c4 etc. This is intentional because the data that goes in each column may change. The raw table column names are mapped to the correct optimized table columns with php, something like this:
$tables['optimized_table_name1']['raw_table'] = 'raw_table_name1';
$tables['optimized_table_name1']['data_map'] = array(
'c1' => array( // <--- "c1" is the raw table column name
'column_name' => 'id',
// I use other values for table creation,
// but they don't matter to the question.
// Just explaining why the array looks like this
//'type' => 'VARCHAR',
//'max_length' => 45,
//'primary_key' => FALSE,
// etc.
),
'c9' => array('column_name' => 'address'),
'c25' => array('column_name' => 'baths'),
'c2' => array('column_name' => 'bedrooms') //etc.
);
I'm doing the same thing for each of the 4 tables: SELECT * FROM the raw table, read the config array and create a huge SQL insert statement, TRUNCATE the optimized table, then run the INSERT query.
foreach ($tables as $table_name => $config):
$raw_table = $config['raw_table'];
$data_map = $config['data_map'];
$fields = array();
$values = array();
$count = 0;
// Get the raw data and create an array mapped to the optimized table columns.
$query = mysql_query("SELECT * FROM dbname.{$raw_table}");
while ($row = mysql_fetch_assoc($query))
{
// Reading column names from my config file on first pass
// Setting up the array, will only run once per table
if (empty($fields))
{
foreach ($row as $key => $val)
{// Produces an array with the column names
$fields[] = $data_map[$key]['column_name'];
}
}
foreach ($row as $key => $val)
{// Assigns data to an array to be imploded later
$values[$count][] = $val;
}
$count++;
}
// Create the INSERT statement string
$insert = array();
$sql = "\nINSERT INTO `{$table_name}` (`".implode('`,`', $fields)."`) VALUES\n";
foreach ($values as $key => $vals)
{
foreach ($vals as &$val)
{
// Escape the data
$val = mysql_real_escape_string($val);
}
// Using implode for simplicity, could avoid the nested foreach if I wanted to
$insert[] = "('".implode("','", $vals)."')";
}
$sql .= implode(",\n", $insert).";\n";
// TRUNCATE optimized table and run INSERT query here
endforeach;
Which produces something like this (only larger - about 15,000 records max per table, and one insert statement per table):
INSERT INTO `optimized_table_name1` (`id`,`beds`,`baths`,`town`) VALUES
('50300584','2','1','Fairfield'),
('87560584','3','2','New Haven'),
('76545584','2','1','Bristol');
Now I'll admit, I have been under the wing of an ORM for a long time and am not up on my vanilla mysql/php. This is a pretty simple task and I want to keep the code simple.
My questions:
Is the TRUNCATE/INSERT method a good way to do this?
Is there anything about my code that you can see being a problem? I know you see nested foreach loops and just shudder, but I want to keep the code as small clean as possible and avoid lots of messy string concatenation (to produce the insert query). Like I said, I also haven't used native php functions for SQL in a long time.
I feel like it really doesn't matter if the code is not optimized if it is run at 3AM every 2 days. Does it matter? Is this code going to preform OK?
Is there a better overall strategy to accomplish this task?
Do I need to be using transactions?
How can I be aware of errors that may occur in cron scripts?
Apologize if I don't use correct cron jargon, it's new to me.

Keep it simple. ORM would be swell for this task.
Answers:
Yes.
Your code is readable. At least I did not have any problems to read it.
We had a script that ran early in the morning. It was not optimized and consumed a lot of memory. After FOUR years it started to consume over 512 Mb. I've spent 2 hours to optimize it to, so now it consumes 7 Mb (pretty good optimization, huh? :) ). I personally think it is "ok" that your script is not optimized now. If this script will start failing, you'll figure what the problem is. Maybe it will exhaust memory, maybe your SQL queries will cause deadlocks... maybe you will later optimize it to READ from slave servers... I don't know, but it works fine now, that's okay.
I'd do something similar to your code. But I'd probably generate file first and load data into the server by running shell command mysql -u username --password=password < import_file.sql. So I'd have my file stored somewhere on a disk so I cal always take a look at it. And maybe even edit for one-time correction load. But you still can do it by writing your sql statement into file.
No. It is just one query. If you use InnoDB engine it is already a transaction.
First, use error_reporting(E_ALL & ~E_NOTICE). Second, use mysql_error PHP function to ensure your query performed correctly. Third, in your cronjob output errors stream into some file like so: 0 7 * * 0 /path/to/php -c /path/to/php.ini /path/to/script.php 2> /tmp/errors_file And thus you can create SECOND script runnin after first one to notify about errors in script.php by email or.... whatever way of notifying you prefer. I'd prefer to register_shutdown_functions that would check for error_file and if it is not empty, notify you and delete it afterwards.
Just my opinion, but I hope my answer helps though.

Downloading Large Data Sets -> Text to MySQL or just to MySQL?

I'm downloading large sets of data via an XML Query through PHP with the following scenario:
- Query for records 1-1000, download all parts (1000 parts has roughly 4.5 megs of text), then store those in memory while i query the next 1001 - 2000, store in mem (up to potentially 400k)
I'm wondering if it would be better to write these entries to a text field, rather than storing them in memory and once the complete download is done trying to insert them all up into the DB or to try and write them to the DB as they come in.
Any suggestions would be greatly appreciated.
Cheers

You can run a query like this:
INSERT INTO table (id, text)
VALUES (null, 'foo'), (null, 'bar'), ..., (null, 'value no 1000');
Doing this you'll do the thing in one shoot, and the parser will be called once. The best you can do, is running something like this with the MySQL's Benchmark function, running 1000 times a query that inserts 1000 records, or 1000000 of inserts of one record.
(Sorry about the prev. answer, I've misunderstood the question).

I think write them to database as soon as you receive them. This will save memory and u don't have to execute a 400 times slower query at the end. You will need mechanism to deal with any problems that may occur in this process like a disconnection after 399K results.

In my experience it would be better to download everything in a temporary area and then, when you are sure that everything went well, to move the data (or the files) in place.
As you are using a database you may want to dump everything into a table, something like this code:
$error=false;
while ( ($row = getNextRow($db)) && !error ) {
$sql = "insert into temptable(key, value) values ($row[0], $row[1])";
if (mysql_query ($sql) ) {
echo '#';
} else {
$error=true;
}
}
if (!error) {
$sql = "insert into myTable (select * from temptable)";
if (mysql_query($sql) {
echo 'Finished';
} else {
echo 'Error';
}
}
Alternatively, if you know the table well, you can add a "new" flag field for newly inserted lines and update everything when you are finished.

Is it bad to put a MySQL query in a PHP loop?

I often have large arrays, or large amounts of dynamic data in PHP that I need to run MySQL queries to handle.
Is there a better way to run many processes like INSERT or UPDATE without looping through the information to be INSERT-ed or UPDATE-ed?
Example (I didn't use prepared statement for brevity sake):
$myArray = array('apple','orange','grape');
foreach($myArray as $arrayFruit) {
$query = "INSERT INTO `Fruits` (`FruitName`) VALUES ('" . $arrayFruit . "')";
mysql_query($query, $connection);
}

OPTION 1
You can actually run multiple queries at once.
$queries = '';
foreach(){
$queries .= "INSERT....;"; //notice the semi colon
}
mysql_query($queries, $connection);
This would save on your processing.
OPTION 2
If your insert is that simple for the same table, you can do multiple inserts in ONE query
$fruits = "('".implode("'), ('", $fruitsArray)."')";
mysql_query("INSERT INTO Fruits (Fruit) VALUES $fruits", $connection);
The query ends up looking something like this:
$query = "INSERT INTO Fruits (Fruit)
VALUES
('Apple'),
('Pear'),
('Banana')";
This is probably the way you want to go.

If you have the mysqli class, you can iterate over the values to insert using a prepared statement.
$sth = $dbh->prepare("INSERT INTO Fruits (Fruit) VALUES (?)");
foreach($fruits as $fruit)
{
$sth->reset(); // make sure we are fresh from the previous iteration
$sth->bind_param('s', $fruit); // bind one or more variables to the query
$sth->execute(); // execute the query
}

one thing to note about your original solution over the implosion method of jerebear (which I have used before, and love) is that it is easier to read. The implosion takes more programmer brain cycles to understand, which can be more expensive than processor cycles. premature optimisation, blah, blah, blah... :)

One thing to note about jerebear's answer with multiple VALUE-blocks in one INSERT:
It can be rather dangerous for really large amounts of data, because most DBMS have an upper limit on the size of the commands they can handle. If you exceed that with too many VALUE-blocks, your insert will fail. On MySQL for example the limit is usually 1MB AFAIK.
So you should figure out what the maximum size is (ideally at runtime, might be available from the database metadata), and make sure you don't exceed it by spreading your lists of values over several INSERTs.

I was inspired by jerebear's answer to build something like his second option for one of my current projects. Because of the shear volume of records I couldn't save and do all the data at once. So I built this to do imports. You add your data, and then call a method when each record is done. After a certain, configurable, number of records the data in memory will be saved with a mass insert like jerebear's second option.
// CREATE TABLE example ( Id INT, Field1 INT, Field2 INT, Field3 INT);
$import=new DataImport($dbh, 'example', 'Id, Field1, Field2, Field3');
foreach ($whatever as $row) {
// add data in the order of your column definition
$import->addValue($Id);
$import->addValue($Field1);
$import->addValue($Field2);
$import->addValue($Field3);
$import->nextRow();
}
$import->lastRow();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.