Downloading Large Data Sets -> Text to MySQL or just to MySQL?

Downloading Large Data Sets -> Text to MySQL or just to MySQL? - php

I'm downloading large sets of data via an XML Query through PHP with the following scenario:
- Query for records 1-1000, download all parts (1000 parts has roughly 4.5 megs of text), then store those in memory while i query the next 1001 - 2000, store in mem (up to potentially 400k)
I'm wondering if it would be better to write these entries to a text field, rather than storing them in memory and once the complete download is done trying to insert them all up into the DB or to try and write them to the DB as they come in.
Any suggestions would be greatly appreciated.
Cheers

You can run a query like this:
INSERT INTO table (id, text)
VALUES (null, 'foo'), (null, 'bar'), ..., (null, 'value no 1000');
Doing this you'll do the thing in one shoot, and the parser will be called once. The best you can do, is running something like this with the MySQL's Benchmark function, running 1000 times a query that inserts 1000 records, or 1000000 of inserts of one record.
(Sorry about the prev. answer, I've misunderstood the question).

I think write them to database as soon as you receive them. This will save memory and u don't have to execute a 400 times slower query at the end. You will need mechanism to deal with any problems that may occur in this process like a disconnection after 399K results.

In my experience it would be better to download everything in a temporary area and then, when you are sure that everything went well, to move the data (or the files) in place.
As you are using a database you may want to dump everything into a table, something like this code:
$error=false;
while ( ($row = getNextRow($db)) && !error ) {
$sql = "insert into temptable(key, value) values ($row[0], $row[1])";
if (mysql_query ($sql) ) {
echo '#';
} else {
$error=true;
}
}
if (!error) {
$sql = "insert into myTable (select * from temptable)";
if (mysql_query($sql) {
echo 'Finished';
} else {
echo 'Error';
}
}
Alternatively, if you know the table well, you can add a "new" flag field for newly inserted lines and update everything when you are finished.

Related

How to update thousands of rows in mysql database

Im trying to update 100.000 rows in my database, the following code should do that but I always get an error :
Error: Commands out of sync; you can't run this command now
Because it is an update I don't need the result and just want to get rid of them. The $count variable is used so that my database gets chunks of updates instead of one big update. (One big update is not working because of some limitations of the database).
I tried a lot of different things like mysqli_free_result and so on... nothing worked.
global $mysqliObject;
$count = 0;
$statement = "";
foreach ($songsArray as $song) {
$id = $song->getId();
$treepath = $song->getTreepath();
$statement = $statement."UPDATE songs SET treepath='".$treepath."' WHERE id=".$id."; ";
$count++;
if ($count > 10000){
$result = mysqli_multi_query($mysqliObject, $statement);
if(!$result) {
die('<br/><br/>Error1: ' . mysqli_error($mysqliObject));
}
$count = 0;
$statement = "";
}
}

Using a prepared query will reduce the CPU load in the mysqld process as DaveRandom and StevenVI suggest. However in this case I doubt that using prepared queries will materially impact your runtime. The challenge that you have is that you are attempting to update 100K rows in the songs table and this is going to involve a lot of physical I/O on your physical disk subsystem. It is these physical delays (say ~10 mSec per PIO) that will dominate runtimes. Factors such as what is contained in each row, how many indexes are you using on the table (especially those that involve treepath) will all blend into this mix.
The actual CPU costs of preparing a simple statement like
UPDATE songs SET treepath="some treepath" WHERE id=12345;
will be lost in this overall physical I/O delay, and the relative size of this will materially depend on the nature of the physical subsystem where you are storing your data: a single SATA disk; SSD; some NAS with large caches and SSD support ...
You need to rethink your overall strategy here, especially if you are also using the songs table at the same time as an resource for interactive requests through a web front-end. Updating 100K rows is going to take some time -- less if you are updating 100K out of 100K in storage order since this will be more aligned to the MYD organisation and the write-though caching will be better; more if you are update 100K rows in random order out of 1M rows, where the number of PIOs will be a lot more.
When you are doing this, the overall performance of your D/B is going to degrade badly.
Do you want to minimise impact on parallel use of your DB or are you just trying to do this as dedicated batch operation with other services offline?
Is your goal to minimise the total elapsed time or to keep it reasonable short subject to some overall impact constrain, or even just to complete without dying.
I suggest that you've got two sensible approaches: (i) do this as a proper batch activity with the D/B offline to other services. In this case you probably want to take out a lock on the table, and bracket the updates with ALTER TABLE ... DISABLE/ENABLE KEYS. (ii) do this as a trickle update with far smaller update sets and a delay between each set to allow the D/B to flush to disk.
Whatever, I'd drop the batch size. The multi_query essentially optimises RPC over heads involved in calling the out-of-process mysqld. A batch of 10 say cuts this by 90%. You've got diminishing returns after this -- especially saying the updates will be physical I/O intensive.

Try this code using prepared statements:
// Create a prepared statement
$query = "
UPDATE `songs`
SET `treepath` = ?
WHERE `id` = ?
";
$stmt = $GLOBALS['mysqliObject']->prepare($query); // Global variables = bad
// Loop over the array
foreach ($songsArray as $key => $song) {
// Get data about this song
$id = $song->getId();
$treepath = $song->getTreepath();
// Bind data to the statement
$stmt->bind_param('si', $treepath, $id);
// Execute the statement
$stmt->execute();
// Check for errors
if ($stmt->errno) {
echo '<br/><br/>Error: Key ' . $key . ': ' . $stmt->error;
break;
} else if ($stmt->affected_rows < 1) {
echo '<br/><br/>Warning: No rows affected by object at key ' . $key;
}
// Reset the statment
$stmt->reset();
}
// We're done, close the statement
$stmt->close();

I'd do something like this:
$link = mysqli_connect('host');
if ( $stmt = mysqli_prepare($link, "UPDATE songs SET treepath=? WHERE id=?") ) {
foreach ($songsArray as $song) {
$id = $song->getId();
$treepath = $song->getTreepath();
mysqli_stmt_bind_param($stmt, 's', $treepath); // Assuming it's a string...
mysqli_stmt_bind_param($stmt, 'i', $id);
mysqli_stmt_execute($stmt);
}
mysqli_stmt_close($stmt);
}
mysqli_close($link);
Or of course you normal mysql_query's but enclosed in a transaction.

I found another way...
Since this is not a production server - the fastest way to update 100k rows is by deleting all of them and inserting 100k from scratch with the new calculated values. It seems a little bit odd to delete everything and insert everything instead of updating but it is WAYYY faster.
Before: hours Now: seconds!

I would suggest to lock the table and disable the keys before executing multiple updates.
This would avoid that the database engine stops (at least in my case of 300,000 row update).
LOCK TABLES `TBL_RAW_DATA` WRITE;
/*!40000 ALTER TABLE `TBL_RAW_DATA` DISABLE KEYS */;
UPDATE TBL_RAW_DATA SET CREATION_DATE = ADDTIME(CREATION_DATE,'01:00:00') WHERE ID_DATA >= 1359711;
/*!40000 ALTER TABLE `TBL_RAW_DATA` ENABLE KEYS */;
UNLOCK TABLES;

How to speed up processing a huge text file?

I have an 800mb text file with 18,990,870 lines in it (each line is a record) that I need to pick out certain records, and if there is a match write them into a database.
It is taking an age to work through them, so I wondered if there was a way to do it any quicker?
My PHP is reading a line at a time as follows:
$fp2 = fopen('download/pricing20100714/application_price','r');
if (!$fp2) {echo 'ERROR: Unable to open file.'; exit;}
while (!feof($fp2)) {
$line = stream_get_line($fp2,128,$eoldelimiter); //use 2048 if very long lines
if ($line[0] === '#') continue; //Skip lines that start with #
$field = explode ($delimiter, $line);
list($export_date, $application_id, $retail_price, $currency_code, $storefront_id ) = explode($delimiter, $line);
if ($currency_code == 'USD' and $storefront_id == '143441'){
// does application_id exist?
$application_id = mysql_real_escape_string($application_id);
$query = "SELECT * FROM jos_mt_links WHERE link_id='$application_id';";
$res = mysql_query($query);
if (mysql_num_rows($res) > 0 ) {
echo $application_id . "application id has price of " . $retail_price . "with currency of " . $currency_code. "\n";
} // end if exists in SQL
} else
{
// no, application_id doesn't exist
} // end check for currency and storefront
} // end while statement
fclose($fp2);

At a guess, the performance issue is because it issues a query for each application_id with USD and your storefront.
If space and IO aren't an issue, you might just blindly write all 19M records into a new staging DB table, add indices and then do the matching with a filter?

Don't try to invent the wheel, it's been done. Use a database to search through the file's content. You can looad that file into a staging table in your database and query your data using indexes for fast access if they add value. Most if not all databases have import/loading tools to get a file into the database relatively fast.

19M rows on DB will slow it down if DB was not designed properly. You can still use text files, if it is partitioned properly. Recreating multiple smaller files, based on certain parameters, storing in proper sorted way might work.
Anyway PHP is not the best language for file IO and processing, it is much slower than Java for this task, while plain old C would be one of the fastest for the job. PHP should be restricted to generated dynamic Web output, while core processing should be in Java/C. Ideally it should be Java/C service which generates output, and PHP using that feed to generate HTML output.

You are parsing the input line twice by doing two explodes in a row. I would start by removing the first line:
$field = explode ($delimiter, $line);
list($export_date, ...., $storefront_id ) = explode($delimiter, $line);
Also, if you are only using the query to test for a match based on your condition, don't use SELECT * use something like this:
"SELECT 1 FROM jos_mt_links WHERE link_id='$application_id';"
You could also, as Brandon Horsley suggested, buffer a set of application_id values in an array and modify your select statement to use the IN clause thereby reducing the number of queries you are performing.

Have you tried profiling the code to see where it's spending most of its time? That should always be your first step when trying to diagnose performance problems.

Preprocess with sed and/or awk ?

Databases are built and designed to cope with large amounts of data, PHP isn't. You need to re-evaluate how you are storing the data.
I would dump all the records into a database, then delete the records you don't need. Once you have done that, you can copy those records wherever you want.

As others have mentioned, the expense is likely in your database query. It might be faster to load a batch of records from the file (instead of one at a time) and perform one query to check multiple records.
For example, load 1000 records that match the USD currency and storefront at a time into an array and execute a query like:
'select link_id from jos_mt_links where link_id in (' . implode(',', application_id_array) . ')'
This will return a list of those records that are in the database. Alternatively, you could change the sql to be not in to get a list of those records that are not in the database.

Batch processing in array using PHP

I got thousands of data inside the array that was parsed from xml.. My concern is the processing time of my script, Does it affect the processing time of my script since I have a hundred thousand records to be inserted in the database? I there a way that I process the insertion of the data to the database in batch?

Syntax is:
INSERT INTO tablename (fld1, fld2) VALUES (val1, val2), (val3, val4)... ;
So you can write smth. like this (dummy example):
foreach ($data AS $key=>$value)
{
$data[$key] = "($value[0], $value[1])";
}
$query = "INSERT INTO tablename (fld1, fld2) VALUES ".implode(',', $data);
This works quite fast event on huge datasets, and don't worry about performance if your dataset fits in memory.

This is for SQL files - but you can follow it's model ( if not just use it ) -
It splits the file up into parts that you can specify, say 3000 lines and then inserts them on a timed interval < 1 second to 1 minute or more.
This way a large file is broken into smaller inserts etc.
This will help bypass editing the php server configuration and worrying about memory limits etc. Such as script execution time and the like.
New Users can't insert links so Google Search "sql big dump" or if this works goto:
www [dot] ozerov [dot] de [ slash ] bigdump [ dot ] php
So you could even theoretically modify the above script to accept your array as the data source instead of the SQl file. It would take some modification obviously.
Hope it helps.
-R

Its unlikely to affect the processing time, but you'll need to ensure the DB's transaction logs are big enough to build a rollback segment for 100k rows.

Or with the ADOdb wrapper (http://adodb.sourceforge.net/):
// assuming you have your data in a form like this:
$params = array(
array("key1","val1"),
array("key2","val2"),
array("key3","val3"),
// etc...
);
// you can do this:
$sql = "INSERT INTO `tablename` (`key`,`val`) VALUES ( ?, ? )";
$db->Execute( $sql, $params );

Have you thought about array_chunk? It worked for me in another project
http://www.php.net/manual/en/function.array-chunk.php

Batch insertion of data to MySQL database using php

I have a thousands of data parsed from huge XML to be inserted into database table using PHP and MySQL. My Problem is it takes too long to insert all the data into table. Is there a way that my data are split into smaller group so that the process of insertion is by group? How can set up a script that will process the data by 100 for example? Here's my code:
foreach($itemList as $key => $item){
$download_records = new DownloadRecords();
//check first if the content exists
if(!$download_records->selectRecordsFromCondition("WHERE Guid=".$guid."")){
/* do an insert here */
} else {
/*do an update */
}
}
*note: $itemList is around 62,000 and still growing.

Using a for loop?
But the quickest option to load data into MySQL is to use the LOAD DATA INFILE command, you can create the file to load via PHP and then feed it to MySQL via a different process (or as a final step in the original process).
If you cannot use a file, use the following syntax:
insert into table(col1, col2) VALUES (val1,val2), (val3,val4), (val5, val6)
so you reduce to total amount of sentences to run.
EDIT: Given your snippet, it seems you can benefit from the INSERT ... ON DUPLICATE KEY UPDATE syntax of MySQL, letting the database do the work and reducing the amount of queries. This assumes your table has a primary key or unique index.
To hit the DB every 100 rows you can do something like (PLEASE REVIEW IT AND FIX IT TO YOUR ENVIRONMENT)
$insertOrUpdateStatement1 = "INSERT INTO table (col1, col2) VALUES ";
$insertOrUpdateStatement2 = "ON DUPLICATE KEY UPDATE ";
$counter = 0;
$queries = array();
foreach($itemList as $key => $item){
$val1 = escape($item->col1); //escape is a function that will make
//the input safe from SQL injection.
//Depends on how are you accessing the DB
$val2 = escape($item->col2);
$queries[] = $insertOrUpdateStatement1.
"('$val1','$val2')".$insertOrUpdateStatement2.
"col1 = '$val1', col2 = '$val2'";
$counter++;
if ($counter % 100 == 0) {
executeQueries($queries);
$queries = array();
$counter = 0;
}
}
And executeQueries would grab the array and send a single multiple query:
function executeQueries($queries) {
$data = "";
foreach ($queries as $query) {
$data.=$query.";\n";
}
executeQuery($data);
}

Yes, just do what you'd expect to do.
You should not try to do bulk insertion from a web application if you think you might hit a timeout etc. Instead drop the file somewhere and have a daemon or cron etc, pick it up and run a batch job (If running from cron, be sure that only one instance runs at once).

You should put it as said before in a temp directory with a cron job to process files, in order to avoid timeouts (or user loosing network).
Use only the web for uploads.
If you really want to import to DB on a web request you can either do a bulk insert or use at least a transaction which should be faster.
Then for limiting inserts by batches of 100 (commiting your trasnsaction if a counter is count%100==0) and repeat until all your rows were inserted.

Is it bad to put a MySQL query in a PHP loop?

I often have large arrays, or large amounts of dynamic data in PHP that I need to run MySQL queries to handle.
Is there a better way to run many processes like INSERT or UPDATE without looping through the information to be INSERT-ed or UPDATE-ed?
Example (I didn't use prepared statement for brevity sake):
$myArray = array('apple','orange','grape');
foreach($myArray as $arrayFruit) {
$query = "INSERT INTO `Fruits` (`FruitName`) VALUES ('" . $arrayFruit . "')";
mysql_query($query, $connection);
}

OPTION 1
You can actually run multiple queries at once.
$queries = '';
foreach(){
$queries .= "INSERT....;"; //notice the semi colon
}
mysql_query($queries, $connection);
This would save on your processing.
OPTION 2
If your insert is that simple for the same table, you can do multiple inserts in ONE query
$fruits = "('".implode("'), ('", $fruitsArray)."')";
mysql_query("INSERT INTO Fruits (Fruit) VALUES $fruits", $connection);
The query ends up looking something like this:
$query = "INSERT INTO Fruits (Fruit)
VALUES
('Apple'),
('Pear'),
('Banana')";
This is probably the way you want to go.

If you have the mysqli class, you can iterate over the values to insert using a prepared statement.
$sth = $dbh->prepare("INSERT INTO Fruits (Fruit) VALUES (?)");
foreach($fruits as $fruit)
{
$sth->reset(); // make sure we are fresh from the previous iteration
$sth->bind_param('s', $fruit); // bind one or more variables to the query
$sth->execute(); // execute the query
}

one thing to note about your original solution over the implosion method of jerebear (which I have used before, and love) is that it is easier to read. The implosion takes more programmer brain cycles to understand, which can be more expensive than processor cycles. premature optimisation, blah, blah, blah... :)

One thing to note about jerebear's answer with multiple VALUE-blocks in one INSERT:
It can be rather dangerous for really large amounts of data, because most DBMS have an upper limit on the size of the commands they can handle. If you exceed that with too many VALUE-blocks, your insert will fail. On MySQL for example the limit is usually 1MB AFAIK.
So you should figure out what the maximum size is (ideally at runtime, might be available from the database metadata), and make sure you don't exceed it by spreading your lists of values over several INSERTs.

I was inspired by jerebear's answer to build something like his second option for one of my current projects. Because of the shear volume of records I couldn't save and do all the data at once. So I built this to do imports. You add your data, and then call a method when each record is done. After a certain, configurable, number of records the data in memory will be saved with a mass insert like jerebear's second option.
// CREATE TABLE example ( Id INT, Field1 INT, Field2 INT, Field3 INT);
$import=new DataImport($dbh, 'example', 'Id, Field1, Field2, Field3');
foreach ($whatever as $row) {
// add data in the order of your column definition
$import->addValue($Id);
$import->addValue($Field1);
$import->addValue($Field2);
$import->addValue($Field3);
$import->nextRow();
}
$import->lastRow();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.