Batch processing in array using PHP

Batch processing in array using PHP - php

I got thousands of data inside the array that was parsed from xml.. My concern is the processing time of my script, Does it affect the processing time of my script since I have a hundred thousand records to be inserted in the database? I there a way that I process the insertion of the data to the database in batch?

Syntax is:
INSERT INTO tablename (fld1, fld2) VALUES (val1, val2), (val3, val4)... ;
So you can write smth. like this (dummy example):
foreach ($data AS $key=>$value)
{
$data[$key] = "($value[0], $value[1])";
}
$query = "INSERT INTO tablename (fld1, fld2) VALUES ".implode(',', $data);
This works quite fast event on huge datasets, and don't worry about performance if your dataset fits in memory.

This is for SQL files - but you can follow it's model ( if not just use it ) -
It splits the file up into parts that you can specify, say 3000 lines and then inserts them on a timed interval < 1 second to 1 minute or more.
This way a large file is broken into smaller inserts etc.
This will help bypass editing the php server configuration and worrying about memory limits etc. Such as script execution time and the like.
New Users can't insert links so Google Search "sql big dump" or if this works goto:
www [dot] ozerov [dot] de [ slash ] bigdump [ dot ] php
So you could even theoretically modify the above script to accept your array as the data source instead of the SQl file. It would take some modification obviously.
Hope it helps.
-R

Its unlikely to affect the processing time, but you'll need to ensure the DB's transaction logs are big enough to build a rollback segment for 100k rows.

Or with the ADOdb wrapper (http://adodb.sourceforge.net/):
// assuming you have your data in a form like this:
$params = array(
array("key1","val1"),
array("key2","val2"),
array("key3","val3"),
// etc...
);
// you can do this:
$sql = "INSERT INTO `tablename` (`key`,`val`) VALUES ( ?, ? )";
$db->Execute( $sql, $params );

Have you thought about array_chunk? It worked for me in another project
http://www.php.net/manual/en/function.array-chunk.php

Related

Transfer data into optimized MySQL tables from "raw" tables in cron with php

I'm working with an MLS real estate listing provider (RETS). Every 48 hours we will be pulling data from their server in a cron job to an SQL database. I'm charged with the task of writing a php script that will be run after the data from the remote server is dumped into our "raw" tables. In these raw tables, all columns are VARCHAR(255), and we want to move the data into optimized tables. Before I send my script to the guy in charge of setting up the cron job, I wondered if there is a more efficient way to do it so I don't look foolish.
Here's what I'm doing:
There are 8 total tables, 4 raw and 4 optimized - all in the same database. The raw table column names are non descriptive, like c1,c2,c2,c4 etc. This is intentional because the data that goes in each column may change. The raw table column names are mapped to the correct optimized table columns with php, something like this:
$tables['optimized_table_name1']['raw_table'] = 'raw_table_name1';
$tables['optimized_table_name1']['data_map'] = array(
'c1' => array( // <--- "c1" is the raw table column name
'column_name' => 'id',
// I use other values for table creation,
// but they don't matter to the question.
// Just explaining why the array looks like this
//'type' => 'VARCHAR',
//'max_length' => 45,
//'primary_key' => FALSE,
// etc.
),
'c9' => array('column_name' => 'address'),
'c25' => array('column_name' => 'baths'),
'c2' => array('column_name' => 'bedrooms') //etc.
);
I'm doing the same thing for each of the 4 tables: SELECT * FROM the raw table, read the config array and create a huge SQL insert statement, TRUNCATE the optimized table, then run the INSERT query.
foreach ($tables as $table_name => $config):
$raw_table = $config['raw_table'];
$data_map = $config['data_map'];
$fields = array();
$values = array();
$count = 0;
// Get the raw data and create an array mapped to the optimized table columns.
$query = mysql_query("SELECT * FROM dbname.{$raw_table}");
while ($row = mysql_fetch_assoc($query))
{
// Reading column names from my config file on first pass
// Setting up the array, will only run once per table
if (empty($fields))
{
foreach ($row as $key => $val)
{// Produces an array with the column names
$fields[] = $data_map[$key]['column_name'];
}
}
foreach ($row as $key => $val)
{// Assigns data to an array to be imploded later
$values[$count][] = $val;
}
$count++;
}
// Create the INSERT statement string
$insert = array();
$sql = "\nINSERT INTO `{$table_name}` (`".implode('`,`', $fields)."`) VALUES\n";
foreach ($values as $key => $vals)
{
foreach ($vals as &$val)
{
// Escape the data
$val = mysql_real_escape_string($val);
}
// Using implode for simplicity, could avoid the nested foreach if I wanted to
$insert[] = "('".implode("','", $vals)."')";
}
$sql .= implode(",\n", $insert).";\n";
// TRUNCATE optimized table and run INSERT query here
endforeach;
Which produces something like this (only larger - about 15,000 records max per table, and one insert statement per table):
INSERT INTO `optimized_table_name1` (`id`,`beds`,`baths`,`town`) VALUES
('50300584','2','1','Fairfield'),
('87560584','3','2','New Haven'),
('76545584','2','1','Bristol');
Now I'll admit, I have been under the wing of an ORM for a long time and am not up on my vanilla mysql/php. This is a pretty simple task and I want to keep the code simple.
My questions:
Is the TRUNCATE/INSERT method a good way to do this?
Is there anything about my code that you can see being a problem? I know you see nested foreach loops and just shudder, but I want to keep the code as small clean as possible and avoid lots of messy string concatenation (to produce the insert query). Like I said, I also haven't used native php functions for SQL in a long time.
I feel like it really doesn't matter if the code is not optimized if it is run at 3AM every 2 days. Does it matter? Is this code going to preform OK?
Is there a better overall strategy to accomplish this task?
Do I need to be using transactions?
How can I be aware of errors that may occur in cron scripts?
Apologize if I don't use correct cron jargon, it's new to me.

Keep it simple. ORM would be swell for this task.
Answers:
Yes.
Your code is readable. At least I did not have any problems to read it.
We had a script that ran early in the morning. It was not optimized and consumed a lot of memory. After FOUR years it started to consume over 512 Mb. I've spent 2 hours to optimize it to, so now it consumes 7 Mb (pretty good optimization, huh? :) ). I personally think it is "ok" that your script is not optimized now. If this script will start failing, you'll figure what the problem is. Maybe it will exhaust memory, maybe your SQL queries will cause deadlocks... maybe you will later optimize it to READ from slave servers... I don't know, but it works fine now, that's okay.
I'd do something similar to your code. But I'd probably generate file first and load data into the server by running shell command mysql -u username --password=password < import_file.sql. So I'd have my file stored somewhere on a disk so I cal always take a look at it. And maybe even edit for one-time correction load. But you still can do it by writing your sql statement into file.
No. It is just one query. If you use InnoDB engine it is already a transaction.
First, use error_reporting(E_ALL & ~E_NOTICE). Second, use mysql_error PHP function to ensure your query performed correctly. Third, in your cronjob output errors stream into some file like so: 0 7 * * 0 /path/to/php -c /path/to/php.ini /path/to/script.php 2> /tmp/errors_file And thus you can create SECOND script runnin after first one to notify about errors in script.php by email or.... whatever way of notifying you prefer. I'd prefer to register_shutdown_functions that would check for error_file and if it is not empty, notify you and delete it afterwards.
Just my opinion, but I hope my answer helps though.

php out of memory, array breaking into subarrays

I am using an update function, where I insert some 40,000 rows to a mysql database. While making that array, I am getting out of memory error (tried to allocate 41 bytes).
The final function is like this:
function Tright($area) {
foreach ($area as $a1=>&$a2) {
mysql_query('INSERT INTO 0_right SET section=\''.$a2['sec_id'].'\', user_id=\''.$a2['user_id'].'\', rank_id=\''.$a2['rank_id'].'\', menu_id=\''.$a2['menu_id'].'\', droit = 1;');
}
}
I have two questions. Is it natural that this above work load becomes too much for php to handle?
If no, can anyone suggest where should I check? And if yes, is there a way to break that $area array to subarrays and execute the function, maybe that way I won't get the out of memory issue. Any other workaround?
Thanks guys.
Edit: #halfdan, #Patrick Fisher, both of you have spoken about making a single multi insert query. How do you do that, in this example please.

First, you should combine all the values into a single INSERT statement, instead of 40,000 different queries.
Second, yes, it is quite natural that you are running out of memory. You can increase this limit at runtime with ini_set(), e.g. ini_set('memory_limit', '16M');
To insert multiple values at once, your SQL should look something like this:
INSERT INTO 0_right (section, user_id, rank_id, menu_id, droit) VALUES
(1,1,1,1,1),
(1,2,1,1,1),
(1,3,1,1,1)
You can build the query like so:
$values = '';
foreach ($area as $a){
if ($values != ''){
$values .= ',';
}
$values .= "('{$a['sec_id']}', '{$a['user_id']}', '{$a['rank_id']}', '{$a['menu_id']}', 1)";
}
$sql = "INSERT INTO 0_right (section, user_id, rank_id, menu_id, droit) VALUES $values";

These are all nice answers to get around the problem. If this is a one time script, just bump up the RAM and make sure the script doesn't time out (max_execution_time in the php.ini file) and you should be fine.
It may run faster if it was one big insert statement, but then you'd pay the cost of constructing the huge query on the PHP side (so the out of memory issue will still be there and will be even worse with the string concatenation). But honestly, who cares if you're just running this once?
However, if you're to perform this operation all the time (e.g. on a webpage), I'd recommend other approaches... like restricting the size of the area, cutting the feature or storing the data differently.

PHP has a built-in (configurable) memory limit to prevent a single script that goes the wrong way from bogging down the whole machine. Depending on your version, that limit defaults to 8MB (pre-5.2.0), 16MB (5.2.0), or 128MB (5.3.0).
You can change the limit either via ini_set or in the php.ini.

Downloading Large Data Sets -> Text to MySQL or just to MySQL?

I'm downloading large sets of data via an XML Query through PHP with the following scenario:
- Query for records 1-1000, download all parts (1000 parts has roughly 4.5 megs of text), then store those in memory while i query the next 1001 - 2000, store in mem (up to potentially 400k)
I'm wondering if it would be better to write these entries to a text field, rather than storing them in memory and once the complete download is done trying to insert them all up into the DB or to try and write them to the DB as they come in.
Any suggestions would be greatly appreciated.
Cheers

You can run a query like this:
INSERT INTO table (id, text)
VALUES (null, 'foo'), (null, 'bar'), ..., (null, 'value no 1000');
Doing this you'll do the thing in one shoot, and the parser will be called once. The best you can do, is running something like this with the MySQL's Benchmark function, running 1000 times a query that inserts 1000 records, or 1000000 of inserts of one record.
(Sorry about the prev. answer, I've misunderstood the question).

I think write them to database as soon as you receive them. This will save memory and u don't have to execute a 400 times slower query at the end. You will need mechanism to deal with any problems that may occur in this process like a disconnection after 399K results.

In my experience it would be better to download everything in a temporary area and then, when you are sure that everything went well, to move the data (or the files) in place.
As you are using a database you may want to dump everything into a table, something like this code:
$error=false;
while ( ($row = getNextRow($db)) && !error ) {
$sql = "insert into temptable(key, value) values ($row[0], $row[1])";
if (mysql_query ($sql) ) {
echo '#';
} else {
$error=true;
}
}
if (!error) {
$sql = "insert into myTable (select * from temptable)";
if (mysql_query($sql) {
echo 'Finished';
} else {
echo 'Error';
}
}
Alternatively, if you know the table well, you can add a "new" flag field for newly inserted lines and update everything when you are finished.

How to speed up processing a huge text file?

I have an 800mb text file with 18,990,870 lines in it (each line is a record) that I need to pick out certain records, and if there is a match write them into a database.
It is taking an age to work through them, so I wondered if there was a way to do it any quicker?
My PHP is reading a line at a time as follows:
$fp2 = fopen('download/pricing20100714/application_price','r');
if (!$fp2) {echo 'ERROR: Unable to open file.'; exit;}
while (!feof($fp2)) {
$line = stream_get_line($fp2,128,$eoldelimiter); //use 2048 if very long lines
if ($line[0] === '#') continue; //Skip lines that start with #
$field = explode ($delimiter, $line);
list($export_date, $application_id, $retail_price, $currency_code, $storefront_id ) = explode($delimiter, $line);
if ($currency_code == 'USD' and $storefront_id == '143441'){
// does application_id exist?
$application_id = mysql_real_escape_string($application_id);
$query = "SELECT * FROM jos_mt_links WHERE link_id='$application_id';";
$res = mysql_query($query);
if (mysql_num_rows($res) > 0 ) {
echo $application_id . "application id has price of " . $retail_price . "with currency of " . $currency_code. "\n";
} // end if exists in SQL
} else
{
// no, application_id doesn't exist
} // end check for currency and storefront
} // end while statement
fclose($fp2);

At a guess, the performance issue is because it issues a query for each application_id with USD and your storefront.
If space and IO aren't an issue, you might just blindly write all 19M records into a new staging DB table, add indices and then do the matching with a filter?

Don't try to invent the wheel, it's been done. Use a database to search through the file's content. You can looad that file into a staging table in your database and query your data using indexes for fast access if they add value. Most if not all databases have import/loading tools to get a file into the database relatively fast.

19M rows on DB will slow it down if DB was not designed properly. You can still use text files, if it is partitioned properly. Recreating multiple smaller files, based on certain parameters, storing in proper sorted way might work.
Anyway PHP is not the best language for file IO and processing, it is much slower than Java for this task, while plain old C would be one of the fastest for the job. PHP should be restricted to generated dynamic Web output, while core processing should be in Java/C. Ideally it should be Java/C service which generates output, and PHP using that feed to generate HTML output.

You are parsing the input line twice by doing two explodes in a row. I would start by removing the first line:
$field = explode ($delimiter, $line);
list($export_date, ...., $storefront_id ) = explode($delimiter, $line);
Also, if you are only using the query to test for a match based on your condition, don't use SELECT * use something like this:
"SELECT 1 FROM jos_mt_links WHERE link_id='$application_id';"
You could also, as Brandon Horsley suggested, buffer a set of application_id values in an array and modify your select statement to use the IN clause thereby reducing the number of queries you are performing.

Have you tried profiling the code to see where it's spending most of its time? That should always be your first step when trying to diagnose performance problems.

Preprocess with sed and/or awk ?

Databases are built and designed to cope with large amounts of data, PHP isn't. You need to re-evaluate how you are storing the data.
I would dump all the records into a database, then delete the records you don't need. Once you have done that, you can copy those records wherever you want.

As others have mentioned, the expense is likely in your database query. It might be faster to load a batch of records from the file (instead of one at a time) and perform one query to check multiple records.
For example, load 1000 records that match the USD currency and storefront at a time into an array and execute a query like:
'select link_id from jos_mt_links where link_id in (' . implode(',', application_id_array) . ')'
This will return a list of those records that are in the database. Alternatively, you could change the sql to be not in to get a list of those records that are not in the database.

Is it bad to put a MySQL query in a PHP loop?

I often have large arrays, or large amounts of dynamic data in PHP that I need to run MySQL queries to handle.
Is there a better way to run many processes like INSERT or UPDATE without looping through the information to be INSERT-ed or UPDATE-ed?
Example (I didn't use prepared statement for brevity sake):
$myArray = array('apple','orange','grape');
foreach($myArray as $arrayFruit) {
$query = "INSERT INTO `Fruits` (`FruitName`) VALUES ('" . $arrayFruit . "')";
mysql_query($query, $connection);
}

OPTION 1
You can actually run multiple queries at once.
$queries = '';
foreach(){
$queries .= "INSERT....;"; //notice the semi colon
}
mysql_query($queries, $connection);
This would save on your processing.
OPTION 2
If your insert is that simple for the same table, you can do multiple inserts in ONE query
$fruits = "('".implode("'), ('", $fruitsArray)."')";
mysql_query("INSERT INTO Fruits (Fruit) VALUES $fruits", $connection);
The query ends up looking something like this:
$query = "INSERT INTO Fruits (Fruit)
VALUES
('Apple'),
('Pear'),
('Banana')";
This is probably the way you want to go.

If you have the mysqli class, you can iterate over the values to insert using a prepared statement.
$sth = $dbh->prepare("INSERT INTO Fruits (Fruit) VALUES (?)");
foreach($fruits as $fruit)
{
$sth->reset(); // make sure we are fresh from the previous iteration
$sth->bind_param('s', $fruit); // bind one or more variables to the query
$sth->execute(); // execute the query
}

one thing to note about your original solution over the implosion method of jerebear (which I have used before, and love) is that it is easier to read. The implosion takes more programmer brain cycles to understand, which can be more expensive than processor cycles. premature optimisation, blah, blah, blah... :)

One thing to note about jerebear's answer with multiple VALUE-blocks in one INSERT:
It can be rather dangerous for really large amounts of data, because most DBMS have an upper limit on the size of the commands they can handle. If you exceed that with too many VALUE-blocks, your insert will fail. On MySQL for example the limit is usually 1MB AFAIK.
So you should figure out what the maximum size is (ideally at runtime, might be available from the database metadata), and make sure you don't exceed it by spreading your lists of values over several INSERTs.

I was inspired by jerebear's answer to build something like his second option for one of my current projects. Because of the shear volume of records I couldn't save and do all the data at once. So I built this to do imports. You add your data, and then call a method when each record is done. After a certain, configurable, number of records the data in memory will be saved with a mass insert like jerebear's second option.
// CREATE TABLE example ( Id INT, Field1 INT, Field2 INT, Field3 INT);
$import=new DataImport($dbh, 'example', 'Id, Field1, Field2, Field3');
foreach ($whatever as $row) {
// add data in the order of your column definition
$import->addValue($Id);
$import->addValue($Field1);
$import->addValue($Field2);
$import->addValue($Field3);
$import->nextRow();
}
$import->lastRow();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.