I am building php web application that let's a user upload a MS Access Database (csv export) that is then translated and migrated into a MySQL database.
The MS Access database consists of one table called t_product of 100k rows. This table is not designed well. As an example, the following query:
SELECT part_number, model_number FROM t_product
will return:
part_number model_number
100 AX1000, AX1001, AX1002
101 CZ10, CZ220, MB100
As you can see, the model numbers are listed as comma separated values instead of individual records in another table. There are many more issues of this nature. I'm writing a script to clean this data before importing into the mysql database. The script will also map existing Access columns to a proper relationally design database.
My issue is that my script takes too long to complete. Here's simplified code to explain what I'm doing:
$handle = fopen("MSAccess.csv, "r");
// get each row from the csv
while ($data=fgetcsv($handle, 1000, ","))
{
mysql_query("INSERT INTO t_product (col1, col2 etc...) values ($data[0], $data[1], etc...");
$prodId = mysql_last_insert_id();
// using model as an example, there are other columns
// with csv values that need to be broken up
$arrModel = explode(',', $data[2]);
foreach($arrModel as $modelNumber)
mysql_query("INSERT INTO t_model (product_id, col1, col2 etc...) values ($prodId, $modelNumber[0], $modelNumber[1] etc...");
}
The problem here is that each while-loop iteration makes a tremendous number of calls to the database. For every product record, I have to insert N model numbers, Y part numbers, X serial numbers etc...
I started another approach where I stored the whole CSV in an array. I then write one batch query like
$sql = "INSERT INTO t_product (col1, col2, etc...) values ";
foreach($arrParam as $val)
$sql .= " ($val[0], $val[1], $val[2]), "
But I ran into excessive memory errors with this approach. I increased the max memory limit to 64M and I'm still running out of memory.
What is the best way to tackle this problem?
Maybe I should write all my queries to a *.sql file first, then import the *.sql file into the mysql database?
This may be entirely not the direction you want to go, but you can generate the MySQL creation script directly from MS Access with the free MySQL Migration Toolkit
Perhaps you could allow the user to upload the Access db, and then have your PHP script call the Migration toolkit?
If you're going to try optimizing the code you have there already, I would try aggregating the INSERTS and see if that helps. This should be easy to add to your code. Something like this (C# pseudocode):
int flushCount = 0;
while (!done)
{
// Build next query, concatenate to last set of queries
if (++flushCount == 5)
{
// Flush queries to database
// Reset query string to empty
flushCount = 0;
}
}
// Flush remaining queries to the database
I decided to write all my queries into a .SQL file. This gave me the opportunity to normalize the CSV file into a proper relational database. Afterwards, my php script called an exec("mysql -h dbserver.com -u myuser -pmypass dbname < db.sql");
This solved my memory problems and it was much faster than multiple queries from php.
Related
I have looked around allot and tried different methods and wanted to improve my import mechanic for big data. Importing data on insert works great, however I hit an issue when I want to update existing data based on 2 where statements.
I first load the data from source and place it in a CSV file, than use LOAD DATA LOCAL INFILE, to import the data in a temp table.
Than insert as followed from the temp table to the main table, which works as expected. Fast and uses a low amount of server resources.
INSERT INTO $table ($fields) SELECT $fields FROM $temptable WHERE (ua,gm_id) NOT IN (SELECT ua,gm_id FROM $table)
I than have the following to update the records, the reason I created this method is because the update on duplicate key did not work. As it always inserted a new record. I think I don't understand how this method worked, or have not used it in the right way. Both UA and GM_ID are indexes on both tables, but can't get that to work. The issue with the below script is that, if I update 8000 rows, it uses 200% CPU and takes over 5 to 8 minutes. Which is of course not great.
$query = "UPDATE $table a INNER JOIN $temptable b ON a.gm_id=b.gm_id AND a.ua=b.ua SET ";
foreach($update_columns as $column => $status){
$query .= "a.$column=b.$column,";
}
$query = trim($query, ",");
$result = $pdo->query($query);
Can someone point me in the right direction what I should be using.
I want to update certain columns from the temp table to the main table. This code executes allot of times during the day. Sometimes can update just 100 rows, but sometimes 8k or 60k rows, and the columns can change.
I hope the sample codes are clear.
Thanks in advance for assistance.
"Both UA and GM_ID are indexes on both tables" -- Two separate indexes is the wrong approach. You must have a "composite" UNIQUE(UA, GM_ID) (in either order). If that pair is not unique, then you cannot use IODKU.
WHERE .. NOT IN ( SELECT ... ) is very inefficient. WHERE ... NOT EXISTS ( SELECT ... ) is better; LEFT JOIN ... WHERE .. IS NULL is even better. See "SQL #1" in http://mysql.rjweb.org/doc.php/staging_table#normalization
Read the rest of that blog for more tips on high speed ingestion.
I've got this server setting a live traffic log DB that holds a big stats table. Now I need to create a smaller table from it, let's say 30 days back.
This server also has a slave server that copies the data and is 5 sec behind the master.
I created this slave in order to reduce server process for selecting queries so it only works with insert/update for the traffic log.
Now I need to copy the last day to the smaller table, and still not to use the "real" DB,
so I need to select from the slave and insert to the real smaller table. (The slave only allows read operations).
I am working with PHP and I can't solve this with one query using two different databases at one query... If it's possible, please let me know how?
When using two queries I need to hold the last day as a PHP MySQL object. For 300K-650K of rows, it's starting to be a cache memory problem. I would use a partial select by ID(by setting the ids at the where term) chunks but I don't have an auto increment id field and there's no id for the rows (when storing traffic data id would take a lot of space).
So I am trying this idea and I would like to get a second opinion.
If I will take the last day at once (300K rows) it will overload the PHP memory.
I can use limit chunks, or a new idea: selecting one column at a time and copying this one to the new real table. But I don't know if the second method is possible. Does insert looks at the first open space at a column level or row level?
the main idea is reducing the size of the select.. so is it possible to build a select by columns and then insert them as columns at mysql?
If this is simply a memory problem in PHP you could try using PDO and fetching 1 result row at a time instead of a all at the same time.
From PHP.net for PDO:
<?php
function getFruit($conn) {
$sql = 'SELECT name, color, calories FROM fruit ORDER BY name';
foreach ($conn->query($sql) as $row) {
print $row['name'] . "\t";
print $row['color'] . "\t";
print $row['calories'] . "\n";
}
}
?>
well here is where php start to be weird.. i took your advice and started to use chunks for the data. i used a loop for advancing a limit in 2000 rows jumps. but what was interesting is when i started to use php memory usage and memory peak functions i found out that the reason the chunks method doesn't work in large scales and looping is because setting a new value to a var doesn't release the memory of what was before the new setting.. so you must use unset or null in order to keep your memory at php, –
Before I go on, this is purely a question of intuition. That is, I'm not seeking answers to work out specific bugs in my PHP/MySQL code. Rather, I want to understand what the range of possible issues that I need to consider in resolving my issue. To these ends, I will not post code or attach scripts - I will simply explain what I did and what is happening.
I have written PHP script that
Reads a CSV text file of X records to be inserted into a MySQL database table and/or update duplicate entries where applicable;
Inserts said records into what I will call a "root" table for that data set;
Selects subset records of specific fields from the "root" table and then inserts those records into a "master" table; and
Creates an output export text file from the master table for distribution.
There are several CSV files that I am processing via separate scheduled cron tasks every 30 minutes. All said, from the various sources, there are an estimated 420,000 insert transactions from file to root table, and another 420,000 insert transactions from root table to master table via the scheduled tasks.
One of the tasks involves a CSV file of about 400,000 records by itself. The processing contains no errors, but here's the problem: of the 400,000 records that MySQL indicates have been successfully inserted into the root table, only about 92,000 of those records actually store in the root table - I'm losing about 308,000 records from that scheduled task.
The other scheduled tasks process about 16,000 and 1,000 transactions respectively, and these transactions process perfectly. In fact, if I reduce the number of transactions from 400,000 to, say, 10,000, then these process just fine as well. Clearly, that's not the goal here.
To address this issue, I have tried several remedies...
Upping the memory of my server (and increasing the max limit in the php.ini file)
Getting a dedicated database with expanded memory (as opposed to a shared VPS database)
Rewriting my code to substantially eliminate stored arrays that suck down memory and process fgetcsv() processes on the run
Use INSERT DELAYED MySQL statements (as opposed to plain INSERT statements)
...and none of these remedies have worked as desired.
What range of remedial actions should be considered at this point, given the lack of success in the actions taken so far? Thanks...
The source data in csv may have duplicate records. Even though there are 400,000 record in the csv, your 'insert or update' logic trims them into reduced set. Less memory could lead to exceptions etc, but this kind of data loss.
I suspect there are problems in the CSV file.
My suggestion:
Print something for debugging information on each lines read from
CSV. This will show you how many lines are processed.
On every insert/update, print any error (if any)
It's something like this:
<?php
$csv = fopen('sample.csv', 'r'); $line = 1;
while (($item = fgetcsv($csv)) !== false) {
echo 'Line ' . $line++ . '... ';
$sql = ''; // your SQL query
mysql_query($sql);
$error = mysql_error();
if ($error == '') {
echo 'OK' . PHP_EOL;
} else {
echo 'FAILED' . PHP_EOL . $error . PHP_EOL;
}
}
So, if there are any errors, you can see it and find the problem (what lines of CSV has problem).
I'm importing a csv file to a mysql db. Haven't looked into bulk insert yet, but was wondering is it more efficient to construct a massive INSERT statement (using PHP) by looping through the values OR is it more efficient to do individual insert of the CSV rows?
Inserting in bulk is much faster. I'll typically do something like this which imports data 100 records at a time (The 100 record batch size is arbitrary).
$a_query_inserts = array();
$i_progress = 0;
foreach( $results as $a_row ) {
$i_progress++;
$a_query_inserts[] = "({$a_row['Column1']}, {$a_row['Column2']}, {$a_row['Column3']})";
if( count($a_query_inserts) > 100 || $i_progress >= $results->rowCount() ) {
$s_query = sprintf("INSERT INTO Table
(Column1,
Column2,
Column3)
VALUES
%s",
implode(', ', $a_query_inserts)
);
db::getInstance()->query($s_query);
// Reset batch
$a_query_inserts = array();
}
}
There is also a way to load the file directly into the database.
I don't know the specifics of how PHP makes connections to mySQL, but every insert request is going to have some amount of overhead beyond the data for the insert itself. Therefore I would imagine a bulk insert would be much more efficient than repeated database calls.
It is difficult to give an answer without knowing at least two more elements:
1) Is your DB running on the same server where the PHP code runs?
2) How "big" is the file? I.e. average 20 csv records? 200? 20000?
In general looping through the csv file and firing a insert statement for each row (please use prepared statements, though, or your DB will spend time parsing the same string every single time) would be the more "traditional" approach and would be efficient enough unless you have a really slow connectiong between PHP and the DB.
Even in that case, if the csv file is more than 20 records long you would probably start having problems with max statement length from the SQL parser.
I'm working on a research project that requires me to process large csv files (~2-5 GB) with 500,000+ records. These files contain information on government contracts (from USASpending.gov). So far, I've been using PHP or Python scripts to attack the files row-by-row, parse them, and then insert the information into the relevant tables. The parsing is moderately complex. For each record, the script checks to see if the entity named is already in the database (using a combination of string and regex matching); if it is not, it first adds the entity to a table of entities and then proceeds to parse the rest of the record and inserts the information into the appropriate tables. The list of entities is over 100,000.
Here are the basic functions (part of a class) that try to match each record with any existing entities:
private function _getOrg($data)
{
// if name of organization is null, skip it
if($data[44] == '') return null;
// use each of the possible names to check if organization exists
$names = array($data[44],$data[45],$data[46],$data[47]);
// cycle through the names
foreach($names as $name) {
// check to see if there is actually an entry here
if($name != '') {
if(($org_id = $this->_parseOrg($name)) != null) {
$this->update_org_meta($org_id,$data); // updates some information of existing entity based on record
return $org_id;
}
}
}
return $this->_addOrg($data);
}
private function _parseOrg($name)
{
// check to see if it matches any org names
// db class function, performs simple "like" match
$this->db->where('org_name',$name,'like');
$result = $this->db->get('orgs');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
// check to see if matches any org aliases
$this->db->where('org_alias_name',$name,'like');
$result = $this->db->get('orgs_aliases');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
return null; // no matches, have to add new entity
}
The _addOrg function inserts the new entity's information into the db, where hopefully it will match subsequent records.
Here's the problem: I can only get these scripts to parse about 10,000 records / hour, which, given the size, means a few solid days for each file. The way my db is structured requires a several different tables to be updated for each record because I'm compiling multiple external datasets. So, each record updates two tables, and each new entity updates three tables. I'm worried that this adds too much lag time between MySQL server and my script.
Here's my question: is there a way to import the text file into a temporary MySQL table and then use internal MySQL functions (or PHP/Python wrapper) to speed up the processing?
I'm running this on my Mac OS 10.6 with local MySQL server.
load the file into a temporary/staging table using load data infile and then use a stored procedure to process the data - shouldnt take more than 1-2 mins at the most to completely load and process the data.
you might also find some of my other answers of interest:
Optimal MySQL settings for queries that deliver large amounts of data?
MySQL and NoSQL: Help me to choose the right one
How to avoid "Using temporary" in many-to-many queries?
60 million entries, select entries from a certain month. How to optimize database?
Interesting presentation:
http://www.mysqlperformanceblog.com/2011/03/18/video-the-innodb-storage-engine-for-mysql/
example code (may be of use to you)
truncate table staging;
start transaction;
load data infile 'your_data.dat'
into table staging
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n'
(
org_name
...
)
set
org_name = nullif(org_name,'');
commit;
drop procedure if exists process_staging_data;
delimiter #
create procedure process_staging_data()
begin
insert ignore into organisations (org_name) select distinct org_name from staging;
update...
etc..
-- or use a cursor if you have to ??
end#
delimiter ;
call process_staging_data();
Hope this helps
It sounds like you'd benefit the most from tuning your SQL queries, which is probably where your script spends the most time. I don't know how the PHP MySQL client performs, but MySQLdb for Python is fairly fast. Doing naive benchmark tests I can easily sustain 10k/sec insert/select queries on one of my older quad-cores. Instead of doing one SELECT after another to test if the organization exists, using a REGEXP to check for them all at once might be more efficient (discussed here: MySQL LIKE IN()?). MySQLdb lets you use executemany() to do multiple inserts simultaneously, you could almost certainly leverage that to your advantage, perhaps your PHP client lets you do the same thing?
Another thing to consider, with Python you can use multiprocessing to and try parallelize as much as possible. PyMOTW has a good article about multiprocessing.