I've been trying to import a csv file into a mysql database using LOAD DATA INFILE.
Everything is working more or less correctly, but when I use mysqli_info() or mysqli_affected_rows() they each show that no rows have been imported for the query. Even though I see the rows are being correctly imported.
A simplified version of what I am trying to do (Fewer columns that I am actually importing):
$server = 'localhost';
$username = 'root';
$password = 'password123';
$database = 'database_name';
$connect = new mysqli($server, $username, $password, $database);
$files = scandir('imports/');
foreach($files as $file) {
$import =
"LOAD DATA INFILE 'imports/$file'
IGNORE INTO TABLE table_name
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(#id, #name, #address, #product)
set id=#id, name=#name, address=#address, product=#product";
if(! $connect->query($import)) {
echo 'Failed';
}
$connect->query($import);
echo mysqli_affected_rows($connect);
}
mysqli_affected_rows() returns 0, while mysqli_info() states all rows have been skipped. Any idea why it's not displaying correctly?
Hopefully that's enough information. Thanks!
Edit:
I've been a bit too busy to work on this over the past few days, but I have decided that although it doesn't specifically answer my question Drew's answer of importing into a temporary table first seems to make the most sense so I have decided to go with that.
Further clarification of my comment: I would not be relying on $connect->affected_rows as ever a Rosetta stone for info. It is broken half the time.
This is one recommendation. Perform your LOAD DATA INFILE into a worktable, not your final desired table. Once that is performed, you have successfully escaped the limitations of that functionality and can enjoy the fruits of Insert on Duplicate Key Update (IODKU).
With the latter when I want to know counts, I get an assigned batch number from a control table. If you need help with that, let me know.
But I am stepping back to the point of having the data in the worktable now, with a batch number in hand. I then perform the IODKU from the worktable into the final table with the batch number coming along for the ride into a column in the final table (yes, I can tweak the schema to do that, perhaps you cannot). To get around any schema changes to existing table, and to find cases or allow for a row having multiple batch id's having hit it, a simple association table with id's can be used. An intersect table if you will.
If concurrency is an issue (multiple users likely able to do this concurrently), then a locking strategy (ideally INNODB row-level locking) is employed. If used, make it fast.
I then fetch my count off the final table (or intersect table) where batch id = my batch number in hand.
See also this answer from Jan.
Related
I'm currently working on a system to managed my Magic The Gathering collection. I've written a script to update pricing for all the cards utilizing a WHILE loop to do the main update but it takes about 9 hours to update all 28,000 rows on my i5 laptop. I have a feeling the same thing can be accomplished without the While loop using a MySQL query and it would be faster.
My script starts off by creating a temporary table with the same structure as my main inventory table, and then copies new prices into the the temporary table via a csv file. I then use a While loop to compare the cards in temp table to the inventory table via card_name and card_set to do the update.
My question is, would a pure mysql query be faster than using the while loop, and can you help me construct it? Any help would be much appreciated. Here is my code.
<?php
set_time_limit(0);
echo "Prices Are Updating. This can Take Up To 8 Hours or More";
include('db_connection.php');
mysql_query("CREATE TABLE price_table LIKE inventory;");
//Upload Data
mysql_query("LOAD DATA INFILE 'c:/xampp/htdocs/mtgtradedesig/price_update/priceupdate.csv'
INTO TABLE price_table FIELDS TERMINATED BY ',' ENCLOSED BY '\"' (id, card_name, card_set, price)");
echo mysql_error();
//UPDATE PRICING
//SELECT all from table named price update
$sql_price_table = "SELECT * FROM price_table";
$prices = mysql_query($sql_price_table);
//Start While Loop to update prices. Do this by putting everything from price table into an array and one entry at a time match the array value to a value in inventory and update.
while($cards = mysql_fetch_assoc($prices)){
$card_name = mysql_real_escape_string($cards['card_name']);
$card_set = mysql_real_escape_string($cards['card_set']);
$card_price = $cards['price'];
$foil_price = $cards['price'] * 2;
//Update prices for non-foil in temp_inventory
mysql_query("UPDATE inventory SET price='$card_price' WHERE card_name='$card_name' AND card_set='$card_set' and foil ='0'");
//Update prices for foil in temp_inventory
mysql_query("UPDATE inventory SET price='$foil_price' WHERE card_name='$card_name' AND card_set='$card_set' and foil ='1'");
}
mysql_query("DROP TABLE price_table");
unlink('c:/xampp/htdocs/mtgtradedesign/price_update/priceupdate.csv');
header("Location: http://localhost/mtgtradedesign/index.php");
?>
The easiest remedy is to perform a join between the tables, and update all rows at once. You will then only need to run two queries, one for foil and one for non foil. You can get it done to one but that gets more complicated.
UPDATE inventory i
JOIN price_table pt
ON (i.card_name = pt.card_name AND i.card_set = pt.card_set)
SET i.price = pt.card_price WHERE foil = 0;
Didn't actually test this but it should generally be what your looking for. Also before running these try using EXPLAIN to see how bad the join performance will be. You might benefit from adding an indexes to the tables.
On a side note (and this isn't really your question) but mysql_real_escape_string is deprecated, and in general you should not use any of the built in php mysql functions as they are all known to be unsafe. Php docs recommend using PDO instead.
I've got this server setting a live traffic log DB that holds a big stats table. Now I need to create a smaller table from it, let's say 30 days back.
This server also has a slave server that copies the data and is 5 sec behind the master.
I created this slave in order to reduce server process for selecting queries so it only works with insert/update for the traffic log.
Now I need to copy the last day to the smaller table, and still not to use the "real" DB,
so I need to select from the slave and insert to the real smaller table. (The slave only allows read operations).
I am working with PHP and I can't solve this with one query using two different databases at one query... If it's possible, please let me know how?
When using two queries I need to hold the last day as a PHP MySQL object. For 300K-650K of rows, it's starting to be a cache memory problem. I would use a partial select by ID(by setting the ids at the where term) chunks but I don't have an auto increment id field and there's no id for the rows (when storing traffic data id would take a lot of space).
So I am trying this idea and I would like to get a second opinion.
If I will take the last day at once (300K rows) it will overload the PHP memory.
I can use limit chunks, or a new idea: selecting one column at a time and copying this one to the new real table. But I don't know if the second method is possible. Does insert looks at the first open space at a column level or row level?
the main idea is reducing the size of the select.. so is it possible to build a select by columns and then insert them as columns at mysql?
If this is simply a memory problem in PHP you could try using PDO and fetching 1 result row at a time instead of a all at the same time.
From PHP.net for PDO:
<?php
function getFruit($conn) {
$sql = 'SELECT name, color, calories FROM fruit ORDER BY name';
foreach ($conn->query($sql) as $row) {
print $row['name'] . "\t";
print $row['color'] . "\t";
print $row['calories'] . "\n";
}
}
?>
well here is where php start to be weird.. i took your advice and started to use chunks for the data. i used a loop for advancing a limit in 2000 rows jumps. but what was interesting is when i started to use php memory usage and memory peak functions i found out that the reason the chunks method doesn't work in large scales and looping is because setting a new value to a var doesn't release the memory of what was before the new setting.. so you must use unset or null in order to keep your memory at php, –
I'm working on a research project that requires me to process large csv files (~2-5 GB) with 500,000+ records. These files contain information on government contracts (from USASpending.gov). So far, I've been using PHP or Python scripts to attack the files row-by-row, parse them, and then insert the information into the relevant tables. The parsing is moderately complex. For each record, the script checks to see if the entity named is already in the database (using a combination of string and regex matching); if it is not, it first adds the entity to a table of entities and then proceeds to parse the rest of the record and inserts the information into the appropriate tables. The list of entities is over 100,000.
Here are the basic functions (part of a class) that try to match each record with any existing entities:
private function _getOrg($data)
{
// if name of organization is null, skip it
if($data[44] == '') return null;
// use each of the possible names to check if organization exists
$names = array($data[44],$data[45],$data[46],$data[47]);
// cycle through the names
foreach($names as $name) {
// check to see if there is actually an entry here
if($name != '') {
if(($org_id = $this->_parseOrg($name)) != null) {
$this->update_org_meta($org_id,$data); // updates some information of existing entity based on record
return $org_id;
}
}
}
return $this->_addOrg($data);
}
private function _parseOrg($name)
{
// check to see if it matches any org names
// db class function, performs simple "like" match
$this->db->where('org_name',$name,'like');
$result = $this->db->get('orgs');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
// check to see if matches any org aliases
$this->db->where('org_alias_name',$name,'like');
$result = $this->db->get('orgs_aliases');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
return null; // no matches, have to add new entity
}
The _addOrg function inserts the new entity's information into the db, where hopefully it will match subsequent records.
Here's the problem: I can only get these scripts to parse about 10,000 records / hour, which, given the size, means a few solid days for each file. The way my db is structured requires a several different tables to be updated for each record because I'm compiling multiple external datasets. So, each record updates two tables, and each new entity updates three tables. I'm worried that this adds too much lag time between MySQL server and my script.
Here's my question: is there a way to import the text file into a temporary MySQL table and then use internal MySQL functions (or PHP/Python wrapper) to speed up the processing?
I'm running this on my Mac OS 10.6 with local MySQL server.
load the file into a temporary/staging table using load data infile and then use a stored procedure to process the data - shouldnt take more than 1-2 mins at the most to completely load and process the data.
you might also find some of my other answers of interest:
Optimal MySQL settings for queries that deliver large amounts of data?
MySQL and NoSQL: Help me to choose the right one
How to avoid "Using temporary" in many-to-many queries?
60 million entries, select entries from a certain month. How to optimize database?
Interesting presentation:
http://www.mysqlperformanceblog.com/2011/03/18/video-the-innodb-storage-engine-for-mysql/
example code (may be of use to you)
truncate table staging;
start transaction;
load data infile 'your_data.dat'
into table staging
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n'
(
org_name
...
)
set
org_name = nullif(org_name,'');
commit;
drop procedure if exists process_staging_data;
delimiter #
create procedure process_staging_data()
begin
insert ignore into organisations (org_name) select distinct org_name from staging;
update...
etc..
-- or use a cursor if you have to ??
end#
delimiter ;
call process_staging_data();
Hope this helps
It sounds like you'd benefit the most from tuning your SQL queries, which is probably where your script spends the most time. I don't know how the PHP MySQL client performs, but MySQLdb for Python is fairly fast. Doing naive benchmark tests I can easily sustain 10k/sec insert/select queries on one of my older quad-cores. Instead of doing one SELECT after another to test if the organization exists, using a REGEXP to check for them all at once might be more efficient (discussed here: MySQL LIKE IN()?). MySQLdb lets you use executemany() to do multiple inserts simultaneously, you could almost certainly leverage that to your advantage, perhaps your PHP client lets you do the same thing?
Another thing to consider, with Python you can use multiprocessing to and try parallelize as much as possible. PyMOTW has a good article about multiprocessing.
I have a script that imports CSV files. What ends up in my database is, among other things, a list of customers and a list of addresses. I have a table called customer and another called address, where address has a customer_id.
One thing that's important to me is not to have any duplicate rows. Therefore, each time I import an address, I do something like this:
$address = new Address();
$address->setLine_1($line_1);
$address->setZip($zip);
$address->setCountry($usa);
$address->setCity($city);
$address->setState($state);
$address = Doctrine::getTable('Address')->findOrCreate($address);
$address->save();
What findOrCreate() does, as you can probably guess, is find a matching address record if it exists, otherwise just return a new Address object. Here is the code:
public function findOrCreate($address)
{
$q = Doctrine_Query::create()
->select('a.*')
->from('Address a')
->where('a.line_1 = ?', $address->getLine_1())
->andWhere('a.line_2 = ?', $address->getLine_2())
->andWhere('a.country_id = ?', $address->getCountryId())
->andWhere('a.city = ?', $address->getCity())
->andWhere('a.state_id = ?', $address->getStateId())
->andWhere('a.zip = ?', $address->getZip());
$existing_address = $q->fetchOne();
if ($existing_address)
{
return $existing_address;
}
else
{
return $address;
}
}
The problem with doing this is that it's slow. To save each row in the CSV file (which translates into several INSERT statements on different tables), it takes about a quarter second. I'd like to get it as close to "instantaneous" as possible because I sometimes have over 50,000 rows in my CSV file. I've found that if I comment out the part of my import that saves addresses, it's much faster. Is there some faster way I could do this? I briefly considered putting an index on it but it seems like, since all the fields need to match, an index wouldn't help.
This certainly won't alleviate all of the time spent on tens of thousands of iterations, but why don't you manage your addresses outside of per-iteration DB queries? The general idea:
Get a list of all current addresses (store it in an array)
As you iterate, check array membership (checksums [sic]); if it doesn't exist, store the new address in the array and save the address to the database.
Unless I'm misunderstanding the scenario, this way you're only making INSERT queries if you have to, and you don't need to perform any SELECT queries aside from the first one.
I recommend that you investigate loading the CSV files into MySQL using LOAD DATA INFILE:
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
In order to update existing rows, you have a couple of options. LOAD DATA INFILE does not have upsert functionality (insert...on duplicate key update), but it does have a REPLACE option, which you could use to update existing rows, but you need to make sure you have an appropriate unique index, and the REPLACE is really just a DELETE and INSERT, which is slower than an UPDATE.
Another option is to load the data from the CSV into a temporary table, then merge that table with the live table using INSERT...ON DUPLICATE KEY UPDATE. Again, make sure you have an appropriate unique index, but in this case you're doing an update instead of a delete so it should be faster.
It looks like your duplicate checking is what is slowing you down. To find out why, figure out what query Doctrine is creating and run EXPLAIN on it.
My guess would be that you will need to create some indexes. Searching through the entire table can be very slow, but adding an index to zip would allow the query to only do a full search through addresses with that zip code. The EXPLAIN will be able to guide you to other optimizations.
What I ended up doing, that improved performance greatly, was to use ON DUPLICATE KEY UPDATE instead of using findOrCreate().
I am building php web application that let's a user upload a MS Access Database (csv export) that is then translated and migrated into a MySQL database.
The MS Access database consists of one table called t_product of 100k rows. This table is not designed well. As an example, the following query:
SELECT part_number, model_number FROM t_product
will return:
part_number model_number
100 AX1000, AX1001, AX1002
101 CZ10, CZ220, MB100
As you can see, the model numbers are listed as comma separated values instead of individual records in another table. There are many more issues of this nature. I'm writing a script to clean this data before importing into the mysql database. The script will also map existing Access columns to a proper relationally design database.
My issue is that my script takes too long to complete. Here's simplified code to explain what I'm doing:
$handle = fopen("MSAccess.csv, "r");
// get each row from the csv
while ($data=fgetcsv($handle, 1000, ","))
{
mysql_query("INSERT INTO t_product (col1, col2 etc...) values ($data[0], $data[1], etc...");
$prodId = mysql_last_insert_id();
// using model as an example, there are other columns
// with csv values that need to be broken up
$arrModel = explode(',', $data[2]);
foreach($arrModel as $modelNumber)
mysql_query("INSERT INTO t_model (product_id, col1, col2 etc...) values ($prodId, $modelNumber[0], $modelNumber[1] etc...");
}
The problem here is that each while-loop iteration makes a tremendous number of calls to the database. For every product record, I have to insert N model numbers, Y part numbers, X serial numbers etc...
I started another approach where I stored the whole CSV in an array. I then write one batch query like
$sql = "INSERT INTO t_product (col1, col2, etc...) values ";
foreach($arrParam as $val)
$sql .= " ($val[0], $val[1], $val[2]), "
But I ran into excessive memory errors with this approach. I increased the max memory limit to 64M and I'm still running out of memory.
What is the best way to tackle this problem?
Maybe I should write all my queries to a *.sql file first, then import the *.sql file into the mysql database?
This may be entirely not the direction you want to go, but you can generate the MySQL creation script directly from MS Access with the free MySQL Migration Toolkit
Perhaps you could allow the user to upload the Access db, and then have your PHP script call the Migration toolkit?
If you're going to try optimizing the code you have there already, I would try aggregating the INSERTS and see if that helps. This should be easy to add to your code. Something like this (C# pseudocode):
int flushCount = 0;
while (!done)
{
// Build next query, concatenate to last set of queries
if (++flushCount == 5)
{
// Flush queries to database
// Reset query string to empty
flushCount = 0;
}
}
// Flush remaining queries to the database
I decided to write all my queries into a .SQL file. This gave me the opportunity to normalize the CSV file into a proper relational database. Afterwards, my php script called an exec("mysql -h dbserver.com -u myuser -pmypass dbname < db.sql");
This solved my memory problems and it was much faster than multiple queries from php.