I'm working on a research project that requires me to process large csv files (~2-5 GB) with 500,000+ records. These files contain information on government contracts (from USASpending.gov). So far, I've been using PHP or Python scripts to attack the files row-by-row, parse them, and then insert the information into the relevant tables. The parsing is moderately complex. For each record, the script checks to see if the entity named is already in the database (using a combination of string and regex matching); if it is not, it first adds the entity to a table of entities and then proceeds to parse the rest of the record and inserts the information into the appropriate tables. The list of entities is over 100,000.
Here are the basic functions (part of a class) that try to match each record with any existing entities:
private function _getOrg($data)
{
// if name of organization is null, skip it
if($data[44] == '') return null;
// use each of the possible names to check if organization exists
$names = array($data[44],$data[45],$data[46],$data[47]);
// cycle through the names
foreach($names as $name) {
// check to see if there is actually an entry here
if($name != '') {
if(($org_id = $this->_parseOrg($name)) != null) {
$this->update_org_meta($org_id,$data); // updates some information of existing entity based on record
return $org_id;
}
}
}
return $this->_addOrg($data);
}
private function _parseOrg($name)
{
// check to see if it matches any org names
// db class function, performs simple "like" match
$this->db->where('org_name',$name,'like');
$result = $this->db->get('orgs');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
// check to see if matches any org aliases
$this->db->where('org_alias_name',$name,'like');
$result = $this->db->get('orgs_aliases');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
return null; // no matches, have to add new entity
}
The _addOrg function inserts the new entity's information into the db, where hopefully it will match subsequent records.
Here's the problem: I can only get these scripts to parse about 10,000 records / hour, which, given the size, means a few solid days for each file. The way my db is structured requires a several different tables to be updated for each record because I'm compiling multiple external datasets. So, each record updates two tables, and each new entity updates three tables. I'm worried that this adds too much lag time between MySQL server and my script.
Here's my question: is there a way to import the text file into a temporary MySQL table and then use internal MySQL functions (or PHP/Python wrapper) to speed up the processing?
I'm running this on my Mac OS 10.6 with local MySQL server.
load the file into a temporary/staging table using load data infile and then use a stored procedure to process the data - shouldnt take more than 1-2 mins at the most to completely load and process the data.
you might also find some of my other answers of interest:
Optimal MySQL settings for queries that deliver large amounts of data?
MySQL and NoSQL: Help me to choose the right one
How to avoid "Using temporary" in many-to-many queries?
60 million entries, select entries from a certain month. How to optimize database?
Interesting presentation:
http://www.mysqlperformanceblog.com/2011/03/18/video-the-innodb-storage-engine-for-mysql/
example code (may be of use to you)
truncate table staging;
start transaction;
load data infile 'your_data.dat'
into table staging
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n'
(
org_name
...
)
set
org_name = nullif(org_name,'');
commit;
drop procedure if exists process_staging_data;
delimiter #
create procedure process_staging_data()
begin
insert ignore into organisations (org_name) select distinct org_name from staging;
update...
etc..
-- or use a cursor if you have to ??
end#
delimiter ;
call process_staging_data();
Hope this helps
It sounds like you'd benefit the most from tuning your SQL queries, which is probably where your script spends the most time. I don't know how the PHP MySQL client performs, but MySQLdb for Python is fairly fast. Doing naive benchmark tests I can easily sustain 10k/sec insert/select queries on one of my older quad-cores. Instead of doing one SELECT after another to test if the organization exists, using a REGEXP to check for them all at once might be more efficient (discussed here: MySQL LIKE IN()?). MySQLdb lets you use executemany() to do multiple inserts simultaneously, you could almost certainly leverage that to your advantage, perhaps your PHP client lets you do the same thing?
Another thing to consider, with Python you can use multiprocessing to and try parallelize as much as possible. PyMOTW has a good article about multiprocessing.
Related
I've been trying to import a csv file into a mysql database using LOAD DATA INFILE.
Everything is working more or less correctly, but when I use mysqli_info() or mysqli_affected_rows() they each show that no rows have been imported for the query. Even though I see the rows are being correctly imported.
A simplified version of what I am trying to do (Fewer columns that I am actually importing):
$server = 'localhost';
$username = 'root';
$password = 'password123';
$database = 'database_name';
$connect = new mysqli($server, $username, $password, $database);
$files = scandir('imports/');
foreach($files as $file) {
$import =
"LOAD DATA INFILE 'imports/$file'
IGNORE INTO TABLE table_name
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(#id, #name, #address, #product)
set id=#id, name=#name, address=#address, product=#product";
if(! $connect->query($import)) {
echo 'Failed';
}
$connect->query($import);
echo mysqli_affected_rows($connect);
}
mysqli_affected_rows() returns 0, while mysqli_info() states all rows have been skipped. Any idea why it's not displaying correctly?
Hopefully that's enough information. Thanks!
Edit:
I've been a bit too busy to work on this over the past few days, but I have decided that although it doesn't specifically answer my question Drew's answer of importing into a temporary table first seems to make the most sense so I have decided to go with that.
Further clarification of my comment: I would not be relying on $connect->affected_rows as ever a Rosetta stone for info. It is broken half the time.
This is one recommendation. Perform your LOAD DATA INFILE into a worktable, not your final desired table. Once that is performed, you have successfully escaped the limitations of that functionality and can enjoy the fruits of Insert on Duplicate Key Update (IODKU).
With the latter when I want to know counts, I get an assigned batch number from a control table. If you need help with that, let me know.
But I am stepping back to the point of having the data in the worktable now, with a batch number in hand. I then perform the IODKU from the worktable into the final table with the batch number coming along for the ride into a column in the final table (yes, I can tweak the schema to do that, perhaps you cannot). To get around any schema changes to existing table, and to find cases or allow for a row having multiple batch id's having hit it, a simple association table with id's can be used. An intersect table if you will.
If concurrency is an issue (multiple users likely able to do this concurrently), then a locking strategy (ideally INNODB row-level locking) is employed. If used, make it fast.
I then fetch my count off the final table (or intersect table) where batch id = my batch number in hand.
See also this answer from Jan.
I'm trying to figure out the most efficient way to send multiple queries to a MySQL database with PHP. Right now I'm doing two separate queries but I know there are more efficient methods, like using mysqli_multi_query. Is mysqli_multi_query the most efficient method or are there other means?
For example, I could just write a query that puts ALL the data from ALL the tables in the database into a PHP array. Then I could sort the data using PHP, resulting in having only one query no matter what data I needed... and I could put that PHP array into a session variable so the user would never query the database again during that session. Makes sense right? Why not just do that rather than create a new query each time the page is reloaded?
It's really difficult to find resources on this so I'm just looking for advice. I plan to have massive traffic on the site that I am building so I need the code to put as little stress on the server as possible. As far as table size is concerned, we're talking about, let's say 3,000 rows in the largest table. Is it feasible to store that into one big PHP array (advantage being the client would query the database only ONCE on page load)?
$Table1Array = Array();
$Table1_result = mysqli_query($con,"SELECT * FROM Table1 WHERE column1 ='" . $somevariable . "'");
while($row = mysqli_fetch_array($Table1_result))
{
$Table1Array[] = $row;
}
// query 2
$Table2Array = Array();
$Table2_result = mysqli_query($con,"SELECT * FROM Table2 LIMIT 5");
while($row = mysqli_fetch_array($Table2_result))
{
$Table2Array[] = $row;
}
There are a few things to address here, hopefully this will make sense / be constructive...
Is mysqli_multi_query the most efficient method or are there other
means?
It depends on the specifics of what you are trying to do for a given page / query. Generally speaking though, using mysql_multi_query won't gain you much performance, as MySQL will still execute the queries you give it one after the other. mysql_multi_query's performance gains come from the fact that fewer round trips are made between PHP and MySQL. A good thing if the two are on different servers, or you are performing 1000s of queries one after the other.
For example, I could just write a query that puts ALL the data from
ALL the tables in the database into a PHP array.
Just. No. In theory you could, but unless you had one page that displayed all of the database contents at once, there would simply be no need.
Then I could sort the data using PHP
If you can sort / filter the data into the correct form using MySQL, do that. Manipulating datasets is one of the things MySQL is very good at.
Why not just [load everything into the session] rather than create a new query each time the page is reloaded?
Because the dataset would be huge, and that session data would be transferred from the client every time they made a request to your server. Apart from sending needlessly huge requests, what about the other challenges this approach would raise? I.e. What would you do if extra data had been added to the db since you created the session-based cache for this particular user? What if the size of the data got too big for a user's session? What experience would I have as a user if I denied your session cookie and thereby forced the monster query to execute on every request?
I plan to have massive traffic on the site that I am building
Don't we all! As the comments above suggest, premature optimization is a Bad Thing. At this stage you should concentrate on getting your domain logic nailed down and building a good, maintainable OO platform on which to base further development.
If i wanted to execute multiple queries on a mysql database i would use mysql stored procedures and then all u have to do is issue a simple call from php, a basic example of a procedure would be:
DELIMITER $$
create procedure multiple_queries()
Begin
SELECT * FROM TBL1 WHERE 1;
SELECT * FROM TBL2 WHERE 2;
SELECT * FROM TBL3 LEFT JOIN ON TBL4 WHERE id= '121';
END $$
DELIMITER ;
and in php you simple call the procedure and any parameter associated with it in the parenthesis
CALL multiple_queries()
Why not use the DB engine as much as possible, its well capable of handling complex solutions and we dont utilize it.
For example, I could just write a query that puts ALL the data from ALL the tables in
the database into a PHP array. Then I could sort the data using PHP, resulting in having
only one query no matter what data I needed...
I would think this would be inefficient since you've lost the value of the Database. When you consider optimization, mysql is superior to any php code that you could write.
Additionally, you're saying that running one query, pushing the data into a variable for the users may decrease resources but is that really true? If you have massive traffic, and this data are in session variables, then if 1000 users are currently logged on then you will have 1000 duplications of the entire Database on your PHP server! - you sure the server has enough memory for this?
There are 2 ways I use to run multiple queries:
$conn = mysql_connect("host", "dbuser", "password");
$query1 = "select.......";
$result1 = mysql_query($query1) or die (mysql_error()); // execute the query
while($row1 = mysql_fetch_assoc($result1))
{
// fetch the results from the query
}
$query2 = "select.......";
$result2 = mysql_query($query2) or die (mysql_error()); // execute the query
while($row2 = mysql_fetch_assoc($result2))
{
// fetch the results from the query i.e. $row2['']
}
mysql_close($conn); // Close the Database connection.
The other way is to employ the use of transactions if there are more than one queries which must be either all executed or none at all
You could try it. But if the only reason is to have 1 query thinking that it will be faster, I would think otherwise. Optimizations in Databases are supreme especially mysql
I am in a debate with a guy telling me that there is no performance hit for using his function that...
Auto index, repair and optimize MySQL tables using PHP class __destruct() on every single page load by every user who runs the page.
He is asking me why I think it is not good for performance but I do not really know, can someone tell me why such a thing isn't good?
UPDATE His reasoning...
Optimizing & repairing the database tables eliminates the byte size of overhead that can essentially slow down additional queries when multiple connections and table use are concerned. Even with a performance enhanced database schema with indexing enabled.
Not to mention the amount of execution time to perform these operations are slim to none in memory and processor threading.
Opening, reading, writing, updating and then cleaning up after oneself makes more sense to me then performing the same operations and leaving unnecessary overhead behind waiting for a cron entry to clean up.
Instead of arguing, why not measure? Use a toolkit to profile where you're spending time, such as Instrumentation for PHP. Prove that the optimize step of your PHP request is taking a long time.
Reindexing is an expensive process, at least as costly as doing a table-scan as if you did not have an index. You should build indexes infrequently, so that you serve many PHP requests with the aid of the index for every one time you build the index. If you're building the index on every PHP request, you might as well not define indexes at all, and just run table-scans all the time.
REPAIR TABLE is only relevant for MyISAM tables (and Archive tables). I don't recommend using MyISAM tables. You should just use InnoDB tables. Not only for the sake of performance, but also data safety. MyISAM is very susceptible to data corruption, whereas InnoDB protects against that in most cases by maintaining internal checksums per page.
OPTIMIZE TABLE for an InnoDB table rebuilds all the data and index pages. This is going to be immensely expensive once your table grows to a non-trivial size. Certainly not something you would want to do on every page load. I would even say you should not do OPTIMIZE TABLE during any PHP web request -- do it offline via some script or admin interface.
A table restructure also locks the table. You will queue up all other PHP requests that access the same table for a long time (i.e. minutes or even hours, depending on the size of the table). When each PHP request gets its chance, it'll run another table restructure. It's ridiculous to incur this amount of overhead on every PHP request.
You can also use an analogy: you don't rebuild or optimize an entire table or index during every PHP request for the same reason you don't give your car a tune-up and oil change every time you start it:
It would be expensive and inconvenient to do so, and it would give no extra benefit compared to performing engine maintenance on an appropriate schedule.
Because every single operation (index,repair and optimize) takes considerable time; in fact they are VERY expensive (table locks, disk IO, risk of data loss) if the tables are even slightly big.
Doing this on every page load is definitely not recommended. It should be done only when needed.
Repair table could cause data loss as stated in documentation, so it requires previous backup to avoid further problems. Also, it is intended to be run only in case of disaster (something HAS failed).
Optimize table blocks the table under maintenance so it could cause problems to concurrent users.
My 0.02: Database management operations should not be part of common user transactions as they are expensive in time and resources as your tables grow.
I have set the following code into our scheduled job running every early morning, when users don't access frequently our site (I read that OPTIMIZE should lock affected tables during optimization).
Advantage using this function is that a single query is composed with all table names comma-separated, instead executing a lot of queries, one for each table to optimize.
It's supposed that you have a db connection opened and a db selected yet, in order to use this function without specifying db connection, db name, etc.
$q = "SHOW TABLE STATUS WHERE Data_Free > '0'";
$res = mysql_query($q); $TOOPT = mysql_num_rows($res);
$N = 0; // number of optimized tables
if(mysql_num_rows($res) > 0)
{
$N = 1;
while($t = mysql_fetch_array($res))
{
$TNAME = $t['Name']; $TSPACE += $t['Data_free'];
if($N < 2)
{
$Q = "OPTIMIZE TABLE ".$TNAME."";
}
else
{
$Q .= ", ".$TNAME."";
}
$N++;
} // endwhile tables
mysql_query($Q);
} // endif tables found (to optimize)
The docs states...
(optimize reference)
OPTIMIZE TABLE should be used if you have deleted a large part of a
table or if you have made many changes to a table with variable-length
rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns).
Deleted rows are maintained in a linked list and subsequent INSERT
operations reuse old row positions. You can use OPTIMIZE TABLE to
reclaim the unused space and to defragment the data file. After
extensive changes to a table, this statement may also improve
performance of statements that use the table, sometimes significantly.
When operations have been performed performance is enhanced by using the 'OPTIMIZE' command.
(flush reference)
FLUSH TABLES has several variant forms. FLUSH TABLE is a synonym for
FLUSH TABLES, except that TABLE does not work with the WITH READ LOCK
variant.
Using the 'FLUSH TABLE' command vs. 'FLUSH TABLES' there is no READ LOCK performed.
(repair reference)
Normally, you should never have to run REPAIR TABLE. However, if
disaster strikes, this statement is very likely to get back all your
data from a MyISAM table. If your tables become corrupted often, you
should try to find the reason for it, to eliminate the need to use
REPAIR TABLE. See Section C.5.4.2, “What to Do If MySQL Keeps
Crashing”, and Section 13.5.4, “MyISAM Table Problems”.
It is my understanding here that if the 'REPAIR TABLE' command is run consistantly the condition concerning large records created would be eliminated as constant maintances is performed. If I am wrong I would like to see benchmarks as my own attempts have not shown anything too detrimental, although the record sets have been under the 10k mark.
Here is the pice of code that is being used and #codedev is asking about...
class db
{
protected static $dbconn;
// rest of database class
public function index($link, $database)
{
$obj = $this->query('SHOW TABLES');
$results = $this->results($obj);
foreach ($results as $key => $value){
if (isset($value['Tables_in_'.$database])){
$this->query('REPAIR TABLE '.$value['Tables_in_'.$database]);
$this->query('OPTIMIZE TABLE '.$value['Tables_in_'.$database]);
$this->query('FLUSH TABLE '.$value['Tables_in_'.$database]);
}
}
}
public function __destruct()
{
$this->index($this->dbconn, $this->configuration['database']);
$this->close();
}
}
I have two tables called clients, they are exactly the same but within two different db's. Now the master always needs to update with the secondary one. And all data should always be the same, the script runs once per day. What would be the best to accomplish this.
I had the following solution but I think maybe theres a better way to do this
$sql = "SELECT * FROM client";
$res = mysql_query($conn,$sql);
while($row = mysql_fetch_object($res)){
$sql = "SELECT count(*) FROM clients WHERE id={$row->id}";
$res1 = mysql_query($connSecond,$sql);
if(mysql_num_rows($res1) > 0){
//Update second table
}else{
//Insert into second table
}
}
and then I need a solution to delete all old data in second table thats not in master.
Any advise help would be appreaciated
This is by no means an answer to your php code, but you should take a look # Mysql Triggers, you should be able to create triggers (on updates / inserts / deletes) and have a trigger (like a stored proceedure) update your table.
Going off the description you give, I would create a trigger that would check for changes to the 2ndary table, then write that change to the primary table, and delete that initial entry (if so required) form the 2ndary table.
Triggers are run per conditions that you define.
Hopefully this gives you insight into 'another' way of doing this task.
More references on triggers for mysql:
http://dev.mysql.com/doc/refman/5.0/en/triggers.html
http://www.mysqltutorial.org/create-the-first-trigger-in-mysql.aspx
You can use mysql INSERT ... SELECT like this (but first truncate the target table):
TRUNCATE TABLE database2.client;
INSERT INTO database2.client SELECT * FROM database1.client;
It will be way faster than doing it by PHP.
And to your notice:
As long as the mysql user has been given the right permissions to all databases and tables where data is pulled from or pushed to, this will work. Though the mysql_select_db function selects one database, the mysql statement may reference another if you use complete reference like databasename.tablename
Not exactly answering your question, but how about just using 1 table, instead of 2? You could use a fedarated table to access the other (if it's on a different mysql instance) or reference the table directly (like shamittomar's suggestion)
If both are on the same MySQL instance, you could easily use a view:
CREATE VIEW database2.client SELECT * FROM database1.client;
And that's it! No synchronizing, no cron jobs, no voodoo :)
I am building php web application that let's a user upload a MS Access Database (csv export) that is then translated and migrated into a MySQL database.
The MS Access database consists of one table called t_product of 100k rows. This table is not designed well. As an example, the following query:
SELECT part_number, model_number FROM t_product
will return:
part_number model_number
100 AX1000, AX1001, AX1002
101 CZ10, CZ220, MB100
As you can see, the model numbers are listed as comma separated values instead of individual records in another table. There are many more issues of this nature. I'm writing a script to clean this data before importing into the mysql database. The script will also map existing Access columns to a proper relationally design database.
My issue is that my script takes too long to complete. Here's simplified code to explain what I'm doing:
$handle = fopen("MSAccess.csv, "r");
// get each row from the csv
while ($data=fgetcsv($handle, 1000, ","))
{
mysql_query("INSERT INTO t_product (col1, col2 etc...) values ($data[0], $data[1], etc...");
$prodId = mysql_last_insert_id();
// using model as an example, there are other columns
// with csv values that need to be broken up
$arrModel = explode(',', $data[2]);
foreach($arrModel as $modelNumber)
mysql_query("INSERT INTO t_model (product_id, col1, col2 etc...) values ($prodId, $modelNumber[0], $modelNumber[1] etc...");
}
The problem here is that each while-loop iteration makes a tremendous number of calls to the database. For every product record, I have to insert N model numbers, Y part numbers, X serial numbers etc...
I started another approach where I stored the whole CSV in an array. I then write one batch query like
$sql = "INSERT INTO t_product (col1, col2, etc...) values ";
foreach($arrParam as $val)
$sql .= " ($val[0], $val[1], $val[2]), "
But I ran into excessive memory errors with this approach. I increased the max memory limit to 64M and I'm still running out of memory.
What is the best way to tackle this problem?
Maybe I should write all my queries to a *.sql file first, then import the *.sql file into the mysql database?
This may be entirely not the direction you want to go, but you can generate the MySQL creation script directly from MS Access with the free MySQL Migration Toolkit
Perhaps you could allow the user to upload the Access db, and then have your PHP script call the Migration toolkit?
If you're going to try optimizing the code you have there already, I would try aggregating the INSERTS and see if that helps. This should be easy to add to your code. Something like this (C# pseudocode):
int flushCount = 0;
while (!done)
{
// Build next query, concatenate to last set of queries
if (++flushCount == 5)
{
// Flush queries to database
// Reset query string to empty
flushCount = 0;
}
}
// Flush remaining queries to the database
I decided to write all my queries into a .SQL file. This gave me the opportunity to normalize the CSV file into a proper relational database. Afterwards, my php script called an exec("mysql -h dbserver.com -u myuser -pmypass dbname < db.sql");
This solved my memory problems and it was much faster than multiple queries from php.