I have a big data set into MySQL (users, companies, contacts)? about 1 million records.
And now I need to make import new users, companies, contacts from import file (csv) with about 100000 records. I records from file has all info for all three essences (user, company, contacts).
Moreover on production i can't use LOAD DATA (just do not have so many rights :( ).
So there are three steps which should be applied to that data set.
- compare with existing DB data
- update it (if we will find something on previous step)
- and insert new, records
I'm using php on server for doing that. I can see two approaches:
reading ALL data from file at once and then work with this BIG array and apply those steps.
or reading line by line from the file and pass each line through steps
which approach is more efficient ? by CPU, memory or time usage
Can I use transactions ? or it will slow down whole production system ?
Thanks.
CPU time/time there won't be much in it, although reading the whole file will be slightly faster. However, for such a large data set, the additional memory required to read all records into memory will vastly outstrip the time advantage - I would definitely process one line at a time.
Did you know that phpMyAdmin has that nifty feature of "resumable import" for big SQL files ?
Just check "Allow interrupt of import" in the Partial Import section. And voila, PhpMyAdmin will stop and loop until all requests are executed.
It may be more efficient to just "use the tool" rather than "reinvent the wheel"
I think, 2nd approach is more acceptable:
Create change list (it would be a separate table)
Make updates line by line (and mark each line as updated using "updflag" field, for example)
Perform this process in background using transactions.
Related
Please, if somebody can give me support.
My problem is:
I have a table with 8 fields and about 510 000 records. In a web form, the user select an Excel file and it's read it with SimpleXLSX. The file has about 340 000 lines. With PHP and SimpleXLSX library this file is loaded in memory, then with a for cicle the script read line by line, taken one data of ecah line and search this value in the table, if the data exists in the table, then does not insert the value, other wise, the values read it are stored in the table.
This process takes days to finish.
Can somebody suggest me some operation to speed up the process?
Thanks a lot.
if you have many users, and they maybe use the web at the same time:
you must change SimpleXLSX to js-xlsx, in webbrowser do all work but only write database in server
if you have few users (i think you in this case)
and search this value in the table
this is cost the must time, if your single-to-single compare memory and database, then add/not-add to database.
so you can read all database info in memory, (must use hash-list for compare),then compare all
and add it to memory and mark newable
at last
add memory info to database
because you database and xls have most same count, so...database become almost valueless
just forget database, this is most fast in memory
in memory use hash-list for compare
of course, you can let above run in database if you can use #Barmar's idea.. don't insert single, but batch
Focus on speed on throwing the data into the database. Do not try to do all the work during the INSERT. Then use SQL queries to further clean up the data.
Use the minimal XLS to get the XML into the database. Use some programming language if you need to massage the data a lot. Neither XLS nor SQL is the right place for complex string manipulations.
If practical, use LOAD DATA ... XML to get the data loaded; it is very fast.
SQL is excellent for handling entire tables at once; it is terrible at handling one row at a time. (Hence, my recommendation of putting the data into a staging table, not directly into the target table.)
If you want to discuss further, we need more details about the conversions involved.
Looking for insight on the best approach for large csv file imports to mysql and managing the dataset. This is for an ecommerce storefront "startup". All product data will be read from csv files which are download via curl (server to server).
Each csv file represents a different supplier/warehouse with up to 100,000 products. In total there are roughly 1.2 million products spread over 90-100 suppliers. At least 75% of the row data (51 columns) is redundant garbage and will not be needed.
Would it be better to use mysqli LOAD DATA LOCAL INFILE to 'temp_products' table. Then, make the needed data adjustments per row, then insert to the live 'products' table or simply use fgetcsv() and go row by row? The import will be handled by a CronJob using the sites php.ini with a memory limit of 128M.
Apache V2.2.29
PHP V5.4.43
MySQL V5.5.42-37.1-log
memory_limit 128M
I'm not looking for "How to's". I'm simply looking for the "best approach" from the communities perspective and experience.
I have direct experience of doing something virtually identical to what you describe -- lots of third party data sources in different formats all needing to go into a single master table.
I needed to take different approaches for different data sources, because some were in XML, some in CSV, some large, some small, etc. For the large CSV ones, I did indeed follow roughly your suggested routed:
I used LOAD DATA INFILE to dump the raw contents into a temporary table.
I took the opportunity to transform or discard some of the data within this query; LOAD DATA INFILE allows some quite complex queries. This allowed me to use the same temp table for several of the import processes even though they had quite different CSV data, which made the next step easier.
I then used a set of secondary SQL queries to pull the temp data into the various main tables. All told, I had about seven steps to the process.
I had a set of PHP classes to do the imports, which all implemented a common interface. This meant that I could have a common front-end program which could run any of the importers.
Since a lot of the importers did similar tasks, I put the commonly used code in traits so that the code could be shared.
Some thoughts based on the things you said in your question:
LOAD DATA INFILE will be orders of magnitude quicker than fgetcsv() with a PHP loop.
LOAD DATA INFILE queries can be very complex and achieve very good data mapping without ever having to run any other code, as long as the imported data is going into a single table.
Your memory limit is likely to need to be raised. However, using LOAD DATA INFILE means that it will be MySQL which will use the memory, not PHP, so the PHP limit won't come into play for that. 128M is still likely to be too low for you though.
-If you struggle to import the whole thing in one go, try using some simple Linux shell commands to split the file into several smaller chunks. CSV data format should make that fairly simple.
I've a Cronjob script, written in PHP with following requirements:
Step 1 (DB server 1): Get some data from multiple tables (We have lot of data here)
Step 2 (Application server): Perform some calculation
Step 3 (DB Server 2): After calculation, insert that data in another database(MySQL)/table(InnoDB) for reporting purpose. This table contains 97 columns, actually different rates, which can not be normalized further. This is different physical DB server and have only one DB.
Script worked fine during development but on production, Step 1 returned approx 50 million records. Result, as obvious, script run for around 4 days and then failed. (Rough estimation, with current rate, it would have taken approx 171 days to finish)
Just for note, We were using prepared statements and Step 1 is getting data in bunch of 1000 records at a time.
What we did till now
Optimization Step 1: Multiple values in insert & drop all indexes
Some tests showed insert (Step 3 above) is taking maximum time (More then 95% time). To optimize, after some googling, we dropped all indexes from table, and instead of one insert query/row, we are not having one insert query/100 rows. This gave us a bit faster insert but still, as per rough estimate, it will take 90 days to run cron once, and we need to run it once every month as new data will be available every month.
Optimization step 2, instead of writing to DB, write to csv file and then import in mysql using linux command.
This step seems not working. Writing 30000 rows in CSV file took 16 minutes and we still need to import that CSV file in MySQL. We have single file handler for all write operations.
Current state
It seems I'm now clueless on what else can be done. Some key requirements:
Script need to insert approx 50,000,000 records (will increase with time)
There are 97 columns for each records, we can skip some but 85 columns at the minimum.
Based on input, we can break script into three different cron to run on three different server but insert had to be done on one DB server (master) so not sure if it will help.
However:
We are open to change database/storage engine (including NoSQL)
On production, we could have multiple database servers but insert had to be done on master only. All read operations can be directed to slave, which are minimal and occasional (Just to generate reports)
Question
I don't need any descriptive answer but can someone in short suggest what could be possible solution. I just need some optimization hint and I'll do remaining R&D.
We are open for everything, change database/storage engine, Server optimization/ multiple servers (Both DB and application), change programming language or whatever is best configuration for above requirements.
Final expectation, cron must finish in maximum 24 hours.
Edit in optimization step 2
To further understand why generating csv is taking time, I've created a replica of my code, with only necessary code. That code is present on git https://github.com/kapilsharma/xz
Output file of experiment is https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt
If you check above file, I'm inserting 500000 records and getting 5000 records form database at a time, making loop running 100 times. Time taken in first loop was 0.25982284545898 seconds but in 100th loop was 3.9140808582306. I assume its because of system resource and/or file size of csv file. In that case, it becomes more of programming question then DB optimization. Still, can someone suggest why it is taking more time in next loops?
If needed, whole code is committed except csv files and sql file generated to create dummy DB as these files are very big. However they can be easily generated with code.
Using OFFSET and LIMIT to walk through a table is O(N*N), that is much slower than you want or expected.
Instead, walk through the table "remembering where you left off". It is best to use the PRIMARY KEY for such. Since the id looks like an AUTO_INCREMENT without gaps, the code is simple. My blog discusses that (and more complex chunking techniques).
It won't be a full 100 (500K/5K) times as fast, but it will be noticeably faster.
This is a very broad question. I'd start by working out what the bottleneck is with the "insert" statement. Run the code, and use whatever your operating system gives you to see what the machine is doing.
If the bottleneck is CPU, you need to find the slowest part and speed it up. Unlikely, given your sample code, but possible.
If the bottleneck is I/O or memory, you're almost certainly going to need either better hardware, or a fundamental re-design.
The obvious way to re-design this is to find a way to handle only deltas in the 50M records. For instance, if you can write to an audit table whenever a record changes, your cron job can look at that audit table and pick out any data that was modified since the last batch run.
I had a mailer cron job on CakePHP, which failed merely on 600 rows fetch and send email to the registered users. It couldn't even perform the job in batch operations. We finally opted for mandrill and since then it all went well.
I'd suggest (considering it a bad idea to touch the legacy system in production) :
Schedule a mirco solution in golang or node.js considering
performance benchmarks, as database interaction is involved -
you'll be fine with any of these. Have this micro solution perform
the cron job. (Fetch + Calculate)
Reporting from NoSQL will be
challenging, so you should try out using available services like
Google Big Query. Have the cron job store data to google big
query and you should get a huge performance improvement even in
generating reports.
or
With each row inserted into your original db server 1, set up a messaging mechanism which performs the operations of cron job everytime an insert is made (sort of trigger) and store it into your reporting server. Possible services you can use are : Google PubSub or Pusher. I think per insert time consumption will be pretty less. (You can also use a async service setup which does the task of storing into the reporting database).
Hope this helps.
My web application lets user import an excel file and writes the data from the file into the mysql database.
The problem is, when the excel file has lots of entries, even 1000 rows, i get an error saying PHP ran out of memory. This occurs while reading the file.
I have assigned 1024MB to PHP in the php.ini file.
My question is, how to go about importing such large data in PHP.
I am using CodeIgniter.
for reading the excel file, i am using this library.
SOLVED. I used CSV instead of xls. and I could import 10,000 rows of data within seconds.
Thank you all for your help.
As others have said, 1000 records is not much. Make sure you process the records one at a time, or a few at a time, and that the variables you use for each iteration go out of scope after you're finished with that row or you're reusing the variables.
If you can avoid the necessity of processing excel files by exporting them to csv, that's even greater, cause then you wouldn't need such a library (which might or might not have its own memory issues).
Don't be afraid of increasing memory usage if you need to and that solves the problem, buying memory is the cheapest option sometimes. And don't let the 1 GB scare you, it is a lot for such a simple task, but if you have the memory and that's all you need to do, then its good enough for the moment.
And as a plus, if you are using an old version of PHP, try updating to PHP 5.4 which handles memory much better than its predecessors.
Instead of inserting one a time in a loop. Insert 100 row at a time.
You can always run
INSERT INTO myTable (clo1, col2, col2) VALUES
(val1, val2), (val3, val4), (val5, val6) ......
This way number of network transaction will reduce thus reducing resource usage.
Ok, I'll try and keep this short, sweet and to-the-point.
We do massive GeoIP updates to our system by uploading a MASSIVE CSV file to our PHP-based CMS. This thing usually has more than 100k records of IP address information. Now, doing a simple import of this data isn't an issue at all, but we have to run checks against our current regional IP address mappings.
This means that we must validate the data, compare and split overlapping IP address, etc.. And these checks must be made for each and every record.
Not only that, but I've just created a field mapping solution that would allow other vendors to implement their GeoIP updates in different formats. This is done by applying rules to IPs records within the CSV update.
For instance a rule might look like:
if 'countryName' == 'Australia' then send to the 'Australian IP Pool'
There might be multiple rules that have to be run and each IP record must apply them all. For instance, 100k records to check against 10 rules would be 1 million iterations; not fun.
We're finding 2 rules for 100k records takes up to 10 minutes to process. I'm fully aware of the bottleneck here which is the shear amount of iterations that must occur for a successful import; just not fully aware of any other options we may have to speed things up a bit.
Someone recommended splitting the file into chunks, server-side. I don't think this is a viable solution as it adds yet another layer of complexity to an already complex system. The file would have to be opened, parsed and split. Then the script would have to iterate over the chunks as well.
So, question is, considering what I just wrote, what would the BEST method be to speed this process up a bit? Upgrading the server's hardware JUST for this tool isn't an option unfortunately, but they're pretty high-end boxes to begin with.
Not as short as I thought, but yeah. Halps? :(
Perform a BULK IMPORT into a database (SQL Server's what I use). The BULK IMPORT takes seconds literally, and 100,000 records is peanuts for a database to crunch on business rules. I regularly perform similar data crunches on a table with over 4 million rows and it doesn't take the 10 minutes you listed.
EDIT: I should point out, yeah, I don't recommend PHP for this. You're dealing with raw DATA, use a DATABASE.. :P
The simple key to this is keeping as much work out of the inner loop as possible.
Simply put, anything you do in the inner loop is done "100K times", so doing nothing is best (but certainly not practical), so doing as little possible is the next best bet.
If you have the memory, for example, and it's practical for the application, defer any "output" until after the main processing. Cache any input data if practical as well. This works best for summary data or occasional data.
Ideally, save for the reading of the CSV file, do as little I/O as possible during the main processing.
Does PHP offer any access to the Unix mmap facility, that is typically the fastest way to read files, particularly large files.
Another consideration is to batch your inserts. For example, it's straightforward to build up your INSERT statements as simple strings, and ship them to the server in blocks of 10, 50, or 100 rows. Most databases have some hard limit on the size of the SQL statement (like 64K, or something), so you'll need to keep that in mind. This will dramatically reduce your round trips to the DB.
If you're creating primary keys through simple increments, do that en masses (blocks of 1000, 10000, whatever). This is another thing you can remove from your inner loop.
And, for sure, you should be processing all of the rules at once for each row, and not run the records through for each rule.
100k records isn't a large number. 10 minutes isn't a bad job processing time for a single thread. The amount of raw work to be done in a straight line is probably about 10 minutes, regardless if you're using PHP or C. If you want it to be faster, you're going to need a more complex solution than a while loop.
Here's how I would tackle it:
Use a map/reduce solution to run the process in parallel. Hadoop is probably overkill. Pig Latin may do the job. You really just want the map part of the map/reduce problem. IE: you're forking of a chunk of the file to be processed by a sub process. Your reducer is probably cat. A simple version could be having PHP fork processes for each 10K record chunk, wait for the children, then re-assemble their output.
Use a queue/grid processing model. Queue up chunks of the file, then have a cluster of machines checking in, grabbing jobs and sending the data somewhere. This is very similar to the map/reduce model, just using different technologies, plus you could scale by adding more machines to the grid.
If you can write your logic as SQL, do it in a database. I would avoid this because most web programmers can't work with SQL on this level. Also, SQL is sort of limited for doing things like RBL checks or ARIN lookups.
One thing you can try is running the CSV import under command line PHP. It generally provides faster results.
If you are using PHP to do this job, switch the parsing to Python since it is WAY faster than PHP on this matters, this exchange should speed up the process by 75% or even more.
If you are using MySQL you can also use the LOAD DATA INFILE operator, I'm not sure if you need check the data before you insert it into the database though.
Have worked on this problem intensively for a while now. And, yes the better solution is to only read in a portion of the file at any one time, parse it, do validation, do filtering, then export it and then read the next portion of the file. I would agree that this is probably not a solution for php, although you can probably do it in php. As long as you have a seek function, so that you can start reading from a particular location in the file. You are right it does add a higher level of complexity but the worth that little extra effort.
It your data is pure i.e. delimited correctly, string qualified, free of broken lines etc then by all means bulk upload into a sql database. Otherwise you want to know where, when and why errors occur and to be able to handle them.
i'm working with something alike. The csv file i'm working contain portuguese data (dd/mm/yyyy) that i have to convert into mysql yyyy-mm-dd. Portuguese monetary: R$ 1.000,15 that had to be converted into mysql decimal 1000,15. Trim the possible spaces and finally, addslashes.
There are 25 variables to be treated before the insert.
If i check every $notafiscal value (select into table to see if exist and update), the php handle around 60k rows. But if i don't check it, php handle more than 1 million rows.The server work with memory of 4GB - scripting localhosting (memory of 2GB), it handles the half rows in both cases.
mysqli_query($db,"SET AUTOCOMMIT=0");
mysqli_query($db, "BEGIN");
mysqli_query($db, "SET FOREIGN_KEY_CHECKS = 0");
fgets($handle); //ignore the header line of csv file
while (($data = fgetcsv($handle, 100000, ';')) !== FALSE):
//if $notafiscal lower than 1, ignore the record
$notafiscal = $data[0];
if ($notafiscal < 1):
continue;
else:
$serie = trim($data[1]);
$data_emissao = converteDataBR($data[2]);
$cond_pagamento = trim(addslashes($data[3]));
//...
$valor_total = trim(moeda($data[24]));
//check if the $notafiscal already exist, if so, update, else, insert into table
$query = "SELECT * FROM venda WHERE notafiscal = ". $notafiscal ;
$rs = mysqli_query($db, $query);
if (mysqli_num_rows($rs) > 0):
//UPDATE TABLE
else:
//INSERT INTO TABLE
endif;
endwhile;
mysqli_query($db,"COMMIT");
mysqli_query($db,"SET AUTOCOMMIT=1");
mysqli_query($db,"SET FOREIGN_KEY_CHECKS = 1");
mysqli_close($db);