My web application lets user import an excel file and writes the data from the file into the mysql database.
The problem is, when the excel file has lots of entries, even 1000 rows, i get an error saying PHP ran out of memory. This occurs while reading the file.
I have assigned 1024MB to PHP in the php.ini file.
My question is, how to go about importing such large data in PHP.
I am using CodeIgniter.
for reading the excel file, i am using this library.
SOLVED. I used CSV instead of xls. and I could import 10,000 rows of data within seconds.
Thank you all for your help.
As others have said, 1000 records is not much. Make sure you process the records one at a time, or a few at a time, and that the variables you use for each iteration go out of scope after you're finished with that row or you're reusing the variables.
If you can avoid the necessity of processing excel files by exporting them to csv, that's even greater, cause then you wouldn't need such a library (which might or might not have its own memory issues).
Don't be afraid of increasing memory usage if you need to and that solves the problem, buying memory is the cheapest option sometimes. And don't let the 1 GB scare you, it is a lot for such a simple task, but if you have the memory and that's all you need to do, then its good enough for the moment.
And as a plus, if you are using an old version of PHP, try updating to PHP 5.4 which handles memory much better than its predecessors.
Instead of inserting one a time in a loop. Insert 100 row at a time.
You can always run
INSERT INTO myTable (clo1, col2, col2) VALUES
(val1, val2), (val3, val4), (val5, val6) ......
This way number of network transaction will reduce thus reducing resource usage.
Related
Please, if somebody can give me support.
My problem is:
I have a table with 8 fields and about 510 000 records. In a web form, the user select an Excel file and it's read it with SimpleXLSX. The file has about 340 000 lines. With PHP and SimpleXLSX library this file is loaded in memory, then with a for cicle the script read line by line, taken one data of ecah line and search this value in the table, if the data exists in the table, then does not insert the value, other wise, the values read it are stored in the table.
This process takes days to finish.
Can somebody suggest me some operation to speed up the process?
Thanks a lot.
if you have many users, and they maybe use the web at the same time:
you must change SimpleXLSX to js-xlsx, in webbrowser do all work but only write database in server
if you have few users (i think you in this case)
and search this value in the table
this is cost the must time, if your single-to-single compare memory and database, then add/not-add to database.
so you can read all database info in memory, (must use hash-list for compare),then compare all
and add it to memory and mark newable
at last
add memory info to database
because you database and xls have most same count, so...database become almost valueless
just forget database, this is most fast in memory
in memory use hash-list for compare
of course, you can let above run in database if you can use #Barmar's idea.. don't insert single, but batch
Focus on speed on throwing the data into the database. Do not try to do all the work during the INSERT. Then use SQL queries to further clean up the data.
Use the minimal XLS to get the XML into the database. Use some programming language if you need to massage the data a lot. Neither XLS nor SQL is the right place for complex string manipulations.
If practical, use LOAD DATA ... XML to get the data loaded; it is very fast.
SQL is excellent for handling entire tables at once; it is terrible at handling one row at a time. (Hence, my recommendation of putting the data into a staging table, not directly into the target table.)
If you want to discuss further, we need more details about the conversions involved.
Looking for insight on the best approach for large csv file imports to mysql and managing the dataset. This is for an ecommerce storefront "startup". All product data will be read from csv files which are download via curl (server to server).
Each csv file represents a different supplier/warehouse with up to 100,000 products. In total there are roughly 1.2 million products spread over 90-100 suppliers. At least 75% of the row data (51 columns) is redundant garbage and will not be needed.
Would it be better to use mysqli LOAD DATA LOCAL INFILE to 'temp_products' table. Then, make the needed data adjustments per row, then insert to the live 'products' table or simply use fgetcsv() and go row by row? The import will be handled by a CronJob using the sites php.ini with a memory limit of 128M.
Apache V2.2.29
PHP V5.4.43
MySQL V5.5.42-37.1-log
memory_limit 128M
I'm not looking for "How to's". I'm simply looking for the "best approach" from the communities perspective and experience.
I have direct experience of doing something virtually identical to what you describe -- lots of third party data sources in different formats all needing to go into a single master table.
I needed to take different approaches for different data sources, because some were in XML, some in CSV, some large, some small, etc. For the large CSV ones, I did indeed follow roughly your suggested routed:
I used LOAD DATA INFILE to dump the raw contents into a temporary table.
I took the opportunity to transform or discard some of the data within this query; LOAD DATA INFILE allows some quite complex queries. This allowed me to use the same temp table for several of the import processes even though they had quite different CSV data, which made the next step easier.
I then used a set of secondary SQL queries to pull the temp data into the various main tables. All told, I had about seven steps to the process.
I had a set of PHP classes to do the imports, which all implemented a common interface. This meant that I could have a common front-end program which could run any of the importers.
Since a lot of the importers did similar tasks, I put the commonly used code in traits so that the code could be shared.
Some thoughts based on the things you said in your question:
LOAD DATA INFILE will be orders of magnitude quicker than fgetcsv() with a PHP loop.
LOAD DATA INFILE queries can be very complex and achieve very good data mapping without ever having to run any other code, as long as the imported data is going into a single table.
Your memory limit is likely to need to be raised. However, using LOAD DATA INFILE means that it will be MySQL which will use the memory, not PHP, so the PHP limit won't come into play for that. 128M is still likely to be too low for you though.
-If you struggle to import the whole thing in one go, try using some simple Linux shell commands to split the file into several smaller chunks. CSV data format should make that fairly simple.
I have a big data set into MySQL (users, companies, contacts)? about 1 million records.
And now I need to make import new users, companies, contacts from import file (csv) with about 100000 records. I records from file has all info for all three essences (user, company, contacts).
Moreover on production i can't use LOAD DATA (just do not have so many rights :( ).
So there are three steps which should be applied to that data set.
- compare with existing DB data
- update it (if we will find something on previous step)
- and insert new, records
I'm using php on server for doing that. I can see two approaches:
reading ALL data from file at once and then work with this BIG array and apply those steps.
or reading line by line from the file and pass each line through steps
which approach is more efficient ? by CPU, memory or time usage
Can I use transactions ? or it will slow down whole production system ?
Thanks.
CPU time/time there won't be much in it, although reading the whole file will be slightly faster. However, for such a large data set, the additional memory required to read all records into memory will vastly outstrip the time advantage - I would definitely process one line at a time.
Did you know that phpMyAdmin has that nifty feature of "resumable import" for big SQL files ?
Just check "Allow interrupt of import" in the Partial Import section. And voila, PhpMyAdmin will stop and loop until all requests are executed.
It may be more efficient to just "use the tool" rather than "reinvent the wheel"
I think, 2nd approach is more acceptable:
Create change list (it would be a separate table)
Make updates line by line (and mark each line as updated using "updflag" field, for example)
Perform this process in background using transactions.
I'm converting a forum from myBB to IPBoard (the conversion is done through a PHP script), however I have over 4 million posts that need to be converted, and it will take about 10 hours at the current rate. I basically have unlimited RAM and CPU, what I want to know is how can I speed this process up? Is there a way I can allocate a huge amount of memory to this one process?
Thanks for any help!
You're not going to get a script to run any faster. By giving it more memory, you might be able to have it do more posts at one time, though. Change memory_limit in your php.ini file to change how much memory it can use.
You might be able to tell the script to do one forum at a time. Then you could run several copies of the script at once. This will be limited by how it talks to the database table and whether the script has been written to allow this -- it might do daft things like lock the target table or do an insanely long read on the source table. In any case, you would be unlikely to get more than three or four running at once without everything slowing down, anyway.
It might be possible to improve the script, but that would be several days' hard work learning the insides of both forums' database formats. Have you asked on the forums for IPBoard? Maybe someone there has experience at what you're trying to do.
not sure how the conversion is done, but if you are importing a sql file , you could split it up to multiple files and import them at the same time. hope that helps :)
If you are saying that you have the file(s) already converted, you should look into MySQL Load Data In FIle for importing, given you have access to the MySQL Console. This will load data considerably faster than executing the SQL Statements via the source command.
If you do not have them in the files and you are doing them on the fly, then I would suggest having the conversion script write the data to a file (set the time limit to 0 to allow it to run) and then use that load data command to insert / update the data.
I am importing a csv file with more then 5,000 records in it. What i am currently doing is, getting all file content as an array and saving them to the database one by one. But in case of script failure, the whole process will run again and if i start checking the them again one by one form database it will use lots of queries, so i thought to keep the imported values in session temporarily.
Is it good practice to keep that much of records in the session. Or is there any other way to do this ?
Thank you.
If you have to do this task in stages (and there's a couple of suggestions here to improve the way you do things in a single pass), don't hold the csv file in $_SESSION... that's pointless overhead, because you already have the csv file on disk anyway, and it's just adding a lot of serialization/unserialization overhead to the process as the session data is written.
You're processing the CSV records one at a time, so keep a count of how many you've successfully processed in $_SESSION. If the script times out or barfs, then restart and read how many you've already processed so you know where in the file to restart.
What can be the maximum size for the $_SESSION ?
The session is loaded into memory at run time - so it's limited by the memory_limit in php.ini
Is it good practice to keep that much of records in the session
No - for the reasons you describe - it will also have a big impact on performance.
Or is there any other way to do this ?
It depends what you are trying to achieve. Most databases can import CSV files directly or come with tools which will do it faster and more efficently than PHP code.
C.
It's not a good idea imho since session data will be serialized/unserialized for every page request, even if they are unrelated to the action you are performing.
I suggest using the following solution:
Keep the CSV file lying around somewhere
begin a transaction
run the inserts
commit after all inserts are done
end of transaction
Link: MySQL Transaction Syntax
If something fails the inserts will be rolled back so you know you can safely redo the inserts without having to worry about duplicate data.
To answer the actual question (Somebody just asked a duplicate, but deleted it in favour of this question)
The default session data handler stores its data in temporary files. In theory, those files can be as large as the file system allows.
However, as #symcbean points out, session data is auto-loaded into the script's memory when the session is initialized. This limits the maximum size you should store in session data severely. Also, loading lots of data has a massive impact on performance.
If you have huge amounts of data you need to store connected to a session, I would recommend using temporary files that you name by the current session ID. You can then deal with those files as needed, and as possible within the limits of the script's memory_limit.
If you are using Postgresql, you can use a single query to insert them all using pg_copy_from., or you can use pg_put_line like it is shown in the example (copy from stdin), which I found very useful when importing tons of data.
If you use MySql, you'll have to do multiple inserts. Remember to use transactions, so that if you use transactions, if your query fails it will be canceled and you can start over. Note that 5000 rows is not that large! You snould however be aware of the max_execution_time constraint which will kill your script after a number of seconds.
For what the SESSION is concerned, I believe that you are limited by the maximum amount of memory a script can use (memory_limit in php.ini). Session data are saved in files, so you should consider also the disk space usage if many clients are connected.
It depends on operating system file size, Whatever the session size, per page default is 128 MB.