I have a problem with exporting large amount of data to csv file using php.
Information:
Need to export 700000 list of addresses from database table(address).
Server timed out or lacking memory
project I'm working on working with multiple servers
My solution(what have i tried)
Get data part by part(from database) process this data(fputcsv) write this part to the temporary file - and send information to user via Ajax (show him the amount of processed Percentage). After last part of data has been processed just give user link to download this file. All is fine i have did this and this solution works for me - on my local enviroment, but
the problem is - project I'm working on working with multiple servers so I ran into a problem that temporary file can be stored on different servers.
For Example:
I have 3 servers: Server1, Server2 and Server3.
First time i read data from db with limit 0 50000 - process it and save it to File.csv on Server1, next iteration limit 50000, 50000 can be saved on another server Server2 - this is the problem.
So my question is:
Where i can store my processed temporary csv data, or maybe i am missing something, i am stuck here, looking for advice.
Every suggestion or solution will be appreciated! Thanks.
UPDATE
PROBLEM IS SOLVED
Later i will post my solution
You can use the mysql query with limits, to dircly export the records into csv file from mysql database.
SELECT id, name, email INTO OUTFILE '/tmp/result.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
ESCAPED BY ‘\\’
LINES TERMINATED BY '\n'
FROM users WHERE 1
It would really be helpful if you posted your code. The reason I'm saying that is because it doesn't sound like you're looping row after row which is will save you heaps of memory - no huge array to keep in the RAM. If you're not looping row by row and committing to the CSV file as you go, then I suggest you modify your code to do just that and it might solve the issue altogether.
If indeed, even committing to the CSV row by row is not enough. Then the issue you're running into is your servers setup relies on the code to be stateless, but your code isn't.
You can solve this issue using either of the following ways:
Make user sessions server specific. If you're routing requests via a load balancer, then you can probably control that setting there. If not, you'll have to go into custom sessions variables and configure your environment accordingly. I don't like this method but if you can control it via the load balancer, it might be the quickest way to get the problem solved.
Save the temporary files to the shared DB all the servers have access to with a simple transaction ID or some other identifier. Make the server handling this last portion of the export aggregate the data and prepare the file for download.
Potentially, you could run into another memory limit or max run time issue with method #2. In this case, if you cannot raise the servers' RAM, configure PHP to use the extra RAM and extend the script max run time. Then my suggestion would be to let the user download the files portion by portion. Export the CSV up to the limit your server supports, let the user download, then let them download the next file, and so on.
Potentially, you should try this method before you try any of the other methods. But perhaps the question which we must be asking is why use PHP to convert database entries into CSV in the first place? A lot of DBs have a CSV export built-in which is almost guaranteed to take less memory and time. If you're using MySQL for example, you can use - How to output MySQL query results in CSV format?
Hope this helps.
you can increase the execution time of your php code using ini_set('max_execution_time', seconds in numbers);
Related
looking for some general advice on how to go about this. I have the following task(s) planned:
The following is already working:.
User submits a CSV file via the form on my site.
The file/url of the file gets sent to a different server for processing (loop through each csv row, connect to my WordPress site and create each item as a product via the WooCommerce REST API).
What I want to achieve:.
If for example, 5 people submit a CSV at roughly the same time that’s a lot of writing to the database at once (some of the files could have 500, 1000+ rows). I would prefer to do a ‘queue’ system:
1 CSV file received.
Process the file, do all the product creation etc.
When finished, move to the next CSV and process that one next.
Note: When I say the CSV is received, I am simply passing the csv url and doing a file_get_contents in the script which processes it.
First point: you have an high workload to process, that it's not depending by the quantity of files only, but also from a single file length.
Eg. you have a csv with 1000+ lines.
That kind of file would lock your queue for a long amount of time and consume a lot of the mysql reserved memory.
So I will move like follows:
Take any file and translate them into a series of mysql multiple insert query.
In this way you reduce the number of roundtrips from mysql to application that would happen instead if you do separate inserts.
Consider that the best way is to split any file in bulk insert scripts of 200 circa records to avoid high mysql memory consumption, that would slow down the process.
Create a queue job for any bulk import script you create and send them to your queue processor, I would avoid to use a cron job and move for an ampq implementation using this wrapper library: php-amqlib as starting point.
Do not use file_get_contents but fgetcsv, since loading all the file at once and process manually could be not the best option.
I receive files in a streamed manner once every 30 seconds. The files may have up to 40 columns and 50,000 rows. The files are txt files and tab seperated. Right now, I'm saving the file temporally, save the contents with load data infile to a temporary table in the database and delete the file afterwards.
I would like to avoid the save and delete process and instead save the data directly to the database. The stream is the $output here:
protected function run(OutputInterface $output)
{
$this->readInventoryReport($this->interaction($output));
}
I've been googling around all the time trying to find a "performance is a big issue" - proof answer to this, but I can't find a good way of doing this without saving the data to a file and using load data infile. I need to have the contents available quickly and work with thoses after they are saved to a temporary table. (Update other tables with the contents...)
Is there a good way of handling this, or will the file save and delete method together with load data infile be better than other solutions?
The server I'm running this on has SSDs and 32GB of RAM.
LOAD DATA INFILE is your fastest way to do low-latency ingestion of tonnage of data into MySQL.
You can write yourself a php program that will, using prepared statements and the like, do a pretty good job of inserting rows into your database. If you arrange to do a COMMIT every couple of hundred rows, and use prepared statements, and write your code carefully, it will be fairly fast, but not as fast as LOAD DATA INFILE. Why? individual row operations have to be serialized onto the network wire, then deserialized, and processed one (or two or ten) at a time. LOAD DATA just slurps up your data locally.
It sounds like you have a nice MySQL server machine. But the serialization is still a bottleneck.
50K records every 30 seconds, eh? That's a lot! Is any of that data redundant? That is, do any of the rows in a later batch of data overwrite rows in an earlier batch? If so, you might be able to write a program that would skip rows that have become obsolete.
I have a big data set into MySQL (users, companies, contacts)? about 1 million records.
And now I need to make import new users, companies, contacts from import file (csv) with about 100000 records. I records from file has all info for all three essences (user, company, contacts).
Moreover on production i can't use LOAD DATA (just do not have so many rights :( ).
So there are three steps which should be applied to that data set.
- compare with existing DB data
- update it (if we will find something on previous step)
- and insert new, records
I'm using php on server for doing that. I can see two approaches:
reading ALL data from file at once and then work with this BIG array and apply those steps.
or reading line by line from the file and pass each line through steps
which approach is more efficient ? by CPU, memory or time usage
Can I use transactions ? or it will slow down whole production system ?
Thanks.
CPU time/time there won't be much in it, although reading the whole file will be slightly faster. However, for such a large data set, the additional memory required to read all records into memory will vastly outstrip the time advantage - I would definitely process one line at a time.
Did you know that phpMyAdmin has that nifty feature of "resumable import" for big SQL files ?
Just check "Allow interrupt of import" in the Partial Import section. And voila, PhpMyAdmin will stop and loop until all requests are executed.
It may be more efficient to just "use the tool" rather than "reinvent the wheel"
I think, 2nd approach is more acceptable:
Create change list (it would be a separate table)
Make updates line by line (and mark each line as updated using "updflag" field, for example)
Perform this process in background using transactions.
I have a php/mysql application, part of which allows the user to upload a csv file. The steps in the process are as follows:
User uploads a file, and it gets parsed to check for validity
If the file is valid, the parsed information is displayed, along with options for the user to match columns from the csv file to database columns
Import the data - final stage where the csv data is actually imported into the database
So, the problem that I have at this point is that the same csv file gets parsed in each of the above 3 steps - so that means 3 parses for each import.
Given that there can be up to 500 rows per csv file, then this doesn't strike me as particularly efficient.
Should I rather temporarily store the imported information in a database table after step 1? If so, I would obviously run clear up routines to keep the table as clean as possible. The one downside is that the csv imports can contain between 2 and 10 columns - so I'd have to make a database table of at least 11 columns (with an ID field)...which would be somewhat redundant in most cases.
Or should I just stick with the csv parsing? Up to 500 rows is quite small...
Or perhaps there is another better alternative?
In PHP, you can store data into the Session memory for later use. This allows you to parse the CSV file only once, save it in the Session memory and use this object in all of the later steps.
See http://www.tizag.com/phpT/phpsessions.php for a small tutorial.
Let me explain a bit more.
Every time a web browser requests a page from the server, PHP executes the PHP script associated with the web page. It then sends the output to the user. This is inherently stateless: the user requests something, you give something back -> end of transaction.
Sometimes, you may want to remember something you calculated in your PHP script and use it the next time the page is requested. This is stateful, you want to save state across different web requests.
One way is to save this result in the database or in a flat file. You could even add an identifier for the currently connected user, so you use a per-user file or save the current user in your database.
You could also use a hidden form and save all of the data as hidden input fields. When the user presses "Next", the hidden input fields are sent back to the PHP script.
This is all very clumsy. There is a better way: session memory. This is a piece of memory that you can access, which is saved across different PHP calls. It is perfect for saving temporary state information like this. The session memory can be indexed per application user.
Please note that there are frameworks that take this a lot further. Java SEAM has an APPLICATION memory, a SESSION memory, a CONVERSATION memory, a PAGE memory and even a single EVENT memory.
I had to do a similar thing for importing users into a database. What I ended up doing was this:
Import and parse CSV
Assign data to an array
Next page had a bunch of hidden form fields each with the data (ex. <input type="hidden" name="data[]" value="thedata" />)
Post it over and add the data to the database
It ended up working well for me. You can also save the data to session variables.
I'd just stick with parsing it 3 times. PHP is slow anyways, as are the network latencies for using the database or sending information to the client. What's most important is that your code is maintainable and extensible. The slowest part of a computer program is the developer.
See http://en.wikipedia.org/wiki/Program_optimization#When_to_optimize
http://ubiquity.acm.org/article.cfm?id=1147993
http://benchmarksgame.alioth.debian.org/u32/which-programs-are-fastest.html
Hope that helps...
We have 2 servers, which one of them is customer's. Our customer is providing us an URLs of XML/JSON exports of his clients informations from his CMS and our task is to write some import scripts for importing data to webapp, which we're developing.
I've always been doing that like this:
INSERT INTO customers (name,address) VALUES ('John Doe', 'NY') ON DUPLICATE KEY UPDATE name='John Doe', address='NY'
This solution is best in the way of permormace, as far as i know...
But this solution is NOT solving the problem of deleting records. What if some client is deleted from the database and isn't now in the export - how should i do that?
Shoud I firstly TRUNCATE the whole table and then fill it again?
Or should I fill some array in PHP with all records and then walk through it again and delete records, which aren't in XML/JSON?
I think there must be better solution.
I'm interested in the best solution in the way of performace, 'cause we have to import many thousands of records and the process of whole import may take a lot of time.
I'm interested in the best solution in the way of performace
If its mysql at the client, use mysql replication - the client as the master and your end as the slave. You can either use a direct feed (you'd probably want to run this across a VPN) or in disconnected mode (they send you the bin logs to roll forward).
Our customer is providing us an URLs of XML/JSON exports of his clients informations from his CMS
This is a really dumb idea - and sounds like you're trying to make the solution fit the problem (which it doesn't). HTTP is not the medium for transferring large data files across the internet. It also means that the remote server must do rather a lot of work just to make the data available (assuming it can even identify what data needs to be replicated - and as you point out, that is currently failing to work for deleted records). The latter point is true regardless of the network protocol.
You caertainly can't copy large amounts of data directly across at a lower level in the stack than the database (e.g. trying to use rsync to replicate data files) because the local mirror will nearly always be inconsistent.
C.
Assuming you are using MySQL, the only SQL I know anything about:
Is it true that the export of your customer's CMS always contains all of his current customer data? If it is true, then yes it is best imo to drop or truncate the 'customers' table; that is, to just throw away yesterday's customer table and reconstruct it today from the beginning.
But you can't use 'insert': it will take ~28 hours per day to insert thousands of customer rows. So forget about 'insert'.
Instead, add rows into 'customers' with 'load data local infile': first write a temp disk file 'cust_data.txt' of all the customer data, with column data separated somehow (perhaps by commas), and then say something like:
load data local infile 'cust_data.txt' replace into table customers fields terminated by ',' lines terminated by '\n';
Can you structure the query such that you can use your client's output file directly, without first staging it into 'cust_data.txt'? That would be the answer to a maiden's prayer.
It should be fast enough for you: you will be amazed!
ref: http://dev.mysql.com/doc/refman/5.0/en/load-data.html
If your customer can export data as csv file, you can use SQL Data Examiner
http://www.sqlaccessories.com/SQL_Data_Examiner to update records in the target database (insert/update/delete) using csv file as source.