Ok, so i've got a web system (built on codeigniter & running on mysql) that allows people to query a database of postal address data by making selections in a series of forms until they arrive at the selection that want, pretty standard stuff. They can then buy that information and download it via that system.
The queries run very fast, but when it comes to applying that query to the database,and exporting it to csv, once the datasets get to around the 30,000 record mark (each row has around 40 columns of which about 20 are all populated with on average 20 chars of data per cell) it can take 5 or so minutes to export to csv.
So, my question is, what is the main cause for the slowness? Is it that the resultset of data from the query is so large, that it is running into memory issues? Therefore should i allow much more memory to the process? Or, is there a much more efficient way of exporting to csv from a mysql query that i'm not doing? Should i save the contents of the query to a temp table and simply export the temp table to csv? Or am i going about this all wrong? Also, is the fact that i'm using Codeigniters Active Record for this prohibitive due to the way that it stores the resultset?
Pseudo Code:
$query = $this->db->select('field1, field2, field3')->where_in('field1',$values_array)->get('tablename');
$data = $this->dbutil->csv_from_result($download_query, $delimiter, $newline); // the some code to save the file
$this->load->helper('download');
force_download($filename, $filedata);
Any advice is welcome! Thank you for reading!
Ignoring Codeigniter for a moment, you basically have three options for exporting CSV using PHP:
To disk - typically the slowest option
To memory - typically the fastest option
Directly to the browser
In your case I would skip any built-in Codeigniter CSV functions and try streaming directly to the browser (see link above for a complete example).
30,000 records * 40 columns * 20 bytes = 24,000,000 bytes
If this is a busy shared server then I can imagine this being a disk I/O bottleneck. If it's a Windows-based server then there's probably some paging happening as well to slow it down.
Try skipping the disk and writing directly to the network.
Related
Please, if somebody can give me support.
My problem is:
I have a table with 8 fields and about 510 000 records. In a web form, the user select an Excel file and it's read it with SimpleXLSX. The file has about 340 000 lines. With PHP and SimpleXLSX library this file is loaded in memory, then with a for cicle the script read line by line, taken one data of ecah line and search this value in the table, if the data exists in the table, then does not insert the value, other wise, the values read it are stored in the table.
This process takes days to finish.
Can somebody suggest me some operation to speed up the process?
Thanks a lot.
if you have many users, and they maybe use the web at the same time:
you must change SimpleXLSX to js-xlsx, in webbrowser do all work but only write database in server
if you have few users (i think you in this case)
and search this value in the table
this is cost the must time, if your single-to-single compare memory and database, then add/not-add to database.
so you can read all database info in memory, (must use hash-list for compare),then compare all
and add it to memory and mark newable
at last
add memory info to database
because you database and xls have most same count, so...database become almost valueless
just forget database, this is most fast in memory
in memory use hash-list for compare
of course, you can let above run in database if you can use #Barmar's idea.. don't insert single, but batch
Focus on speed on throwing the data into the database. Do not try to do all the work during the INSERT. Then use SQL queries to further clean up the data.
Use the minimal XLS to get the XML into the database. Use some programming language if you need to massage the data a lot. Neither XLS nor SQL is the right place for complex string manipulations.
If practical, use LOAD DATA ... XML to get the data loaded; it is very fast.
SQL is excellent for handling entire tables at once; it is terrible at handling one row at a time. (Hence, my recommendation of putting the data into a staging table, not directly into the target table.)
If you want to discuss further, we need more details about the conversions involved.
looking for some general advice on how to go about this. I have the following task(s) planned:
The following is already working:.
User submits a CSV file via the form on my site.
The file/url of the file gets sent to a different server for processing (loop through each csv row, connect to my WordPress site and create each item as a product via the WooCommerce REST API).
What I want to achieve:.
If for example, 5 people submit a CSV at roughly the same time that’s a lot of writing to the database at once (some of the files could have 500, 1000+ rows). I would prefer to do a ‘queue’ system:
1 CSV file received.
Process the file, do all the product creation etc.
When finished, move to the next CSV and process that one next.
Note: When I say the CSV is received, I am simply passing the csv url and doing a file_get_contents in the script which processes it.
First point: you have an high workload to process, that it's not depending by the quantity of files only, but also from a single file length.
Eg. you have a csv with 1000+ lines.
That kind of file would lock your queue for a long amount of time and consume a lot of the mysql reserved memory.
So I will move like follows:
Take any file and translate them into a series of mysql multiple insert query.
In this way you reduce the number of roundtrips from mysql to application that would happen instead if you do separate inserts.
Consider that the best way is to split any file in bulk insert scripts of 200 circa records to avoid high mysql memory consumption, that would slow down the process.
Create a queue job for any bulk import script you create and send them to your queue processor, I would avoid to use a cron job and move for an ampq implementation using this wrapper library: php-amqlib as starting point.
Do not use file_get_contents but fgetcsv, since loading all the file at once and process manually could be not the best option.
Looking for insight on the best approach for large csv file imports to mysql and managing the dataset. This is for an ecommerce storefront "startup". All product data will be read from csv files which are download via curl (server to server).
Each csv file represents a different supplier/warehouse with up to 100,000 products. In total there are roughly 1.2 million products spread over 90-100 suppliers. At least 75% of the row data (51 columns) is redundant garbage and will not be needed.
Would it be better to use mysqli LOAD DATA LOCAL INFILE to 'temp_products' table. Then, make the needed data adjustments per row, then insert to the live 'products' table or simply use fgetcsv() and go row by row? The import will be handled by a CronJob using the sites php.ini with a memory limit of 128M.
Apache V2.2.29
PHP V5.4.43
MySQL V5.5.42-37.1-log
memory_limit 128M
I'm not looking for "How to's". I'm simply looking for the "best approach" from the communities perspective and experience.
I have direct experience of doing something virtually identical to what you describe -- lots of third party data sources in different formats all needing to go into a single master table.
I needed to take different approaches for different data sources, because some were in XML, some in CSV, some large, some small, etc. For the large CSV ones, I did indeed follow roughly your suggested routed:
I used LOAD DATA INFILE to dump the raw contents into a temporary table.
I took the opportunity to transform or discard some of the data within this query; LOAD DATA INFILE allows some quite complex queries. This allowed me to use the same temp table for several of the import processes even though they had quite different CSV data, which made the next step easier.
I then used a set of secondary SQL queries to pull the temp data into the various main tables. All told, I had about seven steps to the process.
I had a set of PHP classes to do the imports, which all implemented a common interface. This meant that I could have a common front-end program which could run any of the importers.
Since a lot of the importers did similar tasks, I put the commonly used code in traits so that the code could be shared.
Some thoughts based on the things you said in your question:
LOAD DATA INFILE will be orders of magnitude quicker than fgetcsv() with a PHP loop.
LOAD DATA INFILE queries can be very complex and achieve very good data mapping without ever having to run any other code, as long as the imported data is going into a single table.
Your memory limit is likely to need to be raised. However, using LOAD DATA INFILE means that it will be MySQL which will use the memory, not PHP, so the PHP limit won't come into play for that. 128M is still likely to be too low for you though.
-If you struggle to import the whole thing in one go, try using some simple Linux shell commands to split the file into several smaller chunks. CSV data format should make that fairly simple.
I receive files in a streamed manner once every 30 seconds. The files may have up to 40 columns and 50,000 rows. The files are txt files and tab seperated. Right now, I'm saving the file temporally, save the contents with load data infile to a temporary table in the database and delete the file afterwards.
I would like to avoid the save and delete process and instead save the data directly to the database. The stream is the $output here:
protected function run(OutputInterface $output)
{
$this->readInventoryReport($this->interaction($output));
}
I've been googling around all the time trying to find a "performance is a big issue" - proof answer to this, but I can't find a good way of doing this without saving the data to a file and using load data infile. I need to have the contents available quickly and work with thoses after they are saved to a temporary table. (Update other tables with the contents...)
Is there a good way of handling this, or will the file save and delete method together with load data infile be better than other solutions?
The server I'm running this on has SSDs and 32GB of RAM.
LOAD DATA INFILE is your fastest way to do low-latency ingestion of tonnage of data into MySQL.
You can write yourself a php program that will, using prepared statements and the like, do a pretty good job of inserting rows into your database. If you arrange to do a COMMIT every couple of hundred rows, and use prepared statements, and write your code carefully, it will be fairly fast, but not as fast as LOAD DATA INFILE. Why? individual row operations have to be serialized onto the network wire, then deserialized, and processed one (or two or ten) at a time. LOAD DATA just slurps up your data locally.
It sounds like you have a nice MySQL server machine. But the serialization is still a bottleneck.
50K records every 30 seconds, eh? That's a lot! Is any of that data redundant? That is, do any of the rows in a later batch of data overwrite rows in an earlier batch? If so, you might be able to write a program that would skip rows that have become obsolete.
Here is the plan. I have a large CSV file extract from a DB with 10 000 entry. Those entry look like :
firstname
lastname
tel
fax
mobile
address
jav-2012-selltotal
fev-2012-selltotal
etc.. etc..
So, i have read about getting those CSV data to MySQL database, and query those database to know who have sold more in feb 2012, or what is the selling total of john.. or whatever i ask to...
But for optimization purpose, caching, optimizing and indexing query is a must... witch lead me to this question. Since i know the 2-3 query i will do ALL THE TIME to the DB... is it faster to take the CSV file, to make the request in PHP and write a result file on disk, so all my call will be readfile-load-it, display-it ?
another wording of the question...is making query to DB is faster or slower that reading file to disk ? because if the DB have 10 000 record, and the result of the selling of paul is 100 line, the file will ONLY contain 100 lines, it will be small... the query will always take about the same time
PLease help, i dont what to code it myself just to discover thing that is evident to you...
thanks in advance
If you stick to database normalization rules and have everything in database, you are just fine. 10k records is not really much and you should not have to worry about performance.
Database queries are faster because the data gets (partially) cached in the memory rather than on plain disc unless fully read into RAM.
A handful plain text files might be faster at first sight, but when you have 100k files and 100k datasets in the DB, database is so much better,.. you don't have unlimited (parallel) inode access and are slowing and killing your harddrive/ssd. The more files you have, the slower everything gets.
You'd also have to manually code a locking queue for read/write actions which is already integrated into MySQL (row- and table locking).
Consider in a few months you want to extend everything,... how would you implement JOINS in text files? All the aggregation functionality MySQL already has built in (GROUP BY, ORDER BY,...).
MySQL has a profiler (use EXPLAIN before each statement) and can optimize even bigger datasets so much.
When I went to school I said to my teacher: 'Plain files are much
faster than your MySQL'. I made a site with a directory for each user
and stored attributes in a textfile each inside that user folder, just
like: /menardmam/username.txt, /menardmam/password.txt,
/DanFromgermany/username.txt, .... I tried to benchmark this and ya
text file was faster, but only because it was just 1000 text files.
When it comes to real business, 1000000000 datasets, combined and
cross joined, there is no way to do this with text files and it is much better when applying for a job to present your work with MySQL than what you done with text files.