I receive files in a streamed manner once every 30 seconds. The files may have up to 40 columns and 50,000 rows. The files are txt files and tab seperated. Right now, I'm saving the file temporally, save the contents with load data infile to a temporary table in the database and delete the file afterwards.
I would like to avoid the save and delete process and instead save the data directly to the database. The stream is the $output here:
protected function run(OutputInterface $output)
{
$this->readInventoryReport($this->interaction($output));
}
I've been googling around all the time trying to find a "performance is a big issue" - proof answer to this, but I can't find a good way of doing this without saving the data to a file and using load data infile. I need to have the contents available quickly and work with thoses after they are saved to a temporary table. (Update other tables with the contents...)
Is there a good way of handling this, or will the file save and delete method together with load data infile be better than other solutions?
The server I'm running this on has SSDs and 32GB of RAM.
LOAD DATA INFILE is your fastest way to do low-latency ingestion of tonnage of data into MySQL.
You can write yourself a php program that will, using prepared statements and the like, do a pretty good job of inserting rows into your database. If you arrange to do a COMMIT every couple of hundred rows, and use prepared statements, and write your code carefully, it will be fairly fast, but not as fast as LOAD DATA INFILE. Why? individual row operations have to be serialized onto the network wire, then deserialized, and processed one (or two or ten) at a time. LOAD DATA just slurps up your data locally.
It sounds like you have a nice MySQL server machine. But the serialization is still a bottleneck.
50K records every 30 seconds, eh? That's a lot! Is any of that data redundant? That is, do any of the rows in a later batch of data overwrite rows in an earlier batch? If so, you might be able to write a program that would skip rows that have become obsolete.
Related
Please, if somebody can give me support.
My problem is:
I have a table with 8 fields and about 510 000 records. In a web form, the user select an Excel file and it's read it with SimpleXLSX. The file has about 340 000 lines. With PHP and SimpleXLSX library this file is loaded in memory, then with a for cicle the script read line by line, taken one data of ecah line and search this value in the table, if the data exists in the table, then does not insert the value, other wise, the values read it are stored in the table.
This process takes days to finish.
Can somebody suggest me some operation to speed up the process?
Thanks a lot.
if you have many users, and they maybe use the web at the same time:
you must change SimpleXLSX to js-xlsx, in webbrowser do all work but only write database in server
if you have few users (i think you in this case)
and search this value in the table
this is cost the must time, if your single-to-single compare memory and database, then add/not-add to database.
so you can read all database info in memory, (must use hash-list for compare),then compare all
and add it to memory and mark newable
at last
add memory info to database
because you database and xls have most same count, so...database become almost valueless
just forget database, this is most fast in memory
in memory use hash-list for compare
of course, you can let above run in database if you can use #Barmar's idea.. don't insert single, but batch
Focus on speed on throwing the data into the database. Do not try to do all the work during the INSERT. Then use SQL queries to further clean up the data.
Use the minimal XLS to get the XML into the database. Use some programming language if you need to massage the data a lot. Neither XLS nor SQL is the right place for complex string manipulations.
If practical, use LOAD DATA ... XML to get the data loaded; it is very fast.
SQL is excellent for handling entire tables at once; it is terrible at handling one row at a time. (Hence, my recommendation of putting the data into a staging table, not directly into the target table.)
If you want to discuss further, we need more details about the conversions involved.
Looking for insight on the best approach for large csv file imports to mysql and managing the dataset. This is for an ecommerce storefront "startup". All product data will be read from csv files which are download via curl (server to server).
Each csv file represents a different supplier/warehouse with up to 100,000 products. In total there are roughly 1.2 million products spread over 90-100 suppliers. At least 75% of the row data (51 columns) is redundant garbage and will not be needed.
Would it be better to use mysqli LOAD DATA LOCAL INFILE to 'temp_products' table. Then, make the needed data adjustments per row, then insert to the live 'products' table or simply use fgetcsv() and go row by row? The import will be handled by a CronJob using the sites php.ini with a memory limit of 128M.
Apache V2.2.29
PHP V5.4.43
MySQL V5.5.42-37.1-log
memory_limit 128M
I'm not looking for "How to's". I'm simply looking for the "best approach" from the communities perspective and experience.
I have direct experience of doing something virtually identical to what you describe -- lots of third party data sources in different formats all needing to go into a single master table.
I needed to take different approaches for different data sources, because some were in XML, some in CSV, some large, some small, etc. For the large CSV ones, I did indeed follow roughly your suggested routed:
I used LOAD DATA INFILE to dump the raw contents into a temporary table.
I took the opportunity to transform or discard some of the data within this query; LOAD DATA INFILE allows some quite complex queries. This allowed me to use the same temp table for several of the import processes even though they had quite different CSV data, which made the next step easier.
I then used a set of secondary SQL queries to pull the temp data into the various main tables. All told, I had about seven steps to the process.
I had a set of PHP classes to do the imports, which all implemented a common interface. This meant that I could have a common front-end program which could run any of the importers.
Since a lot of the importers did similar tasks, I put the commonly used code in traits so that the code could be shared.
Some thoughts based on the things you said in your question:
LOAD DATA INFILE will be orders of magnitude quicker than fgetcsv() with a PHP loop.
LOAD DATA INFILE queries can be very complex and achieve very good data mapping without ever having to run any other code, as long as the imported data is going into a single table.
Your memory limit is likely to need to be raised. However, using LOAD DATA INFILE means that it will be MySQL which will use the memory, not PHP, so the PHP limit won't come into play for that. 128M is still likely to be too low for you though.
-If you struggle to import the whole thing in one go, try using some simple Linux shell commands to split the file into several smaller chunks. CSV data format should make that fairly simple.
I have an app that is posting data from android to some MySQL tables through PHP with a 10 second interval. The same PHP file does a lot of queries on some other tables in the same database and the result is downloaded and processed in the app (with DownloadWebPageTask).
I usually have between 20 and 30 clients connected this way. Most of the data each client query for is the same as for all the other clients. If 30 clients run the same query every 10th second, 180 queries will be run. In fact every client run several queries, some of them are run in a loop (looping through results of another query).
My question is: if I somehow produce a textfile containing the same data, and updating this textfile every x seconds, and let all the clients read this file instead of running the queries themself - is it a better approach? will it reduce serverload?
In my opinion you should consider using memcache.
It will let you store your data in memory which is even faster than files on disk or mysql queries.
What it will also do is reduce load on your database so you will be able to serve more users with the same server/database setup.
Memcache is very easy to use and there are lots of tutorials on the internet.
Here is one to get you started:
http://net.tutsplus.com/tutorials/php/faster-php-mysql-websites-in-minutes/
What you need is caching. You can either cache the data coming from your DB or cache the page itself. Below you can find few links on how do the same in PHP:
http://www.theukwebdesigncompany.com/articles/php-caching.php
http://www.addedbytes.com/articles/for-beginners/output-caching-for-beginners/
And yes. This will reduce DB server load drastically.
At my company, I use a program read CSV files and insert the data into a database. I am having trouble with it because it needs to able to insert a large amount of data ( up to 10,000 rows ) of a time. At first I had it looping through and inserting each record one at a time. That is slow because it calls an insert function 10,000 times... Next I tried to group it together so it inserted 50 rows at a time by concatenating the SQL call. I have tried grouping the SQL calls into up to 1,000 rows at a time, but it is still too slow.
Another thing that I have to do is change data. The client gives a spreadsheet with a data such as their username and password, but sometimes the usernames are the same, so I change them by adding a number at the end. i.e. JoDoe, JoDoe1. Sometimes the case is that there is no password or username, so I have to generate one. The reason I bring this up is that I read that using LOAD DATA INFILE reads a file really fast and puts it into a table, but I need to edit it before going into the table.
It will time out after 120 seconds, and what doesn't get finished in that time is inserted as all 0's. I need to speed it up so it doesn't take as long. I do NOT want to change the time limit because it is a company thing. What is an efficient way to insert many rows of a CSV file into a database?
LOAD DATA INFILE can perform numerous preprocessing operations as it loads the data. That might be enough. If not, run a PHP script to process from one CSV file to another, temporary, CSV file, editing as you go. Then use LOAD DATA INFILE on the newly created file.
Ok, so i've got a web system (built on codeigniter & running on mysql) that allows people to query a database of postal address data by making selections in a series of forms until they arrive at the selection that want, pretty standard stuff. They can then buy that information and download it via that system.
The queries run very fast, but when it comes to applying that query to the database,and exporting it to csv, once the datasets get to around the 30,000 record mark (each row has around 40 columns of which about 20 are all populated with on average 20 chars of data per cell) it can take 5 or so minutes to export to csv.
So, my question is, what is the main cause for the slowness? Is it that the resultset of data from the query is so large, that it is running into memory issues? Therefore should i allow much more memory to the process? Or, is there a much more efficient way of exporting to csv from a mysql query that i'm not doing? Should i save the contents of the query to a temp table and simply export the temp table to csv? Or am i going about this all wrong? Also, is the fact that i'm using Codeigniters Active Record for this prohibitive due to the way that it stores the resultset?
Pseudo Code:
$query = $this->db->select('field1, field2, field3')->where_in('field1',$values_array)->get('tablename');
$data = $this->dbutil->csv_from_result($download_query, $delimiter, $newline); // the some code to save the file
$this->load->helper('download');
force_download($filename, $filedata);
Any advice is welcome! Thank you for reading!
Ignoring Codeigniter for a moment, you basically have three options for exporting CSV using PHP:
To disk - typically the slowest option
To memory - typically the fastest option
Directly to the browser
In your case I would skip any built-in Codeigniter CSV functions and try streaming directly to the browser (see link above for a complete example).
30,000 records * 40 columns * 20 bytes = 24,000,000 bytes
If this is a busy shared server then I can imagine this being a disk I/O bottleneck. If it's a Windows-based server then there's probably some paging happening as well to slow it down.
Try skipping the disk and writing directly to the network.