Example I have csv file with million rows and I created queue for habdle row by handle step 100 rows. And after 100 row wlll be handled I want to remove it from this csv file. It's possible ? Or maybe someone know another tools or way hoow to do that. Would be great if will be exist opportunity remove rows with offset and limit. And approach when we get all content from file then transform it ro array and execure array shift for remove first line example and then file put content again into the file not correct for this issue, because my csv file contain millions rows.
I use php and League\Csv library, but maybe there is another tool for this task?
Could possible delete some count of rows from csv file, just remove or replace to empty ?
Related
I have a rather large product feed which i split up in multiple csv files of 20k lines each.
feed-0.csv
feed-1.csv
etc
I load this csv into a temp table and export 3 new csv files which i will lateron load into seperate tables.
products.csv
attributes.csv
prices.csv
Above of course also contain 20k lines just like the (split) source, so far so good, all going well.
Another part of the script loads above 3 csv files into their respective tables, db.products , db.attributes and db.prices. When i select 1 file (be it feed-0.csv or feed-9.csv any split file will do) the database is updated with 20k rows in each respective table. Still no problem there.
Now i create a loop where i loop through the split csv files and add 20k rows to each table on each loop.
This works well until i hit the 3rd loop. then i will get mismatching numbers in the database. e.g.
db.products - 57569 rows
db.attributes - 58661 rows
db.prices - 52254 rows
So while on the previous all was 40k, now i have mismatching numbers.
I have checked the products.csv, attributes.csv and prices.csv on each loop, and these have each the 20k as they should have.
I have tried with random split feeds, e.g. feed-1.csv, feed-5.csv, feed-7.csv or feed-1.csv, feed-8.csv and feed-3.csv. So i changed the files, i changed the order, but each time on the 3rd and further loops i get this problem.
I tried to import split files from different feeds too, but each time 3rd loop i get incorrect numbers. The source files should be good. When i run just 1 or 2 files in any sequence the results are goods.
I suspect that i am hitting some limitation somewhere. I thought it might be a innodb buffer issue, so i restarted the server, but the issue remains. (innodb buffer around 25% after the 3rd loop)
I am using MariaDB 10.1.38, PHP Version 7.3.3, innodb buffer size 500mb
Any pointers in what direction i have to search for a solution would be welcome!
is There a way that I can update 100k records in a query and mysql database will work smoothly?
Suppose there is a table users containg hundred thousand of records and I have to update approx fifty thousand of records and for update I have IDs of those records means to around fifty thousand of records somewhere stored in csv file,
1 - Will query be ok as size of query would be too large ? or if there is any way to put in smaller chuncks let me know ?
2- Considering laravel framework, if there any option to read a part of file not the whole file, to avoid memory leakage, As I donot want to read all file at the same time, please suggest.
Any suggestion are welcome !
If you're thinking of building a query like UPDATE users SET column = 'value' WHERE id = 1 OR id = 2 OR id = 3 ... OR id = 50000 or WHERE id IN (1, 2, 3, ..., 50000) then that will probably be too big. If you can make some logic to summarize that, it would shorten the query and speed things up on MySQL's end significantly. Maybe you could make it WHERE id >= 1 AND id <= 50000.
If that's not an option, you could do it in bursts. You're probably going to loop through the rows of the CSV file, build the query as a big WHERE id = 1 OR id = 2... query and every 100 rows or so (or 50 if that's still too big), run the query and start a new one for the next 50 IDs.
Or you could just run 50.000 single UPDATE queries on your database. Honestly, if the table makes proper use of indexes, running 50.000 queries should only take a few seconds on most modern webservers. Even the busiest servers should be able to handle that in under a minute.
As for reading a file in chunks, you can use PHP's basic file access functions for that:
$file = fopen('/path/to/file.csv', 'r');
// read one line at a time from the file (fgets reads up to the
// next newline character if you don't provide a number of bytes)
while (!feof($file)) {
$line = fgets($file);
// or, since it's a CSV file:
$row = fgetcsv($file);
// $row is not an array with all the CSV columns
// do stuff with the line/row
}
// set the file pointer to 60 kb into the file
fseek($file, 60*1024);
// close the file
fclose($file);
This will not read the full file into memory. Not sure if Laravel has its own way of dealing with files, but this is how to do that in basic PHP.
Depending on data you have to update, i would suggest few ways:
If all users would be updated by same value - as #rickdenhaan said,
you can build multiple batches every X rows from csv.
If every individual user have to be updated with unique values - you have to run single queries.
If any updated columns have indices - you should disable autocommit and do transaction manually to avoid reindex after each single update.
To avoid memory leakage, my opinion is the same as #rickdenhaan's. You should read csv line by line using the fgetcsv
To avoid possible timeouts, for example you can put script processing into laravel queues
I am downloading csv reports from doubleclick for advertisers. My client needs to save the the csv to database tables. Before saving the csv to database, I need to delete few rows from top and grand total row from bottom. Is there any function in PHP using which I can delete rows which I specify from csv files?
There is no specific function for that, but there are several ways to accomplish it.
If you have enough RAM to hold the entire file in memory at once, you could convert the file to an array with:
$data=explode("\n",file_get_contents("path/to/csv_file");
Then, to get rid of the totals at the bottom, you can just unset the last row with something like:
unset($data[sizeof($data)-1);
The way that works is that each line in the array is numbered, but starting from zero. The "sizeof" would be the number of rows, but since we started counting with zero, it would be one too many.
To remove some rows from the top, you could
unset($data[0]); and unset($data[1]);
Then use the foreach() function to loop through the rows and insert into the database:
foreach($data as $line){
/// parse your $line
/// Insert row into database
}
I have n csv files which I need to compare against each other and modify them afterwards.
The Problem is that each csv file has around 800.000 lines.
To read the csv file I use fgetcsv and it works good. Get some memory pikes but in the end it is fast enough. But if I try to compare the array against each other it takes ages.
One other Problem is that I have to use a foreach to get the csv data with fgetcsv because of the n amount of files. I end up with one ultra big array and can't compare it with array_diff. So i need to compare it with nested foreach loops and that take ages.
a code snippet for better understanding:
foreach( $files as $value ) {
$data[] = $csv->read( $value['path'] );
}
my csv class use fgetcsv to add the output to the array:
fgetcsv( $this->_fh, $this->_lengthToRead, $this->_delimiter, $this->_enclosure )
Every data of all the csv files are stored in the $data array. This is probably the first big mistake to use only one array, but I have no clue how to stay flexible with the files without to use an foreach. I tried to use flexible variable names but I stucked there as well :)
Now I have this big array. Normally if I try to compare the values against each other and to find out if the data from file one exists in file two and so on, I use array_diff or array_intersect. But in this case I have only this one big array. And as I said, to run an foreach over it takes ages.
Also after only 3 files I have an array with 3 * 800.000 entries. I guess latest after 10 files my memory will explode.
So is there any better way to use PHP to compare n amount of very large csv files?
Use SQL
Create a table with the same columns as your CSV files.
Insert the data from the first CSV file.
Add indexes to speed up queries.
Compare with other CSV files by reading a line and issuing a SELECT.
You did not describe how you compare n files, and there are several ways to do so. If you just want to find the line that are in A1 but not in A2,...,An, then you'll just have to add a boolean column diff in your table. If you want to know in which files a line is repeated, you'll need a text column, or a new table if a line can be in several files.
Edit: a few words on performance if you're using MySQL (I do not now much about other RDBMS).
Inserting lines one by one would be too slow. You probably can't use LOAD DATA unless you can put the CSV files directly onto the DB server's filesystem. So I guess the best solution is to read a few hundreds of lines in the CSV then send a multiple insert query INSERT INTO mytable VALUES (..1..), (..2..).
You can't issue a SELECT for each line you read in your other files, so you'd better put them in another table. Then issue a multiple-table update to mark the rows that are identical in the tables t1 and t2: UPDATE t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b) SET t1.diff=1
Maybe you could try using sqlite. No concurrency problems here, and it could be faster than the client/server model of MySQL. And you don't need to setup much to use sqlite.
A vendor is feeding us a CSV file of their products. A particular column on the file (eg column 3) is the style number. This file has thousands on entries.
We have a data-base table of products with a column called manufacturer_num which is the vendors style number.
I need to find which of the vendor's products we do not currently have.
I know I can loop throw each line in the CSV file and extract the style_number and check to see if it is in our data-base. But then I am making a call to the data-base for each line. This would be thousands of calls to the data-base. I think this is inefficient.
I could also build a list of the style numbers (either as a string or array) to make one DB call.
Something like: WHERE manufactuer_num IN(...) But won't PHP run out of memory if the list is too big? And actually this would give me the ones we do have, not the ones we don't have.
Whats an efficient way to do this?
Bulk load the CSV into a temporary table, do a LEFT JOIN, then get the records where the RHS of the join is NULL.