How to optimize loops for large CSV files data extraction

How to optimize loops for large CSV files data extraction - php

I have a question about code optimization.
I haven't coded anything besides simple loops in over ten years.
I created the code below, which works fine but is super slow for my needs.
In essence, I have 2 CSV files:
a source CSV file that has about 500 000 records, let's say: att1, att2, source_id, att3, att4 (in reality there are about 40 columns)
a main CSV file that has about 120 million records, let's say: att1, att2, att3, main_id, att4 (in reality there are about 120 columns)
For each source_id in the source file, my code parses the main file for all the lines where main_ id == source_id and it writes each of those lines in a new file.
Do you have any suggestion on how I could optimize the code, to go much much faster?
<?php
$mf = "main.csv";
$mf_max_line_length = "512";
$mf_id = "main_id";
$sf = "source.csv";
$sf_max_line_length = "884167";
$sf_id = "source_id";
if (($mf_handle = fopen($mf, "r")) !== FALSE)
{
// Read the first line of the main CSV file
// and look for the position of main_id
$mf_data = fgetcsv($mf_handle, $mf_max_line_length, ",");
$mf_id_pos = array_search ($mf_id, $mf_data);
// Create a new main CSV file
if (($nmf_handle = fopen("new_main.csv", "x")) !== FALSE)
{
fputcsv($nmf_handle,$mf_data);
} else {
echo "Cannot create file: new_main.csv" . $sf;
break;
}
}
// Open the source CSV file
if (($sf_handle = fopen($sf, "r")) !== FALSE)
{
// Read the first line of the source CSV file
// and look for the position of source_id
$sf_data = fgetcsv($sf_handle, $sf_max_line_length, ",");
$sf_id_pos = array_search ($sf_id, $sf_data);
// Go trhough the whole source CSV file
while (($sf_data = fgetcsv($sf_handle, $sf_max_line_length, ",")) !== FALSE)
{
// Open the main CSV file
if (($mf_handle = fopen($mf, "r")) !== FALSE)
{
// Go trhough the whole main CSV file
while (($mf_data = fgetcsv($mf_handle, $mf_max_line_length, ",")) !== FALSE)
{
// If the source_id matches the main_id
// then we write it into the new_main CSV file
if ($mf_data[$mf_id_pos] == $sf_data[$sf_id_pos])
{
fputcsv($nmf_handle,$mf_data);
}
}
fclose($mf_handle);
}
}
fclose($sf_handle);
fclose($nmf_handle);
}
?>

Sounds like a job for mysql.
First, you'll need to create tables based on all your fields. See here
Then, you'll load your data. See here
Finally, you'll create a query like:
SELECT * INTO OUTFILE '/tmp/something.csv'
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM source_table INNER JOIN main_table ON
source_table.source_id=main_table.main_id;

Related

How to randomly separte large data set into two equally sized data sets

I have several CSV files with 10 million+ values in them, each value is 9 characters long. My goal is to divide each file into two equally sized files, where each half is a random selection of the values from the initial set.
I am thinking about doing this using PHP (because I am slightly familiar with it).
I can think of two potential ways to do this, but curious about (1) which one will run faster? (2) is there a different way of doing that will be better? (3) or with a data set of about 10 to 15 million does it not matter?
Plan 1:
Convert CSV into an array
Shuffle the array using the shuffle() function
Divide the array in 2 with the array_chunk() function
Save each array to CSV file (not sure how but will figure it out)
Plan 2:
Convert CSV into an array
Use array_rand() to randomly select X amount of values, where X = (number of values / 2), and create array from that selection
Repeat step 2 for second half of values
Save each of the new arrays to CSV file
Is this anywhere close to right? Should I consider a different language?
Thank you!

Plan 3:
1) Write a PHP script that fetches all CSV data and insert it to an MySQL database (plenty of exmples).
2) in your PHP select * from table where type = 1 order by rand() limit 10 or some other fancy query with timestamp whatever
This is how I would do it.
EDIT with example
<?php
$files = glob("path/to/files/*.csv");
foreach($files as $file) {
if (($handle = fopen($file, "r")) !== FALSE) {
echo "<b>Filename: " . basename($file) . "</b><br><br>";
while (($data = fgetcsv($handle, 4096, ",")) !== FALSE) {
//do something with the data
echo implode("\t", $data);
}
echo "<br>";
fclose($handle);
} else {
echo "Could not open file: " . $file;
}
}
?>
This will get the content of all the csv files in a directory. Keep in mind this is a stressfull task for a server specially with so many values. So maybe this helps:
function listdirfile_by_date($path)
{
$dir = opendir($path);
$list = array();
while($file = readdir($dir))
{
if($file != '..' && $file != '.')
{
$mtime = filemtime($path . $file) . ',' . $file;
$list[$mtime] = $file;
}
}
closedir($dir);
krsort($list);
foreach($list as $key => $value)
{
return $list[$key];
}
return '';
}
A stolen function that lists the files sorted by date. With this data you can run the script only with the newest file.

Edit CSV field value for entire column

I have a CSV that is downloaded from the wholesaler everynight with updated prices.
What I need to do is edit the price column (2nd column) and multiply the current value by 1.3 (30%).
My code to read the provided CSV and take just the columns I need is below, however I can't seem to figure out how to edit the price column.
<?php
// open the csv file in write mode
$fp = fopen('var/import/tb_prices.csv', 'w');
// read csv file
if (($handle = fopen("var/import/Cbl_4036_2408.csv", "r")) !== FALSE) {
$targetColumns = array(1, 2, 3); // get data from the 1st, 4th and 15th column
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
$targetData = array(); // array that hold target data
foreach($targetColumns as $column){ // loop throught the targeted columns array
if($column[2]){
$data[$column] = $data[0] * 1.3;
}
$targetData[] = $data[$column]; // get the data from the column
}
# Populate the multidimensional array.
$csvarray[$nn] = $targetData; // add target data to csvarray
// write csv file
fputcsv($fp, $targetData);
}
fclose($handle);
fclose($fp);
echo "CSV File Written Successfully!";
}
?>
Could somebody point me in the right direction please, explaining how you've worked out the function too so I can learn at the same time.

You are multiplying your price column always as - $data[0] * 1.3.
It may be wrong here.
Other views:
If you are doing it once in a lifetime of this data(csv) handling, try to solve it using mysql itself only. Create the table similar to the database, import the .csv data into that mysql table. And then, SQL operate as you want.
No loops; no coding, no file read/write, and precise control over what you want to do with UPDATE. You just need to be aware of the delimiters (line separators eg. \r\n, column separators (eg. comma or tab or semicolon) and data encoding in double/single-quotes or not)
Once you modify your data, you can export it back to csv again.
If you want to handle the .csv file itself, open it in one connection (read only mode), and write to another file - saving the original data.

you say that the column that contains the price is the second but then use that index with zero. anyway the whole thing can be easier
$handle = fopen("test.csv", "r");
if ( $handle !== FALSE) {
$out = "";
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
$data[1] = ((float)$data[1] * 1.3);
$out .= implode(";",$data) . "\n";
}
fclose($handle);
file_put_contents("test2.csv", $out);
}
this code open a csv file with comma as separator.
than read every line and for every line it's multiplies the second coloumn (index 1) for 1.3
this line
$out .= implode(";",$data) . "\n";
generate a line for new csb file. see implode on the officile documentation ...
after I close the connection to the file. and 'useless to have a connection with two files when you can do the writing of the second file in one fell swoop. the thing is true for small files

PHP- Compare two CSV files, look for duplicates and remove matching rows from one of the files

I'm trying my best to learn PHP and hack things out myself. But this part has me stuck.
I have two CSV files with hundreds of rows each.
CSV 1 looks like this:
name, email, interest
CSV 2 looks like this:
email only
I'm trying to write a script to compare the two files looking for duplicates. I only want to keep the duplicates. But as you can see, CSV 2 only contains an email. If an email in CSV 1 DOES NOT EXIST in CSV 2, then the row containing that email in CSV 1 should be deleted.
The end result can either overwrite CSV 1 or create a fresh new file called "final.csv"... whatever is easiest.
I would be grateful for the help.
I tried something along these lines with no luck:
egrep -v $(cat csv2.csv | tr '\n' '|' | sed 's/.$//') csv1.csv
and
grep -v -f csv22.csv csv1.csv >output-file
cheers,
marc

Here is a script that will loop through both files and output a 3rd file where email addresses in file2 are found in file1.
if (($file3 = fopen("file3.csv", "w")) !== FALSE) {
if (($file1 = fopen("file1.csv", "r")) !== FALSE) {
while (($file1Row = fgetcsv($file1)) !== FALSE) {
if (($file2 = fopen("file2.csv", "r")) !== FALSE) {
while (($file2Row = fgetcsv($file2)) !== FALSE) {
if ( strtolower(trim($file2Row[0])) == strtolower(trim($file1Row[1])) )
fputcsv($file3, $file1Row);
}
fclose($file2);
}
}
fclose($file1);
}
fclose($file3);
}
Couple of notes:
You may need to provide some additional arguments to fgetcsv, depending on how your csv is structured (e.g. delimiter, quotes)
Based on how you listed the contents of each file, this code reads the 2nd column of file1, and the 1st column of file2. If that's not really how they are positioned, you will need to change the number in the bracket for $file1Row[1] and $file2Row[0]. Column # starts at 0.
Script is current set to overwrite if file3.csv exists. If you want it to append instead of overwrite, change the 2nd argument of the $file3 fopen to "a" instead of "w"
Example:
file1.csv:
john,john#foobar.com,blah
mary,mary#blah.com,something
jane,jan#something.com,blarg
bob,bob#test.com,asdfsfd
file2.csv
mary#blah.com
bob#test.com
file3.csv (generated)
mary,mary#blah.com,something
bob,bob#test.com,asdfsfd

Solved! The problem was with Mac line breaks. Look at the code below to see the additions at the beginning and end of the code to fix that problem. Thank you Crayon Violent for all of your help!
ini_set('auto_detect_line_endings',TRUE);
if (($file3 = fopen("output.csv", "w")) !== FALSE) {
if (($file1 = fopen("dirty.csv", "r")) !== FALSE) {
while (($file1Row = fgetcsv($file1)) !== FALSE) {
if (($file2 = fopen("clean.csv", "r")) !== FALSE) {
while (($file2Row = fgetcsv($file2)) !== FALSE) {
if ( strtolower(trim($file2Row[0])) == strtolower(trim($file1Row[1])) )
fputcsv($file3, $file1Row);
}
fclose($file2);
}
}
fclose($file1);
}
fclose($file3);
}
ini_set('auto_detect_line_endings',FALSE);

Delete blank lines in CSV file with PHP or PHPExcel?

I am trying programmatically delete blank lines in CSV files using PHP. Files are uploaded to a site and converted to CSV using PHPExcel. A particular type of CSV is being generated with blank lines in between the data rows, and I'm trying to clean them up with PHP without any luck. Here is an example of what this CSV looks like: https://gist.github.com/vinmassaro/467ea98151e26a79d556
I need to load the CSV, remove the blank lines, and save it, using either PHPExcel or standard PHP functions. Thanks in advance.
EDIT:
Here is a snippet from how it is currently converted with PHPExcel. This is part of a Drupal hook, acting on a file that has just been uploaded. I couldn't get the PHPExcel removeRow method working because it didn't seem to work on blank lines, only empty data rows.
// Load the PHPExcel IOFactory.
require_once(drupal_realpath(drupal_get_path('module', 'custom')) . '/PHPExcel/Classes/PHPExcel/IOFactory.php');
// Load the uploaded file into PHPExcel for normalization.
$loaded_file = PHPExcel_IOFactory::load(drupal_realpath($original_file->uri));
$writer = PHPExcel_IOFactory::createWriter($loaded_file, 'CSV');
$writer->setDelimiter(",");
$writer->setEnclosure("");
// Get path to files directory and build a new filename and filepath.
$files_directory = drupal_realpath(variable_get('file_public_path', conf_path() . '/files'));
$new_filename = pathinfo($original_file->filename, PATHINFO_FILENAME) . '.csv';
$temp_filepath = $files_directory . '/' . $new_filename;
// Save the file with PHPExcel to the temp location. It will be deleted later.
$writer->save($temp_filepath);

If you want to use phpexcel, search for CSV.php
and edit this File:
// Write rows to file
for ($row = 1; $row <= $maxRow; ++$row) {
// Convert the row to an array...
$cellsArray = $sheet->rangeToArray('A'.$row.':'.$maxCol.$row, '', $this->preCalculateFormulas);
// edit by ger
// if last row, then no linebreak will added
if($row == $maxRow){ $ifMaxRow = TRUE; }else{ $ifMaxRow = False; }
// ... and write to the file
$this->writeLine($fileHandle, $cellsArray[0], $ifMaxRow);
}
at the end of file edit this
/**
* Write line to CSV file
*
* edit by ger
*
* #param mixed $pFileHandle PHP filehandle
* #param array $pValues Array containing values in a row
* #throws PHPExcel_Writer_Exception
*/
private function writeLine($pFileHandle = null, $pValues = null, $ifMaxRow = false)
{
...
// Add enclosed string
$line .= $this->enclosure . $element . $this->enclosure;
}
insert this following
if($ifMaxRow == false){
// Add line ending
$line .= $this->lineEnding;
}

Using str_replace as in Mihai Iorga's comment will work. Something like:
$csv = file_get_contents('path/file.csv');
$no_blanks = str_replace("\r\n\r\n", "\r\n", $csv);
file_put_contents('path/file.csv', $no_blanks);
I copied the text from the example you posted, and this worked, although I had to change the "find" parameter to to "\r\n \r\n" instead of "\r\n\r\n" because of a single space on each of the blank-looking lines.

Try this:
<?php
$handle = fopen("test.csv", 'r'); //your csv file
$clean = fopen("clean.csv", 'a+'); //new file with no empty rows
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
$num = count($data);
if($num > 1)
fputcsv($clean, $data ,";");
}
fclose($handle);
fclose($clean);
?>
Tested on my localhost.
Output Data:
Initial File:
col1,col2,col3,col4,col5,col6,col7,col8
0,229.500,7.4,3165.5,62,20.3922,15.1594,0
1,229.600,8.99608,3156.75,62,15.6863,16.882,0
2,229.700,7.2549,3130.25,62,16.8627,15.9633,0
3,229.800,7.1098,3181,62,17.2549,14.1258,0
Clean Csv File:
col1 col2 col3 col4 col5 col6 col7 col8
0 229.500 7.4 3165.5 62 203.922 151.594 0
1 229.600 899.608 3156.75 62 156.863 16.882 0
2 229.700 72.549 3130.25 62 168.627 159.633 0
3 229.800 71.098 3181 62 172.549 141.258 0

Count number of columns in a CSV file, using PHP?

Is it possible to validate a text file before I dump its data into a MYSQL database?
I want to check if it contains, say, 5 columns (of data). If so, then i go ahead with the following query:
LOAD DATA CONCURRENT INFILE 'c:/test/test.txt'
INTO TABLE DUMP_TABLE FIELDS TERMINATED BY '\t' ENCLOSED BY '' LINES TERMINATED BY '\n' ignore 1 lines.
If not, I remove the entire row. I repeat this process for all rows in the txt file.
The text file contains data of the format:
id col2 col3 2012-07-27-19:27:06 col5
id col2 col3 2012-07-25-09:58:50 col5
id col2 col3 2012-07-23-10:14:13 col5

EDIT: After reading your comments, here's the code for doing the same on tab separated data:
$handler = fopen("myfile.txt","r");
$error = false;
while (!feof($handler)){
fgets($handler,$linetocheck);
$cols = explode (chr(9), $linetocheck); //edit: using http://es.php.net/manual/en/function.fgetcsv.php you can get the same result as with fgets+explode
if (count($cols)>$max_cols){
$error=true;
break;
}
}
fclose($handler);
if (!$error){
//...do stuff
}
This code reads a file, let's say "myfile.txt", line by line, and sets variable $error to true if any of the lines has a length of more than $max_cols. (My apologies if that's not what you're asking, your question is not the most clear to me)
$handler = fopen("myfile.txt","r");
$error = false;
while (!feof($handler)){
fgets($handler,$linetocheck);
if (strlen($linetocheck)>$max_cols){
$error=true;
break;
}
}
fclose($handler);
if (!$error){
//...do stuff
}

I know it's an old thread, but I was looking something similar for myself and I came across to this topic, but none of the answers provided here helped me.
Thus, I've went ahead and came with my own solution which is tested and works perfectly (can be improved).
Assume, we have a CSV file named example.csv that contains the following dummy data (on purpose, the last line, 6th, contains one extra data then the other rows):
Name,Country,Age
John,Ireland,18
Ted,USA,22
Lisa,UK,23
Michael,USA,20
Louise,Ireland,22,11
Now, when we're checking the CSV file to assure all the rows have the same number of data, the following block of code will do the trick and pin-point on what line the error occurred:
function validateCsvColumnLength($pathToCsvFile)
{
if(!file_exists($pathToCsvFile) || !is_readable($pathToCsvFile)){
throw new \Exception('Filename doesn`t exist or is not readable.');
}
if (!$handle = fopen($pathToCsvFile, "r")) {
throw new \Exception("Stream error");
}
$rowLength = [];
$rowNumber = 0;
while (($data = fgetcsv($handle)) !== FALSE) {
$rowLength[] = count($data);
$rowNumber++;
}
fclose($handle);
$rowKeyWithError = array_search(max($rowLength), $rowLength);
$differentRowCount = count(array_unique($rowLength));
// if there's a row that has more or less data, throw an error with the line that triggered it
if ($differentRowCount !== 1) {
throw new \Exception("Error, data count from row {$rowKeyWithError} does not match header size");
}
return true;
}
To actually test it, just do a var_dump() to see the result:
var_dump(validateCsvColumnLength('example.csv'));

What columns do you mean? If you just means amount of characters in rows, just split (explode) the file into many rows and check whether their lengths are equal to 5.
If you meant columns with delimeters, then you should find amount of occurences of that splitter in each row and then again check are they equal to 5. use fgetcsv for that

I'm assuming your talking about the length of each line in the file. If so, here's a possible solution.
$file_handle = fopen("myfile", "r");
while (!feof($file_handle)) {
$line = fgets($file_handle);
if(strlen($line)!=5) {
throw new Exception("Could not save file to database.");
break;
}
}
fclose($file_handle);

Yes, it is possible. I've done that exact thing. Use PHP's csv processing functions.
You will need these functions:
fopen()
fgetcsv()
And possibly some others.
fgetcsv returns an array.
I'll give you a short example of how you can validate.
here's the csv:
col1,col2,col3,col4
1,2,3,4
1,2,3,4,
1,2,3,4,5
1,2,3,4
I'll skip the fopen part and go straight to the validation step.
Note that "\t" is the tab character.
$row_length;
$i = 0;
while($row = fgetcsv($handle,0,"\t") {
if($i == 0) {
$row_length = sizeof($row);
} else {
if(sizeof($row) != $row_length) {
echo "Error, line $i of the data does not match header size";
break;
}
}
}
That would test each row to make sure it is the same as the 1st row's ($i = 0) length.
EDIT:
And, in case you don't know how to search the internet, here is the page for fgetcsv:
http://php.net/manual/en/function.fgetcsv.php
Here is the function prototype:
array fgetcsv ( resource $handle [, int $length = 0 [, string $delimiter = ',' [, string $enclosure = '"' [, string $escape = '\' ]]]] )
As you can see, it has everything you would need for doing a quick scan in PHP before you send your data to LOAD DATA IN FILE.
I have solved your exact problem in my own program. My program also automatically eliminates duplicate rows and other cool stuff.

You can try to see if fgetcsv will suffice. If it doesn't, please be a bit more descriptive on what you mean by columns.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to optimize loops for large CSV files data extraction - php

Related

How to randomly separte large data set into two equally sized data sets

Edit CSV field value for entire column

PHP- Compare two CSV files, look for duplicates and remove matching rows from one of the files

Delete blank lines in CSV file with PHP or PHPExcel?

Count number of columns in a CSV file, using PHP?

Categories

Resources