i'm looking for an algorithm strategy. I have a csv file with 162 columns and 55000 lines.
I want to order the datas with one date (which is on column 3).
first i tried directly to put everything in an array, but memory explodes.
So i decided to :
1/ Put in an array the 3 first columns.
2/ Order this array with usort
3/ read the csv file to recover the other columns
4/ Add in a new csv file the complete line
5/ replace the line by an empty string on the readed csv file
//First read of the file
while(($data = fgetcsv($handle, 0,';')) !== false)
{
$tabLigne[$columnNames[0]] = $data[0];
$tabLigne[$columnNames[1]] = $data[1];
$tabLigne[$columnNames[2]] = $data[2];
$dateCreation = DateTime::createFromFormat('d/m/Y', $tabLigne['Date de Création']);
if($dateCreation !== false)
{
$tableauDossiers[$row] = $tabLigne;
}
$row++;
unset($data);
unset($tabLigne);
}
//Order the array by date
usort(
$tableauDossiers,
function($x, $y) {
$date1 = DateTime::createFromFormat('d/m/Y', $x['Date de Création']);
$date2 = DateTime::createFromFormat('d/m/Y', $y['Date de Création']);
return $date1->format('U')> $date2->format('U');
}
);
fclose($handle);
copy(PATH_CSV.'original_file.csv', PATH_CSV.'copy_of_file.csv');
for ($row = 3; $row <= count($tableauDossiers); $row++)
{
$handle = fopen(PATH_CSV.'copy_of_file.csv', 'c+');
$tabHandle = file(PATH_CSV.'copy_of_file.csv');
fgetcsv($handle);
fgetcsv($handle);
$rowHandle = 2;
while(($data = fgetcsv($handle, 0,';')) !== false)
{
if($tableauDossiers[$row]['Caisse Locale Déléguée'] == $data[0]
&& $tableauDossiers[$row]['Date de Création'] == $data[1]
&& $tableauDossiers[$row]['Numéro RCT'] == $data[2])
{
fputcsv($fichierSortieDossier, $data,';');
$tabHandle[$rowHandle]=str_replace("\n",'', $tabHandle[$rowHandle]);
file_put_contents(PATH_CSV.'copy_of_file.csv', $tabHandle);
unset($tabHandle);
break;
}
$rowHandle++;
unset($data);
unset($tabLigne);
}
fclose($handle);
unset($handle);
}
This algo is really too long to execute, but works
Any idea how to improve it ?
Thanks
Assuming you are limited to using PHP and can not use a database to implement it as suggested in the comments, the next best option is to use an external sorting algorithm.
Split the file into small files. The files should be small enough to sort them in memory.
Sort all these files individually in memory.
Merge the sorted files to one big file by comparing the first lines of each file.
The merging of the sorted files can be done very memory efficient: You only need to have the first line of each file in memory at any given time. The first line with the minimal timestamp should go to the resulting file.
For really big files you can cascade the merging ie: if you have 10,000 files you can merge groups of 100 files first then merge the resulting 100 files.
Example
I use a comma to separate values instead of line-breaks for readability.
The unsorted file (imagine it to be be too big to fit into memory):
1, 6, 2, 4, 5, 3
Split the files in parts that are small enough to fit into memory:
1, 6, 2
4, 5, 3
Sort them individually:
1, 2, 6
3, 4, 5
Now merge:
Compare 1 & 3 → take 1
Compare 2 & 3 → take 2
Compare 6 & 3 → take 3
Compare 6 & 4 → take 4
Compare 6 & 5 → take 5
Take 6.
You have a fairly large set of data to process, so you need to do something to optimize it.
You could increase your memory, but that will only postpone the error, when there is a bigger file, it'll crash then (or get waaaayy too slow).
The first option is try to minimize the amount of data. Remove all non-relevant columns from the file. Whichever solution you apply, a smaller dataset is always faster.
I suggest you put it into a database and apply your requirements to it, using that result to create a new file. A database is made to manage large data sets, so it'll take a whole lot less time.
Taking that much data and write that to a file from PHP will still be slow, but could be manageble. Another tactic might be using the commandline, using a .sh file. If you have do basic terminal/ssh skills, you have basic .sh writing capabilities. In that file, you can use mysqldump to export as csv like this. Mysqldump will be significantly faster, but it's a bit trickier to get going when you're used to PHP.
To improve your current code:
- The unset at the end of the first will don't do anything userful. They barely store data and get reset anyways when the next itteration of the while starts.
- Instead of DateTime() for everything, which is easier to work with, but slower, use epoch values. I dont know in what format it comes now, but if you use epoch seconds (like the result of time()), you have two numbers. Your usort() will improve drastically, as it no longer has to use the heavy DateTime class, but just a simple number comparison.
This all asumes that you need to do it muliple times. If not, just open it in Excel or Numbers, use that sort and save as copy.
I've only tried this on a small file, but the principle is very similar to your idea of reading the file, stores the dates and then sorts it. Then reading the original file and writing out the sorted data.
In this version, the load just reads the dates in and creates an array which holds the date and the position in the file of the start of the line (using ftell() after each read to get the file pointer).
It then sorts this array (as date is first just uses normal sort).
Then it goes through the sorted array and for each entry it uses fseek() to locate the record in the file and reads the line (using fgets()) and writes this line to the output file...
$file = "a.csv";
$out = "sorted.csv";
$handle = fopen($file, "r");
$tabligne = [];
$start = 0;
while ( $data = fgetcsv($handle) ) {
$tabligne[] = ['date' => DateTime::createFromFormat('d/m/Y', $data[2]),
'start' => $start ];
$start = ftell($handle);
}
sort($tabligne);
$outHandle = fopen( $out, "w" );
foreach ( $tabligne as $entry ) {
fseek($handle, $entry['start']);
$copy = fgets($handle);
fwrite($outHandle, $copy);
}
fclose($outHandle);
fclose($handle);
I would load the data in a database and let that worry about the underlying algorithm.
If this is a one-time issue, i would suggest not to automate it and use a spreadsheet instead.
Related
I have csv file with 104 fields, but I need only 4 fields to use in mysql database. each file has about a million rows.
could somebody tell me efficient way to do this? reading each line to array takes long time.
thanks
You have to read every line in its entirety by definition. This is necessary to find the delimiter for the next record (i.e. the newline character). You only need to discard the data you have read that you don't need. E.g.:
$data = array();
$fh = fopen('data.csv', 'r');
$headers = fgetcsv($fh);
while ($row = fgetcsv($fh)) {
$row = array_combine($headers, $row);
$data[] = array_intersect_key($row, array_flip(array('foo', 'bar', 'baz')));
// alternatively, if you know the column index, something like:
// $data[] = array($row[1], $row[45], $row[60]);
}
This only retains the columns foo, bar and baz and discards the rest. The reading from file (fgetcsv) is about as fast as it gets. If you need it any faster, you'll have to implement your own CSV tokenizer and parser which skips over the columns you don't need without even temporarily storing them in memory; how much of a performance boost this brings vs. development time necessary to implement this bug free is very debatable.
simple excel macro can drop all unnecessary columns (100 out of 104)
within second. I am looking for similar solution.
That is because Excel, once a file is opened, has all data in memory and can act on it very quickly. For an accurate comparison you need to compare the time it takes to open the file in Excel + dropping of the columns, not just dropping the columns.
This question was asked on a message board, and I want to get a definitive answer and intelligent debate about which method is more semantically correct and less resource intensive.
Say I have a file with each line in that file containing a string. I want to generate an MD5 hash for each line and write it to the same file, overwriting the previous data. My first thought was to do this:
$file = 'strings.txt';
$lines = file($file);
$handle = fopen($file, 'w+');
foreach ($lines as $line)
{
fwrite($handle, md5(trim($line))."\n");
}
fclose($handle);
Another user pointed out that file_get_contents() and file_put_contents() were better than using fwrite() in a loop. Their solution:
$thefile = 'strings.txt';
$newfile = 'newstrings.txt';
$current = file_get_contents($thefile);
$explodedcurrent = explode('\n', $thefile);
$temp = '';
foreach ($explodedcurrent as $string)
$temp .= md5(trim($string)) . '\n';
$newfile = file_put_contents($newfile, $temp);
My argument is that since the main goal of this is to get the file into an array, and file_get_contents() is the preferred way to read the contents of a file into a string, file() is more appropriate and allows us to cut out another unnecessary function, explode().
Furthermore, by directly manipulating the file using fopen(), fwrite(), and fclose() (which is the exact same as one call to file_put_contents()) there is no need to have extraneous variables in which to store the converted strings; you're writing them directly to the file.
My method is the exact same as the alternative - the same number of opens/closes on the file - except mine is shorter and more semantically correct.
What do you have to say, and which one would you choose?
This should be more efficient and less resource-intensive as the previous two methods:
$file = 'passwords.txt';
$passwords = file($file);
$converted = fopen($file, 'w+');
while (count($passwords) > 0)
{
static $i = 0;
fwrite($converted, md5(trim($passwords[$i])));
unset($passwords[$i]);
$i++;
}
fclose($converted);
echo 'Done.';
As one of the comments suggests do what makes more sense to you. Since you might come back to this code in few months and you need to spend least amount of time trying to understand it.
However, if speed is your concern then I would create two test cases (you pretty much already got them) and use timestamp (create variable with timestamp at the beginning of the script, then at the end of the script subtract it from timestamp at the end of the script to work out the difference - how long it took to run the script.) Prepare few files I would go for about 3, two extremes and one normal file. To see which version runs faster.
http://php.net/manual/en/function.time.php
I would think that differences would be marginal, but it also depends on your file sizes.
I'd propose to write a new temporary file, while you process the input one. Once done, overwrite the input file with the temporary one.
I'm looking for a very fast method to read a csv file. My data structure looks like this:
timestamp ,float , string ,ip ,string
1318190061,1640851625, lore ipsum,84.169.42.48,appname
and I'm using fgetcsv to read this data into arrays.
The problem: Performance. On a regular basis the script has to read (and process) more than 10,000 entries.
My first attempt is very simple:
//Performance: 0,141 seconds / 13.5 MB
while(!feof($statisticsfile))
{
$temp = fgetcsv($statisticsfile);
$timestamp[] = $temp[0];
$value[] = $temp[1];
$text[] = $temp[2];
$ip[] = $temp[3];
$app[] = $temp[4];
}
My second attempt:
//Performance: 0,125 seconds / 10.8 MB
while (($userinfo = fgetcsv($statisticsfile)) !== FALSE) {
list ($timestamp[], $value[], $text, $ip, $app) = $userinfo;
}
Is there any way to improve performance even further, or is my method as fast as it could get?
Probably more important: Is there any way to define what columns are read, e.g. sometimes only the timestamp, float columns are needed. Is there any better way than my way (have a look at my second attempt :)
Thanks :)
How long is the longest line? Pass that as the second parameter to fgetcsv() and you'll see the greatest improvement.
Check time that PHP read this file:
If is bigg move file to ramdisk or SSD
[..]sometimes only the timestamp
Somthing like that
preg_match_all('#\d{10},\d{10}, (.*?),\d.\d.\d.\d,appname#',$f,$res);
print_r($res);
I am using PHP to expose vehicle GPS data from a CSV file. This data is captured at least every 30 seconds for over 70 vehicles and includes 19 columns of data. This produces several thousand rows of data and file sizes around 614kb. New data is appended to end of the file. I need to pull out the last row of data for each vehicle, which should represent the most the recent status. I am able to pull out one row for each unit, however since the CSV file is in chronological order I am pulling out the oldest data in the file instead of the newest. Is it possible to read the CSV from the end to the beginning? I have seen some solutions, however they typically involve loading the entire file into memory and then reversing it, this sounds very inefficient. Do I have any other options? Thank you for any advice you can offer.
EDIT: I am using this data to map real-time locations on-the-fly. The data is only provided to me in CSV format, so I think importing into a DB is out of the question.
With fseek you can set the pointer to the end of the file and offset it negative to read a file backwards.
If you must use csv files instead of a database, then perhaps you could read the file line-by-line. This will prevent more than the last line being stored in memory (thanks to the garbage collector).
$handle = #fopen("/path/to/yourfile.csv", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
// old values of $last are garbage collected after re-assignment
$last = $line;
// you can perform optional computations on past data here if desired
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
// $last will now contain the last line of the file.
// You may now do whatever with it
}
edit: I did not see the fseek() post. If all you need is the last line, then that is the way to go.
How can i get a particular line in a 3 gig text file. The lines are delimited by \n. And i need to be able to get any line on demand.
How can this be done? Only one line need be returned. And i would not like to use any system calls.
Note: There is the same question elsewhere regarding how to do this in bash. I would like to compare it with the PHP equiv.
Update: Each line is the same length the whole way thru.
Without keeping some sort of index to the file, you would need to read all of it until you've encountered x number of \n characters. I see that nickf has just posted some way of doing that, so I won't repeat it.
To do this repeatedly in an efficient manner, you will need to build an index. Store some known file positions for certain (or all) line numbers once, which you can then use to seek to the right location using fseek.
Edit: if each line is the same length, you do not need the index.
$myfile = fopen($fileName, "r");
fseek($myfile, $lineLength * $lineNumber);
$line = fgets($myfile);
fclose($myfile);
Line number is 0 based in this example, so you may need to subtract one first. The line length includes the \n character.
There is little discussion of the problem and no mention is made of how the 'one line' should be referenced (by number, some value within it, etc.) so below is just a guess as to what you're wanting.
If you're not averse to using an object (it might be 'too high level', perhaps) and wish to reference the line by offset, then SplFileObject (available as of PHP 5.1.0) could be used. See the following basic example:
$file = new SplFileObject('myreallyhugefile.dat');
$file->seek(12345689); // seek to line 123456790
echo $file->current(); // or simply, echo $file
That particular method (seek) requires scanning through the file line-by-line. However, if as you say all the lines are the same length then you can instead use fseek to get where you want to go much, much faster.
$line_length = 1024; // each line is 1 KB line
$file->fseek($line_length * 1234567); // seek lots of bytes
echo $file->current(); // echo line 1234568
You said each line has the same length, so you can use fopen() in combination with fseek() to get a line quickly.
http://ch2.php.net/manual/en/function.fseek.php
The only way I can think to do it would be like this:
function getLine($fileName, $num) {
$fh = fopen($fileName, 'r');
for ($i = 0; $i < $num && ($line = fgets($fh)); ++$i);
return $line;
}
While this is not a solution exactly, how come you are needing to pull out one line from a 3 gig text file? is perfomance an issue or can this run a leisurely pace?
If you need pull lots of lines out of this file at different points in time, i would definately suggest putting this data into a DB of some kind. SQLite maybe your friend here as its very simple but not great with lots of scripts/people accessing it at one time.