I have a 10MB text file.
The length of the lines may vary.
Which is the most efficient way (fast and memory friendly) to read just one specific line from this file? e.g. get_me_the_line($nr, $file_resource)
I don't know of a way to just jump to the line, if the lines are of varying length. However you can iterate through lines pretty quickly when not using them for anything, and return the one of interest.
function ReadLineNumber($file, $number)
{
$handle = fopen($file, "r");
$i = 0;
while (fgets($handle) && $i < $number - 1)
$i++;
return fgets($handle);
}
Edit
I added - 1 to the loop because this reads a line ahead. The $number is therefore a zero-index line reference. Change to - 2 if you would prefer line 1 mean the first line in the file.
As the lines are of varying length you have to look at each character as it might denote the end of the line. Quickest would be loading the file in chunks that are sized like the blocksize of the filesystem and counting the linebreaks until you are on the desired line.
Better way would be to have an index file that stores information about the file containing the lines. Using a database could also be a better idea.
If the file is REALLY large (several GB or more) and your application is running on *nix you may not want to try having PHP process the file and instead use some existing unix tools optimized for this kind of line processing. Once such tool is sed and an example of printing a specific line from a huge file can be found here.
Should be trivial to wrap this in a system_exec() call, or similar to write the function you are looking for.
Related
Considering i have a 100GB txt file containing millions of lines of text. How could i read this text file by block of lines using PHP?
i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.
If i'll be using fread($fp,5030) wherein '5030' is some length value for which it has to read. Would there be a case where it won't read the whole line(such as stop at the middle of the line) because it has reached the max length?
i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.
Don't see, why you shouldn't be able to use fgets()
$blocksize = 50; // in "number of lines"
while (!feof($fh)) {
$lines = array();
$count = 0;
while (!feof($fh) && (++$count <= $blocksize)) {
$lines[] = fgets($fh);
}
doSomethingWithLines($lines);
}
Reading 100GB will take time anyway.
The fread approach sounds like a reasonable solution. You can detect whether you've reached the end of a line by checking whether the final character in the string is a newline character ('\n'). If it isn't, then you can either read some more characters and append them to your existing string, or you can trim characters from your string back to the last newline, and then use fseek to adjust your position in the file.
Side point: Are you aware that reading a 100GB file will take a very long time?
i think that you have to use fread($fp, somesize), and check manually if you have founded the end of the line, otherwise read another chunk.
Hope this helps.
I would recommend implementing the reading of a single line within a function, hiding the implementation details of that specific step from the rest of your code - the processing function must not care how the line was retrieved. You can then implement your first version using fgets() and then try other methods if you notice that it is too slow. It could very well be that the initial implementation is too slow, but the point is: you won't know until you've benchmarked.
I know this is an old question, but I think there is value for a new answer for anyone that finds this question eventually.
I agree that reading 100GB takes time, that I why I also agree that we need to find the most effective option to read it so it can be as little as possible instead of just thinking "who cares how much it is if is already a lot", so, lets find out our lowest time possible.
Another solution:
Cache a chunk of raw data
Use fread to read a cache of that data
Read line by line
Read line by line from the cache until end of cache or end of data found
Read next chunk and repeat
Grab the un processed last part of the chunk (the one you were looking for the line delimiter) and move it at the front, then reads a chunk of the size you had defined minus the size of the unprocessed data and put it just after that un processed chunk, then, there you go, you have a new complete chunk.
Repeat the read by line and this process until the file is read completely.
You should use a cache chunk bigger than any expected size of line.
The bigger the cache size the faster you read, but the more memory you use.
Considering i have a 100GB txt file containing millions of lines of text. How could i read this text file by block of lines using PHP?
i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.
If i'll be using fread($fp,5030) wherein '5030' is some length value for which it has to read. Would there be a case where it won't read the whole line(such as stop at the middle of the line) because it has reached the max length?
i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.
Don't see, why you shouldn't be able to use fgets()
$blocksize = 50; // in "number of lines"
while (!feof($fh)) {
$lines = array();
$count = 0;
while (!feof($fh) && (++$count <= $blocksize)) {
$lines[] = fgets($fh);
}
doSomethingWithLines($lines);
}
Reading 100GB will take time anyway.
The fread approach sounds like a reasonable solution. You can detect whether you've reached the end of a line by checking whether the final character in the string is a newline character ('\n'). If it isn't, then you can either read some more characters and append them to your existing string, or you can trim characters from your string back to the last newline, and then use fseek to adjust your position in the file.
Side point: Are you aware that reading a 100GB file will take a very long time?
i think that you have to use fread($fp, somesize), and check manually if you have founded the end of the line, otherwise read another chunk.
Hope this helps.
I would recommend implementing the reading of a single line within a function, hiding the implementation details of that specific step from the rest of your code - the processing function must not care how the line was retrieved. You can then implement your first version using fgets() and then try other methods if you notice that it is too slow. It could very well be that the initial implementation is too slow, but the point is: you won't know until you've benchmarked.
I know this is an old question, but I think there is value for a new answer for anyone that finds this question eventually.
I agree that reading 100GB takes time, that I why I also agree that we need to find the most effective option to read it so it can be as little as possible instead of just thinking "who cares how much it is if is already a lot", so, lets find out our lowest time possible.
Another solution:
Cache a chunk of raw data
Use fread to read a cache of that data
Read line by line
Read line by line from the cache until end of cache or end of data found
Read next chunk and repeat
Grab the un processed last part of the chunk (the one you were looking for the line delimiter) and move it at the front, then reads a chunk of the size you had defined minus the size of the unprocessed data and put it just after that un processed chunk, then, there you go, you have a new complete chunk.
Repeat the read by line and this process until the file is read completely.
You should use a cache chunk bigger than any expected size of line.
The bigger the cache size the faster you read, but the more memory you use.
I have a csv file with records being sorted on the first field. I managed to generate a function that does binary search through that file, using fseek for random access through file.
However, this is still a pretty slow process, since when I seek some file position, I actually need to look left, looking for \n characted, so I can make sure I'm reading a whole line (once whole line is read, I can check for first field value mentioned above).
Here is the function that returns a line that contains character at position x:
function fgetLineContaining( $fh, $x ) {
if( $x 125145411) // 12514511 is the last pos in my file
return "";
// now go as much left as possible, until newline is found
// or beginning of the file
while( $x > 0 && $c != "\n" && $c != "\r") {
fseek($fh, $x);
$x--; // go left in the file
$c = fgetc( $fh );
}
$x+=2; // skip newline char
fseek( $fh, $x );
return fgets( $fh, 1024 ); // return the line from the beginning until \n
}
While this is working as expected, I have to sad that my csv file has ~1.5Mil lines, and these left-seeks are slowing thins down pretty much.
Is there a better way to seek a line containing position x inside a file?
Also, it would be much better if object of a class could be saved to a file without serializing it, thus enabling reading of a file object-by-object. Does php support that?
Thanks
I think you really should consider using SQLite or MySQL again (like others have suggested in the comments). Most of the suggestions about pre-calculating indexes are already implemented "properly" in these SQL engines.
You said the speed wasn't good enough in SQL. Did you have the fields indexed properly? How were you querying the data? Where you using bulk queries, where you using prepared statements? Did the SQL process have enough ram to store it's indexes in RAM?
One thing you can possibly try to speed under the current algorithm is to load the (~100MB ?) file onto a RAM disc. No matter what you chose to do, either CVS or SQLite, this WILL help speed things up, especially if the hard drive seek time is your bottleneck.
You could possibly even read the whole file into PHP array's (assuming your computer has enough RAM for that). That would allow you to do your search via index ($big_array[$offset]) lookups.
Also one thing to keep in mind, PHP isn't exactly super fast at doing low level things fast. You might want to consider moving away from PHP in favor of C or C++.
I have a script that parses large files line by line. When it encounters an error that it can't handle, it stops, notifying us of the last line parsed.
Is this really the best / only way to seek to a specific line in a file? (fseek() is not usable in my case.)
<?php
for ($i = 0; $i < 100000; $i++)
fgets($fp); // just discard this
I don't have a problem using this, it is fast enough - it just feels a bit dirty. From what I know about the underlying code, I don't imagine there is a better way to do this.
An easy way to seek to a specific line in a file is to use the SplFileObject class, which supports seeking to a line number (seek()) or byte offset (fseek()).
$file = new SplFileObject('myfile.txt');
$file->seek(9999); // Seek to line no. 10,000
echo $file->current(); // Print contents of that line
In the background, seek() just does what your PHP code did (except, in C code).
If you only have the line number to go on, there is no other method of finding the line. Files are not line based (or even character based), so there is no way to simply jump to a specific line in a file.
There might be other ways of reading the lines in the file that might be slightly faster, like reading larger chunks of the file into a buffer and read lines from that, but you could only hope for it to be a few percent faster. Any method to find a specific line in a file still has to read all data up to that line.
I know it is late for posting but it can help some ppl
I did a function like fseekbyline one day ...
function GoToLine($handle,$line)
{
fseek($handle,0); // seek to 0
$i = 0;
$bufcarac = 0;
for($i = 1;$i<$line;$i++)
{
$ligne = fgets($handle);
$bufcarac += strlen($ligne); // in the end bufcarac will contains all caracters until the line
}
fseek($handle,$bufcarac);
}
there is no error system, if you wanna go to the line <1 or 203 but the file is empty ...
you will get nothing good.
same if you wanna go out of eot
rewind($handle);
for ($i=0; $i < $desired_line; $i++) {
fgetcsv($handle, 1000, ",");
}
This is working for me while I need to rewind to a specific line multiple times in my script.
I am not sure if this eats up memory or speed, but it does the trick.
If I understand correctly, you want to seek to the specific line at some point after you have found an error. If that is the case, you probably store or print the line-number of the bad line somewhere, depending on what you mean by "notify".
Unless you really mean that you cannot use fseek()*, what you can do is to also store/print the position in the file where the bad line starts. Then you can fseek().
* How, in that case, would fseekbyline() be usable if it existed?
How can i get a particular line in a 3 gig text file. The lines are delimited by \n. And i need to be able to get any line on demand.
How can this be done? Only one line need be returned. And i would not like to use any system calls.
Note: There is the same question elsewhere regarding how to do this in bash. I would like to compare it with the PHP equiv.
Update: Each line is the same length the whole way thru.
Without keeping some sort of index to the file, you would need to read all of it until you've encountered x number of \n characters. I see that nickf has just posted some way of doing that, so I won't repeat it.
To do this repeatedly in an efficient manner, you will need to build an index. Store some known file positions for certain (or all) line numbers once, which you can then use to seek to the right location using fseek.
Edit: if each line is the same length, you do not need the index.
$myfile = fopen($fileName, "r");
fseek($myfile, $lineLength * $lineNumber);
$line = fgets($myfile);
fclose($myfile);
Line number is 0 based in this example, so you may need to subtract one first. The line length includes the \n character.
There is little discussion of the problem and no mention is made of how the 'one line' should be referenced (by number, some value within it, etc.) so below is just a guess as to what you're wanting.
If you're not averse to using an object (it might be 'too high level', perhaps) and wish to reference the line by offset, then SplFileObject (available as of PHP 5.1.0) could be used. See the following basic example:
$file = new SplFileObject('myreallyhugefile.dat');
$file->seek(12345689); // seek to line 123456790
echo $file->current(); // or simply, echo $file
That particular method (seek) requires scanning through the file line-by-line. However, if as you say all the lines are the same length then you can instead use fseek to get where you want to go much, much faster.
$line_length = 1024; // each line is 1 KB line
$file->fseek($line_length * 1234567); // seek lots of bytes
echo $file->current(); // echo line 1234568
You said each line has the same length, so you can use fopen() in combination with fseek() to get a line quickly.
http://ch2.php.net/manual/en/function.fseek.php
The only way I can think to do it would be like this:
function getLine($fileName, $num) {
$fh = fopen($fileName, 'r');
for ($i = 0; $i < $num && ($line = fgets($fh)); ++$i);
return $line;
}
While this is not a solution exactly, how come you are needing to pull out one line from a 3 gig text file? is perfomance an issue or can this run a leisurely pace?
If you need pull lots of lines out of this file at different points in time, i would definately suggest putting this data into a DB of some kind. SQLite maybe your friend here as its very simple but not great with lots of scripts/people accessing it at one time.