reading a block of lines in a file using php - php

Considering i have a 100GB txt file containing millions of lines of text. How could i read this text file by block of lines using PHP?
i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.
If i'll be using fread($fp,5030) wherein '5030' is some length value for which it has to read. Would there be a case where it won't read the whole line(such as stop at the middle of the line) because it has reached the max length?

i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.
Don't see, why you shouldn't be able to use fgets()
$blocksize = 50; // in "number of lines"
while (!feof($fh)) {
$lines = array();
$count = 0;
while (!feof($fh) && (++$count <= $blocksize)) {
$lines[] = fgets($fh);
}
doSomethingWithLines($lines);
}
Reading 100GB will take time anyway.

The fread approach sounds like a reasonable solution. You can detect whether you've reached the end of a line by checking whether the final character in the string is a newline character ('\n'). If it isn't, then you can either read some more characters and append them to your existing string, or you can trim characters from your string back to the last newline, and then use fseek to adjust your position in the file.
Side point: Are you aware that reading a 100GB file will take a very long time?

i think that you have to use fread($fp, somesize), and check manually if you have founded the end of the line, otherwise read another chunk.
Hope this helps.

I would recommend implementing the reading of a single line within a function, hiding the implementation details of that specific step from the rest of your code - the processing function must not care how the line was retrieved. You can then implement your first version using fgets() and then try other methods if you notice that it is too slow. It could very well be that the initial implementation is too slow, but the point is: you won't know until you've benchmarked.

I know this is an old question, but I think there is value for a new answer for anyone that finds this question eventually.
I agree that reading 100GB takes time, that I why I also agree that we need to find the most effective option to read it so it can be as little as possible instead of just thinking "who cares how much it is if is already a lot", so, lets find out our lowest time possible.
Another solution:
Cache a chunk of raw data
Use fread to read a cache of that data
Read line by line
Read line by line from the cache until end of cache or end of data found
Read next chunk and repeat
Grab the un processed last part of the chunk (the one you were looking for the line delimiter) and move it at the front, then reads a chunk of the size you had defined minus the size of the unprocessed data and put it just after that un processed chunk, then, there you go, you have a new complete chunk.
Repeat the read by line and this process until the file is read completely.
You should use a cache chunk bigger than any expected size of line.
The bigger the cache size the faster you read, but the more memory you use.

Related

Faster Way to Read File Line by Line?

In PHP, I use fopen( ), fgets( ), and fclose( ) to read a file line by line. It works well. But I have a script (being run from the CLI) that has to process three hundred 5GB text files. That's approximately 3 billion fgets( ). So it works well enough but at this scale, tiny speed savings will add up extremely fast. So I'm wondering if there are any tricks to speed up the process?
The only potential thing I thought of was getting fgets( ) to read more than one line at once. It doesn't look like it supports that, but I could in theory do lets say 20 consecutive $line[] = fgets($file); and then process the array. That's not quite the same thing as reading multiple lines in one command so it may not have any affect. But I know queue your mysql inserts and sending them as one giant insert (another trick I'm going to implement in this script after more testing and benchmarking) will save a lot of time.
Update 4/13/19
Here is the solution I went with. Originally I had a much more complicated method of slicing off the end of each read, but then I realized you can do it much simpler.
$index_file = fopen( path to file,"r" );
$chunk = "";
while ( !feof($index_file) )
{
$chunk .= fread($index_file,$read_length);
$payload_lines = explode("\n",$chunk);
if ( !feof($index_file) )
{ $chunk = array_pop($payload_lines); }
}
Of course PHP has a function for everything. So I break every read into an array of lines, and array_pop() the last item in the array back to the beginning of the 'read buffer'. That last part is probably split, but not necessarily split. But either way, it goes back in and gets processed with the next loop (unless we're done with the file, then we don't pop it).
The only thing you have to watch out for here is if you have a line so long that a single read won't capture the whole thing. But know your data, that probably won't be a hassle. For me, I'm parsing a json-ish file, and I'm reading 128 KB at a time, so there are always many line breaks in my read.
Note: I settled on 128 KB by doing a million benchmarks and finding the size my server processes the absolute fastest. This parsing function will run 300 times so every second I save, saves me 5 minutes of total runtime.
One possible approach that might be faster would be to read large chunks of the file in with fread(), split it by newlines and then process the lines. You'd have to take in account that the chunks may sever lines and you'd have to detect this and glue them back together.
Generally speaking the larger the chunk you can read in one go the faster your process should become. Within the limits of your available memory.
From fread() docs:
Note that fread() reads from the current position of the file pointer. Use ftell() to find the current position of the pointer and rewind() to rewind the pointer position.

PHP read a large file line by line and string replace

I'd like to read a large file line by line, perform string replacement and save changes into the file, IOW rewriting 1 line at a time. Is there any simple solution in PHP/ Unix?
The easiest way that came on my mind would be to write the lines into a new file and then replace the old one, but it's not elegant.
I think there're only 2 options
Use memory
Read, replace then store the replaced string into memory, once done, overwrite the source file.
Use tmp file
Read & replace string then write every line immediately to tmp file, once all done, replace original file by tmp file
The #1 will be more effective because IO is expensive, use it if you have vast memory or the processing file is not too big.
The #2 will be a bit slow but be quite stable even on large file.
Of course, you may combine both way by writing replaced string by chunk of lines to file (instead of just by line)
There're the simplest, most elegant ways I can think out.
It seems it's not such a bad solution to use the temporary file in most cases.
$f='data.txt';
$fh=fopen($f,'r+');
while (($l=fgets($fh))!==false) file_put_contents('tmp',clean($l),FILE_APPEND);
fclose($f);
unlink($f);
rename('tmp',$f);

Is it possible to load a text file but only part of it with PHP? [duplicate]

Considering i have a 100GB txt file containing millions of lines of text. How could i read this text file by block of lines using PHP?
i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.
If i'll be using fread($fp,5030) wherein '5030' is some length value for which it has to read. Would there be a case where it won't read the whole line(such as stop at the middle of the line) because it has reached the max length?
i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.
Don't see, why you shouldn't be able to use fgets()
$blocksize = 50; // in "number of lines"
while (!feof($fh)) {
$lines = array();
$count = 0;
while (!feof($fh) && (++$count <= $blocksize)) {
$lines[] = fgets($fh);
}
doSomethingWithLines($lines);
}
Reading 100GB will take time anyway.
The fread approach sounds like a reasonable solution. You can detect whether you've reached the end of a line by checking whether the final character in the string is a newline character ('\n'). If it isn't, then you can either read some more characters and append them to your existing string, or you can trim characters from your string back to the last newline, and then use fseek to adjust your position in the file.
Side point: Are you aware that reading a 100GB file will take a very long time?
i think that you have to use fread($fp, somesize), and check manually if you have founded the end of the line, otherwise read another chunk.
Hope this helps.
I would recommend implementing the reading of a single line within a function, hiding the implementation details of that specific step from the rest of your code - the processing function must not care how the line was retrieved. You can then implement your first version using fgets() and then try other methods if you notice that it is too slow. It could very well be that the initial implementation is too slow, but the point is: you won't know until you've benchmarked.
I know this is an old question, but I think there is value for a new answer for anyone that finds this question eventually.
I agree that reading 100GB takes time, that I why I also agree that we need to find the most effective option to read it so it can be as little as possible instead of just thinking "who cares how much it is if is already a lot", so, lets find out our lowest time possible.
Another solution:
Cache a chunk of raw data
Use fread to read a cache of that data
Read line by line
Read line by line from the cache until end of cache or end of data found
Read next chunk and repeat
Grab the un processed last part of the chunk (the one you were looking for the line delimiter) and move it at the front, then reads a chunk of the size you had defined minus the size of the unprocessed data and put it just after that un processed chunk, then, there you go, you have a new complete chunk.
Repeat the read by line and this process until the file is read completely.
You should use a cache chunk bigger than any expected size of line.
The bigger the cache size the faster you read, but the more memory you use.

Read single line from a big text file

I have a 10MB text file.
The length of the lines may vary.
Which is the most efficient way (fast and memory friendly) to read just one specific line from this file? e.g. get_me_the_line($nr, $file_resource)
I don't know of a way to just jump to the line, if the lines are of varying length. However you can iterate through lines pretty quickly when not using them for anything, and return the one of interest.
function ReadLineNumber($file, $number)
{
$handle = fopen($file, "r");
$i = 0;
while (fgets($handle) && $i < $number - 1)
$i++;
return fgets($handle);
}
Edit
I added - 1 to the loop because this reads a line ahead. The $number is therefore a zero-index line reference. Change to - 2 if you would prefer line 1 mean the first line in the file.
As the lines are of varying length you have to look at each character as it might denote the end of the line. Quickest would be loading the file in chunks that are sized like the blocksize of the filesystem and counting the linebreaks until you are on the desired line.
Better way would be to have an index file that stores information about the file containing the lines. Using a database could also be a better idea.
If the file is REALLY large (several GB or more) and your application is running on *nix you may not want to try having PHP process the file and instead use some existing unix tools optimized for this kind of line processing. Once such tool is sed and an example of printing a specific line from a huge file can be found here.
Should be trivial to wrap this in a system_exec() call, or similar to write the function you are looking for.

How can I tell what line a file resource is currently "on" in PHP?

Using PHP, it's possible to read off the contents of a file using fopen and fgets. Each time fgets is called, it returns the next line in the file.
How does fgets know what line to read? In other words, how does it know that it last read line 5, so it should return the contents of line 6 this time? Is there a way for me to access that line-number data?
(I know it's possible to do something similar by reading the entire contents of the file into an array with file, but I'd like to accomplish this with fopen.)
There is a "position" kept in memory for each file that is opened ; it is automatically updated each time you are reading a line/character/whatever from the file.
You can get this position with ftell, and modify it with fseek :
ftell — Returns the current position
of the file read/write pointer
fseek — Seeks on a file pointer
You can also use rewind to... rewind... the position of that pointer.
This is not getting you a position as a line number, but closer to a position as a character number (actually, you are getting the position as a number of bytes from the beginning of the file) ; when you have that, reading a line is just a metter of reading characters until yu hit an end of line character.
BTW : as far as I remember, these functions are coming from the C language -- PHP itself being written in C ;-)
Files are just a stream of data, read from the beginning to the end. The OS will remember the position you've read so far in that file. If needed, doing so in the application as well is fairly simple. The OS only cares about byte positions though, not lines.
Just imagine dealing out a deck of 52 card sequentially. You hand off the first card. Next time the 2. card. When you want to give out the 3. card , you don't need to start counting from the start again, or even remembering where you were you just hand out the next available card, and that'll be the third.
It might be a bit more work that's needed to read lines, since you'd want to buffer data read from the actual file for preformance sake, but it's not that much more to it than to record the offset of the last piece of data you handed out, find the next newline character and hand off all the data between those 2 points.
PHP nor the OS has no real need to keep the line number around, since all the system care about is "next line". If you want to know the line number, you keep a counter and increment it every time your app reads a line.
$lineno=0;
while (!feof($handle)) {
$buffer = fgets($handle, 4096);
lineno++; // keep track of the line number
...
}
i hav this old sample i hob its can help you :)
$File = file('path');
$array = array();
$linenr = 5;
foreach( $File AS $line_num => $line )
{
$array = array_push( $array , $line );
}
echo $array[($linenr-1)];
You could just call fgets and increment a var $line_number each time you call it. That would tell you the line it is on.

Categories