How can i get a particular line in a 3 gig text file. The lines are delimited by \n. And i need to be able to get any line on demand.
How can this be done? Only one line need be returned. And i would not like to use any system calls.
Note: There is the same question elsewhere regarding how to do this in bash. I would like to compare it with the PHP equiv.
Update: Each line is the same length the whole way thru.
Without keeping some sort of index to the file, you would need to read all of it until you've encountered x number of \n characters. I see that nickf has just posted some way of doing that, so I won't repeat it.
To do this repeatedly in an efficient manner, you will need to build an index. Store some known file positions for certain (or all) line numbers once, which you can then use to seek to the right location using fseek.
Edit: if each line is the same length, you do not need the index.
$myfile = fopen($fileName, "r");
fseek($myfile, $lineLength * $lineNumber);
$line = fgets($myfile);
fclose($myfile);
Line number is 0 based in this example, so you may need to subtract one first. The line length includes the \n character.
There is little discussion of the problem and no mention is made of how the 'one line' should be referenced (by number, some value within it, etc.) so below is just a guess as to what you're wanting.
If you're not averse to using an object (it might be 'too high level', perhaps) and wish to reference the line by offset, then SplFileObject (available as of PHP 5.1.0) could be used. See the following basic example:
$file = new SplFileObject('myreallyhugefile.dat');
$file->seek(12345689); // seek to line 123456790
echo $file->current(); // or simply, echo $file
That particular method (seek) requires scanning through the file line-by-line. However, if as you say all the lines are the same length then you can instead use fseek to get where you want to go much, much faster.
$line_length = 1024; // each line is 1 KB line
$file->fseek($line_length * 1234567); // seek lots of bytes
echo $file->current(); // echo line 1234568
You said each line has the same length, so you can use fopen() in combination with fseek() to get a line quickly.
http://ch2.php.net/manual/en/function.fseek.php
The only way I can think to do it would be like this:
function getLine($fileName, $num) {
$fh = fopen($fileName, 'r');
for ($i = 0; $i < $num && ($line = fgets($fh)); ++$i);
return $line;
}
While this is not a solution exactly, how come you are needing to pull out one line from a 3 gig text file? is perfomance an issue or can this run a leisurely pace?
If you need pull lots of lines out of this file at different points in time, i would definately suggest putting this data into a DB of some kind. SQLite maybe your friend here as its very simple but not great with lots of scripts/people accessing it at one time.
Related
I'm ok with PHP but probably not half as good as some of you guys on here.
I am basically trying to find a way to grab a line from a huge and I mean huge text file.... its basically a list of keywords I want to call by line number but without preferably going through them all before I get to that line.....otherwise couldmcrash my server obviously.
At the moment im using this
$lines = file('http://www.mysite.com/keywords.txt');
foreach ($lines as $line_num => $line) {
echo "$line_num";
}
This works but im sure theres gotta be a better way of doing to save on usuage because this is putting the whole file into the memory and if I can simply say to php give me line number 97, would umm RULE....
Hope you guys can come up with a solution as your much smarter than me :P ty
Use SplFileObject
$file = "test.txt";
$line_number = 1000;
$file_obj = new SplFileObject( $file );
/*** seek to the line number ***/
$file_obj->seek( $line_number );
/*** return the current line ***/
echo $file_obj->current();
If the lines are just text and variable in length, you can't know which line is #97; the only thing that makes it 97th is that there are 96 lines before.
So you need to read the whole file up to that point (this is what SplFileObject does):
$fp = fopen("keywords.txt", "r");
while($line--)
{
if (feof($fp))
// ERROR: line does not exist
$text = fgets($fp, 1024); // 1024 = max length of one line
}
fclose($fp);
But if you can store a line number before each line, i.e. the file is
...
95 abbagnale
96 abbatangelo
97 abbatantuono
98 ...
then you can implement a sort of binary search:
- start with s1 = 0 and s2 = file length
- read a keyword and line number at seek position s3 = (s1+s2)/2 (*)
- if line number is less than desired, s1 = s3; else s2 = s3; and repeat previous step.
- if line number is the one desired, strip the number from the text and you get the keyword.
(*) since the line most likely will not start exactly at s#, you need two fgets: one to get rid of the spurious half keyword, the second to read the line number. When you get "close", it will be faster to read a bigger chunk and split it into lines. For example, you seek line 170135 and read in line 170180: what you'd better do is rewind the seek position by one kilobyte, read in a kilobyte of data, and seek 170135 in there.
Or, if the lengths of the various lines are not too different, it could be worthwhile to store a fixed size line (here the "#" should actually be spaces, and in the line length you need to count the line terminator, \n or \r\n):
abbagnale#########
abbatangelo#######
abbatantuono######
and then, say that each keyword is 32 bytes,
$fp = fopen("keywords.txt", "r");
fseek($fp, 97 * 32, SEEK_SET);
$text = trim(fgets($fp, 32));
fclose($fp);
would be more or less instantaneous.
If the file is on a remote server though, you still need to download the Whole file (up to the desired line), and you'd be better served by placing a "scanner" script on the remote server that could run the search. Then you could run
$text = file_get_contents("http://www.mysite.com/keywords.php?line=97");
and get your line in milliseconds.
There isn't any way to get 'line number x' from a file in pretty much any language without having to read it first some way or the other. A line, after all, is just the stuff between two end-of-line characters. Whereas picking up 'character number x' from a file can be done without loading the whole file (with some difficulty), picking up 'line number x' can't be done without loading all lines till x (and in most methods, you need to load all lines)
A method in which you load all the lines till line x is the following (using fgets):
$f = fopen('http://www.mysite.com/keywords.txt');
$i=97
$text=""
while (($text = fgets($f,2048)) !== false && $i>0) {
$i--
}
echo $text
This question was asked on a message board, and I want to get a definitive answer and intelligent debate about which method is more semantically correct and less resource intensive.
Say I have a file with each line in that file containing a string. I want to generate an MD5 hash for each line and write it to the same file, overwriting the previous data. My first thought was to do this:
$file = 'strings.txt';
$lines = file($file);
$handle = fopen($file, 'w+');
foreach ($lines as $line)
{
fwrite($handle, md5(trim($line))."\n");
}
fclose($handle);
Another user pointed out that file_get_contents() and file_put_contents() were better than using fwrite() in a loop. Their solution:
$thefile = 'strings.txt';
$newfile = 'newstrings.txt';
$current = file_get_contents($thefile);
$explodedcurrent = explode('\n', $thefile);
$temp = '';
foreach ($explodedcurrent as $string)
$temp .= md5(trim($string)) . '\n';
$newfile = file_put_contents($newfile, $temp);
My argument is that since the main goal of this is to get the file into an array, and file_get_contents() is the preferred way to read the contents of a file into a string, file() is more appropriate and allows us to cut out another unnecessary function, explode().
Furthermore, by directly manipulating the file using fopen(), fwrite(), and fclose() (which is the exact same as one call to file_put_contents()) there is no need to have extraneous variables in which to store the converted strings; you're writing them directly to the file.
My method is the exact same as the alternative - the same number of opens/closes on the file - except mine is shorter and more semantically correct.
What do you have to say, and which one would you choose?
This should be more efficient and less resource-intensive as the previous two methods:
$file = 'passwords.txt';
$passwords = file($file);
$converted = fopen($file, 'w+');
while (count($passwords) > 0)
{
static $i = 0;
fwrite($converted, md5(trim($passwords[$i])));
unset($passwords[$i]);
$i++;
}
fclose($converted);
echo 'Done.';
As one of the comments suggests do what makes more sense to you. Since you might come back to this code in few months and you need to spend least amount of time trying to understand it.
However, if speed is your concern then I would create two test cases (you pretty much already got them) and use timestamp (create variable with timestamp at the beginning of the script, then at the end of the script subtract it from timestamp at the end of the script to work out the difference - how long it took to run the script.) Prepare few files I would go for about 3, two extremes and one normal file. To see which version runs faster.
http://php.net/manual/en/function.time.php
I would think that differences would be marginal, but it also depends on your file sizes.
I'd propose to write a new temporary file, while you process the input one. Once done, overwrite the input file with the temporary one.
I have a 10MB text file.
The length of the lines may vary.
Which is the most efficient way (fast and memory friendly) to read just one specific line from this file? e.g. get_me_the_line($nr, $file_resource)
I don't know of a way to just jump to the line, if the lines are of varying length. However you can iterate through lines pretty quickly when not using them for anything, and return the one of interest.
function ReadLineNumber($file, $number)
{
$handle = fopen($file, "r");
$i = 0;
while (fgets($handle) && $i < $number - 1)
$i++;
return fgets($handle);
}
Edit
I added - 1 to the loop because this reads a line ahead. The $number is therefore a zero-index line reference. Change to - 2 if you would prefer line 1 mean the first line in the file.
As the lines are of varying length you have to look at each character as it might denote the end of the line. Quickest would be loading the file in chunks that are sized like the blocksize of the filesystem and counting the linebreaks until you are on the desired line.
Better way would be to have an index file that stores information about the file containing the lines. Using a database could also be a better idea.
If the file is REALLY large (several GB or more) and your application is running on *nix you may not want to try having PHP process the file and instead use some existing unix tools optimized for this kind of line processing. Once such tool is sed and an example of printing a specific line from a huge file can be found here.
Should be trivial to wrap this in a system_exec() call, or similar to write the function you are looking for.
I have a script that parses large files line by line. When it encounters an error that it can't handle, it stops, notifying us of the last line parsed.
Is this really the best / only way to seek to a specific line in a file? (fseek() is not usable in my case.)
<?php
for ($i = 0; $i < 100000; $i++)
fgets($fp); // just discard this
I don't have a problem using this, it is fast enough - it just feels a bit dirty. From what I know about the underlying code, I don't imagine there is a better way to do this.
An easy way to seek to a specific line in a file is to use the SplFileObject class, which supports seeking to a line number (seek()) or byte offset (fseek()).
$file = new SplFileObject('myfile.txt');
$file->seek(9999); // Seek to line no. 10,000
echo $file->current(); // Print contents of that line
In the background, seek() just does what your PHP code did (except, in C code).
If you only have the line number to go on, there is no other method of finding the line. Files are not line based (or even character based), so there is no way to simply jump to a specific line in a file.
There might be other ways of reading the lines in the file that might be slightly faster, like reading larger chunks of the file into a buffer and read lines from that, but you could only hope for it to be a few percent faster. Any method to find a specific line in a file still has to read all data up to that line.
I know it is late for posting but it can help some ppl
I did a function like fseekbyline one day ...
function GoToLine($handle,$line)
{
fseek($handle,0); // seek to 0
$i = 0;
$bufcarac = 0;
for($i = 1;$i<$line;$i++)
{
$ligne = fgets($handle);
$bufcarac += strlen($ligne); // in the end bufcarac will contains all caracters until the line
}
fseek($handle,$bufcarac);
}
there is no error system, if you wanna go to the line <1 or 203 but the file is empty ...
you will get nothing good.
same if you wanna go out of eot
rewind($handle);
for ($i=0; $i < $desired_line; $i++) {
fgetcsv($handle, 1000, ",");
}
This is working for me while I need to rewind to a specific line multiple times in my script.
I am not sure if this eats up memory or speed, but it does the trick.
If I understand correctly, you want to seek to the specific line at some point after you have found an error. If that is the case, you probably store or print the line-number of the bad line somewhere, depending on what you mean by "notify".
Unless you really mean that you cannot use fseek()*, what you can do is to also store/print the position in the file where the bad line starts. Then you can fseek().
* How, in that case, would fseekbyline() be usable if it existed?
Using PHP, it's possible to read off the contents of a file using fopen and fgets. Each time fgets is called, it returns the next line in the file.
How does fgets know what line to read? In other words, how does it know that it last read line 5, so it should return the contents of line 6 this time? Is there a way for me to access that line-number data?
(I know it's possible to do something similar by reading the entire contents of the file into an array with file, but I'd like to accomplish this with fopen.)
There is a "position" kept in memory for each file that is opened ; it is automatically updated each time you are reading a line/character/whatever from the file.
You can get this position with ftell, and modify it with fseek :
ftell — Returns the current position
of the file read/write pointer
fseek — Seeks on a file pointer
You can also use rewind to... rewind... the position of that pointer.
This is not getting you a position as a line number, but closer to a position as a character number (actually, you are getting the position as a number of bytes from the beginning of the file) ; when you have that, reading a line is just a metter of reading characters until yu hit an end of line character.
BTW : as far as I remember, these functions are coming from the C language -- PHP itself being written in C ;-)
Files are just a stream of data, read from the beginning to the end. The OS will remember the position you've read so far in that file. If needed, doing so in the application as well is fairly simple. The OS only cares about byte positions though, not lines.
Just imagine dealing out a deck of 52 card sequentially. You hand off the first card. Next time the 2. card. When you want to give out the 3. card , you don't need to start counting from the start again, or even remembering where you were you just hand out the next available card, and that'll be the third.
It might be a bit more work that's needed to read lines, since you'd want to buffer data read from the actual file for preformance sake, but it's not that much more to it than to record the offset of the last piece of data you handed out, find the next newline character and hand off all the data between those 2 points.
PHP nor the OS has no real need to keep the line number around, since all the system care about is "next line". If you want to know the line number, you keep a counter and increment it every time your app reads a line.
$lineno=0;
while (!feof($handle)) {
$buffer = fgets($handle, 4096);
lineno++; // keep track of the line number
...
}
i hav this old sample i hob its can help you :)
$File = file('path');
$array = array();
$linenr = 5;
foreach( $File AS $line_num => $line )
{
$array = array_push( $array , $line );
}
echo $array[($linenr-1)];
You could just call fgets and increment a var $line_number each time you call it. That would tell you the line it is on.