In a text file, lines are detected by \n at the end of each line. For this purpose, it is necessary to read the entire file, and this is a big problem for large files (say 2GB). I am looking for a method to read a single line without walking through the entire file (though I know it should be a complicated process).
The first way I know is to use fseek() with offset; but it is not practical.
Creating a flat file of key/value; but I am not sure if there is a way to avoid loading the entire into RAM (it should be something like reading an array in php).
Alternatively, can we make some numbers at the beginning of each line to be read. I mean, is it possible to read the first digits at the beginning of the line by skipping the line contents (going to the next line).
768| line content is here
769| another line
770| something
If reading only the first digits, the total data which should be read is not much even for large files.
Do you need to read specific lines that can be indexed on line number?. If so just do a binary search. Read (say) 200 characters in the middle of the file to find out a line number. Then repeat in either of the halves until you get to the right line.
I think there are no simple way to do what you want. Records have variable length and no length could be determined in advance, right?
If file is always the same (or at least not modified frequently), I'd put it to database, or at least create index file (record number: offset) and use that fseek()
Alternatively you can index your text file initially and then proceed with your daily operation of picking up single file lines based on your index file. You can find how to index your text file here or here. Indexing a text file is no different from indexing a CSV or variable record file.
Related
There are a lot of different scenarios that are similar (replace text in file, read specific lines etc) but I have not found a good solution to what I want to do:
Messages (strings) are normally sent to a queue. If the server that handles the queue is down the messages are saved to a file. One message per line.
When the server is up again I want to start sending the messages to the server. The file with messages could be "big" so I do not want to read the entire file into memory. I also only want to send a message once, so the file need to reflect if a message has been sent(in other words: don't get 100 lines and then PHP timeout after 95 so the next time the same thing will happen again).
What I basically need is to read one line from a big text file and then delete that line when it has been processed by my script, without constantly reading/writing the whole file.
I have seen different solutions (fread, SplFileObject etc) that can read a line from a file without reading the entire file (into memory) but I have not seen a good way to delete the line that was just read without going through the entire file and saving it again.
I'm guessing that it can be done since the thing that needs to be done is to remove x bytes from the beginning or the end of the file, depending where you read the lines from.
To be clear: I do not think it's a good solution to read the first line from the file, use it, and then read all the other lines just to write them to a tmp-file and then from there to the original file. Read/write 100000 lines just to get one line.
The problem can be fixed in other ways, like creating a number of smaller files so they can be read/written without to much performance problems, but I would like to know if anyone has a solution to the exact problem.
Update:
Since it can't be done did I end up using Sqlite.
I have 1 file txt with 100 000 lines. Can you know any way to get 1 line fastest randomly ? Thank you very much !
The fastest way would be to build index (simple array that contains the position in the file of each new line). Then choose random key, get the position, fseek the file to that position, and read the line. This will require updating the index file any time you change the file, but if you want to optimize retrieving the data, that's the way.
You can optimize further by spliting the file in ranges (e.g. sharding the data), or having several representations of the file (for example you can have file with lines inverted so last come first, and if your random number is bigger than half of the elements, you read from the second file)
Lets say I have a modestly sized text file (~850kb, 10,000+ lines)
And I want to replace a particular line (or several) spread out amongst the file.
Current methods for doing this include re-writing the whole file. The current method I use is read through the entire file line by line, writing to a .tmp file, and once I am done, I rename() the tmp file to the original source file.
It works, but it is slow. And of course, as the file grows, so will execution times.
Is there another way (using PHP) to get the job done without having to rewrite the entire file every time a line or two need to be replaced or removed?
Thanks! I looked around and could not find an answer to this on stackoverflow.
If the replacement is EXACTLY the same size as the original line, then you can simply fwrite() at that location in the file and all's well. But if it's a different length (shorter OR longer), you will have to rewrite the portion of the file that comes AFTER the replacement.
There is no way around this. Shorter 'new' lines will leave a gap in the file, and longer 'new' lines would overwrite the first part of the next line.
Basically, you're asking if it's possible to insert a piece of wood in the middle of another board without having to move the original board around.
You can't, because of the way files are stored on common filesystems. A file always takes up one or more 'blocks' of disk space, where blocks are for example 4096 bytes in size. A file that has one byte of data, will still occupy one whole block (consuming 4096 bytes of available disk space), while a file of 4097 bytes will occupy two blocks (taking up 8192 bytes).
If you remove one byte from a file, there will be a gap of one byte inside one of the blocks it occupies, which is not possible to store on disk. You have to shift all other bytes one byte to the beginning of the file, which will affect the current and all following blocks.
The other way around, adding bytes in the middle of a block, shows the same problem: you'll have one or more bytes that don't fit in the 4096 bytes anymore, so they'll have to shift into the next block, and so on, until the end of the file (and all blocks) has been reached.
The only place where you can have non-occupied bytes in a block, is at the end of the last block that forms a file.
I have a large file, 100,000 lines. I can read each line and process it, or I can store the lines in an array then process them. I would prefer to use the array for extra features, but I'm really concerned about the memory usage associated with storing that many lines in an array, and if it's worth it.
There are two functions you should familiarize yourself with.
The first is file(), which reads an entire file into an array, with each line as an array element. This is good for shorter files, and probably isn't what you want to be using on a 100k line file. This function handles its own file management, so you don't need to explicitly open and close the file yourself.
The second is fgets(), which you can use to read a file one line at a time. You can use this to loop for as long as there are more lines to process, and run your line processing inside the loop. You'll need to use fopen() to get a handle on this file, you may want to track the file pointer yourself for recovery management (i.e. so you won't have to restart processing from scratch if something goes sideways and the script fails), etc.
Hopefully that's enough to get you started.
How about a combination of the two? Read 1000 lines into an array, process it, delete the array, then read 1000 more, etc. Monitor memory usage and adjust how many you read into an array at a time.
There is an array of numbers, divided into partitions containing the same number of elements (as an output of array_chunk()). They are written into separate files, file 1.txt contains the first chunk, 2.txt - the second and so on. And now I want these files to contain a different number of elements of the initial array. Of course, I can read them into one array and split it again, but it requires quite a large amount of memory. Could you please help me with a more efficient solution? (The number of files and the size of the last are stored separately) I have no other ideas...
Do you know what the different number is? If you do, then you can easily read data in, and then whenever you fill a chunk write data out. In pseudo-code:
for each original file:
for each record:
add record to buffer
if buffer is desired size:
write new file
clear buffer
write new file
Obviously you'll need to keep new files separate from old ones. And then, once you've rewritten the data, you can swap them out somehow. (I would personally suggest having two directories, then rename directories after you're done.)
If you don't know what the size of your chunks should be (for instance you want a specific number of files) then first do whatever work it needs to figure that out, then proceed with the original solution.