I have a large file, 100,000 lines. I can read each line and process it, or I can store the lines in an array then process them. I would prefer to use the array for extra features, but I'm really concerned about the memory usage associated with storing that many lines in an array, and if it's worth it.
There are two functions you should familiarize yourself with.
The first is file(), which reads an entire file into an array, with each line as an array element. This is good for shorter files, and probably isn't what you want to be using on a 100k line file. This function handles its own file management, so you don't need to explicitly open and close the file yourself.
The second is fgets(), which you can use to read a file one line at a time. You can use this to loop for as long as there are more lines to process, and run your line processing inside the loop. You'll need to use fopen() to get a handle on this file, you may want to track the file pointer yourself for recovery management (i.e. so you won't have to restart processing from scratch if something goes sideways and the script fails), etc.
Hopefully that's enough to get you started.
How about a combination of the two? Read 1000 lines into an array, process it, delete the array, then read 1000 more, etc. Monitor memory usage and adjust how many you read into an array at a time.
Related
tl;dr: I need a way of splitting 5 GB / ~11m row files in ~half (or thirds) while keeping track of exactly every file I create and of course not breaking any lines, so I can process both files at once
I have a set of 300 very large json-like files I need to parse with a php script periodically. Each file is about 5 GB decompressed. I've optimized the hell out of parsing script and it's reached it's speed limit. But it's still a single-threaded script running for about 20 hours on a 16 core server.
I'd like to split each file into approximately half, and have two parsing scripts run at once, to "fake" multi-threaded-ness and speed up run time. I can store global runtime information and "messages" between threads in my sql database. That should cut the total runtime in half, having one thread downloading the files, another decompressing them, and two more loading them into sql in parallel.
That part is actually pretty straight forward, where I'm stuck is splitting up the file to be parsed. I know there is a split tool that can break down files into chunks based on KB or line count. Problem is that doesn't quite work for me. I need to split these files in half (or thirds or quarters) cleanly. And without having any excess data go into an extra file. I need to know exactly what files the split command has created so I can note easy file in my sql table so the parsing script can know which files are ready to be parsed. If possible, I'd even like to avoid running wc -l in this process. That may not be possible, but it takes about 7 seconds for each file, 200 files, means 35 extra minutes of runtime.
Despite what I just said, I guess I run wc -l file on my file, divide that by n, round the result up, and use split to break the file into that many lines. That should always give me exactly n files. Than I can just know that ill have filea, fileb and so on.
I guess the question ultimately is, is there a better way to deal with this problem? Maybe theres another utility that will split in a way thats more compatible with what I'm doing. Or maybe there's another approach entirely that I'm overlooking.
I had the same problem and it wasn't easy to find a solution.
First you need to use jq to convert your JSON to string format.
Then use the GNU version of split, it has an extra --filter option which allows processing individual chunks of data in much less space as it does not need to create any temporary files:
split --filter='shell_command'
Your filter command should read from stdin:
jq -r '' file.json | split -l 10000 --filter='php process.php'
-l will tell split to work on 10000 lines at a time.
In process.php file you just need to read from stdin and do whatever you want.
I tried to load a 16MB file, into an php array.
It ends up with about 63MB memory usage.
Loading it into a string, just consumes the 16MB, but the issue is, I need it inside of an array, to access it faster, afterwards.
The file consists of about 750k lines (routing table dump).
I proberly should load it into a MySQL database, issue there, not enough memory to run that thing, so I did choose rqlite: https://github.com/rqlite/rqlite. Since I also need the replication features.
I am not sure if a SQLite database is fast enough for that.
Does anyone got an Idea for that issue?
You can get the actual file here: http://data.caida.org/datasets/routing/routeviews-prefix2as/2018/07/routeviews-rv2-20180715-1400.pfx2as.gz
The code I used:
$data = file('routeviews-rv2-20180715-1400.pfx2as');
var_dump(memory_get_usage());
Thanks.
You may use the Php fread function. It allows reading data of fixed size. It can be used inside a loop to read sized blocks of data. It does not consume much memory and is suitable for reading large files.
If you want to sort the data, then you may want to use a database. You can read the data from the large file one line at a time using fread and then insert it to the database.
There are a lot of different scenarios that are similar (replace text in file, read specific lines etc) but I have not found a good solution to what I want to do:
Messages (strings) are normally sent to a queue. If the server that handles the queue is down the messages are saved to a file. One message per line.
When the server is up again I want to start sending the messages to the server. The file with messages could be "big" so I do not want to read the entire file into memory. I also only want to send a message once, so the file need to reflect if a message has been sent(in other words: don't get 100 lines and then PHP timeout after 95 so the next time the same thing will happen again).
What I basically need is to read one line from a big text file and then delete that line when it has been processed by my script, without constantly reading/writing the whole file.
I have seen different solutions (fread, SplFileObject etc) that can read a line from a file without reading the entire file (into memory) but I have not seen a good way to delete the line that was just read without going through the entire file and saving it again.
I'm guessing that it can be done since the thing that needs to be done is to remove x bytes from the beginning or the end of the file, depending where you read the lines from.
To be clear: I do not think it's a good solution to read the first line from the file, use it, and then read all the other lines just to write them to a tmp-file and then from there to the original file. Read/write 100000 lines just to get one line.
The problem can be fixed in other ways, like creating a number of smaller files so they can be read/written without to much performance problems, but I would like to know if anyone has a solution to the exact problem.
Update:
Since it can't be done did I end up using Sqlite.
We get a product list from our suppliers delivered to our site by ftp. I need to create a script that searches through that file (tab delimited) for the products relevant to us and use the information to update stock levels, prices etc.
The file itself is something like 38,000 lines long and I'm wondering on the best way of handling this.
The only way I can think initially is using fopen and fgetcsv then cycling through each line. Putting the line into an array and looking for the relevant product code.
I'm hoping there is a much more efficient way (though I haven't tested the efficiency of this yet)
The file I'll be reading is 8.8 Mb.
All of this will need to be done automatically, e.g. by CRON on a daily basis.
Edit - more information.
I have run my first trial, and based on the 2 answers, I have the following code:
I have the items I need to pick out of the text file from the database in the array with $items[$row['item_id']] = $row['prod_code'];
$catalogue = file('catalogue.txt');
while ($line = $catalogue)
{
$prod = explode(" ",$line);
if (in_array($prod[0],$items))
{
echo $prod[0]."<br>";//will be updating the stock level in the db eventually
}
}
Though this is not giving the correct output currently
I used to do a similar thing with Dominos Pizza clocking in daily data (all UK).
Either load it all into a database in one go.
OR
Use fopen and load a line at a time into a database, keeping memory overheads low. (I had to use this method as the data wasn't formatted very well)
You can then query the database at your leisure.
What do you mean by »I hope there is a more efficient way«? Effecient in respect to what? Writing the code? CPU consumption while executing the code? Disk I/O? Memory consumption?
Holding ~9MB of text in memory is not a problem (unless you've got a very low memory limit). A file() call would read the entire file and return an array (split by lines). This or file_get_contents() will be the most efficient approach in respect to Disk I/O, but consume a lot more memory than necessary.
Putting the line into an array and looking for the relevant product code.
I'm not sure why you would need to cache the contents of that file in an array. But if you do, remember that the array will use slightly more memory than the ~9MB of text. So you'd probably want to read the file sequentially, to avoid having the same data in memory twice.
Depending on what you want to do with the data, loading it into a database might be a viable solution as well, as #user1487944 already pointed out.
I have a large file, 100,000 lines. I can read each line and process it, or I can store the lines in an array then process them. I would prefer to use the array for extra features, but I'm really concerned about the memory usage associated with storing that many lines in an array, and if it's worth it.
There are two functions you should familiarize yourself with.
The first is file(), which reads an entire file into an array, with each line as an array element. This is good for shorter files, and probably isn't what you want to be using on a 100k line file. This function handles its own file management, so you don't need to explicitly open and close the file yourself.
The second is fgets(), which you can use to read a file one line at a time. You can use this to loop for as long as there are more lines to process, and run your line processing inside the loop. You'll need to use fopen() to get a handle on this file, you may want to track the file pointer yourself for recovery management (i.e. so you won't have to restart processing from scratch if something goes sideways and the script fails), etc.
Hopefully that's enough to get you started.
How about a combination of the two? Read 1000 lines into an array, process it, delete the array, then read 1000 more, etc. Monitor memory usage and adjust how many you read into an array at a time.