PHP Reading large tab delimited file looking for one line

PHP Reading large tab delimited file looking for one line - php

We get a product list from our suppliers delivered to our site by ftp. I need to create a script that searches through that file (tab delimited) for the products relevant to us and use the information to update stock levels, prices etc.
The file itself is something like 38,000 lines long and I'm wondering on the best way of handling this.
The only way I can think initially is using fopen and fgetcsv then cycling through each line. Putting the line into an array and looking for the relevant product code.
I'm hoping there is a much more efficient way (though I haven't tested the efficiency of this yet)
The file I'll be reading is 8.8 Mb.
All of this will need to be done automatically, e.g. by CRON on a daily basis.
Edit - more information.
I have run my first trial, and based on the 2 answers, I have the following code:
I have the items I need to pick out of the text file from the database in the array with $items[$row['item_id']] = $row['prod_code'];
$catalogue = file('catalogue.txt');
while ($line = $catalogue)
{
$prod = explode(" ",$line);
if (in_array($prod[0],$items))
{
echo $prod[0]."<br>";//will be updating the stock level in the db eventually
}
}
Though this is not giving the correct output currently

I used to do a similar thing with Dominos Pizza clocking in daily data (all UK).
Either load it all into a database in one go.
OR
Use fopen and load a line at a time into a database, keeping memory overheads low. (I had to use this method as the data wasn't formatted very well)
You can then query the database at your leisure.

What do you mean by »I hope there is a more efficient way«? Effecient in respect to what? Writing the code? CPU consumption while executing the code? Disk I/O? Memory consumption?
Holding ~9MB of text in memory is not a problem (unless you've got a very low memory limit). A file() call would read the entire file and return an array (split by lines). This or file_get_contents() will be the most efficient approach in respect to Disk I/O, but consume a lot more memory than necessary.
Putting the line into an array and looking for the relevant product code.
I'm not sure why you would need to cache the contents of that file in an array. But if you do, remember that the array will use slightly more memory than the ~9MB of text. So you'd probably want to read the file sequentially, to avoid having the same data in memory twice.
Depending on what you want to do with the data, loading it into a database might be a viable solution as well, as #user1487944 already pointed out.

Related

PHP array uses a lot more memory then it should

I tried to load a 16MB file, into an php array.
It ends up with about 63MB memory usage.
Loading it into a string, just consumes the 16MB, but the issue is, I need it inside of an array, to access it faster, afterwards.
The file consists of about 750k lines (routing table dump).
I proberly should load it into a MySQL database, issue there, not enough memory to run that thing, so I did choose rqlite: https://github.com/rqlite/rqlite. Since I also need the replication features.
I am not sure if a SQLite database is fast enough for that.
Does anyone got an Idea for that issue?
You can get the actual file here: http://data.caida.org/datasets/routing/routeviews-prefix2as/2018/07/routeviews-rv2-20180715-1400.pfx2as.gz
The code I used:
$data = file('routeviews-rv2-20180715-1400.pfx2as');
var_dump(memory_get_usage());
Thanks.

You may use the Php fread function. It allows reading data of fixed size. It can be used inside a loop to read sized blocks of data. It does not consume much memory and is suitable for reading large files.
If you want to sort the data, then you may want to use a database. You can read the data from the large file one line at a time using fread and then insert it to the database.

How do i process a file in parts [duplicate]

I have a large file, 100,000 lines. I can read each line and process it, or I can store the lines in an array then process them. I would prefer to use the array for extra features, but I'm really concerned about the memory usage associated with storing that many lines in an array, and if it's worth it.

There are two functions you should familiarize yourself with.
The first is file(), which reads an entire file into an array, with each line as an array element. This is good for shorter files, and probably isn't what you want to be using on a 100k line file. This function handles its own file management, so you don't need to explicitly open and close the file yourself.
The second is fgets(), which you can use to read a file one line at a time. You can use this to loop for as long as there are more lines to process, and run your line processing inside the loop. You'll need to use fopen() to get a handle on this file, you may want to track the file pointer yourself for recovery management (i.e. so you won't have to restart processing from scratch if something goes sideways and the script fails), etc.
Hopefully that's enough to get you started.

How about a combination of the two? Read 1000 lines into an array, process it, delete the array, then read 1000 more, etc. Monitor memory usage and adjust how many you read into an array at a time.

PHP file seek performance optimisations

I am building a website where the basic premise is there are two files. index.php and file.txt.
File.txt has (currently) 10megs of data, this can potentially be up to 500mb. The idea of the site is, people go to index.php and then can seek to any position of the file. Another feature is they can read up to 10kb data from the point of seeking. So:
index.php?pos=432 will get the byte at position 423 on the file.
index.php?pos=555&len=5000 will get 5kb of the data from the file starting from position 555
Now, Imagine the site getting thousands of hits a day.
I currently use fseek and fread to serve the data. Is there any faster way of doing this? Or is my usage too low to consider advanced optimizations such as caching the results of each request or loading the file into memory and reading it from there?

Thousands of hits per day, that's like one every few seconds? That's definitely too low to need optimizing at this point, so just use fseek and fread if that's what's easiest for you.

If it is crucial for you to keep all data into a file, I would suggest you to split your file into a chunk of smaller files.
So for example you could make a decision, that a file size should not be more then 1 mb. It means that you have to split your file.txt file into 10 separate files: file-1.txt, file-2.txt, file-3.txt and so on...
When you will process a request, you will need to determine what file to pickup by division pos argument on file size and show appropriate amount of data. In this case fseek function will work faster, perhaps...
But anyway you have to stick with fseek and fopen functions.

edit: now that I consider it, so long as you're using fseek() to go to a byte offset and then using fread() to get a a certain number of bytes it shouldn't be a problem. For some reason I read your question as serving X number of lines from a file which would be truly terrible.
The problem is you are absolutely hammering the disk with IO operations, and you're not just causing performance issues with this one file/script, you're causing performance issues with anything that needs that disk. Other users, the OS, etc. if you're on shared hosting I guarantee that one of the sysadmins is trying to figure out who you are so they can turn you off. [I would be]
You need to find a way to either:
Offload this to memory.
Set up a daemon on the server that loads the file into memory and serves chunks on request.
Offload this to something more efficient, like mySQL.
You're already serving the data in sequential chunks, eg: line 466 to 476, it will be much faster to retrieve the data from a table like:
CREATE TABLE mydata (
line INTEGER NOT NULL AUTO_INCREMENT,
data VARCHAR(2048)
) PRIMARY KEY (line);
by:
SELECT data FROM mydata WHERE line BETWEEN 466 AND 476;

If the file never changes, and is truly limited in maximum size, I would simply mount a ramdisk, and have a boot script which copies the file from permanent storage to RAM storage.
This probably requires hosting the site on linux if you aren't already.
This would allow you to guarantee that the file segments are served from memory, without relying on the OS filesystem cache.

CSV export script running out of memory

I have a PHP script that exports a MySQL database to CSV format. The strange CSV format is a requirement for a 3rd party software application. It worked initially, but I am now running out of memory at line 743. My hosting service has a limit of 66MB.
My complete code is listed here. http://pastebin.com/fi049z4n
Are there alternatives to using array_splice? Can anyone please assist me in reducing memory usage. What functions can be changed? How do I free up memory?

You have to change your CSV creation strategy. Instead if reading the whole result from the database into an array in memory you must work sequencially:
Read a row from the database
Write that row into the CSV file (or maybe buffer)
Free the row from memory
Do this in a loop...
This way the memory consumption of your script does not rise proportionally with the number of rows in the result, but stays (more or less) constant.
This might serve as an untested example to sketch the idea:
while (FALSE!==($row=mysql_fetch_result($res))) {
fwrite ($csv_file."\n", implode(',', $row));
}

Increase the memory limit and maximum execution time
For example :
// Change the values to suit you
ini_set("memory_limit","15000M");
ini_set("max-execution_time", "5000");

PHP Reducing memory usage when using fputcsv to php://output

We've run into a bit of a weird issue. It goes like this. We have large reams of data that we need to output to the client. These data files cannot be pre-built, they must be served from live data.
My preferred solution has been to write into the CSV line by line from a fetch like this:
while($datum = $data->fetch(PDO::FETCH_ASSOC)) {
$size += fputcsv($outstream, $datum, chr(9), chr(0));
}
This got around a lot of ridiculous memory usage (reading 100000 records into memory at once is bad mojo) but we still have lingering issues for large tables that are only going to get worse as the data increases in size. And please note, there is no data partitioning; they don't download in year segments, but they download all of their data and then segment it themselves. This is per the requirements; I'm not in a position to change this as much as it would remove the problem entirely.
In either case, on the largest table it runs out of memory. One solution is to increase the memory available, which solves one problem but suggests the creation of server load problems later on or even now if more than one client is downloading.
In this case, $outstream is:
$outstream = fopen("php://output",'w');
Which seems pretty obviously not really a physical disk location. I don't know much about php://output in terms of where the data resides before it is sent to the client, but it seems obvious that there are memory issues with simply writing a manifold database table to csv via this method.
To be exact, the staging box allows about 128mb for PHP, and this call in particular was short about 40mb (it tried to allocate 40mb more.) This seems a bit odd for behavior, as you would expect it to ask for memory in smaller parts.
Anyone know what can be done to get a handle on this?

So it looks like the memory consumption is caused by Zend Framework's output buffering. The best solution that I came up with was this.
doing ob_end_clean() right before we start to stream the file to the client. This particular instance of ZF is not going to produce any normal output or do anything more after this point, so complications don't arise. What odd thing does happen (perhaps from the standpoint of the user) is that they really get the file streamed to them.
Here's the code:
ob_end_clean();
while($datum = $data->fetch(PDO::FETCH_ASSOC)) {
$size += fputcsv($outstream, $datum, chr(9), chr(0));
}
Memory usage (according to the function memory_get_peak_usage(true) suggested in a ZF forum post somewhere) went from 90 megabytes down to 9 megabytes, which is what it was using here on my development box prior to any file reading.
Thanks for the help, guys!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.