I have a csv file with records being sorted on the first field. I managed to generate a function that does binary search through that file, using fseek for random access through file.
However, this is still a pretty slow process, since when I seek some file position, I actually need to look left, looking for \n characted, so I can make sure I'm reading a whole line (once whole line is read, I can check for first field value mentioned above).
Here is the function that returns a line that contains character at position x:
function fgetLineContaining( $fh, $x ) {
if( $x 125145411) // 12514511 is the last pos in my file
return "";
// now go as much left as possible, until newline is found
// or beginning of the file
while( $x > 0 && $c != "\n" && $c != "\r") {
fseek($fh, $x);
$x--; // go left in the file
$c = fgetc( $fh );
}
$x+=2; // skip newline char
fseek( $fh, $x );
return fgets( $fh, 1024 ); // return the line from the beginning until \n
}
While this is working as expected, I have to sad that my csv file has ~1.5Mil lines, and these left-seeks are slowing thins down pretty much.
Is there a better way to seek a line containing position x inside a file?
Also, it would be much better if object of a class could be saved to a file without serializing it, thus enabling reading of a file object-by-object. Does php support that?
Thanks
I think you really should consider using SQLite or MySQL again (like others have suggested in the comments). Most of the suggestions about pre-calculating indexes are already implemented "properly" in these SQL engines.
You said the speed wasn't good enough in SQL. Did you have the fields indexed properly? How were you querying the data? Where you using bulk queries, where you using prepared statements? Did the SQL process have enough ram to store it's indexes in RAM?
One thing you can possibly try to speed under the current algorithm is to load the (~100MB ?) file onto a RAM disc. No matter what you chose to do, either CVS or SQLite, this WILL help speed things up, especially if the hard drive seek time is your bottleneck.
You could possibly even read the whole file into PHP array's (assuming your computer has enough RAM for that). That would allow you to do your search via index ($big_array[$offset]) lookups.
Also one thing to keep in mind, PHP isn't exactly super fast at doing low level things fast. You might want to consider moving away from PHP in favor of C or C++.
Related
In PHP, I use fopen( ), fgets( ), and fclose( ) to read a file line by line. It works well. But I have a script (being run from the CLI) that has to process three hundred 5GB text files. That's approximately 3 billion fgets( ). So it works well enough but at this scale, tiny speed savings will add up extremely fast. So I'm wondering if there are any tricks to speed up the process?
The only potential thing I thought of was getting fgets( ) to read more than one line at once. It doesn't look like it supports that, but I could in theory do lets say 20 consecutive $line[] = fgets($file); and then process the array. That's not quite the same thing as reading multiple lines in one command so it may not have any affect. But I know queue your mysql inserts and sending them as one giant insert (another trick I'm going to implement in this script after more testing and benchmarking) will save a lot of time.
Update 4/13/19
Here is the solution I went with. Originally I had a much more complicated method of slicing off the end of each read, but then I realized you can do it much simpler.
$index_file = fopen( path to file,"r" );
$chunk = "";
while ( !feof($index_file) )
{
$chunk .= fread($index_file,$read_length);
$payload_lines = explode("\n",$chunk);
if ( !feof($index_file) )
{ $chunk = array_pop($payload_lines); }
}
Of course PHP has a function for everything. So I break every read into an array of lines, and array_pop() the last item in the array back to the beginning of the 'read buffer'. That last part is probably split, but not necessarily split. But either way, it goes back in and gets processed with the next loop (unless we're done with the file, then we don't pop it).
The only thing you have to watch out for here is if you have a line so long that a single read won't capture the whole thing. But know your data, that probably won't be a hassle. For me, I'm parsing a json-ish file, and I'm reading 128 KB at a time, so there are always many line breaks in my read.
Note: I settled on 128 KB by doing a million benchmarks and finding the size my server processes the absolute fastest. This parsing function will run 300 times so every second I save, saves me 5 minutes of total runtime.
One possible approach that might be faster would be to read large chunks of the file in with fread(), split it by newlines and then process the lines. You'd have to take in account that the chunks may sever lines and you'd have to detect this and glue them back together.
Generally speaking the larger the chunk you can read in one go the faster your process should become. Within the limits of your available memory.
From fread() docs:
Note that fread() reads from the current position of the file pointer. Use ftell() to find the current position of the pointer and rewind() to rewind the pointer position.
I have a 1.3GB text file that I need to extract some information from in PHP. I have researched it and have come up with a few various ways to do what I need to do, but as always am after a little clarification on which method would be best or if another better exists that I do not know about?
The information I need in the text file is only the first 40 characters of each line, and there are around 17million lines in the file. The 40 characters from each line will be inserted into a database.
The methods I have are below;
// REMOVE TIME LIMIT
set_time_limit(0);
// REMOVE MEMORY LIMIT
ini_set('memory_limit', '-1');
// OPEN FILE
$handle = #fopen('C:\Users\Carl\Downloads\test.txt', 'r');
if($handle) {
while(($buffer = fgets($handle)) !== false) {
$insert[] = substr($buffer, 0, 40);
}
if(!feof($handle)) {
// END OF FILE
}
fclose($handle);
}
Above is read each line at a time and get the data, I have all the database inserts sorted, doing 50 inserts at a time ten times over in a transaction.
The next method is the same as above really but calling file() to store all the lines in an array before doing a foreach to get the data? I am not sure about this method though as the array would essentially have over 17 million values.
Another method would be to extract only part of the file, rewrite the file with the unused data, and after that part has been executed recall the script using a header call?
What would be the best way in terms of getting this done in the most quick and efficient manner? Or is there a better way to approach this that I have thought of?
Also I plan to use this script with wamp, but running it in a browser while testing has caused problems with timeout even with setting script time out to 0. Is there a way I can execute the script to run without accessing the page through a browser?
You have it good so far, don't use "file()" function as it would most probably hit RAM usage limit and terminate your script.
I wouldn't even accumulate stuff into "insert[]" array, as that would waste RAM as well. If you can, insert into the database right away.
BTW, there is a nice tool called "cut" that you could use to process the file.
cut -c1-40 file.txt
You could even redirect cut's stdout to some PHP script that inserts into database.
cut -c1-40 file.txt | php -f inserter.php
inserter.php could then read lines from php://stdin and insert into DB.
"cut" is a standard tool available on all Linuxes, if you use Windows you can get it with MinGW shell, or as part of msystools (if you use git) or install native win32 app using gnuWin32.
Why are you doing this in PHP when your RDBMS almost certainly has bulk import functionality built in? MySQL, for example, has LOAD DATA INFILE:
LOAD DATA INFILE 'data.txt'
INTO TABLE `some_table`
FIELDS TERMINATED BY ''
LINES TERMINATED BY '\n';
( #line )
SET `some_column` = LEFT( #line, 40 );
One query.
MySQL also has the mysqlimport utility that wraps this functionality from the command line.
None of the above. The problem with the using fgets() is it does not work as you expect. When the maximum characters is reached, then the next call to fgets() will continue on the same line. You have correctly identified the problem with using file(). The third method is an interesting idea, and you could pull it off with other solutions as well.
That said, your first idea of using fgets() is pretty close, however we need to slightly modify its behaviour. Here's a customized version that will work as you'd expect:
function fgetl($fp, $len) {
$l = 0;
$buffer = '';
while (false !== ($c = fgetc($fp)) && PHP_EOL !== $c) {
if ($l < $len)
$buffer .= $c;
++$l;
}
if (0 === $l && false === $c) {
return false;
}
return $buffer;
}
Execute the insert operation immediately or you will waste memory. Make sure you are using prepared statements to insert this many rows; this will drastically reduce execution time. You don't want to submit the full query on each insert when you can only submit the data.
I have a 10MB text file.
The length of the lines may vary.
Which is the most efficient way (fast and memory friendly) to read just one specific line from this file? e.g. get_me_the_line($nr, $file_resource)
I don't know of a way to just jump to the line, if the lines are of varying length. However you can iterate through lines pretty quickly when not using them for anything, and return the one of interest.
function ReadLineNumber($file, $number)
{
$handle = fopen($file, "r");
$i = 0;
while (fgets($handle) && $i < $number - 1)
$i++;
return fgets($handle);
}
Edit
I added - 1 to the loop because this reads a line ahead. The $number is therefore a zero-index line reference. Change to - 2 if you would prefer line 1 mean the first line in the file.
As the lines are of varying length you have to look at each character as it might denote the end of the line. Quickest would be loading the file in chunks that are sized like the blocksize of the filesystem and counting the linebreaks until you are on the desired line.
Better way would be to have an index file that stores information about the file containing the lines. Using a database could also be a better idea.
If the file is REALLY large (several GB or more) and your application is running on *nix you may not want to try having PHP process the file and instead use some existing unix tools optimized for this kind of line processing. Once such tool is sed and an example of printing a specific line from a huge file can be found here.
Should be trivial to wrap this in a system_exec() call, or similar to write the function you are looking for.
To make this more clear, I'm going to put code samples:
$file = fopen('filename.ext', 'rb');
// Assume $pos has been declared
// method 1
fseek($file, $pos);
$parsed = fread($file, 2);
// method 2
while (!feof($file)) {
$data = fread($file, 1000000);
}
$data = bin2hex($data);
$parsed = substr($data, $pos, 2);
$fclose($file);
There are about 40 fread() in method 1 (with maybe 15 fseek()) vs 1 fread() in method 2. The only thing I am wondering is if loading in 1000000 bytes is overkill when you're really only extracting maybe 100 total bytes (all relatively close together in the middle of the file).
So which code is going to perform better? Which code makes more sense to use? A quick explanation would be greatly appreciated.
If you already know the offset you are looking for, fseek is the best method here, as there is no reason to load the whole file into memory if you only need a few bytes of it. The first method is better because you skip right to what you want in the file stream and read out a small portion. The second method requires you to read the entire file into memory, then seek through that while you could have just read it straight from the file. Hope this answers your question
Files are read in units of clusters, and a cluster is usually something like 8 kb. Usually a few clusters are read ahead.
So, if the file is only a few kb there is very little to gain by using fseek compared to reading the entire file. The file system will read the entire file anyway.
If the file is considerably larger, as in your case, only a few of the clusters has to be read, so the first method should perform better. At worst all the data will still be read from the disk, but your application will still use less memory.
It seems that seeking the position you want and then reading only be bytes you need is the best approach.
But the correct answer is (as always) to test it for real instead of guessing. Run your two examples in your server environment and make some time measurements. Also check memory usage. Then make your optimization once you have some hard data to back it up.
What is the best way to write to files in a large php application. Lets say there are lots of writes needed per second. How is the best way to go about this.
Could I just open the file and append the data. Or should i open, lock, write and unlock.
What will happen of the file is worked on and other data needs to be written. Will this activity be lost, or will this be saved. and if this will be saved will is halt the application.
If you have been, thank you for reading!
Here's a simple example that highlights the danger of simultaneous wites:
<?php
for($i = 0; $i < 100; $i++) {
$pid = pcntl_fork();
//only spawn more children if we're not a child ourselves
if(!$pid)
break;
}
$fh = fopen('test.txt', 'a');
//The following is a simple attempt to get multiple threads to start at the same time.
$until = round(ceil(time() / 10.0) * 10);
echo "Sleeping until $until\n";
time_sleep_until($until);
$myPid = posix_getpid();
//create a line starting with pid, followed by 10,000 copies of
//a "random" char based on pid.
$line = $myPid . str_repeat(chr(ord('A')+$myPid%25), 10000) . "\n";
for($i = 0; $i < 1; $i++) {
fwrite($fh, $line);
}
fclose($fh);
echo "done\n";
If appends were safe, you should get a file with 100 lines, all of which roughly 10,000 chars long, and beginning with an integer. And sometimes, when you run this script, that's exactly what you'll get. Sometimes, a few appends will conflict, and it'll get mangled, however.
You can find corrupted lines with grep '^[^0-9]' test.txt
This is because file append is only atomic if:
You make a single fwrite() call
and that fwrite() is smaller than PIPE_BUF (somewhere around 1-4k)
and you write to a fully POSIX-compliant filesystem
If you make more than a single call to fwrite during your log append, or you write more than about 4k, all bets are off.
Now, as to whether or not this matters: are you okay with having a few corrupt lines in your log under heavy load? Honestly, most of the time this is perfectly acceptable, and you can avoid the overhead of file locking.
I do have high-performance, multi-threaded application, where all threads are writing (appending) to single log file. So-far did not notice any problems with that, each thread writes multiple times per second and nothing gets lost. I think just appending to huge file should be no issue. But if you want to modify already existing content, especially with concurrency - I would go with locking, otherwise big mess can happen...
If concurrency is an issue, you should really be using databases.
If you're just writing logs, maybe you have to take a look in syslog function, since syslog provides an api.
You should also delegate writes to a dedicated backend and do the job in an asynchroneous maneer ?
These are my 2p.
Unless a unique file is needed for a specific reason, I would avoid appending everything to a huge file. Instead, I would wrap the file by time and dimension. A couple of configuration parameters (wrap_time and wrap_size) could be defined for this.
Also, I would probably introduce some buffering to avoid waiting the write operation to be completed.
Probably PHP is not the most adapted language for this kind of operations, but it could still be possible.
Use flock()
See this question
If you just need to append data, PHP should be fine with that as filesystem should take care of simultaneous appends.