Which is faster between glob() and opendir(), for reading around 1-2K file(s)?
http://code2design.com/forums/glob_vs_opendir
Obviously opendir() should be (and is) quicker as it opens the directory handler and lets you iterate. Because glob() has to parse the first argument it's going to take some more time (plus glob handles recursive directories so it'll scan subdirs, which will add to the execution time.
glob and opendir do different things. glob finds pathnames matching a pattern and returns these in an array, while opendir returns a directory handle only. To get the same results as with glob you have to call additional functions, which you have to take into account when benchmarking, especially if this includes pattern matching.
Bill Karwin has written an article about this recently. See:
http://www.phparch.com/2010/04/28/putting-glob-to-the-test/
Not sure whether that is perfect comparison but glob() allows you to incorporate the shell-like patterns as well where as opendir is directly there for the directories there by making it faster.
another question that can be answered with a bit of testing. i had a convenient folder with 412 things in it, but the results shouldn't vary much, i imagine:
igor47#whisker ~/test $ ls /media/music | wc -l
412
igor47#whisker ~/test $ time php opendir.php
414 files total
real 0m0.023s
user 0m0.000s
sys 0m0.020s
igor47#whisker ~/test $ time php glob.php
411 files total
real 0m0.023s
user 0m0.010s
sys 0m0.010s
Okay,
Long story short:
if you want full filenames+paths, sorted, glob is practically unbeatable.
if you want full filenames+paths unsorted, use glob with GLOB_NOSORT.
if you want only the names, and no sorting, use opendir + loop.
That's it.
Some more thoughts:
You can do tests to compose the exact same result with different methods only to find they have approximately the same time cost. Merely for fetching the information you'll have no real winner. However, consider these:
Dealing with a huge file list, glob will sort faster - it uses the filesystem's sort method which will always be superior. (It knows what it sorts while PHP doesn't, PHP sorts a hashed array of arbitrary strings, it's simply not fair to compare them.)
You'll probably want to filter your list by some extensions or filename masks for which glob is really efficient. You have fnmatch() of course, but calling it every time will never be faster than a system-level filter trained for this very job.
On the other hand, glob returns a significantly bigger amount of text (each name with full path) so with a lot of files you may run into memory allocation limits. For a zillion files, glob is not your friend.
OpenDir is more Faster...
<?php
$path = "/var/Upload/gallery/TEST/";
$filenm = "IMG20200706075415";
function microtime_float()
{
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
echo "<br> <i>T1:</i>".$t1 = microtime_float();
echo "<br><br> <b><i>Glob :</i></b>";
foreach( glob($path.$filenm.".*") as $file )
{
echo "<br>".$file;
}
echo "<br> <i>T2:</i> ".$t2 = microtime_float();
echo "<br><br> <b><i>OpenDir :</b></i>";
function resolve($name)
{
// reads informations over the path
$info = pathinfo($name);
if (!empty($info['extension']))
{
// if the file already contains an extension returns it
return $name;
}
$filename = $info['filename'];
$len = strlen($filename);
// open the folder
$dh = opendir($info['dirname']);
if (!$dh)
{
return false;
}
// scan each file in the folder
while (($file = readdir($dh)) !== false)
{
if (strncmp($file, $filename, $len) === 0)
{
if (strlen($name) > $len)
{
// if name contains a directory part
$name = substr($name, 0, strlen($name) - $len) . $file;
}
else
{
// if the name is at the path root
$name = $file;
}
closedir($dh);
return $name;
}
}
// file not found
closedir($dh);
return false;
}
$file = resolve($path.$filenm);
echo "<br>".$file;
echo "<br> <i>T3:</i> ".$t3 = microtime_float();
echo "<br><br> <b>glob time:</b> ". $gt= ($t2 - $t1) ."<br><b>opendir time:</b>". $ot = ($t3 - $t2) ;
echo "<u>". (( $ot < $gt ) ? "<br><br>OpenDir is ".($gt-$ot)." more Faster" : "<br><br>Glob is ".($ot-$gt)." moreFaster ") . "</u>";
?>
Output:
T1:1620133029.7558
Glob :
/var/Upload/gallery/TEST/IMG20200706075415.jpg
T2: 1620133029.7929
OpenDir :
/var/Upload/gallery/TEST/IMG20200706075415.jpg
T3: 1620133029.793
glob time:0.037137985229492
opendir time:5.9843063354492E-5
OpenDir is 0.037078142166138 more Faster
Related
tl;dr: To store files under a path determined by their hash, I need a single function to get the following with level=3 and hash='fd6eg3': f/fd/fd6/fd6eg3
I am looking for a way to create chunks (of a given string) of increasing lengths, from the beginning of the string. Also, I want to limit the number of produced chunks.
The goal is to store a file named fd6eg3 under (with a number of chunks set to 3):
A directory named fd6.
Which is the child of a directory named fd.
Which is the child of a directory named f.
So the final path would be: f/fd/fd6/fd6eg3
The closest I got is getting the "directory part" (f/fd/fd6/) using the following function:
function computePathForHash(string $hash, int $level): string
{
if ($level <= 0) {
return '';
} else {
return computePathForHash($hash, $level - 1)
. mb_substr($hash, 0, $level)
. DIRECTORY_SEPARATOR
;
}
}
echo computePathForHash('fd6eg3', 3);
Which output is resumed in this table:
level
Returned
0
""
1
"f/"
2
"f/fd/"
3
"f/fd/fd6/"
But I fail to add the "file part" ($hash) to the end.
I would like to avoid passing a third parameter such as $initial_level that would store the originally asked level against I could compare current $level to decide when to add $hash.
Considering the low value of $level (shouldn't be higher than 10 in my use case) I don't think there are arguments against recursion. I find recursion simpler to read/understand but if it's impossible in recursion I'll go with something else (or use 2 functions, one for directory part, the other to concatenate it with file part).
I think recursion is great in the right circumstances, but a straightforward for loop can do this and hopefully more maintainable.
Just loop up to the level and add the start chunk of the hash each time, then add the full hash on the return...
function computePathForHash(string $hash, int $level): string
{
$output = '';
for ( $i = 1; $i <= $level; $i++ ) {
$output .= mb_substr($hash, 0, $i) . DIRECTORY_SEPARATOR;
}
return $output . $hash;
}
If we say that we have a text file containing order numbers or references (1 Number per 1 Line only) what is the best way to find/validate an input (number entered in form for example) against those numbers in a file?
Is there a simple idea to do it? Assume we have thousands of numbers to search through.
Thank you very much.
If memory is not an issue (Demo):
if (in_array($number, file('numbers.txt', FILE_IGNORE_NEW_LINES))) {
// number exists - do something
}
Since file returns an array where each line is one element in the array, you can also use array_search to find the line where it was found or array_keys to find all the lines where it was found.
If memory is an issue (Demo):
foreach(new SplFileObject('numbers.txt') as $line) {
if ($number == $line) {
// number exists - do something
break;
}
}
When in doubt which to use, benchmark.
Marking CW because there is already several questions asking how to read a file line by line or efficiently.
$file = file_get_contents("filename.txt");
if (strpos($file, "search string") === false) {
echo "String not found!";
}
if the numbers are ordered: don't load the whole file into memory. seek to the middle of the file and read the number. if your number is < than the middle, seek the middle of the first half. otherwise seek the middle of the second half...
Binary Search
If you want to return the line number of the location of the matching number in the file, you can use file() to return the reference file as an array of file lines.
$search_string = '42';
$file_name = 'test_file.txt';
$file = file($file_name);
foreach($file as $line_number=>$number){
if(intval($search_string) == $number){
$found_on_lines[] = $line_number;
}
}
echo "String ".$search_string;
if(count($found_on_lines)>0){
echo " found on line(s):</br> ";
foreach($found_on_lines as $line){
echo $line."</br>";
}
}
else{
echo "not found in file ".$file_name.".";
}
This will output
String 42 found on line(s):
9
256
if your reference file contains the number '42' on lines 9 and 256.
as I was not able to find a function which retrieves the amount of lines a file has,
do I need to use
$handle = fopen("file.txt");
For($Line=1; $Line<=10; $Line=$Line+1){
fgets($handle);
}
If feof($handle){
echo "File has 10 lines or more.";
}Else{
echo "File has less than 10 lines.";
}
fclose($handle)
or something similar? All I want to know is if the file has more than 10 lines or not :-).
Thanks in advance!
You can get the number of lines using:
$file = 'smth.txt';
$num_lines = count(file($file));
Faster, more memory resourceful:
$file = new SplFileObject('file.txt');
$file->seek(9);
if ($file->eof()) {
echo 'File has less than 10 lines.';
} else {
echo 'File has 10 lines or more.';
}
SplFileObject
This bigger problems will occur if you have a LARGE file, PHP tends to slow down some. Why not run an exec command and let the system return the number? Then you do not have to worry about the PHP overhead to read the file.
$count = exec("wc -l /path/to/file");
Or if you want to get a bit more fancy:
$count = exec("awk '// {++x} END {print x}' /path/to/file");
If you have big file then better schould be read files in segments and counts "\n" chars, or what ever is the lineend char, for example on some systems you will also need "\r" counter or whatever...
$lineCounter=0;
$myFile =fopen('/pathto/file.whatever','r');
while ($stringSegment = fread($myFile, 4096000)) {
$lineCounter += substr_count($stringSegment, "\n");
}
Is there a maximum file size the XMLReader can handle?
I'm trying to process an XML feed about 3GB large. There are certainly no PHP errors as the script runs fine and successfully loads to the database after it's been run.
The script also runs fine with smaller test feeds - 1GB and below. However, when processing larger feeds the script stops reading the XML File after about 1GB and continues running the rest of the script.
Has anybody experienced a similar problem? and if so how did you work around it?
Thanks in advance.
I had same kind of problem recently and I thought to share my experience.
It seems that problem is in the way PHP was compiled, whether it was compiled with support for 64bit file sizes/offsets or only with 32bit.
With 32bits you can only address 4GB of data. You can find a bit confusing but good explanation here: http://blog.mayflower.de/archives/131-Handling-large-files-without-PHP.html
I had to split my files with Perl utility xml_split which you can find here: http://search.cpan.org/~mirod/XML-Twig/tools/xml_split/xml_split
I used it to split my huge XML file into manageable chunks. The good thing about the tool is that it splits XML files over whole elements. Unfortunately its not very fast.
I needed to do this one time only and it suited my needs, but I wouldn't recommend it repetitive use. After splitting I used XMLReader on smaller files of about 1GB in size.
Splitting up the file will definitely help. Other things to try...
adjust the memory_limit variable in php.ini. http://php.net/manual/en/ini.core.php
rewrite your parser using SAX -- http://php.net/manual/en/book.xml.php . This is a stream-oriented parser that doesn't need to parse the whole tree. Much more memory-efficient but slightly harder to program.
Depending on your OS, there might also be a 2gb limit on the RAM chunk that you can allocate. Very possible if you're running on a 32-bit OS.
It should be noted that PHP in general has a max file size. PHP does not allow for unsigned integers, or long integers, meaning you're capped at 2^31 (or 2^63 for 64 bit systems) for integers. This is important because PHP uses an integer for the file pointer (your position in the file as you read through), meaning it cannot process a file larger than 2^31 bytes in size.
However, this should be more than 1 gigabyte. I ran into issues with two gigabytes (as expected, since 2^31 is roughly 2 billion).
I've run into a similar issue when parsing large documents. What I wound up doing is breaking the feed into smaller chunks using filesystem functions, then parsing those smaller chunks... So if you have a bunch of <record> tags that you are parsing, parse them out with string functions as a stream, and when you get a full record in the buffer, parse that using the xml functions... It sucks, but it works quite well (and is very memory efficient, since you only have at most 1 record in memory at any one time)...
Do you get any errors with
libxml_use_internal_errors(true);
libxml_clear_errors();
// your parser stuff here....
$r = new XMLReader(...);
// ....
foreach( libxml_get_errors() as $err ) {
printf(". %d %s\n", $err->code, $err->message);
}
when the parser stops prematurely?
Using WindowsXP, NTFS as filesystem and php 5.3.2 there was no problem with this test script
<?php
define('SOURCEPATH', 'd:/test.xml');
if ( 0 ) {
build();
}
else {
echo 'filesize: ', number_format(filesize(SOURCEPATH)), "\n";
timing('read');
}
function timing($fn) {
$start = new DateTime();
echo 'start: ', $start->format('Y-m-d H:i:s'), "\n";
$fn();
$end = new DateTime();
echo 'end: ', $start->format('Y-m-d H:i:s'), "\n";
echo 'diff: ', $end->diff($start)->format('%I:%S'), "\n";
}
function read() {
$cnt = 0;
$r = new XMLReader;
$r->open(SOURCEPATH);
while( $r->read() ) {
if ( XMLReader::ELEMENT === $r->nodeType ) {
if ( 0===++$cnt%500000 ) {
echo '.';
}
}
}
echo "\n#elements: ", $cnt, "\n";
}
function build() {
$fp = fopen(SOURCEPATH, 'wb');
$s = '<catalogue>';
//for($i = 0; $i < 500000; $i++) {
for($i = 0; $i < 60000000; $i++) {
$s .= sprintf('<item>%010d</item>', $i);
if ( 0===$i%100000 ) {
fwrite($fp, $s);
$s = '';
echo $i/100000, ' ';
}
}
$s .= '</catalogue>';
fwrite($fp, $s);
flush($fp);
fclose($fp);
}
output:
filesize: 1,380,000,023
start: 2010-08-07 09:43:31
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:43:31
diff: 07:31
(as you can see I screwed up the output of the end-time but I don't want to run this script another 7+ minutes ;-))
Does this also work on your system?
As a side-note: The corresponding C# test application took only 41 seconds instead of 7,5 minutes. And my slow harddrive might have been the/one limiting factor in this case.
filesize: 1.380.000.023
start: 2010-08-07 09:55:24
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:56:05
diff: 00:41
and the source:
using System;
using System.IO;
using System.Xml;
namespace ConsoleApplication1
{
class SOTest
{
delegate void Foo();
const string sourcepath = #"d:\test.xml";
static void timing(Foo bar)
{
DateTime dtStart = DateTime.Now;
System.Console.WriteLine("start: " + dtStart.ToString("yyyy-MM-dd HH:mm:ss"));
bar();
DateTime dtEnd = DateTime.Now;
System.Console.WriteLine("end: " + dtEnd.ToString("yyyy-MM-dd HH:mm:ss"));
TimeSpan s = dtEnd.Subtract(dtStart);
System.Console.WriteLine("diff: {0:00}:{1:00}", s.Minutes, s.Seconds);
}
static void readTest()
{
XmlTextReader reader = new XmlTextReader(sourcepath);
int cnt = 0;
while (reader.Read())
{
if (XmlNodeType.Element == reader.NodeType)
{
if (0 == ++cnt % 500000)
{
System.Console.Write('.');
}
}
}
System.Console.WriteLine("\n#elements: " + cnt + "\n");
}
static void Main()
{
FileInfo f = new FileInfo(sourcepath);
System.Console.WriteLine("filesize: {0:N0}", f.Length);
timing(readTest);
return;
}
}
}
I have just found out that my script gives me a fatal error:
Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 440 bytes) in C:\process_txt.php on line 109
That line is this:
$lines = count(file($path)) - 1;
So I think it is having difficulty loading the file into memeory and counting the number of lines, is there a more efficient way I can do this without having memory issues?
The text files that I need to count the number of lines for range from 2MB to 500MB. Maybe a Gig sometimes.
Thanks all for any help.
This will use less memory, since it doesn't load the whole file into memory:
$file="largefile.txt";
$linecount = 0;
$handle = fopen($file, "r");
while(!feof($handle)){
$line = fgets($handle);
$linecount++;
}
fclose($handle);
echo $linecount;
fgets loads a single line into memory (if the second argument $length is omitted it will keep reading from the stream until it reaches the end of the line, which is what we want). This is still unlikely to be as quick as using something other than PHP, if you care about wall time as well as memory usage.
The only danger with this is if any lines are particularly long (what if you encounter a 2GB file without line breaks?). In which case you're better off doing slurping it in in chunks, and counting end-of-line characters:
$file="largefile.txt";
$linecount = 0;
$handle = fopen($file, "r");
while(!feof($handle)){
$line = fgets($handle, 4096);
$linecount = $linecount + substr_count($line, PHP_EOL);
}
fclose($handle);
echo $linecount;
Using a loop of fgets() calls is fine solution and the most straightforward to write, however:
even though internally the file is read using a buffer of 8192 bytes, your code still has to call that function for each line.
it's technically possible that a single line may be bigger than the available memory if you're reading a binary file.
This code reads a file in chunks of 8kB each and then counts the number of newlines within that chunk.
function getLines($file)
{
$f = fopen($file, 'rb');
$lines = 0;
while (!feof($f)) {
$lines += substr_count(fread($f, 8192), "\n");
}
fclose($f);
return $lines;
}
If the average length of each line is at most 4kB, you will already start saving on function calls, and those can add up when you process big files.
Benchmark
I ran a test with a 1GB file; here are the results:
+-------------+------------------+---------+
| This answer | Dominic's answer | wc -l |
+------------+-------------+------------------+---------+
| Lines | 3550388 | 3550389 | 3550388 |
+------------+-------------+------------------+---------+
| Runtime | 1.055 | 4.297 | 0.587 |
+------------+-------------+------------------+---------+
Time is measured in seconds real time, see here what real means
True line count
While the above works well and returns the same results as wc -l, if the file ends without a newline, the line number will be off by one; if you care about this particular scenario, you can make it more accurate by using this logic:
function getLines($file)
{
$f = fopen($file, 'rb');
$lines = 0; $buffer = '';
while (!feof($f)) {
$buffer = fread($f, 8192);
$lines += substr_count($buffer, "\n");
}
fclose($f);
if (strlen($buffer) > 0 && $buffer[-1] != "\n") {
++$lines;
}
return $lines;
}
Simple Oriented Object solution
$file = new \SplFileObject('file.extension');
while($file->valid()) $file->fgets();
var_dump($file->key());
#Update
Another way to make this is with PHP_INT_MAX in SplFileObject::seek method.
$file = new \SplFileObject('file.extension', 'r');
$file->seek(PHP_INT_MAX);
echo $file->key();
If you're running this on a Linux/Unix host, the easiest solution would be to use exec() or similar to run the command wc -l $path. Just make sure you've sanitized $path first to be sure that it isn't something like "/path/to/file ; rm -rf /".
There is a faster way I found that does not require looping through the entire file
only on *nix systems, there might be a similar way on windows ...
$file = '/path/to/your.file';
//Get number of lines
$totalLines = intval(exec("wc -l '$file'"));
If you're using PHP 5.5 you can use a generator. This will NOT work in any version of PHP before 5.5 though. From php.net:
"Generators provide an easy way to implement simple iterators without the overhead or complexity of implementing a class that implements the Iterator interface."
// This function implements a generator to load individual lines of a large file
function getLines($file) {
$f = fopen($file, 'r');
// read each line of the file without loading the whole file to memory
while ($line = fgets($f)) {
yield $line;
}
}
// Since generators implement simple iterators, I can quickly count the number
// of lines using the iterator_count() function.
$file = '/path/to/file.txt';
$lineCount = iterator_count(getLines($file)); // the number of lines in the file
If you're under linux you can simply do:
number_of_lines = intval(trim(shell_exec("wc -l ".$file_name." | awk '{print $1}'")));
You just have to find the right command if you're using another OS
Regards
This is an addition to Wallace Maxter's solution
It also skips empty lines while counting:
function getLines($file)
{
$file = new \SplFileObject($file, 'r');
$file->setFlags(SplFileObject::READ_AHEAD | SplFileObject::SKIP_EMPTY |
SplFileObject::DROP_NEW_LINE);
$file->seek(PHP_INT_MAX);
return $file->key() + 1;
}
The most succinct cross-platform solution that only buffers one line at a time.
$file = new \SplFileObject(__FILE__);
$file->setFlags($file::READ_AHEAD);
$lines = iterator_count($file);
Unfortunately, we have to set the READ_AHEAD flag otherwise iterator_count blocks indefinitely. Otherwise, this would be a one-liner.
private static function lineCount($file) {
$linecount = 0;
$handle = fopen($file, "r");
while(!feof($handle)){
if (fgets($handle) !== false) {
$linecount++;
}
}
fclose($handle);
return $linecount;
}
I wanted to add a little fix to the function above...
in a specific example where i had a file containing the word 'testing' the function returned 2 as a result. so i needed to add a check if fgets returned false or not :)
have fun :)
Based on dominic Rodger's solution,
here is what I use (it uses wc if available, otherwise fallbacks to dominic Rodger's solution).
class FileTool
{
public static function getNbLines($file)
{
$linecount = 0;
$m = exec('which wc');
if ('' !== $m) {
$cmd = 'wc -l < "' . str_replace('"', '\\"', $file) . '"';
$n = exec($cmd);
return (int)$n + 1;
}
$handle = fopen($file, "r");
while (!feof($handle)) {
$line = fgets($handle);
$linecount++;
}
fclose($handle);
return $linecount;
}
}
https://github.com/lingtalfi/Bat/blob/master/FileTool.php
Counting the number of lines can be done by following codes:
<?php
$fp= fopen("myfile.txt", "r");
$count=0;
while($line = fgetss($fp)) // fgetss() is used to get a line from a file ignoring html tags
$count++;
echo "Total number of lines are ".$count;
fclose($fp);
?>
You have several options. The first is to increase the availble memory allowed, which is probably not the best way to do things given that you state the file can get very large. The other way is to use fgets to read the file line by line and increment a counter, which should not cause any memory issues at all as only the current line is in memory at any one time.
There is another answer that I thought might be a good addition to this list.
If you have perl installed and are able to run things from the shell in PHP:
$lines = exec('perl -pe \'s/\r\n|\n|\r/\n/g\' ' . escapeshellarg('largetextfile.txt') . ' | wc -l');
This should handle most line breaks whether from Unix or Windows created files.
TWO downsides (at least):
1) It is not a great idea to have your script so dependent upon the system its running on ( it may not be safe to assume Perl and wc are available )
2) Just a small mistake in escaping and you have handed over access to a shell on your machine.
As with most things I know (or think I know) about coding, I got this info from somewhere else:
John Reeve Article
public function quickAndDirtyLineCounter()
{
echo "<table>";
$folders = ['C:\wamp\www\qa\abcfolder\',
];
foreach ($folders as $folder) {
$files = scandir($folder);
foreach ($files as $file) {
if($file == '.' || $file == '..' || !file_exists($folder.'\\'.$file)){
continue;
}
$handle = fopen($folder.'/'.$file, "r");
$linecount = 0;
while(!feof($handle)){
if(is_bool($handle)){break;}
$line = fgets($handle);
$linecount++;
}
fclose($handle);
echo "<tr><td>" . $folder . "</td><td>" . $file . "</td><td>" . $linecount . "</td></tr>";
}
}
echo "</table>";
}
I use this method for purely counting how many lines in a file. What is the downside of doing this verses the other answers. I'm seeing many lines as opposed to my two line solution. I'm guessing there's a reason nobody does this.
$lines = count(file('your.file'));
echo $lines;
this is a bit late but...
Here is my solution for a text log file I have which uses \n to separate each line.
$data = file_get_contents("myfile.txt");
$numlines = strlen($data) - strlen(str_replace("\n","",$data));
It does load the file into memory but doesn't need to cycle through an unknown number of lines. It may be unsuitable if the file is GB in size but for smaller files with short lines of data it works a treat for me.
It just removes the "\n" from the file and compares how many have been removed by comparing the length of the data in the file to the length after removing all the line breaks ("\n" chars n my case). If your line delineator is a different char, replace the "\n" with whatever is your line delineation character.
I know it is not the best answer for all occasions but is something I have found quick and simple for my purposes where each line of the log is only a few hundred chars and total log file is not too large.
For just counting the lines use:
$handle = fopen("file","r");
static $b = 0;
while($a = fgets($handle)) {
$b++;
}
echo $b;