Discovering folder tree via depth-first or breadth-first

Discovering folder tree via depth-first or breadth-first - php

I had to find the paths to the "deepest" folders in a folder. For this I implemented two algorithms, and one is way faster than the other.
Does anyone know why ? I suppose this has some link with the hard-disk hardware but I'd like to understand.
Here is the fast one :
private function getHostAux($path) {
$matches = array();
$folder = rtrim($path, DIRECTORY_SEPARATOR);
$moreFolders = glob($folder.DIRECTORY_SEPARATOR.'*', GLOB_ONLYDIR);
if (count($moreFolders) == 0) {
$matches[] = $folder;
} else {
foreach ($moreFolders as $fd) {
$arr = $this->getHostAux($fd);
$matches = array_merge($matches, $arr);
}
}
return $matches;
}
And here is the slow-one :
/**
* Breadth-first function using glob
*/
private function getHostAux($path) {
$matches = array();
$folders = array(rtrim($path, DIRECTORY_SEPARATOR));
$i = 0;
while($folder = array_shift($folders)) {
$moreFolders = glob($folder.DIRECTORY_SEPARATOR.'*', GLOB_ONLYDIR);
if (count($moreFolders == 0)) {
$matches[$i] = $folder;
}
$folders = array_merge($folders, $moreFolders);
$i++;
}
return $matches;
}
Thanks !

You haven't provided additional informations that might be crucial for understanding these "timings" which you observed. (I intentionally wrote the quotes since you haven't specified what "slow" and "fast" mean and how exactly did you measure it.)
Assuming that the supplied informations are true and that the speedup for the first method is greater than a couple of percent and you've tested it on directories of various sizes and depth...
First I would like to comment on the supplied answers:
I wouldn't be so sure about your answer. First I think you mean "kernel handles". But this is not true since glob doesn't open handles. How did you come up with this answer?
Both versions have the same total iteration count.
And add something from myself:
I would suspect array_shift() may cause the slowdown because it reindexes the whole array each time you call it.
The order in which you glob may matter depending on the underlying OS and file system.
You have a bug (probably) in your code. You increment $i after every glob and not after adding an element to the $matches array. That causes that the $matches array is sparse which may cause the merging, shifting or even the adding process to be slower. I don't know exactly if that's the case with PHP but I know several languages in which arrays have these properties which are sometimes hard to keep in mind while coding. I would recommend fixing this, timing the code again and seeing if that makes any difference.

I think that your first algorithm with recursion does less iterations than the second one. Try to watch how many iterations each algorithm does using auxilary variables.

Related

PHP If-Else does not work for comparing filecontents

I am trying to make a PHP application which searches through the files of your current directory and looks for a file in every subdirectory called email.txt, then it gets the contents of the file and compares the contents from email.txt with the given query and echoes all the matching directories with the given query. But it does not work and it looks like the problem is in the if-else part of the script at the end because it doesn't give any output.
<?php
// pulling query from link
$query = $_GET["q"];
echo($query);
echo("<br>");
// listing all files in doc directory
$files = scandir(".");
// searching trough array for unwanted files
$downloader = array_search("downloader.php", $files);
$viewer = array_search("viewer.php", $files);
$search = array_search("search.php", $files);
$editor = array_search("editor.php", $files);
$index = array_search("index.php", $files);
$error_log = array_search("error_log", $files);
$images = array_search("images", $files);
$parsedown = array_search("Parsedown.php", $files);
// deleting unwanted files from array
unset($files[$downloader]);
unset($files[$viewer]);
unset($files[$search]);
unset($files[$editor]);
unset($files[$index]);
unset($files[$error_log]);
unset($files[$images]);
unset($files[$parsedown]);
// counting folders
$folderamount = count($files);
// defining loop variables
$loopnum = 0;
// loop
while ($loopnum <= $folderamount + 10) {
$loopnum = $loopnum + 1;
// gets the emails from every folder
$dirname = $files[$loopnum];
$email = file_get_contents("$dirname/email.txt");
//checks if the email matches
if ($stremail == $query) {
echo($dirname);
}
}
//print_r($files);
//echo("<br><br>");
?>
Can someone explain / fix this for me? I literally have no clue what it is and I debugged soo much already. It would be heavily gracious and appreciated.
Kind regards,
Bluppie05

There's a few problems with this code that would be preventing you from getting the correct output.
The main reason you don't get any output from the if test is the condition is (presumably) using the wrong variable name.
// variable with the file data is called $email
$email = file_get_contents("$dirname/email.txt");
// test is checking $stremail which is never given a value
if ($stremail == $query) {
echo($dirname);
}
There is also an issue with your scandir() and unset() combination. As you've discovered scandir() basically gives you everything that a dir or ls would on the command line. Using unset() to remove specific files is problematic because you have to maintain a hardcoded list of files. However, unset() also leaves holes in your array, the count changes but the original indices do not. This may be why you are using $folderamount + 10 in your loop. Take a look at this Stack Overflow question for more discussion of the problem.
Rebase array keys after unsetting elements
I recommend you read the PHP manual page on the glob() function as it will greatly simplify getting the contents of a directory. In particular take a look at the GLOB_ONLYDIR flag.
https://www.php.net/manual/en/function.glob.php
Lastly, don't increment your loop counter at the beginning of the loop when using the counter to read elements from an array. Take a look at the PHP manual page for foreach loops for a neater way to iterate over an array.
https://www.php.net/manual/en/control-structures.foreach.php

Random image from directory with no repeats?

I am successfully able to get random images from my 'uploads' directory with my code but the issue is that it has multiple images repeat. I will reload the page and the same image will show 2 - 15 times without changing. I thought about setting a cookie for the previous image but the execution of how to do this is frying my brain. I'll post what I have here, any help would be great.
$files = glob($dir . '/*.*');
$file = array_rand($files);
$filename = $files[$file];
$search = array_search($_COOKIE['prev'], $files);
if ($_COOKIE['prev'] == $filename) {
unset($files[$search]);
$filename = $files[$file];
setcookie('prev', $filename);
}

Similar to slicks answer, but a little more simple on the session front:
Instead of using array_rand to randomise the array, you can use a custom process that reorders based on just a rand:
$files = array_values(glob($dir . '/*.*'));
$randomFiles = array();
while(count($files) > 0) {
$randomIndex = rand(0, count($files) - 1);
$randomFiles[] = $files[$randomIndex];
unset($files[$randomIndex]);
$files = array_values($files);
}
This is useful because you can seed the rand function, meaning it will always generate the same random numbers. Just add (before you randomise the array):
if($_COOKIE['key']) {
$microtime = $_COOKIE['key'];
else {
$microtime = microtime();
setcookie('key', $microtime);
}
srand($microtime);
This does means that someone can manipulate the order of the images by manipulating the cookie, but if you're okay with that this this should work.

So you want to have no repeats per request? Use session. Best way to avoid repetitions is to have two arrays (buckets). First one will contains all available elements that your will pick from. The second array will be empty for now.
Then start picking items from first array and move them from 1st array to the second. (Remove and array_push to the second). Do this in a loop. On the next iteration first array won't have the element you picked already so you will avoid duplicates.
In general. Move items from a bucket to a bucket and you're done. Additionally you can store your results in session instead of cookies? Server side storage is better for that kind of things.

What exactly are the benefits of using a PHP 5 DirectoryIterator over PHP 4 "opendir/readdir/closedir"?

What exactly are the benefits of using a PHP 5 DirectoryIterator
$dir = new DirectoryIterator(dirname(__FILE__));
foreach ($dir as $fileinfo)
{
// handle what has been found
}
over a PHP 4 "opendir/readdir/closedir"
if($handle = opendir(dirname(__FILE__)))
{
while (false !== ($file = readdir($handle)))
{
// handle what has been found
}
closedir($handle);
}
besides the subclassing options that come with OOP?

To understand the difference between the two, let's write two functions that read contents of a directory into an array - one using the procedural method and the other object oriented:
Procedural, using opendir/readdir/closedir
function list_directory_p($dirpath) {
if (!is_dir($dirpath) || !is_readable($dirpath)) {
error_log(__FUNCTION__ . ": Argument should be a path to valid, readable directory (" . var_export($dirpath, true) . " provided)");
return null;
}
$paths = array();
$dir = realpath($dirpath);
$dh = opendir($dir);
while (false !== ($f = readdir($dh))) {
if ("$f" != '.' && "$f" != '..') {
$paths[] = "$dir" . DIRECTORY_SEPARATOR . "$f";
}
}
closedir($dh);
return $paths;
}
Object Oriented, using DirectoryIterator
function list_directory_oo($dirpath) {
if (!is_dir($dirpath) || !is_readable($dirpath)) {
error_log(__FUNCTION__ . ": Argument should be a path to valid, readable directory (" . var_export($dirpath, true) . " provided)");
return null;
}
$paths = array();
$dir = realpath($dirpath);
$di = new DirectoryIterator($dir);
foreach ($di as $fileinfo) {
if (!$fileinfo->isDot()) {
$paths[] = $fileinfo->getRealPath();
}
}
return $paths;
}
Performance
Let's assess their performance first:
$start_t = microtime(true);
for ($i = 0; $i < $num_iterations; $i++) {
$paths = list_directory_oo(".");
}
$end_t = microtime(true);
$time_diff_micro = (($end_t - $start_t) * 1000000) / $num_iterations;
echo "Time taken per call (list_directory_oo) = " . round($time_diff_micro / 1000, 2) . "ms (" . count($paths) . " files)\n";
$start_t = microtime(true);
for ($i = 0; $i < $num_iterations; $i++) {
$paths = list_directory_p(".");
}
$end_t = microtime(true);
$time_diff_micro = (($end_t - $start_t) * 1000000) / $num_iterations;
echo "Time taken per call (list_directory_p) = " . round($time_diff_micro / 1000, 2) . "ms (" . count($paths) . " files)\n";
On my laptop (Win 7 / NTFS), procedural method seems to be clear winner:
C:\code>"C:\Program Files (x86)\PHP\php.exe" list_directory.php
Time taken per call (list_directory_oo) = 4.46ms (161 files)
Time taken per call (list_directory_p) = 0.34ms (161 files)
On an entry-level AWS machine (CentOS):
[~]$ php list_directory.php
Time taken per call (list_directory_oo) = 0.84ms (203 files)
Time taken per call (list_directory_p) = 0.36ms (203 files)
Above are results on PHP 5.4. You'll see similar results using PHP 5.3 and 5.2. Results are similar when PHP is running on Apache or NGINX.
Code Readability
Although slower, code using DirectoryIterator is more readable.
File reading order
The order of directory contents read using either method are exact same. That is, if list_directory_oo returns array('h', 'a', 'g'), list_directory_p also returns array('h', 'a', 'g')
Extensibility
Above two functions demonstrated performance and readability. Note that, if your code needs to do further operations, code using DirectoryIterator is more extensible.
e.g. In function list_directory_oo above, the $fileinfo object provides you with a bunch of methods such as getMTime(), getOwner(), isReadable() etc (return values of most of which are cached and do not require system calls).
Therefore, depending on your use-case (that is, what you intend to do with each child element of the input directory), it's possible that code using DirectoryIterator performs as good or sometimes better than code using opendir.
You can modify the code of list_directory_oo and test it yourself.
Summary
Decision of which to use entirely depends on use-case.
If I were to write a cronjob in PHP which recursively scans a directory (and it's subdirectories) containing thousands of files and do certain operation on them, I would choose the procedural method.
But if my requirement is to write a sort of web-interface to display uploaded files (say in a CMS) and their metadata, I would choose DirectoryIterator.
You can choose based on your needs.

Benefit 1: You can hide away all the boring details.
When using iterators you generally define them somewhere else, so real-life code would look something more like:
// ImageFinder is an abstraction over an Iterator
$images = new ImageFinder($base_directory);
foreach ($images as $image) {
// application logic goes here.
}
The specifics of iterating through directories, sub-directories and filtering out unwanted items are all hidden from the application. That's probably not the interesting part of your application anyway, so it's nice to be able to hide those bits away somewhere else.
Benefit 2: What you do with the result is separated from obtaining the result.
In the above example, you could swap out that specific iterator for another iterator and you don't have to change what you do with the result at all. This makes the code a bit easier to maintain and add new features to later on.

A DirectoryIterator provides you with items that make sense in themselves. For example, DirectoryIterator::getPathname() will return all the information that you need to access the file contents.
The information that readdir() provides to you only make sense locally, namely in combination with the parameter that you passed to opendir().
The DirectoryIterator is implemented in terms of wrappers around the php_stream_* functions, so no fundamentally different performance characteristics are to be expected. Particularly, items from the directory are read only when they are requested. Details can be found in the file
ext/spl/spl_directory.c
of the PHP source code.

It's shorter, cleaner and easier to type and read.
Try re-read your examples. Just “for each in $dir” in first example.
What you want, that you write…

dirname() X amount of times on path of file PHP

I need to do a dirname() on a file path multiple times to exclude sub-folders, so like this:
dirname(dirname(dirname(__FILE__)));
The amount of times I need to do this on a file path is completely dynamic (not fixed) so I need to somehow do it variable $x amount of times...
I could do this:
$x=6;//amount of sub-folders involved in the path
if($x==1){dirname(__FILE__);}
elseif($x==2){dirname(dirname(__FILE__));}
elseif($x==3){dirname(dirname(dirname(__FILE__)));}
elseif($x==4){dirname(dirname(dirname(dirname(__FILE__))));}//and so on.....
But thats not exactly a professional way of going about it, and it will never be reliable (if $x=9999999....).
Does anyone know how I'd go about doing this??

You need to invoke the dirname function $x times, that's called a loop:
$x=6; //amount of sub-folders involved in the path
$dir = dirname(__FILE__);
while(max(0, --$x)) {
$dir = dirname($dir);
}

Recursion is the answer my friend!
function go_up_x_times($path, $x) {
if ($x <= 0) {
return $path; // we're done, yay!
}
return dirname(go_up_x_times($path, $x - 1));
}
go_up_x_times(__FILE__, 5);

Crunch lots of files to generate stats file

I have a bunch of files I need to crunch and I'm worrying about scalability and speed.
The filename and filedata(only the first line) is stored into an array in RAM to create some statical files later in the script.
The files must remain files and can't be put into a databases.
The filename are formatted in the following fashion :
Y-M-D-title.ext (where Y is Year, M for Month and D for Day)
I'm actually using glob to list all the files and create my array :
Here is a sample of the code creating the array "for year" or "month" (It's used in a function with only one parameter -> $period)
[...]
function create_data_info($period=NULL){
$data = array();
$files = glob(ROOT_DIR.'/'.'*.ext');
$size = sizeOf($files);
$existing_title = array(); //Used so we can handle having the same titles two times at different date.
if (isSet($period)){
if ( "year" === $period ){
for ($i = 0; $i < $size; $i++) {
$info = extract_info($files[$i], $existing_file);
//Create the data array with all the data ordered by year/month/day
$data[(int)$info[5]][] = $info;
unset($info);
}
}elseif ( "month" === $period ){
for ($i = 0; $i < $size; $i++) {
$info = extract_info($files[$i], $existing_file);
$key = $info[5].$info[6];
//Create the data array with all the data ordered by year/month/day
$data[(int)$key][] = $info;
unset($info);
}
}
}
[...]
}
function extract_info($file, &$existing){
$full_path_file = $file;
$file = basename($file);
$info_file = explode("-", $file, 4);
$filetitle = explode(".", $info_file[3]);
$info[0] = $filetitle[0];
if (!isSet($existing[$info[0]]))
$existing[$info[0]] = -1;
$existing[$info[0]] += 1;
if ($existing[$info[0]] > 0)
//We have already found a post with this title
//the creation of the cache is based on info[4] data for the filename
//so we need to tune it
$info[0] = $info[0]."-".$existing[$info[0]];
$info[1] = $info_file[3];
$info[2] = $full_path_file;
$post_content = file(ROOT_DIR.'/'.$file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$info[3] = $post_content[0]; //first line of the files
unset($post_content);
$info[4] = filemtime(ROOT_DIR.'/'.$file);
$info[5] = $info_file[0]; //year
$info[6] = $info_file[1]; //month
$info[7] = $info_file[2]; //day
return $info;
}
So in my script I only call create_data_info(PERIOD) (PERIOD being "year", "month", etc..)
It returns an array filled with the info I need, and then I can loop throught it to create my statistics files.
This process is done everytime the PHP script is launched.
My question is : is this code optimal (certainly not) and what can I do to squeeze some juice from my code ?
I don't know how I can cache this (even if it's possible), as there is a lot of I/O involved.
I can change the tree structure if it could change things compared to a flat structure, but from what I found out with my tests it seems flat is the best.
I already thought about creating a little "booster" in C doing only the crunching, but I since it's I/O bound, I don't think it would make a huge difference and the application would be a lot less compatible for shared hosting users.
Thank you very much for your input, I hope I was clear enough here. Let me know if you need clarification (and forget my english mistakes).

To begin with you should use DirectoryIterator instead of glob function. When it comes to scandir vs opendir vs glob, glob is as slow as it gets.
Also, when you are dealing with a large amount of files you should try to do all your processing inside one loop, php function calls are rather slow.
I see you are using unset($info); yet in every loop you make, $info gets new value. Php does its own garbage collection, if thats your concern. Unset is a language construct not a function and should be pretty fast, but when using not needed, it still makes whole thing a bit slower.
You are passing $existing as a reference. Is there practical outcome for this? In my experience references make things slower.
And at last your script seems to deal with a lot of string processing. You might want to consider somekind of "serialize data and base64 encode/decode" solution, but you should benchmark that specifically, might be faster, might be slower depenging on your whole code. (My idea is that, serialize/unserialize MIGHT run faster as these are native php functions and custom functions with string processing are slower).
My answer was not very I/O related but I hope it was helpful.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Discovering folder tree via depth-first or breadth-first - php

I think that your first algorithm with recursion does less iterations than the second one. Try to watch how many iterations each algorithm does using auxilary variables.

Related

PHP If-Else does not work for comparing filecontents

Random image from directory with no repeats?

What exactly are the benefits of using a PHP 5 DirectoryIterator over PHP 4 "opendir/readdir/closedir"?

dirname() X amount of times on path of file PHP

Crunch lots of files to generate stats file

Categories

Resources