Is it possible to speed up a recursive file scan in PHP? - php

I've been trying to replicate Gnu Find ("find .") in PHP, but it seems impossible to get even close to its speed. The PHP implementations use at least twice the time of Find. Are there faster ways of doing this with PHP?
EDIT: I added a code example using the SPL implementation -- its performance is equal to the iterative approach
EDIT2: When calling find from PHP it was actually slower than the native PHP implementation. I guess I should be satisfied with what I've got :)
// measured to 317% of gnu find's speed when run directly from a shell
function list_recursive($dir) {
if ($dh = opendir($dir)) {
while (false !== ($entry = readdir($dh))) {
if ($entry == '.' || $entry == '..') continue;
$path = "$dir/$entry";
echo "$path\n";
if (is_dir($path)) list_recursive($path);
}
closedir($d);
}
}
// measured to 315% of gnu find's speed when run directly from a shell
function list_iterative($from) {
$dirs = array($from);
while (NULL !== ($dir = array_pop($dirs))) {
if ($dh = opendir($dir)) {
while (false !== ($entry = readdir($dh))) {
if ($entry == '.' || $entry == '..') continue;
$path = "$dir/$entry";
echo "$path\n";
if (is_dir($path)) $dirs[] = $path;
}
closedir($dh);
}
}
}
// measured to 315% of gnu find's speed when run directly from a shell
function list_recursivedirectoryiterator($path) {
$it = new RecursiveDirectoryIterator($path);
foreach ($it as $file) {
if ($file->isDot()) continue;
echo $file->getPathname();
}
}
// measured to 390% of gnu find's speed when run directly from a shell
function list_gnufind($dir) {
$dir = escapeshellcmd($dir);
$h = popen("/usr/bin/find $dir", "r");
while ('' != ($s = fread($h, 2048))) {
echo $s;
}
pclose($h);
}

I'm not sure if the performance is better, but you could use a recursive directory iterator to make your code simpler... See RecursiveDirectoryIterator and 'SplFileInfo`.
$it = new RecursiveDirectoryIterator($from);
foreach ($it as $file)
{
if ($file->isDot())
continue;
echo $file->getPathname();
}

Before you start changing anything, profile your code.
Use something like Xdebug (plus kcachegrind for a pretty graph) to find out where the slow parts are. If you start changing things blindly, you won't get anywhere.
My only other advice is to use the SPL directory iterators as posted already. Letting the internal C code do the work is almost always faster.

PHP just cannot perform as fast as C, plain and simple.

Why would you expect the interpreted PHP code to be as fast as the compiled C version of find? Being only twice as slow is actually pretty good.
About the only advice I would add is to do a ob_start() at the beginning and ob_get_contents(), ob_end_clean() at the end. That might speed things up.

You're keeping N directory streams open where N is the depth of the directory tree. Instead, try reading an entire directory's worth of entries at once, and then iterate over the entries. At the very least you'll maximize use of the desk I/O caches.

You might want to seriously consider just using GNU find. If it's available, and safe mode isn't turned on, you'll probably like the results just fine:
function list_recursive($dir) {
$dir=escapeshellcmd($dir);
$h = popen("/usr/bin/find $dir -type f", "r")
while ($s = fgets($h,1024)) {
echo $s;
}
pclose($h);
}
However there might to be some directory that's so big, you're not going to want to bother with this either. Consider amortizing the slowness in other ways. Your second try can be checkpointed (for example) by simply saving the directory stack in the session. If you're giving the user a list of files, simply collect a pageful then save the rest of the state in the session for page 2.

Try using scandir() to read a whole directory at once, as Jason Cohen has suggested. I've based the following code on code from the php manual comments for scandir()
function scan( $dir ){
$dirs = array_diff( scandir( $dir ), Array( ".", ".." ));
$dir_array = Array();
foreach( $dirs as $d )
$dir_array[ $d ] = is_dir($dir."/".$d) ? scan( $dir."/".$d) : print $dir."/".$d."\n";
}

Related

How to perform a statement inside a query to check existence of a file that has the query id as name with less resources in laravel

i'm looking for away to see if the query 'id' has a folder with the same id as name in the file system, i did it but it will slow down the drive in the future with lots of files
$query = Model::all();
if(Input::get('field') == 'true'){
$filenames = scandir('img/folders');
$query->whereIn('id', $filenames);
}
as you can see this will scan and get names of all folders inside the 'folders' directory and create an array with it, now my app is going to have hundreds of thousands of folders in the future and i would like to resolve it before it happens, thanks for further help
ps: other propositions to do it differently are welcome
Do you have good reason to believe that scandir on a directory with a large number of folders will actually slow you down?
You can do your query like this:
if(Input::has('field')){
$filenames = scandir('img/folders');
$query = Model::whereIn('id', $filenames)->get();
}
Edit 1
You may find these links useful:
PHP: scandir() is too slow
Get the Files inside a directory
Edit 2
There are some really good suggestions in the links which you should be able to use for guidance to make your own implementation. As I see it, based on the links included from the first edit I made, your options are use DirectoryIterator, readdir or chunking with scandir.
This is a very basic way of doing it but I guess you could do something with readdir like this:
$ids = Model::lists('id');
$matches = [];
if($handle = opendir('path/to/folders'))
{
while (($entry = readdir($handle)) !== false)
{
if(count($ids) === 0)
{
break;
}
if ($entry != "." && $entry != "..")
{
foreach ($ids as $key => $value)
{
if($value === $entry)
{
$matches[] = $entry;
unset($ids[$key]);
}
}
}
}
closedir($handle);
}
return $matches;

Optimising php file reading code

I have the following which is fairly slow. How can I speed it up?
(it scans a directory and makes headers out of the foldernames and retrieves the pdf files from within and adds them to lists)
$directories= array_diff(scandir("../pdfArchive/subfolder", 0), array('..', '.'));
foreach ($directories as $v) {
echo "<h3>".$v."</h3>";
$current = array_diff(scandir("../pdfArchive/subfolder/".$v, 0), array('..', '.'));
echo "<ul style=\"list-style-image: url(/images/pdf.gif); margin-left: 20px;\">";
foreach ($current as $vone) {
echo "<li><a target=\"blank\" href=\"../pdfArchive/subfolder/".$vone."\">".str_replace(".pdf", "", $vone)."</a>";
echo "</li><br>";
}
echo "</ul>";
}
Don't use array_diff() to filter out current and parent directory, use something like DirectoryIterator or glob() and then test whether it's . or .. via an if statement
glob() has a flag that allows you to retrieve only directories for your loops
Profile your code to see exactly what lines/functions are executing slowly
I'm not sure how fast array_diff() is when the array is very large, isn't it faster to simply add a separate check and make sure that '.' and '..' is not the returned name?
Other than that, I can't see there being anything really wrong.
What did you test to consider the current approach slow?
Here is a snippet of code I use that I adapted from php.net. It is very basic and goes through a given directory and lists the files contained within.
// The # suppresses any errors, $dir is the directory path
if (($handle = #opendir($dir)) != FALSE) {
// Loop over directory contents
while (($file = readdir($handle)) !== FALSE) {
// We don't want the current directory (.) or parent (..)
if ($file != "." && $file != "..") {
var_dump($file);
if (!is_dir($dir . $file)) {
// $file is really a file
} else {
// $file is a directory
}
}
}
closedir($handle);
} else {
// Deal with it
}
You may adapt this further to recurse over subdirectories by using is_dir to identify folders as I have shown above.

PHP - fastest way to find if directory has children?

I'm building a file browser, and I need to know if a directory has children (but not how many or what type).
What's the most efficient way to find if a directory has children? glob()? scandir() it? Check its tax records?
Edit
It seems I was misunderstood, although I thought I was pretty clear. I'll try to restate my question.
What is the most efficient way to know if a directory is not empty? I'm basically looking for a boolean answer - NOT EMPTY or EMPTY.
I don't need to know:
how many files are in the directory
what the files are
when they were modified
etc.
I do need to know:
does the directory have any files in it at all
efficiently.
I think this is very efficient:
function dir_contains_children($dir) {
$result = false;
if($dh = opendir($dir)) {
while(!$result && ($file = readdir($dh)) !== false) {
$result = $file !== "." && $file !== "..";
}
closedir($dh);
}
return $result;
}
It stops the listing of the directories contents as soon as there is a file or directory found (not including the . and ..).
You could use 'find' to list all empty directories in one step:
exec("find '$dir' -maxdepth 1 -empty -type d",$out,$ret);
print_r($out);
Its not "pure" php but its simple and fast.
This should do, easy, quick and effective.
<?php
function dir_is_empty($dir) {
$dirItems = count(scandir($dir));
if($dirItems > 2) return false;
else return true;
}
?>
Unfortunately, each solution so far has lacked the brevity and elegance necessary to shine above the rest.
So, I was forced to homebrew a solution myself, which I'll be implementing until something better pops up:
if(count(glob($dir."/*")) {
echo "NOT EMPTY";
}
Still not sure of the efficiency of this compared to other methods, which was the original question.
I wanted to expand vstm's answer - Check only for child directories (and not files):
/**
* Check if directory contains child directories.
*/
function dir_contains_children_dirs($dir) {
$result = false;
if($dh = opendir($dir)) {
while (!$result && ($file = readdir($dh))) {
$result = $file !== "." && $file !== ".." && is_dir($dir.'/'.$file);
}
closedir($dh);
}
return $result;
}

Optimize PHP function

I have a function that detects all files started by a string and it returns an array filled with the correspondent files, but it is starting to get slow, because I have arround 20000 files in a particular directory.
I need to optimize this function, but I just can't see how. This is the function:
function DetectPrefix ($filePath, $prefix)
{
$dh = opendir($filePath);
while (false !== ($filename = readdir($dh)))
{
$posIni = strpos( $filename, $prefix);
if ($posIni===0):
$files[] = $filename;
endif;
}
if (count($files)>0){
return $files;
} else {
return null;
}
}
What more can I do?
Thanks
http://php.net/glob
$files = glob('/file/path/prefix*');
Wikipedia breaks uploads up by the first couple letters of their filenames, so excelfile.xls would go in a directory like /uploads/e/x while textfile.txt would go in /uploads/t/e.
Not only does this reduce the number of files glob (or any other approach) has to sort through, but it avoids the maximum files in a directory issue others have mentioned.
You could use scandir() to list the files in the directory, instead of iterating through them one-by-one using readdir(). scandir() returns an array of the files.
However, it'd be better if you could change your file system organization - do you really need to store 20000+ files in a single directory?
As the other answers mention, I'd look at glob(), scandir(), and/or the DirectoryIterator class, there is no need to recreate the wheel.
However watch out! check your operating system, but there may be a limit on the maximum number of files in a single directory. If this is the case and you just keep adding files in the same directory you will have some downtime, and some problems, when you reach the limit. This error will probably appear as a permissions or write failure and not an obvious "you can't write more files in a single directory" message.
I'm not sure but probably DirectoryIterator is a bit faster. Also add caching so that list gets generated only when files are added or deleted.
You just need to compare the first length of prefix characters. So try this:
function DetectPrefix($filePath, $prefix) {
$dh = opendir($filePath);
$len = strlen($prefix);
$files = array();
while (false !== ($filename = readdir($dh))) {
if (substr($filename, 0, $len) === $prefix) {
$files[] = $filename;
}
}
if (count($files)) {
return $files;
} else {
return null;
}
}

How to check if directory contents has changed with PHP?

I'm writing a photo gallery script in PHP and have a single directory where the user will store their pictures. I'm attempting to set up page caching and have the cache refresh only if the contents of the directory has changed. I thought I could do this by caching the last modified time of the directory using the filemtime() function and compare it to the current modified time of the directory. However, as I've come to realize, the directory modified time does not change as files are added or removed from that directory (at least on Windows, not sure about Linux machines yet).
So my questions is, what is the simplest way to check if the contents of a directory have been modified?
As already mentioned by others, a better way to solve this would be to trigger a function when particular events happen, that changes the folder.
However, if your server is a unix, you can use inotifywait to watch the directory, and then invoke a PHP script.
Here's a simple example:
#!/bin/sh
inotifywait --recursive --monitor --quiet --event modify,create,delete,move --format '%f' /path/to/directory/to/watch |
while read FILE ; do
php /path/to/trigger.php $FILE
done
See also: http://linux.die.net/man/1/inotifywait
What about touching the directory after a user has submitted his image?
Changelog says: Requires php 5.3 for windows to work, but I think it should work on all other environments
with inotifywait inside php
$watchedDir = 'watch';
$in = popen("inotifywait --monitor --quiet --format '%e %f' --event create,moved_to '$watchedDir'", 'r');
if ($in === false)
throw new Exception ('fail start notify');
while (($line = fgets($in)) !== false)
{
list($event, $file) = explode(' ', rtrim($line, PHP_EOL), 2);
echo "$event $file\n";
}
Uh. I'd simply store the md5 of a directory listing. If the contents change, the md5(directory-listing) will change. You might get the very occasional md5 clash, but I think that chance is tiny enough..
Alternatively, you could store a little file in that directory that contains the "last modified" date. But I'd go with md5.
PS. on second thought, seeing as how you're looking at performance (caching) requesting and hashing the directory listing might not be entirely optimal..
IMO edubem's answer is the way to go, however you can do something like this:
if (sha1(serialize(Map('/path/to/directory/', true))) != /* previous stored hash */)
{
// directory contents has changed
}
Or a more weak / faster version:
if (Size('/path/to/directory/', true) != /* previous stored size */)
{
// directory contents has changed
}
Here are the functions used:
function Map($path, $recursive = false)
{
$result = array();
if (is_dir($path) === true)
{
$path = Path($path);
$files = array_diff(scandir($path), array('.', '..'));
foreach ($files as $file)
{
if (is_dir($path . $file) === true)
{
$result[$file] = ($recursive === true) ? Map($path . $file, $recursive) : $this->Size($path . $file, true);
}
else if (is_file($path . $file) === true)
{
$result[$file] = Size($path . $file);
}
}
}
else if (is_file($path) === true)
{
$result[basename($path)] = Size($path);
}
return $result;
}
function Size($path, $recursive = true)
{
$result = 0;
if (is_dir($path) === true)
{
$path = Path($path);
$files = array_diff(scandir($path), array('.', '..'));
foreach ($files as $file)
{
if (is_dir($path . $file) === true)
{
$result += ($recursive === true) ? Size($path . $file, $recursive) : 0;
}
else if (is_file() === true)
{
$result += sprintf('%u', filesize($path . $file));
}
}
}
else if (is_file($path) === true)
{
$result += sprintf('%u', filesize($path));
}
return $result;
}
function Path($path)
{
if (file_exists($path) === true)
{
$path = rtrim(str_replace('\\', '/', realpath($path)), '/');
if (is_dir($path) === true)
{
$path .= '/';
}
return $path;
}
return false;
}
Here's what you may try. Store all pictures in a single directory (or in /username subdirectories inside it to speed things up and to lessen the stress on the FS) and set up Apache (or whaterver you're using) to serve them as static content with "expires-on" set to 100 years in the future. File names should contain some unique prefix or suffix (timestamp, SHA1 hash of file content, etc), so whenever uses changes the file its name gets changed and Apache will serve a new version, which will get cached along the way.
You're thinking the wrong way.
You should execute your directory indexer script as soon as someone's uploaded a new file and it's moved to the target location.
Try deleting the cached version when a user uploads a file to his directory.
When someone tries to view the gallery, look if there's a cached version first. If there's a cached version, load it, otherwise, generate the page, cache it, done.
I was looking for something similar and I just found this:
http://www.franzone.com/2008/06/05/php-script-to-monitor-ftp-directory-changes/
For me looks like a great solution since I'll have a lot of control (I'll be doing an AJAX call to see if anything changed).
Hope that this helps.
Here is a code sample, that would return 0 if the directory was changed.
I use it in backups.
The changed status is determined by presence of files and their filesizes.
You could easily change this, to compare file contents by replacing
$longString .= filesize($file);
with
$longString .= crc32(file_get_contents($file));
but it will affect execution speed.
#!/usr/bin/php
<?php
$dirName = $argv[1];
$basePath = '/var/www/vhosts/majestichorseporn.com/web/';
$dataFile = './backup_dir_if_changed.dat';
# startup checks
if (!is_writable($dataFile))
die($dataFile . ' is not writable!');
if (!is_dir($basePath . $dirName))
die($basePath . $dirName . ' is not a directory');
$dataFileContent = file_get_contents($dataFile);
$data = #unserialize($dataFileContent);
if ($data === false)
$data = array();
# find all files ang concatenate their sizes to calculate crc32
$files = glob($basePath . $dirName . '/*', GLOB_BRACE);
$longString = '';
foreach ($files as $file) {
$longString .= filesize($file);
}
$longStringHash = crc32($longString);
# do changed check
if (isset ($data[$dirName]) && $data[$dirName] == $longStringHash)
die('Directory did not change.');
# save hash do DB
$data[$dirName] = $longStringHash;
file_put_contents($dataFile, serialize($data));
die('0');

Categories