I am brushing up on my non-framework Object Oriented PHP and decided to do a test. Unfortunately, while I understand the concept of calling methods from a class, this particular test is slightly more complicated and I don't know what the terminology is for this particular type of situation is.
Test
Create a PHP class that parses (unknown number of) text files in a folder and allows to extract the total value of amount field from the file and get filenames of parsed files.
File Format:
The files are plain text csv files. Let's assume that the files contain a list of payments changed in the last N days. There are 2 different types of line:
Card payment collected - type = 1, date, order id, amount
Card payment rejected - type = 2, date, order id, reason, amount
Example file:
1,20090313,542,11.99
1,20090313,543,9.99
2,20090312,500,some reason, 2.99
Usage Example:
The usage could be something like this:
$parser = new Parser(...);
$files = $parser->getFiles();
foreach ($files as $file) {
$filename = $file->getFileName();
$amount_collected = $file->getTotalAmount(...);
$amount_rejected = $file->getTotalAmount(...);
}
My question is:
How can you do $file->method() when the class is called parser? I'm guessing you return an object from the getFiles method in the parser class, but how can you run methods with the returned object?
I attempted to Google this, but as I don't know the terminology for this situation I didn't find anything.
Any help is much appreciated, even if it's just what the terminology for this situation is.
The returned object could be of class ParserFile with getFileName and getTotalAmount methods. This approach would be quite close to the Factory pattern, although it would be a good idea to make the getFiles method static, callable without the parser class itself.
class ParserFile {
public function getFilename() { /* whatever */ }
public function getTotalAmount() { /* whatever */ }
}
class Parser {
public static function getFiles() {
// loop through the available files
// and store them in some $arr
$arr[] = new ParserFile('filename1.txt');
$arr[] = new ParserFile('filename2.txt');
return $arr;
}
}
$files = Parser::getFiles();
foreach ($files as $file) {
$filename = $file->getFilename();
$amount_collected = $file->getTotalAmount();
$amount_rejected = $file->getTotalAmount();
}
Although I'm pretty sure this is not the best design. Another approach would be:
$parser = new Parser();
$files = $parser->getFiles();
foreach ($files as $file) {
$filename = $parser->getFilename($file);
$amount_collected = $parser->getTotalAmount($file);
$amount_rejected = $parser->getTotalAmount($file);
}
So you'll get the array of files into your $files but when you want to parse these files you'll ask $parser to do it for you by passing the current $file to its methods.
There's no 100% correct solution I guess, just use what's best for you. If you then encounter problems, profile, benchmark and refactor.
Hope that helped. Cheers :)
P.S. Hope this isn't homework :D
Related
Here´s my recent result for recursive listing of a user directory.
I use the results to build a filemanger (original screenshots).
(source: ddlab.de)
Sorry, the 654321.jpg is uploaded several times to different folders, thats why it looks a bit messy.
(source: ddlab.de)
Therefor I need two separate arrays, one for the directory tree, the other for the files.
Here only showing the php solution, as I am currently still working on javascript for usability. The array keys contain all currently needed infos. The key "tree" is used to get an ID for the folders as well as a CLASS for the files (using jquery, show files which are related to the active folder and hide which are not) a.s.o.
The folder list is an UL/LI, the files section is a sortable table which includes a "show all files"-function, where files are listed completely, sortable as well, with path info.
The function
function build_tree($dir,$deep=0,$tree='/',&$arr_folder=array(),&$arr_files=array()) {
$dir = rtrim($dir,'/').'/'; // not really necessary if 1st function call is clean
$handle = opendir($dir);
while ($file = readdir($handle))
{
if ($file != "." && $file != "..")
{
if (is_dir($dir.$file))
{
$deep++;
$tree_pre = $tree; // remember for reset
$tree = $tree.$file.'/'; // bulids something like "/","/sub1/","/sub1/sub2/"
$arr_folder[$tree] = array('tree'=>$tree,'deep'=>$deep,'file'=>$file);
build_tree($dir.$file,$deep,$tree,$arr_folder,$arr_files); // recursive function call
$tree = $tree_pre; // reset to go to upper levels
$deep--; // reset to go to upper levels
}
else
{
$arr_files[$file.'.'.$tree] = array('tree'=>$tree,'file'=>$file,'filesize'=>filesize($dir.$file),'filemtime'=>filemtime($dir.$file));
}
}
}
closedir($handle);
return array($arr_folder,$arr_files); //cannot return two separate arrays
}
Calling the function
$build_tree = build_tree($udir); // 1st function call, $udir is my user directory
Get the arrays separated
$arr_folder = $build_tree[0]; // separate the two arrays
$arr_files = $build_tree[1]; // separate the two arrays
see results
print_r($arr_folder);
print_r($arr_files);
It works like a charme,
Whoever might need something like this, be lucky with it.
I promise to post the entire code, when finished :-)
What exactly are the benefits of using a PHP 5 DirectoryIterator
$dir = new DirectoryIterator(dirname(__FILE__));
foreach ($dir as $fileinfo)
{
// handle what has been found
}
over a PHP 4 "opendir/readdir/closedir"
if($handle = opendir(dirname(__FILE__)))
{
while (false !== ($file = readdir($handle)))
{
// handle what has been found
}
closedir($handle);
}
besides the subclassing options that come with OOP?
To understand the difference between the two, let's write two functions that read contents of a directory into an array - one using the procedural method and the other object oriented:
Procedural, using opendir/readdir/closedir
function list_directory_p($dirpath) {
if (!is_dir($dirpath) || !is_readable($dirpath)) {
error_log(__FUNCTION__ . ": Argument should be a path to valid, readable directory (" . var_export($dirpath, true) . " provided)");
return null;
}
$paths = array();
$dir = realpath($dirpath);
$dh = opendir($dir);
while (false !== ($f = readdir($dh))) {
if ("$f" != '.' && "$f" != '..') {
$paths[] = "$dir" . DIRECTORY_SEPARATOR . "$f";
}
}
closedir($dh);
return $paths;
}
Object Oriented, using DirectoryIterator
function list_directory_oo($dirpath) {
if (!is_dir($dirpath) || !is_readable($dirpath)) {
error_log(__FUNCTION__ . ": Argument should be a path to valid, readable directory (" . var_export($dirpath, true) . " provided)");
return null;
}
$paths = array();
$dir = realpath($dirpath);
$di = new DirectoryIterator($dir);
foreach ($di as $fileinfo) {
if (!$fileinfo->isDot()) {
$paths[] = $fileinfo->getRealPath();
}
}
return $paths;
}
Performance
Let's assess their performance first:
$start_t = microtime(true);
for ($i = 0; $i < $num_iterations; $i++) {
$paths = list_directory_oo(".");
}
$end_t = microtime(true);
$time_diff_micro = (($end_t - $start_t) * 1000000) / $num_iterations;
echo "Time taken per call (list_directory_oo) = " . round($time_diff_micro / 1000, 2) . "ms (" . count($paths) . " files)\n";
$start_t = microtime(true);
for ($i = 0; $i < $num_iterations; $i++) {
$paths = list_directory_p(".");
}
$end_t = microtime(true);
$time_diff_micro = (($end_t - $start_t) * 1000000) / $num_iterations;
echo "Time taken per call (list_directory_p) = " . round($time_diff_micro / 1000, 2) . "ms (" . count($paths) . " files)\n";
On my laptop (Win 7 / NTFS), procedural method seems to be clear winner:
C:\code>"C:\Program Files (x86)\PHP\php.exe" list_directory.php
Time taken per call (list_directory_oo) = 4.46ms (161 files)
Time taken per call (list_directory_p) = 0.34ms (161 files)
On an entry-level AWS machine (CentOS):
[~]$ php list_directory.php
Time taken per call (list_directory_oo) = 0.84ms (203 files)
Time taken per call (list_directory_p) = 0.36ms (203 files)
Above are results on PHP 5.4. You'll see similar results using PHP 5.3 and 5.2. Results are similar when PHP is running on Apache or NGINX.
Code Readability
Although slower, code using DirectoryIterator is more readable.
File reading order
The order of directory contents read using either method are exact same. That is, if list_directory_oo returns array('h', 'a', 'g'), list_directory_p also returns array('h', 'a', 'g')
Extensibility
Above two functions demonstrated performance and readability. Note that, if your code needs to do further operations, code using DirectoryIterator is more extensible.
e.g. In function list_directory_oo above, the $fileinfo object provides you with a bunch of methods such as getMTime(), getOwner(), isReadable() etc (return values of most of which are cached and do not require system calls).
Therefore, depending on your use-case (that is, what you intend to do with each child element of the input directory), it's possible that code using DirectoryIterator performs as good or sometimes better than code using opendir.
You can modify the code of list_directory_oo and test it yourself.
Summary
Decision of which to use entirely depends on use-case.
If I were to write a cronjob in PHP which recursively scans a directory (and it's subdirectories) containing thousands of files and do certain operation on them, I would choose the procedural method.
But if my requirement is to write a sort of web-interface to display uploaded files (say in a CMS) and their metadata, I would choose DirectoryIterator.
You can choose based on your needs.
Benefit 1: You can hide away all the boring details.
When using iterators you generally define them somewhere else, so real-life code would look something more like:
// ImageFinder is an abstraction over an Iterator
$images = new ImageFinder($base_directory);
foreach ($images as $image) {
// application logic goes here.
}
The specifics of iterating through directories, sub-directories and filtering out unwanted items are all hidden from the application. That's probably not the interesting part of your application anyway, so it's nice to be able to hide those bits away somewhere else.
Benefit 2: What you do with the result is separated from obtaining the result.
In the above example, you could swap out that specific iterator for another iterator and you don't have to change what you do with the result at all. This makes the code a bit easier to maintain and add new features to later on.
A DirectoryIterator provides you with items that make sense in themselves. For example, DirectoryIterator::getPathname() will return all the information that you need to access the file contents.
The information that readdir() provides to you only make sense locally, namely in combination with the parameter that you passed to opendir().
The DirectoryIterator is implemented in terms of wrappers around the php_stream_* functions, so no fundamentally different performance characteristics are to be expected. Particularly, items from the directory are read only when they are requested. Details can be found in the file
ext/spl/spl_directory.c
of the PHP source code.
It's shorter, cleaner and easier to type and read.
Try re-read your examples. Just “for each in $dir” in first example.
What you want, that you write…
I had to find the paths to the "deepest" folders in a folder. For this I implemented two algorithms, and one is way faster than the other.
Does anyone know why ? I suppose this has some link with the hard-disk hardware but I'd like to understand.
Here is the fast one :
private function getHostAux($path) {
$matches = array();
$folder = rtrim($path, DIRECTORY_SEPARATOR);
$moreFolders = glob($folder.DIRECTORY_SEPARATOR.'*', GLOB_ONLYDIR);
if (count($moreFolders) == 0) {
$matches[] = $folder;
} else {
foreach ($moreFolders as $fd) {
$arr = $this->getHostAux($fd);
$matches = array_merge($matches, $arr);
}
}
return $matches;
}
And here is the slow-one :
/**
* Breadth-first function using glob
*/
private function getHostAux($path) {
$matches = array();
$folders = array(rtrim($path, DIRECTORY_SEPARATOR));
$i = 0;
while($folder = array_shift($folders)) {
$moreFolders = glob($folder.DIRECTORY_SEPARATOR.'*', GLOB_ONLYDIR);
if (count($moreFolders == 0)) {
$matches[$i] = $folder;
}
$folders = array_merge($folders, $moreFolders);
$i++;
}
return $matches;
}
Thanks !
You haven't provided additional informations that might be crucial for understanding these "timings" which you observed. (I intentionally wrote the quotes since you haven't specified what "slow" and "fast" mean and how exactly did you measure it.)
Assuming that the supplied informations are true and that the speedup for the first method is greater than a couple of percent and you've tested it on directories of various sizes and depth...
First I would like to comment on the supplied answers:
I wouldn't be so sure about your answer. First I think you mean "kernel handles". But this is not true since glob doesn't open handles. How did you come up with this answer?
Both versions have the same total iteration count.
And add something from myself:
I would suspect array_shift() may cause the slowdown because it reindexes the whole array each time you call it.
The order in which you glob may matter depending on the underlying OS and file system.
You have a bug (probably) in your code. You increment $i after every glob and not after adding an element to the $matches array. That causes that the $matches array is sparse which may cause the merging, shifting or even the adding process to be slower. I don't know exactly if that's the case with PHP but I know several languages in which arrays have these properties which are sometimes hard to keep in mind while coding. I would recommend fixing this, timing the code again and seeing if that makes any difference.
I think that your first algorithm with recursion does less iterations than the second one. Try to watch how many iterations each algorithm does using auxilary variables.
tl;dr: Is there a way to prevent alteration to (essentially lock) variables declared/defined prior to an include() call, by the file being included? Also, somewhat related question.
I'm wondering about what measures can be taken, to avoid variable pollution from included files. For example, given this fancy little function:
/**
* Recursively loads values by include returns into
* arguments of a callback
*
* If $path is a file, only that file will be included.
* If $path is a directory, all files in that directory
* and all sub-directories will be included.
*
* When a file is included, $callback is invoked passing
* the returned value as an argument.
*
* #param string $path
* #param callable $callback
*/
function load_values_recursive($path, $callback){
$paths[] = path($path);
while(!empty($paths)){
$path = array_pop($paths);
if(is_file($path)){
if(true === $callback(include($path))){
break;
}
}
if(is_dir($path)){
foreach(glob($path . '*') as $path){
$paths[] = path($path);
}
}
}
}
I know it's missing some type-checking and other explanations, let's ignore those.
Anyways, this function basically sifts through a bunch of "data" files that merely return values (typically configuration arrays, or routing tables, but whatever) and then invokes the passed callback so that the value can be filtered or sorted or used somehow. For instance:
$values = array();
load_values_recursive('path/to/dir/', function($value) use(&$values){
$values[] = $value;
});
And path/to/dir/ may have several files that follow this template:
return array(
// yay, data!
);
My problem comes when these "configuration" files (or whatever, trying to keep this portable and cross-functional) start to contain even rudimentary logic. There's always the possibility of polluting the variables local to the function. For instance, a configuration file, that for the sake of cleverness does:
return array(
'path_1' => $path = 'some/long/complicated/path/',
'path_2' => $path . 'foo/',
'path_3' => $path . 'bar/',
);
Now, given $path happens to be a visible directory relative to the current, the function is gonna go wonky:
// ...
if(is_file($path)){
if(true === $callback(include($path))){ // path gets reset to
break; // some/long/complicated/path/
}
}
if(is_dir($path)){ // and gets added into the
foreach(glob($path . '*') as $path){ // search tree
$paths[] = path($path);
}
}
// ...
This would likely have bad-at-best results. The only1 solution I can think of, is wrapping the include() call in yet another anonymous function to change scope:
// ...
if(true === call_user_func(function() use($callback, $path){
return $callback($path);
})){
break;
}
// ...
Thus protecting $path (and more importantly, $callback) from causing side effects with each iteration.
I'm wondering if there exists a simpler way to "lock" variables in PHP under such circumstances.
I just wanna go on the record here; I know I could use, for instance, an elseif to alleviate one of the issues specific to this function, however my question is more interested in circumstance-agnostic solutions, a catch-all if you will.
take a look at Giving PHP include()'d files parent variable scope it has a rather unique approach to the problem that can be used here.
it amounts to unsetting all defined vars before the include and then resetting them after.
it certainly isn't elegant, but it'll work.
I've gone with the following solution to include pollution:
$value = call_user_func(function(){
return include(func_get_arg(0));
}, $path);
$path is nowhere to be seen at inclusion, and it seems most elegant. Surely, calling func_get_arg($i) from the included file will yield passed values, but, well...
I have a bunch of files I need to crunch and I'm worrying about scalability and speed.
The filename and filedata(only the first line) is stored into an array in RAM to create some statical files later in the script.
The files must remain files and can't be put into a databases.
The filename are formatted in the following fashion :
Y-M-D-title.ext (where Y is Year, M for Month and D for Day)
I'm actually using glob to list all the files and create my array :
Here is a sample of the code creating the array "for year" or "month" (It's used in a function with only one parameter -> $period)
[...]
function create_data_info($period=NULL){
$data = array();
$files = glob(ROOT_DIR.'/'.'*.ext');
$size = sizeOf($files);
$existing_title = array(); //Used so we can handle having the same titles two times at different date.
if (isSet($period)){
if ( "year" === $period ){
for ($i = 0; $i < $size; $i++) {
$info = extract_info($files[$i], $existing_file);
//Create the data array with all the data ordered by year/month/day
$data[(int)$info[5]][] = $info;
unset($info);
}
}elseif ( "month" === $period ){
for ($i = 0; $i < $size; $i++) {
$info = extract_info($files[$i], $existing_file);
$key = $info[5].$info[6];
//Create the data array with all the data ordered by year/month/day
$data[(int)$key][] = $info;
unset($info);
}
}
}
[...]
}
function extract_info($file, &$existing){
$full_path_file = $file;
$file = basename($file);
$info_file = explode("-", $file, 4);
$filetitle = explode(".", $info_file[3]);
$info[0] = $filetitle[0];
if (!isSet($existing[$info[0]]))
$existing[$info[0]] = -1;
$existing[$info[0]] += 1;
if ($existing[$info[0]] > 0)
//We have already found a post with this title
//the creation of the cache is based on info[4] data for the filename
//so we need to tune it
$info[0] = $info[0]."-".$existing[$info[0]];
$info[1] = $info_file[3];
$info[2] = $full_path_file;
$post_content = file(ROOT_DIR.'/'.$file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$info[3] = $post_content[0]; //first line of the files
unset($post_content);
$info[4] = filemtime(ROOT_DIR.'/'.$file);
$info[5] = $info_file[0]; //year
$info[6] = $info_file[1]; //month
$info[7] = $info_file[2]; //day
return $info;
}
So in my script I only call create_data_info(PERIOD) (PERIOD being "year", "month", etc..)
It returns an array filled with the info I need, and then I can loop throught it to create my statistics files.
This process is done everytime the PHP script is launched.
My question is : is this code optimal (certainly not) and what can I do to squeeze some juice from my code ?
I don't know how I can cache this (even if it's possible), as there is a lot of I/O involved.
I can change the tree structure if it could change things compared to a flat structure, but from what I found out with my tests it seems flat is the best.
I already thought about creating a little "booster" in C doing only the crunching, but I since it's I/O bound, I don't think it would make a huge difference and the application would be a lot less compatible for shared hosting users.
Thank you very much for your input, I hope I was clear enough here. Let me know if you need clarification (and forget my english mistakes).
To begin with you should use DirectoryIterator instead of glob function. When it comes to scandir vs opendir vs glob, glob is as slow as it gets.
Also, when you are dealing with a large amount of files you should try to do all your processing inside one loop, php function calls are rather slow.
I see you are using unset($info); yet in every loop you make, $info gets new value. Php does its own garbage collection, if thats your concern. Unset is a language construct not a function and should be pretty fast, but when using not needed, it still makes whole thing a bit slower.
You are passing $existing as a reference. Is there practical outcome for this? In my experience references make things slower.
And at last your script seems to deal with a lot of string processing. You might want to consider somekind of "serialize data and base64 encode/decode" solution, but you should benchmark that specifically, might be faster, might be slower depenging on your whole code. (My idea is that, serialize/unserialize MIGHT run faster as these are native php functions and custom functions with string processing are slower).
My answer was not very I/O related but I hope it was helpful.