I'm trying to get my head around using Iterators effectively in PHP 5, and without a lot of decent examples on the net, it's proving to be a little difficult.
I'm trying to loop over a directory, and read all the (php) files within to search for defined classes. What I then want to do is have an associative array returned with the class names as keys, the the file paths as the values.
By using a RecursiveDirectoryIterator(), I can recurse through directories.
By passing this into a RecursiveIteratorIterator, I can retrieve the contents of the directory as a single dimensional iterator.
By then using a filter on this, I can filter out all the directories, and non-php files which will just leave me the files I want to consider.
What I now want to do is be able to pass this iterator into another iterator (not sure which would be suitable), such that when it loops over each entry, it could retrieve an array which it needs to combine into a master array.
It's a little complicated to explain, so here's a code example:
// $php_files now represents an array of SplFileInfo objects representing files under $dir that match our criteria
$php_files = new PhpFileFilter(new RecursiveIteratorIterator(new RecursiveDirectoryIterator($dir)));
class ClassDetector extends FilterIterator {
public function accept() {
$file = $this->current(); // get the current item, which will be an SplFileInfo object
// Match all the classes contained within this file
if (preg_match($regex, $file->getContents(), $match)) {
// Return an assoc array of all the classes matched, the class name as key and the filepath as value
return array(
'class1' => $file->getFilename(),
'class2' => $file->getFilename(),
'class3' => $file->getFilename(),
);
}
}
}
foreach (new ClassDetector($php_files) as $class => $file) {
print "{$class} => {$file}\n";
}
// Expected output:
// class1 => /foo.php
// class2 => /foo.php
// class3 => /foo.php
// class4 => /bar.php
// class5 => /bar.php
// ... etc ...
As you can see from this example, I'm kind of hijacking the accept() method for FilterIterator, which is completely incorrect usage I know - but I use it only as an example to demonstrate how I just want the one function to be called, and for it to return an array which is merged into a master array.
At the moment I'm thinking I'm going to have to use one of the RecursionIterators, since this appears to be what they do, but I'm not fond of the idea of using two different methods (hasChildren() and getChildren()) to achieve the goal.
In short, I'm trying to identify which Iterator I can use (or extend) to get it to pass over a single-dimensional array(?) of objects, and get it to combine the resulting array into a master one and return that.
I realise that there are several other ways around this, ala something like:
$master = array();
foreach($php_files as $file) {
if (preg_match($regex, $file->getContents(), $match)) {
// create $match_results
$master = array_merge($master, $match_results);
}
}
but this defeats the purpose of using Iterators, and it's not very elegant either as a solution.
Anyway, I hope I've explained that well enough. Thanks for reading this far, and for your answers in advance :)
Right, I managed to get my head around it eventually. I had to use a Recursive iterator because the input iterator is essentially generating child results, and I extended IteratorIterator which already had the functionality to loop over an Iterator.
Anyways, here's a code example, just in case this helps anyone else. This assumes you've passed in an array of SplFileInfo objects (which are the result of a DirectoryIterator anyway).
class
ClassMatcher
extends
IteratorIterator
implements
RecursiveIterator
{
protected $matches;
public function hasChildren() {
return preg_match_all(
'#class (\w+)\b#ism',
file_get_contents($this->current()->getPathname()),
$this->matches
);
}
public function getChildren() {
$classes = $this->matches[1];
return new RecursiveArrayIterator(
array_combine(
$classes, // class name as key
array_fill(0, count($classes), $this->current()->getPathname()) // file path as value
)
);
}
}
I once did something similar. The source is right here and I believe is easily understandable. If you have any problem with it please let me know.
The main idea is to extend SplFileInfo and then use RecursiveIteratorIterator::setInfoClass($className); in order to obtain information about the source code. A Filter for parsing only PHP files could be nice though I decided back then to filter them by extension in the main loop.
Related
Is it possible to use XPath syntax directly on PHP objects in order to navigate through the hierarchy of the object?
That is, can I use (2) instead of (1):
$object->subObject1->subObject2
$object['subObject1/subObject2'] (The expression in the brackets is the XPath.)
Additional question:
According to my current understanding, a conversion of an object into an ArrayObject doesn't make sense, because XPath cannot be used with ArrayObjects. Is this correct?
If all you need is basic traversal based on a /-separated path, then you can implement it with a simple loop like this:
public function getDescendant($path) {
// Separate the path into an array of components
$path_parts = explode('/', $path);
// Start by pointing at the current object
$var = $this;
// Loop over the parts of the path specified
foreach($path_parts as $property)
{
// Check that it's a valid access
if ( is_object($var) && isset($var->$property) )
{
// Traverse to the specified property,
// overwriting the same variable
$var = $var->$property;
}
else
{
return null;
}
}
// Our variable has now traversed the specified path
return $var;
}
To set a value is similar, but we need one extra trick: to make it possible to assign a value after the loop has exited, we need to assign the variable by reference each time:
public function setDescendant($path, $value) {
// Separate the path into an array of components
$path_parts = explode('/', $path);
// Start by pointing at the current object
$var =& $this;
// Loop over the parts of the path specified
foreach($path_parts as $property)
{
// Traverse to the specified property,
// overwriting the same variable with a *reference*
$var =& $var->$property;
}
// Our variable has now traversed the specified path,
// and is a reference to the variable we want to overwrite
$var = $value;
}
Adding those to a class called Test, allows us to do something like the following:
$foo = new Test;
$foo->setDescendant('A/B', 42);
$bar = new Test;
$bar->setDescendant('One/Two', $foo);
echo $bar->getDescendant('One/Two/A/B'), ' is the same as ', $bar->One->Two->A->B;
To allow this using array access notation as in your question, you need to make a class that implements the ArrayAccess interface:
The above functions can be used directly as offsetGet and offsetSet
offsetExists would be similar to getDescendant/offsetGet, except returning false instead of null, and true instead of $var.
To implement offsetUnset properly is slightly trickier, as you can't use the assign-by-reference trick to actually delete a property from its parent object. Instead, you need to treat the last part of the specified path specially, e.g. by grabbing it with array_pop($path_parts)
With a bit of care, the 4 methods could probably use a common base.
One other thought is that this might be a good candidate for a Trait, which basically lets you copy-and-paste the functions into unrelated classes. Note that Traits can't implement Interfaces directly, so each class will need both implements ArrayAccess and the use statement for your Trait.
(I may come back and edit in a full example of ArrayAccess methods when I have time.)
With some dependencies it should be (easily) possible supporting the complete set of XPath expressions. The only difficulty is to implement the walk over the object from the fully qualified XPath.
Serialize the object to XML with something like XML_Serializer from PEAR.
Load the created document as DOMDocument, run your arbitrary XPath expression and get the node path ($node->getNodePath()) from the selected elements as shown here
Armed with a node path like /blah/example[2]/x[3] you can now implement a walk on the object recursively using object attribute iteration. This heavily depends on how the serializer from 1. actually works.
Note: I don't know if implementing the ArrayAccess interface is actually necessary, because you can access object attributes like $obj->$key with $key being some string that was sliced from the node path.
I'm working with some existing code, specifically the JQuery File Upload Plugin. There is one large class and within that there are some functions i'm trying to customize. Problem is there are a few lines of code that make no sense to me.
protected function get_file_object($file_name) {
//whole bunch of code is here that generates an object file file size
//and other information related to the image that was in the array.
//removed the code to be concise, just know it returns an object.
return $file;
}
protected function get_file_objects() {
return array_values(
array_filter(
array_map(
array($this, 'get_file_object'),
scandir($this->options['upload_dir'])
)));
}
Okay, so what I don't understand is what is going on inside array_map. I know array map takes a callback and then an array as a arguments. scandir grabs an array from a directory.
Its the callback that makes no sense to me. I looked at the syntax for the array() function on the php documentation and it didn't say anything about taking two arguments like this. obviously the second one is a function, that's in quotes? I understand what the code is doing just not how its doing it.
Is this some undocumented functionality?
The first argument of array_map is a callable one of the things that is a callable is an array where the first element represents the instance (or classname if the method is static) and the second the methodname. So array($this, 'get_file_object') is refering to the get_file_object of the current instance ($this is the current instance).
I have yet to find a good example of how to use the php RegexIterator to recursively traverse a directory.
The end result would be I want to specify a directory and find all files in it with some given extensions. Say for example only html/php extensions. Furthermore, I want to filter out folders such of the type .Trash-0, .Trash-500 etc.
<?php
$Directory = new RecursiveDirectoryIterator("/var/www/dev/");
$It = new RecursiveIteratorIterator($Directory);
$Regex = new RegexIterator($It,'/^.+\.php$/i',RecursiveRegexIterator::GET_MATCH);
foreach($Regex as $v){
echo $value."<br/>";
}
?>
Is what I have so far but result in : Fatal error: Uncaught exception 'UnexpectedValueException' with message 'RecursiveDirectoryIterator::__construct(/media/hdmovies1/.Trash-0)
Any suggestions?
There are a couple of different ways of going about something like this, I'll give two quick approaches for you to choose from: quick and dirty, versus longer and less dirty (though, it's a Friday night so we're allowed to go a little bit crazy).
1. Quick (and dirty)
This involves just writing a regular expression (could be split into multiple) to use to filter the collection of files in one quick swoop.
(Only the two commented lines are really important to the concept.)
$directory = new RecursiveDirectoryIterator(__DIR__);
$flattened = new RecursiveIteratorIterator($directory);
// Make sure the path does not contain "/.Trash*" folders and ends eith a .php or .html file
$files = new RegexIterator($flattened, '#^(?:[A-Z]:)?(?:/(?!\.Trash)[^/]+)+/[^/]+\.(?:php|html)$#Di');
foreach($files as $file) {
echo $file . PHP_EOL;
}
This approach has a number of issues, though it is quick to implement being just a one-liner (though the regex might be a pain to decipher).
2. Less quick (and less dirty)
A more re-usable approach is to create a couple of bespoke filters (using regex, or whatever you like!) to whittle down the list of available items in the initial RecursiveDirectoryIterator down to only those that you want. The following is only one example, written quickly just for you, of extending the RecursiveRegexIterator.
We start with a base class whose main job is to keep a hold of the regex that we want to filter with, everything else is deferred back to the RecursiveRegexIterator. Note that the class is abstract since it doesn't actually do anything useful: the actual filtering is to be done by the two classes which will extend this one. Also, it may be called FilesystemRegexFilter but there is nothing forcing it (at this level) to filter filesystem-related classes (I'd have chosen a better name, if I weren't quite so sleepy).
abstract class FilesystemRegexFilter extends RecursiveRegexIterator {
protected $regex;
public function __construct(RecursiveIterator $it, $regex) {
$this->regex = $regex;
parent::__construct($it, $regex);
}
}
These two classes are very basic filters, acting on the file name and directory name respectively.
class FilenameFilter extends FilesystemRegexFilter {
// Filter files against the regex
public function accept() {
return ( ! $this->isFile() || preg_match($this->regex, $this->getFilename()));
}
}
class DirnameFilter extends FilesystemRegexFilter {
// Filter directories against the regex
public function accept() {
return ( ! $this->isDir() || preg_match($this->regex, $this->getFilename()));
}
}
To put those into practice, the following iterates recursively over the contents of the directory in which the script resides (feel free to edit this!) and filters out the .Trash folders (by making sure that folder names do match the specially crafted regex), and accepting only PHP and HTML files.
$directory = new RecursiveDirectoryIterator(__DIR__);
// Filter out ".Trash*" folders
$filter = new DirnameFilter($directory, '/^(?!\.Trash)/');
// Filter PHP/HTML files
$filter = new FilenameFilter($filter, '/\.(?:php|html)$/');
foreach(new RecursiveIteratorIterator($filter) as $file) {
echo $file . PHP_EOL;
}
Of particular note is that since our filters are recursive, we can choose to play around with how to iterate over them. For example, we could easily limit ourselves to only scanning up to 2 levels deep (including the starting folder) by doing:
$files = new RecursiveIteratorIterator($filter);
$files->setMaxDepth(1); // Two levels, the parameter is zero-based.
foreach($files as $file) {
echo $file . PHP_EOL;
}
It is also super-easy to add yet more filters (by instantiating more of our filtering classes with different regexes; or, by creating new filtering classes) for more specialised filtering needs (e.g. file size, full-path length, etc.).
P.S. Hmm this answer babbles a bit; I tried to keep it as concise as possible (even removing vast swathes of super-babble). Apologies if the net result leaves the answer incoherent.
The docs are indeed not much helpful. There's a problem using a regex for 'does not match' here, but we'll illustrate a working example first:
<?php
//we want to iterate a directory
$Directory = new RecursiveDirectoryIterator("/var/dir");
//we need to iterate recursively
$It = new RecursiveIteratorIterator($Directory);
//We want to stop decending in directories named '.Trash[0-9]+'
$Regex1 = new RecursiveRegexIterator($It,'%([^0-9]|^)(?<!/.Trash-)[0-9]*$%');
//But, still continue on doing it **recursively**
$It2 = new RecursiveIteratorIterator($Regex1);
//Now, match files
$Regex2 = new RegexIterator($It2,'/\.php$/i');
foreach($Regex2 as $v){
echo $v."\n";
}
?>
The problem is the doesn't match .Trash[0-9]{3} part: The only way I know how to negative match the directory, is match the end of the string $, and then then assert with a lookbehind (?<!/foo) 'if it is not preceded by '/foo'.
However, as .Trash[0-9]{1,3} is not fixed length, we cannot use it as a lookbehind assertion. Unfortunately, there is no 'invert match' for a RegexIterator. But perhaps there are more regex-savvy people then I knowing how to match 'any string not ending with .Trash[0-9]+
edit: got it '%([^0-9]|^)(?<!/.Trash-)[0-9]*$%' as a regex would do the trick.
An improvement to salathe, would be to forget about the custom abstract class.
Just use good OOP in PHP and directly extend RecursiveRegexIterator instead:
Here is the File filter
class FilenameFilter
extends RecursiveRegexIterator
{
// Filter files against the regex
public function accept()
{
return ! $this->isFile() || parent::accept();
}
}
And the Directory filter
class DirnameFilter
extends RecursiveRegexIterator
{
// Filter directories against the regex
public function accept() {
return ! $this->isDir() || parent::accept();
}
}
I need to write a script that will search through a CSV file, and perform certain search functions on it;
find duplicate entries in a column
find matches to a list of banned entries in another column
find entries through regular expression matching on a column specified
Now, I have no problem at all coding this procedurally, but as I am now moving on to Object Orientated Programming, I would like to use classes and instances of objects instead.
However, thinking in OOP doesn't come naturally to me yet, so I'm not entirely sure which way to go. I'm not looking for specific code, but rather suggestions on how I could design the script.
My current thinking is this;
Create a file class. This will handle import/export of data
Create a search class. A child class of file. This will contain the various search methods
How it would function in index.php:
get an array from the csv in the file object in index.php
create a loop to iterate through the values of the array
call the methods in the loop from a search object and echo them out
The problem I see with this approach is this;
I will want to point at different elements in my array to look at particular "columns". I could just put my loop in a function and pass this as a parameter, but this kind of defeats the point of OOP, I feel
My search methods will work in different ways. To search for duplicate entries is fairly straight forward with nested loops, but I do not need a nested loop to do a simple word or regular expression searchs.
Should I instead go like this?
Create a file class. This will handle import/export of data
Create a loop class A child of class of file. This will contain methods that deals with iterating through the array
Create a search class. A child class of loop. This will contain the various search methods
My main issue with this is that it appears that I may need multiple search objects and iterate through this within my loop class.
Any help would be much appreciated. I'm very new to OOP, and while I understand the individual parts, I'm not yet able to see the bigger picture. I may be overcomplicating what it is I'm trying to do, or there may be a much simpler way that I can't see yet.
PHP already offers a way to read a CSV file in an OO manner with SplFileObject:
$file = new SplFileObject("data.csv");
// tell object that it is reading a CSV file
$file->setFlags(SplFileObject::READ_CSV);
$file->setCsvControl(',', '"', '\\');
// iterate over the data
foreach ($file as $row) {
list ($fruit, $quantity) = $row;
// Do something with values
}
Since SplFileObject streams over the CSV data, the memory consumption is quite low and you can efficiently handle large CSV files, but since it is file i/o, it is not the fastest. However, an SplFileObject implements the Iterator interface, so you can wrap that $file instance into other iterators to modify the iteration. For instance, to limit file i/o, you could wrap it into a CachingIterator:
$cachedFile = new CachingIterator($file, CachingIterator::FULL_CACHE);
To fill the cache, you iterate over the $cachedFile. This will fill the cache
foreach ($cachedFile as $row) {
To iterate over the cache then, you do
foreach ($cachedFile->getCache() as $row) {
The tradeoff is increased memory obviously.
Now, to do your queries, you could wrap that CachingIterator or the SplFileObject into a FilterIterator which would limit the output when iterating over the csv data
class BannedEntriesFilter extends FilterIterator
{
private $bannedEntries = array();
public function setBannedEntries(array $bannedEntries)
{
$this->bannedEntries = $bannedEntries;
}
public function accept()
{
foreach ($this->current() as $key => $val) {
return !$this->isBannedEntryInColumn($val, $key);
}
}
public function $isBannedEntryInColumn($entry, $column)
{
return isset($this->bannedEntries[$column])
&& in_array($this->bannedEntries[$column], $entry);
}
}
A FilterIterator will omit all entries from the inner Iterator which does not satisfy the test in the FilterIterator's accept method. Above, we check the current row from the csv file against an array of banned entries and if it matches, the data is not included in the iteration. You use it like this:
$filteredCachedFile = new BannedEntriesFilter(
new ArrayIterator($cachedFile->getCache())
)
Since the cached results are always an Array, we need to wrap that Array into an ArrayIterator before we can wrap it into our FilterIterator. Note that to use the cache, you also need to iterate the CachingIterator at least once. We just assume you already did that above. The next step is to configure the banned entries
$filteredCachedFile->setBannedEntries(
array(
// banned entries for column 0
array('foo', 'bar'),
// banned entries for column 1
array( …
)
);
I guess that's rather straightforward. You have a multidimensional array with one entry for each column in the CSV data holding the banned entries. You then simply iterate over the instance and it will give you only the rows not having banned entries
foreach ($filteredCachedFile as $row) {
// do something with filtered rows
}
or, if you just want to get the results into an array:
$results = iterator_to_array($filteredCachedFile);
You can stack multiple FilterIterators to further limit the results. If you dont feel like writing a class for each filtering, have a look at the CallbackFilterIterator, which allows passing of the accept logic at runtime:
$filteredCachedFile = new CallbackFilterIterator(
new ArrayIterator($cachedFile->getCache()),
function(array $row) {
static $bannedEntries = array(
array('foo', 'bar'),
…
);
foreach ($row as $key => $val) {
// logic from above returning boolean if match is found
}
}
);
I 'm going to illustrate a reasonable approach to designing OOP code that serves your stated needs. While I firmly believe that the ideas presented below are sound, please be aware that:
the design can be improved -- the aim here is to show the approach, not the final product
the implementation is only meant as an example -- if it (barely) works, it's good enough
How to go about doing this
A highly engineered solution would start by trying to define the interface to the data. That is, think about what would be a representation of the data that allows you to perform all your query operations. Here's one that would work:
A dataset is a finite collection of rows. Each row can be accessed given its zero-based index.
A row is a finite collection of values. Each value is a string and can be accessed given its zero-based index (i.e. column index). All rows in a dataset have exactly the same number of values.
This definition is enough to implement all three types of queries you mention by looping over the rows and performing some type of test on the values of a particular column.
The next move is to define an interface that describes the above in code. A not particularly nice but still adequate approach would be:
interface IDataSet {
public function getRowCount();
public function getValueAt($row, $column);
}
Now that this part is done, you can go and define a concrete class that implements this interface and can be used in your situation:
class InMemoryDataSet implements IDataSet {
private $_data = array();
public function __construct(array $data) {
$this->_data = $data;
}
public function getRowCount() {
return count($this->_data);
}
public function getValueAt($row, $column) {
if ($row >= $this->getRowCount()) {
throw new OutOfRangeException();
}
return isset($this->_data[$row][$column])
? $this->_data[$row][$column]
: null;
}
}
The next step is to go and write some code that converts your input data to some kind of IDataSet:
function CSVToDataSet($file) {
return new InMemoryDataSet(array_map('str_getcsv', file($file)));
}
Now you can trivially create an IDataSet from a CSV file, and you know that you can perform your queries on it because IDataSet was explicitly designed for that purpose. You 're almost there.
The only thing missing is creating a reusable class that can perform your queries on an IDataSet. Here is one of them:
class DataQuery {
private $_dataSet;
public function __construct(IDataSet $dataSet) {
$this->_dataSet = $dataSet;
}
public static function getRowsWithDuplicates($columnIndex) {
$values = array();
for ($i = 0; $i < $this->_dataSet->getRowCount(); ++$i) {
$values[$this->_dataSet->->getValueAt($i, $columnIndex)][] = $i;
}
return array_filter($values, function($row) { return count($row) > 1; });
}
}
This code will return an array where the keys are values in your CSV data and the values are arrays with the zero-based indexes of the rows where each value appears. Since only duplicate values are returned, each array will have at least two elements.
So at this point you are ready to go:
$dataSet = CSVToDataSet("data.csv");
$query = new DataQuery($dataSet);
$dupes = $query->getRowsWithDuplicates(0);
What you gain by doing this
Clean, maintainable code that supports being modified in the future without requiring edits all over your application.
If you want to add more query operations, add them to DataQuery and you can instantly use them on all concrete types of data sets. The data set and any other external code will not need any modifications.
If you want to change the internal representation of the data, modify InMemoryDataSet accordingly or create another class that implements IDataSet and use that one instead from CSVToDataSet. The query class and any other external code will not need any modifications.
If you need to change the definition of the data set (perhaps to allow more types of queries to be performed efficiently) then you have to modify IDataSet, which also brings all the concrete data set classes into the picture and probably DataQuery as well. While this won't be the end of the world, it's exactly the kind of thing you would want to avoid.
And this is precisely the reason why I suggested to start from this: If you come up with a good definition for the data set, everything else will just fall into place.
You have actually chosen a bad example for learning OOP. Because, the functionality you are looking for "importing" and "searching" a file, can be best implemented in procedural way, rather than object-oriented way. Remember that not everything in the world is an "object". Besides objects, we have "procedures", "actions" etc. You can still implement this functionality with classes, which is recommended way, in fact. But, just putting a functionality in a class does not turn it into real OOP automatically.
The point that I am trying to make is that, one of the reasons that you might be struggling to comprehend this functionality in terms of OOP is, that it is not really of object-oriented nature.
If you are familiar with Java Math class (PHP may have a similar thing), it has b bunch of methods/functions such as abs, log, etc. This, although is a class, is not really a class in the object-oriented sense. It is just a bunch of functions.
What really a class in object-oriented sense is? Well this is a huge topic, but at least one general criteria is that it has both state (attributes/fields) and behavior (methods), in such a way that there is an intrinsic bond between the behavior and state. If so, for instance, a call to a method accesses state (because they are so tied together).
Here is a simple OOP class:
Class person {
// State
name;
age;
income;
// Behavior
getName();
setName()
.
.
.
getMonthlyIncome() {
return income / 12;
}
}
And here is a class, that despite its appearance (as a class) in reality is procedureal:
class Math {
multiply(double x, double y) {
return x * y;
}
divide(double x, double y) {
return x / y;
}
exponentiate(double x, double y) {
return x^y;
}
In my task would be very nice to write a kind of objects serialization (for XML output). I've already done it, but have no idea, how to avoid recursive links.
The trouble is that some objects must have public(!) properties with links to their parents (it's really nessecary). And when I try to serialize a parent object which agregates some children - children with links to parent do recursion forever.
Is there a solution to handle such recursions as print_r() does without hacks?
I can't use somthing like "if ($prop === 'parent')", because sometimes there's more than 1 link to parents from different contexts.
Write your own serialization function and always pass it a list of already-processed items. Since PHP5 (I assume, you are using php5) always copies references to an object, you can do the following:
public function __sleep() {
return $this->serialize();
}
protected function serialize($processed = array()) {
if (($position = array_search($this, $processed, true)) !== false) {
# This object has already been processed, you can use the
# $position of this object in the $processed array to reference it.
return;
}
$processed[] = $this;
# do your actual serialization here
# ...
}