Find double files using PHP with high performance

Find double files using PHP with high performance - php

I've got around 25000 files scattered around many folders which vary between 5MB and 200MB on 2 external hard drives. I need to find out which of these are duplicate, leaving only the unique files on the drives.
Currently im doing md5_file() over each source file and compare these to see if the same file has been found before. The issue with that is, md5_file() could easily take more than 10 seconds to execute and I've seen it even taking up to a minute for some files. If I let this script run in it's current form, that would mean this process will take more than a week to finish.
Note that I'm saving each hash after one has been made, so I dont have to re-hash each file on each run. Thing is that all these files are yet to be hashed.
I'm wondering what I could do to speed this up. I need to finish this in less than 5 days, so a script that takes more than a week is no option. I was thinking multithreading (using pthread) could be a solution, but as the drives are so slow and my CPU is not the issue, I don't think this would help. What else is there I could do?

As you guessed, it's hard to tell if you can see any gains by using threading ...
However, I decided I would write a nice pthreads example based on your idea, I think it illustrates well things you should do while threading ...
Your mileage will vary, but here's the example all the same:
<?php
/* create a mutex for readable logging output */
define ("LOG", Mutex::create());
/* log a message to stdout, use as thread safe printf */
function out($message, $format = null) {
$format = func_get_args();
if ($format) {
$message = array_shift(
$format);
Mutex::lock(LOG);
echo vsprintf(
$message, $format
);
Mutex::unlock(LOG);
}
}
/*
Sums is a collection of sum => file shared among workers
*/
class Sums extends Stackable {
public function run(){}
}
/* Worker to execute sum tasks */
class CheckWorker extends Worker {
public function run() {}
}
/*
The simplest version of a job that calculates the checksum of a file
*/
class Check extends Stackable {
/* all properties are public */
public $file;
public $sum;
/* accept a file and Sums collection */
public function __construct($file, Sums &$sums) {
$this->file = $file;
$this->sums = $sums;
}
public function run(){
out(
"checking: %s\n", $this->file);
/* calculate checksum */
$sum = md5_file($this->file);
/* check for sum in list */
if (isset($this->sums[$sum])) {
/* deal with duplicate */
out(
"duplicate file found: %s, duplicate of %s\n", $this->file, $this->sums[$sum]);
} else {
/* set sum in shared list */
$this->sums[$sum] = $this->file;
/* output some info ... */
out(
"unique file found: %s, sum (%s)\n", $this->file, $sum);
}
}
}
/* start a timer */
$start = microtime(true);
/* checksum collection, shared across all threads */
$sums = new Sums();
/* create a suitable amount of worker threads */
$workers = array();
$checks = array();
$worker = 0;
/* how many worker threads you have depends on your hardware */
while (count($workers) < 16) {
$workers[$worker] = new CheckWorker();
$workers[$worker]->start();
$worker++;
}
/* scan path given on command line for files */
foreach (scandir($argv[1]) as $id => $path) {
/* #TODO(u) write code to recursively scan a path */
$path = sprintf(
"%s/%s",
$argv[1], $path
);
/* create a job to calculate the checksum of a file */
if (!is_dir($path)) {
$checks[$id] = new Check(
$path, $sums);
/* #TODO(u) write code to stack to an appropriate worker */
$workers[array_rand($workers)]->stack($checks[$id]);
}
}
/* join threads */
foreach ($workers as $worker) {
$worker->shutdown();
}
/* output some info */
out("complete in %.3f seconds\n", microtime(true)-$start);
/* destroy logging mutex */
Mutex::destroy(LOG);
?>
Play around with it, see how different numbers of workers affects runtime, and implement your own logic to delete files and scan directories (this is basic stuff you should know already, left out to make for a simple example) ...

You could try to find possible duplicates by only looking at the file size. Then only if multiple files have the same size you need to hash them. This is probably faster, since looking up files sizes is not much of an effort.

Related

How to avoid race hazard with multiple requests?

In order to protect script form race hazard, I am considering approach described by code sample
$file = 'yxz.lockctrl';
// if file exists, it means that some other request is running
while (file_exists($file))
{
sleep(1);
}
file_put_contents($file, '');
// do some work
unlink($file);
If I go this way, is it possible to create file with same name simultaneously from multiple requests?
I know that there is php mutex. I would like to handle this situation without any extensions (if possible).
Task for the program is to handle bids in auctions application. I would like to process every bid request sequentially. With most possible latency.

From what I understand you want to make sure only a single process at a time is running a certain piece of code. A mutex or similar mechanism could be used for this. I myself use lockfiles to have a solution that works on many platforms and doesn't rely on a specific library only available on Linux etc.
For that, I have written a small Lock class. Do note that it uses some non-standard functions from my library, for instance, to get where to store temporary files etc. But you could easily change that.
<?php
class Lock
{
private $_owned = false;
private $_name = null;
private $_lockFile = null;
private $_lockFilePointer = null;
public function __construct($name)
{
$this->_name = $name;
$this->_lockFile = PluginManager::getInstance()->getCorePlugin()->getTempDir('locks') . $name . '-' . sha1($name . PluginManager::getInstance()->getCorePlugin()->getPreference('EncryptionKey')->getValue()).'.lock';
}
public function __destruct()
{
$this->release();
}
/**
* Acquires a lock
*
* Returns true on success and false on failure.
* Could be told to wait (block) and if so for a max amount of seconds or return false right away.
*
* #param bool $wait
* #param null $maxWaitTime
* #return bool
* #throws \Exception
*/
public function acquire($wait = false, $maxWaitTime = null) {
$this->_lockFilePointer = fopen($this->_lockFile, 'c');
if(!$this->_lockFilePointer) {
throw new \RuntimeException(__('Unable to create lock file', 'dliCore'));
}
if($wait && $maxWaitTime === null) {
$flags = LOCK_EX;
}
else {
$flags = LOCK_EX | LOCK_NB;
}
$startTime = time();
while(1) {
if (flock($this->_lockFilePointer, $flags)) {
$this->_owned = true;
return true;
} else {
if($maxWaitTime === null || time() - $startTime > $maxWaitTime) {
fclose($this->_lockFilePointer);
return false;
}
sleep(1);
}
}
}
/**
* Releases the lock
*/
public function release()
{
if($this->_owned) {
#flock($this->_lockFilePointer, LOCK_UN);
#fclose($this->_lockFilePointer);
#unlink($this->_lockFile);
$this->_owned = false;
}
}
}
Usage
Now you can have two process that run at the same time and execute the same script
Process 1
$lock = new Lock('runExpensiveFunction');
if($lock->acquire()) {
// Some expensive function that should only run one at a time
runExpensiveFunction();
$lock->release();
}
Process 2
$lock = new Lock('runExpensiveFunction');
// Check will be false since the lock will already be held by someone else so the function is skipped
if($lock->acquire()) {
// Some expensive function that should only run one at a time
runExpensiveFunction();
$lock->release();
}
Another alternative would be to have the second process wait for the first one to finish instead of skipping the code.
$lock = new Lock('runExpensiveFunction');
// Process will now wait for the lock to become available. A max wait time can be set if needed.
if($lock->acquire(true)) {
// Some expensive function that should only run one at a time
runExpensiveFunction();
$lock->release();
}
Ram disk
To limit the number of writes to your HDD/SSD with the lockfiles you could crate a RAM disk to store them in.
On Linux you could add something like the following to /etc/fstab
tmpfs /mnt/ramdisk tmpfs nodev,nosuid,noexec,nodiratime,size=1024M 0 0
On Windows you can download something like ImDisk Toolkit and create a ramdisk with that.

Synchronize and pause Thread in PHP

I am running 2 threads at the same time, but I have critical section where I need to put something in a MySql DB. The problem is that they can put the same thing in at the same time.
I have done some calculations that shows that for indexing 20000 different news pages, the indexes are from 20000 to 20020. (So 0 to 20 are duplicates)
How do I pause one thread while the other is accessing the database?
-----thread.php
class Process extends Thread {
public function __construct($website_url){
$this->website_url = $website_url;
}
public function run() {
work($this->website_url);
}
}
-------------- work
function work($website_url) {
while(condition) {
some work...
if(something->check){ // if this exist in base
mysqli->query("INSERT something IN db...");
prepare bind exec...
}
// between check and insert, second thread can put that element
// critical section is really small but sometimes occurs ...
}
}
------ main.php
$job1 = new Process($website_url,$trigger);
$job2 = new Process($website_url,$trigger);
$job1->start();
$job2->start();

Mutual Exclusion
The simplest way of achieving what you want here is by the use of a single Mutex:
<?php
class Process extends Thread {
public function __construct($url, $mutex) {
$this->url = $url;
$this->mutex = $mutex;
}
public function run() {
work($this->url, $this->mutex);
}
protected $url;
protected $mutex;
}
function work($url, $mutex) {
while (1) {
/* some work */
/* failing to check the return value of calls to acquire
or release mutex is bad form, I haven't done so for brevity */
Mutex::lock($mutex);
{
/* critical section */
printf("working on %s\n", $url);
/* sleeping here shows you that the critical section is
not entered by the second thread, this is obviously not needed */
sleep(1);
}
Mutex::unlock($mutex);
/* breaking here allows the example code to end, not needed */
break;
}
}
$website = "stackoverflow.com";
$lock = Mutex::create();
$jobs = [
new Process($website, $lock),
new Process($website, $lock)
];
foreach ($jobs as $job)
$job->start();
foreach ($jobs as $job)
$job->join();
/* always destroy mutex when finished with them */
Mutex::destroy($lock);
?>
This code should explain itself, I have added a few comments to guide you through it.

PHP - Multiple instances of script accessing same resources

I have to analyze a lot of information.
To speed things up I'll be running multiple instances of same script at the same moment.
However there is a big chance scripts would analyze same piece of information(duplicate) which I do not like as it would slow down the process.
If running only 1 instance I solve this problem with array(I save what has been already analyzed).
So I have a question how could I somehow sync that array with other "threads" ?
MySQL is an option but I guess it would be overkill?
I read also about memory sharing but not sure if this is solution I am looking for.
So if anyone has some suggestions let me know.
Regards

This is a trivial task using real multi-threading:
<?php
/* we want logs to be readable so we are creating a mutex for output */
define ("LOG", Mutex::create());
/* basically a thread safe printf */
function slog($message, $format = null) {
$format = func_get_args();
if ($format) {
$message = array_shift($format);
if ($message) {
Mutex::lock(LOG);
echo vsprintf(
$message, $format);
Mutex::unlock(LOG);
}
}
}
/* any pthreads descendant would do */
class S extends Stackable {
public function run(){}
}
/* a thread that manipulates the shared data until it's all gone */
class T extends Thread {
public function __construct($shared) {
$this->shared = $shared;
}
public function run() {
/* you could also use ::chunk if you wanted to bite off a bit more work */
while (($next = $this->shared->shift())) {
slog(
"%lu working with item #%d\n", $this->getThreadId(), $next);
}
}
}
$shared = new S();
/* fill with dummy data */
while (#$o++ < 10000) {
$shared[]=$o;
}
/* start some threads */
$threads = array();
while (#$thread++ < 5) {
$threads[$thread] = new T($shared);
$threads[$thread]->start();
}
/* join all threads */
foreach ($threads as $thread)
$thread->join();
/* important; ::destroy what you ::create */
Mutex::destroy(LOG);
?>
The slog() function isn't necessarily required for your use case, but thought it useful to show an executable example with readable output.
The main gist of it is that multiple threads need only a reference to a common set of data to manipulate that data ...

php mutex for ram based wordpress cache in php

Im trying to implement a cache for a high traffic wp site in php. so far ive managed to store the results to a ramfs and load them directly from the htaccess. however during peak hours there are mora than one process generatin certain page and is becoming an issue
i was thinking that a mutex would help and i was wondering if there is a better way than system("mkdir cache.mutex")

From what I understand you want to make sure only a single process at a time is running a certain piece of code. A mutex or similar mechanism could be used for this. I myself use lockfiles to have a solution that works on many platforms and doesn't rely on a specific library only available on Linux etc.
For that, I have written a small Lock class. Do note that it uses some non-standard functions from my library, for instance, to get where to store temporary files etc. But you could easily change that.
<?php
class Lock
{
private $_owned = false;
private $_name = null;
private $_lockFile = null;
private $_lockFilePointer = null;
public function __construct($name)
{
$this->_name = $name;
$this->_lockFile = PluginManager::getInstance()->getCorePlugin()->getTempDir('locks') . $name . '-' . sha1($name . PluginManager::getInstance()->getCorePlugin()->getPreference('EncryptionKey')->getValue()).'.lock';
}
public function __destruct()
{
$this->release();
}
/**
* Acquires a lock
*
* Returns true on success and false on failure.
* Could be told to wait (block) and if so for a max amount of seconds or return false right away.
*
* #param bool $wait
* #param null $maxWaitTime
* #return bool
* #throws \Exception
*/
public function acquire($wait = false, $maxWaitTime = null) {
$this->_lockFilePointer = fopen($this->_lockFile, 'c');
if(!$this->_lockFilePointer) {
throw new \RuntimeException(__('Unable to create lock file', 'dliCore'));
}
if($wait && $maxWaitTime === null) {
$flags = LOCK_EX;
}
else {
$flags = LOCK_EX | LOCK_NB;
}
$startTime = time();
while(1) {
if (flock($this->_lockFilePointer, $flags)) {
$this->_owned = true;
return true;
} else {
if($maxWaitTime === null || time() - $startTime > $maxWaitTime) {
fclose($this->_lockFilePointer);
return false;
}
sleep(1);
}
}
}
/**
* Releases the lock
*/
public function release()
{
if($this->_owned) {
#flock($this->_lockFilePointer, LOCK_UN);
#fclose($this->_lockFilePointer);
#unlink($this->_lockFile);
$this->_owned = false;
}
}
}
Usage
Now you can have two process that run at the same time and execute the same script
Process 1
$lock = new Lock('runExpensiveFunction');
if($lock->acquire()) {
// Some expensive function that should only run one at a time
runExpensiveFunction();
$lock->release();
}
Process 2
$lock = new Lock('runExpensiveFunction');
// Check will be false since the lock will already be held by someone else so the function is skipped
if($lock->acquire()) {
// Some expensive function that should only run one at a time
runExpensiveFunction();
$lock->release();
}
Another alternative would be to have the second process wait for the first one to finish instead of skipping the code.
$lock = new Lock('runExpensiveFunction');
// Process will now wait for the lock to become available. A max wait time can be set if needed.
if($lock->acquire(true)) {
// Some expensive function that should only run one at a time
runExpensiveFunction();
$lock->release();
}
Ram disk
To limit the number of writes to your HDD/SSD with the lockfiles you could create a RAM disk to store them in.
On Linux you could add something like the following to /etc/fstab
tmpfs /mnt/ramdisk tmpfs nodev,nosuid,noexec,nodiratime,size=1024M 0 0
On Windows you can download something like ImDisk Toolkit and create a ramdisk with that.

I agree with #gries, a reverse proxy is going to be a really good bang-for-the-buck way to get high performance out of a high-volume Wordpress site. I've leveraged Varnish with quite a lot of success, though I suspect you can do so with nginx as well.

How to touch a file and read the modification date in PHP on Linux?

I need to touch a file from within one PHP script and read the last time this file was touched from within another script, but no matter how I touch the file and read out the modification date, the modification date doesn't change, below is a test file.
How can I touch the log file and thus change the modification date, and then read this modification date?
class TestKeepAlive {
protected $log_file_name;
public function process() {
$this->log_file_name = 'test_keepalive_log.txt';
$this->_writeProcessIdToLogFile();
for ($index = 0; $index < 10; $index++) {
echo 'test' . PHP_EOL;
sleep(1);
touch($this->log_file_name);
$this->_touchLogFile();
$dateTimeLastTouched = $this->_getDateTimeLogFileLastTouched();
echo $dateTimeLastTouched . PHP_EOL;
}
}
protected function _touchLogFile() {
//touch($this->log_file_name);
exec("touch {$this->log_file_name}");
}
protected function _getDateTimeLogFileLastTouched() {
return filemtime($this->log_file_name);
}
protected function _writeProcessIdToLogFile() {
file_put_contents($this->log_file_name, getmypid());
}
}
$testKeepAlive = new TestKeepAlive();
$testKeepAlive->process();

You should use the function clearstatcache found in the PHP Manual
PHP caches the information those functions(filemtime) return in order
to provide
faster performance. However, in certain cases, you may want to clear the cached
information. For instance, if the same file is being checked multiple times within a
single script, and that file is in danger of being removed or changed during that
script's operation, you may elect to clear the status cache. In these cases, you can
use the clearstatcache() function to clear the information that PHP caches about a file.
Function:
protected function _getDateTimeLogFileLastTouched() {
clearstatcache();
return filemtime($this->log_file_name);
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Find double files using PHP with high performance - php

You could try to find possible duplicates by only looking at the file size. Then only if multiple files have the same size you need to hash them. This is probably faster, since looking up files sizes is not much of an effort.

Related

How to avoid race hazard with multiple requests?

Synchronize and pause Thread in PHP

PHP - Multiple instances of script accessing same resources

php mutex for ram based wordpress cache in php

How to touch a file and read the modification date in PHP on Linux?

Categories

Resources