Check if image exists - database or file lookup? - php

I hash a file's name based on its contents and store this name reference in a database, and store the file on a server.
Would it be more efficient (quicker) to check for duplicate files (and therefore not re-upload) via checking its name in a database or by checking if the file exists on the server?
There would be 1000s of files.

I had the same issue, we have roughly 40k images and the duplicates were loading heavy on our server, especially with image license management, as the same license had to be added to the same image multiple times.
I recommmend a database lookup. It's much faster as your collection of files grows. a 40k table scan takes something like 20 milliseconds. A 40k file search on disk runs in a few seconds, which gets annoying fast.
To solve this we changed how images were uploaded, so we don't get duplicate files, but multiple database records that reference the same physical file on disk. This gives us speed for looking up the file data, without having the "file" or even knowing where the actual file is.
We also don't store the file as the original filename, but as hexadicimal hash based on date and time, so we don't get conflicting filenames, and have no delivery issues due to special characters, spaces, etc... and store the original file name in a database field for lookup purposes.
Our images have their "metadata" stored in the database, with a hexadecimal file name and an "original" filename. It's really fast to check against this database, and then retrieve the file link when there's a match/relation. This also allows checking if this file has already been uploaded, as you don't need to scan the entire directory structure with alt the images, which can take up a significant amount of time.
This is the code I use, you can use something similar. Note that this is using laravel's eloquent, but it's fairly easy to replicate in pure mysql.
First you get an instance to query the file model table.
Then you check for a file where the original filename, file size, content type and other meta data that shouldn't change are the same.
If they are the same, make your file entry a duplicate of the original entry(In my case this allows modifying image titles and descriptions for each reference)
$file = new $FILE();
$existingFile = $file->newQuery()->where('file_name',$uploadedFile->getClientOriginalName())
->where('file_size',$uploadedFile->getSize())
->where('content_type',$uploadedFile->getMimeType())
->where('is_public',$fileRelation->isPublic())->limit(1)->get()->first();
if($existingFile) {
$file->disk_name = $existingFile->disk_name;
$file->file_size = $existingFile->file_size;
$file->file_name = $existingFile->file_name;
$file->content_type = $existingFile->content_type;
$file->is_public = $existingFile->is_public;
}
else {
$file->data = $uploadedFile;
$file->is_public = $fileRelation->isPublic();
}
then as the file is deleted, you need to check if it's the "last one"
public function afterDelete()
{
try {
$count = $this->newQuery()->where('disk_name','=',$this->disk_name)
->where('file_name','=',$this->file_name)
->where('file_size','=', $this->file_size)
->count();
if(!$count) {
$this->deleteThumbs();
$this->deleteFile();
}
}
catch (Exception $ex) {
traceLog($ex->getMessage() . '\n' . $ex->getTraceAsString());
}
}

Related

PHP - can my script for fetching filenames and finding new files be faster?

I have FTP access to 1 directory that holds all images for all products of the vendor.
1 product has multiple images: variations in size and variations in display of the product.
There is no "list" (XML, CSV, database..) by which I am able to know "what's new".
For now the only way I see is to grab all filenames and compare them with the ones in my DB.
The last check counted 998.283 files in that directory.
1 product has multiple variations and there is no documentation of how they are named.
I did an initial grab of the filenames, compared them with my products and saved in database table for "images" with their filenames and date modified (from file).
The next step is to check for "new ones".
What I am doing now is:
// get the file list /
foreach ($this->getFilenamesFromFtp() as $key => $image_data) {
// I extract data from filenames (product code, size, variation number, extension..) so I can store them in table and later use that as reference (ie. I want to use only large images of variation, not all sizes
$data=self::extractDataFromImage($image_data);
// checking if filename already exists in DB images
// if there is DB entry (TRUE) it will do nothing, and if there is none it will continue with insertion in DB
if($this->checkForFilenameInDb($data['filename'])){
}
else{
$export_codes=$this->export->getProductIds();
// check if product code is in export table - that is do we really need this image
if($this->functions->in_array_r($data['product_code'],$export_codes)){
self::insertImageDataInDb($data);
} // end if
} // end if check if filename is already in DB
} // end foreach
and my method getFilenamesFromFtp() looks like this:
$filenames = array();
$i=1;
$ftp = $this->getFtpConfiguration();
// set up basic connection
$conn_id = ftp_ssl_connect($ftp['host']);
// login with username and password
$login_result = ftp_login($conn_id, $ftp['username'], $ftp['pass']);
ftp_set_option($conn_id, FTP_USEPASVADDRESS, false);
$mode = ftp_pasv($conn_id, TRUE);
ftp_set_option($conn_id, FTP_TIMEOUT_SEC, 180);
//Login OK ?
if ((!$conn_id) || (!$login_result) || (!$mode)) { // || (!$mode)
die("FTP connection has failed !");
}
else{
// I get all filenames and store them in array
$files=ftp_nlist($conn_id, ".");
// I count the number of files in array = the number of files on FTP
$nofiles=count($files);
foreach($files as $filename){
// the limit I implemented while developing or testing, but in production (current mode) it has to run without limit
if(self::LIMIT>0 && $i==self::LIMIT){ //!empty(self::LIMIT) &&
break;
}
else{
// I get date modified from from file
$date_modified = ftp_mdtm($conn_id, $filename);
// I create new array for filenames and date modified so I can return it and store it in DB
$filenames[]= array(
"filename" => $filename,
"date_modified" => $date_modified
);
} // end if LIMIT empty
$i++;
} // end foreach
// close the connection
ftp_close($conn_id);
return $filenames;
}
The problem is that script takes a long time.
The longest period I have detected by now is when in getFilenamesFromFtp() I create the array:
$filenames[]= array(
"filename" => $filename,
"date_modified" => $date_modified
);
That part so far lasts for 4h and is still not done.
While writing this I had an idea to remove "date modified" from the beginning and to use that later only if I am planning to store that image in DB.
I will update this question as soon as I am done with this change and test :)
Processing a million filenames will take time, however, I see no reason to store those file names (and date_modified) in an array, why not process a filename directly?
Also, instead of completely processing a filename, why not store it in a database table first? Then you can do the real processing later. By splitting the task in two, retrieval and processing, it becomes more flexible. For instance, you don't need to do a new retrieval if you want to change the processing.
If the objective is to just display new files on the webpage:
You can just store the highest file created/modified time from the DB.
This way, for the next batch, just fetch the last modified time and compare it against file created/modified time of all the files. This will make your app pretty lightweight. You can use filemtime for this.
Now, take highest filemtime of all current files in iteration and store the highest recorded in the DB and repeat the same above steps.
Suggestions:
foreach ($this->getFilenamesFromFtp() as $key => $image_data) {
If the above snippet gets all filenames in an array, you can discard this strategy. This would consume a lot of memory. Instead read files by one by one using directory functions as mentioned in this answer, as this one maintains an internal pointer for the handle and doesn't load all files at once. Of course, you need to make the pointed out answer follow recursive iteration as well for nested directories.

How to create increment folder name in php

I have an HTML form with three inputs:
name
consultant id (number)
picture upload
After the user submits the form, a php script would:
Create folder with the submitted name
Inside the folder create a txt file with: name + consultant id (given number)
Inside the folder, store the image uploaded by user
The most important thing I want is that folders created by the php file should be increased by 1. What I mean: folder1 (txt file + image), folder2 (txt file + image), folder3 (txtfile + image) and so on...
There are a few different methods for accomplishing what you describe. One option would be look at all existing folders(directories) when you attempt to create a new one and determine the next highest number.
You can accomplish this by using scandir on your parent output directory to find existing files.
Example:
$max=0;
$files=scandir("/path/to/your/output-directory");
$matches=[];
foreach($files as $file){
if(preg_match("/folder(\d+)/", $file, $matches){
$number=intval($matches[1]);
if($number>$max)
$max=$number;
}
}
$newNumber=$max+1;
That is a simple example to get you the next number. There are many other factors to consider. For instance, what happens if two users submit the form concurrently? You would need some synchronization metaphor(such as semaphore or file lock) to ensure only insert can occur at a time.
You could use a separate lock file to store the current number and function as a synchronization method.
I would highly encourage finding a different way to store the data. Using a database to store this data may be a better option.
If you need to store the files on disk, locally, you may consider other options for generating the directory name. You could use a timestamp, a hash of the data, or a combination thereof, for instance. You may also be able to get by with something like uniqid. Any filesystem option will require some form of synchronization to address race conditions.
Here is a more complete example for sequentially creating directories using a lock file for the sequence and synchronization. This omits some error handling that should be added for production code, but should provide the core functionality.
define("LOCK_FILE", "/some/file/path"); //A file for synchronization and to store the counter
define("OUTPUT_DIRECTORY", "/some/directory"); //The directory where you want to write your folders
//Open the lock file
$file=fopen(LOCK_FILE, "r+");
if(flock($file, LOCK_EX)){
//Read the current value of the file, if empty, default to 0
$last=fgets($file);
if(empty($last))
$last=0;
//Increment to get the current ID
$current=$last+1;
//Write over the existing value(a larger number will always completely overwrite a smaller number written from the same position)
rewind($file);
fwrite($file, (string)$current);
fflush($file);
//Determine the path for the next directory
$dir=OUTPUT_DIRECTORY."/folder$current";
if(file_exists($dir))
die("Directory $dir already exists. Lock may have been reset");
//Create the next directory
mkdir($dir);
//TODO: Write your content to $dir (You'll need to provide this piece)
//Release the lock
flock($file, LOCK_UN);
}
else{
die("Unable to acquire lock");
}
//Always close the file handle
fclose($file);

PHP move file using part of a known file name

I have a directory full of images (40,000 +) that I need sorted. I have designed a script to sort them into knew proper directories, however, I am having issues with the file name.
The images urls with the id they belong to are stored in a database, and I am using the database in conjunction with the script to sort the images.
My Problem:
The image url's in the database are shortened. An example of such corresponding images are like this:
dsc_0107-367.jpg
dsc_0107-367-5478-2354-0014.jpg
The first part of the filenames are the same, but the actual file contains more info. I'd like a way to move the file from the database with the known part of the file name.
I have a basic code:
<?php
$sfiles = mysqli_query($dbconn, "SELECT * FROM files WHERE gal_id = '$_GET[id']");
while($file = mysqli_fetch_assoc($sfiles)){
$folder = $file['gal_id'];
$fileToMove = $file['filename'];
$origDir = "mypath/to/dir";
$newDir = "mypath/to/new/dir/$file['gal_id']";
mkdir "$newDir";
mv "$fileToMove" "$newDir";
}
Im just confused on how to select the file based on the small part from the database.
NOTE: It's not as simple as changing the number of chars in the db, because the db was given to me from an external site thats been deleted. So this is all the data I have.
PHP can open files using the function glob() . Glob searches your server, or specified directory, for any files containing a "match" to a pattern you specify.
Using glob() like this will pull your images from a partial name.
Run this query separate from the second:
$update = mysqli($dbconn, "UPDATE files
SET filename = REPLACE(filename, '.info', ''));
filename should be the column in your database that contains the list of images. The reason we are removing the .jpg from the db columns is if your names are partial, the .jpg may not match with the given name in your directory. With it removed, we can search solely for the pattern of the name.
Build the query to select and move the folders:
$sfiles = mysqli_query($dbconn, "SELECT * FROM files");
while($file = mysqli_fetch_assoc($sfiles)){
$fileToMove = $file['filename'];
// because glob outputs the result set into an array,
// we will use foreach to run each result from the array individually.
foreach(glob("$fileToMove*") as filename){
echo "$filename <br>";
// I'm echoing this out to see that the results are being run
// one line at a time and to confirm the photo's are
// matching the pattern.
$folder = $file['gal_id'];
// pulling the id from the db of the gallery the photo belongs to.
// This will specify which folder to move the pic to.
// Replace gal_id with the name of your column.
$newDir = $_SERVER['DOCUMENT_ROOT']."/admin/wysiwyg/kcfinder/upload/images/gallery/old/".$folder;
copy($filename,$newDir."/".$filename);
// I would recommend copy rather than move.
// This will leave the original photo in its place.
// This measure is to ensure the photo made it to the new directory so you don't lose it.
// You could go back and delete the photos after if you'd prefer.
}
}
Your MySQL query is ripe for SQL Injection, and your GET statement needs to be sanitized, if I went to your page with something similar to :
pagename.php?id=' DROP TABLE; #--
this is going to end extremely badly for you.
So;
OVerall it's much better to use Prepared Statements. THere's LOTS and LOTS of data about how to use them all over SO and the wider internet. What I show below is only a stopgap measure.
$id = (int)$_GET['id'] //This forces the id value to be numeric.
$sfiles = mysqli_query($dbconn, "SELECT * FROM files WHERE gal_id = ".$id);
Also keep note of closing your ' and " quotes as your original doesn't close the array key wrapper quotes.
I never used mysqli_fetch_assoc and always used mysqli_fetch_array so will use that as it fits the same syntax :
while($file = mysqli_fetch_array($sfiles)){
$folder = $id //same thing.
$fileToMove = $file['filename'];
$origDir = "mypath/to/dir/".$fileToMove;
//This directory shold always start with Server['DOCUMENT_ROOT'].
//Please read the manual for it.
$newDir = $_SERVER['DOCUMENT_ROOT']."/mypath/to/new/dir/".$folder;
if(!is_dir($newDir)){
mkdir $newDir;
}
// Now the magic happens, copies the file to the new directory.
// Then (optionally) delete the original.
copy($origDir,$newDir."/".$fileToMove);
unlink($origDir); //removes original.
// Add a flag to your Database to know that this file has been copied,
// ideally you should resave the filepath to the correct new one.
//MySQL update saving the new filepath.
}
Read up on PHP Copy and PHP unlink.
And; please use Prepared Statements for PHP and Database interactions.!

Directory structure for large number of files

I made one site... where i am storing user uploaded files in separate directories like
user_id = 1
so
img/upload_docs/1/1324026061_1.txt
img/upload_docs/1/1324026056_1.txt
Same way if
user_id = 2
so
img/upload_docs/2/1324026061_2.txt
img/upload_docs/2/1324026056_2.txt
...
n
So now if in future if I will get 100000 users then in my upload_docs folder I will have 100000 folders.
And there is no restriction on user upload so it can be 1000 files for 1 user or 10 files any number of files...
so is this proper way?
Or if not then can anyone suggest me how to store this files in this kind of structure???
What I would do is name the images UUIDs and create subfolders based on the names of the files. You can do this pretty easily with chunk_split. For example, if you create a folder every 4 characters you would end up with a structure like this:
img/upload_docs/1/1324/0260/61_1.txt
img/upload_docs/1/1324/0260/56_1.txt
By storing the image name 1324026056_1.txt you could then very easily determine where it belongs or where to fetch it using chunk_split.
This is a similar method to how git stores objects.
As code, it could look something like this.
// pass filename ('123456789.txt' from db)
function get_path_from_filename($filename) {
$path = 'img/upload_docs';
list($filename, $ext) = explode('.', $filename); //remove extension
$folders = chunk_split($filename, 4, '/'); // becomes 1234/5678/9
// now we store it here
$newpath = $path.'/'.$folders.'/'.$ext;
return $newpath;
}
Now when you search for the file to deliver it to the user, use a function using these steps to recreate where the file is (based on the filename which is still stored as '123456789.txt' in the DB).
Now to deliver or store the file, use get_path_from_filename.
img/upload_docs/1/0/10000/1324026056_2.txt
img/upload_docs/9/7/97555/1324026056_2.txt
img/upload_docs/2/3/23/1324026056_2.txt

What's the best way to read from and then overwrite file contents in php?

What's the cleanest way in php to open a file, read the contents, and subsequently overwrite the file's contents with some output based on the original contents? Specifically, I'm trying to open a file populated with a list of items (separated by newlines), process/add items to the list, remove the oldest N entries from the list, and finally write the list back into the file.
fopen(<path>, 'a+')
flock(<handle>, LOCK_EX)
fread(<handle>, filesize(<path>))
// process contents and remove old entries
fwrite(<handle>, <contents>)
flock(<handle>, LOCK_UN)
fclose(<handle>)
Note that I need to lock the file with flock() in order to protect it across multiple page requests. Will the 'w+' flag when fopen()ing do the trick? The php manual states that it will truncate the file to zero length, so it seems that may prevent me from reading the file's current contents.
If the file isn't overly large (that is, you can be confident loading it won't blow PHP's memory limit), then the easiest way to go is to just read the entire file into a string (file_get_contents()), process the string, and write the result back to the file (file_put_contents()). This approach has two problems:
If the file is too large (say, tens or hundreds of megabytes), or the processing is memory-hungry, you're going to run out of memory (even more so when you have multiple instances of the thing running).
The operation is destructive; when the saving fails halfway through, you lose all your original data.
If any of these is a concern, plan B is to process the file and at the same time write to a temporary file; after successful completion, close both files, rename (or delete) the original file and then rename the temporary file to the original filename.
Read
$data = file_get_contents($filename);
Write
file_put_contents($filename, $data);
One solution is to use a separate lock file to control access.
This solution assumes that only your script, or scripts you have access to, will want to write to the file. This is because the scripts will need to know to check a separate file for access.
$file_lock = obtain_file_lock();
if ($file_lock) {
$old_information = file_get_contents('/path/to/main/file');
$new_information = update_information_somehow($old_information);
file_put_contents('/path/to/main/file', $new_information);
release_file_lock($file_lock);
}
function obtain_file_lock() {
$attempts = 10;
// There are probably better ways of dealing with waiting for a file
// lock but this shows the principle of dealing with the original
// question.
for ($ii = 0; $ii < $attempts; $ii++) {
$lock_file = fopen('/path/to/lock/file', 'r'); //only need read access
if (flock($lock_file, LOCK_EX)) {
return $lock_file;
} else {
//give time for other process to release lock
usleep(100000); //0.1 seconds
}
}
//This is only reached if all attempts fail.
//Error code here for dealing with that eventuality.
}
function release_file_lock($lock_file) {
flock($lock_file, LOCK_UN);
fclose($lock_file);
}
This should prevent a concurrently-running script reading old information and updating that, causing you to lose information that another script has updated after you read the file. It will allow only one instance of the script to read the file and then overwrite it with updated information.
While this hopefully answers the original question, it doesn't give a good solution to making sure all concurrent scripts have the ability to record their information eventually.

Categories