Only grab completed files

Only grab completed files - php

I'm making a real simple "backend" (PHP5) for two flash/air-applications. One of them will upload a photo, the backend will save it to a folder, and the second app will poll the backend for new photo's and show them.
I don't got any access to a database, so the backend has to be pure PHP5 and nothing more. That's why I chose to save the images to a folder (with a timestamp in their names) and use readdir() to get them back.
This all works like a charm. Nevertheless, I would really like to make sure the backend only returns photo's that are completely uploaded, preventing the second app to try to load an unfinished image. Are there any methods/tricks that I can use to validate a file?

You could check the filesize a couple hundred milliseconds apart and see if it changes:
$first = filesize($file);
// wait 100ms
usleep(100000);
$second = filesize($file);
if($first == $second) {
// file is no longer being actively uploaded
}

The usual trick for atomic filesystem operations is to write into a temporary file that is not matched by the reader (e.g. XXX.jpg.tmp) and once it's completely uploaded, rename it to it's target name. Renames on the same volume are atomic, so there is no point where the file is either uncomplete or unavailable.

A really easy and common way to do so would be to create a trigger file based on the files name, so that you get something like
123.jpg
123.rdy
or
123.jpg
123.jpg.rdy
You create that file (just an empty stub) as soon as the upload is complete. The application that grabs files to load only cares about files with a trigger file and then processes those. Alternatively, you could also save the uploaded file as ie. 123.bsy or 123.jpg.bsy while it is still being uploaded and then rename it to the finale name 123.jpg after the upload is done. Since renames in the same directory are usually really cheap operations in term of processing time, the chances of running in a race condition should be pretty low. (This might or might now depend on the OS used, though ...)
If you need to keep the files in that place, you could, of course, use a database where you add a record for each file, as the upload is complete. The other app could then just provide files with a matching database record.

After writing this all down I figgered it out myself. What I did was adding the exact amount of bytes in the filename as well and validate that while outputting the list of images. The .tmp/.bsy-sollution is nice also, but I read it a bit to late :)
Upside to my solution is that no more renaming is required after the upload is done. Thanks everybody for your fast answers!

Related

Good way to handle temporary file situation in PHP

I have a PHP script where a user can upload an image. This image is stored in the temporary directory and is returned to the user. The user can then use a javascript interface to crop the image. (x1,y2)(x2,y2) is sent to the script which is used to crop the image. It is then returned to the user for another preview and\or crop. Once the user is sufficiently satisfied he will click "save". The temp file is copied over to the original and the temp deleted. These are not per-user images, but rather images of equipment. Any user in the organization can replace any image of equipment. This approach is good but there are a few issues:
1) Let's say the user uploads an image for preview but then closes the browser window. I will be left with a temporary file. This can become an issue. Sure I can have a CRON clean them up but in theory I can have a ton of temporary files (this is ugly). The cron can also delete the user's temp file during an edit.
2) To deal with number 1 I can always have a temporary file per piece of equipment, such as equip1.temp and equip1.jpg. All uploads are stored in equip1.temp, all commits are transfered to equip1.jpg. If two users are trying to upload pictures of the same piece of equipment at the same time this could mess them up (highly unlikely + not an issue, but still ugly)
3) I can always pass the image back and forth (user "uploads" image and it get's echoed back as an <img src="base64....." />. The resulting edits + original base64 string are sent back to PHP for processing). This solution relieves the temp file issue but I noticed it takes several seconds to send high res images back and forth.
How would you deal with this situation?

I had a similar issue like this. If I recall correctly (its been a while), I ended up creating a table in a DB to store file names and session keys/time. Each time the script loaded, if there was a dead session in the database, the corresponding session and image/file was deleted.
I don't know if that's a good solution or not, but it solved the multiple user access problem for me.

I wouldn't recommend #3 due to the reasons you mentioned.
I suggest you do this instead:
User uploads file to a random temporary name. equip1.jpg gets stored as equip1_fc8293ae82f72cf7.jpg. Be sure you script will juggle both file names around. It will allow two users to upload the same equipment, with the last one to upload being the winner, but no conflict along the way.
Everytime your cropper works with the temp image, you should "touch" it to update the modified time.
Let the user finish their edits, move the temp file in place of the final image name.
Have a cron, or a section of your uploader script, that deletes abandoned temp files that have a mtime older than an hour or so. You suggest this is messy, because of the potential of lots of temp files, but do you expect a lot of images to be abandoned? Garbage collection is a very standard method for this problem.

How to create a solution that limits user storage in an app?

I can't figure out a good solution for limiting the storage amount a user may access with his files.
In the application users are allowed to upload a limit amount of files. The limitation is based on total file size. I.e. a user might be allowed to store 50 Mb on the server.
This number is stored in a database so that it can be easily increased/decreased.
The language used is PHP, but I guess the solution isn't depended on the scripting language.
Very sorry if the question is unclear. I don't really know what to ask for more than a strategy to implement.
Thanks!

Keeping track of how much space has been used should be straightforward - with each upload you could store the space used in another table. The PHP filesize() function will tell you the size of a file on disk. Use a SUM() SQL query to get the total size of all the files uploaded by each user, and compare it against their quota limit.
The tricky bit is when you're approaching the limit - you can't tell how big the file is going to be before it's uploaded. So you'll have to get the user to upload a file and then check its size and see if it takes them over quota. If the file's too big, delete and let the user know they're out of space.

A simple approach would be to store the filename, dates and sizes of a users uploads in the database too. Then you can easily reject an upload when it exceeds their total storage.
This also makes it easy to show a list of files sorted in a variety of ways, allowing a user close to their limit to select some files for removal.
You could even use the average size of the files the user uploads to warn them when they are getting close to using up all their space.

You can use a script (something like that) that iterates through a directory contents, calculates filesizes and then deletes files that don't fit or rejects new uploads. But I think that this better be done with some sort of directory restrictions on a server. Unfortunately, I'm not a linux guy, so I don't know exactly how to do that, but this post might be helpful.

Solution of drewm is good, I just want to add few words about tricky part he mentioned.
Yes, it is impossible to predict file size before file is uploaded, as you cannot check filesize using javascript on user`s file upload page. However you can do it using flash based file uploader (swfupload.org for example). By using it you can check files size before upload is started and check it against upload limit you have. This way you will save time for user (no need to upload file to get "limit exceed error" message).
As a bonus you can show user upload progress bar as well.

Don' forget about OS solutions. If the files are stored in a "user" specific directory, then you can use the OS to find the disk spaced used in that directory. A Linux solution would something like this:
$dirSize = explode("\t", `du -ks $userDir`); // Will return an array of size, dirName
if ($dirSize[0] > MAX_DIR_LIMIT) print "USER IS OVER QUOTA";

File uploads with php - displaying a list of files

I am in the middle of making a script to upload files via php. What I would like to know, is how to display the files already uploaded, and when clicking on them open them for download. Should I store the names and path in a database, or just list the conents of a directory with php?

Check out handling file uploads in PHP. A few points:
Ideally you want to allow the user to upload multiple files at the same time. Just create extra file inputs dynamically with Javascript for this;
When you get an upload, make sure you check that it is an upload with is_uploaded_file;
Use move_uploaded_file() to copy the file to wherever you're going to store it;
Don't rely on what the client tells you the MIME type is;
Sending them back to the client can be done trivially with a PHP script but you need to know the right MIME type;
Try and verify that what you get is what you expect (eg if it is a PDF file use a library to verify that it is), particularly if you use the file for anything or send it to anyone else; and
I would recommend you store the file name of the file from the client's computer and display that to them regardless of what you store it as. The user is just more likely to recognise this than anything else.

Storing paths in the database might be okay, depending on your specific application, but consider storing the filenames in the database and construct your paths to those files in PHP in a single place. That way, if you end up moving all uploaded files later, there is only one place in your code you need to change path generation, and you can avoid doing a large amount of data transformation on your "path" field in the database.
For example, for the file 1234.txt, you might store it in:
/your_web_directory/uploaded_files/1/2/3/1234.txt
You can use a configuration file or if you prefer, a global somewhere to define the path where your uploads are stored (/your web directory/uploaded files/) and then split characters from the filename (in the database) to figure out which subdirectory the file actually resides in.
As for displaying your files, you can simply load your list of files from the database and use a path-generating function to get download paths for each one based on their filenames. If you want to paginate the list of files, try using something like START 0, LIMIT 50; in mySQL. Just pass in a new start number with each successive page of upload results.

maybe you should use files, in this sense:
myfile.txt
My Uploaded File||my_upload_dir/my_uploaded_file.pdf
Other Uploaded File||my_upload_dir/other_uploaded.html
and go through them like this:
<?php
$file = "myfile.txt";
$lines = file($file);
$files = array();
for($i=0;$i<=count($lines)-1;$i++) {
$parts = explode($lines[$i]);
$name = parts[0];
$filename = parts[1];
$files[$i][0] = $name;
$files[$i][1] = $filename;
}
print_r($files);
?>
hope this helps. :)

What I always did (past tense, I haven't written an upload script for ages) is, I'd link up an upload script (any upload script) to a simple database.
This offers some advantages;
You do not offer your users direct insight to your file system (what if there is a leak in your 'browse'-script and you expose your whole harddrive?
You can store extra information and meta-data in an easy and efficient way
You can actually query for files / meta-data instead of just looping through all the files
You can enable a 'safe-delete', where you delete the row, but keep the file (for example)
You can enable logging way more easily
Showing files in pages is easier
You can 'mask' files. Using a database enables you to store a 'masked' filename, and a 'real' filename.
Obviously, there are some disadvantages as well;
It is a little harder to migrate, since your file system and database have to be in sync
If an operation fails (on one of both ends) you have either a 'corrupt' database or file system
As mentioned before (but we can not mention enough, I'm afraid); _Keep your uploading safe!_
The MIME type / extension issue is one that is going on for ages.. I think most of the web is solid nowadays, but there used to be a time when developers would check either MIME type or extension, but never both (why bother?). This resulted in websites being very, very leaky.
If not written properly, upload scripts are big hole in your security. A great example of that is a website I 'hacked' a while back (on their request, of course). They supported the upload of images to a photoalbum, but they only checked on file extension. So I uploaded a GIF, with a directory scanner inside. This allowed me to scan through their whole system (since it wasn't a dedicated server; I could see a little more then that).
Hope I helped ;)

How do I handle image management (upload, removal, etc.) in CakePHP?

I'm building a site were users can upload images and then "use" them. What I would like is some thoughts and ideas about how to manage temporary uploads.
For example, a user uploads an image but decides not to do anything with it and just leaves the site. I have then either uploaded the file to the server, or loaded it to the server memory, but how do I know when the image can be removed? First, I thought of just having a temporary upload folder which is emptied periodically, but it feels like there must be something better?
BTW I'm using cakePHP and MySQL. Although images are stored on the server, only the location is stored in the dbb.

Save the information about file to MySQL, and save the last time the image was viewed - can be done via some script that would be altered everytime the image is being used.. and check the database for images not used for 30 days, delete them..

You could try to define a "session" in some way and give the user some information about it. For example, in SO, there is a popup when you started an answer but try to leave the site (and your answer would be lost). You could do the same and delete the uploaded image if the user proceeds. Of course, you can still use a timeout or some other rules (maximum image folder size etc.).

I'm not sure what does "temporary upload" mean in your app. The file is either uploaded or not, and under the ownership of a user. If a user doesn't want to do anything at the moment, you have no other choice but to leave the file where it is.
What you can do is put a warning somewhere on your image management page about unused images, but removing them yourself seems like a bad practice (at least from the user perspective).

As a user,When I upload the image to a server(assuming I want to use it later) and leave the site, I don't expect it to be deleted if I am a registered user.
I would prefer it to be there in my acct until I come back.I would suggest thinking in those lines and implementing a solution to save the users' images if possible.

Check the last accessed/modified time of file to see it if has been used.

Generating ZIP files with PHP + Apache on-the-fly in high speed?

To quote some famous words:
“Programmers… often take refuge in an understandable, but disastrous, inclination towards complexity and ingenuity in their work. Forbidden to design anything larger than a program, they respond by making that program intricate enough to challenge their professional skill.”
While solving some mundane problem at work I came up with this idea, which I'm not quite sure how to solve. I know I won't be implementing this, but I'm very curious as to what the best solution is. :)
Suppose you have this big collection with JPG files and a few odd SWF files. With "big" I mean "a couple thousand". Every JPG file is around 200KB, and the SWFs can be up to a few MB in size. Every day there's a few new JPG files. The total size of all the stuff is thus around 1 GB, and is slowly but steadily increasing. Files are VERY rarely changed or deleted.
The users can view each of the files individually on the webpage. However there is also the wish to allow them to download a whole bunch of them at once. The files have some metadata attached to them (date, category, etc.) that the user can filter the collection by.
The ultimate implementation would then be to allow the user to specify some filter criteria and then download the corresponding files as a single ZIP file.
Since the amount of criteria is big enough, I cannot pre-generate all the possible ZIP files and must do it on-the-fly. Another problem is that the download can be quite large and for users with slow connections it's quite likely that it will take an hour or more. Support for "resume" is therefore a must-have.
On the bright side however the ZIP doesn't need to compress anything - the files are mostly JPEGs anyway. Thus the whole process shouldn't be more CPU-intensive than a simple file download.
The problems then that I have identified are thus:
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Will passing large amounts of file data through PHP not be a performance hit in itself?
How would you implement this? Is PHP up to the task at all?
Added:
By now two people have suggested to store the requested ZIP files in a temporary folder and serving them from there as usual files. While this is indeed an obvious solution, there are several practical considerations which make this infeasible.
The ZIP files will usually be pretty large, ranging from a few tens of megabytes to hundreads of megabytes. It's also completely normal for a user to request "everything", meaning that the ZIP file will be over a gigabyte in size. Also there are many possible filter combinations and many of them are likely to be selected by the users.
As a result, the ZIP files will be pretty slow to generate (due to sheer volume of data and disk speed), and will contain the whole collection many times over. I don't see how this solution would work without some mega-expensive SCSI RAID array.

This may be what you need:
http://pablotron.org/software/zipstream-php/
This lib allows you to build a dynamic streaming zip file without swapping to disk.

Use e.g. the PhpConcept Library Zip library.
Resuming must be supported by your webserver except the case where you don't make the zipfiles accessible directly. If you have a php script as mediator then pay attention to sending the right headers to support resuming.
The script creating the files shouldn't timeout ever just make sure the users can't select thousands of files at once. And keep something in place to remove "old zipfiles" and watch out that some malicious user doesn't use up your diskspace by requesting many different filecollections.

You're going to have to store the generated zip file, if you want them to be able to resume downloads.
Basically you generate the zip file and chuck it in a /tmp directory with a repeatable filename (hash of the search filters maybe). Then you send the correct headers to the user and echo file_get_contents to the user.
To support resuming you need to check out the $_SERVER['HTTP_RANGE'] value, it's format is detailed here and once your parsed that you'll need to run something like this.
$size = filesize($zip_file);
if(isset($_SERVER['HTTP_RANGE'])) {
//parse http_range
$range = explode( '-', $seek_range);
$new_length = $range[1] - $range[0]
header("HTTP/1.1 206 Partial Content");
header("Content-Length: $new_length");
header("Content-Range: bytes {$range[0]}-$range[1]");
echo file_get_contents($zip_file, FILE_BINARY, null, $range[0], $new_length);
} else {
header("Content-Range: bytes 0-$size");
header("Content-Length: ".$size);
echo file_get_contents($zip_file);
}
This is very sketchy code, you'll probably need to play around with the headers and the contents to the HTTP_RANGE variable a bit. You can use fopen and fwrite rather than file_get contents if you wish and just fseek to the right place.
Now to your questions
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
You can remove it if you want to, however if something goes pear shaped and your code get stuck in an infinite loop at can lead to interesting problems should that infinite loop be logging and error somewhere and you don't notice, until a rather grumpy sys-admin wonders why their server ran out of hard disk space ;)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Cache the file to the hard disk, means you wont have this problem.
Will passing large amounts of file data through PHP not be a performance hit in itself?
Yes it wont be as fast as a regular download from the webserver. But it shouldn't be too slow.

i have a download page, and made a zip class that is very similar to your ideas.
my downloads are very big files, that can't be zipped properly with the zip classes out there.
and i had similar ideas as you.
the approach to give up the compression is very good, with that you not even need fewer cpu resources, you save memory because you don't have to touch the input files and can pass it throught, you can also calculate everything like the zip headers and the end filesize very easy, and you can jump to every position and generate from this point to realize resume.
I go even further, i generate one checksum from all the input file crc's, and use it as an e-tag for the generated file to support caching, and as part of the filename.
If you have already download the generated zip file the browser gets it from the local cache instead of the server.
You can also adjust the download rate (for example 300KB/s).
One can make zip comments.
You can choose which files can be added and what not (for example thumbs.db).
But theres one problem that you can't overcome with the zip format completely.
Thats the generation of the crc values.
Even if you use hash-file to overcome the memory problem, or use hash-update to incrementally generate the crc, it will use to much cpu resources.
Not much for one person, but not recommend for professional use.
I solved this with an extra crc value table that i generate with an extra script.
I add this crc values per parameter to the zip class.
With this, the class is ultra fast.
Like a regular download script, as you mentioned.
My zip class is work in progress, you can have a look at it here: http://www.ranma.tv/zip-class.txt
I hope i can help someone with that :)
But i will discontinue this approach, i will reprogram my class to a tar class.
With tar i don't need to generate crc values from the files, tar only need some checksums for the headers, thats all.
And i don't need an extra mysql table any more.
I think it makes the class easier to use, if you don't have to create an extra crc table for it.
It's not so hard, because tars file structure is easier as the zip structure.
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
If your script is safe and it closes on user abort, then you can remove it completely.
But it would be safer, if you just renew the timeout on every file that you pass throught :)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Yes that would work.
I had generated a checksum from the input file crc's.
I used this as an e-tag and as part of the zip filename.
If something changed, the user can't resume the generated zip,
because the e-tag and filename changed together with the content.
Will passing large amounts of file data through PHP not be a performance hit in itself?
No, if you only pass throught it will not use much more then a regular download.
Maybe 0.01% i don't know, its not much :)
I assume because php don't do much with the data :)

You can use ZipStream or PHPZip, which will send zipped files on the fly to the browser, divided in chunks, instead of loading the entire content in PHP and then sending the zip file.
Both libraries are nice and useful pieces of code. A few details:
ZipStream "works" only with memory, but cannot be easily ported to PHP 4 if necessary (uses hash_file())
PHPZip writes temporary files on disk (consumes as much disk space as the biggest file to add in the zip), but can be easily adapted for PHP 4 if necessary.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.