Detecting whether a file is complete or partial in PHP

Detecting whether a file is complete or partial in PHP - php

In this system, SFTP will be used to upload large video files to a specific directory on the server. I am writing a PHP script that will be used to copy completely uploaded files to another directory.
My question is, what is a good way to tell if a file is completely uploaded or not? I considered taking note of the file size, sleeping the script for a few seconds, reading the size again, and comparing the two numbers. But this seems very kludgy.

Simple
If you just want a simple technique that will handle most use cases - write a process to check the last modified date or (as indicated in the question) the file size and if it hasn't changed in a set time, say 1 minute, assume the upload is finished and process it. Infact - your question is a near duplicate of another question showing exactly this.
Robust
An alternative to polling the size/last modified time of a file (which with a slow or intermittent connection could cause an open upload to be considered complete too early) is to listen for when the file has finished being written with inotify.
Listening for IN_CLOSE_WRITE you can know when an ftp upload has finished updating a specific file.
I've never used it in php, but from the spartan docs it looks that that aught to work in exactly the same way as the underlying lib.
Make FTP clients work for you
Take into account how some ftp programs work, from this comment:
it's worth mentioning that some ftp servers create a hidden temporary file to receive data until the transfer is finished. I.e. .foo-ftpupload-2837948723 in the same upload path as the target file
If applicable, from the finished filename you can therefore tell if the ftp upload has actually finished or was stopped/disconnected mid-upload.

In theory, in case you are allowed to run system commands from PHP you can try to get the pids of your sftp server processes with pidof command and then use lsof output to check if open file descriptors exist for the currently checked upload file.

During a transfer, the only thing that knows the real size of the file is your SFTP client and maybe the SFTP server (don't know protocol specifics so the server may know too.) Unless you can talk to the SFTP server directly, then you'll have to wait for the file to finish. Checking at some interval seems kludgy but under the circumstances is pretty good. It's what we do on my project.
If you have a very short poll interval, it's not terribly resilient when dealing with uploads that may stall for a few seconds. So just set the interval to one a minute or something like that. Assuming you're on a *nix based platform, you could set up a cron job to run every minute (or whatever your polling interval is) to run your script for you.

As #jcolebrand stated - if you have no control over upload process - there is actually not much you can do except guessing (size/date of file). I would look for a server software that allows you to execute hooks/scripts before/after some server action (here: file transfer complete). I am not sure though whether such software exists. As a last resort you could adapt some opensource SFTP server for your requirements by adding such a hook for yourself.

An interesting answer was mentioned in a commend under original question (but for some unknown reason the comment was deleted): you can try parsing SFTP server logs to find files that were uploaded completely.

Related

PHP creating .nfs00000 files?

I have a PHP app, which is working fine for me, both on test system and a production system.
But another user of my app wrote me, that it creates a lot of files .nfs00000* on his system and it slows down loading of the page.
My app does not create any files on the filesystem, all datas are stored into MySQL. So I was really surprised by this. But that user removed my PHP app from his website and the problem dissappeared.
I will be honest -- I know nothing about .nfs00000* files and I was not able to google out anything reasonable about them. Can someone please try to give me explanation, what they are, why they are created and if I can do anything to avoid their creation?
Thanx, Honza

Maybe this can help:
Under linux/unix, if you remove a file that a currently running process still has open, the file isn't really removed. Once the process closes the file, the OS then removes the file handle and frees up the disk blocks. This process is complicated slightly when the file that is open and removed is on an NFS mounted filesystem. Since the process that has the file open is running on one machine (such as a workstation in your office or lab) and the files are on the file server, there has to be some way for the two machines to communicate information about this file. The way NFS does this is with the .nfsNNNN files. If you try to remove one of these file, and the file is still open, it will just reappear with a different number. So, in order to remove the file completely you must kill the process that has it open.
If you want to know what process has this file open, you can use 'lsof .nfs1234'. Note, however, this will only work on the machine where the processes that has the file open is running. So, if your process is running on one machine (eg. bobac) and you run the lsof on some other burrow machine (eg. silo or prairiedog), you won't see anything.
(Source)
If your app is deleting or modifying some files it could be the cause of the problem.

prevent php timeout exception for image uploads

I have an ajax image uploader that sends images to a php script, so that they are validated / resized & saved into a directory. The ajax uploader allows multiple files to be uploaded at the same time. Since it allows multiple files there can be timeouts, so I thought of increasing the execution time using set_time_limit. But I am having trouble determining how much time I have to set, since the default is 30sec. Will 1min be enough? The images are uploading properly in my local machine, but I am having thoughts that there will be timeout errors on a shared hosting service. Any ideas & thoughts on how others have implemented will be valuable.. Thanks.

You can set it to 5 minutes if you need to. But for obvious reasons, you don't really want to have it that high especially for http calls.
So... if I had the energy, this is what I'd do...
I need:
process_initializer.php
process_checker.php
client.html
the_process.php (runs as background)
...
client uploads files to process_initializer.
initializer creates a unique ID, maybe based off of time with milliseconds or some other advanced solution
initializer starts a background process, sending it necessary arguments like filenames along with the ID
initializer responds to the client with ID
client then polls process_checker to see what's going on with ID (maybe 20 second intervals - setTimeout(), whatever)
process_checker may check to see if file output_ID.txt exists which the_process should create when it's done and then if it doesn't exist respond to the client that it's not ready, if it does, maybe send the output to the client and then the client can do whatever.
When apache runs php, it uses one php.ini configuration and when you run php from the command line or from another script like exec('php the_process arg1 arg2') it will use a different php.ini for this, reffered to as php cli or something like that unless you have php cli configured to use the same php.ini that apache does. Important thing is, it's possible they use different settings and so you can let cli scripts take more time than your http called scripts.

PHP: How do I avoid reading partial files that are pushed to me with FTP?

Files are being pushed to my server via FTP. I process them with PHP code in a Drupal module. O/S is Ubuntu and the FTP server is vsftp.
At regular intervals I will check for new files, process them with SimpleXML and move them to a "Done" folder. How do I avoid processing a partially uploaded file?
vsftp has lock_upload_files defaulted to yes. I thought of attempting to move the files first, expecting the move to fail on a currently uploading file. That doesn't seem to happen, at least on the command line. If I start uploading a large file and move, it just keeps growing in the new location. I guess the directory entry is not locked.
Should I try fopen with mode 'a' or 'r+' just to see if it succeeds before attempting to load into SimpleXML or is there a better way to do this? I guess I could just detect SimpleXML load failing but... that seems messy.
I don't have control of the sender. They won't do an upload and rename.
Thanks

Using the lock_upload_files configuration option of vsftpd leads to locking files with the fcntl() function. This places advisory lock(s) on uploaded file(s) which are in progress. Other programs don't need to consider advisory locks, and mv for example does not. Advisory locks are in general just an advice for programs that care about such locks.
You need another command line tool like lockrun which respects advisory locks.
Note: lockrun must be compiled with the WAIT_AND_LOCK(fd) macro to use the lockf() and not the flock() function in order to work with locks that are set by fcntl() under Linux. So when lockrun is compiled with using lockf() then it will cooperate with the locks set by vsftpd.
With such features (lockrun, mv, lock_upload_files) you can build a shell script or similar that moves files one by one, checking if the file is locked beforehand and holding an advisory lock on it as long as the file is moved. If the file is locked by vsftpd then lockrun can skip the call to mv so that running uploads are skipped.

If locking doesn't work, I don't know of a solution as clean/simple as you'd like. You could make an educated guess by not processing files whose last modified time (which you can get with filemtime()) is within the past x minutes.
If you want a higher degree of confidence than that, you could check and store each file's size (using filesize()) in a simple database, and every x minutes check new size against its old size. If the size hasn't changed in x minutes, you can assume nothing more is being sent.

The lsof linux command lists opened files on your system. I suggest executing it with shell_exec() from PHP and parsing the output to see what files are still being used by your FTP server.

Picking up on the previous answer, you could copy the file over and then compare the sizes of the copied file and the original file at a fixed interval.
If the sizes match, the upload is done, delete the copy, work with the file.
If the sizes do not match, copy the file again.
repeat.

Here's another idea: create a super (but hopefully not root) FTP user that can access some or all of the upload directories. Instead of your PHP code reading uploaded files right off the disk, make it connect to the local FTP server and download files. This way vsftpd handles the locking for you (assuming you leave lock_upload_files enabled). You'll only be able to download a file once vsftp releases the exclusive/write lock (once writing is complete).
You mentioned trying flock in your comment (and how it fails). It does indeed seem painful to try to match whatever locking vsftpd is doing, but dio_fcntl might be worth a shot.

I guess you've solved your problem years ago but still.
If you use some pattern to find the files you need you can ask the party uploading the file to use different name and rename the file once the upload has completed.

You should check the Hidden Stores in proftp, more info here:
http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html

Will I run into load problems with this application stack?

I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?

MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.

Uploading big files over HTTP

I need to upload potentially big (as in, 10's to 100's of megabytes) files from a desktop application to a server. The server code is written in PHP, the desktop application in C++/MFC. I want to be able to resume file uploads when the upload fails halfway through because this software will be used over unreliable connections. What are my options? I've found a number of HTTP upload components for C++, such as http://www.chilkatsoft.com/refdoc/vcCkUploadRef.html which looks excellent, but it doesn't seem to handle 'resume' of half done uploads (I assume this is because HTTP 1.1 doesn't support it). I've also looked at the BITS service but for uploads it requires an IIS server. So far my only option seems to be to cut up the file I want to upload into smaller pieces (say 1 meg each), upload them all to the server, reassemble them with PHP and run a checksum to see if everything went ok. To resume, I'd need to have some form of 'handshake' at the beginning of the upload to find out which pieces are already on the server. Will I have to code this by hand or does anyone know of a library that does all this for me, or maybe even a completely different solution? I'd rather not switch to another protocol that supports resume natively for maintenance reasons (potential problems with firewalls etc.)

I'm eight months late, but I just stumbled upon this question and was surprised that webDAV wasn't mentioned. You could use the HTTP PUT method to upload, and include a Content-Range header to handle resuming and such. A HEAD request would tell you if the file already exists and how big it is. So perhaps something like this:
1) HEAD the remote file
2) If it exists and size == local size, upload is already done
3) If size < local size, add a Content-Range header to request and seek to the appropriate location in local file.
4) Make PUT request to upload the file (or portion of the file, if resuming)
5) If connection fails during PUT request, start over with step 1
You can also list (PROPFIND) and rename (MOVE) files, and create directories (MKCOL) with dav.
I believe both Apache and Lighttpd have dav extensions.

You need a standard size (say 256k). If your file "abc.txt", uploaded by user x is 78.3MB it would be 313 full chunks and one smaller chunk.
You send a request to upload stating filename and size, as well as number of initial threads.
your php code will create a temp folder named after the IP address and filename,
Your app can then use MULTIPLE connections to send the data in different threads, so you could be sending chunks 1,111,212,313 at the same time (with separate checksums).
your php code saves them to different files and confirms reception after validating the checksum, giving the number of a new chunk to send, or to stop with this thread.
After all thread are finished, you would ask the php to join all the files, if something is missing, it would goto 3
You could increase or decrease the number of threads at will, since the app is controlling the sending.
You can easily show a progress indicator, either a simple progress bar, or something close to downthemall's detailed view of chunks.

libcurl (C api) could be a viable option
-C/--continue-at
Continue/Resume a previous file transfer at the given offset. The given offset is the exact number of bytes that will be skipped, counting from the beginning of the source file before it is transferred to the destination. If used with uploads, the FTP server command SIZE will not be used by curl.
Use "-C -" to tell curl to automatically find out where/how to resume the transfer. It then uses the given output/input files to figure that out.
If this option is used several times, the last one will be used

Google have created a Resumable HTTP Upload protocol. See https://developers.google.com/gdata/docs/resumable_upload

Is reversing the whole proccess an option? I mean, instead of pushing file over to the server make the server pull the file using standard HTTP GET with all bells and whistles (like accept-ranges, etc.).

Maybe the easiest method would be to create an upload page that would accept the filename and range in parameter, such as http://yourpage/.../upload.php?file=myfile&from=123456 and handle resumes in the client (maybe you could add a function to inspect which ranges the server has received)

# Anton Gogolev
Lol, I was just thinking about the same thing - reversing whole thing, making server a client, and client a server. Thx to Roel, why it wouldn't work, is clearer to me now.
# Roel
I would suggest implementing Java uploader [JumpLoader is good, with its JScript interface and even sample PHP server side code]. Flash uploaders suffer badly when it comes to BIIIGGG files :) , in a gigabyte scale that is.

F*EX can upload files up to TB range via HTTP and is able to resume after link failures.
It does not exactly meets your needs, because it is written in Perl and needs an UNIX based server, but the clients can be on any operating system. Maybe it is helpful for you nevertheless:
http://fex.rus.uni-stuttgart.de/

Exists the protocol called TUS for resumable uploads with some implementations in PHP and C++

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.