Backing up thousands of images in PHP? - php

I am just currently wondering how I can backup a folder which contains 8000+ images without the script timing out, the folder in all contains around 1.5gb of data, which we need to backup ourselves every so often.
I have tried the zip functionality provided in PHP, however it simply times out the request due to the huge number of files needed to be backed up, it does however work with smaller amounts of work.
I am trying to run this script through a HTTP REQUEST, would putting it through a Cronjob ignore the timeout?
Does anyone have any recommendations?

I would not use php for that.
If you are on linux I would setup a cron job and to run a program like rsync periodically.
A nice introduction about rsync.
Edit: If you do want / need to go the php way, you can also consider just copying instead of using zip. zip normally doesn't do much with images and if you have a database already, you can check your current directory against the database and just do a differential backup (just copy the new files). That way only your initial backup would take a long time.

You can post the code so we can optimize it, other than that, you should change your php.ini (configuration file) and remove/increase the timeout (the longest time your script can run on your server)

Related

PHP FPM download speed on a backup system

i'm making a backup system for a company and I need to understand why i can't get better download speeds using PHP.
The files are on the webserver and I need to bring them to the backupserver. The problem is, using WGET to get the files, i can download them to the backupserver at 50mbps(network limit), but using PHP file_put_contents i can only get like 2mbps if only one file, and when i try to download like 50 files at the same time they get 50kbps each...
Since i'm downloading about 50TB of content, and each file is about 800mb-1.2g, this would take months this way.
I'm using NGINX with PHP-FPM and the configs are perfect everywhere. No limits, no timeouts etc
The code i'm using is basically this example but i'm updating the bytes downloaded in mysql.
https://www.php.net/manual/en/function.stream-notification-callback.php
Could this problem be related to file_put_contents performance? Is there a solution to get better download speeds?

Temp files being left over from PHP script on linux/apache

I've just published a client website, it's primary purpose is distributing content from other sources, so it's regularly pulling in text, videos, images and audio from external feeds.
It also has an option for client to manually add content to be distributed.
Using PHP all this makes a fair bit of use of copy() to copy files from another server, move_uploaded_file() to copy manually uploaded files, and it also uses SimpleImage image manipulation class to make multiple copies, and crop etc..
Now to the problem: in amongst all of this, some temp files are not being deleted, it's locking up the server pretty quickly as when tmp is full it causes things like mysql errors and stops pages loading.
I've spent a lot of time googling which leads me to one thing: "temp files are deleted when the script is finished executing" - this is clearly not the case here.
Is there anything i can do to make sure any temporary files created by the scripts are deleted?
I've spoken to my server guy who suggested running a cron that will delete from it every 24 hours, i don't know whether this is a good solution but it's certainly not THE solution as i believe the files should be getting deleted? what could be a cause of stoping files from being deleted?
Regardless of anything else you come up with, the cron idea is still a good one, as you want to make sure that /tmp is getting cleaned up. You can have the cron job delete anything older than 24 hours, not delete everything every 24 hours, assuming this leaves enough space.
In terms of temp files deleting when the script is done. This only happens when tmpfile () is used to creat the temp file in the first place, as far as I know. So other files created in /tmp by other means (and there would be many other means) will not just go away because the script is done.

Detecting whether a file is complete or partial in PHP

In this system, SFTP will be used to upload large video files to a specific directory on the server. I am writing a PHP script that will be used to copy completely uploaded files to another directory.
My question is, what is a good way to tell if a file is completely uploaded or not? I considered taking note of the file size, sleeping the script for a few seconds, reading the size again, and comparing the two numbers. But this seems very kludgy.
Simple
If you just want a simple technique that will handle most use cases - write a process to check the last modified date or (as indicated in the question) the file size and if it hasn't changed in a set time, say 1 minute, assume the upload is finished and process it. Infact - your question is a near duplicate of another question showing exactly this.
Robust
An alternative to polling the size/last modified time of a file (which with a slow or intermittent connection could cause an open upload to be considered complete too early) is to listen for when the file has finished being written with inotify.
Listening for IN_CLOSE_WRITE you can know when an ftp upload has finished updating a specific file.
I've never used it in php, but from the spartan docs it looks that that aught to work in exactly the same way as the underlying lib.
Make FTP clients work for you
Take into account how some ftp programs work, from this comment:
it's worth mentioning that some ftp servers create a hidden temporary file to receive data until the transfer is finished. I.e. .foo-ftpupload-2837948723 in the same upload path as the target file
If applicable, from the finished filename you can therefore tell if the ftp upload has actually finished or was stopped/disconnected mid-upload.
In theory, in case you are allowed to run system commands from PHP you can try to get the pids of your sftp server processes with pidof command and then use lsof output to check if open file descriptors exist for the currently checked upload file.
During a transfer, the only thing that knows the real size of the file is your SFTP client and maybe the SFTP server (don't know protocol specifics so the server may know too.) Unless you can talk to the SFTP server directly, then you'll have to wait for the file to finish. Checking at some interval seems kludgy but under the circumstances is pretty good. It's what we do on my project.
If you have a very short poll interval, it's not terribly resilient when dealing with uploads that may stall for a few seconds. So just set the interval to one a minute or something like that. Assuming you're on a *nix based platform, you could set up a cron job to run every minute (or whatever your polling interval is) to run your script for you.
As #jcolebrand stated - if you have no control over upload process - there is actually not much you can do except guessing (size/date of file). I would look for a server software that allows you to execute hooks/scripts before/after some server action (here: file transfer complete). I am not sure though whether such software exists. As a last resort you could adapt some opensource SFTP server for your requirements by adding such a hook for yourself.
An interesting answer was mentioned in a commend under original question (but for some unknown reason the comment was deleted): you can try parsing SFTP server logs to find files that were uploaded completely.

PHP: How do I avoid reading partial files that are pushed to me with FTP?

Files are being pushed to my server via FTP. I process them with PHP code in a Drupal module. O/S is Ubuntu and the FTP server is vsftp.
At regular intervals I will check for new files, process them with SimpleXML and move them to a "Done" folder. How do I avoid processing a partially uploaded file?
vsftp has lock_upload_files defaulted to yes. I thought of attempting to move the files first, expecting the move to fail on a currently uploading file. That doesn't seem to happen, at least on the command line. If I start uploading a large file and move, it just keeps growing in the new location. I guess the directory entry is not locked.
Should I try fopen with mode 'a' or 'r+' just to see if it succeeds before attempting to load into SimpleXML or is there a better way to do this? I guess I could just detect SimpleXML load failing but... that seems messy.
I don't have control of the sender. They won't do an upload and rename.
Thanks
Using the lock_upload_files configuration option of vsftpd leads to locking files with the fcntl() function. This places advisory lock(s) on uploaded file(s) which are in progress. Other programs don't need to consider advisory locks, and mv for example does not. Advisory locks are in general just an advice for programs that care about such locks.
You need another command line tool like lockrun which respects advisory locks.
Note: lockrun must be compiled with the WAIT_AND_LOCK(fd) macro to use the lockf() and not the flock() function in order to work with locks that are set by fcntl() under Linux. So when lockrun is compiled with using lockf() then it will cooperate with the locks set by vsftpd.
With such features (lockrun, mv, lock_upload_files) you can build a shell script or similar that moves files one by one, checking if the file is locked beforehand and holding an advisory lock on it as long as the file is moved. If the file is locked by vsftpd then lockrun can skip the call to mv so that running uploads are skipped.
If locking doesn't work, I don't know of a solution as clean/simple as you'd like. You could make an educated guess by not processing files whose last modified time (which you can get with filemtime()) is within the past x minutes.
If you want a higher degree of confidence than that, you could check and store each file's size (using filesize()) in a simple database, and every x minutes check new size against its old size. If the size hasn't changed in x minutes, you can assume nothing more is being sent.
The lsof linux command lists opened files on your system. I suggest executing it with shell_exec() from PHP and parsing the output to see what files are still being used by your FTP server.
Picking up on the previous answer, you could copy the file over and then compare the sizes of the copied file and the original file at a fixed interval.
If the sizes match, the upload is done, delete the copy, work with the file.
If the sizes do not match, copy the file again.
repeat.
Here's another idea: create a super (but hopefully not root) FTP user that can access some or all of the upload directories. Instead of your PHP code reading uploaded files right off the disk, make it connect to the local FTP server and download files. This way vsftpd handles the locking for you (assuming you leave lock_upload_files enabled). You'll only be able to download a file once vsftp releases the exclusive/write lock (once writing is complete).
You mentioned trying flock in your comment (and how it fails). It does indeed seem painful to try to match whatever locking vsftpd is doing, but dio_fcntl might be worth a shot.
I guess you've solved your problem years ago but still.
If you use some pattern to find the files you need you can ask the party uploading the file to use different name and rename the file once the upload has completed.
You should check the Hidden Stores in proftp, more info here:
http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html

Server file read/write concurrency issue

I have a web-service serving from a MySQL database. I would like to create cache file to improve the performance. The idea is once a while we read data from DB and generate a text file. My question is:
What if a client-side user is accessing the file while we are generating it?
We are using LAMP. In PHP there is flock() handles concurrency problem, but my understanding is that it's only for when 2 PHP processes accessing the file simultaneously. Our case is different.
I don't know whether this will cause issues at all. If so, how can I prevent it?
Thanks,
don't use locking;
if your cachefile is /tmp/cache.txt then you should always regenerate the cache to /tmp/cache2.txt and then do a
mv /tmp/cache2.txt /tmp/cache.txt
or
rename('/tmp/cache2.txt','/tmp/cache.txt')
the mv/rename operation is atomic if it happens inside the same filesystem; no locking needed
All sorts of optimisation options here;
1) Are you using the MySQL queryCache - that can take a huge load off the database to start with.
2) You could pull the file through a web proxy like squid (or Apache configured as a reverse caching proxy). I do this all the time and it's a really handy technique - generate the file by fetching it from a url using wget for example (that way you can have it in a cron job). The web proxy takes care of either delivering the same file that was there before, or regenerating it if needs be.
3) You don't want to be rolling your own file locking solution in this scenario.
Depending on your scenario, you could also consider cacheing pages in something like memcache which is fantastic for high traffic scenarios, but possibly beyond the scope of this question.
You can use A -> B switching to avoid this issue.
E.g. : Let there be two copies of this cache file A and B, program should read these via a symlink, C.
When program is building the cache, it would modify the file that is not "current" I.e. if C link to A, update B. Once update is complete, switch symlink to B.
next time, update A and switch symlink to A once update is complete.
this way clients would never read a file while it is being updated.
When a client-side access the file, it reads it as it is in that moment.
flock() is for when 2 PHP processes accessing the file simultaneously.
I would solve it like this:
While generating the new text file, save it to a temporary file (cache.tmp), that way the old file (cache.txt) is being accessed like before.
When generation is done, delete the old file and rename the new file
To avoid problems during that short period of time, your code should check wether cache.txt exists and retry for a short period of time.
Trivial but that should do the trick

Categories