Move millions of files to root of S3

Move millions of files to root of S3 - php

I have 20 million local files. Each file is represented by a numeric ID that is hashed.
File 1 is named 356a192b7913b04c54574d18c28d46e6395428ab (sha1 of "1")
File 2 is named da4b9237bacccdf19c0760cab7aec4a8359010b0 (sha1 of "2")
etc. etc.
Not every number represent a file, but i have a list of all numbers that do.
The files are placed in folders named after the first two characters in the hash, followed by the next two, followed by the next two.
For file 1 (da4b9237bacccdf19c0760cab7aec4a8359010b0) the folder structure is
da/4b/92/
In that folder the file is placed and it's named it's full hash, so the full path of the file is
da/4b/92/da4b9237bacccdf19c0760cab7aec4a8359010b0
I now want to move all the files from the file system to a bucket at Amazon S3, and while doing so I want to move them out to the root of that bucket.
As there are so many files it would be good if there was a way to log what files have been moved and what files might have failed for some reson, i need to be able to resume the operation if it fails.
My plan is to create a table in mysql called moved_files and then run a PHP script that fetches X number of ID's from the files table, uses the AWS SDK for PHP to copy the file to S3, if it succeeds it add that ID to the moved_files table. However I'm not sure if this would be the fastest way to do it, maybe I should look into writing a bash script using the AWS Cli.
Any suggestions would be appreciated!

I do NOT use AWS S3, but a little Googling suggests you need a command like:
aws s3 cp test.txt s3://mybucket/test2.txt
So, if you want to run that for all your files, I would suggest you use GNU Parallel to keep your connection fully utilised and to reduce latencies.
Please make a test directory with a few 10s of files to test with, then cd to that directory and try this command:
find . -type f -print0 | parallel -0 --dry-run aws s3 cp {} s3://rootbucket/{/}
Sample Output
aws s3 cp ./da/4b/92/da4b9237bacccdf19c0760cab7aec4a8359010b0 s3://rootbucket/da4b9237bacccdf19c0760cab7aec4a8359010b0
aws s3 cp ./da/4b/92/da4b9237bacccdf19c0760cab7aec4a8359010b1 s3://rootbucket/da4b9237bacccdf19c0760cab7aec4a8359010b1
If you have 8 CPU cores, that will run 8 parallel copies of aws at a time till all your files are copied.
The {} expands to mean "the current file", and {/} expands to mean "the current file without its directory". You can also add --bar to get a progress bar.
If that looks hopeful, we can add a little bash function for each file that updates your database, or deletes the local file, conditionally upon the success of the aws command. That looks like this - start reading at the bottom ;-)
#!/bin/bash
# bash function to upload single file
upload() {
local="$1" # Pick up parameters
remote="$2" # Pick up parameters
echo aws s3 cp "$local" "s3://rootbucket/$remote" # Upload to AWS
if [ $? -eq 0 ] ; then
: # Delete locally or update database with success
else
: # Log error somewhere
fi
}
# Export the upload function so processes started by GNU Parallel can find it
export -f upload
# Run GNU Parallel on all files
find . -type f -print0 | parallel -0 upload {} {/}

Related

php-updating a file and reading concurrently

I want to regularly update (rewrite, not append) a txt file from php by using file_put_contents. another php API reads this file and prints the content for the user.
is it possible that when the user wants to read the file via PHP API, it returns empty? because when the first PHP file tries to update the file, it erases the data and then writes new content. if it is possible, how to avoid it?

It can prevent and sure the source file won't be empty try following solution :
you can keep your processing text file in tmp folder e.g. tmp_txt which you can create parallel to same location where as your current text file, so first your text file goes to in this tmp folder
Create a shell script file and keep that under the tmp folder or any other folder
add the shell script which will observer the file size, and put that in to cron job scheduler
find /your project root path/tmp_txt/ -type f -size +1 -name "mytext.txt" -exec mv {} /your project root paht/folder where you want it/
"find" is command for search the file and next your tmp folder path"
"-type f" this will consider only the file
"-size +1" +1 mean above 1 KB
"-name "mytext.txt"" you can define your file name, if dynamic names then -name "*.txt"
"-exec mv {}" this will move the file on path that next to it, if match the file size with above condition which is 1KB you can change that as per your need
e.g. cronjob entry which will run the every minutes
bash /yor project root path/tmp_txt/shellscriptfilename>> /dev/null 2>&1

PHP symlink a huge files list

I have a huge file's list (more than 48k files with paths) and I wanna to do a symlink for these files
Here is my PHP code:
$files=explode("\n","files.txt");
foreach($files as $file){
$file=trim($file);
#symlink($file,"/home/".$file."#".rand(1,80000).".txt");
}
The problem is the process takes more than 1 hour
I thought about checking if the file exists first and then do a symlink, so I made some research in php.net and there some functions like is_link() and readlink() for what I wanted in the first place, but a comment took my attention:
It is neccesary to be notified that readlink only works for a real link file, if this link file locates to a directory, you cannot detect and read its contents, the is_link() function will always return false and readlink() will display a warning is you try to do.
So I made this new code:
$files=explode("\n","files.txt");
foreach($files as $file){
$file=trim($file);
if(!empty(readlink($file))){
#symlink($file,"/home/".$file."#".rand(1,80000).".txt");
}
}
The problem now : "there is no symlink files !"
How I can prevent this problems ? Should I use a multi threading or there is another option

Obviously you are running Linux-based operating system and your question is related to File system.
In this case I would recommend to create bash script to read the file.txt and create the symlinks for all of them.
Good start to this is:
How to symlink a file in Linux?
Linux/UNIX: Bash Read a File Line By Line
Random number from a range in a Bash Script
So you may try something like this:
#!/bin/bash
while read name
do
# Do what you want to $name
ln -s /full/path/to/the/file/$name /path/to/symlink/shuf -i 1-80000 -n 1$name'.txt'
done < file.txt
EDIT:
One line solution:
while read name; do ln -s /full/path/to/the/file/$name /path/to/symlink/shuf -i 1-80000 -n 1$name'.txt'; done < file.txt
Note: Replace the "file.txt" with full path to the file. And test it on small amount of files if anything goes wrong.

tar gz file extract exclude folder "data"

I have a little problem, I have a large 41gb file on a server and I need to extract it..
How would i go about it, the file is in a tar.gz format and it will take 24hr on a godaddy server and then it stops for some reason
I need to exclude a folder name data this contains the bulk of the data 40.9gb the rest is just php.
home/xxx/public_html/xxx.com.au/data << this is the folder I don't need
I have been searching google and other sites for day's but it doesn't work..
shell_exec('tar xvf xxx_backup_20140921.tar.gz'); this is the command I use I have even used the 'k' to skip files and it dont work
I have used the -exclude command but nothing,

Try this:
shell_exec("tar xzvf xxx_backup_20140921.tar.gz --exclude='home/xxx/public_html/xxx.com.au/data'");
This should prevent the path listed (relative to the root of the archive) from being extracted.

Zip files contain same files but have different hashes?

I have created hundreds of folders and text files using php, I then add them to a zip archive.
This all works fine but if I create another zip archive using the same folders and files, the new archive will have a different hash to the first one. This is the same if I use winrar instead of php to create an archive.
It only seems to show different hashes when I zip the files I have created through php, yet they open fine.
Very strange anyone shed any light on this?
Thanks

Zip is not deterministic. To solve this zip problem (it's really problem when you have CI and need to update AWS lambda, for example and don't want to update it each time, but only when something was really changed) I used this article: https://medium.com/#pat_wilson/building-deterministic-zip-files-with-built-in-commands-741275116a19
Like this:
find . -exec touch -t "$(git ls-files -z . | \
xargs -0 -n1 -I{} -- git log -1 --date=format:"%Y%m%d%H%M" --format="%ad" '{}' | \
sort -r | head -n 1)" '{}' +
zip -rq -D -X -9 -A --compression-method deflate dest.zip sources...

There is certainly some difference in the files. If the lengths are not exactly the same, the hash will be different. You can use a comparing hex editor, like Hex Workshop for example, to see what exactly the differences are.
Possibilities that come to my mind:
As #orn mentioned, there may be a timestamp in the zip format you are using (not sure).
The order that the files are added to the archive may be different (depending on how you're selecting them / building the source array).

You can consider using deterministic_zip it solves this issue, from its documentation:
There are three tricks to building a deterministic zip:
Files must be added to the zip in the same order. Directory iteration order may vary across machines, resulting in different zips. deterministic_zip sorts all files before adding them to the zip archive.
Files in the zip must have consistent timestamps. If I share a directory to another machine, the timestamps of individual files may differ, despite having identical content. To achieve timestamp consistency, deterministic_zip sets the timestamp of all added files to 2019-01-01 00:00:00.
Files in the zip must have consistent permissions. File permissions look like -rw-r--r-- for a file that is readable by all users, and only writable by the user who owns the file. Similarly executable files might have permissions that look like: -rwxr-xr-x or -rwx------. deterministic_zip sets the permission of all files added to the archive to either -r--r--r--, or -r-xr-xr-x. The latter is only used when the user running deterministic_zip has execute access on the file.
Note: deterministic_zip does not modify nor update timestamps of any files it adds to archives. The techniques used above apply only to the copies of files within archives deterministic_zip creates.

Is there a way to check whether a file is completely uploaded using PHP?

I have a directory on a remote machine in which my clients are uploading (via different tools and protocols, from WebDav to FTP) files. I also have a PHP script that returns the directory structure. Now, the problem is, if a client uploads a large file, and I make a request during the uploading time, the PHP script will return the file even if it's not completely uploaded. Is there a way to check whether a file is completely uploaded using PHP?

Setup your remote server to move uploaded files to another directory, and only query the directory files are moved to for files.
AFAIK, there is no way (at least cross-machine) to tell if a file is still being uploaded, without doing something like:
Query the file's length
Wait a few seconds
Query the file's length
If it's the same, its possibly completed

Most UNIX/Linux/BSD-like operating systems has a command called lsof (lsof stands for "list open files") which outputs a list of all currently open files in the system. You can run that command to see if any process is still working with the file. If not, your upload has finished. In this example, awk is used to filter so only files will show that are open with write or read/write file handlers:
if (shell_exec("lsof | awk '\$4 ~ /.*[uw]/' | grep " . $uploaded_file_name) == '') {
/* No file handles open for this file, so upload is finished. */
}
I'm not very familiar with Windows servers, but this thread might help you to do the same on a Windows machine (if that is what you have): How can I determine whether a specific file is open in Windows?

I would think that some operating systems include a ".part" file when downloading a file, so there may be a way to check for the existence of such a file. Otherwise, I agree with Brian's answer. If you were using the script on the same system it is simple enough to tell using move_uploaded_file()'s return if it was being uploaded by a PHP script, but it does become a challenge pulling from a remote directory that can be added to with different protocols.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Move millions of files to root of S3 - php

Related

php-updating a file and reading concurrently

PHP symlink a huge files list

tar gz file extract exclude folder "data"

Zip files contain same files but have different hashes?

Is there a way to check whether a file is completely uploaded using PHP?

Categories

Resources