File processing on load balancing server (Cluster)

File processing on load balancing server (Cluster) - php

I have application in PHP on cluster server. It copy file from aws bucket on server process the file (unzip file. convert PDF to XML using itext java, Read XML and save data to database) and the upload processed file back to bucket.
It works fine for single instance but in load balancing for multiple instances file under process on server disappears.
I can not process file directly from bucket as I can not unzip it on bucket also can not run jar file on bucket. So I have to store file
temporary for processing. Is there any way to handle this situation

A few possible solutions:
Use a central single key value store (database) to store the path of the file's that you are currently processing, when downloading a new file, check if this file isn't already being. You could use Redis for this
Upload a new, empty, file to S3, but with something in the file name so you know that if that file is present, the accompanying file is already being processed (Though I'm not sure if S3 caches directory listings) With this solution you should also consider the cost writing a file to S3, that also depends on your scale
Rename or remove the file from S3 while it's being processed

There can be multiple solutions to this:
One solution is to check and apply tags if the file is processed at the time of upload apply some tag like processed=true and when you are downloading files check for tags.
Better solution is to use lambda for this task.

You can use the pattern of
S3 to lambda
Lambda drops a message in SQS
Application monitors SQS
Application processes file
Delete message.
Or just have lambda do all the work on S3 upload. Depending on how long the process runs. Execution time is 5 mins.
http://docs.aws.amazon.com/lambda/latest/dg/limits.html
For example:
Set up a lambda function to monitor the s3 on upload new object event. Then have the lambda function drop a message in SQS(From the event data it receives, the Lambda function knows the source bucket name and object key name). The server can monitor the queue, process the message, extract the file and upload it to a new bucket, delete the file from the old s3 bucket and then delete the message from the queue. If the server dies during processing, the message goes back onto the queue(visibility timeout). A way to ensure it is processed and deleted on the old bucket is to enable versioning and a life cycle policy. When processing the message if the files doesn't exist on the old bucket send an alert and/or check for the previous version. You can also have a life cycle policy on the old bucket to permanently delete version if they are older than X days.
Monitoring S3 with Lambda
http://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
s3 Versioning
http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html
Select Permanently delete previous versions and then enter the number of days after an object becomes a previous version to permanently delete the object (for example, 455 days).
http://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html

What you need is a system which will store the file without losses. There are many alternatives for that.
a) Another server
b) An SQS Queue.
#strongiz answer above explains it very well.
c) Even another database.
In each of these cases, you need a flag which will define if file is processed or not. when file processing is complete
a) delete the file or,
b) Change the flag
Since, PHP is session oriented, you cant store data there permannently, so, you need to connect to another interface. In case of a database, You can actually store a the file path entry and a flag to determine if file is processed or not. So, even a combo of the 3 might also work.

Related

Best way to "pipe" file contents from a remote server, via a 2nd server, output to browser

I have numerous storage servers, and then more "cache" servers which are used to load balance downloads. At the moment I use RSYNC to copy the most popular files from the storage boxes to the cache boxes, then update the DB with the new server IDs, so my script can route the download requests to a random box which has the file.
I'm now looking at better ways to distribute the content, and wondering if it's possible to route requests to any box at random, and the download script then check if the file exists locally, if it doesn't, it would "get" the file contents from the remote storage box, and output the content in realtime to the browser, whilst keeping the file on the cache box so that the next time the same request is made, it can just serve the local copy, rather than connecting to the storage box again.
Hope that makes sense(!)
I've been playing around with RSYNC, wget and cURL commands, but I'm struggling to find a way to output the data to browser as it comes in.
I've also been reading up on reverse proxies with nginx, which sounds like the right route... but it still sounds like they require the entire file to be downloaded from the origin server to the cache server before it can output anything to the client(?) some of my files are 100GB+ and each server has a 1gbps bandwidth limit, so at best, it would take 100s to download a file of that size to the cache server before the client will see any data at all. There must be a way to "pipe" the data to the client as it streams?
Is what I'm trying to achieve possible?

You can pipe data without downloading the full file using streams. One example for downloading a file as a stream would be the Guzzle sink feature. One example for uploading a file as a stream would be the Symfony StreamedResponse. Using those the following can be done:
Server A has a file the user wants
Server B gets the user request for the file
Server B uses Guzzle to setup a download stream to server A
Server B outputs the StreamedResponse directly to the user
Doing so will serve the download in real-time without having to wait for the entire file to be finished. However I do not know if you can stream to the user and store the file on disk at the same time. There's a stream_copy_to_stream function in PHP which might allow this, but don't know that for sure.

Check if all previous functions are successful

I have a long running script which gets a (long) array of folders (with subarray of files in that folder) where I have to do several actions on each file.
What is the best way to make sure I make all actions successful? And how to handle unsuccessful actions?
Lets say what will happen if my mysql server is unavailable or like the Amazon S3 API is not working correctly.
pseudocode of my script:
starting script with folders / files array
looping through each folder
looping through each file in that folder
open file (from external server) and try converting it to custom object (only continue if file is a valid "object")
extract some parts of file and save them to Amazon S3 bucket
extract some other parts of file and save them to another Amazon S3 bucket
extract metadata / text of file and insert into elasticsearch
update mysql record

As mentioned, what you could do is throw and catch Exceptions.
So for instance, if you iterate over files in a folder using a foreach, doing something with those files, on an error, you can throw an Exception and it will stop code execution till it is catched.
So maybe you want to use a logger instead. Since it is 2014, you probably want to use a DIC to inject a logger service or otherwise, you can just use a singleton (only considering the great flaws that brings) that stores your errors.
So either way you have this service that stores every error.
At the end you just check if it has any errors and then act accordingly.

AWS S3 file upload over PHP and HTTP

I'm wanting to use Amazons S3 service to store the files that users upload to my LAMP package. I'm wondering what would be the best (time, cost, security, etc) to do this. I'm familar with uploading files using HTTP with PHP processing but I've always saved it to the local storage. Should I have a tmp directory that I have the SDK upload from or should or even could I upload the file to S3 from a data variable? Also I would like to be able to handle 5 GB files but at the moment I'm only running a half a GB of ram, would this cause any issues while I'm in the alfa of my project? Keep in mind my web server is an EC2 server. Thanks for the help.

Q1: Uploading From memory without creating temporary file.
Yes, you can do it. The Amazon SDK has "putObjectFile" and "putObjectString" functions, the first creates an object from a temporary file, the second from a string.
Q2: Uploading large files (5GB).
While you can get a server with 5GB of memory, it's a bit overkill just to store the data to upload entriely in memory while the upload happens - so going for a temporary file for upload and streaming chunk by chunk from that file would probably be wise. To handle chunks in curl in PHP, you may need to add CURLOPT_READFUNCTION that reads a bit of the file at a time for upload.
The name of a callback function where the callback function takes three parameters. The first is the cURL resource, the second is a stream resource provided to cURL through the option CURLOPT_INFILE, and the third is the maximum amount of data to be read. The callback function must return a string with a length equal or smaller than the amount of data requested, typically by reading it from the passed stream resource. It should return an empty string to signal EOF.
You can find the curl functions in the Amazon SDK, function getResponse(). The class is labelled "final" so you'll need to actually modify the SDK to add this in.
Q3: Costs. A server with a slightly larger hard disk (to store temp files) is most likely cheaper than adding memory.
Q4: Security. You can store your temp files out of web root, so they will be as secure as your webserver is. If your webserver gets compromised, they'll get your Amazon Secret Key anyway - so this shouldn't really be any more concern than protecting the rest of your application.

PHP: select the latest file added to an Amazon S3 folder

I'm working on an auto-update solution, and I'm using Amazon S3 for distribution.
I would like for this to work like follows:
I upload a file to s3 folder
An automatic PHP script detects that a new file has been added and notifies clients
To do this, I somehow need to list all files in an amazon bucket's folder, and find the one which has been added last.
I've tried $s3->list_objects("mybucket");, but it returns the list of all objects inside the bucket, and I don't see an option to list only files inside the specified folder.
What is the best way to do this using Amazon S3 PHP api?

To do this, I somehow need to list all files in an amazon bucket's folder, and find the one which has been added last.
S3's API isn't really optimized for sort-by-modified-date, so you'd need to call list_buckets() and check each timestamp, always keeping track of the newest one until you get to the end of the list.
An automatic PHP script detects that a new file has been added and notifies clients
You'd need to write a long-running PHP CLI script that starts with:
while (true) { /*...*/ }
Maybe throw an occasional sleep(1) in there so that your CPU doesn't spike so badly, but you essentially need to sleep-and-poll, looping over all of the timestamps each time.
I've tried $s3->list_objects("mybucket");, but it returns the list of all objects inside the bucket, and I don't see an option to list only files inside the specified folder.
You'll want to set the prefix parameter in your list_objects() call.

S3 launched versioning functionality of files in bucket http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html.
You could get latest n files by calling s3client.listVersions(request) and specifying n if you want.See http://docs.aws.amazon.com/AmazonS3/latest/dev/list-obj-version-enabled-bucket.html
Example is in java. Not sure if the API for PHP is added for versioning.

Send file from PHP page to another

I need to send a file from one PHP page (on which client uploads their files) to another PHP page on another server were files will be finaly stored.
To comunicate now I use JSON-RPC protocol; is it wise to send the file this way?
$string = file_get_contents("uploaded_file_path");
send the string to remote server and then
file_put_contents("file_name", $recieved_string_from_remte);
I understand that this approach takes twice the time than uploading directly to the second server.
Thanks
[edit]
details:
i need to write a service allowing some php (may be joomla) user to use a simple api to upload files and send some other data to my server which analyze them , put in a db and send back a response
[re edit]
i need to create a Simple method allowing the final user to do that, who will use this the interface on server 1 (the uploading) use the php and stop, so remote ssh mount ore strange funny stuff

If I were you, I'd send the file directly to the second server and store its file name and/or some hash of the file name (for easier retrieval) in a database on the first server.
Using this approach, you could query the second server from the first one for the status of the operation. This way, you can leave the file processing to the second machine, and assign user interaction to the first machine.

As i said in my comment, THIS IS NOT RECOMMENDABLE but anyway....
You can use sockets reading byte by byte:
http://php.net/manual/en/book.sockets.php
or you can use ftp:
http://php.net/manual/en/book.ftp.php
Anyway, the problem in your approuch is doing the process async or sync with the user navigation? I really suggest you passed it by sql or ftp and give the user a response based on another event (like a file watching, then email, etc) or using sql (binary, blob, etc)

Use SSHFS on machine 1 to map a file path to machine 2 (using SSH) and save the uploaded file to machine 2. After the file is uploaded, trigger machine 2 to do the processing and report back as normal.
This would allow you to upload to machine 1, but actually stream it to machine 2's HD so it can be processed faster on that machine.
This will be faster than any SQL or manual file copy solution, because the file transfer happens while the user is uploading the file.

If you don't need the files immediately after receiving them (for processing etc), then you can save them all in one folder on Server 1 and set up a cron to scp the contents of the folder to Server 2. All this assuming you are using linux servers, this is one of the most secure and efficient ways to do it.
For more info please take a look at http://en.wikipedia.org/wiki/Secure_copy or google scp.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

File processing on load balancing server (Cluster) - php

There can be multiple solutions to this: One solution is to check and apply tags if the file is processed at the time of upload apply some tag like processed=true and when you are downloading files check for tags. Better solution is to use lambda for this task.

Related

Best way to "pipe" file contents from a remote server, via a 2nd server, output to browser

Check if all previous functions are successful

AWS S3 file upload over PHP and HTTP

PHP: select the latest file added to an Amazon S3 folder

Send file from PHP page to another

Categories

Resources