Okay, in my head this is somewhat complicated and I hope I can explain it. If anything is unclear please comment, so I can refine the question.
I want to handle user file uploads to a 3rd server.
So we have
the User
the the website (server where the website runs on)
the storage server (which recieves the file)
The flow should be like:
The Website requests an upload url from the storage clouds gateway, that points directly to the final storage server (something like http://serverXY.mystorage.com/upload.php). Along with the request a "target path" (website specific and globally unique) and a redirect url is sent.
the Website generates an upload form with the storage servers upload url as target, the user selects a file and clicks the submit button. The storage server handles the post request, saves the file to a temporary location (which is '/tmp-directory/'.sha1(target-path-fromabove)) and redirects the back to the redirect url that has been specified by the website. The "target path" is also passed.
I do not want any "ghosted files" to remain if the user cancels the process or the connection gets interrupted or something! Also entries in the websites database that have not been correctly processed int he storage cloud and then gets broken must be avoided. thats the reason for this and the next step
these are the critical steps
The website now writes en entry to its own database, and issues a restful request to the storage api (signed, website has to authenticate with secret token) that
copies the file from its temporary location on the storage server to its final location (this should be fast because its only a rename)
the same rest request also inserts a database row in the storage networks database along with the websites id as owner
All files in tmp directory on the storage server that are older than 24 hours automatically get deleted.
If the user closes the browser window or the connection gets interrupted, the program flow on the server gets aborted too, right?
Only destructors and registered shutdown functions are executed, correct?
Can I somehow make this code part "critical" so that the server, if it once enters this code part, executes it to teh end regardless of whether the user aborts the page loading or not?
(Of course I am aware that a server crash or an error may interrupt at any time, but my concerns are about the regular flow now)
One of me was to have a flag and a timestamp in the websites database that marks the file as "completed" and check in a cronjob for old incompleted files and delete them from the storage cloud and then from the websites database, but I would really like to avoid this extra field and procedure.
I want the storage api to be very generic and use it in many other future projects.
I had a look at Google storage for developers and Amazon s3.
They have the same problem and even worse. In amazon S3 you can "sign" your post request. So the file gets uploaded by the user under your authority and is directly saved and stored and you have to pay it.
If the connection gets interrupted and the user never gets back to your website you dont even know.
So you have to store all upload urls you sign and check them in a cronjob and delete everything that hasnt "reached its destination".
Any ideas or best practices for that problem?
If I'm reading this correctly, you're performing the critical operations in the script that is called when the storage service redirects the user back to your website.
I see two options for ensuring that the critical steps are performed in their entirety:
Ensure that PHP is ignoring connection status and is running scripts through to completion using ignore_user_abort().
Trigger some back-end process that performs the critical operations separately from the user-facing scripts. This could be as simple as dropping a job into the at queue if you're using a *NIX server (man at for more details) or as complex as having a dedicated queue management daemon, much like the one LrdCasimir suggested.
The problems like this that I've faced have all had pretty time-consuming processes associated with their operation, so I've always gone with Option 2 to provide prompt responses to the browser, and to free up the web server. Option 1 is easy to implement, but Option 2 is ultimately more fault-tolerant, as updates would stay in the queue until they could be successfully communicated to the storage server.
The connection handling page in the PHP manual provides a lot of good insights into what happens during the HTTP connection.
I'm not certain I'd call this a "best practice" but a few ideas on a general approach for this kind of problem. One of course is to allow the transaction of REST request to the storage server to take place asynchronously, either by a daemonized process that listens for incoming requests (either by watching a file for changes, or a socket, shared memory, database, whatever you think is best for IPC in your environment) or a very frequently running cron job that would pick up and deliver the files. The benefits of this are that you can deliver a quick message to the User who uploaded the file, while the background process can try, try again if there's a connectivity issue with the REST service. You could even go as far as to have some AJAX polling taking place so the user could get a nice JS message displayed when you complete the REST process.
Related
I've found similar posts to what I want to do, but nothing exact so please excuse me if I missed it and there is already an answer here. Also, I am a C++ engineer, but new to PHP and without much experience with networking and HTTP requests.
Here is what I am hoping to do. I have a Linux server that is running PHP that hosts a restful API for clients to access. Clients have their custom authentication to access the API and they can upload files using it. I then need to take those files send them to an external server using my private authentication credentials. I can easily set that up so when I receive the POST, I create a new HTTP request to post it to my private server and then return the results back to the client.
The issue is speed. The files can be quite large so that means the client will have to wait for the file to be uploaded twice before receiving a response back. One solution I have is to immediately send a response back to the client and then have the client ping the server every x seconds to check the status of the secondary upload. This would allow me to get a response back to the client faster, but is not ideal. I was hoping there is a more advanced solution that Is there a way that I can begin the secondary upload on my server as soon as I start receiving the upload so that by the time the upload to my server is complete, the upload to the secondary server will be almost complete as well. This all has to be accomplished with POST's as well, so I don't know if that is a limiting factor in the equation.
Is something like that even possible? If so, how would you recommend doing it?
Another option might be to somehow have the client directly upload to the secondary server, but how would that be possible without giving the client my private authentication. Keep in mind that the secondary server is just a restful API that accepts posts using API key and token for authentication.
That is a tough one. If you had FTP access on the secondary server, we could get a bit more speed eventually.
As far as I know, you can´t start sending the secondary upload at the same time you receive the file. While the file is being uploaded, it resides in a temporary folder untill the request is finished, and only then it is moved to some accessible folder. But hey, I am not 100% sure there is no possibility at all.
The current solution seems the best to me: get the file, inform the user, upload to secondary server, inform the user if it is all complete. Communication between servers is usually quite faster, and so the secondary upload should take less time than the first.
The last option, for now, is a no no. I can´t think of a way to upload directly to the secondary server, without exposing your credentials.
Thank you guys. Looks like I am stuck with the option of returning a result to the client when the upload is complete, then starting the secondary upload while having the client ping the main server to check it's status. I think should be fine since I am limited to PHP at the moment.
Cheers
I am working on a PHP product application, which will be deployed on several client servers (say 500+). What my client's requirement is, when a new feature released in this product, we need to push all the changes ( files & db ) to clients server using FTP/SFTP. I am a bit concerned about transferring files using FTP to 500+ server at a time. I am not sure how handle this kind of situation. I have came up with some ideas, like
When user (product admin) click update, we send an ajax request, which will update first 10 servers, and returns with the remaining count of servers. From the ajax response, we will send again for next 10, and so on.
OR
Create a cron, which runs every 1 mins, which check whether any update is active, and update the first 10 active servers. Once it complete the transfer for a server, it changes the status for that server to 0.
I just want know, is there any other method to do these kind of tasks ?
Add the whole code to a code repository mechanism like Git and then push all over present files to the created repository. Go to any one server and write a cron for auto pull the repository to the severs and upload that cron to every server.
In future if you like to add new feature just pull the whole repository and add the feature. Push the code again to repository it will be pulled by cron automatically in all the server where you kept the cron previously.
First, I would like to provide some suggestions and insight into your suggested methods:
In both the methods, you'll have to keep a list of all the servers where your application has been deployed and have keep track whether the update has been applied to a particular one or not. That can be difficult if in future you want to scale from 500+ to say 50,000+.
You are also not considering the case, where the target server might not be functioning at the time you send the request for update.
I suggest, instead of sending the update from your end to target server, you achieve the same in the opposite direction. As you said, you are developing an entire PHP Application to be deployed on Client Server, I suggest you develop an Update Module into it. The Target Server can send a request to your servers at designated time to check whether there is any Update available or not. If there is, then I suggest following two ways to proceed further
You send an update list, providing the names and paths of files to be updated, along with any DB changes, the same can be processed on Client Side accordingly.
You can just send a response saying there is an update available, then a separate process launches on Client Server which will download all the files and DB changes from the Server.
For maintaining concurrency of updates you can implement a Token System, or can rely on the Time-stamp at which the update happened.
I have a file uploading site which is currently resting on a single server i.e using the same server for users to upload the files to and the same server for content delivery.
What I want to implement is a CDN (content delivery network). I would like to buy a server farm and somehow if i were to have a mechanism to have files spread out across the different servers, that would balance my load a whole lot better.
However, I have a few questions regarding this:
Assuming my server farm consists of 10 servers for content delivery,
Since at the user end, the script to upload files will be one location only, i.e <form action=upload.php>, It has to reside on a single server, correct? How can I duplicate the script across multiple servers and direct the user's file upload data to the server with the least load?
How should I determine which files to be sent to which server? During the upload process, should I randomize all files to go to random servers? If the user sends 10 files should i send them to a random server? Is there a mechanism to send them to the server with the least load? Is there any other algorithm which can help determine which server the files need to be sent to?
How will the files be sent from the upload server to the CDN? Using FTP? Wouldn't that introduce additional overhead and need for error checking capability to check for FTP connection break, and to check if file was transferred successfully etc.?
Assuming you're using an Apache server, there is a module called mod_proxy_balancer. It handles all of the load-balancing work behind the scenes. The user will never know the difference -- except when their downloads and uploads are 10 times faster.
If you use this, you can have a complete copy on each server.
mod_proxy_balancer will handle this for you.
Each server can have its own sub-domain. You will have a database on your 'main' server, which matches up all of your download pages to the physical servers they are located on. Then a on-the-fly URL is passed based on some hash encryption algorithm, which prevents using a hard link to the download and increases your page hits. It could be a mix of personal and miscellaneous information, e.g., the users IP and the time of day. The download server then checks the hashes, and either accepts or denies the request.
If everything checks out, the download starts; your load is balanced; and the users don't have to worry about any of this behind the scenes stuff.
note: I have done Apache administration and web development. I have never managed a large CDN, so this is based on what I have seen in other sites and other knowledge. Anyone who has something to add here, or corrections to make, please do.
Update
There are also companies that manage it for you. A simple Google search will get you a list.
I've seen many web apps that implement progress bars, however, my question is related to the non-uploading variety.
Many PHP web applications (phpBB, Joomla, etc.) implement a "smart" installer to not only guide you through the installation of the software, but also keep you informed of what it's currently doing. For instance, if the installer was creating SQL tables or writing configuration files, it would report this without asking you to click. (Basically, sit-back-and-relax installation.)
Another good example is with Joomla's Akeeba Backup (formerly Joomla Pack). When you perform a backup of your Joomla installation, it makes a full archive of the installation directory. This, however, takes a long time, and hence requires updates on the progress. However, the server itself has a limit on PHP script execution time, and so it seems that either
The backup script is able to bypass it.
Some temp data is stored so that the archive is appended to (if archive appending is possible).
Client scripts call the server's PHP every so often to perform actions.
My general guess (not specific to Akeeba) is with #3, that is:
Web page JS -> POST foo/installer.php?doaction=1 SESSID=foo2
Server -> ERRCODE SUCCESS
Web page JS -> POST foo/installer.php?doaction=2 SESSID=foo2
Server -> ERRCODE SUCCESS
Web page JS -> POST foo/installer.php?doaction=3 SESSID=foo2
Server -> ERRCODE SUCCESS
Web page JS -> POST foo/installer.php?doaction=4 SESSID=foo2
Server -> ERRCODE FAIL Reason: Configuration.php not writable!
Web page JS -> Show error to user
I'm 99% sure this isn't the case, since that would create a very nasty dependency on the user to have Javascript enabled.
I guess my question boils down to the following:
How are long running PHP scripts (on web servers, of course) handled and are able to "stay alive" past the PHP maximum execution time? If they don't "cheat", how are they able to split the task up at hand? (I notice that Akeeba Backup does acknowledge the PHP maximum execution time limit, but I don't want to dig too deep to find such code.)
How is the progress displayed via AJAX+PHP? I've read that people use a file to indicate progress, but to me that seems "dirty" and puts a bit of strain on I/O, especially for live servers with 10,000+ visitors running the aforementioned script.
The environment for this script is where safe_mode is enabled, and the limit is generally 30 seconds. (Basically, a restrictive, free $0 host.) This script is aimed at all audiences (will be made public), so I have no power over what host it will be on. (And this assumes that I'm not going to blame the end user for having a bad host.)
I don't necessarily need code examples (although they are very much appreciated!), I just need to know the logic flow for implementing this.
Generally, this sort of thing is stored in the $_SESSION variable. As far as execution timeout goes, what I typically do is have a JavaScript timeout that sets the innerHTML of an update status div to a PHP script every x number of seconds. When this script executes, it doesn't "wait" or anything like that. It merely grabs the current status from the session (which is updated via the script(s) that is/are actually performing the installation) then outputs that in whatever fancy method I see fit (status bar, etc).
I wouldn't recommend any direct I/O for status updates. You're correct in that it is messy and inefficient. I'd say $_SESSION is definitely the way to go here.
I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?
MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.