PHP File Upload Bandwidth - php

I have a page that uploads a file to my server, where it then gets copied to a permanent directory via move_uploaded_file. This all seems to work great, with the exception that in a real-life scenario I will be expecting much larger files than I have successfully sent up.
I have already tacked the timeout for the file upload by changing the connection timeout in my site settings in IIS - so the file continues to upload up to six hours ( -_- ) - but this is where I run into my current problem - It might just take six hours!
After getting the upload process to get past 10% or so ( on a 300 meg file ), I noticed that the file continues to push up, but my upload rate seems to be 'falling off' - as in, I observed faster speeds when I started the transfer, than I am seeing halfway through it. The numbers here aren't necessarily relevant, as I know that my upload ( while Im uploading, still 2 Mbps ) is capable of pushing faster than it is, and the server on the other end is on fiber.
I wonder if anyone has encountered this before, and if so, have you determined a work-around. Any help appreciated. Thanks.

You should not be using HTTP for this task. You may have observed that all the "file locker" services (and others which involve uploading files, such as Apple's online-music service) provide you with an "uploader" program rather than making use of the browser. There are reasons for this.
First off, the overhead of the transfer encoding is large. You take your (presumably binary) data, and Base64 encode it; that's 33% overhead. So if it would take four hours with HTTP, it would only take three with a binary protocol - and that's disregarding the chunked-transfer overhead, so the reality is probably more severe.
Second, there's no way to "resume" an upload in HTTP. So if your connection is broken, you'll either have to write application-specific code to handle the resumption, or start all over.
Third, HTTP servers are not designed for super-long-lived connections: they usually have a finite or small pool of workers to service the (usually seconds-long at the outset) client requests, and occasionally they have smallish limits on the size of request data (2GB is common, and PHP by default has only a few MB).
I strongly recommend using a file transfer protocol to transfer files (such as FTP). You don't have to give out a single username/password pair to everyone: you can have a gatekeeper which integrates with whatever authentication system you already have in place. FTP-over-TLS also exists and is relatively mature.
There is a fairly good summary of the differences between the two protocols here. Note that you gain nothing from any of the advantages of HTTP listed, due to your circumstances.
Don't feel limited to FTP - rsync is a great protocol for transferring files as well, especially if you only change part of the file (it even does binary deltas!). git can also efficiently transport large blobs over secure connections or even HTTP, if you insist on using that.

Related

By closing a socket after only reading the useful data, am I really saving bandwidth?

I mean, I have an app that in a first step only needs to get the size of some images in a webserver, to do that, I'm using fsockopen. After reading the content-length header, I close the socket.
The question may be silly but I know little to nothing about the tcp protocol and the whole data transmision process over the internet and how the file gets to my php app throught this socket, so what I want to know is: Am I saving bandwidth by closing the socket before reading the whole file or is it still transferred to my local machine in it's entirety anyway? What about the server that hosts the image, does it know the socket is closed and stops sending the data?
It depends on a bunch of stuff. If the image is 10 Terabytes, then yes. Absolutely. If it's 100k, then probably not.
It all has to do with buffers -- buffers all over the place. Buffers on each computer, on each device in the network between them.. as well as latency and available bandwidth.
But basically if the file is big, yes, you're saving bandwidth. If it's small, you're not. Figuring out exactly would be difficult and the number of variables involved would be large. And the break-even point would likely change over time unless you controlled the full end-to-end system (and even then lots of things you would have a hard time controlling would still impact the answer).
Basically yes, for large enough files, but you'd save a lot more if you used the HTTP HEAD request for this, not a full GET request. Then you would save for all files.

Will I run into load problems with this application stack?

I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?
MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.

Is uploading very large files (eg 500mb) via php advisable?

I created an simple web interface to allow various users to upload files. I set the upload limit to 100mb but now it turns out that the client occasionally wants to upload files 500mb+.
I know what to alter the php configuration to change the upload limit but I was wondering if there are any serious disadvantages to uploading files of this size via php?
Obviously ftp would be preferable but if possible i'd rather not have two different methods of uploading files.
Thanks
Firstly FTP is never preferable. To anything.
I assume you mean that you transferring the files via HTTP. While not quite as bad as FTP, its not a good idea if you can find another of solving the problem. HTTP (and hence the component programs) are optimized around transferring relatively small files around the internet.
While the protocol supports server to client range requests, it does not allow for the reverse operation. Even if the software at either end were unaffected by the volume, the more data you are pushing across the greater the interval during which you could lose the connection. But the biggest problem is that caveat in the last sentence.
Regardless of the server technology you use (PHP or something else) it's never a good idea to push that big file in one sweep in synchronous mode.
There are lots of plugins for any technology/framework that will do asynchronous upload for you.
Besides the connection timing out, there is one more disadvantage in that file uploading consumes the web server memory. You don't normally want that.
PHP will handle as many and as large a file as you'll allow it. But consider that it's basically impossible to resume an aborted upload in PHP, as scripts are not fired up until AFTER the upload is completed. The larger the file gets, the larger the chance of a network glitch killing the upload and wasting a good chunk of time and bandwidth. As well, without extra work with APC, or using something like uploadify, there's no progress report and users are left staring at a browser showing no visible signs of actual work except the throbber chugging away.

Read files via php

You all know about restrictions that exist in shared environment, so with that in mind, please suggest me a php function or something with the help of which I could stream my videos and other files. I have a lot of videos on the server, unlimited bandwidth and disk space, but I am limited in ram and cpu.
Don't use php to stream the data. Use a header redirect to point to the URL of the actual file. This will offload the work onto the webserver which might run under a different user id and is better optimized for this task.
Hmm, there is XMoov that acts as a "streaming server" but does not much more than serve a file byte by byte, with a few additional options and settings. It promises random access (i.e. arbitrary skipping within a video) but I haven't used it myself yet.
As a server administrator, though, I would frown on anybody using PHP to serve huge files like that because of the strain it puts on the server. I would generally not regard this to be a good idea, and rent a streaming server instead if at all possible. Use at your own risk.
You can use a while loop to load bits of the file, and then sleep for some time, and then output more, and sleep... (that would be the only way to limit the CPU usage).
RAM shouldn't be a problem, as you will just dump parts of the file, so you don't need to load it into RAM.

HTTP vs FTP upload

I am building a large website where members will be allowed to upload content (images, videos) up to 20MB of size (maybe a little less like 15MB, we haven't settled on a final upload limit yet but it will be somewhere between 10-25MB).
My question is, should I go with HTTP or FTP upload in this case. Bear in mind that 80-90% of uploads will be smaller size like cca 1-3MB but from time to time some members will also want to upload large files (10MB+).
Is HTTP uploading reliable enough for such large files or should I go with FTP? Is there a noticeable speed difference between HTTP and FTP while uploading files?
I am asking because I'm using Zend Framework which already has HTTP adapter for file uploads, in case I choose FTP I would have to write my own adapter for it.
Thanks!
HTTP definitely puts less of a burden on your clients. A lot of places have proxies or firewalls that block all FTP traffic (in or out).
The big advantage of HTTP is that it goes over firewalls and it's very easy to encrypt---just use HTTPS on port 443 instead of HTTP on port 80. Both go through proxies and firewalls. And these days it's pretty easy to upload a 20MB files over HTTP/HTTPS using a POST.
The problem with HTTP is that it is not restartable for uploads. If you get 80% of the file sent and then there is a failure, you will need to restart at the beginning. That's why vendors are increasingly using flash-based, java-based or javascript-based uploaders and downloaders. These systems can see how much of the file has been sent, send a MAC to make sure that it has arrived properly, and resend the parts that are missing.
A MAC is more important than you might think. TCP checksums are only 32 bits, so there is a 1-in-4-billion chance of an error not being detected. That potentially happens a lot with today's internet.
Is HTTP uploading reliable enough for
such large files
One major advantage of FTP would be the ability to resume aborted uploads. Most FTP servers and clients support this, though it's not always activated. Whereas with HTTP, it's theoretically possible using special headers, but a normal client (i.e. browser) will not support it.
Another advantage would be bulk uploads: very simple in FTP, not so in HTTP.
But why not simply offer both options? HTTP for those who are behind proxies or won't/can't use an FTP client, and FTP for people who have to do upload many or large uploads over unreliable connections.
I do not want to be sarcastic, but File Transfer Protocol must be more reliable on file transfer :)
Resource availability / usage is more of an issue than reliability or speed. Each upload consumes resources - thread / memory / etc - on your web server for the duration of the upload. If content upload traffic is significant for large files it would be better to use FTP simply to free your HTTP server to be more responsive to page requests.
I definitely, opt for the HTTP approach as the rest of the people here. The reason for this is what you've said about most of the files being from one to three megabytes.
The problem is for the "rest", so:
Have you considered allowing users to send larger files through e-mail to a deamon script that gets the emails and uploads the emails to the account associated with the sender?
Or there is the solution of the flash uploader, in a facebook-like approach.
FTP will consume less bandwidth than HTTP, since the latter will need to encode(base64) the binary content into plain text thus increase the total transfer size. (by 1/3).
However, bandwidth consumption might not necessarily be the major concern, compare to other factors like usability and security, in which HTTP prevail.

Categories