HTTP Uploads with Resource Forks

HTTP Uploads with Resource Forks - php

I'm building a PHP based upload service for some of our clients. I am using SWFUpload so that I can view the progress of a file as it uploads. I've got it pretty much built, but am running into one last issue before we can release it to the public.
Many (almost all) of our clients are Mac-based and are uploading sets of files that include InDesign Files, Fonts, Illustrator Files, etc. Most of the times the images files are OK, but occasionally (and always with Type 1 Fonts) the file will become corrupted because it is losing the resource fork.
I understand why this is happening (moving from a multi-fork system to a single-fork system), but I can not find any elegant solution. In my research the best answer I've found so far is "have the user compress it". I know that works, but it's unreasonable - in our client's opinion - for us to require them to compress every set of files they are going to send.
Are there any better solutions for keeping those resource forks alive? Of course, I would prefer a solution that is straight javascript/php, but would settle for something that is flash based or (least preferably) java based.
My only requirements for the new solution would be:
View upload progress
User doesn't have to manually compress files
Here's some information about my system
Ubuntu 10.10 Server running a standard LAMP install
PHP5
SWFUpload (wtv the most recent version is)

Uploads handle files. If the browser and the underlying OS is not able to deal with forks in this procedure (map anything file onto the file model for uploads), then you're bound to what you get by the systems architecture.
Resource fork: The resource fork is a construct of the Mac OS operating system used to store structured data in a file, alongside unstructured data stored within the data fork. A resource fork stores information in a specific form, such as icons, the shapes of windows, definitions of menus and their contents, and application code (machine code).
If that's a blocker to you you might have chosen the wrong field to work in. Just saying, if you run into systematic borders, there is not much you can do about. Even if you work for graphic designers and mac users.
The swfupload would need a feature to deal with forks. For that, flash would need a feature to deal with forks. For that the browser would eventually need a feature to deal with forks. And so on.
Next to this chain, another question remains: How to deal with forks? As the upload only maps one file to a chunk of binary data, how to map the fork as well? Append it? Add an additional file?
So on the technical level this does not sound like easily solveable. All components and systems in the file input chain must support a feature that is commonly not supported at all.
So as you can not offer something to the user that does not exist, the only thing you can do is make your application more usable or user-friendly. E.g. by providing the right notes at the right time (e.g. when a user selects a Type 1 file for uploading, to remind him/her to select the fork as well). Communicating with the user can help, but keep in mind that a user needs to be spoken with in a language he/she understands.
So if you know that certain file types have forks, address the issue to someone who can solve it: The user. You can't.

You don't have to use swfupload to monitor progress.
Here are some file that demonstrate this: https://github.com/senica/Booger/tree/master/assets/js/jquery-upload
It is not documented very well, but it basically uses webkitSlice function for uploading the files in javascript. You can use the callback functions to display the progress of the files.
This would be a javascript/php solution.

Related

Best strategy for doing real time, scalable audio processing?

I am building a web application that allows users to upload audio files, music in particular. Most of the time, I expect the duration of each song to normally be about several minutes and the file to be approximately 3-10MB in size. However, I would like to accept audio uploads up to about 100MB, possibly allowing for over an hour of audio. I am currently using a combination of FFmpeg, SoX, and LAME to convert from 7 possible formats to mp3 and perform audio modifications including equalization, trimming, and fading. The files are then stored and linked in the database.
My current strategy is to handle the entire process in one HTTP file upload request using PHP on the backend, in which I perform the following functions:
Validation
Transcode audio into multiple versions (using shell through PHP)
Store the original and transcoded versions in a temp directory
Upload all audio files to Amazon S3 for permanent storage
Commit the ID of each file to a database, linking them to the user
This works very similar to an image processing system I have already set up. However, while images can complete this whole process in just a few seconds, audio can take a lot longer. At most, audio could take about 5-10 minutes to be processed and stored.
My questions are:
For audio processing, would it be better to fork off the transcoding to another background process, writing its state to the database, and pinging it every few seconds to update the webpage vs. doing it all in one HTTP request?
With the intention of scaling in the future, would it be advisable to do all processing on a single server instance, leaving the frontend web instances free to replicate / be destroyed?
If yes, would this require cross-domain file uploading directly to that server? (Anyone know if this is how youtube or the big sites do it?)
Thanks!

If I understand your system correctly, your best approach is probably something more like this:
In your web front-end, store the audio and create a "task" indicating that the audio needs to be processed.
Run a background task that pulls tasks and does the processing. At the end of the task, the user can be notified (if necessary) and database state can be updated or whatever.
Your tasks should be written so that if they fail partway through, they can be re-executed from the start without causing problems. You can run multiple background tasks and web front-ends in this architecture.
A good way to write tasks is using a message passing system like AMQP. There are cheap services like rabbitmq that will do this for you. You can, of course, also build your own on top of any database, but this may require polling.
Finally, you might find it faster and more efficient to use a service like zencoder to do your transcoding, because they can parallelize the work and probably handle more input formats, but it may not be compatible with your processing.

you definitely want to throw the audio processing to a background process.
Depending on the scalability involved, you might need a computer dedicated to the processing. You might want to look into other resources you can offload audio stuff too (like PCIe cards and such)
Sorry to say I know nothing about cross domain file uploading or how the big dogs do it (youtube, soundcloud ect)

Common practice to compress image before sending to mobile device？

My application requires downloading many images from server(each image about 10kb large). And I'm simply downloading each of them with independent AsyncTask without any optimization.
Now I'm wondering what's the common practice to transfer these images. For example, I'm thinking about saving zipped images at server, then send zipped file for user's mobile to unzip. In this case, is it better to combine the zip files into one big zip file for user to download?
Or there's better solution? Thanks in advance!
EDIT:
It seems combining zip files is a good idea, but I feel it may take too long for user to wait downloading and unzipping all images. So I may put ten or twenty images in each zip file, so user can see some downloaded ones while waiting for more to come. Having multiple AsyncTask fired together can be faster right? But they won't finish at the same time even given same file size and same address to download?

Since latency is often the largest problem with mobile connections, reducing the number of connections you have to open is a great way to optimize the loading times. Sending a zip file with all the images sounds like a very good idea, and is probably worth the time implementing.

Images probably are already compressed (gif, jpg, png). You will not reduce filesize but will reduce the number of connections. Which is a good idea for mobile. If it is always the same set of images you can use some sprite technology (sending one bigger image file containing all the images but with different x/y offset, in html you can use the backround with an offset to show the right image).

I was looking at the sidebar and saw this topic, but you're asking about patching when I saw the comments.
The best way to make sure is that the user knows what to do with it. You want the user to download X file and have Y output for a different purpose. On the other hand, it appears common practice is that chunks of resources for those not native to the Android app and not able to fit in the APK.
A comparable example is the JDIC apps, which use the popular Japanese resource that are in tandem used for English translations. JDIC apps like WWWJDIC use online downloads for the extremely large reference files that would otherwise have bad latency (which have been mentioned before) on Google servers. It's also bad rep to have >200 MB on Google apps unless it is 3D, which is justifiable. If your images cannot be compressed without extremely long loading times on the app itself, you may need to consider this option. The only downside is to request online connection (also mentioned before).
Also, you could use 7zip and program Android to self-extract it to a location. http://www.wikihow.com/Use-7Zip-to-Create-Self-Extracting-excutables
On another note, it would be optimal for the user to perform routine checks on the app while having a one-time download on initial startup. You can then optionally put in an AsyncTask so that your files will be downloaded to the app and used after restart or however you want it, so you really need only one AsyncTask. The benefit of this is that the user syncs on the apps and he may need to check only once. The downside is that the user may not always be able to update and may need to use 4G or LTE, but that is a minor concern if he can use WiFi whenever he wants.

Fetching a file on a server, resizing with PHP GD2, security considerations

What are the security considerations when a server fetches a file from an untrusted domain?
What are the security considerations when resizing an image that you don't trust with PHPs GD2 library?
The file will be stored on the server machine, and will be offered for download. I know I can't trust the MIME-Type header. Is there anything else I should be aware of?
I have a webservice that looks like this:
input
An http-URL (or a String that is expected to be a URL)
output
A meta description of the file, or an error if there was one.
The meta description has one of two forms:
It's an image + a URL to the image on my domain + a thumbnail of the image (generated on and hosted by my server)
It's not an image + a URL to the file on my domain
update
Concerns that I can come up with:
The remote server is a malicious server that will send tiny bits of information, enough to keep the socket open, but doesn't do anything useful - like slowloris. I don't know how real of a threat this is. I suppose it could be easily avoided with timeout + progress check.
The remote server serves something that looks like an image (headers, mime-type) but causes PHP to crash when I load it with GD2.
The server sends a useless or bad MIME-type header. Like text-plain for binary files.
The remote server serves an image with a virus in it. I assume that resizing the image will get rid of the virus, but I will serve the original image if there is no reason to scale.
The remote server serves a file with a virus in it. The file will not be treated as an image so my server will do nothing with it. Nothing will happen until the user downloads, and runs it.
Also, I assume I can trust the users of my service. This is a private application in a situation where users can be held accountable for bad behavior. I assume they wont intentionally try to break it.

What are the security considerations when a server fetches a file from an untrusted domain?
The domain (host) and the file is not to be trusted. This spreads over two points:
Transport
Data
To transport the data safely, use a timeout and a size limit. Modern HTTP client libraries offer both of that. If the file could not be requested in time, drop the connection. If the file is too large, drop the data. Tell the user that there was a problem getting the file. Alternatively let the user handle the transport to that server by using the users browser and javascript to obtain the file. Then post it. Set the post limit with your script.
As long as the data is untrusted you need to handle it with caution. That means, you implement yourself a process that is able to run different security checks on the file before you mark it as "safe".
What are the security considerations when resizing an image that you don't trust with PHPs GD2 library?
Do not pass untrusted data to the image library then. See the step above, bring it into a safe state first.
The file will be stored on the server machine, and will be offered for download. I know I can't trust the MIME-Type header. Is there anything else I should be aware of?
I think you're still at the point above. How to come to safe from untrusted. Sure you can't trust the Content-Type header, however it's good to understand it as well.
You want to protect against the Unrestricted File Upload VulnerabilityOWASP.
Check the filename. If you store the data on your server, give it a safe temporary name that can not be guessed upfront and that is not accessible via the web.
Check the data associated with the filename, e.g. the URL information of the source of that file. Properly handle encoding.
Drop anything that does not meet your expectations, so check the pre-conditions you formulate strictly.
Validate the file data before you continue, for example by using a virus checker.
Validate the image data before you continue. This includes file-headers (magic numbers) as well as that the file-size and file-content is valid. You should use a library that has specialized for the job, e.g. an image-file-format-malformation-checker. This is specialized software, so if this part of your business get into business. Many free software image file code exists, I leave this just for the info, you can't trust any recommendation anyway and need to get into the topic.
If you plan to resize the image yourself, you need to make everything double-safe, because next to hosting you plan to process the data. So know what you do with the data first to locate potential fields of problems.
Do logging and monitoring.
Have a plan for the case that everything get's wrong.
Consider to repeat the process for already existing files, so if you change your procedure, you are able to automatically apply the principles to uploads that were done in the past as well.
Create a system for each type of work that is able to be cleaned after the work has been done. One system to do the download, one system to obtain the meta data etc.. After each action, restore the system from an image. If a single components fails, it won't be left over in an exploited state. Additionally if you detect a fail, you can take your whole system out of business until you have found the flaw.
All this depends a bit how much you want to do, but I think you get the idea. Create a process that works for you knowing where improvement can be added, but first create an infrastructure that is modular enough to deal with error-cases and which probably encapsulates the process enough to deal with any outcome.
You could delegate critical parts to a system that you don't need to care about, e.g. to separate processing from hosting. Additionally, when you host the images the webserver must not be clever. The more stupid a system is, the less exploitable it is (normally).
If hosting is not part of your business, why not hand it over to amazon s3 or similar stores? Your domain can be preserved via DNS settings.
Keep the libraries you use to verify images with up-to-date (which implicates you know which libraries are used and their versio, e.g. the PHP exif extension is making use of mbstring etc. pp. - track the whole tree down). Take care you're in the position to report flaws to the library maintainers in a useful way, e.g. with logging, storing upload data to reproduce stuff etc..
Get knowledge about which exploits for images did exist in the past and which systems/components/libraries (example, see disclaimer there) were affected.
Also get into the topic which are common ways to exploit something, to get the basics together (I'm sure you are aware, however it's always good to re-read some stuff):
Secure file upload in PHP web applications (Alla Bezroutchko; June 13, 2007; PDF)
Some related questions, assorted:
Is it important to verify that the uploaded file is an actual image file?
PHP Upload file enhance security

What you're describing basically comes down to an input validation problem; you don't trust what your application is reading in as input and processing.
To address this, what you should do is to download the resource in question and then attempt to determine a true file type. There are multiple ways to attempt this, but basically you will want to use either some custom-code or a library to parse through the file and look for the tell-tail signs of a certain type. There is a good SO discussion on how to do this in PHP here - How can I determine a file's true extension/type programatically? - I would check the second answer that lists some PHP-specific functions to do this. When your application receives a file, it should perform some true file typing like this and then compare the result to what the specified MIME type from the remote server is; if they match accept the file and if they do not, drop it.
I would also suggest using a whitelist of allowable filetypes (a list of everything your service will support and then ONLY accept files of those types). If you have a very general-purpose service, then you should at least do a blacklist of disallowed filetypes (a list of everything your service absolutely will not support and drop those immediately based on the outcome of your MIME type compares). Again, the use of these is entirely dependent on your use-cases.
Once you've got a type, the concern becomes if what the remote server has sent you is a bad file that targets your server (contains malicious code, buffer overflow designed to make the GD2 library blow up and run arbitrary code, etc). Basically, you are relying on the GD2 library to not contain bugs that would lead to such a successful exploit. There's not much you can do here, short of running security audit on the library yourself and I'm going to assume that's out-of-scope. Basically, keep up on any reported security bugs with the library and patch as soon as you can; as a consumer of the library, you are really relying on the maintainers to find and remedy security vulnerabilities like this.
Next, the concern is that the remote server has sent you a bad file that targets your users/clients (contains malicious code, buffer overflows, viruses, etc). Here, if there is corrupted data that is really malware in the image, it will most likely either (1) break or exploit GD2 when it is read (see above for that scenario) or (2) be eliminated when the resize operation is performed by the library if GD2 can successfully process it. There is still a chance it will remain despite the processing, but there's not much you can do there either. If you're really concerned about this, you can apply a virusscan using an external product designed for that; I would suggest that if you're doing that to do so both (1) after the download and before GD2 processing and then (2) on the manipulated file before you serve it out. Personally, I don't think you get much by doing this, but if you want to provide an additional check / warm fuzzies to your users, it cannot hurt.
To address the slow-feeding of data to keep a connection open, put a timeout on any connection to deal with this problem; unless you are dealing with a specific threat to your use-case here, I do not think this is a huge concern.

1) My primary concern with blindly fetching a file from an untrusted domain would be how to verify that the file is, in fact, what you expected to get.; could the untrusted server trick your script into downloading a harmful file (like a virus) or possibly a script that would allow a backdoor into your system?
2) I haven't read any security issues with resizing an image with the GD2 library. If it's not an image to begin with, the GD2 functions would throw an error. I don't think you have much to worry about with this part.
3) I (personally) would not ever do this without reviewing every single file that my script downloaded first. If you want to partially automate this, you might consider running magic number tests on all the files as a pre-filter. But a human look is the safest way to serve random files. When you finish this project - before you make it live - try to break / trick / hack it as hard as you can. Get some knowledgeable friends involved to help.

when it is not an image you store the file any way regardless what kind of file? so they can upload and php file and browse to it to execute php code on your server?

Best Practice for Uploading Many (2000+) Images to A Server

I have a general question about this.
When you have a gallery, sometimes people need to upload 1000's of images at once. Most likely, it would be done through a .zip file. What is the best way to go about uploading this sort of thing to a server. Many times, server have timeouts etc. that need to be accounted for. I am wondering what kinds of things should I be looking out for and what is the best way to handle a large amount of images being uploaded.
I'm guessing that you would allow a user to upload a zip file (assuming the timeout does not effect you), and this zip file is uploaded to a specific directory, lets assume in this case a directory is created for each user in the system. You would then unzip the directory on the server and scan the user's folder for any directories containing .jpg or .png or .gif files (etc.) and then import them into a table accordingly. I'm guessing labeled by folder name.
What kind of server side troubles could I run into?
I'm aware that there may be many issues. Even general ideas would be could so I can then research further. Thanks!
Also, I would be programming in Ruby on Rails but I think this question applies accross any language.

There's no reason why you couldn't handle this kind of thing with a web application. There's a couple of excellent components that would be useful for this:
Uploadify (based on jquery/flash)
plupload (from moxiecode, the tinymce people)
The reason they're useful is that in the first instance, it uses a flash component to handle uploads, so you can select groups of files from the file browser window (assuming no one is going to individually select thousands of images..!), and with plupload, drag and drop is supported too along with more platforms.
Once you've got your interface working, the server side stuff just needs to be able to handle individual uploads, associating them with some kind of user account, and from there it should be pretty straightforward.
With regards to server side issues, that's really a big question, depending on how many people will be using the application at the same time, size of images, any processing that takes place after. Remember, the files are kept in a temporary location while the script is processing them, and either deleted upon completion or copied to a final storage location by your script, so space/memory overheads/timeouts could be an issue.
If the images are massive in size, say raw or tif, then this kind of thing could still work with chunked uploads, but implementing some kind of FTP upload might be easier. Its a bit of a vague question, but should be plenty here to get you going ;)

For those many images it has to be a serious app.. thus giving you the liberty to suggest a piece of software running on the client (something like yahoo mail/picassa does) that will take care of 'managing' (network interruptions/resume support etc) the upload of images.
For the server side, you could process these one at a time (assuming your client is sending them that way)..thus keeping it simple.

take a peek at http://gallery.menalto.com
they have a dozen of methods for uploading pictures into galleries.
You can choose ones which suits you.

Either have a client app, or some Ajax code that sends the images one by one, preventing timeouts. Alternatively if this is not available to the public. FTP still works...

I'd suggest a client application (maybe written in AIR or Titanium) or telling your users what FTP is.
deviantArt.com for example offers FTP as an upload method for paying subscribers and it works really well.
Flickr instead has it's own app for this. The "Flickr Uploadr".

What's the best way to manage multiple media servers, and file allocations between them?

I have a file host website thats burning through 2gbit of bandwidth, so I need to start adding secondary media servers to store the files. What would be the best way to manage a multiple server setup, with a large amount of files? Preferably through php only.
Currently, I only have around 100Gb of files... so I could get a 2nd server, mirror all content between them, and then round robin the traffic 50/50, 33/33/33, etc. But once the total amount of files grows beyond the capacity of a single server, this wont work.
The idea that I had was to have a list of media servers stored in the DB with the amounts of free space left on each server. Once a file is uploaded, php will choose to which server the file is actually uploaded to, and spread out all the files evenly among the servers.
Was hoping to get some more input/inspiration.
Cant use any 3rd party services like Amazon. The files range from several bytes to a gigabyte.
Thanks

You could try MogileFS. It is a distributed file system. Has a good API for PHP. You can create categories and upload a file to that category. For each category you can define on how many servers it should be distributed. You can use the API to get a URL to that file on a random node.

If you are doing as much data transfer as you say, it would seem whatever it is you are doing is growing quite rapidly.
It might be worth your while to contact your hosting provider and see if they offer any sort of shared storage solutions via iscsi, nas, or other means. Ideally the storage would not only start out large enough to store everything you have on it, but it would also be able to dynamically grow beyond your needs. I know my hosting provider offers a solution like this.
If they do not, you might consider colocating your servers somewhere that either does offer a service like that, or would allow you install your own storage server (which could be built cheaply from off the shelf components and software like Freenas or Openfiler).
Once you have a centralized storage platform, you could then add web-servers to your hearts content and load balance them based on load, all while accessing the same central storage repository.
Not only is this the correct way to do it, it would offer you much more redundancy and expandability in the future if you endeavor continues to grow at the pace it is currently growing.
The other solutions offered using a database repository of what is stored where, would work, but it not only adds an extra layer of complexity into the fold, but an extra layer of processing between your visitors and the data they wish to access.
What if you lost a hard disk, do you lose 1/3 or 1/2 of all your data?
Should the heavy IO's of static content be on the same spindles as the rest of your operating system and application data?

Your best bet is really to get your files into some sort of storage that scales. Storing files locally should only be done with good reason (they are sensitive, private, etc.)
Your best bet is to move your content into the cloud. Mosso's CloudFiles or Amazon's S3 will both allow you to store an almost infinite amount of files. All your content is then accessible through an API. If you want, you can then use MySQL to track meta-data for easy searching, and let the service handle the actual storage of the files.

i think your own idea is not the worst one. get a bunch of servers, and for every file store which server(s) it's on. if new files are uploaded, use most-free-space first*. every server handles it's own delivery (instead of piping through the main server).
pros:
use multiple servers for a single file. e.g. for cutekitten.jpg: filepath="server1\cutekitten.jpg;server2\cutekitten.jpg", and then choose the server depending on the server load (or randomly, or alternating, ...)
if you're careful you may be able to move around files automatically depending on the current load. so if your cute-kitten image gets reddited/slashdotted hard, move it to the server with the lowest load and update the entry.
you could do this with a cron-job. just log the downloads for the last xx minutes. try some formular like (downloads-per-minutefilesize(product of serverloads)) for weighting. pick tresholds for increasing/decreasing the number of servers those files are distributed to.
if you add a new server, it's relativley painless (just add the address to the server pool)
cons:
homebrew solutions are always risky
your load distribution algorithm must be well tested, otherwise bad things could happen (everything mirrored everywhere)
constantly moving files around for balancing adds additional server load
* or use a mixed weighting algorithm: free-space, server-load, file-popularity
disclaimer: never been in the situation myself, just guessing.

Consider HDFS, which is part of Apache's Hadoop. This will integrate with PHP, but you'll be setting up a second application. This will also solve all your points of balancing among servers and handling things when your file space usage exceeds one server's ability. It's not purely in PHP, though, but I don't think that's what you meant when you said "pure" anyway.
See http://hadoop.apache.org/core/docs/current/hdfs_design.html for the idea of it. They cover the whole idea of how it handles large files, many files, replication, etc.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.