I'm currently separating our video conversion part of the web page (kinda like youtube where users upload videos and we convert them to flv/mp4) to a different server. I already have the system running with gearman on the same machine. So when a user uploads a video file to server A in gets picked by a gearman worker on the same server A.
Now I moved the worker to server B. So worker on server B needs to access the uploaded file on server A. Currently I use SCP to copy the file from A to B and then process it. This method works but I feel like there should be a more clean way of doing it but I haven't found any information about sending files (or large files) to gearman workers. How would you approach this problem?
Preferably the client would send the video file as part of the command to start a background job, so I don't have to worry where the file actually is from within the worker. That way I can add more conversion servers without to much hassle.
I'm using PHP (with Gearman extension) for both my webpage and the worker.
As was suggested in the comments, having a shared FS is the (usual) way to implement this, and simply pass the path around in the job request from gearman. Gearman is not well-suited for passing around large blobs of data, as it has to keep all of the information for a job in memory. It was never designed for handling the transfer and distribution of large files. Since MogileFS was also initially developed at Danga, there simply was no need to also incorporate file transfer and handling in Gearman (and that's a good thing, there's quite a few technologies that solve that problem better than Gearman would ever do).
We're using NFS for handling distributed workers when videos arrive, and the encoder puts the encoded video back onto the NFS share that's available to the public when it's done. Haven't had a serious issue yet, NFS is stable and it's problems are well known and already solved for the kind of loads you'll see.
Related
I am building a web application that allows users to upload audio files, music in particular. Most of the time, I expect the duration of each song to normally be about several minutes and the file to be approximately 3-10MB in size. However, I would like to accept audio uploads up to about 100MB, possibly allowing for over an hour of audio. I am currently using a combination of FFmpeg, SoX, and LAME to convert from 7 possible formats to mp3 and perform audio modifications including equalization, trimming, and fading. The files are then stored and linked in the database.
My current strategy is to handle the entire process in one HTTP file upload request using PHP on the backend, in which I perform the following functions:
Validation
Transcode audio into multiple versions (using shell through PHP)
Store the original and transcoded versions in a temp directory
Upload all audio files to Amazon S3 for permanent storage
Commit the ID of each file to a database, linking them to the user
This works very similar to an image processing system I have already set up. However, while images can complete this whole process in just a few seconds, audio can take a lot longer. At most, audio could take about 5-10 minutes to be processed and stored.
My questions are:
For audio processing, would it be better to fork off the transcoding to another background process, writing its state to the database, and pinging it every few seconds to update the webpage vs. doing it all in one HTTP request?
With the intention of scaling in the future, would it be advisable to do all processing on a single server instance, leaving the frontend web instances free to replicate / be destroyed?
If yes, would this require cross-domain file uploading directly to that server? (Anyone know if this is how youtube or the big sites do it?)
Thanks!
If I understand your system correctly, your best approach is probably something more like this:
In your web front-end, store the audio and create a "task" indicating that the audio needs to be processed.
Run a background task that pulls tasks and does the processing. At the end of the task, the user can be notified (if necessary) and database state can be updated or whatever.
Your tasks should be written so that if they fail partway through, they can be re-executed from the start without causing problems. You can run multiple background tasks and web front-ends in this architecture.
A good way to write tasks is using a message passing system like AMQP. There are cheap services like rabbitmq that will do this for you. You can, of course, also build your own on top of any database, but this may require polling.
Finally, you might find it faster and more efficient to use a service like zencoder to do your transcoding, because they can parallelize the work and probably handle more input formats, but it may not be compatible with your processing.
you definitely want to throw the audio processing to a background process.
Depending on the scalability involved, you might need a computer dedicated to the processing. You might want to look into other resources you can offload audio stuff too (like PCIe cards and such)
Sorry to say I know nothing about cross domain file uploading or how the big dogs do it (youtube, soundcloud ect)
I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?
MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.
I've recently set up a test server running in a virtual machine on my computer so I can do such things as interactive debugging with XDebug. For the most part it's pretty sweet, but I've run into a snag when running multiple requests to the server at once from the same client.
The problem is that guest-host network connection doesn't really exist as a physical connection, so it will run as fast as the computer hardware will allow. This isn't usually a big issue, but I'm trying to implement APC file upload monitoring, and this requires an AJAX request to run in parallel to the file upload to monitor its performance. In the real world, the network would introduce lag and latency and suchlike, leaving enough unused bandwidth for the AjAX request to run in parallel with the file upload. However, in the test machine, the AJAX request can't fetch any data from the server until the upload is finished as there's absolutely no bandwidth left available to it.
Is it possible to set up some kind of bandwidth management in the virtual machine (in Apache, PHP or some Linux utility) that could limit the bandwidth available per HTTP request? For example so that each request is limited to 1mbps, but several requests can exist between the client and the server at the same time? I'm hoping that if this can be done it will allow the AJAX request to fetch its data while the upload is progressing instead of being stalled until the upload actually completes.
I tried a utility called IPRelay, but I don't seem able to get it to work, or at least not in a way that limits per request.
What you're asking for is called Traffic Shaping.
Lighttpd (an alternative to Apache) supports this natively
For Apache, there are a few ways of doing it.
mod_bandwidth - A 3pd module (that hasn't been updated recently) which appears to do the same thing.
mod_bwshare - 3pd module designed to combat DOS attacks, but may be helpful.
Here's a ServerFault Question that may be relevant...
Thanks for the reply. However, I found a handy little utility for Linux called iprelay that lets me throttle connections, it seems to let me have multiple connections open with each connection throttled to the specified limit. That's what I've been using today for testing my APC code and it all seems to be working fine.
I created an simple web interface to allow various users to upload files. I set the upload limit to 100mb but now it turns out that the client occasionally wants to upload files 500mb+.
I know what to alter the php configuration to change the upload limit but I was wondering if there are any serious disadvantages to uploading files of this size via php?
Obviously ftp would be preferable but if possible i'd rather not have two different methods of uploading files.
Thanks
Firstly FTP is never preferable. To anything.
I assume you mean that you transferring the files via HTTP. While not quite as bad as FTP, its not a good idea if you can find another of solving the problem. HTTP (and hence the component programs) are optimized around transferring relatively small files around the internet.
While the protocol supports server to client range requests, it does not allow for the reverse operation. Even if the software at either end were unaffected by the volume, the more data you are pushing across the greater the interval during which you could lose the connection. But the biggest problem is that caveat in the last sentence.
Regardless of the server technology you use (PHP or something else) it's never a good idea to push that big file in one sweep in synchronous mode.
There are lots of plugins for any technology/framework that will do asynchronous upload for you.
Besides the connection timing out, there is one more disadvantage in that file uploading consumes the web server memory. You don't normally want that.
PHP will handle as many and as large a file as you'll allow it. But consider that it's basically impossible to resume an aborted upload in PHP, as scripts are not fired up until AFTER the upload is completed. The larger the file gets, the larger the chance of a network glitch killing the upload and wasting a good chunk of time and bandwidth. As well, without extra work with APC, or using something like uploadify, there's no progress report and users are left staring at a browser showing no visible signs of actual work except the throbber chugging away.
I'm playing with an embedded Linux device and looking for a way to get my application code to communicate with a web interface. I need to show some status information from the application on the devices web interface and also would like to have a way to inform the application of any user actions like uploaded files etc. PHP-seems to be a good way to make the interface, but the communication part is harder. I have found the following options, but not sure which would be the easiest and most convenient to use.
Sockets. Have to enable sockets for the PHP first to try this. Don't know if enabling will take much more space.
Database. Seems like an overkill solution.
Shared file. Seems like a lot of work.
Named pipes. Tried this with some success, but not sure if there will be problems with for example on simultaneous page loads. Maybe sockets are easier?
What would be the best way to go? Is there something I'm totally missing? How is this done in those numerous commercial Linux based network switches?
I recently did something very similar using sockets, and it worked really well. I had a Java application that communicates with the device, which listened on a server socket, and the PHP application was the client.
So in your case, the PHP client would initialize the connection, and then the server can reply with the status of the device.
There's plenty of tutorials on how to do client/server socket communication with most languages, so it shouldn't take too long to figure out.
What kind of device is it?
If you work with something like a shared file, how will the device be updated?
How will named pipes run into concurrency problems that sockets will avoid?
In terms of communication from the device to PHP, a file seems perfect. PHP can use something basic like file_get_contents(), the device can just write to the file. If you're worried about the moment in time the file is updated to a quick length check.
In terms of PHP informing the device of what to do, I'm also leaning towards files. Have the device watch a directory, and have the script create a file there with something like file_put_contents($path . uniqid(), $command); That way should two scripts run at the exact sime time, you simply have two files for the device to work with.
Embedded linux boxes for routing with web interface don't use PHP. They use CGI and have shell scripts deliver the web page.
For getting information from the application to the web interface, the Shared file option seems most reasonable to me. The application can just write information into the file which is read by PHP.
The other way round it looks not so good at first. PHP supports locking of files, but it most probably doesn't work on a system level. Perhaps one solution is that in fact every PHP script which has information for the application creates it own file (with a unique id filename, e.g. based on timestamp + random value). The application could watch a designated directory for these files to pop-up. After processing them, it could just delete them. For that, the application only needs write permission on the directory (so file ownership is not an issue).
If possible, use shell scripts.
I did something similar, i wrote a video surveillance application. The video part is handled by motion (a great FOSS package). The application is a turn-key solution on standardized hardware, used to monitor slot-machine casinos. It serves as a kiosk system locally and is accessible via internet. I wrote all UI code in PHP, the local display is a tightly locked down KDE desktop with a full screen browser defaulting to localhost. I used shell scripts to interact with motion and the OS.
On a second thought:
If you can use self-compiled applications on the device: Write a simple program that returns the value you want and use PHP's exec() or passthru() or system().