I'm trying to create a AJAX push implementation in PHP using a Comet long-polling method. My code involves using file_get_contents() to read a file repeatedly to check for any messages to send to the user. To reduce server load, I'm using two text files; one containing the actual command and one acting as a "change notifier", which either is iterated through 0-9 or contains a UNIX timestamp. My question is, how often can I access and read from a small (only a few bytes) file without overloading the server? The push implementation means that I can poll for changes much more often than requesting a file every few seconds, but there's still must be a limit.
If it helps, I'm using the 1&1 Home (Linux) hosting plan, which is shared hosting.
Assuming you're running a sane os which will cache the 'change notifier' file in ram, the operation would be so cheap as to insignificant. PHP would become a bottleneck way before then.
Related
I have an ajax image uploader that sends images to a php script, so that they are validated / resized & saved into a directory. The ajax uploader allows multiple files to be uploaded at the same time. Since it allows multiple files there can be timeouts, so I thought of increasing the execution time using set_time_limit. But I am having trouble determining how much time I have to set, since the default is 30sec. Will 1min be enough? The images are uploading properly in my local machine, but I am having thoughts that there will be timeout errors on a shared hosting service. Any ideas & thoughts on how others have implemented will be valuable.. Thanks.
You can set it to 5 minutes if you need to. But for obvious reasons, you don't really want to have it that high especially for http calls.
So... if I had the energy, this is what I'd do...
I need:
process_initializer.php
process_checker.php
client.html
the_process.php (runs as background)
...
client uploads files to process_initializer.
initializer creates a unique ID, maybe based off of time with milliseconds or some other advanced solution
initializer starts a background process, sending it necessary arguments like filenames along with the ID
initializer responds to the client with ID
client then polls process_checker to see what's going on with ID (maybe 20 second intervals - setTimeout(), whatever)
process_checker may check to see if file output_ID.txt exists which the_process should create when it's done and then if it doesn't exist respond to the client that it's not ready, if it does, maybe send the output to the client and then the client can do whatever.
When apache runs php, it uses one php.ini configuration and when you run php from the command line or from another script like exec('php the_process arg1 arg2') it will use a different php.ini for this, reffered to as php cli or something like that unless you have php cli configured to use the same php.ini that apache does. Important thing is, it's possible they use different settings and so you can let cli scripts take more time than your http called scripts.
I'm currently separating our video conversion part of the web page (kinda like youtube where users upload videos and we convert them to flv/mp4) to a different server. I already have the system running with gearman on the same machine. So when a user uploads a video file to server A in gets picked by a gearman worker on the same server A.
Now I moved the worker to server B. So worker on server B needs to access the uploaded file on server A. Currently I use SCP to copy the file from A to B and then process it. This method works but I feel like there should be a more clean way of doing it but I haven't found any information about sending files (or large files) to gearman workers. How would you approach this problem?
Preferably the client would send the video file as part of the command to start a background job, so I don't have to worry where the file actually is from within the worker. That way I can add more conversion servers without to much hassle.
I'm using PHP (with Gearman extension) for both my webpage and the worker.
As was suggested in the comments, having a shared FS is the (usual) way to implement this, and simply pass the path around in the job request from gearman. Gearman is not well-suited for passing around large blobs of data, as it has to keep all of the information for a job in memory. It was never designed for handling the transfer and distribution of large files. Since MogileFS was also initially developed at Danga, there simply was no need to also incorporate file transfer and handling in Gearman (and that's a good thing, there's quite a few technologies that solve that problem better than Gearman would ever do).
We're using NFS for handling distributed workers when videos arrive, and the encoder puts the encoded video back onto the NFS share that's available to the public when it's done. Haven't had a serious issue yet, NFS is stable and it's problems are well known and already solved for the kind of loads you'll see.
I have a page that uploads a file to my server, where it then gets copied to a permanent directory via move_uploaded_file. This all seems to work great, with the exception that in a real-life scenario I will be expecting much larger files than I have successfully sent up.
I have already tacked the timeout for the file upload by changing the connection timeout in my site settings in IIS - so the file continues to upload up to six hours ( -_- ) - but this is where I run into my current problem - It might just take six hours!
After getting the upload process to get past 10% or so ( on a 300 meg file ), I noticed that the file continues to push up, but my upload rate seems to be 'falling off' - as in, I observed faster speeds when I started the transfer, than I am seeing halfway through it. The numbers here aren't necessarily relevant, as I know that my upload ( while Im uploading, still 2 Mbps ) is capable of pushing faster than it is, and the server on the other end is on fiber.
I wonder if anyone has encountered this before, and if so, have you determined a work-around. Any help appreciated. Thanks.
You should not be using HTTP for this task. You may have observed that all the "file locker" services (and others which involve uploading files, such as Apple's online-music service) provide you with an "uploader" program rather than making use of the browser. There are reasons for this.
First off, the overhead of the transfer encoding is large. You take your (presumably binary) data, and Base64 encode it; that's 33% overhead. So if it would take four hours with HTTP, it would only take three with a binary protocol - and that's disregarding the chunked-transfer overhead, so the reality is probably more severe.
Second, there's no way to "resume" an upload in HTTP. So if your connection is broken, you'll either have to write application-specific code to handle the resumption, or start all over.
Third, HTTP servers are not designed for super-long-lived connections: they usually have a finite or small pool of workers to service the (usually seconds-long at the outset) client requests, and occasionally they have smallish limits on the size of request data (2GB is common, and PHP by default has only a few MB).
I strongly recommend using a file transfer protocol to transfer files (such as FTP). You don't have to give out a single username/password pair to everyone: you can have a gatekeeper which integrates with whatever authentication system you already have in place. FTP-over-TLS also exists and is relatively mature.
There is a fairly good summary of the differences between the two protocols here. Note that you gain nothing from any of the advantages of HTTP listed, due to your circumstances.
Don't feel limited to FTP - rsync is a great protocol for transferring files as well, especially if you only change part of the file (it even does binary deltas!). git can also efficiently transport large blobs over secure connections or even HTTP, if you insist on using that.
I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?
MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.
I've recently set up a test server running in a virtual machine on my computer so I can do such things as interactive debugging with XDebug. For the most part it's pretty sweet, but I've run into a snag when running multiple requests to the server at once from the same client.
The problem is that guest-host network connection doesn't really exist as a physical connection, so it will run as fast as the computer hardware will allow. This isn't usually a big issue, but I'm trying to implement APC file upload monitoring, and this requires an AJAX request to run in parallel to the file upload to monitor its performance. In the real world, the network would introduce lag and latency and suchlike, leaving enough unused bandwidth for the AjAX request to run in parallel with the file upload. However, in the test machine, the AJAX request can't fetch any data from the server until the upload is finished as there's absolutely no bandwidth left available to it.
Is it possible to set up some kind of bandwidth management in the virtual machine (in Apache, PHP or some Linux utility) that could limit the bandwidth available per HTTP request? For example so that each request is limited to 1mbps, but several requests can exist between the client and the server at the same time? I'm hoping that if this can be done it will allow the AJAX request to fetch its data while the upload is progressing instead of being stalled until the upload actually completes.
I tried a utility called IPRelay, but I don't seem able to get it to work, or at least not in a way that limits per request.
What you're asking for is called Traffic Shaping.
Lighttpd (an alternative to Apache) supports this natively
For Apache, there are a few ways of doing it.
mod_bandwidth - A 3pd module (that hasn't been updated recently) which appears to do the same thing.
mod_bwshare - 3pd module designed to combat DOS attacks, but may be helpful.
Here's a ServerFault Question that may be relevant...
Thanks for the reply. However, I found a handy little utility for Linux called iprelay that lets me throttle connections, it seems to let me have multiple connections open with each connection throttled to the specified limit. That's what I've been using today for testing my APC code and it all seems to be working fine.