I'm developing the backend part of a social app. Clients are iOS/Android phones. The backend code is a PHP application that provides a REST API to clients.
I'm using a simple logging system, with several log levels and different log writers. The simpler writer is a FileWriter. All the log messages go to a log file that changes every day. The log files are not going to be used for analytical purposes, at least so far. Just record errors and user's important operations (database access, mainly)
I'm worried because, if the userbase grows quickly, I think that writing to a file is a kind of bottleneck, for 2 reasons:
Disk writing overhead
¿Concurrency?
About the second point, I have a doubt. I'm sorry if the doubt is stupid: I'm using Apache with Prefork MPM. As far as different client's requests are handled using different processes, there're no concurrecy issues when two processes are trying to log messages to the same file. The OS (Ubuntu 11.10) handles this. Am I right?
Even in that case when I don't have to worry about concurrency writing to a file, is it a good idea? Isn't it too slow?
Many thanks in advance
As long as you open the file in append mode you are fine. Note that as long as you want persistent log files, they have to go to a file on disk at some point anyways. It makes absolutely no sense at all to use a DBMS, since that's simply another layer on top of the filesystem. As long as you don't open the file with caching disabled, the OS should take care of the I/O scheduling and write stuff off in bunches.
Related
I currently finished building a Web server who's main responsibility is to simply take the contents of the body data in each http post request and write it to a log file. The contents of the post data is obfuscated when received. So i'm un obfuscating the post data and writing it to a log file on the server. The contents after obfuscated is a series of random key value pairs that differ between every request. It is not fixed data.
The server is running Linux with 2.6+ kernel. Server is configured to handle heavy traffic (open files limit 32k, etc). The application is written in Python using web.py framework. The http server is Gunicorn behind Nginx.
After using Apache Benchmark to do some load testing, I noticed that it can handle up to about 600-700 requests per second without any log writing issues. Linux natively does a good job at buffering. Problems start to occur when more than this many requests per second attempt to write to the same file at same moment. Data will not get written and information will be lost. I know that "the writing directly to a file" design might not have been the right solution from the get go.
So i'm wondering if anyone can propose a solution that I can implement quickly without altering too much infrastructure and code that can overcome this problem?
I have read about in memory storage like Redis, but I have realized that if data is sitting in memory during server failure then that data is lost. I have read in the docs that redis can be configured as a persistent store, there just needs to be enough memory on the server for Redis to do it. This solution would mean that I would have to write a script that would dump the data from Redis (memory) to the Log file at a certain interval.
I am wondering if there is even a quicker solution? Any help would be greatly appreciated!
One possible option what I can think of is a separate logging process. So that your web.py can be shielded for performance issue. This is classical way of handling logging module. You can use IPC or any other bus communication infrastructure. With this you will be able to address two issues -
Logging will not be a huge bottle neck for high capacity call flows.
A separate module can ensure/provide switch off/on facility.
As such there would not be any huge/significant process memory usage.
However, you should bear in mind below points -
You need be sure that logging is restricted to just logging. It must not be a data store for business processing. Else you may have many synchronization problem in your business logic.
The logging process (here I mean actual Unix process) will become critical and slightly complex (i.e you may have to handle a form of IPC).
HTH!
Let's say your writing a PHP application that will be hosted in a load-balanced/multi-server setup. What are the things you need to know in order to ensure smooth operation? Right now the only thing I think will be an issue is PHP sessions (i.e., you must use a custom database handler for it). Anything else?
Let's turn this into an answer:
In my experience, the overwhelming majority of PHP applications is not or not only constrained by PHP horsepower on the webserver, but at least as much by backing store, i.e. Database and/or files.
So load balancing a PHP application without carefull analysis bears the potential to make things worse: Hit the weakest link in the chain with more and more load.
So the first - and IMHO most important "thing to know when writing a web app hosted in a load-balanced server" is the load pattern, and its potential for balancing. If your app performs bad, you load-balance it on more servers, then find out you now have more servers waiting for the DB, you are in trouble.
Here is an out-of-the blue checklist, please reagrd it as a brainstorm (or a brainfart) only:
First: Are you really CPU-bound?
Which pages are hit most (see your log)
For the top N of these (with a suitable N) check the processing pattern: Where do the CPU cycles go?
What would be the side effects of making sessions, uploads, file storage (add whatever you use) shared and would it be offset by the load balancing?
Comments welcome, I am very sure to have not even scratched the surface!
Edit
Just thought of something that bit me once in this context: Resource locking. Brace yourself for a higher degree of concurrency, if you go multi-server
File uploads/downloads could be also an issue - you probably would need them to be visible all servers
I am building a web-application and have a couple of quick questions. From what I learnt, one should not worry about scalability when initially building the app and should only start worrying when the traffic increases. However, this being my first web-application, I am not quite sure if I should take an approach where I design things in an ad-hoc manner and later "fix" them. I have been reading stories about how people start off with an app that gets millions of users in a week or two. Not that I will face the same situation but I can't help but wonder, how do these people do it?
Currently, I bought a shared hosting account on Lunarpages and that got me started in building and testing the application. However, I am interested in learning how to build the same application in a scalable-manner using the cloud, for instance, Amazon's EC2. From my understanding, I can see a couple of components:
There is a load balancer that first receives requests and then decides where to route each request
This request is then handled by a server replica that then processes the request and updates (if required) the database and sends back the response to the client
If a similar request comes in, then a caching mechanism like memcached kicks into picture and returns objects from the cache
A blackbox that handles database replication
Specifically, I am trying to do the following:
Setting up a load balancer (my homework revealed that HAProxy is one such load balancer)
Setting up replication so that databases can be synchronized
Using memcached
Configuring Apache to work with multiple web servers
Partitioning application to use Amazon EC2 and Amazon S3 (my application is something that will need great deal of storage)
Finally, how can I avoid burning myself when using Amazon services? Because this is just a learning phase, I can probably do with 2-3 servers with a simple load balancer and replication but until I want to avoid paying loads of money accidentally.
I am able to find resources on individual topics but am unable to find something that starts off from the big picture. Can someone please help me get started?
Personally, I think you should be considering how your app will scale initially - as otherwise you'll run into problems down the line.
I'm not saying you need to build it initially as a multi-server system, but if you think you'll need to do it later, be mindful of the concerns now.
In my experience, this includes things like:
Sessions. Unless you use 'sticky' load balancing, you will have to have some way of sharing session state between servers. This probably means storing session data on either shared storage, or in a DB.
File uploads and replication. If you allow users to upload files, or you have a CMS that allows you to upload images/documents, it needs to cater for the fact that these files will also need to find their way onto other nodes in your cluster. However, if you've gone down the shared storage route mentioned above, this should cover it.
DB scalability. If you're using traditional DB servers, you might want to think about how you'll implement scalability at that level. This may mean coding your app so you use one connection string for reads, and another for writes. Then, you are free to implement replication with one master node handling the inserts/updates cascading the changes to read only nodes that handle the bulk of the work.
Middleware. You might even want to go down the route of implementing some kind of message oriented middleware solution to completely hand off business logic functions - this will give you a great level of flexibility in how you wish to scale this business logic layer in the future. Although initially this will be a lot of complication and work for not a great deal of payoff.
Have you considered playing around with VMs first? You can run 2-3 VMs on your local machine and set them up like you would actual servers, they just won't be able to handle real traffic levels. If all you're looking for is the learning experience, it might be an ideal way to go about it.
I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?
MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.
In ASPNET, I grew to love the Application and Cache stores. They're awesome. For the uninitiated, you can just throw your data-logic objects into them, and hey-presto, you only need query the database once for a bit of data.
By far one of the best ASPNET features, IMO.
I've since ditched Windows for Linux, and therefore PHP, Python and Ruby for webdev. I use PHP most because I dev several open source projects, all using PHP.
Needless to say, I've explored what PHP has to offer in terms of caching data-objects. So far I've played with:
Serializing to file (a pretty slow/expensive process)
Writing the data to file as JSON/XML/plaintext/etc (even slower for read ops)
Writing the data to file as pure PHP (the fastest read, but quite a convoluted write op)
I should stress now that I'm looking for a solution that doesn't rely on a third party app (eg memcached) as the apps are installed in all sorts of scenarios, most of which don't have install rights (eg: a cheap shared hosting account).
So back to what I'm doing now, is persisting to file secure? Rule 1 in production server security has always been disable file-writing, but I really don't see any way PHP could cache if it couldn't write. Are there any tips and/or tricks to boost the security?
Is there another persist-to-file method that I'm forgetting?
Are there any better methods of caching in "limited" environments?
Serializing is quite safe and commonly used. There is an alternative however, and that is to cache to memory. Check out memcached and APC, they're both free and highly performant. This article on different caching techniques in PHP might also be of interest.
Re: Is there another persist-to-file method that I'm forgetting?
It's of limited utility but if you have a particularly beefy database query you could write the serialized object back out to an indexed database table. You'd still have the overhead of a database query, but it would be a simple select as opposed to the beefy query.
Re: Is persisting to file secure? and cheap shared hosting account)
The sad fact is cheap shared hosting isn't secure. How much do you trust the 100,500, or 1000 other people who have access to your server? For historic and (ironically) security reasons, shared hosting environments have PHP/Apache running as a unprivileged user (with PHP running as an Apache module). The security rational here is if the world facing apache process gets compromised, the exploiters only have access to an unprivileged account that can't screw with important system files.
The bad part is, that means whenever you write to a file using PHP, the owner of that file is the same unprivileged Apache user. This is true for every user on the system, which means anyone has read and write access to the files. The theoretical hackers in the above scenario would also have access to the files.
There's also a persistent bad practice in PHP of giving a directory permissions of 777 to directories and files to enable the unprivileged apache user to write files out, and then leaving the directory or file in that state. That gives anyone on the system read/write access.
Finally, you may think obscurity saves you. "There's no way they can know where my secret cache files are", but you'd be wrong. Shared hosting sets up users in the same group, and most default file masks will give your group users read permission on files you create. SSH into your shared hosting account sometime, navigate up a directory, and you can usually start browsing through other users files on the system. This can be used to sniff out writable files.
The solutions aren't pretty. Some hosts will offer a CGI Wrapper that lets you run PHP as a CGI. The benefit here is PHP will run as the owner of the script, which means it will run as you instead of the unprivileged user. Problem averted! New Problem! Traditional CGI is slow as molasses in February.
There is FastCGI, but FastCGI is finicky and requires constant tuning. Not many shared hosts offer it. If you find one that does, chances are they'll have APC enabled, and may even be able to provide a mechanism for memcached.
I had a similar problem, and thus wrote a solution, a memory cache written in PHP. It only requires the PHP build to support sockets. Other then that, it is a pure php solution and should run just fine on Shared hosting.
http://code.google.com/p/php-object-cache/
What I always do if I have to be able to write is to ensure I'm not writing anywhere I have PHP code. Typically my directory structure looks something like this (it's varied between projects, but this is the general idea):
project/
app/
html/
index.php
data/
cache/
app is not writable by the web server (neither is index.php, preferably). cache is writable and used for caching things such as parsed templates and objects. data is possibly writable, depending on need. That is, if the users upload data, it goes into data.
The web server gets pointed to project/html and whatever method is convenient is used to set up index.php as the script to run for every page in the project. You can use mod_rewrite in Apache, or content negotiation (my preference but often not possible), or whatever other method you like.
All your real code lives in app, which is not directly accessible by the web server, but should be added to the PHP path.
This has worked quite well for me for several projects. I've even been able to get, for instance, Wikimedia to work with a modified version of this structure.
Oh... and I'd use serialize()/unserialize() to do the caching, although generating PHP code has a certain appeal. All the templating engines I know of generate PHP code to execute, making post-parse very fast.
If you have access to the Database Query Cache (ie. MySQL) you could go with serializing your objects and storing them in the DB. The database will take care of holding the query results in memory so that should be pretty fast.
You don't spell out -why- you're trying to cache objects. Are you trying to speed up a slow database query, work around expensive object instantiation, avoid repeated generation of complex page, maintain application state or are you just compulsively storing away objects in case of a long winter?
The best solution, given the atrocious limitations of most low-cost shared hosting, is going to depend on what you're trying to accomplish. Going for bottom of the barrel shared-hosting means you have to accept that you won't be working with the best tools. The numbers are hard to quantify, but there's a trade off between hosting costs, site performance & developer time (ie - fast, cheap or easy).
It's in theory possible to store objects in sessions. That might get you past the file writing disabled problem. Additionally you could store the session in a mysql memory backed table to speed up the query.
Some hosting places may have APC compiled in.. That would allow you to store the objects in memory.