Quick writing to log file after http request - php

I currently finished building a Web server who's main responsibility is to simply take the contents of the body data in each http post request and write it to a log file. The contents of the post data is obfuscated when received. So i'm un obfuscating the post data and writing it to a log file on the server. The contents after obfuscated is a series of random key value pairs that differ between every request. It is not fixed data.
The server is running Linux with 2.6+ kernel. Server is configured to handle heavy traffic (open files limit 32k, etc). The application is written in Python using web.py framework. The http server is Gunicorn behind Nginx.
After using Apache Benchmark to do some load testing, I noticed that it can handle up to about 600-700 requests per second without any log writing issues. Linux natively does a good job at buffering. Problems start to occur when more than this many requests per second attempt to write to the same file at same moment. Data will not get written and information will be lost. I know that "the writing directly to a file" design might not have been the right solution from the get go.
So i'm wondering if anyone can propose a solution that I can implement quickly without altering too much infrastructure and code that can overcome this problem?
I have read about in memory storage like Redis, but I have realized that if data is sitting in memory during server failure then that data is lost. I have read in the docs that redis can be configured as a persistent store, there just needs to be enough memory on the server for Redis to do it. This solution would mean that I would have to write a script that would dump the data from Redis (memory) to the Log file at a certain interval.
I am wondering if there is even a quicker solution? Any help would be greatly appreciated!

One possible option what I can think of is a separate logging process. So that your web.py can be shielded for performance issue. This is classical way of handling logging module. You can use IPC or any other bus communication infrastructure. With this you will be able to address two issues -
Logging will not be a huge bottle neck for high capacity call flows.
A separate module can ensure/provide switch off/on facility.
As such there would not be any huge/significant process memory usage.
However, you should bear in mind below points -
You need be sure that logging is restricted to just logging. It must not be a data store for business processing. Else you may have many synchronization problem in your business logic.
The logging process (here I mean actual Unix process) will become critical and slightly complex (i.e you may have to handle a form of IPC).
HTH!

Related

Linux server: Would a cache scheme help reduce hits to 3rd-party server?

I have a situation where my Linux server will be running a website which gets some of its data from a 3rd-party server through a SOAP interface. The data isn't exactly real-time, but it does change every 5 minutes or so. I was told not to have our website hammer their website for data, which I can completely understand.
So I wondered if this was a good candiate to use a cache scheme of some type. Where when a user comes to our web page to display the data, if it's less than 5 minutes old (for example), it would get that data from our server instead of polling the 3rd-party for it. This way, if 100 users at once come to our website, our server won't be access the 3rd-party website 100 times to share the same exact data within a given time-frame.
Is this a practical thing to do in PHP? Or should this be written in a faster language when it comes to caching? Are their cache packages for this sort of situation which can be used along with a PHP Joomla application? Thanks!
I think memcached is a good choice.
You can set timeout when you store content to memcached server, if key-value missed, retrieve data from 3rd-part server and store again.
There is memcached extension for PHP, check doc here.
There's lots of ways to solve the problem -we can't say which is the right one without knowing a lot more about the constraints you are working in or how the service is used. If you are using Joomla then you're obviously not bothered about performance - it would be really hard to write anything which has a measurable impact on your html generation times. This does not need to "be written in a faster language", but....
can you install additional software?
have you got access to cron?
at what rate is the service consumed?
how many webservers do you have consuming the service - do they have a shared filesystem? Are they on the same sub-net?
Is the SOAP response cacheable?
how do you deal with non-availability of the service?
For a very scalable solution I would suggest running a simple forward proxy (e.g. squid) but do make sure that it's not accessible from the internet. Sven (see comment elsewhere) is right about POST sometimes not being cacheable - but you can cache the response from a surrogate script on your own site accessed via GET returning appropriate caching instructions - and this could return the data as a serialized php array / object which is much less expensive to process. Indeed whichever method you choose I would recommend caching the parsed response - not the XML. This also allows you to override poor caching information from the service.
If the rate is less than around 1 per minute then the cron solution is overkill. But if its more than 20 per minute then it makes a lot of sense. If you don't have access to cron / can't install your own software then you might consider simply caching the response and refreshing the cache on demand. Don't bother with memcache unless you are already using it. APC is faster on a single server - but memcache is distributed. If you have multiple servers then use whatever cluster storage you are currently sharing your data in (distributed filesystem / database cluster / shared filesystem....).
Don't try to use locking / mutexes around the cache refresh unless you really have to (i.e. only if accessing the service more than once every 5 minutes is a mortal sin) - this gets real complicated real quick - it's too easy to introduce bugs.
Do make sure you buffer and validate any responses before writing them to the cache.
Yes, just use HTTP. Most of the heavy lifting has already been built into your web server.
Since SOAP is just a simple HTTP POST request with an XML body, you could set up your website or HTTP API in front of the SOAP endpoint to act like a translator to regular HTTP, attaching the appropriate HTTP caching headers on the transformed response body and then configure an NGinx reverse proxy in front of it.
Notably: if the transformation is simple you could just use XSLT to transform the response body from the SOAP API and remove the web service layer entirely.
Your problem is a very small one, which does not require a complicated solution.
You could write a small cron job that is executed every five minutes, sends the request to the SOAP server, and stores the result in a local file. If any script needs the data, it reads the local file. This will result in 288 requests to the SOAP server per day, and have excellent performance for any script call that needs the results because they are already on your server.
If you do not have cron jobs available and cannot fake them, any other cache will do. You really don't need fancy stuff like Memcached, unless it already is available. Storing the result to a cache file will work as well. Note that if you have to really fetch the SOAP result from the origin, this will take some more time and might affect the perceived performance of your site.
There are plenty of frameworks which also offer cache support, and if you use one you should investigate if there is support included. I'm not sure if Joomla has something appropriate for you. Otherwise, you can implement something yourself. It isn't that hard.
Cache functionality comes in various flavours:
memory-based, where a separate process on the server holds data in RAM (or overflows to disk) and you query it like you would a database; very efficient and powerful, and will have options to manage storage use and clear up after themselves, but requires setting up additional software on the server; e.g. memcached, redis
file-based, where you just write the data to disk; less efficient, but can be implemented in "user-land" code, i.e. pure PHP; beware of filling up your disk with variant caches that have expired but not been cleaned up; many frameworks have an implementation of this built in
database-backed, where you push data into an RDBMS (e.g. MySQL, PostgreSQL) or fully-featured NoSQL store (e.g. MongoDB); might make sense if you have a large amount of data, and can trade a bit of performance; as with files, you need to make sure that stale data is cleaned up
In each case, the basic idea is that you create a "key" that can tell one request from another (e.g. the name of the SOAP call and its input parameters, serialized), and pick a "lifetime" (how long you want to carry on using the same copy of the data). The caching engine or library then checks for a cache with that key, and if it is still within its "lifetime" returns the previously cached data. If there is a "cache miss" (there is no cache for that key, or it has expired), you perform the costly operation (in your case, the SOAP call) and save to the cache, using the same key.
You can do more complex things, like pre-caching things in the background so that there is never a cache miss, or having some code paths which accept stale data in order to return quickly, but these can generally be implemented on top of whatever you're using as the main caching solution.
Edit Another important decision is at what level of granularity to cache the data, in relation to processing it. At one extreme, you could cache each individual SOAP call: simple to set up, but means re-processing the same data repeatedly, and can cause problems if two responses are related, but cached independently and may get out of sync. At the other extreme, you can cache whole rendered pages: pages load very fast once cached, but creating variations based on the same data without repeating work becomes tricky. In between are various points in your code where you have processed and combined data into meaningful chunks: if your application is well-written, these are the input and output of major functions, or possibly even complete model objects; this is more work to implement, as you have to choose the right keys (avoiding two contexts overwriting each other's caches while ignoring variables that have no impact on the data in question) and values (avoiding repeats of costly work without having to store huge blobs of data which will be slow to unserialize and use up the capacity of your cache store). As with anything else, no approach suits all needs, and a complex application will probably involve caching at multiple levels for different purposes.

Best way to log messages in a mobile-based social app

I'm developing the backend part of a social app. Clients are iOS/Android phones. The backend code is a PHP application that provides a REST API to clients.
I'm using a simple logging system, with several log levels and different log writers. The simpler writer is a FileWriter. All the log messages go to a log file that changes every day. The log files are not going to be used for analytical purposes, at least so far. Just record errors and user's important operations (database access, mainly)
I'm worried because, if the userbase grows quickly, I think that writing to a file is a kind of bottleneck, for 2 reasons:
Disk writing overhead
¿Concurrency?
About the second point, I have a doubt. I'm sorry if the doubt is stupid: I'm using Apache with Prefork MPM. As far as different client's requests are handled using different processes, there're no concurrecy issues when two processes are trying to log messages to the same file. The OS (Ubuntu 11.10) handles this. Am I right?
Even in that case when I don't have to worry about concurrency writing to a file, is it a good idea? Isn't it too slow?
Many thanks in advance
As long as you open the file in append mode you are fine. Note that as long as you want persistent log files, they have to go to a file on disk at some point anyways. It makes absolutely no sense at all to use a DBMS, since that's simply another layer on top of the filesystem. As long as you don't open the file with caching disabled, the OS should take care of the I/O scheduling and write stuff off in bunches.

PHP Threading and high-latency file access (eg; FTP)

This is a bit complicated, so please don't jump to conclusions, feel free to ask about anything that is not clear enough.
Basically, I have a websocket server written in PHP. Please note that websocket messages are asynchronous, that is, a response to a request might take a lot of time, all the while the client keeps on working (if applicable).
Clients are supposed to ask the server for access to files on other servers. This can be an FTP service, or Dropbox, for the matter.
Here, please take note of two issues: connections should be shared and reused and the server actually 'freezes' while it does its work, hence any requests are processed after the server has 'unfrozen'.
Therefore, I thought, why not offload file access (which is what freezes the server) to PHP threads?
The problem here is twofold;
how do I make a connection resource in the main thread (the server) available to the sub threads (not possible with the above threading model)?
what would happen if two threads end up needing the same resource? It's perfectly fine if one is locked until the other one finishes, but we still need to figure out issue #1.
Perhaps my train of thought is all screwed up, if you can find a better solution, I'm eager to hear it out. I've also had the idea of having a PHP thread hosting a connection resource, but it's pretty memory intensive.
PHP supports no threads. The purpose of PHP is to respond to web requests quickly. That's what the architecture was built for. Different libraries try to do something like threads but they usually cause more issues than they solve.
In general there are two ways to achieve what you want:
off-load the long processes to an external process. A common approach is using a system like gearman http://php.net/gearman
Use asynchronous operations. Some stream operations and such provide an "async" flag or "non-blocking" mode. http://php.net/stream-set-blocking

Will I run into load problems with this application stack?

I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?
MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.

How to stream the contents of a file live to a browser

I'm trying to find a efficient way to watch the server log on a webpage, i don't mind building an app i just can't work out the best way to do it.
Is there a way to keep a stream open to a file with php and to the browser? or will it have to be done by polling the file every x seconds?
Thanks in advance,
Shadi
The best solution is definitely AJAX in some capacity. The only way to have the server "push" to you the way you describe (maintain an open stream) would require the HTTP connection to remain open which would ultimately trigger timeouts and consume a lot of resources. I would look into the Cometd library. The downside to this is that I believe it depends on Java although the site does mention perl, python and "other languages." In the worst case, you could use a specific jetty implementation just for log monitoring on a specific port. Regardless, that framework would most likely be your best bet.
Any web-based chat mechanism essentially uses a push architecture and would be good to look at for some inspiration. In this case, instead of users creating messages that are fired to other users, the server creates the events (when a log message is generated). Check out this article on Facebook chat for some insight into how they do it. Google chat might be worth looking into if you can find some stuff on the architecture.
For the actual logging, I'm not sure if you are in need of help for that, but log4php which is currently under incubation might be a good place to start as it provides you with a configuration that can simultaneously log to an arbitrary number of "loggers" like database, file, socket, etc. You could likely find one that would allow you to tie it into whatever push framework you elect to use.
Good luck!
Remember that the web model is essentially stateless (disconnected). Having that in mind when a client submits a request, the server processes the request and then send a response accordingly. You can have track of the clients action using cookies and/or sessions, but the resources reserved for a request are released after the response is submitted back.
I think that the best way to meet your goal, is to develop a web services that checks for the status of the log and fetch the diff (if any). Your app may consist of a web page with a div that will display the diff from the web service.
A script with a timer will trigger the call to the web service.
I will try to do something like this in a few weeks, and I will post the entire solution on moropo blog (spanish). You can ask for a post translation using the comments.
The best way to do it is to use AJAX to pull the file content every x seconds, giving the illusion of real time.
If you do want real time, you can use an XMPP server, but from what I can see, the first solution is far sufficient and does't require a lot of work.
Try wonlog.
https://www.npmjs.com/package/wonlog
You can stream multiple log files to a web browser.

Categories