I'm working on a PHP/CodeIgniter web app that will be the backend for a non-realtime game. We want the ability to record game activity for later analysis. In my performance tests using either codeigniter's own logging system or log4php, file logging seems slow, reducing the number of requests per second the server can handle by 50%. I've tried it on both a WAMP machine and an Apache/Ubuntu server. If I change logging to use MongoDB, the performance only drops by a few percent, even if I'm logging the same amount of information.
Is file logging going to be inherently slow for php scripts because they are all waiting on locks on the same file or is it likely a configuration issue?
You can try logging in RAM drive file(s).
Also consider naming of the logs by a date stamp like YYYY-mm-dd-HH.log so you can periodically take older logs and process (archive) them and the drive will stay clean.
Related
I have created a PHP+MYSQL web app and I am trying to implement now a logging system to store and track some actions of each user.
The purpose of this is the following: track the activity of each user's session by logging the IP+time+action, then see which pages he accessed later on by logging time+pagename; for each user there will be a file in the format: log{userid}_{month}.log
Each log will then be viewed only by the website owner, through a custom admin panel, and the data will be used only for security purposes (as in to show to the user if he logged in from a different IP or if someone else logged in from a different IP and to see which areas of the website the user accessed during his login session).
Currently, I have a MYSQL MyISAM table where I store the userid,IP,time,action and the app is still not launched, but we intend to have very many users (over 100k), and using a database for this solutions feels like suicide.
So what do you suggest? How should the logging be done? Using files, using a table in the current database, using a separate database? Are there any file-logging frameworks available for PHP?
How should the reading of the file be done then? Read the results by row?
Thank you
You have many options, so I'll speak from my experience running a startup with about 500k users, 100k active every month, which seems to be in your range.
We logged user actions in a MySQL database.
Querying your data is very easy and fast (provided good indexes)
We ran on Azure, and had a dedicated MySQL (with slaves, etc) for storing all user data, including logs. Space was not an issue.
Logging to MySQL can be slow, depending on everything you are logging, so we just pushed a log to Redis and had a Python app read it from Redis and insert into MySQL in the background. This made that logging basically had no impact on loading times.
We decided to log in MySQL for user actions because:
We wanted to run queries on anything at any time without much effort. The structured format of the user action logs made that incredibly easy to do.
It also allows you to display certain logs to users, if you would require it.
When we introduced badges, we had no need to parse text logs to award badges to those who performed a specific action X number of times. We simply wrote a query against the user action logs, and the badges were awarded. So adding features based on actions was easy as well.
We did use file logging for a couple of application logs - or things we did not query on a daily basis - such as the Python app writing to the database, Webserver access and error logs, etc.
We used Logstash to process those logs. It can simply hook into a log file and stream it to your Logstash server. Logstash can also query your logs, which is pretty cool.
Advanced uses
We used Slack for team communications and integrated the Python database writing app with it, this allowed us to send critical errors to a channel (via their API) where someone could action a fix immediately.
Closing
My suggestion would be to not over think it for now, log to MySQL, query and see the stats. Make updates, rinse and repeat. You want to keep the cycle between deploy and update quick, so making decisions from a quick SQL query makes it easy.
Basically what you want to avoid is logging into a server, finding a log and grep your way through it to find something, the above achieved that.
This is what we did, it is still running like that and we have no plans to change it soon. We haven't had any issues where we could not find anything that we needed. If there is a massive burst of users and we scale to 1mil monthly active users, then we might change it.
Please note: whichever way you decide to log, if you are saving the POST data, be sure to never do that for credit card info, unless you are compliant. Or rather use Stripe's JavaScript libraries.
If you are sure that reading the log will mainly target one user at a time, you should consider partioning your log table:
http://dev.mysql.com/doc/refman/5.1/en/partitioning-range.html
using your user_id as partitioning key.
Maximum number of partitions being 1024, you will have one partition storing 1/1000 of your 100k users, which is something reasonable.
Are there any file-logging frameworks available for PHP?
There is this which is available on packagist: https://packagist.org/packages/psr/log
Note that it's not a file logging framework but an API for a logger based on the PSR-3 standard from FIG. So, if you like, it's the "standard" logger interface for PHP. You can build a logger that implements this interface or search around on packagist for other loggers that implement that interface (either file or MySQL based). There are a few other loggers on packagist (teacup, forestry) but it would be preferable to use one that sticks to the PSR standard.
We do logging with the great tool Graylog.
It scales as high as you want it, has great tools on data visualization, is incredibly fast even for complex querys and huge datasets, and the underlying search-enginge (elasticsearch) is schemaless. The latter may be an advantage as you get more possibilities on extending your logs without the hassle mysql-schemas can give you.
Graylog, elasticsearch and mongodb (which is used as to save the configuration of graylog and its webinterface) are easily deployable via tools like puppet, chef and the like.
Actually logging to graylog is easy with the already mentioned php-lib monolog.
Of curse the great disadvantage here is that you have to learn a bunch of new tools and softwares. But it is worth it in my opinion.
The crux of the matter is the data you are writing is not going to be changed. In my experience in this scenario I would use either:
MySQL with a blackhole storage engine. Set it up right and its blisteringly fast!
Riak Cluster (NoSQL solution) - though this may be a learning curve for you it might be one you may need to eventually take anyway.
Use SysLog ;)
Set it up on another server and it can log all of your processes seperately (such as networking, servers, sql, apache, and your php).
It can be usefull for you and decreasing the time spend of debugging. :)
I am currently using an AWS micro instance as a web server for a website that allows users to upload photos. Two questions:
1) When looking at my CloudWatch metrics, I have recently noticed CPU spikes, the website receives very little traffic at the moment, but becomes utterly unusable during these spikes. These spikes can last several hours and resetting the server does not eliminate the spikes.
2) Although seemingly unrelated, whenever I post a link of my website on Twitter, the server crashes (i.e.,Error Establishing a Database Connection). Once restarting Apache and MySQL, the website returns to normal functionality.
My only guess would be that the issue is somehow the result of deficiencies with the micro instance. Unfortunately, when I upgraded to the small instance, the site was actually slower due to fact that the micro instances can have two EC2 compute units.
Any suggestions?
If you want to stay in the free tier of AWS (micro instance), you should off load as much as possible away from your EC2 instance.
I would suggest you to upload the images directly to S3 instead of going through your web server (see some example for it here: http://aws.amazon.com/articles/1434).
S3 can also be used to serve most of your web pages (images, js, css...), instead of your weak web server. You can also add these files in S3 as origin to Amazon CloudFront (CDN) distribution to improve your application performance.
Another service that can help you in off loading the work is SQS (Simple Queue Service). Instead of working with online requests from users, you can send some requests (upload done, for example) as a message to SQS and have your reader process these messages on its own pace. This is good way to handel momentary load cause by several users working simultaneously with your service.
Another service is DynamoDB (managed NoSQL DB service). You can put on dynamoDB most of your current MySQL data and queries. Amazon DynamoDB also has a free tier that you can enjoy.
With the combination of the above, you can have your micro instance handling the few remaining dynamic pages until you need to scale your service with your growing success.
Wait… I'm sorry. Did you say you were running both Apache and MySQL Server on a micro instance?
First of all, that's never a good idea. Secondly, as documented, micros have low I/O and can only burst to 2 ECUs.
If you want to continue using a resource-constrained micro instance, you need to (a) put MySQL somewhere else, and (b) use something like Nginx instead of Apache as it requires far fewer resources to run. Otherwise, you should seriously consider sizing up to something larger.
I had the same issue: As far as I understand the problem is that AWS will slow you down when you reach a predefined usage. This means that they allow for a small burst but after that things will become horribly slow.
You can test that by logging in and doing something. If you use the CPU for a couple of seconds then the whole box will become extremely slow. After that you'll have to wait without doing anything at all to get things back to "normal".
That was the main reason I went for VPS instead of AWS.
I'm developing the backend part of a social app. Clients are iOS/Android phones. The backend code is a PHP application that provides a REST API to clients.
I'm using a simple logging system, with several log levels and different log writers. The simpler writer is a FileWriter. All the log messages go to a log file that changes every day. The log files are not going to be used for analytical purposes, at least so far. Just record errors and user's important operations (database access, mainly)
I'm worried because, if the userbase grows quickly, I think that writing to a file is a kind of bottleneck, for 2 reasons:
Disk writing overhead
¿Concurrency?
About the second point, I have a doubt. I'm sorry if the doubt is stupid: I'm using Apache with Prefork MPM. As far as different client's requests are handled using different processes, there're no concurrecy issues when two processes are trying to log messages to the same file. The OS (Ubuntu 11.10) handles this. Am I right?
Even in that case when I don't have to worry about concurrency writing to a file, is it a good idea? Isn't it too slow?
Many thanks in advance
As long as you open the file in append mode you are fine. Note that as long as you want persistent log files, they have to go to a file on disk at some point anyways. It makes absolutely no sense at all to use a DBMS, since that's simply another layer on top of the filesystem. As long as you don't open the file with caching disabled, the OS should take care of the I/O scheduling and write stuff off in bunches.
I have a simple CRM system that allows sales to put in customer info and upload appropriate files to create a project.
The system is already being hosted in the cloud. But the office internet upload speed is horrendous. One file may take up to 15 minutes or more to finish, causing a bottleneck in the sales process.
Upgrading our office internet is not an option; what other good solutions are out there?
I propose splitting the project submission form into 2 parts. Project info fields are posted directly to our cloud server webapp and stored in the appropriate DB table, the file submission will actually be submitted to a LAN server with a simple DB and api that will allow the cloud-hosted server webapp to communicate with to retrieve the file if ever needed again via a download link. Details need to be worked out for this set-up. But this is what I want to do in general.
Is this a good approach to solving this slow upload problem? I've never done this before, so are there also any obstacles to this implementation (cross-domain restrictions is something that comes into mind, but I believe that can be fixed with using an iFrame)?
If bandwidth is the bottleneck, then you need a solution that doesn't chew up all your bandwidth. You mentioned that you can't upgrade your bandwidth - what about putting in a second connection?
If not, the files need to stay on the LAN a little longer. It sounds like your plan would be to keep the files on the LAN forever, but you can store them locally initially and then push them later.
When you do copy the files out to the cloud, be sure to compress them and also setup rate limiting (so they take up maybe 10% of your available bandwidth during business hours).
Also put some monitoring in place to make sure the files are being sent in a timely manner.
I hope nobody needs to download those files! :(
I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?
MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.