Database logs vs file logs

Database logs vs file logs - php

I have created a PHP+MYSQL web app and I am trying to implement now a logging system to store and track some actions of each user.
The purpose of this is the following: track the activity of each user's session by logging the IP+time+action, then see which pages he accessed later on by logging time+pagename; for each user there will be a file in the format: log{userid}_{month}.log
Each log will then be viewed only by the website owner, through a custom admin panel, and the data will be used only for security purposes (as in to show to the user if he logged in from a different IP or if someone else logged in from a different IP and to see which areas of the website the user accessed during his login session).
Currently, I have a MYSQL MyISAM table where I store the userid,IP,time,action and the app is still not launched, but we intend to have very many users (over 100k), and using a database for this solutions feels like suicide.
So what do you suggest? How should the logging be done? Using files, using a table in the current database, using a separate database? Are there any file-logging frameworks available for PHP?
How should the reading of the file be done then? Read the results by row?
Thank you

You have many options, so I'll speak from my experience running a startup with about 500k users, 100k active every month, which seems to be in your range.
We logged user actions in a MySQL database.
Querying your data is very easy and fast (provided good indexes)
We ran on Azure, and had a dedicated MySQL (with slaves, etc) for storing all user data, including logs. Space was not an issue.
Logging to MySQL can be slow, depending on everything you are logging, so we just pushed a log to Redis and had a Python app read it from Redis and insert into MySQL in the background. This made that logging basically had no impact on loading times.
We decided to log in MySQL for user actions because:
We wanted to run queries on anything at any time without much effort. The structured format of the user action logs made that incredibly easy to do.
It also allows you to display certain logs to users, if you would require it.
When we introduced badges, we had no need to parse text logs to award badges to those who performed a specific action X number of times. We simply wrote a query against the user action logs, and the badges were awarded. So adding features based on actions was easy as well.
We did use file logging for a couple of application logs - or things we did not query on a daily basis - such as the Python app writing to the database, Webserver access and error logs, etc.
We used Logstash to process those logs. It can simply hook into a log file and stream it to your Logstash server. Logstash can also query your logs, which is pretty cool.
Advanced uses
We used Slack for team communications and integrated the Python database writing app with it, this allowed us to send critical errors to a channel (via their API) where someone could action a fix immediately.
Closing
My suggestion would be to not over think it for now, log to MySQL, query and see the stats. Make updates, rinse and repeat. You want to keep the cycle between deploy and update quick, so making decisions from a quick SQL query makes it easy.
Basically what you want to avoid is logging into a server, finding a log and grep your way through it to find something, the above achieved that.
This is what we did, it is still running like that and we have no plans to change it soon. We haven't had any issues where we could not find anything that we needed. If there is a massive burst of users and we scale to 1mil monthly active users, then we might change it.
Please note: whichever way you decide to log, if you are saving the POST data, be sure to never do that for credit card info, unless you are compliant. Or rather use Stripe's JavaScript libraries.

If you are sure that reading the log will mainly target one user at a time, you should consider partioning your log table:
http://dev.mysql.com/doc/refman/5.1/en/partitioning-range.html
using your user_id as partitioning key.
Maximum number of partitions being 1024, you will have one partition storing 1/1000 of your 100k users, which is something reasonable.

Are there any file-logging frameworks available for PHP?
There is this which is available on packagist: https://packagist.org/packages/psr/log
Note that it's not a file logging framework but an API for a logger based on the PSR-3 standard from FIG. So, if you like, it's the "standard" logger interface for PHP. You can build a logger that implements this interface or search around on packagist for other loggers that implement that interface (either file or MySQL based). There are a few other loggers on packagist (teacup, forestry) but it would be preferable to use one that sticks to the PSR standard.

We do logging with the great tool Graylog.
It scales as high as you want it, has great tools on data visualization, is incredibly fast even for complex querys and huge datasets, and the underlying search-enginge (elasticsearch) is schemaless. The latter may be an advantage as you get more possibilities on extending your logs without the hassle mysql-schemas can give you.
Graylog, elasticsearch and mongodb (which is used as to save the configuration of graylog and its webinterface) are easily deployable via tools like puppet, chef and the like.
Actually logging to graylog is easy with the already mentioned php-lib monolog.
Of curse the great disadvantage here is that you have to learn a bunch of new tools and softwares. But it is worth it in my opinion.

The crux of the matter is the data you are writing is not going to be changed. In my experience in this scenario I would use either:
MySQL with a blackhole storage engine. Set it up right and its blisteringly fast!
Riak Cluster (NoSQL solution) - though this may be a learning curve for you it might be one you may need to eventually take anyway.

Use SysLog ;)
Set it up on another server and it can log all of your processes seperately (such as networking, servers, sql, apache, and your php).
It can be usefull for you and decreasing the time spend of debugging. :)

Related

Does PHP support "Application Sessions"?

I've got a PHP app that stores arbitrary config info in a file. I would like to read that file once, when the app first starts, save it as some kind of application state variable, and leverage it across potentially thousands of user sessions. My Google foo is usually pretty good but in this case the only thing I'm able to come up with is the $_SESSION variable. Using it means reading the config file once per user session, which could mean reading it thousands of times a minute in high-volume installations, which seems inefficient.
When I worked with .NET web apps there was an idea of an application session that could be used to persist app configuration information across multiple user sessions. Does PHP have a similar concept?

Does php provide an API for cross-session data management? No
Does php provide a mechanism for reading and updating data? Yes there's lots of them
While this sounds like a session handler which is shared across multiple users, it's implementation is very different. By default (and by necessity) php's sessions are blocking. If the access to this shared dataset was blocking then you would severely limit concurrency.
Given that the access to the data must be non-blocking, how do you mediate concurrent updates to the shared data? A lot depends on the frequency of the updates. But there's also questions about capacity and whether you need to support multiple nodes.
Any one-size-all solution for the functionality is going to be severely hampered in capacity and/or performance. There are lots of products PHP will integrate with to provide a suitable storage substrate, however (leaving aside the logic of the interface for your super-session) it is not in the nature of open source software to package up third party products and hide them behind APIs.

How to dispatch events across multiple instances?

I'm currently managing PHP "events" for in a single instance. This is working well and is correctly implemented in my system using something similar to Laravel's events provider.
My question now concerns a system where I need to dispatch events across different instances/users.
For example, I have an account composed of multiple users. Each user is caching the account settings in session after the initial loading of the application.
Now, if a user is doing a modification to the account settings, I'd like to send an event to my other users so they update there settings.
For the time being, I'm thinking about these solutions:
Storing the events in a database table, with each users regularly checking the values, but this will require additional SQL load and would make the caching system obsolete.
Another solution would be to store a flag using REDIS. Each users can regularly check the value of the flag and reload the settings is required. It's similar to the SQL solutions above but will be much more efficient with REDIS. However the implementation would be more complexe, and it might be custom built for this specific event.
I also started to look at ways of sharing data between PHP instances and found this question which is suggesting the usage of shared memory. I'm not very familiar with this concept and I'm still looking at it, but I suppose that it may be possible to build a cross instance event system using it.
Using memcached server in PHP. I'm not familiar with that and still evaluating the possibility of building an event dispatcher system around it.
Using a message queue server. Still evaluating the possibility and also checking is existing event based system in PHP are built with it.
Is there any other solutions I could use to dispatch such events between instances?
Edit:
Proposition 3 has been rejected has shared memory is done in the same server, I'm working with server clustering for the application side.

CakePHP: deploying database changes with filters

We periodically need to modify the structure of our production database. In many cases, these changes correspond to changes in the codebase which reference the new changes.
We usually end up putting up a fail whale page for a minute while we pull the new changes and run the new SQL queries, but I'm interested in writing a component or something to run the SQL queries on the fly, so that no downtime would be required.
I haven't tried it yet, but here's my plan:
Write a component or something to run a specific query or queries. The queries will probably have checks like IF EXISTS so they run only once. I think it will also have to clear the model / persistent caches too.
Call the above component/query from inside the AppController's beforeFilter.
Pull the changes into the live (with the above code in place).
Wait a few seconds for the app to run once (either by another user or us).
Remove the beforeFilter code which triggered the SQL queries to run.
Is this a crazy idea? Here's my questions:
A. Will something like this work?
B. Is there a better way to do this that I'm missing?
C. What do I need to know about the model caches to keep from throwing errors. Presumably, the debug level will be set to 0 (because the site will be in production).
By the way, it seems relevant to mention that we are not on a load balanced system. We're on a single dedicated server.

Will something like this work?
Sort of. You will likely still suffer downtime. Even if only briefly while the queries finish running. You may also have concurrency issues if the site has heavy traffic during your deployment.
Is there a better way to do this that I'm missing?
With a load balanced system you could update one system at a time while diverting traffic to another system until all systems are updated.
You could feature flag your code. Code will not run until this feature is enabled. So launch your code and when your database updates are complete, enable the feature.
What do I need to know about the model caches to keep from throwing errors.
I am not familiar enough with caches, but cache could provide the illusion that the site is available. But any dynamic request (form submission) could still result in an error.
As an aside, take a look at Schema Migrations if you haven't already.

Will something like this work?
I'm sure it would work, but you can also pound a nail with a shoe if you hit it hard enough.
Is there a better way to do this?
Jason McCreary is absolutely correct with his comment mentioning how code shouldn't deploy itself and that you should be using a deployment process - whether that be (at the very least) custom scripts, migrations, or even a custom tool like my company's tool BuildMaster which handles database deployments as a first-class concept.
Using a tool like this to flesh out your process will allow you to incrementally build-up to a fully automated system easily, that way if you ever do add load-balancing (or some other additional infrastructure) you don't have spend a whole lot of time updating your deployment process all over the place. You can also plan for rollbacks, advanced deployment scenarios, and other cool things that cannot be accomplished simply by putting deployment code within application code.
For database deployments specifically, BuildMaster can manage the scripts and will deploy them in the same order automatically when the appropriate deployment actions are used. This will minimize any downtime you may experience trying to run them manually. You can also put in stop/start application actions which will clear any server-side caches. There will always be the problem of ViewState or other client side persistence however, but that's a different issue altogether.

PHP chat active users

I have added a chat capability to a site using jquery and PHP and it seems to generally work well, but I am worried about scalability. I wonder if anyone has some advice. The key area for me I think is efficiently managing awareness of who is onine.
detail:
I haven't implemented long-polling (yet) and I'm worried about the raw number of long-running processes in PHP (Apache) getting out of control.
My code runs a periodic jquery ajax poll (4secs), that first updates the db to say I am active and sets a timestamp.
Then there is a routine that checks the timestamp for all active users and sets those outside (10mins) to inactive.
This is fairly normal from my research so far. However, I am concenred that if I allow every active user to check every other active user and then everyone update the db to kick off inactive users, then I will get duplicated effort, record locks and unnecessary server load.
So I have implemented an idea of the role of a 'sweeper'. This is just one of the online users, who inherits the role of the person doing the cleanup. Everyone else just checks whether there is a 'sweeper' in existence (DB read) and carries on. If there is no sweeper when they check, they make themselves sweeper (DB write for their own record). If there are more than one, make yourself 'non-sweeper', sleep for a random period and check again.
My theory is that this way there is only one user regularly writing updates to several records on the relevant table and everyone else is either reading or just writing to their own record.
So it works OK, but the problem possibly is that the process requires a few DB reads and may actually be less efficient than just letting everyone do the cleanup as with other research as I mentioned.
I have had over 100 concurrent users running OK so far, but the client wants to scale up to several 100's, even over 1,000 and I have no idea of knowing at this stage whether this idea is good or not.
Does anyone know whether this is a good approach or not, whether it is scalable to hundreds of active users, or whether you can recommend a different approach?
AS an aside, long polling / comet for the actual chat messages seems simple and I have found a good resource for the code, but there are several blog comments that suggest it's dangerous with PHP and apache specifically. active threads etc. Impact minimsed with usleep and session_write_close.
Again does anyone have any practical experience of a PHP long polling set up for hundreds of active users, maybe you can put my mind at ease ! Do I really ahve to look to migrate this to node.js (no experience) ?
Thank you in advance
Tony

My advice would be to do this with meteor framework, which should be pretty trivial to do, even if you are not an expert, and then simply load such chat into your PHP website via iframe.
It will be scalable, won't consume much resources, and it will get only better in the future, I presume.
And it sure beats both PHP comet solutions and jquery & ajax timeout based calls to server.
I even believe you could find on github more or less a completed solution that just requires tweaking.
But of course, do read the docs before you implement it.
If you worry about security issues, read security with meteor

Long polling is indeed pretty disastrous for PHP. PHP is always runs with limited concurrent processes, and it will scale great as long as you optimize for handling each request as quickly as possible.
Long polling and similar solutions will quickly fill up your pipe.
It could be argued that PHP is simply not the right technology for this type of stuff, with the current tools out there. If you insist on using PHP you could try ReactPHP, which is a framework for PHP quite similar to how NodeJS is built. The implication with React is also that it's expected to run as a separate deamon, and not within a webserver such as apache. I have no experience on the stability of this, and how well it scales, so you will have to do the testing yourself.
NodeJS is not hard to get into, if you know javascript well. NodeJS + socket.io make it really easy to write the chat-server and client with websockets. This would be my recommendations. When I started with this is, I had something nice up and running within several hours.

If you want to keep your application stack using PHP, you want the chat application running in your actual web app (not an iframe) and your concerned about scaling your realtime infrastructure then I'd recommend you look at a hosted service for the realtime updates, such as Pusher who I work for. This way the hosted service handles the scaling of the realtime infrastructure for you and lets you concentrate on building your application functionality.
This way you only need to handle the chat message requests - sanitize/verify the content - and then push the information through Pusher to the 1000's of connected clients.
The quick start guide is available here:
http://pusher.com/docs/quickstart
I've a full list of hosted services on my realtime web tech guide.

Scalability 101: How can I design a scalable web application using PHP?

I am building a web-application and have a couple of quick questions. From what I learnt, one should not worry about scalability when initially building the app and should only start worrying when the traffic increases. However, this being my first web-application, I am not quite sure if I should take an approach where I design things in an ad-hoc manner and later "fix" them. I have been reading stories about how people start off with an app that gets millions of users in a week or two. Not that I will face the same situation but I can't help but wonder, how do these people do it?
Currently, I bought a shared hosting account on Lunarpages and that got me started in building and testing the application. However, I am interested in learning how to build the same application in a scalable-manner using the cloud, for instance, Amazon's EC2. From my understanding, I can see a couple of components:
There is a load balancer that first receives requests and then decides where to route each request
This request is then handled by a server replica that then processes the request and updates (if required) the database and sends back the response to the client
If a similar request comes in, then a caching mechanism like memcached kicks into picture and returns objects from the cache
A blackbox that handles database replication
Specifically, I am trying to do the following:
Setting up a load balancer (my homework revealed that HAProxy is one such load balancer)
Setting up replication so that databases can be synchronized
Using memcached
Configuring Apache to work with multiple web servers
Partitioning application to use Amazon EC2 and Amazon S3 (my application is something that will need great deal of storage)
Finally, how can I avoid burning myself when using Amazon services? Because this is just a learning phase, I can probably do with 2-3 servers with a simple load balancer and replication but until I want to avoid paying loads of money accidentally.
I am able to find resources on individual topics but am unable to find something that starts off from the big picture. Can someone please help me get started?

Personally, I think you should be considering how your app will scale initially - as otherwise you'll run into problems down the line.
I'm not saying you need to build it initially as a multi-server system, but if you think you'll need to do it later, be mindful of the concerns now.
In my experience, this includes things like:
Sessions. Unless you use 'sticky' load balancing, you will have to have some way of sharing session state between servers. This probably means storing session data on either shared storage, or in a DB.
File uploads and replication. If you allow users to upload files, or you have a CMS that allows you to upload images/documents, it needs to cater for the fact that these files will also need to find their way onto other nodes in your cluster. However, if you've gone down the shared storage route mentioned above, this should cover it.
DB scalability. If you're using traditional DB servers, you might want to think about how you'll implement scalability at that level. This may mean coding your app so you use one connection string for reads, and another for writes. Then, you are free to implement replication with one master node handling the inserts/updates cascading the changes to read only nodes that handle the bulk of the work.
Middleware. You might even want to go down the route of implementing some kind of message oriented middleware solution to completely hand off business logic functions - this will give you a great level of flexibility in how you wish to scale this business logic layer in the future. Although initially this will be a lot of complication and work for not a great deal of payoff.

Have you considered playing around with VMs first? You can run 2-3 VMs on your local machine and set them up like you would actual servers, they just won't be able to handle real traffic levels. If all you're looking for is the learning experience, it might be an ideal way to go about it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.