Best practice for custom statistics

Best practice for custom statistics - php

I'm sitting in a situation where i have to build a statistics module which can store user related statistical informations.
Basically, all thats stored is a event identifier, a datetime object and the amount of times this event has been fired and the id of the object which is being interacted with.
Ive made similar systems before, but never anything that has to store the amount of informations as this one.
My suggestion would be a simple tabel in the database.
etc. "statistics" containing the following rows
id (Primary, auto-increment)
amount (integer)
event (enum -(list,click,view,contact)
datetime (datetime)
object_id (integer)
Usually, this method works fine, enabling me to store statistics about the object in a given timeframe ( inserting a new datetime every hour or 15 minutes, so the statistics will update every 15 minute )
Now, my questions are:
is theres better methods or more optimized methods of achieving
and building a custom statistics module.
since this new site will receive massive traffic, how do i go about the paradox that index on object id will cause slower update response time
How do you even achieve live statistics like etc. analytics? Is this solely about the server size and processing power? Or is there a best practice.
I hope my questions are understandable, and i'm looking forward to get wiser on this topic.
best regards.
Jonas

I believe one of the issues you are going to run into is you wanting two worlds of transactional and analytical. Which is fine in small cases, but when you start to scale, especially into realm of 500M+ records.
I would suggest separating the two, you generate events and keep track of just the event itself. You would then run analytical queries to get things such as count of events per object interaction. You could have these counts or other metric calculations aggregated into a report table periodically.
As for tracking events, you could either do that with keeping them in a table of occurrences of events, or have something before the database that is doing this tracking and it is then providing the periodic aggregations to the database. Think of the world of monitoring systems which use collect agents to generate events which go to an aggregation layer which then writes a periodic metric snapshot to an analytical area (e.g. CollectD to StatsD / Graphite to Whisper)
Disclaimer, I am an architect for InfiniDB
Not sure what kind of datasource you are using, but as you grow and determine amount of history etc... you will probably face sizing problems as most people typically do when they are collecting event data or monitoring data. If you are in MySQL / MariaDB / PostegreSQL , I would suggest you check out InfiniDB (open source columnar MPP database for analytics); It is fully open source (GPLv2) and will provide the performance you need to do queries upon billions and TBs of data for answering those analytical questions.

Related

Best practice for high-volume transactions with real time balance updates

I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and sub-affiliates belong to affiliates.
I need to keep a balance of clicks. For the sake of simplicity, let's say I need to increase the clicks balance by 1 (there is actually more processing depending on an event) for each of - the user, for the sub-affiliate and the affiliate. Currently I do it very simply - once I receive the event, I do sequential queries in PHP - I read the balance of user, increment by one and store the new value, then I read the balance of the sub-affiliate, increment and write, etc.
The user's balance is the most important metric for me, so I want to keep it as real time, as possible. Other metrics on the sub-aff and affiliate level are less important, but the closer they are to real-time, the better, however I think 5 minute delay might be ok.
As the project grows, it is already becoming a bottleneck, and I am now looking at alternatives - how to redesign the calculation of balances. I want to ensure that the new design will be able to crunch 50 million of events per day. It is also important for me not to lose a single event and I actually wrap each cycle of changes to click balances in an sql transaction.
Some things I am considering:
1 - Create a cron job that will update the balances on the sub-affiliate and affiliate level not in real time, let's say every 5 mins.
2 - Move the number crunching and balance updates to the database itself by using stored procedures. I am considering adding a separate database, maybe Postgress will be better suited for the job? I tried to see if there is a serious performance improvement, but the Internet seems divided on the topic.
3 - Moving this particular data stream to something like hadoop with parquet (or Apache Kudu?) and just add more servers if needed.
4 - Sharding the existing db, basically adding a separate db server for each affiliate.
Are there some best practices / technologies for this type of task or some obvious things that I could do? Any help is really appreciated!

My advice for High Speed Ingestion is here. In your case, I would collect the raw information in the ping-pong table it describes, then have the other task summarize the table to do mass UPDATEs of the counters. When there is a burst of traffic, it become more efficient, thereby not keeling over.
Click balances (and "Like counts") should be in a table separate from all the associated data. This helps avoid interference with other activity in the system. And it is likely to improve the cacheability of the balances if you have more data than can be cached in the buffer_pool.
Note that my design does not include a cron job (other than perhaps as a "keep-alive"). It processes a table, flips tables, then loops back to processing -- as fast as it can.

If I were you, I would implement Redis in-memory storage, and increase there your metrics. It's very fast and reliable. You can also read from this DB. Create also cron job, which will save those data into MySQL DB.

Is your web tier doing the number crunching as it receives & processes the HTTP request? If so, the very first thing you will want to do is move this to work queue and process these events asynchronously. I believe you hint at this in your Item 3.
There are many solutions and the scope of choosing one is outside the scope of this answer, but some packages to consider:
Gearman/PHP
Sidekiq/Ruby
Amazon SQS
RabbitMQ
NSQ
...etc...
In terms of storage it really depends on what you're trying to achieve, fast reads, fast writes, bulk reads, sharding/distribution, high-availability... the answer to each points you in different directions

This sounds like an excellent candidate for Clustrix which is a drop in replacement for MySQL. They do something like sharding, but instead of putting data in separate databases, they split it and replicate it across nodes in the same DB cluster. They call it slicing, and the DB does it automatically for you. And it is transparent to the developers. There is a good performance paper on it that shows how it's done, but the short of it is that it is a scale-out OTLP DB that happens to be able to absorb mad amounts of analytical processing on real time data as well.

Planning, Scaling and Optimizing Large Web Application

I'm currently designing and developing a web application that has the potential to grow very large at a fast rate. I will give some general information and move on to my question(s). I would say I am a mid-level web programmer.
Here are some specifications:
MySQL - Database Backend
PHP - Used in front/backend. Also used for SOAP Client
HTML, CSS, JS, jQuery - Front end widgets (highcharts, datatables, jquery-ui, etc.)
I can't get into too many fine details as it is a company project, but the main objective is to construct a dashboard that thousands of users will be accessing from various devices.
The data for this project is projected to grow by 50,000 items per year ( ~1000 items per week ).
1 item = 1 row in database
An item will also record a daily history starting at the day it was inserting.
1 day of history per item = 1 record
365 records per 1 year per device
365 * 50,000 = ~18,500,000 [first year]
multiply ~18,500,000 records by x for each year after.
(My forumla is a bit off since items will be added periodically throughout that year)
All items and history are accessed through a SOAP Client that connects to an API service, then writes the record to the database.
Majority of this data will be read and remain static (read only). But some item data may be updated or changed. The data will also be updated each day and need to write another x amount of history.
Questions:
1) Is MySQL a good solution to handle these data requirements? ~100 million records at some point.
2) I am limited to synchronous calls with my PHP Soap Client (as far as I know). This is becoming time consuming as more items are being extracted. Is there a better option for writing a SOAP Client so that I can send asynchronous requests without waiting for a response?
3) Are there any other requirements I should be thinking about?

The difficulty involved in scaling is almost always a function of users times data. If you have a lot of users, but not much data, it's not hard to scale. A typical example is a popular blog. Likewise, if you have a lot of data but not very many users, you're also going to be fine. This represents things like accounting systems or data-warehouse situations.
The first step towards any solution is to rough in a schema and test it at scale. You will have no idea how your application is going to perform until you run it through the paces. No two applications ever have exactly the same problems. Most of the time you'll need to adjust your schema, de-normalize some data, or cache things more aggressively, but these are just techniques and there's no standard cookbook for scaling.
In your specific case you won't have many problems if the rate of INSERT activity is low and your indexes aren't too complicated. What you'll probably end up doing is splitting out those hundreds of millions of rows into several identical tables each with a much smaller set of records in them.
If you're having trouble getting your queries to execute, consider the standard approach: index, optimize, then denormalize, then cache.
Where PHP can't cut it, consider using something like Python, Ruby, Java/Scala or even NodeJS to help facilitate your database calls. If you're writing a SOAP interface, you have many options.

1) Is MySQL a good solution to handle these data requirements? ~100 million records at some point.
Absolutely. Make sure you've got everything indexed properly, and if you hit a storage or query-per-second limit, you've got plenty of options that apply to most/all DBMS's. You can get beefier hardware, start sharding data across servers, clustering, etc..
2) I am limited to synchronous calls with my PHP Soap Client (as far as I know). This is becoming time consuming as more items are being extracted. Is there a better option for writing a SOAP Client so that I can send asynchronous requests without waiting for a response?
PHP 5+ allows you to execute multiple requests in parallel with CURL. Refer to the curl_muli* function for this, such as curl_multi_exec(). As far as I know, this requires you to handle SOAP/XML processing disjointly from the requests.
3) Are there any other requirements I should be thinking about?
Probably. But, you're usually on the right track if you start with a properly indexed, normalized database, for which you've thought about your objects at least mostly correctly. Start denormalizing if/when you find instances wherein denormalization solves an existing or obvious near-future efficiency problem. But, don't optimize for things that could become problems if the moons of Saturn align. Only optimize for problems that users will notice somewhat regularly.

While talking about large scale app the all the efforts and credits should not be given to the database alone. However it is the core part as our data in the main thing in any web aplication and side my side the your application depends upon the code optimization too that includes your backend and frontend script. Images and mainly server. Oh god many factors affecting the application.

Collaborative Filtering / Recommendation System performance and approaches

I'm really interested to find out how people approach collaborative filtering and recommendation engines etc. I mean this more in terms of performance of the script than anything. I have stated reading Programming Collective Intelligence, which is really interesting but tends to focus more on the algorithmic side of things.
I currently only have 2k users, but my current system is proving to be totally not future proof and very taxing on the server already. The entire system is based on making recommendations of posts to users. My application is PHP/MySQL but I use some MongoDB for my collaborative filtering stuff - I'm on a large Amazon EC2 instance. My setup is really a 2 step process. First I calculate similarities between items, then I use this information to make recommendations. Here's how it works:
First my system calculates similarities between users posts. The script runs an algorithm which returns a similarity score for each pair. The algorithm examines information such as - common tags, common commenters and common likers and is able to return a similarity score. The process goes like:
Each time a post is added, has a tag added, commented on or liked I add it to a queue.
I process this queue via cron (once a day), finding out the relevant information for each post, e.g. user_id's of the commenters and likers and tag_id's. I save this information to MongoDB in this kind of structure: {"post_id":1,"tag_ids":[12,44,67],"commenter_user_ids":[6,18,22],"liker_user_ids":[87,6]}. This allows me to eventually build up a MongoDB collection which gives me easy and quick access to all of the relevant information for when I try to calculate similarities
I then run another cron script (once a day also, but after the previous) which goes through the queue again. This time, for each post in the queue, I grab their entry from the MongoDB collection and compare it to all of the other entries. When 2 entries have some matching information, I give them +1 in terms of similarity. In the end I have an overall score for each pair of posts. I save the scores to a different MongoDB collection with the following structure: {"post_id":1,"similar":{"23":2,"2":5,"7":2}} ('similar' is a key=>value array with the post_id as key and the similarity score as the value. I don't save a score if it is 0.
I have 5k posts. So all of the above is quite hard on the server. There's a huge amount of reads and writes to be performed. Now, this is only half the issue. I then use this information to work out what posts would be interesting to a particular user. So, once an hour I run a cron script which runs a script that calculates 1 recommended post for each user on the site. The process goes like so:
The script first decides, which type of recommendation the user will get. It's a 50-50 change of - 1. A post similar to one of your posts or 2. A post similar to a post you have interacted with.
If 1, then the script grabs the users post_ids from MySQL, then uses them to grab their similar posts from MongoDB. The script takes the post that is most similar and has not yet been recommended to the user.
If 2, the script grabs all of the posts the user has commented on or liked from MySQL and uses their ids to do the same in 1 above.
Unfortunately the hourly recommendation script is getting very resource intensive and is slowly taking longer and longer to complete... currently 10-15 minutes. I'm worried that at some point I won't be able to provide hourly recommendations anymore.
I'm just wondering if anyone feels I could be approaching this any better?

With 5000 posts, that's 25,000,000 relationships, increasing O(n^2).
Your first problem is how you can avoid examining so many relationships every time the batch runs. Using tags or keywords will help with content matching - and you could use date ranges to limit common 'likes'. Beyond that....we'd to know a lot more about the methodology for establishing relationships.
Another consideration is when you establish relationships. Why are you waiting until the batch runs to compare a new post with existing data? Certainly it makes sense to handle this asynchronously to ensure that the request is processed quickly - but (other than the restrictions imposed by your platform) why wait until the batch kicks in before establishing the relationships? Use an asynchronous message queue.
Indeed depending on how long it takes to process a message, there may even be a case for re-generating cached relationship data when an item is retrieved rather than when it is created.
And if I were writing a platform to measure relationships with data then (the clue is in the name) I'd definitely be leaning towards a relational database where joins are easy and much of the logic can be implemented on the database tier.
It's certainly possible to reduce the length of time the system takes to cross-reference the data. This is exactly the kind of problem map-reduce is intended to address - but the benefits of this mainly come from being to run the algorithm in prallel across lots of machines - at the end of the day it takes just as many clock ticks.

I'm starting to plan how to do this.
First thing is to possibly get rid of your database technology or supplement it with either triplestore or graph technologies. That should provide some better performance for analyzing similar likes or topics.
Next yes get a subset. Take a few interests that the user has and get a small pool of users that have similar interests.
Then build indexes of likes in some sort of meaningful order and count the inversions (divide and conquer - this is pretty similar to merge sort and you'll want to sort on your way out to count split inversions anyways).
I hope that helps - you don't want to compare everything to everything else or it's definately n2. You should be able to replace that with something somwhere between constant and linear if you take sets of people who have similar likes and use that.
For example, from a graph perspective, take something that they recently liked, and look at the in edges and then go trace them out and just analyze those users. Maybe do this on a few recently liked articles and then find a common set of users from that and use that for the collaborative filtering to find articles the user would likely enjoy. then you're at a workable problem size - especially in graph where there is no index growth (although maybe more in edges to traverse on the article - that just gives you more change of finding usable data though)
Even better would be to key the articles themselves so that if an article was liked by someone you can see articles that they may like based on other users (ie Amazon's 'users that bought this also bought').
Hope that gives a few ideas. For graph analysis there are some frameworks that may help like faunus for stats and derivitions.

Analytics Dashboard Strategy

We are currently developing an API and we want to provide an analytics dashboard for our clients to view metrics about their calls per month/day/hour.
The current strategy we thought is to save each call to a client separate table (eg. calls_{client_id}) for historic reasons and have a summary table (eg. calls_summary) containing the number of calls for a given hour of a day for each client.
Then, each day a cron job will create a xml file with the summary of last day's calls for each client and the dashboard will use them instead of the database. So, the only analytics task which will use the database will be the cron job.
For infrastructure we are thinking MySQL replication and the slave as the analytics database.
Is that strategy useful and valid for real web statistics?
Can you propose any tuning on that or even a totally different one?

save each call to a client separate table (eg. calls_{client_id}) for historic reasons
No. Don't break the rules of normalization unless you've got good reason. It won't improve performance and could actually be very detrimental. It'll certainly make your code more complex and therefore less reliable.
It might be worth archiving off old records on period-by-period basis, but unless you know that you'rew going to run into performance problems I'd advise against this.
By all means pre-consolidate the data into another table (provided you are getting a reduction ni number of rows of at least 95%). But don't bother converting it to XML unless and until you need the data in that format.
As for how you pre-consolidate....either use period based consolidations (e.g. roll up by date) or use flagging to record which records have already been consolidated.
The less frequently you run the consolidation the more the impact on performance. But run it too frequently and you'll have problems with contention / locking.
Without knowing a lot about the structure and volume of data or the constraints in terms of budget, availability and timeliness it's hard to provide an optimal solution. But if were me, I'd probably go with 3 mysqld tiers - one providing the transactional write facility, one replicating this data and generating the consolidated data, and one to provide read access to the consolidated data (master <-> master <-> slave)

Performance wise, it is a bad idea to create a separate table for each client. The classic approach for that would be the folowing:
client: id, name, address, ...
call: id, client_id, created_at, duration, ...
calls_summary: id, client_id, date_start, date_end, nb_calls
Now if you want to retrieve all the calls of a client, you go like this:
SELECT * FROM client
LEFT JOIN call ON call.client_id = client.id
WHERE client.id = 42
Or :
SELECT * FROM call where client_id = 42
I don't see any reason for using xml, your cron could be just updating the calls_summary table.

Approaches to gathering large visiting statistics

I have website, where users can post their articles and I would like to give full stats about each articles visits and referrers to it's author. Realization seems quite straight forward here, just store a database record for every visit and then use aggregate functions to draw graphs and so on.
The problem is, that articles receive about 300k views in 24 hours and just in a month, stats table will get about 9 million records which is a very big number, because my server isn't quite powerful.
Is there a solution to this kind of task? Is there an algorithm or caching mechanism that allows to store long term statistics without losing accuracy?
P.S. Here is my original stats table:
visitid INT
articleid INT
ip INT
datetime DATETIME

Assuming a home-brewed usage-tracking solution (as opposed to say GA as suggested in other response), a two databases setup may be what you are looking for:
a "realtime" database which captures the vist events as they come.
an "offline" database where the data from the "realtime" database is collected on a regular basis, for being [optionally] aggregated and indexed.
The purpose for this setup is mostly driven by operational concerns. The "realtime" database is not indexed (or minimally indexed), for fast insertion, and it is regularly emptied, typically each night, when the traffic is lighter, as the "offline" database picks up the events collected through the day.
Both databases can have the very same schema, or the "offline" database may introduce various forms of aggregation. The specific aggregation details applied to the offline database can vary greatly depending on the desire to keep the database's size in check and depending on the data which is deemed important (most statistics/aggregation functions introduce some information loss, and one needs to decide which losses are acceptable and which are not).
Because of the "half life" nature of the value of usage logs, whereby the relative value of details decays with time, a common strategy is to aggregate info in multiple tiers, whereby the data collected in the last, say, X days remains mostly untouched, the data collected between X and Y days is partially aggregated, and finally, data older than Y days only keep the most salient info (say, number of hits).

Unless you're particularly keen on storing your statistical data yourself, you might consider using Google Analytics or one of its modern counterparts, which are much better than the old remotely hosted hit counters of the 90s. You can find the API to the Google Analytics PHP interface at http://code.google.com/p/gapi-google-analytics-php-interface/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.