We are currently developing an API and we want to provide an analytics dashboard for our clients to view metrics about their calls per month/day/hour.
The current strategy we thought is to save each call to a client separate table (eg. calls_{client_id}) for historic reasons and have a summary table (eg. calls_summary) containing the number of calls for a given hour of a day for each client.
Then, each day a cron job will create a xml file with the summary of last day's calls for each client and the dashboard will use them instead of the database. So, the only analytics task which will use the database will be the cron job.
For infrastructure we are thinking MySQL replication and the slave as the analytics database.
Is that strategy useful and valid for real web statistics?
Can you propose any tuning on that or even a totally different one?
save each call to a client separate table (eg. calls_{client_id}) for historic reasons
No. Don't break the rules of normalization unless you've got good reason. It won't improve performance and could actually be very detrimental. It'll certainly make your code more complex and therefore less reliable.
It might be worth archiving off old records on period-by-period basis, but unless you know that you'rew going to run into performance problems I'd advise against this.
By all means pre-consolidate the data into another table (provided you are getting a reduction ni number of rows of at least 95%). But don't bother converting it to XML unless and until you need the data in that format.
As for how you pre-consolidate....either use period based consolidations (e.g. roll up by date) or use flagging to record which records have already been consolidated.
The less frequently you run the consolidation the more the impact on performance. But run it too frequently and you'll have problems with contention / locking.
Without knowing a lot about the structure and volume of data or the constraints in terms of budget, availability and timeliness it's hard to provide an optimal solution. But if were me, I'd probably go with 3 mysqld tiers - one providing the transactional write facility, one replicating this data and generating the consolidated data, and one to provide read access to the consolidated data (master <-> master <-> slave)
Performance wise, it is a bad idea to create a separate table for each client. The classic approach for that would be the folowing:
client: id, name, address, ...
call: id, client_id, created_at, duration, ...
calls_summary: id, client_id, date_start, date_end, nb_calls
Now if you want to retrieve all the calls of a client, you go like this:
SELECT * FROM client
LEFT JOIN call ON call.client_id = client.id
WHERE client.id = 42
Or :
SELECT * FROM call where client_id = 42
I don't see any reason for using xml, your cron could be just updating the calls_summary table.
Related
I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and sub-affiliates belong to affiliates.
I need to keep a balance of clicks. For the sake of simplicity, let's say I need to increase the clicks balance by 1 (there is actually more processing depending on an event) for each of - the user, for the sub-affiliate and the affiliate. Currently I do it very simply - once I receive the event, I do sequential queries in PHP - I read the balance of user, increment by one and store the new value, then I read the balance of the sub-affiliate, increment and write, etc.
The user's balance is the most important metric for me, so I want to keep it as real time, as possible. Other metrics on the sub-aff and affiliate level are less important, but the closer they are to real-time, the better, however I think 5 minute delay might be ok.
As the project grows, it is already becoming a bottleneck, and I am now looking at alternatives - how to redesign the calculation of balances. I want to ensure that the new design will be able to crunch 50 million of events per day. It is also important for me not to lose a single event and I actually wrap each cycle of changes to click balances in an sql transaction.
Some things I am considering:
1 - Create a cron job that will update the balances on the sub-affiliate and affiliate level not in real time, let's say every 5 mins.
2 - Move the number crunching and balance updates to the database itself by using stored procedures. I am considering adding a separate database, maybe Postgress will be better suited for the job? I tried to see if there is a serious performance improvement, but the Internet seems divided on the topic.
3 - Moving this particular data stream to something like hadoop with parquet (or Apache Kudu?) and just add more servers if needed.
4 - Sharding the existing db, basically adding a separate db server for each affiliate.
Are there some best practices / technologies for this type of task or some obvious things that I could do? Any help is really appreciated!
My advice for High Speed Ingestion is here. In your case, I would collect the raw information in the ping-pong table it describes, then have the other task summarize the table to do mass UPDATEs of the counters. When there is a burst of traffic, it become more efficient, thereby not keeling over.
Click balances (and "Like counts") should be in a table separate from all the associated data. This helps avoid interference with other activity in the system. And it is likely to improve the cacheability of the balances if you have more data than can be cached in the buffer_pool.
Note that my design does not include a cron job (other than perhaps as a "keep-alive"). It processes a table, flips tables, then loops back to processing -- as fast as it can.
If I were you, I would implement Redis in-memory storage, and increase there your metrics. It's very fast and reliable. You can also read from this DB. Create also cron job, which will save those data into MySQL DB.
Is your web tier doing the number crunching as it receives & processes the HTTP request? If so, the very first thing you will want to do is move this to work queue and process these events asynchronously. I believe you hint at this in your Item 3.
There are many solutions and the scope of choosing one is outside the scope of this answer, but some packages to consider:
Gearman/PHP
Sidekiq/Ruby
Amazon SQS
RabbitMQ
NSQ
...etc...
In terms of storage it really depends on what you're trying to achieve, fast reads, fast writes, bulk reads, sharding/distribution, high-availability... the answer to each points you in different directions
This sounds like an excellent candidate for Clustrix which is a drop in replacement for MySQL. They do something like sharding, but instead of putting data in separate databases, they split it and replicate it across nodes in the same DB cluster. They call it slicing, and the DB does it automatically for you. And it is transparent to the developers. There is a good performance paper on it that shows how it's done, but the short of it is that it is a scale-out OTLP DB that happens to be able to absorb mad amounts of analytical processing on real time data as well.
I'm sitting in a situation where i have to build a statistics module which can store user related statistical informations.
Basically, all thats stored is a event identifier, a datetime object and the amount of times this event has been fired and the id of the object which is being interacted with.
Ive made similar systems before, but never anything that has to store the amount of informations as this one.
My suggestion would be a simple tabel in the database.
etc. "statistics" containing the following rows
id (Primary, auto-increment)
amount (integer)
event (enum -(list,click,view,contact)
datetime (datetime)
object_id (integer)
Usually, this method works fine, enabling me to store statistics about the object in a given timeframe ( inserting a new datetime every hour or 15 minutes, so the statistics will update every 15 minute )
Now, my questions are:
is theres better methods or more optimized methods of achieving
and building a custom statistics module.
since this new site will receive massive traffic, how do i go about the paradox that index on object id will cause slower update response time
How do you even achieve live statistics like etc. analytics? Is this solely about the server size and processing power? Or is there a best practice.
I hope my questions are understandable, and i'm looking forward to get wiser on this topic.
best regards.
Jonas
I believe one of the issues you are going to run into is you wanting two worlds of transactional and analytical. Which is fine in small cases, but when you start to scale, especially into realm of 500M+ records.
I would suggest separating the two, you generate events and keep track of just the event itself. You would then run analytical queries to get things such as count of events per object interaction. You could have these counts or other metric calculations aggregated into a report table periodically.
As for tracking events, you could either do that with keeping them in a table of occurrences of events, or have something before the database that is doing this tracking and it is then providing the periodic aggregations to the database. Think of the world of monitoring systems which use collect agents to generate events which go to an aggregation layer which then writes a periodic metric snapshot to an analytical area (e.g. CollectD to StatsD / Graphite to Whisper)
Disclaimer, I am an architect for InfiniDB
Not sure what kind of datasource you are using, but as you grow and determine amount of history etc... you will probably face sizing problems as most people typically do when they are collecting event data or monitoring data. If you are in MySQL / MariaDB / PostegreSQL , I would suggest you check out InfiniDB (open source columnar MPP database for analytics); It is fully open source (GPLv2) and will provide the performance you need to do queries upon billions and TBs of data for answering those analytical questions.
The system I'm working is structured as below. Given that I'm planning to use Joomla as the base.
a(www.a.com),b(www.b.com),c(www.c.com) are search portals which allows user to to search for reservation.
x(www.x.com),y(www.y.com),z(www.z.com) are hotels where booking are made by users.
www.a.com's user can only search for the booking which are in
www.x.com
www.b.com's user can only search for the booking which are in
www.x.com,www.y.com
www.c.com's user can search for all the booking which are in
www.x.com, www.y.com, www.z.com
All a,b,c,x,y,z runs the same system. But they should have separate domains. So according to my finding and research architecture should be as above where an API integrate all database calls.
Given that only 6 instance are shown here(a,b,c,x,y,z). There can be up to 100 with different search combinations.
My problems,
Should I maintain a single database for the whole system ? If so how can I unplug one instance if required(EG : removing www.a.com from the system or removing www.z.com from the system) ? Since I'm using mysql will it not be cumbersome for the system due to the number of records?
If I maintain separate database for each instance how can I do the search? How can I integrate required records into one and do the search?
Is there a different database approach to be used rather than mentioned above ?
The problem you describe is "multitenancy" - it's a fairly tricky problem to solve, but luckily, others have written up some useful approaches. (Though the link is to Microsoft, it applies to most SQL environments except in the details).
The trade-offs in your case are:
Does the data from your hotels fit into a single schema? Do their "vacancy" records have the same fields?
How many hotels will there be? 3 separate databases is kinda manageable; 30 is probably not; 300 is definitely not.
How large will the database grow? How many vacancy records?
How likely is it that the data structures will change over time? How likely is it that one hotel will need a change that the others don't?
By far the simplest to manage and develop against is the "single database" model, but only if the data is moderately homogenous in schema, and as long as you can query the data with reasonable performance. I'd not worry about putting a lot of records in MySQL - it scales very well.
In such a design, you'd map "portal" to "hotel" in a lookup table:
PortalHotelAccess
PortalID HotelID
-----------------
A X
B X
B Y
C X
C Y
C Z
I can suggest 2 approaches. Which one to choose depends from some additional information about whole system. In fact, the main question is whether your system can impersonate (substitune by itself, in legal meaning) any of data providers (x, y, z, etc) from consumers point of view (a, b, c, etc) or not.
Centralized DB
First one is actually based on your original scheme with centralized API. It implies a single search engine, collecting required data from data sources, aggregating it in its own DB, and providing to data cosumers.
This is most likely a preferrable solution if data sources are different in their data representation, so you need to preprocess it for uniformity. Also this variant protects your clients from possible problems in connectivity, that is if one of source site goes offline for a short period (I think this may be even up to several hours without a great impact on the booking service actuality), you can still handle requests to the offline site, and store all new documents in the central DB until the problems solved. On the other hand, this means that you should provide some sort of two-way synchronization between your DB and every data source site. Also the centralized DB should be created with reliability in mind in the first place, so it seems that it should be distributed (preferrably over different data centers).
As a result - this approach will probably give best user experience, but will require sufficient efforts for robust implementation.
Multiple Local DBs
If every data provider runs its own DB, but all of them (including backend APIs) are based on a single standard, you can eliminate the need to copy their data into central DB. Of course, the central point should remain, but it will host a middle-layer logic only without DB. The layer is actually an API which binds (x, y, z) with appropriate (a, b, c) - that is a configuration, nothing more. Every consumer site will host a widget (can be just a javascript or fully-fledged web-application) loaded from your central point with appropriate settings embedded into it.
The widget will request all specified backends directly, and aggregate their results in a single list.
This variant is much like most of todays web-applications work, it's simplier to implement, but it is more error-prone.
I have website, where users can post their articles and I would like to give full stats about each articles visits and referrers to it's author. Realization seems quite straight forward here, just store a database record for every visit and then use aggregate functions to draw graphs and so on.
The problem is, that articles receive about 300k views in 24 hours and just in a month, stats table will get about 9 million records which is a very big number, because my server isn't quite powerful.
Is there a solution to this kind of task? Is there an algorithm or caching mechanism that allows to store long term statistics without losing accuracy?
P.S. Here is my original stats table:
visitid INT
articleid INT
ip INT
datetime DATETIME
Assuming a home-brewed usage-tracking solution (as opposed to say GA as suggested in other response), a two databases setup may be what you are looking for:
a "realtime" database which captures the vist events as they come.
an "offline" database where the data from the "realtime" database is collected on a regular basis, for being [optionally] aggregated and indexed.
The purpose for this setup is mostly driven by operational concerns. The "realtime" database is not indexed (or minimally indexed), for fast insertion, and it is regularly emptied, typically each night, when the traffic is lighter, as the "offline" database picks up the events collected through the day.
Both databases can have the very same schema, or the "offline" database may introduce various forms of aggregation. The specific aggregation details applied to the offline database can vary greatly depending on the desire to keep the database's size in check and depending on the data which is deemed important (most statistics/aggregation functions introduce some information loss, and one needs to decide which losses are acceptable and which are not).
Because of the "half life" nature of the value of usage logs, whereby the relative value of details decays with time, a common strategy is to aggregate info in multiple tiers, whereby the data collected in the last, say, X days remains mostly untouched, the data collected between X and Y days is partially aggregated, and finally, data older than Y days only keep the most salient info (say, number of hits).
Unless you're particularly keen on storing your statistical data yourself, you might consider using Google Analytics or one of its modern counterparts, which are much better than the old remotely hosted hit counters of the 90s. You can find the API to the Google Analytics PHP interface at http://code.google.com/p/gapi-google-analytics-php-interface/
I have a site where the users can view quite a large number of posts. Every time this is done I run a query similar to UPDATE table SET views=views+1 WHERE id = ?. However, there are a number of disadvantages to this approach:
There is no way of tracking when the pageviews occur - they are simply incremented.
Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
Therefore I consider employing an approach where I create a table, say:
object_views { object_id, year, month, day, views }, so that each object has one row pr. day in this table. I would then periodically update the views column in the objects table so that I wouldn't have to do expensive joins all the time.
This is the simplest solution I can think of, and it seems that it is also the one with the least performance impact. Do you agree?
(The site is build on PHP 5.2, Symfony 1.4 and Doctrine 1.2 in case you wonder)
Edit:
The purpose is not web analytics - I know how to do that, and that is already in place. There are two purposes:
Allow the user to see how many times a given object has been shown, for example today or yesterday.
Allow the moderators of the site to see simple view statistics without going into Google Analytics, Omniture or whatever solution. Furthermore, the results in the backend must be realtime, a feature which GA cannot offer at this time. I do not wish to use the Analytics API to retrieve the usage data (not realtime, GA requires JavaScript).
Quote : Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
There is much more than this. This is database killer.
I suggest u make table like this :
object_views { object_id, timestamp}
This way you can aggregate on object_id (count() function).
So every time someone view the page you will INSERT record in the table.
Once in a while you must clean the old records in the table. UPDATE statement is EVIL :)
On most platforms it will basically mark the row as deleted and insert a new one thus making the table fragmented. Not to mention locking issues .
Hope that helps
Along the same lines as Rage, you simply are not going to get the same results doing it yourself when there are a million third party log tools out there. If you are tracking on a daily basis, then a basic program such as webtrends is perfectly capable of tracking the hits especially if your URL contains the ID's of the items you want to track... I can't stress this enough, it's all about the URL when it comes to these tools (Wordpress for example allows lots of different URL constructs)
Now, if you are looking into "impression" tracking then it's another ball game because you are probably tracking each object, the page, the user, and possibly a weighted value based upon location on the page. If this is the case you can keep your performance up by hosting the tracking on another server where you can fire and forget. In the past I worked this using SQL updating against the ID and a string version of the date... that way when the date changes from 20091125 to 20091126 it's a simple query without the overhead of let's say a datediff function.
First just a quick remark why not aggregate the year,month,day in DATETIME, it would make more sense in my mind.
Also I am not really sure what is the exact reason you are doing that, if it's for a marketing/web stats purpose you have better to use tool made for that purpose.
Now there is two big family of tool capable to give you an idea of your website access statistics, log based one (awstats is probably the most popular), ajax/1pixel image based one (google analytics would be the most popular).
If you prefer to build your own stats database you can probably manage to build a log parser easily using PHP. If you find parsing apache logs (or IIS logs) too much a burden, you would probably make your application ouput some custom logs formated in a simpler way.
Also one other possible solution is to use memcached, the daemon provide some kind of counter that you can increment. You can log view there and have a script collecting the result everyday.
If you're going to do that, why not just log each access? MySQL can cache inserts in continuous tables quite well, so there shouldn't be a notable slowdown due to the insert. You can always run Show Profiles to see what the performance penalty actually is.
On the datetime issue, you can always use GROUP BY MONTH( accessed_at ) , YEAR( accessed_at) or WHERE MONTH(accessed_at) = 11 AND YEAR(accessed_at) = 2009.