How to build a proper Database for a traffic analytics system?

How to build a proper Database for a traffic analytics system? - php

How to build a proper structure for an analytics service? Currently i have 1 table that stores data about every user that visits the page with my client's ID so later my clients will be able to see the statistics for a specific date.
I've thought a bit today and I'm wondering: Let's say i have 1,000 users and everyone has around 1,000 impressions on their sites daily, means i get 1,000,000 (1M) new records every day to a single table. How will it work after 2 months or so (when the table reaches 60 Million records)?
I just think that after some time it will have so much records that the PHP queries to pull out the data will be really heavy, slow and take a lot of resources, is it true? and how to prevent that?
A friend of mine working on something similar and he is gonna make a new table for every client, is this the correct way to go with?
Thanks!

Problem you are facing is I/O bound system. 1 million records a day is roughly 12 write queries per second. That's achievable, but pulling the data out while writing at the same time will make your system to be bound at the HDD level.
What you need to do is configure your database to support the I/O volume you'll be doing, such as - use appropriate database engine (InnoDB and not MyISAM), make sure you have fast enough HDD subsystem (RAID, not regular drives since they can and will fail at some point), design your database optimally, inspect queries with EXPLAIN to see where you might have gone wrong with them, maybe even use a different storage engine - personally, I'd use TokuDB if I were you.
And also, I sincerely hope you'd be doing your querying, sorting, filtering on the database side and not on PHP side.

Consider this Link to the Google Analytics Platform Components Overview page and pay special attention to the way the data is written to the database, simply based on the architecture of the entire system.
Instead of writing everything to your database right away, you could write everything to a log file, then process the log later (perhaps at a time when the traffic isn't so high). At the end of the day, you'll still need to make all of those writes to your database, but if you batch them together and do them when that kind of load is more tolerable, your system will scale a lot better.

You could normalize impressions the data like this;
Client Table
{
ID
Name
}
Pages Table
{
ID
Page_Name
}
PagesClientsVisits Table
{
ID
Client_ID
Page_ID
Visits
}
and just increment visits on the final table on each new impression. Then the maximum number of records in there becomes (No. of clients * No. of pages)

Having a table with 60 million records can be ok. That is what a database is for. But you should be careful about how many fields you have in the table. Also what datatype (=>size) each field has.
You create some kind of reports on the data. Think about what data you really need for those reports. For example you might need only the numbers of visits per user on every page. A simple count would do the trick.
What you also can do is generate the report every night and delete the raw data afterwards.
So, read and think about it.

Related

Handling large number of MySQL tables

I have an API which is being used by around 200 websites right now. The number is expected to grow very soon. I need to store information of each visitor (IP address etc) on clients' websites. The number of daily visitors for each client ranges from 2000 to 50000. That means I am adding 400000 to 500000 rows everyday. For that right now I am making a different table for each client.
Now the problem is when I try to fetch data from all tables combined, it takes a lot of time. How should I handle this? How should I store the data?
Thanks!

I always try to keep tables to a minimum in my schemas. Perhaps you should make a client table with relevant client information and then have a visitor table with all the visitor information. Link the two with a foreign key.

Since all the tables are the same, I'd just keep the visitor information in one table, with a column to identify the client / website.
The question then is whether a large table like that will still perform... Obviously you need your indexing and so on, but here are a couple of ideas:
Partitioning: I know nothing about partitioning on Mysql (but have tried it on Postgresql). The idea is to design the physical data storage to suit your data retrieval / work needs. Might be an idea if your table gets huge.
"Live" and "archive" tables. I'm sure there's proper terminology for this. Again, depending on how you're analysing your data, you can keep today's / this week's / this month's / whatever you need's data in the "live" table where new records are added, then have housekeeping functions that move older records to a larger archive table. The idea would be to keep only the records you want to analysis frequently in the smaller live table, so query performance is fast.
Lastly, you might be pleasantly surprised by the performance of Mysql even on large tables. I've got a Postgresql table with several million records and performance is more than adequate without any playing around.

Do not store raw data in mysql. Put visitors data into queue (based on redis, rabbitmq etc) and store only aggregated data which is necessary for your business model.

Insert a row every given time else update previous row (Postgresql, PHP)

I have a multiple devices (eleven to be specific) which sends information every second. This information in recieved in a apache server, parsed by a PHP script, stored in the database and finally displayed in a gui.
What I am doing right now is check if a row for teh current day exists, if it doesn't then create a new one, otherwise update it.
The reason I do it like that is because I need to poll the information from the database and display it in a c++ application to make it look sort of real-time; If I was to create a row every time a device would send information, processing and reading the data would take a significant ammount of time as well as system resources (Memory, CPU, etc..) making the displaying of data not quite real-time.
I wrote a report generation tool which takes the information for every day (from 00:00:00 to 23:59:59) and put it in an excel spreadsheet.
My questions are basically:
Is it posible to do the insertion/updating part directly in the database server or do I have to do the logic in the php script?
Is there a better (more efficient) way to store the information without a decrease in performance in the display device?
Regarding the report generation, if I want to sample intervals lets say starting from yesterday at 15:50:00 and ending today at 12:45:00 it cannot be done with my current data structure, so what do I need to consider in order to make a data structure which would allow me to create such queries.
The components I use:
- Apache 2.4.4
- PostgreSQL 9.2.3-2
- PHP 5.4.13

My recommendations - just store all the information, your devices are sending. With proper indexes and queries you can process and retrieve information from DB really fast.
For your questions:
Yes it is possible to build any logic you desire inside Postgres DB using SQL, PL/pgSQL, PL/PHP, PL/Java, PL/Py and many other languages built into Postgres.
As I said before - proper indexing can do magic.
If you cannot get desired query speed with full table - you can create a small table with 1 row for every device. And keep in this table last known values to show them in sort of real-time.

1) The technique is called upsert. In PG 9.1+ it can be done with wCTE (http://www.depesz.com/2011/03/16/waiting-for-9-1-writable-cte/)
2) If you really want it to be real-time you should be sending the data directly to the aplication, storing it in memory or plaintext file also will be faster if you only care about the last few values. But PG does have Listen/notify channels so probabably your lag will be just 100-200 mili and that shouldn't be much taken you're only displaying it.

I think you are overestimating the memory system requirements given the process you have described. Adding a row of data every second (or 11 per second) is not a hog of resources. In fact it is likely more time consuming to UPDATE vs ADD a new row. Also, if you add a TIMESTAMP to your table, sort operations are lightning fast. Just add some garbage collection handling as a CRON job (deletion of old data) once a day or so and you are golden.
However to answer your questions:
Is it posible to do the insertion/updating part directly in the database server or do I >have to do the logic in the php script?
Writing logic from with the Database engine is usually not very straight forward. To keep it simple stick with the logic in the php script. UPDATE (or) INSERT INTO table SET var1='assignment1', var2='assignment2' (WHERE id = 'checkedID')
Is there a better (more efficient) way to store the information without a decrease in >performance in the display device?
It's hard to answer because you haven't described the display device connectivity. There are more efficient ways to do the process however none that have locking mechanisms required for such frequent updating.
Regarding the report generation, if I want to sample intervals lets say starting from >yesterday at 15:50:00 and ending today at 12:45:00 it cannot be done with my current data >structure, so what do I need to consider in order to make a data structure which would >allow me to create such queries.
You could use the a TIMESTAMP variable type. This would include DATE and TIME of the UPDATE operation. Then it's just a simple WHERE clause using DATE functions within the database query.

Processing and matching large amounts of data

I have one large database table of request data, much like Apache request logs, of about 50 million rows:
request_url
user_agent
created
that contains data like this:
/profile/Billy
Mozilla.....
2012-06-17...
/profile/Jane
Mozilla.....
2012-06-17...
I then have my user database table, with all my user data including usernames.
At the moment, every night, I process the request data for the previous day, row by row and see if it contains an URL that matches one of the usernames in the users table. If it does, I increment a total in another table that stores stats that allows users to see how many pageviews they got for any particular day.
However as the datasets grow, this is becoming resource intensive and can also take a long time to complete, even when grouping the request data by URL and grabbing a count for that group.
Is there a better way of processing this information to get the end result I need? The request data is going to be logged anyway, so it would be preferable to to generate the stats after the fact rather than incrementing the total on every page view.
I'm running this on one server, so distributed processing of the data on multiple servers isn't required.

Start with a fresh log-table every day. When the day is done, use it to increment the totals, then append it to that huge main log-table and delete it.

Incrementing total on every page view is your best option. It saves trouble of "search" later on for each user separately. It's just one extra query of update on every pageview, and thus processing load is spread out throughout the day instead of single time (Plus your stats stay updated all the time, instead of being updated daily)
If you are insistent on doing in SQL, you might consider
SELECT COUNT(request_url) FROM your_table WHERE request_url LIKE %/profile/username%
(though I am not sure if that's what you're already doing?)

Start looking into analytic database like Infobright. Column Based storeage engines are huge in the big data initiatives and are built for doing in memory analytics on aggregates as well as ad hoc querying.
Disclaimer: the author is affiliated with Infobright.

MySql queries at certain time

I'm supposed to make queries from MySql database once a day and display data on the page... and this sounds like cron job - I never did this before and I'd like you opinion.
if I make query once a day, I have to save this data in a file, let's say, xml file and every time the page reloads, it has to parse data from that file.
From my point of view, it would be faster and more user friendly to make query every time the page loads, as data would be refreshed ...
Any help please ....
Thank for your answers, I'll update my answer ... I don't think the queries would be extensive: something like find the most popular categories from articles, the most popular cites from where the author is ... three of those queries. So data pulled out from database will rely only on two tables, max three and only one will have dynamic data, other will be small ones.
I didn't ask yet why ... because it is not available at the moment ...

It all depends on the load on the server. If users are requesting this data a few times a day, then pulling the data on each request should be ok (KISS first). However, if they are slamming the server many times and the request is slow on top of that, then you should store the data off. I would just suggest storing it to a table and just clearing the table each night on a successful reload.

If this is a normal query that doesn't take long to execute, there is no reason to cache the result in a file. MySQL also has caching built in, which may be closer to what you want.

That would depend on the complexity of the query. If the "query" is actually going through a lot of work to build a dataset, or querying a dozen different database servers, i can see only doing it once per day.
For example, if you own a chain of stores across 30 states and 5 countries, each with their own stock-levels, and you want to display local stock levels on your website, i can see only going through the trouble of doing that once per day...
If efficiency is the only concern, it should be pretty easy to estimate which is better:
Time to run Query + (Time to load xml x estimated visits)
versus
Time to run Query x Estimated Visits

Tracking the views of a given row

I have a site where the users can view quite a large number of posts. Every time this is done I run a query similar to UPDATE table SET views=views+1 WHERE id = ?. However, there are a number of disadvantages to this approach:
There is no way of tracking when the pageviews occur - they are simply incremented.
Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
Therefore I consider employing an approach where I create a table, say:
object_views { object_id, year, month, day, views }, so that each object has one row pr. day in this table. I would then periodically update the views column in the objects table so that I wouldn't have to do expensive joins all the time.
This is the simplest solution I can think of, and it seems that it is also the one with the least performance impact. Do you agree?
(The site is build on PHP 5.2, Symfony 1.4 and Doctrine 1.2 in case you wonder)
Edit:
The purpose is not web analytics - I know how to do that, and that is already in place. There are two purposes:
Allow the user to see how many times a given object has been shown, for example today or yesterday.
Allow the moderators of the site to see simple view statistics without going into Google Analytics, Omniture or whatever solution. Furthermore, the results in the backend must be realtime, a feature which GA cannot offer at this time. I do not wish to use the Analytics API to retrieve the usage data (not realtime, GA requires JavaScript).

Quote : Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
There is much more than this. This is database killer.
I suggest u make table like this :
object_views { object_id, timestamp}
This way you can aggregate on object_id (count() function).
So every time someone view the page you will INSERT record in the table.
Once in a while you must clean the old records in the table. UPDATE statement is EVIL :)
On most platforms it will basically mark the row as deleted and insert a new one thus making the table fragmented. Not to mention locking issues .
Hope that helps

Along the same lines as Rage, you simply are not going to get the same results doing it yourself when there are a million third party log tools out there. If you are tracking on a daily basis, then a basic program such as webtrends is perfectly capable of tracking the hits especially if your URL contains the ID's of the items you want to track... I can't stress this enough, it's all about the URL when it comes to these tools (Wordpress for example allows lots of different URL constructs)
Now, if you are looking into "impression" tracking then it's another ball game because you are probably tracking each object, the page, the user, and possibly a weighted value based upon location on the page. If this is the case you can keep your performance up by hosting the tracking on another server where you can fire and forget. In the past I worked this using SQL updating against the ID and a string version of the date... that way when the date changes from 20091125 to 20091126 it's a simple query without the overhead of let's say a datediff function.

First just a quick remark why not aggregate the year,month,day in DATETIME, it would make more sense in my mind.
Also I am not really sure what is the exact reason you are doing that, if it's for a marketing/web stats purpose you have better to use tool made for that purpose.
Now there is two big family of tool capable to give you an idea of your website access statistics, log based one (awstats is probably the most popular), ajax/1pixel image based one (google analytics would be the most popular).
If you prefer to build your own stats database you can probably manage to build a log parser easily using PHP. If you find parsing apache logs (or IIS logs) too much a burden, you would probably make your application ouput some custom logs formated in a simpler way.
Also one other possible solution is to use memcached, the daemon provide some kind of counter that you can increment. You can log view there and have a script collecting the result everyday.

If you're going to do that, why not just log each access? MySQL can cache inserts in continuous tables quite well, so there shouldn't be a notable slowdown due to the insert. You can always run Show Profiles to see what the performance penalty actually is.
On the datetime issue, you can always use GROUP BY MONTH( accessed_at ) , YEAR( accessed_at) or WHERE MONTH(accessed_at) = 11 AND YEAR(accessed_at) = 2009.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.