Banner Impressions Tracking - Database Design - php

Looking for some good advice on db design for tracking multiple banner impressions.
IE I have 5 Banners over x Domains
I would like to build data on each banner on how many impressions per day per banner etc. So also be able to do lookups for other date ranges.
Would it be best to have a date per day per row or just track each impression per row.
Hope you can advise.
And thanks in Advance

I'd recommend to create the most flexible design that would allow you to create new reports as requirements extend in the future. You suggest that the customer wants reports on "impressions per day". What if they come in later and say "what time of the day are impressions shown most at"? How about "when are they clicked on most"?
So the most flexible way to do this is to have 1 record for each impression, where each record is just
banner_id
timestamp
Later on, you can create a stored procedure that aggregates historical data and thus purges HUGE amounts of data that you have accumulated - thus, creating reports on the level of granularity that you care about. I can imagine storing hourly data for a month, and daily data for a year. The stored procs would just write to an archive table:
Banner ID
Time interval identifier (of the month/year for monthly data, or day/month/year for daily data, etc)
Number of impressions

Why reinvent the wheel? There are plenty of free ad servers. The most notable one I've heard of is OpenX (used to be phpAdsNew). If nothing else, you can install it and see how they set up their DB.

Related

Time Prediction based on existing date:time records

I have a system that logs date:time and it returns results such as:
05.28.2013 11:58pm
05.27.2013 10:20pm
05.26.2013 09:47pm
05.25.2013 07:30pm
05.24.2013 06:24pm
05.23.2013 05:36pm
What I would like to be able to do is have a list of date:time prediction for the next few days - so a person could see when the next event might occur.
Example of prediction results:
06.01.2013 04:06pm
05.31.2013 03:29pm
05.30.2013 01:14pm
Thoughts on how to go about doing time prediction of this kind with php?
The basic answer is "no". Programming tools are not designed to do prediction. Statistical tools are designed for that purpose. You should be thinking more about R, SPSS, SAS, or some other similar tool. Some databases have rudimentary data analysis tools built-in, which is another (often inferior) option.
The standard statistical technique for time-series prediction is called ARIMA analysis (auto-regressive integrated moving average). It is unlikely that you are going to be implementing that in php/SQL. The standard statistical technique for estimating time between events is Poisson regression. It is also highly unlikely that you are going to be implementing that in php/SQL.
I observe that your data points are once per day in the evening. I might guess that this is the end of some process that runs during the day. The end time is based on the start time and the duration of the process.
What can you do? Often a reasonable prediction is "what happened yesterday". You would be surprised at how hard it is to beat this prediction for weather forecasting and for estimating the stock market. Another very reasonable method is the average of historical values.
If you know something about your process, then an average by day of the week can work well. You can also get more sophisticated, and do Monte Carlo estimates, by measuring the average and standard deviation, and then pulling a random value from a statistical distribution. However, the average value would work just as well in your case.
I would suggest that you study a bit about statistics/data mining/predictive analytics before attempting to do any "predictions". At the very least, if you really have a problem in this domain, you should be looking for the right tools to use.
As Gordon Linoff posted, the simple answer is "no", but you can write some code that will give a rough guess on what the next time will be.
I wrote a very basic example on how to do this on my site http://livinglion.com/2013/05/next-occurrence-in-datetime-sequence/
Here's a possible way that this could be done, using PHP + MySQL:
You can have a table with two fields: a DATE field and a TIME field (essentially storing the date + time portion separately). Say that the table is named "timeData" and the fields are:
eventDate: date
eventTime: time
Your primary key would be the combination of eventDate and eventTime, so that they're never repeated as a pair.
Then, you can do a query like:
SELECT eventTime, count(*) as counter FROM timeData GROUP BY eventTime ORDER BY counter DESC LIMIT 0, 10
The aforementioned query will always return the first 10 most frequent event times, ordered by frequency. You can then order these again from smallest to largest.
This way, you can return quite accurate time prediction results, which will become even more accurate as you gather data each day

Caching a large SQL query - best way of structuring?

Let me give an example of my issue. Let's say I have a table called users and a table called payments. To calculate a user's total balance, I'd use a query to get all the payments after a certain date and then cache the result for a while.
However, I was wondering, due to the nature of this, would it be a good idea to have a column in the users table called balance and then when the cache expires, I use a different query to gather the payments but from a shorter time and then add this amount on to whatever is in the balance column?
To calculate a user's total balance,
You can create an additional table that always contains the users current balance. If a new payment is added for the user, that column needs to be updated, too. Do a transaction so adding the payment and updating the total balance is aligned.
If you need this more differentiated, you can next to the user relation, keep a date column representing the interval you need to be able to do calculations for. E.g. the week number or month number to be able to give a view back in the past.
If you need more flexibility, you can after some time compress existing payments into a total value and store that into such a balance table that is user-related and keeps a date column.
You can then UNION it with the table of the payments that are "realtime" for the dates not yet compressed / condensed. Then use an aggregation function to SUM the total balance. This might give you the best of both worlds if you need to keep recent data with more detail you can move out of the data-store after some time, just keeping statistic values.
Generally with these kinds of "pre-calculated" values I find that the most pain free way is to store/update them on save of any model that concerns the data
So in short, update the total balance whenever a new payment is saved. That way you can guarantee that your database and your data will always be in sync
The pre-calculation can be either a mysql trigger or a background task with something like Gearman
But as your own question suggested if you want to do some kind of incremental roll-up of the balance, I would advice going by months or some fixed date range. This would work providing that you have no payment backdating or something like that, where payment could appear for an old month.
Start of the new month, run a payment aggregator, bam, you now only have to sum the monthly tables.
It all really depends on how much data you have to deal with. But again I stress, data consistency is a lot more valuable than speed, you can always buy more servers.

MySQL database with entries increasing by 1 million every month, how can I partition the database to keep a check on query time

I am a college undergrad working on a PHP and MySQL based inventory management system operating on a country-wide level. Its database size is projected to increase by about 1 million plus entries every month with current size of about 2 million.
I need to prevent the exponential increase in query time which is currently ranges from 7-11 seconds for most modules.
The thing is that the probability of accessing data entered in the last month is much higher as compared to any older data. So I believe partitioning of data on the basis of time of data entry should be able to keep the query time in check. So how can I achieve this.
Specifically speaking I want to have a way to cache the last month's data so that every query searches for the product in the tables having recent data and should search rest of the data in case it is not found in the last 1 month's data.
If you want to use the partitioning functions of MySQL, have a look at this article.
That being said, there are a few restrictions when using partitions :
you cant have indexes that are not in the partition key
you loose some database portability as partitioning works quite differently with other databases.
You can also handle partitioning manually, by moving old records to an archive table at regular intervals. Of course, you will then have to also implements different code to read those archived records.
Also note that your query time seems quite long. I have worked with table much larger than 2 million records with much better access time.

How to calculate percentile rank for point totals over different time spans?

On a PHP & CodeIgniter-based web site, users can earn reputation for various actions, not unlike Stack Overflow. Every time reputation is awarded, a new entry is created in a MySQL table with the user_id, action being rewarded, and value of that bunch of points (e.g. 10 reputation). At the same time, a field in a users table, reputation_total, is updated.
Since all this is sort of meaningless without a frame of reference, I want to show users their percentile rank among all users. For total reputation, that seems easy enough. Let's say my user_id is 1138. Just count the number of users in the users table with a reputation_total less than mine, count the total number of users, and divide to find the percentage of users with a lower reputation than mine. That'll be user 1138's percentile rank, right? Easy!
But I'm also displaying reputation totals over different time spans--e.g., earned in the past seven days, which involves querying the reputation table and summing all my points earned since a given date. I'd also like to show percentile rank for the different time spans--e.g., I may be 11th percentile overall, but 50th percentile this month and 97th percentile today.
It seems I would have to go through and find the reputation totals of all users for the given time span, and then see where I fall within that group, no? Is that not awfully cumbersome? What's the best way to do this?
Many thanks.
I can think of a few options off the top of my head here:
As you mentioned, total up the reputation points earned during the time range and calculate the percentile ranks based on that.
Track updates to reputation_total on a daily basis - so you have a table with user_id, date, reputation_total.
Add some new columns to the user table (reputation_total, reputation_total_today, reputation_total_last30days, etc) for each time range. You could also normalize this into a separate table (reputation_totals) to prevent you from having to add a new column for each time span you want to track.
Option #1 is the easiest, but it's probably going to get slow if you have lots of rows in your reputation transaction table - it won't scale very well, especially if you need to calculate these in real time.
Option #2 is going to require more storage over time (one row per user per day) but would probably be significantly faster than querying the transaction table directly.
Option #3 is less flexible, but would likely be the fastest option.
Both options 2 & 3 would likely require a batch process to calculate the totals on a daily basis, so that's something to consider as well.
I don't think any option is necessarily the best - they all involve different tradeoffs of speed/storage space/complexity/flexibility. What you do will ultimately depend on the requirements for your application of course.
I don't see why that would be too overly complex. Generally all you would need is to add to your WHERE clause a query that limits results like:
WHERE DatePosted between #StartOfRange and #EndOfRange

datewise records sorting /fetching from large mySQL database

I have a separate table for every day's data which is basically webstats type : keywords, visits, duration, IP, sale, etc (maybe 100 bytes total per record)
Each table will have around a couple of million records.
What I need to do is have a web admin so that the user/admin can view reports for different date periods AND sorted by certain calculated values. For example, the user may want the results for the 15th of last month to the 12th of this month , sorted by SALE/VISIT , descending order.
The admin/user only needs to view (say) the top 200 records at a time and will probably not view more than a few hundred total in any one session
Because of the arbitrary date period involved, I need to sum up the relevant columns for each record and only then can the selection be done.
My question is whether it will be possible to have the reports in real time or would they be too slow (the tables are not rarely - if ever - updated after the day's data has been inserted)
Is such a scenario better fitted to indexes or tablescans?
And also, whether a massive table for all dates would be better than having separate tables for each date (there are almost no joins)
thanks in advance!
With a separate table for each day's data, summarizing across a month is going to involve doing the same analysis on each of 30-odd tables. Over a year, you will have to do the analysis on 365 or so tables. That's going to be a nightmare.
It would almost certainly be better to have a soundly indexed single table than the huge number of tables. Some DBMS support fragmented tables - if MySQL does, fragment the single big table by the date. I would be inclined to fragment by month, especially if the normal queries are for one month or less and do not cross month boundaries. (Even if it involves two months, with decent fragment elimination, the query engine won't have to read most of the data; just the two fragments for the two months. It might be able to do those scans in parallel, even - again, depending on the DBMS.)
Sometimes, it is quicker to do sequential scans of a table than to do indexed lookups - don't simply assume that because the query plan involves a table scan that it will automatically be bad performing.
You may want to try a different approach. I think Splunk will work for you. It was designed for this, they even do ads on this site. They have a free version you can try.

Categories