Tracking page views and display daily, weekly, monthly results

Tracking page views and display daily, weekly, monthly results - php

Platform: PHP5, MySQL
I have a web site that displays articles that are sorted into categories. I would like to track the number of views for each article. Then, in the side bar, display the top 5 articles viewed articles for the current day, within the past meek, and within the last month. What do you think would be the best way to do that? One row in database for each view (article_id, timestamp)? What would be the least amount of work for the server?
Thanks, joe

This can become a tricky problem. If you just store raw hits, your table will grow rapidly and crunching the numbers becomes more time consuming. So, one way to deal with this is to create aggregate tables and crunch the numbers using a cron job.
For example, you could have the following tables
hit_count: article_id, timestamp
hit_count_daily: day, year, article_id, hit_count
hit_count_weekly: week, year, article_id, hit_count
hit_count_monthly: month, year, article_id, hit_count
hit_count_yearly: year, article_id, hit_count
You then process the data in the hit_count table, add it to the aggregate tables, and then remove the data from the hit_count table.
You also need to think about what happens if someone refreshes the page or if Google crawls the article. Do you want to count those as hits?
To keep crawlers from triggering hits, you could use some Javascript on the page to communicate back to your server and register the hit. This way, a normal browser will trigger the hit but a crawler will not.
You could also offload this task to another service, like Chartbeat or Clicky

How about:
Add this to php.ini
auto_append_file = /server_root/footer.php
footer.php contains a silent routine to write the SQL data.

Related

MySQL Performance for Online Games Highscore Lists

I have a question about making "Highscore-Lists".
Lets say I have an online game with 1.000.000 active users. Each user has points from 0 to X. Now, I want to show a ranking-list. It would be insane to show all million entries in one page so it is divided into Y pages (100 entries each page => 10.000 pages).
I am not really sure how to solve it.
1. The easiest way to do that would be loading all 1m entries
in one SELECT, get the result and find current user with a for loop and show that specific page. (but all other 999.900 entries will be saved in RAM eventhough its not showing up). For a page change I could just use the result data with no second database call. (So I don't care about point changes during that time)
SELECT UserName, UserID, Points FROM UserAccount ORDER BY Points;
2. My second idea was, to load each page individually but than I do not know
2.1 if it is really better performance
2.2 how to get the right start page because I only have the points of the user but not really his place
So how could I solve that problem. I dont really know what mysql can handle. Are more small calls better then one huge call.
Can I even save huge result data?
Second solution would update all changed points with each page change, though but i care more about performance then always uptodate list-data.
Thank you for your help!
Markus

Use pagination. In SQL it's a "limit" clause:
SELECT UserName, UserID, Points FROM UserAccount ORDER BY Points LIMIT 0, 20;
The above query will return only the first 20 rows of the original selection.
You can pass page parameters via get, like this: highscore.php?page=1 or ?page=2 and so on.

Calculating page views in php / MySQL

I have a table article with many articles. I want to track the number of viewers for each article. Here's my idea, of how I plan to do it.
I've a table viewers with rows: id, ip, date, article_id.
article_id is a FOREIGN FIELD referring to the id of the article. So, when a user open up an article, his/her IP address is stored in the table.
Is this a decent approach? It is for a personal site and not a very big website.
EDIT: I want to print the number of view on each article page.

It depends on how frequently you need to display number of viewer. Your general query will be:
select count(*) from viewers
where article_id='10'
With time, your viewers table will grow. Say it have million records after 1 year or two. Now if you are showing number of viewers on each article page or displaying articles with most viewers, it will start impacting on performance even though foreign key is indexed. However that will happen after you added hundreds of articles with each having thousands of viewers.
A better optimized solution may be to keep number of viewers in article table and use that to display results. Existing Viewers table is also necessary to ensure there is no duplicate entry (Same user reading an article ten times must be marked as single entry not ten).

Use a Tool like Google Analytics. This will do the job much more elaborated and you're up and running in minutes, there's more about unique visitors than IP addresses!
If you want to have an on premise solution, look at PIWIK, which is PHP framework for exactly this puprose.

In this design,There is a one problem if the same user open it again and again then either you have to put check before insert the entry or you insert the same ip address multiple time but different time stamp.
Most of the popular sites consider one ip address as one view even if that client or user open that article several times.
I can think of solution.
your approach with single check. if the same client has opened it again don't insert it.
or group by Id when you retrieve the counter.

It depends on what you want to store in your database. If you want to know exactly how many unique users visited this particular article (including date and ip) this is reasonable way do to this. But if you need only a number to show you can alter article table and include new column with visit_counter and store cookie to prevent incrementing counter on same user.

try something like this
// insert
$query = mysqli_query("REPLACE INTO viewers (ip) VALUES ('" . ip2long($_SERVER['REMOTE_ADDR']) . "')");
// retrieve
list($pageviews) = mysqli_fetch_row(mysqli_query("SELECT COUNT(ip) FROM viewers"));
echo $pageviews;
Read : REPLACE INTO

Yes, this is good aproach if you create some kind of cache for displaying how many views each article had. It's not optimal to count views each time user open website.
You can do it in SQL Server. There is something like materialized view. ( https://code.google.com/p/flexviews/ )
select article_id, count(*) as views from viewers group by article_id
Or you can cache it in files and refresh every X time.
To store users who viewed article I suggest using AJAX. When user open website, another 'thread' will call website to add his as viewer. Even if your db is slow, it will not slow down website loading, also web spiders will not be counted.

How to generate statistics on article views last week

I see on sites that they sometimes have a statistic showing how many views an article or downloads a file had over the last week. I think download.com does something like this. I was wondering how they go about doing this. Are they actually keeping track of every days downloads or am I missing something really basic?
Are they doing something like having three rows called total_downloads, last_week_downloads, this_week_downloads. Then every week, copying the value of this_week_downloads to last_week_downloads and then resetting this_week_downloads to 0?

There are a couple of ways to do it, depending on what your trying to get out of the stats.
One way is to include a visits column on your table, then just increase that number by 1 each time that article's page is loaded.
This however isn't very good for giving the past weeks number of views. You can do this in 2 ways:
1) another column in your table doing the same as visits, but run a cron job to put it back to 0 every week.
2) create another table which holds article_id, ip_address and timestamp, you would insert a record each time someone visits the article, storing their IP address (allowing you to roughly get page views and unique page views), and of course the timestamp allows you to query for only a sub-set of those records. Note: using this method you could store more information for stats, but it does require a lot more server resources.

The most basic way you can do this is associate a MySQL field alongside your article on the database and just increment it.
Assuming you we're retreiving article 123 from your database you would have something like this on your code:
<?php
// this would increment the number of views
$sql = "UPDATE table SET count_field=count_field+1 WHERE id=123";
...
?>

Optimizing queries for content popularity by hits

I've done some searching for this but haven't come up with anything, maybe someone could point me in the right direction.
I have a website with lots of content in a MySQL database and a PHP script that loads the most popular content by hits. It does this by logging each content hit in a table along with the access time. Then a select query is run to find the most popular content in the past 24 hours, 7 day or maximum 30 days. A cronjob deletes anything older than 30 days in the log table.
The problem I'm facing now is as the website grows the log table has 1m+ hit records and it is really slowing down my select query (10-20s). At first I though the problem was a join I had in the query to get the content title, url, etc. But now I'm not sure as in test removing the join does not speed the query as much as I though it would.
So my question is what is best practise of doing this kind of popularity storing/selecting? Are they any good open source scripts for this? Or what would you suggest?
Table scheme
"popularity" hit log table
nid | insert_time | tid
nid: Node ID of the content
insert_time: timestamp (2011-06-02 04:08:45)
tid: Term/category ID
"node" content table
nid | title | status | (there are more but these are the important ones)
nid: Node ID
title: content title
status: is the content published (0=false, 1=true)
SQL
SELECT node.nid, node.title, COUNT(popularity.nid) AS count
FROM `node` INNER JOIN `popularity` USING (nid)
WHERE node.status = 1
AND popularity.insert_time >= DATE_SUB(CURDATE(),INTERVAL 7 DAY)
GROUP BY popularity.nid
ORDER BY count DESC
LIMIT 10;

We've just come across a similar situation and this is how we got around it. We decided we didn't really care about what exact 'time' something happened, only the day it happened on. We then did this:
Every record has a 'total hits' record which is incremented every time something happens
A logs table records these 'total hits' per record, per day (in a cron job)
By selecting the difference between two given dates in this log table, we can deduce the 'hits' between two dates, very quickly.
The advantage of this is the size of your log table is only as big as NumRecords * NumDays which in our case is very small. Also any queries on this logs table are very quick.
The disadvantage is you lose the ability to deduce hits by time of day but if you don't need this then it might be worth considering.

You actually have two problems to solve further down the road.
One, which you've yet to run into but which you might earlier than you want, is going to be insert throughput within your stats table.
The other, which you've outlined in your question, is actually using the stats.
Let's start with input throughput.
Firstly, in case you're doing so, don't track statistics on pages that could use caching. Use a php script that advertises itself as an empty javascript, or as a one-pixel image, and include the latter on pages you're tracking. Doing so allows to readily cache the remaining content of your site.
In a telco business, rather than doing an actual inserts related to billing on phone calls, things are placed in memory and periodically sync'ed with the disk. Doing so allows to manage gigantic throughputs while keeping the hard-drives happy.
To proceed similarly on your end, you'll need an atomic operation and some in-memory storage. Here's some memcache-based pseudo-code for doing the first part...
For each page, you need a Memcache variable. In Memcache, increment() is atomic, but add(), set(), and so forth aren't. So you need to be wary of not miss-counting hits when concurrent processes add the same page at the same time:
$ns = $memcache->get('stats-namespace');
while (!$memcache->increment("stats-$ns-$page_id")) {
$memcache->add("stats-$ns-$page_id", 0, 1800); // garbage collect in 30 minutes
$db->upsert('needs_stats_refresh', array($ns, $page_id)); // engine = memory
}
Periodically, say every 5 minutes (configure the timeout accordingly), you'll want to sync all of this to the database, without any possibility of concurrent processes affecting each other or existing hit counts. For this, you increment the namespace before doing anything (this gives you a lock on existing data for all intents and purposes), and sleep a bit so that existing processes that reference the prior namespace finish up if needed:
$ns = $memcache->get('stats-namespace');
$memcache->increment('stats-namespace');
sleep(60); // allow concurrent page loads to finish
Once that is done, you can safely loop through your page ids, update stats accordingly, and clean up the needs_stats_refresh table. The latter only needs two fields: page_id int pkey, ns_id int). There's a bit more to it than simple select, insert, update and delete statements run from your scripts, however, so continuing...
As another replier suggested, it's quite appropriate to maintain intermediate stats for your purpose: store batches of hits rather than individual hits. At the very most, I'm assuming you want hourly stats or quarter-hourly stats, so it's fine to deal with subtotals that are batch-loaded every 15 minute.
Even more importantly for your sake, since you're ordering posts using these totals, you want to store the aggregated totals and have an index on the latter. (We'll get to where further down.)
One way to maintain the totals is to add a trigger which, on insert or update to the stats table, will adjust the stats total as needed.
When doing so, be especially wary about dead-locks. While no two $ns runs will be mixing their respective stats, there is still a (however slim) possibility that two or more processes fire up the "increment $ns" step described above concurrently, and subsequently issue statements that seek to update the counts concurrently. Obtaining an advisory lock is the simplest, safest, and fastest way to avoid problems related to this.
Assuming you use an advisory lock, it's perfectly OK to use: total = total + subtotal in the update the statement.
While on the topic of locks, note that updating the totals will require an exclusive lock on each affected row. Since you're ordering by them, you don't want them processed all in one go because it might mean keeping an exclusive lock for an extended duration. The simplest here is to process the inserts into stats in smaller batches (say, 1000), each followed by a commit.
For intermediary stats (monthly, weekly), add a few boolean fields (bit or tinyint in MySQL) to your stats table. Have each of these store whether they're to be counted for with monthly, weekly, daily stats, etc. Place a trigger on them as well, in such a way that they increase or decrease the applicable totals in your stat_totals table.
As a closing note, give some thoughts on where you want the actual count to be stored. It needs to be an indexed field, and the latter is going to be heavily updated. Typically, you'll want it stored in its own table, rather than in the pages table, in order to avoid cluttering your pages table with (much larger) dead rows.
Assuming you did all the above your final query becomes:
select p.*
from pages p join stat_totals s using (page_id)
order by s.weekly_total desc limit 10
It should be plenty fast with the index on weekly_total.
Lastly, let's not forget the most obvious of all: if you're running these same total/monthly/weekly/etc queries over and over, their result should be placed in memcache too.

you can add indexes and try tweaking your SQL but the real solution here is to cache the results.
you should really only need to caclulate the last 7/30 days of traffic once daily
and you could do the past 24 hours hourly ?
even if you did it once every 5 minutes, that's still a huge savings over running the (expensive) query for every hit of every user.

RRDtool
Many tools/systems do not build their own logging and log aggregation but use RRDtool (round-robin database tool) to efficiently handle time-series data. RRDtools also comes with powerful graphing subsystem, and (according to Wikipedia) there are bindings for PHP and other languages.
From your questions I assume you don't need any special and fancy analysis and RRDtool would efficiently do what you need without you having to implement and tune your own system.

You can do some 'aggregation' in te background, for example by a con job. Some suggestions (in no particular order) that might help:
1. Create a table with hourly results. This means you can still create the statistics you want, but you reduce the amount of data to (24*7*4 = about 672 records per page per month).
your table can be somewhere along the lines of this:
hourly_results (
nid integer,
start_time datetime,
amount integer
)
after you parse them into your aggregate table you can more or less delete them.
2.Use result caching (memcache, apc)
You can easily store the results (which should not change every minute, but rather every hour?), either in a memcache database (which again you can update from a cronjob), use the apc user cache (which you can't update from a cronjob) or use file caching by serializing objects/results if you're short on memory.
3. Optimize your database
10 seconds is a long time. Try to find out what is happening with your database. Is it running out of memory? Do you need more indexes?

What is the right way to count article comments, hits, and likes in an articles index?

I have three tables, one for articles, one for comments, one for likes, one for visits, in this example schema
**news**
news_id
**comments**
comment_id
news_id
**likes**
like_id
news_id
**hits**
hit_id
news_id
What i want to do is to listen all the articles in a sortable index in a box/div for each article with article count of hits, comments, and likes, i know how to do all this, so it's not the how i am seeking, it's the best way, i am thinking about those two solutions.
do it the normal way, a complex SQL query then cache the query let's say for an hour or two.
write a script that is executed every two or three hours to calculate the data and store it in the same news table in "news_hits, news_likes, news_comments" numeral fields.
and of course the third way is to do the query each time the page is loaded without any caching.
i feel that it's method number one that i shall go after, but i wanted a professional or experienced opinion, i am not expecting a huge number of visitors, around 500-1000 a day maximum, but still i want to be prepared for high traffic.
thank you,
Rami

It would be best to admit redundancy in this case, to improve speed. To the news table, add these fields:
comments_count int not null default 0,
likes_count int not null default 0,
hits_count int not null default 0
When a comment/like/hit is added/deleted, if the database supports triggers, trigger an increment/decrement of the referenced counter, and if not - do it manually on each insert/delete (stored procedure maybe?).
This type of data is more often read than written, so to optimize read speed, slowing down write speed and storage space isn't a big deal.
From time to time, it would be OK to run a query that would update these counters if by some reason they become erroneous.

Break the complex SQL into several smaller queries (less complex) and cache the individual result(s), so in anytime you want to prepare warm-up cache, it won't take too many database resources

With such a simple model, query and low number of visitors I would go for the straight query. It will execute just fine (milliseconds) with proper indexing.
If I understand the scenario correctly, the query should sort news articles by their popularity, which is determined in some way by the nr of likes/hits/comments.
If you are set on fixing a performance problem you may not actually run into, the simplest "solution" would be to use a query cache that expires every 10 seconds. With your current load, each visitor would basically always render the view from the database since the cache expires between page visits. If, one day you suddenly become overrun with say 200,000 visitors, you would only perform the query once every 10 seconds.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.