Using PHP5 and the latest version of MySQL I want to be able to track impressions and clicks for business listings. My question is if I did this myself what would be the best method in storing it so I can run reports? Before I just had a table that had the listing id, user ip address and if it was a click or impression as well as the date it was tracked. However the database itself is approaching 2GB of data and its very slow, part of the problem is its a pretty simple script that includes impressions and clicks from anyone including search engines and basically anyone or anything that accesses the listing page.
Is there an api or file out there that has an update to date list that can detect if the person viewing is a actually person and not a spider so I dont fill up the database with unneeded stats? Just looking for suggestions, do I just have a raw database that gets just the hits then a cron job at night tally up for the day for each listing for each ip and store the cumulative stats in a different table?
Also what type of database should it be? Innodb? MyISAM?
I would think that you will never create something better then what is already out there. I'd use Google's analytics. If you want to use it in an admin side of a site (for a client to run may-be) you can always use googles api and pull the data as you need. here is where I'd look.. http://code.google.com/intl/en-US/apis/analytics/
hth Cheers -Jeremy
Just in case you need to distinguish real users from bots, here's a simple solution: use javascript for sending reports to the server.
Let's say you have a link and you want to track when it's clicked. Then add an onclick handler, which will send a decent report to the server. Here's an example:
Some page
The track function would look like this one:
function track(action, data) {
var Img = new Image();
Img.src = '/track.php?action=' + action + '&data=' + data;
}
So in this case when a user clicks the link, the information about this click will be sent to the server by this piece of javascript code. Bots can't run javascript, so they won't be counted. There is one drawback though, if a user has disabled javascript in his browser, your tracking script won't count such a user. Obviously, you'll need to implement the track.php script in order to store the data.
Concerning your MySQL question, I'd pick MyIsam as it seems to be more tolerant to lots of intserts. Also, you can look at INSERT DELAYED statement, and your idea about nightly cron jobs seems reasonable to me. You can split your statistics table by days, weeks, or months as well.
99.999% of the time, you will juste write into the database.
So for this kind of work, daily partitioned MySQL tables will do the job.
Each day write on the same partition, and run an ANALYZE PARTITION on your yesterday partition.
Related
I have made a progress sheet that between 1-30 users will update constantly throughout the course of the day. It’s controlled by PHP and jQuery and all the data is stored in a mySql database.
It’s quite slow on the live server, because I re query on searches and filters. I was wondering where is the best place to store the result, because I only need the latest updates but I do not need to constantly filter and update.
Is it okay to store that kind of data in a session or who else could I store it to avoid full query on every change/search or when the page changes?
I did look into memcached, but I ended up redesigning it.
I used jQuery to filter the records using hide() and show() on the data attributes that matched, instead of using a “LIKE '%Darren%'” SQL query.
This puts the load on the user’s terminal opposed to my server (Try to keep the animations to a minimum, to keep the memory usage low).
Then I store a timestamp of the last update on the user’s terminal in the HTML then run a
“SELECT id FROW progress WHERE updated > '2014-02-20 16:54:30' LIMIT 1” query periodically.
If I get 1 record back I use Ajax to a full update and refresh the areas required with fresh HTML.
Doing it this way has made it so much quicker and smoother.
I how this helps, if you need more info let me know!
I'm running a server and website for an online game.
I need to pull information from the mySQL database containing player names and messages and display it on our website.
I managed to do that with a php script which directly pulls the info from the MySQL database whenever someone opens the page.
However, this puts unnecessary load on the server.
I need to achieve so that the function pulls the data only at certain intervals, for example, every hour.
I have no idea how to do this and what to even look for in terms of functions.
Could anyone show me the right direction?
If you're worried about the load being placed on the call being made every time the user opens up the site, perhaps first you should look into caching. Something like APC or memcached, that way there isn't an actual DB lookup, it just returns the same data that it grabbed last time You just set the period that you want the cache to hold the data so that in time it will grab a fresh copy, in your case this would be an hour.
There's a bunch of questions on APC, memcached etc on stack overflow that should be able to help you out, e.g., The best way of PHP Caching
I'm developing a website that is sensitive to page visits. For instance it has sections that will show the users which parts of the website (which items) have been visited the most. To implement this features, two strategies come to my mind:
Create a page hit counter, sort the pages by the number of visits and pick the highest ones.
Create a Google Analytics account and use its info.
If the first strategy has been chosen, I would need a very fast and accurate hit counter with the ability to distinguish the unique IPs (or users). I believe that using MySQL wouldn't be a good choice, since a lot of page visits, means a lot of DB locks and performance problems. I think a fast logging class would be a good one.
The second option seems very interesting when all the problems of the first one emerge but I don't know if there is a way (like an API) for Google Analytics to make me able to access the information I want. And if there is, is it fast enough?
Which approach (or even an alternative approach) you suggest I should take? Which one is faster? The performance is my top priority. Thanks.
UPDATE:
Thank you. It's interesting to see different answers. These answers reminded me an important factor. My website updates the "most visited" items, every 8 minutes so I don't need the data in real time but I need it to be accurate enoughe every 8 minutes or so. What I had in mind was this:
Log every page visit to a simple text log file
Send a cookie to the user to separate unique users
Every 8 minutes, load the log file, collect the info and update the MySQL tables.
That said, I wouldn't want to reinvent the wheel. If a 3rd party service can meet my requirements, I would be happy to use it.
Given you are planning to use the page hit data to determine what data to display on your site, I'd suggest logging the page hit info yourself. You don't want to be reliant upon some 3rd party service that you'd have to interrogate in order to create your page. This is especially true if you are loading that data real time as you'd have to interrogate that service for every incoming request to your site.
I'd be inclined to save the data yourself in a database. If you're really concerned about the performance of the inserts, then you could investigate intercepting requests (I'm not sure how you go about this in PHP, but I'm assuming it's possible.) and then passing the request data of to a separate thread to store the request info. By having a separate thread handle the logging, then you won't interrupt your response to the end user.
Also, given you are planning using the data collected to "... show the users which parts of the website (which items) have been visited the most", then you'll need to think about accessing this data to build your dynamic page. Maybe it'd be good to store a consolidated count for each resource. For example, rather than having 30000 rows showing that index.php was requested, maybe have one row showing index.php was requested 30000 times. This would certainly be quicker to reference than having to perform queries on what could become quite a large table.
Google Analytics has a latency to it and it samples some of the data returned to the API so that's out.
You could try the API from Clicky. Bear in mind that:
Free accounts are limited to the last 30 days of history, and 100 results per request.
There are many examples of hit counters out there, but it sounds like you didn't find one that met your needs.
I'm assuming you don't need real-time data. If that's the case, I'd probably just read the data out of the web server log files.
Your web server can distinguish IP addresses. There's no fully reliable way to distinguish users. I live in a university town; half the dormitory students have the same university IP address. I think Google Analytics relies on cookies to identify users, but shared computers makes that somewhat less than 100% reliable. (But that might not be a big deal.)
"Visited the most" is also a little fuzzy. The easy way out is to count every hit on a particular page as a visit. But a "visit" of 300 milliseconds is of questionable worth. (Probably realized they clicked the wrong link, and hit the "back" button before the page rendered.)
Unless there are requirements I don't know about, I'd probably start by using awk to extract timestamp, ip address, and page name into a CSV file, then load the CSV file into a database.
How to build a proper structure for an analytics service? Currently i have 1 table that stores data about every user that visits the page with my client's ID so later my clients will be able to see the statistics for a specific date.
I've thought a bit today and I'm wondering: Let's say i have 1,000 users and everyone has around 1,000 impressions on their sites daily, means i get 1,000,000 (1M) new records every day to a single table. How will it work after 2 months or so (when the table reaches 60 Million records)?
I just think that after some time it will have so much records that the PHP queries to pull out the data will be really heavy, slow and take a lot of resources, is it true? and how to prevent that?
A friend of mine working on something similar and he is gonna make a new table for every client, is this the correct way to go with?
Thanks!
Problem you are facing is I/O bound system. 1 million records a day is roughly 12 write queries per second. That's achievable, but pulling the data out while writing at the same time will make your system to be bound at the HDD level.
What you need to do is configure your database to support the I/O volume you'll be doing, such as - use appropriate database engine (InnoDB and not MyISAM), make sure you have fast enough HDD subsystem (RAID, not regular drives since they can and will fail at some point), design your database optimally, inspect queries with EXPLAIN to see where you might have gone wrong with them, maybe even use a different storage engine - personally, I'd use TokuDB if I were you.
And also, I sincerely hope you'd be doing your querying, sorting, filtering on the database side and not on PHP side.
Consider this Link to the Google Analytics Platform Components Overview page and pay special attention to the way the data is written to the database, simply based on the architecture of the entire system.
Instead of writing everything to your database right away, you could write everything to a log file, then process the log later (perhaps at a time when the traffic isn't so high). At the end of the day, you'll still need to make all of those writes to your database, but if you batch them together and do them when that kind of load is more tolerable, your system will scale a lot better.
You could normalize impressions the data like this;
Client Table
{
ID
Name
}
Pages Table
{
ID
Page_Name
}
PagesClientsVisits Table
{
ID
Client_ID
Page_ID
Visits
}
and just increment visits on the final table on each new impression. Then the maximum number of records in there becomes (No. of clients * No. of pages)
Having a table with 60 million records can be ok. That is what a database is for. But you should be careful about how many fields you have in the table. Also what datatype (=>size) each field has.
You create some kind of reports on the data. Think about what data you really need for those reports. For example you might need only the numbers of visits per user on every page. A simple count would do the trick.
What you also can do is generate the report every night and delete the raw data afterwards.
So, read and think about it.
I have a site where the users can view quite a large number of posts. Every time this is done I run a query similar to UPDATE table SET views=views+1 WHERE id = ?. However, there are a number of disadvantages to this approach:
There is no way of tracking when the pageviews occur - they are simply incremented.
Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
Therefore I consider employing an approach where I create a table, say:
object_views { object_id, year, month, day, views }, so that each object has one row pr. day in this table. I would then periodically update the views column in the objects table so that I wouldn't have to do expensive joins all the time.
This is the simplest solution I can think of, and it seems that it is also the one with the least performance impact. Do you agree?
(The site is build on PHP 5.2, Symfony 1.4 and Doctrine 1.2 in case you wonder)
Edit:
The purpose is not web analytics - I know how to do that, and that is already in place. There are two purposes:
Allow the user to see how many times a given object has been shown, for example today or yesterday.
Allow the moderators of the site to see simple view statistics without going into Google Analytics, Omniture or whatever solution. Furthermore, the results in the backend must be realtime, a feature which GA cannot offer at this time. I do not wish to use the Analytics API to retrieve the usage data (not realtime, GA requires JavaScript).
Quote : Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
There is much more than this. This is database killer.
I suggest u make table like this :
object_views { object_id, timestamp}
This way you can aggregate on object_id (count() function).
So every time someone view the page you will INSERT record in the table.
Once in a while you must clean the old records in the table. UPDATE statement is EVIL :)
On most platforms it will basically mark the row as deleted and insert a new one thus making the table fragmented. Not to mention locking issues .
Hope that helps
Along the same lines as Rage, you simply are not going to get the same results doing it yourself when there are a million third party log tools out there. If you are tracking on a daily basis, then a basic program such as webtrends is perfectly capable of tracking the hits especially if your URL contains the ID's of the items you want to track... I can't stress this enough, it's all about the URL when it comes to these tools (Wordpress for example allows lots of different URL constructs)
Now, if you are looking into "impression" tracking then it's another ball game because you are probably tracking each object, the page, the user, and possibly a weighted value based upon location on the page. If this is the case you can keep your performance up by hosting the tracking on another server where you can fire and forget. In the past I worked this using SQL updating against the ID and a string version of the date... that way when the date changes from 20091125 to 20091126 it's a simple query without the overhead of let's say a datediff function.
First just a quick remark why not aggregate the year,month,day in DATETIME, it would make more sense in my mind.
Also I am not really sure what is the exact reason you are doing that, if it's for a marketing/web stats purpose you have better to use tool made for that purpose.
Now there is two big family of tool capable to give you an idea of your website access statistics, log based one (awstats is probably the most popular), ajax/1pixel image based one (google analytics would be the most popular).
If you prefer to build your own stats database you can probably manage to build a log parser easily using PHP. If you find parsing apache logs (or IIS logs) too much a burden, you would probably make your application ouput some custom logs formated in a simpler way.
Also one other possible solution is to use memcached, the daemon provide some kind of counter that you can increment. You can log view there and have a script collecting the result everyday.
If you're going to do that, why not just log each access? MySQL can cache inserts in continuous tables quite well, so there shouldn't be a notable slowdown due to the insert. You can always run Show Profiles to see what the performance penalty actually is.
On the datetime issue, you can always use GROUP BY MONTH( accessed_at ) , YEAR( accessed_at) or WHERE MONTH(accessed_at) = 11 AND YEAR(accessed_at) = 2009.