Track Web Traffic with PHP

Track Web Traffic with PHP - php

Is there an effective way to track web traffic (or at least the origin of web traffic) with PHP?
I was thinking of using custom canonical links for each search engine and other websites, which would mean anybody who visits mywebsite.com without a parameter is likely direct traffic. But then I would somehow need to change the href value of the link rel='canonical' element for each engine crawler (e.g. https://mywebsite.com/?ref=google, https://mywebsite.com/?ref=duckduckgo, etc), and I'm not exactly sure how to go about this (through robots.txt, meta tags or?).
I really don't want to use Google Analytics if I don't have to. I'd prefer to have all of my analytics under one roof so to speak, but I'm stuck for ideas of how to achieve this, and most of my searches on SO seem to pull up stuff related GA.

well ive read all over SO about how in many cases the header can be and is simply omitted for various reasons such as AV software, browser extensions, switching from http to https, etc? is this often the case?
Yes, this can happen. How often for your particular site's visitors is anyone's guess.
does GA rely on the referer header?
Not quite... as Google Analytics runs client-side, it's getting that information from document.referrer, which contains the same value as what is sent in the Referer header.
but i would of course like to have numbers that are as accurate as possible
With any web analytics, there are things you simply can't measure. The best way is to use a client-side analytics script to send data to your server. There are a handful of reasons why this is better than simply looking at the data you get in the HTTP request data in PHP:
Pages can be cached, so you'll be able to see page loads at times when the browser never even checked in with your server to load the page.
The Performance API is available, allowing you to track specific load timings that you can work to improve on over time.
In most browsers, you can use the Beacon API to get a sense for when the user leaves the page, so you have accurate time-on-page measurements.
id like to get an idea of what traffic is direct and what traffic is not direct and where non-direct traffic is coming from
document.referrer is what you want, and gets you as close to accurate as you can get.

Related

How does google analytics avoid same origin policy?

I had an idea for a project involving a Javascript terminal utilising a specified PHP script as a server to carry out remote functions. I understand that the same origin policy would be an obstacle with such a project, but looking at google analytics, which I use every day, it seems they have a way of avoiding the problem on a huge scale.

Google Analytics, Google AdWords and practically all other analytics/web-marketing platforms use <img> tags.
They load their JS programs, those programs handle whatever tracking you put on the page, then they create an image and set the source of the image to be equal to whatever their server's domain is, plus add all of your tracking information to the query string.
The crux is that it doesn't matter how it gets there:
the server is only concerned about the data which is inside of the URL being called, and the client is only concerned about making a call to a specific URL, and not in getting any return value.
Thus, somebody chose <img> years and years ago, and companies have been using it ever since.

The modern way to allow cross-domain requests is for the server to respond with the following header to any requests:
Access-Control-Allow-Origin: *
This allows requests from any hosts, or alternatively a specific host can be used instead of *. This is called Cross Origin Resource Sharing (CORS). Unfortunately it's not supported in older browsers, so you need hacks to work around the browser in that case (like a commenter said perhaps by requesting an image).

You can get codes from third-party sites, but collecting data with them is restricted by the policy.
Google collects data with "_gaq" function array embedded by the 1st-orgine-site, and then Google sends the collected data as they are embedded in the http-request parameters.
http://www.google-analytics.com/__utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=GATC012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/GATC012.html?utm_source=www.gatc012.org&utm_campaign=campaign+gatc012&utm_term=keywords+gatc012&utm_content=content+gatc012&utm_medium=medium+gatc012&utmac=UA-30138-1&utmcc=__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B...
Google demonstrates clearly how tracking works.

How can I track outgoing link clicks without tracking bots?

I have a few thoughts on this but I can see problems with both. I don't need 100% accurate data. An 80% solution that allows me to make generalizations about the most popular domains I'm routing users to is fine.
Option 1 - Use PHP. Route links through a file track.php that makes sure the referring page is from my domain before tracking the click. This page then routes the user to the final intended URL. Obviously bots could spoof this. Do many? I could also check the user agent. Again, I KNOW many bots spoof this.
Option 2 - Use JavaScript. Execute a JavaScript on click function that writes the click to the database and then directs the user to the final URL.
Both of these methods feel like they may cause problems with crawlers following my outgoing links. What is the most effective method for tracking these outgoing clicks?

The most effective method for tracking outgoing links (it's used by Facebook, Twitter, and almost every search engine) is a "track.php" type file.
Detecting bots can be considered a separate problem, and the methods are covered fairly well by these questions: http://duckduckgo.com/?q=how+to+detect+http+bots+site%3Astackoverflow.com But doing a simple string search for "bot" in the User-Agent will probably get you close to your 80%* (and watching for hits to /robots.txt will, depending on the type of bot you're dealing with, get you 95%*).
*: a semi-educated guess, based on zero concrete data

Well, Google analytics and Piwik use Javascript for that.
Since bots can't use JS, you'll only have humans. In the other way, humans can disable JS too (but sincerely, that's rarely the case)
Facebook, Deviantart, WLM, etc use server side script to track. I don't know how they filter bots but a nice robots.txt with one or two filter and that should be good enough to get 80% I guess.

Algorithms used for catching robots

What kind of algorithm do websites, including stackexchange use to catch robots?
What makes them fail at times and present human-verification to normal users?
For web-applications and websites running on PHP, what would you recommend in order to stop robots and bot attacks and even content stealing?
Thank you.

Check out http://www.captcha.net/ for good and easy human-verification tools.
Preventing content stealing will be really difficult as you want the information to be available to your visitors.
Do not disable right click, it will only annoy your users and not stop content thiefs in any way.
You won't be able to keep out all bots, but you will be able to implement layers of security that will each stop a part of the bots.
A few hints and tips;
Use Captcha's for human verification, but don't use too many of them as they will tire users.
You could do e-mail verification with a Captcha and require a login for your content (if it doesn't scare away too many users). Or consider giving some part of the content for free and require registration for the full content.
Check for pieces of your content on other sites regularly (through Google, possibly automated with the Google API) and sue / DMCA notice if they blatantly stole (not quoted!) your content.
Limit the speed at which individual clients can make requests to your site. Bots will scrape often and quickly. Requesting content more than once a second is already a lot for human users. There are server tools that can accomplish this, eg. check out http://www.modsecurity.org/
I am sure there are more layers of security that can be thought of, but these come to mind directly.

I ran across an interesting article from Princeton University that presents nice ideas for automatic robot detection. The idea is quite simple. Humans behave differently than machines, and an automated access usually does things differently than a human.
The article presents some basic checks that can be done over the course of a few requests. You spend a few requests gathering information about how the client is browsing and after some time you take all your variables and make an assertion. Things to include are:
Mouse movement: a robot will most likely not use a mouse and therefore will not generate mouse movement events in the browser. You can prepare a javascript function, say "onBodyMouseMove()" and call it whenever the mouse moves over the entire area of page's body. If this function is called, count +1 in a session counter.
Javascript: some robots will not take the time to run javascript (i.e. curl, wget, axel, and other command line tools), since they are mostly sending specific requests that return useful output. You can prepare a function that is called after a page is loaded and count +1 in a session counter.
Invisble links: crawler robots are sucking machines that don't care about the content of a website. They are designed to click on all possible links and suck all the contents to a mirror location. You can insert invisible links somewhere in your webpage -- for example, a few nbsp; space characters at the bottom of the page surrounded by an anchor tag. Humans will not ever see this link, but you get a request on it, count +1 in a session counter.
CSS, images, and other visual components: robots will most likely ignore CSS and images, because they are not interested in rendering the webpage for viewing. You can hide a link to inside an URL that ends in *.css or *.jpg (you can use Apache rewrites or servlet mappings for Java). If these specific links are accessed, it's most likely a browser loading CSS and JPG for viewing.
NOTE: *.css, *.js, *.jpg, etc are usually loaded only once per page in a session. You need to append a unique counter at the end for the browser to reload these links everytime the page is requested.
Once you gather all that information in your session over the course of a few requests, you can make an assertion. For example, if you don't see any javascript, css or mouse move activity you can assume it's a bot. It's up to you to take these counters into consideration according to your needs.. so you can program it based on these variables any way you want. If you decide some client is a robot, you can force him to solve some captcha before continuing with further requests.
Just a note: Tablets will usually not create any mouse move events. So I'm still trying to figure out how to deal with them. Suggestions are welcome :)

How to determine real user are browsing my site or just crawling or else in PHP

I want to know whether a user are actually looking my site(I know it's just load by the browser and display to human, not actually human looking at it).
I know two method will work.
Javascript.
If the page was load by the browser, it will run the js code automatically, except forbid by the browser. Then use AJAX to call back the server.
1×1 transparent image of in the html.
Use img to call back the server.
Do anyone know the pitfall of these method or any better method?
Also, I don't know how to determine a 0×0 or 1×1 iframe to prevent the above method.

A bot can access a browser, e.g. http://browsershots.org
The bot can request that 1x1 image.
In short, there is no real way to tell. Best you could do is use a CAPTCHA, but then it degrades the experience for humans.
Just use a CAPTCHA where required (user sign up, etc).

I want to know whether a user are actually looking my site(I know it's just load by the browser and display to human, not actually human looking at it).
The image way seems better, as Javascript might be turned off by normal users as well. Robots generally don't load images, so this should indeed work. Nonetheless, if you're just looking to filter a known set of robots (say Google and Yahoo), you can simply check for the HTTP User Agent header, as those robots will actually identify themselves as being a robot.

you can create an google webmasters account
and it tells you how to configure your site for bots
also show how robot will read your website

I agree with others here, this is really tough - generally nice crawlers will identify themselves as crawlers so using the User-Agent is a pretty good way to filter out those guys. A good source for user agent strings can be found at http://www.useragentstring.com. I've used Chris Schulds php script (http://chrisschuld.com/projects/browser-php-detecting-a-users-browser-from-php/) to good effect in the past.
You can also filter these guys at the server level using the Apache config or .htaccess file, but I've found that to be a losing battle keeping up with it.
However, if you watch your server logs you'll see lots of suspect activity with valid (browser) user-agents or funky user-agents so this will only work so far. You can play the blacklist/whitelist IP game, but that will get old fast.
Lots of crawlers do load images (i.e. Google image search), so I don't think that will work all the time.
Very few crawlers will have Javascript engines, so that is probably a good way to differentiate them. And lets face it, how many users actually turn of Javascript these days? I've seen the stats on that, but I think those stats are very skewed by the sheer number of crawlers/bots out there that don't identify themselves. However, a caveat is that I have seen that the Google bot does run Javascript now.
So, bottom line, its tough. I'd go with a hybrid strategy for sure - if you filter using user-agent, images, IP and javascript I'm sure you'll get most bots, but expect some to get through despite that.
Another idea, you could always use a known Javascript browser quirk to test if the reported user-agent (if its a browser) is really actually that browser?

"Nice" robots like those from google or yahoo will usually respect a robots.txt file. Filtering by useragent might also help.
But in the end - if someone wants to gain automated access it will be very hard to prevent that; you should be sure it is worth the effort.

Inspect the User-Agent header of the http request.
Crawlers should set this to anything but a known browser.
here are the google-bot header http://code.google.com/intl/nl-NL/web/controlcrawlindex/docs/crawlers.html
In php you can get the user-agent with :
$Uagent=$_SERVER['HTTP_USER_AGENT'];
Then you just compare it with the known headers
as a tip preg_match() could be handy to do this all in a few lines of code.

Top techniques to avoid 'data scraping' from a website database

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.

While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.

If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.

There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.

Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.

force a reCAPTCHA every 10 page loads for each unique IP

There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.

Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.

I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.

Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.

There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.

What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!

Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?

Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.

My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.