Exclude bots and spiders from a View counter in PHP - php

I have built a pretty basic advertisement manager for a website in PHP.
I say basic because it's not complex like Google or Facebook ads or even most high end ad servers. Doesn't handle payments or anything or even targeting users.
It serves the purpose for my low traffic site though to simply show a random banner ad, count impression views and clicks.
Features:
Ad slot/position on page
Banner image
Name
View/impression counter
Click counter
Start and end date, or never ending
Disable/enable ad
I am wanting to gradually add more functionality to the system though.
One thing I have noticed is the Impressions/views counter often seems inflated.
I believe the cause of this is from Social networks' spiders and bots as well as search engine spiders.
For example, if someone enters a URL from a page on my website into Facebook, Google+, Twitter, LinkedIn, Pinterest, and other networks, those sites will often spider my site to gather the webpages Title, images, and description.
I would really like to be able to disable this from counting as Advertisement impressions/view counts when an actual human is not viewing the page.
I realize this will be very hard to detect all these but if there is a way to get a majority of them, at least it will make my stats a little more accurate.
So I am reaching out for any help or ideas on how to achieve my goal? Please do not say to use another advertisement system, that is not in the cards, thank you

You need to serve the ADs with JavaScript. That's the only way to avoid most of the crawlers. Only browsers load dependencies like Images, JS and CSS. 99% of the robots avoid them.
You can also do this:
// basic crawler detection and block script (no legit browser should match this)
if(!empty($_SERVER['HTTP_USER_AGENT']) and preg_match('~(bot|crawl)~i', $_SERVER['HTTP_USER_AGENT'])){
// this is a crawler and you should not show ads here
}
You'll have much better stats this way. Use JS for ads.
PS: You could also try setting a cookie in JS and later checking for it. Crawlers might get cookies sent in PHP by HTTP but those set in JS, 99.9% chances they'll miss it. Because they need to load a JS file and interpret it. That's only done by browsers.

You could do something like this:
There is a good list of crawlers in text format here: http://www.robotstxt.org/db/all.txt
assume you've collected all of the user agents in that file in an array called $botList
$ua = isset($_SERVER['HTTP_USER_AGENT']) ? strtolower($_SERVER['HTTP_USER_AGENT']) : NULL;
if($ua && in_array($ua, $botList)) {
// this is probably a bot
}
Of course, user agent easily can be changed or may be missing sometimes, but search engines like Google and Yahoo are honest about themselves.

A crawler will download robots.txt, even if it doesn't respect it and does it out of curiosity. This is a good indication you might be dealing with one, although it's not definite.
You can detect a crawler if he visits a huge number of links in a very short time. This can be quite complicated to do in code though.
But that's only feasible if you don't want or can't run Javascript. Otherwise go with CodeAngry's answer.
Edit: In response to #keune's answer, you could keep all the visitor IPs and run them through the list in a cron job, then publish the updated visitor count.

Try this:
if (preg_match("/^(Mozilla|Opera|PSP|Bunjalloo|wii)/i", $_SERVER['HTTP_USER_AGENT']) && !preg_match("/bot|crawl|crawler|slurp|spider|link|checker|script|robot|discovery|preview/i", $_SERVER['HTTP_USER_AGENT'])) {
It's not a bot
} else {
It's a bot
}

Related

How to make a private URL?

I want to create a private url as
http://domain.com/content.php?secret_token=XXXXX
Then, only visitors who have the exact URL (e.g. received by email) can see the page. We check the $_GET['secret_token'] before displaying the content.
My problem is that if by any chance search bots find the URL, they will simply index it and the URL will be public. Is there a practical method to avoid bot visits and subsequent index?
Possible But Unfavorable Methods:
Login system (e.g. by php session): But I do not want to offer user login.
Password-protected folder: The problem is as above.
Using Robots.txt: Many search engine bots do not respect it.
What you are talking about is security through obscurity. Its never a good idea. If you must, I would offer these thoughts:
Make the link expire
Lock the link to the C or D class of IPs that it was accessed from the first time
Have the page challenge the user with something like a logic question before forwarding to the real page with a time sensitive token (2 step process), and if the challenge fails send a 404 back so the crawler stops.
Try generating a 5-6 alphanumeric password and attach along with the email, so eventhough robots spider it , they need password to access the page. (Just an extra added safety measure)
If there is no link to it (including that the folder has no index
view), the robot won't find it
You could return a 404, if the token is wrong: This way, a robot (and who else doesn't have the token) will think, there is no such page
As long as you don't link to it, no spider will pick it up. And, since you don't want any password protection, the link is going to work for everyone. Consider disabling the secret key after it is used.
you only need to tell the search engines not to index /content.php, and search engines that honor robots.txt wont index any pages that start with /content.php.
Leaving the link unpublished will be ok in most circumstances...
...However, I will warn you that the prevalence of browser toolbars (Google and Yahoo come to mind) change the game. One company I worked for had pages from their intranet indexed in Google. You could search for the page, and a few results came up, but you couldn't access them unless you were inside our firewall or VPN'd in.
We figured the only way those links got propagated to Google had to be through the toolbar. (If anyone else has a better explanation, I'd love to hear it...) I've been out of that company a while now, so I don't know if they ever figured out definitively what happened there.
I know, strange but true...

How to track relevant views using php

I would like to track all views to a page using php and mysql. I will be tracking the number of times a person viewed the page and the ip address along with the current date. However is there a way to make sure your tracking actual users rather than bots/spiders?
Two options that I see:
Create a "hidden" link on your home page to a honey pot. Any one who hits the honey pot page should be considered a bot and not included in your stats
2: Not a fool proof way, but you could compare the browser's User Agent string to a white list of known web browsers. This string can be spoofed so its not the most reliable.
Personally, I'd go with the first option.
For the honey pot:
on your home page I'd add something like this:
ReallyNotATrap
and on the honey pot page itself something like this:
$BotIp=$_SERVER['REMOTE_ADDR'];
//DB connection
Insert into BlackList($BotIp,$Date,$otherDataYouCareAboutLogging);
//close DB Connection
Then for your stats code simply compare every user's Ip to the BlackList table. If the user isn't on it, record the stats.
EDIT
As pointed out below, googlebot can get tricked by this. If this is something that matters to you (if your just filtering for your own stats and not filtering content it shouldn't matter), include your honeypot page in your Robots.txt. Google will read the text file and avoid the trap. Other nasty bots will fall into it. Since google will avoid our trap, I would also use option 2 and filter out Google's User Agent String from the stats.
The amount of real users should be basically the same number as the number of real users - bots. If you want to you can check the User Agent which will tell you who is browsing the site.
You could try out my tracking script, it's pretty simple to implement and bots and spiders will come up as a bunk browser so it's easy to weed them out. I use this on all my company's sites for analytics. There's one caveat though, if you use this for keyword tracking you may be disappointed real soon because Google is starting to change the structure of their query strings for logged in users.
https://github.com/k4t434sis/tracking.php

How to track users across domains?

We got pitched this idea yesterday. A user visits our site and are marked. Then when they visit other sites like CNN they are targeted with adds for our site. So once they are exposed to us, they start to see us everywhere, creating the illusion we are bigger than we are.
The person pitching it said it was done by cookies. I was very skeptical since I don't believe there to be anyway to see what cookies a different domain has set. So I wanted to try an figure out how it was accomplished. The salesman called this technology pixel tracking.
I have never heard of pixel tracking but from my research I have found that it is placing a 1 pixel image that references a script on another domain with parameters to be executed. My first thought was, OK maybe its possible this way.. But I still don't know how?
Can anyone explain how they are able to mark you as visited our site, and then see this mark on another site? Is it from your IP?
Included at the bottom of the (CNN) website in this case is an img tag like:
<img src="http://www.webmarketingCompany.com/pixel.php?ID=623489593479">
When a user visits the (CNN) website, and the browser renders the page, it sends http requests for all the images as well, including a request to http://www.webmarketingCompany.com for the image pixel.php which includes the ID as a get parameter. pixel.php not only returns an image, typically a 1x1 transparent gif (so it isn't visible in the rendered page), but can do a whole host of additional processing using the ID value; and it also has access to any webmarketingCompany.com cookies, which are also sent with the http request.
Of course, CNN have to agree to include the img tag in their html. Typically it's used as a tracker by third party marketing companies working on behalf of CNN to identify who is visiting their site, what pages they're viewing, etc.
But because it's a PHP script, it can do a whole host of extras, such as setting further cookies. If webmarketingCompany.com also handle ad-serving on behalf of CNN, they can do some creative selection of the ads that they choose to serve.
Such cross-client "pollination" is frowned upon, certainly here in the UK.
What you are describing is pretty standard for all advertisement networks. The only difference here is that they will place that cookie on your site as well.
As long as the browser has "accept third party cookies" set to true, this will work as the salesman said. Most browsers has the setting set to true by default, the only exception I can think of is Safari.

php how to know that a click came from google

My adsense ad have a dedicated land page.
I want to show the content only to those who came through that ad.
The page is coded with PHP so I'm using $_SERVER['HTTP_REFERER'].
Two questions here:
Is there a better alternative to $_SERVER['HTTP_REFERER'] ?
To what strings/domains should I compare the referrer's domain (I'll handle extracting it)? I mean, I'm guessing that google has more than one domain they're using for the ads, or not? There's doubleclick.com.... any other domain? How can I check it, besides try/fail?
$_SERVER['HTTP_REFERER'] is the canonical way to determine where a click came from generally. There are more reliable (and complicated) methods for clicks within a site you fully control, but that's not much help for clicks from Google. Yes, it can be spoofed, and yes, it can be null, but as long as you're not targeting nuclear weapons based on that data, and you can handle null values gracefully, it should be good enough.
As for domains, you have to consider the international google domains, as well as all the google*.com domains.
I suggest adding a parameter on the link you give to Google. i.e. instead of yoursite.com/landing, do yoursite.com/landing?campaign=12.
If you are concerned that curious users will play with this parameter, the fix is simple-- redirect via a server 301 redirect when they hit that URL.
That is, if I request yoursite.com/landing?campaign=12, your server--before serving a page-- should log my visit to campaign 12 and redirect me to the plain url yoursite.com/landing. This has the added advantage that reloads won't increment your campaign hit count.
Yes, users could still mess with the original link if they are clever or curious enough to look at it before they click on it, but I think this is going to be far more effective than sniffing the referer.
Rather than trying to work out on your own how to measure your page views, you can consider using an existing system for that, like Google Analytics

Short URL Outbound Link Tracking in PHP and MySQL

I have a site that shortens links based on Noah Hendrix's tutorial on the subject. I decided that it would be great if I could track when users click the short URLs, similar to the way that HootSuite users can track their links with Owly. I currently have a database which has the short URL stored along with the true URL and it's click count. Ideally the click count column would update when that short URL is accessed by an outside user.
In short, I am looking for a PHP/MySQL solution to keep track of the number of times various short URLs are clicked. Any additional information that could be gathered from the clicks would be greatly appreciated as well.
I am assuming you followed the php version of his tutorial. If so look at the listing for serve.php under "Serving the Short URL". In the section round line 11 where it sets the 301 status you can log the redirect there with an update to the database. Something like
$query = mysql_query("update `".$database."`.`url_redirects` set count=count+1 where `short`='".mysql_escape_string($short), $db);
$row = mysql_execute_update($query);
should do it.
Here's a round-about alternative--How about a non-brain damage approach? Try putting Google Analytics on the site. You'll not only get a click report, but can also track paths through the site, ins vs outs, network properties, user locations, etc. It's a simple javascript include, and takes all of 5 minutes to setup start to finish.
I've been a PHP developer for a long time, and my personal theory is that there's a lot of challenges that need to be solved out there, no reason to waste time on solutions others are willing to give you...

Categories