Related
How can one detect the search engine bots using php?
I use the following code which seems to be working fine:
function _bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
update 16-06-2017
https://support.google.com/webmasters/answer/1061943?hl=en
added mediapartners
Here's a Search Engine Directory of Spider names
Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}
Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:
http://www.useragentstring.com/pages/useragentstring.php
Or more specifically for crawlers:
http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler
If you want to -say- log the number of visits of most common search engine crawlers, you could use
$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
// $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}
You can checkout if it's a search engine with this function :
<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
'Rambler' => 'Rambler',
'Yahoo' => 'Yahoo',
'AbachoBOT' => 'AbachoBOT',
'accoona' => 'Accoona',
'AcoiRobot' => 'AcoiRobot',
'ASPSeek' => 'ASPSeek',
'CrocCrawler' => 'CrocCrawler',
'Dumbot' => 'Dumbot',
'FAST-WebCrawler' => 'FAST-WebCrawler',
'GeonaBot' => 'GeonaBot',
'Gigabot' => 'Gigabot',
'Lycos spider' => 'Lycos',
'MSRBOT' => 'MSRBOT',
'Altavista robot' => 'Scooter',
'AltaVista robot' => 'Altavista',
'ID-Search Bot' => 'IDBot',
'eStyle Bot' => 'eStyle',
'Scrubby robot' => 'Scrubby',
'Facebook' => 'facebookexternalhit',
);
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
$crawlers_agents = implode('|',$crawlers);
if (strpos($crawlers_agents, $USER_AGENT) === false)
return false;
else {
return TRUE;
}
}
?>
Then you can use it like :
<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>
I'm using this to detect bots:
if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
// is bot
}
In addition I use a whitelist to block unwanted bots:
if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
// allowed bot
}
An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.
Note: My whitelist is based on Facebooks robots.txt.
Because any client can set the user-agent to what they want, looking for 'Googlebot', 'bingbot' etc is only half the job.
The 2nd part is verifying the client's IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it's under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site's visitor, you're sure it's a crawler from that search engine.
I've written a library in Java that performs these checks for you. Feel free to port it to PHP. It's on GitHub: https://github.com/optimaize/webcrawler-verifier
If you really need to detect GOOGLE engine bots you should never rely on "user_agent" or "IP" address because "user_agent" can be changed and acording to what google said in: Verifying Googlebot
To verify Googlebot as the caller:
1.Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.
2.Verify that the domain name is in either googlebot.com or google.com
3.Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.
Here is my tested code :
<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 )
{
//add your code
}
?>
In this code we check "hostname" which should contain "googlebot.com" or "google.com" at the end of "hostname" which is really important to check exact domain not subdomain.
I hope you enjoy ;)
I use this function ... part of the regex comes from prestashop but I added some more bot to it.
public function isBot()
{
$bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
$userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
$isBot = !$userAgent || preg_match($bot_regex, $userAgent);
return $isBot;
}
Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )
One distinctive feature of most of the bot is that they don't carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )
Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector
You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client’s IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.
I made one good and fast function for this
function is_bot(){
if(isset($_SERVER['HTTP_USER_AGENT']))
{
return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
}
return false;
}
This cover 99% of all possible bots, search engines etc.
<?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
$lastupdated = date("Ymd", filemtime(FILE_BOTS));
if ($lastupdated != date("Ymd")) {
$lists = array(
'http://labs.getyacg.com/spiders/google.txt',
'http://labs.getyacg.com/spiders/inktomi.txt',
'http://labs.getyacg.com/spiders/lycos.txt',
'http://labs.getyacg.com/spiders/msn.txt',
'http://labs.getyacg.com/spiders/altavista.txt',
'http://labs.getyacg.com/spiders/askjeeves.txt',
'http://labs.getyacg.com/spiders/wisenut.txt',
);
foreach($lists as $list) {
$opt .= fetch($list);
}
$opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
$fp = fopen(FILE_BOTS,"w");
fwrite($fp,$opt);
fclose($fp);
}
$ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
$ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
$agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
$host = strtolower(gethostbyaddr($ip));
$file = implode(" ", file(FILE_BOTS));
$exp = explode(".", $ip);
$class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
$threshold = CLOAKING_LEVEL;
$cloak = 0;
if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
$cloak++;
}
if (stristr($file, $class)) {
$cloak++;
}
if (stristr($file, $agent)) {
$cloak++;
}
if (strlen($ref) > 0) {
$cloak = 0;
}
if ($cloak >= $threshold) {
$cloakdirective = 1;
} else {
$cloakdirective = 0;
}
}
?>
That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com
Needs a bit of work, but definitely the way to go.
100% Working Bot detector. It is working on my website successfully.
function isBotDetected() {
if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
) {
return true; // 'Above given bots detected'
}
return false;
} // End :: isBotDetected()
I'm using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt and know about new user_agents and put them in your condition of if clause.
$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
// if not meet the conditions then
// do what you need
// here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
if($user_agent!=""){
$myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
fwrite($myfile, $user_agent);
$user_agent = "\n";
fwrite($myfile, $user_agent);
fclose($myfile);
}
}
This is the content of useragent.txt
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
For Google i'm using this method.
function is_google() {
$ip = $_SERVER['REMOTE_ADDR'];
$host = gethostbyaddr( $ip );
if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {
$forward_lookup = gethostbyname( $host );
if ( $forward_lookup == $ip ) {
return true;
}
return false;
} else {
return false;
}
}
var_dump( is_google() );
Credits: https://support.google.com/webmasters/answer/80553
Verifying Googlebot
As useragent can be changed...
the only official supported way to identify a google bot is to run a
reverse DNS lookup on the accessing IP address and run a forward DNS
lookup on the result to verify that it points to accessing IP address
and the resulting domain name is in either googlebot.com or google.com
domain.
Taken from here.
so you must run a DNS lookup
Both, reverse and forward.
See this guide on Google Search Central.
function bot_detected() {
if(preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']){
return true;
}
else{
return false;
}
}
might be late, but what about a hidden a link. All bots will use the rel attribute follow, only bad bots will use the nofollow rel attribute.
<a style="display:none;" rel="follow" href="javascript:void(0);" onclick="isabot();">.</a>
function isabot(){
//define a variable to pass with ajax to php
// || send bots info direct to where ever.
isabot = true;
}
for a bad bot you can use this:
<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>
for PHP specific you can remove the onclick attribute and replace the href attribute with a link to your ip detector/ bot detector like so:
<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>
OR
<a style="display:none;" rel="nofollow" href="https://somedomain.com/badbotdetector.php">.</a>
you can work with it and maybe use both, one detects a bot, while the other proves it to be a bad bot.
hope you find this useful
Can any any one help with the following code it runs and works fine but seems to always log a double entry each time for a single hint.
Not sure if its the host or my code
Time: 23rd February 2012 5:45:36 am
IP Address: xxx.xxx.141.162
Browser: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1Time: 23rd February 2012 5:45:36 am
IP Address: xxx.xxx.141.162
Browser: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1
.
<?php
// Create a new image instance
$im = imagecreatetruecolor(60, 20);
// Make the background red
imagefilledrectangle($im, 0, 0, 60, 20, 0xFF0000);
// Draw a text string on the image
imagestring($im, 3, 1, 1, 'Tracking', 0xCCFFFF);
// Output the image to browser
header('Content-Type: image/gif');
imagegif($im);
imagedestroy($im);
// Get server variables
$address = $_SERVER['REMOTE_ADDR'];
$referer = isset($_SERVER['HTTP_REFERER']) ?
$_SERVER['HTTP_REFERER'] : '';
$browser = $_SERVER['HTTP_USER_AGENT'];
//Open log file
$file = fopen("log.html",'a');
//Set time zone and date format
date_default_timezone_set('Australia/Sydney');
$accessTime = date("jS F Y g:i:s a");
//write collected data to file
fwrite($file, "<b>Time:</b> $accessTime<br />");
if( $address != null)
fwrite($file,"<b>IP Address:</b> $address<br />");
if($referer != null)
fwrite($file,"<b>Referer:<b> $referer<br />");
fwrite($file,"<b>Browser:</b> $browser<hr>");
// save file and close
fclose($file);
?>
I think this may be because of request to favicon.ico. Browser does request to http://your-site.com/favicon.ico and webserver rewrite this request to your script that log it to file. So you get two lines in log file.
I need to display additional information on a web page (PHP) if one of the following criteria is met:
.NET Framework Installed on the client machine in lower than 3.5
Impossible to determine if .NET Framework 3.5 is installed or not on the client machine
I know that some browser (if not only one, IE) is sending that information in his tag.
Can any of you provide me with his suggestions ?
Website is built in PHP.
Thanks!
EDIT:
Proposed answers are incomplete and/or don't provide me with a robust solution.
The only way to get this information is from the user agent that the browser sends. You can parse the .NET version from there. But note that client can spoof this information or omit it completely, so I wouldn't base any critical functionality on this.
I would try and do a strstr on the UserAgent in PHP, Example Below
A .NET User Agent : Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Pivim Multibar; GTB6.4; .NET CLR 2.0.50727)
The PHP:
<?php
function DotNetInstalled($ua = false)
{
$ua = $ua ? $ua : $_SERVER['HTTP_USER_AGENT'];
//return (strstr('.NET CLR',$ua) !== false);
$matches = preg_match('^\.NET CLR ([0-4]+\.)?([0-9]\.)?(\*|\d+)$',$ua);
if((int)$matches[1] > 0)
{
return array(
(int)$matches[1],
(int)$matches[2],
(int)$matches[3],
);
}
return false;
}
if(false !== ($version = DotNetInstalled()))
{
//Show me the money
//$version[0] = 2
//$version[1] = 0
//$version[2] = 50727
}
?>
I would also check out the following PEAR Package : http://pear.php.net/package/Net_UserAgent_Detect
?>
Here is another version I actually made for a friend of mine and realised that there was a bounty on this thread so what the hell :)
This is fully working but only with User-agent strings as there is no alternative means of doing so.
The core class:
class NETFrameworkChecker
{
//General String / Array holders
var $original_au,$ua_succesParse,$ua_componants,$ua_dotNetString,$CLRTag = "";
//IsInstalled
var $installed = false;
//Version holders
public $major = 0,$minor = 0,$build = 0;
public function __construct($ua = false)
{
$this->original_au = $ua !== false ? $ua : $_SERVER['HTTP_USER_AGENT'];
$this->ParserUserAgent();
}
public function Installed(){return (bool)$this->installed;}
public function AUTag(){return $this->CLRTag;}
//Version Getters
public function getMajor(){return $this->major;}
public function getMinor(){return $this->minor;}
public function getBuild(){return $this->build;}
private function ParserUserAgent()
{
$this->ua_succesParse = (bool) preg_match('/(?<browser>.+?)\s\((?<components>.*?)\)/',$this->original_au,$this->ua_componants);
if($this->ua_succesParse)
{
$this->ua_componants = explode(';',$this->ua_componants['components']);
foreach($this->ua_componants as $aComponant)
{
$aComponant = trim($aComponant);
if(substr(strtoupper($aComponant),0,4) == ".NET")
{
//We have .Net Installed
$this->installed = true;
$this->CLRTag = $aComponant;
//Lets make sure we can get the versions
$gotVersions = (bool)preg_match("/\.NET.CLR.+?(?<major>[0-9]{1})\.(?<minor>[0-9]{1})\.(?<build>[0-9]+)/si",$aComponant,$versions);
if($gotVersions)
{
$this->major = (int)$versions['major'];
$this->minor = (int)$versions['minor'];
$this->build = (int)$versions['build'];
}
break;
}
}
}
}
}
Example Usage:
$Net = new NETFrameworkChecker(); //leave first param blank to detect current user agent
if($Net->Installed())
{
if($Net->getMajor()> 2 && $Net->getMinor() >= 0)
{
//User is using a >NET system thats greater than 2.0.0000
if($Net->GetBuild() >= 0200)
{
//We can do stuff with so that's only supported from this build up-words
}
}else
{
//Redirect them asking them to upgrade :) pretty please
}
}
If you also want to check custom UA Strings from DB lets say
$Net = new NETFrameworkChecker("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Pivim Multibar; GTB6.4; .NET CLR 2.0.50727)");
Info about .NET CLR versions in the UserAgent:
http://www.hanselman.com/blog/TheNETFrameworkAndTheBrowsersUserAgentString.aspx
http://msdn.microsoft.com/en-us/library/ms537503.aspx
The code below is first the client code, then the class file.
For some reason the 'deductTokens()' method is calling twice, thus charging an account double.
I've been programming all night, so I may just need a second pair of eyes:
if ($action == 'place_order') {
if ($_REQUEST['unlimited'] == 200) {
$license = 'extended';
} else {
$license = 'standard';
}
if ($photograph->isValidPhotographSize($photograph_id, $_REQUEST['size_radio'])) {
$token_cost = $photograph->getTokenCost($_REQUEST['size_radio'], $_REQUEST['unlimited']);
$order = new ImageOrder($_SESSION['user']['id'], $_REQUEST['size_radio'], $license, $token_cost);
$order->saveOrder();
$order->deductTokens();
header('location: account.php');
} else {
die("Please go back and select a valid photograph size");
}
}
######CLASS CODE#######
<?php
include_once('database_classes.php');
class Order {
protected $account_id;
protected $cost;
protected $license;
public function __construct($account_id, $license, $cost) {
$this->account_id = $account_id;
$this->cost = $cost;
$this->license = $license;
}
}
class ImageOrder extends Order {
protected $size;
public function __construct($account_id, $size, $license, $cost) {
$this->size = $size;
parent::__construct($account_id, $license, $cost);
}
public function saveOrder() {
//$db = Connect::connect();
//$account_id = $db->real_escape_string($this->account_id);
//$size = $db->real_escape_string($this->size);
//$license = $db->real_escape_string($this->license);
//$cost = $db->real_escape_string($this->cost);
}
public function deductTokens() {
$db = Connect::connect();
$account_id = $db->real_escape_string($this->account_id);
$cost = $db->real_escape_string($this->cost);
$query = "UPDATE accounts set tokens=tokens-$cost WHERE id=$account_id";
$result = $db->query($query);
}
}
?>
When I die("$query"); directly after the query, it's printing the proper statement, and when I run that query within MySQL it works perfectly.
$action = $_REQUEST['action'];
account.php is just a list of orders, never does it call up downloads.php. Just tried commenting out the redirect, but I'm having the same problem. I don't understand how it's getting called twice, the die statements are showing the right query, and the script doesn't reload itself.
Here are my apache access logs:
71.*** - - [22/May/2010:13:14:35 +0000] "POST /download.php?action=confirm_download&photograph_id=122 HTTP/1.1" 200 1951 "http://***.com/viewphotograph.php?photograph_id=122" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)"
71.*** - - [22/May/2010:13:14:36 +0000] "GET /download.php?action=place_order&photograph_id=122&size_radio=xsmall&unlimited=0 HTTP/1.1" 302 453 "http://*** .com/download.php?action=confirm_download&photograph_id=122" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)"
71.*** - - [22/May/2010:13:14:36 +0000] "GET /download.php?action=place_order&photograph_id=122&size_radio=xsmall&unlimited=0 HTTP/1.1" 302 453 "http://*** .com/download.php?action=confirm_download&photograph_id=122" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)"
71.*** - - [22/May/2010:13:14:36 +0000] "GET /account.php HTTP/1.1" 200 2626 "http://***.com/download.php?action=confirm_download&photograph_id=122" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)"
I understand there's obviously something wrong here. But I can't figure out where the second request is coming from.
The line:
header('location: account.php');
Forwards the browser, but doesn't end the php script.
This might be fine, the code-snippet you gave doens't "do" anything after this line.
Another option might be a double-click
If you double-click on the submit button, the form will be sent twice.
We used some javascript to disable the submit button after the first click.
Another long shot, but it happened to me once in Firefox - doubled page executions, resulting in doubled inserts and updates - close your browser and restart.
I know this will sound strange but make sure that you don't have tag with empty src="" attribute or any css style refering to empty url (like background: url();) on your site around the place when you have your code that runs twice.
Read about some trouble this may cause here: http://hi.baidu.com/zhenyk/blog/item/38a1051fc63b96c3a686698f.html
How can one detect the search engine bots using php?
I use the following code which seems to be working fine:
function _bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
update 16-06-2017
https://support.google.com/webmasters/answer/1061943?hl=en
added mediapartners
Here's a Search Engine Directory of Spider names
Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}
Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:
http://www.useragentstring.com/pages/useragentstring.php
Or more specifically for crawlers:
http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler
If you want to -say- log the number of visits of most common search engine crawlers, you could use
$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
// $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}
You can checkout if it's a search engine with this function :
<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
'Rambler' => 'Rambler',
'Yahoo' => 'Yahoo',
'AbachoBOT' => 'AbachoBOT',
'accoona' => 'Accoona',
'AcoiRobot' => 'AcoiRobot',
'ASPSeek' => 'ASPSeek',
'CrocCrawler' => 'CrocCrawler',
'Dumbot' => 'Dumbot',
'FAST-WebCrawler' => 'FAST-WebCrawler',
'GeonaBot' => 'GeonaBot',
'Gigabot' => 'Gigabot',
'Lycos spider' => 'Lycos',
'MSRBOT' => 'MSRBOT',
'Altavista robot' => 'Scooter',
'AltaVista robot' => 'Altavista',
'ID-Search Bot' => 'IDBot',
'eStyle Bot' => 'eStyle',
'Scrubby robot' => 'Scrubby',
'Facebook' => 'facebookexternalhit',
);
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
$crawlers_agents = implode('|',$crawlers);
if (strpos($crawlers_agents, $USER_AGENT) === false)
return false;
else {
return TRUE;
}
}
?>
Then you can use it like :
<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>
I'm using this to detect bots:
if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
// is bot
}
In addition I use a whitelist to block unwanted bots:
if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
// allowed bot
}
An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.
Note: My whitelist is based on Facebooks robots.txt.
Because any client can set the user-agent to what they want, looking for 'Googlebot', 'bingbot' etc is only half the job.
The 2nd part is verifying the client's IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it's under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site's visitor, you're sure it's a crawler from that search engine.
I've written a library in Java that performs these checks for you. Feel free to port it to PHP. It's on GitHub: https://github.com/optimaize/webcrawler-verifier
If you really need to detect GOOGLE engine bots you should never rely on "user_agent" or "IP" address because "user_agent" can be changed and acording to what google said in: Verifying Googlebot
To verify Googlebot as the caller:
1.Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.
2.Verify that the domain name is in either googlebot.com or google.com
3.Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.
Here is my tested code :
<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 )
{
//add your code
}
?>
In this code we check "hostname" which should contain "googlebot.com" or "google.com" at the end of "hostname" which is really important to check exact domain not subdomain.
I hope you enjoy ;)
I use this function ... part of the regex comes from prestashop but I added some more bot to it.
public function isBot()
{
$bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
$userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
$isBot = !$userAgent || preg_match($bot_regex, $userAgent);
return $isBot;
}
Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )
One distinctive feature of most of the bot is that they don't carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )
Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector
You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client’s IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.
I made one good and fast function for this
function is_bot(){
if(isset($_SERVER['HTTP_USER_AGENT']))
{
return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
}
return false;
}
This cover 99% of all possible bots, search engines etc.
<?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
$lastupdated = date("Ymd", filemtime(FILE_BOTS));
if ($lastupdated != date("Ymd")) {
$lists = array(
'http://labs.getyacg.com/spiders/google.txt',
'http://labs.getyacg.com/spiders/inktomi.txt',
'http://labs.getyacg.com/spiders/lycos.txt',
'http://labs.getyacg.com/spiders/msn.txt',
'http://labs.getyacg.com/spiders/altavista.txt',
'http://labs.getyacg.com/spiders/askjeeves.txt',
'http://labs.getyacg.com/spiders/wisenut.txt',
);
foreach($lists as $list) {
$opt .= fetch($list);
}
$opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
$fp = fopen(FILE_BOTS,"w");
fwrite($fp,$opt);
fclose($fp);
}
$ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
$ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
$agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
$host = strtolower(gethostbyaddr($ip));
$file = implode(" ", file(FILE_BOTS));
$exp = explode(".", $ip);
$class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
$threshold = CLOAKING_LEVEL;
$cloak = 0;
if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
$cloak++;
}
if (stristr($file, $class)) {
$cloak++;
}
if (stristr($file, $agent)) {
$cloak++;
}
if (strlen($ref) > 0) {
$cloak = 0;
}
if ($cloak >= $threshold) {
$cloakdirective = 1;
} else {
$cloakdirective = 0;
}
}
?>
That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com
Needs a bit of work, but definitely the way to go.
100% Working Bot detector. It is working on my website successfully.
function isBotDetected() {
if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
) {
return true; // 'Above given bots detected'
}
return false;
} // End :: isBotDetected()
I'm using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt and know about new user_agents and put them in your condition of if clause.
$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
// if not meet the conditions then
// do what you need
// here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
if($user_agent!=""){
$myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
fwrite($myfile, $user_agent);
$user_agent = "\n";
fwrite($myfile, $user_agent);
fclose($myfile);
}
}
This is the content of useragent.txt
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
For Google i'm using this method.
function is_google() {
$ip = $_SERVER['REMOTE_ADDR'];
$host = gethostbyaddr( $ip );
if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {
$forward_lookup = gethostbyname( $host );
if ( $forward_lookup == $ip ) {
return true;
}
return false;
} else {
return false;
}
}
var_dump( is_google() );
Credits: https://support.google.com/webmasters/answer/80553
Verifying Googlebot
As useragent can be changed...
the only official supported way to identify a google bot is to run a
reverse DNS lookup on the accessing IP address and run a forward DNS
lookup on the result to verify that it points to accessing IP address
and the resulting domain name is in either googlebot.com or google.com
domain.
Taken from here.
so you must run a DNS lookup
Both, reverse and forward.
See this guide on Google Search Central.
function bot_detected() {
if(preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']){
return true;
}
else{
return false;
}
}
might be late, but what about a hidden a link. All bots will use the rel attribute follow, only bad bots will use the nofollow rel attribute.
<a style="display:none;" rel="follow" href="javascript:void(0);" onclick="isabot();">.</a>
function isabot(){
//define a variable to pass with ajax to php
// || send bots info direct to where ever.
isabot = true;
}
for a bad bot you can use this:
<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>
for PHP specific you can remove the onclick attribute and replace the href attribute with a link to your ip detector/ bot detector like so:
<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>
OR
<a style="display:none;" rel="nofollow" href="https://somedomain.com/badbotdetector.php">.</a>
you can work with it and maybe use both, one detects a bot, while the other proves it to be a bad bot.
hope you find this useful