How can one detect the search engine bots using php?
I use the following code which seems to be working fine:
function _bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
update 16-06-2017
https://support.google.com/webmasters/answer/1061943?hl=en
added mediapartners
Here's a Search Engine Directory of Spider names
Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}
Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:
http://www.useragentstring.com/pages/useragentstring.php
Or more specifically for crawlers:
http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler
If you want to -say- log the number of visits of most common search engine crawlers, you could use
$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
// $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}
You can checkout if it's a search engine with this function :
<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
'Rambler' => 'Rambler',
'Yahoo' => 'Yahoo',
'AbachoBOT' => 'AbachoBOT',
'accoona' => 'Accoona',
'AcoiRobot' => 'AcoiRobot',
'ASPSeek' => 'ASPSeek',
'CrocCrawler' => 'CrocCrawler',
'Dumbot' => 'Dumbot',
'FAST-WebCrawler' => 'FAST-WebCrawler',
'GeonaBot' => 'GeonaBot',
'Gigabot' => 'Gigabot',
'Lycos spider' => 'Lycos',
'MSRBOT' => 'MSRBOT',
'Altavista robot' => 'Scooter',
'AltaVista robot' => 'Altavista',
'ID-Search Bot' => 'IDBot',
'eStyle Bot' => 'eStyle',
'Scrubby robot' => 'Scrubby',
'Facebook' => 'facebookexternalhit',
);
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
$crawlers_agents = implode('|',$crawlers);
if (strpos($crawlers_agents, $USER_AGENT) === false)
return false;
else {
return TRUE;
}
}
?>
Then you can use it like :
<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>
I'm using this to detect bots:
if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
// is bot
}
In addition I use a whitelist to block unwanted bots:
if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
// allowed bot
}
An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.
Note: My whitelist is based on Facebooks robots.txt.
Because any client can set the user-agent to what they want, looking for 'Googlebot', 'bingbot' etc is only half the job.
The 2nd part is verifying the client's IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it's under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site's visitor, you're sure it's a crawler from that search engine.
I've written a library in Java that performs these checks for you. Feel free to port it to PHP. It's on GitHub: https://github.com/optimaize/webcrawler-verifier
If you really need to detect GOOGLE engine bots you should never rely on "user_agent" or "IP" address because "user_agent" can be changed and acording to what google said in: Verifying Googlebot
To verify Googlebot as the caller:
1.Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.
2.Verify that the domain name is in either googlebot.com or google.com
3.Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.
Here is my tested code :
<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 )
{
//add your code
}
?>
In this code we check "hostname" which should contain "googlebot.com" or "google.com" at the end of "hostname" which is really important to check exact domain not subdomain.
I hope you enjoy ;)
I use this function ... part of the regex comes from prestashop but I added some more bot to it.
public function isBot()
{
$bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
$userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
$isBot = !$userAgent || preg_match($bot_regex, $userAgent);
return $isBot;
}
Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )
One distinctive feature of most of the bot is that they don't carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )
Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector
You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client’s IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.
I made one good and fast function for this
function is_bot(){
if(isset($_SERVER['HTTP_USER_AGENT']))
{
return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
}
return false;
}
This cover 99% of all possible bots, search engines etc.
<?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
$lastupdated = date("Ymd", filemtime(FILE_BOTS));
if ($lastupdated != date("Ymd")) {
$lists = array(
'http://labs.getyacg.com/spiders/google.txt',
'http://labs.getyacg.com/spiders/inktomi.txt',
'http://labs.getyacg.com/spiders/lycos.txt',
'http://labs.getyacg.com/spiders/msn.txt',
'http://labs.getyacg.com/spiders/altavista.txt',
'http://labs.getyacg.com/spiders/askjeeves.txt',
'http://labs.getyacg.com/spiders/wisenut.txt',
);
foreach($lists as $list) {
$opt .= fetch($list);
}
$opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
$fp = fopen(FILE_BOTS,"w");
fwrite($fp,$opt);
fclose($fp);
}
$ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
$ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
$agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
$host = strtolower(gethostbyaddr($ip));
$file = implode(" ", file(FILE_BOTS));
$exp = explode(".", $ip);
$class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
$threshold = CLOAKING_LEVEL;
$cloak = 0;
if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
$cloak++;
}
if (stristr($file, $class)) {
$cloak++;
}
if (stristr($file, $agent)) {
$cloak++;
}
if (strlen($ref) > 0) {
$cloak = 0;
}
if ($cloak >= $threshold) {
$cloakdirective = 1;
} else {
$cloakdirective = 0;
}
}
?>
That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com
Needs a bit of work, but definitely the way to go.
100% Working Bot detector. It is working on my website successfully.
function isBotDetected() {
if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
) {
return true; // 'Above given bots detected'
}
return false;
} // End :: isBotDetected()
I'm using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt and know about new user_agents and put them in your condition of if clause.
$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
// if not meet the conditions then
// do what you need
// here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
if($user_agent!=""){
$myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
fwrite($myfile, $user_agent);
$user_agent = "\n";
fwrite($myfile, $user_agent);
fclose($myfile);
}
}
This is the content of useragent.txt
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
For Google i'm using this method.
function is_google() {
$ip = $_SERVER['REMOTE_ADDR'];
$host = gethostbyaddr( $ip );
if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {
$forward_lookup = gethostbyname( $host );
if ( $forward_lookup == $ip ) {
return true;
}
return false;
} else {
return false;
}
}
var_dump( is_google() );
Credits: https://support.google.com/webmasters/answer/80553
Verifying Googlebot
As useragent can be changed...
the only official supported way to identify a google bot is to run a
reverse DNS lookup on the accessing IP address and run a forward DNS
lookup on the result to verify that it points to accessing IP address
and the resulting domain name is in either googlebot.com or google.com
domain.
Taken from here.
so you must run a DNS lookup
Both, reverse and forward.
See this guide on Google Search Central.
function bot_detected() {
if(preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']){
return true;
}
else{
return false;
}
}
might be late, but what about a hidden a link. All bots will use the rel attribute follow, only bad bots will use the nofollow rel attribute.
<a style="display:none;" rel="follow" href="javascript:void(0);" onclick="isabot();">.</a>
function isabot(){
//define a variable to pass with ajax to php
// || send bots info direct to where ever.
isabot = true;
}
for a bad bot you can use this:
<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>
for PHP specific you can remove the onclick attribute and replace the href attribute with a link to your ip detector/ bot detector like so:
<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>
OR
<a style="display:none;" rel="nofollow" href="https://somedomain.com/badbotdetector.php">.</a>
you can work with it and maybe use both, one detects a bot, while the other proves it to be a bad bot.
hope you find this useful
Related
i have created a website in codeignter, the website is working fine my local, i uploaded it to my server, then its showing the following error:
Table 'kirana_btp_new.oc_ci_sessions' doesn't exist
SELECT * FROM (`oc_ci_sessions`) WHERE `session_id` = '20e9bf3d13c5fbcad7582f354abaf8e3' AND `user_agent` = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
Filename: libraries/Session.php
coz my table names doesn't start with _oc, i don't know where this oc is coming from, below is a few code from the session.php where the error is showing.
if ($this->sess_match_useragent == TRUE)
{
$this->CI->db->where('user_agent', $session['user_agent']);
}
$query = $this->CI->db->get($this->sess_table_name);
// No result? Kill it!
if ($query->num_rows() == 0)
{
$this->sess_destroy();
return FALSE;
}
can anyone please tell me what is wrong here? thanks in advance
please check your config database file
you can add there prefix for db prefix
$db['default']['dbprefix'] = '';
you can check it from application\config\database.php
I need to get a specific value from the subscription object in woocommerce to put it into a variable, but unfortunately I don't manage to solve this problem.
This object I get from woocommerce subscription:
{"id":327,"parent_id":326,"status":"expired","currency":"EUR","version":"3.6.5","prices_include_tax":false,"date_created":{"date":"2019-08-03 10:40:55.000000","timezone_type":3,"timezone":"Europe\/Berlin"},"date_modified":{"date":"2019-08-03 10:49:23.000000","timezone_type":3,"timezone":"Europe\/Berlin"},"discount_total":"0","discount_tax":"0","shipping_total":"0.00","shipping_tax":"0","cart_tax":"0","total":"2.00","total_tax":"0","customer_id":5,"order_key":"wc_order_O6jvDq6kygZxu","billing":{"first_name":"Hello","last_name":"Test","company":"","address_1":"Test 123","address_2":"","city":"1111","state":"","postcode":"8888","country":"AT","email":"123#test123.com","phone":"1234"},"shipping":{"first_name":"","last_name":"","company":"","address_1":"","address_2":"","city":"","state":"","postcode":"","country":""},"payment_method":"stripe","payment_method_title":"Credit Card (Stripe)","transaction_id":"","customer_ip_address":"178.165.131.46","customer_user_agent":"Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/70.0.3538.77 Safari\/537.36","created_via":"checkout","customer_note":"","date_completed":null,"date_paid":null,"cart_hash":"","billing_period":"day","billing_interval":"1","suspension_count":0,"requires_manual_renewal":false,"cancelled_email_sent":"","trial_period":"","schedule_trial_end":null,"schedule_next_payment":null,"schedule_cancelled":null,"schedule_end":{"date":"2019-08-03 10:49:24.000000","timezone_type":3,"timezone":"Europe\/Berlin"},"schedule_payment_retry":null,"schedule_start":{"date":"2019-08-03 10:40:55.000000","timezone_type":3,"timezone":"Europe\/Berlin"},"switch_data":"","number":"327","meta_data":[{"id":3350,"key":"is_vat_exempt","value":"no"},{"id":3351,"key":"_cred_meta","value":"a:1:{i:0;a:3:{s:15:\"cred_product_id\";s:3:\"168\";s:12:\"cred_form_id\";i:156;s:12:\"cred_post_id\";i:325;}}"},{"id":3352,"key":"_cred_post_id","value":"325"},
{"id":3353,"key":"_cred_form_id","value":"156"},
{"id":3373,"key":"_stripe_customer_id","value":"cus_FXnn7KOOi3LN6T"},{"id":3374,"key":"_stripe_source_id","value":"src_1F2iCEFxFOzrkjfff6wQYsA7"}],"line_items":{"31":{}},"tax_lines":[],"shipping_lines":[],"fee_lines":[],"coupon_lines":[]}
Now I want to read the "value" out of this line and put into my variable:
{"id":3353,"key":"_cred_form_id","value":"156"}
How can I do that?
Would be great if somebody could give me a solution on how to solve this problem. Thanks a lot!
This code will do as you want.
$json = '{"id":327,"parent_id":326,"status":"expired","currency":"EUR","version":"3.6.5","prices_include_tax":false,"date_created":{"date":"2019-08-03 10:40:55.000000","timezone_type":3,"timezone":"Europe\/Berlin"},"date_modified":{"date":"2019-08-03 10:49:23.000000","timezone_type":3,"timezone":"Europe\/Berlin"},"discount_total":"0","discount_tax":"0","shipping_total":"0.00","shipping_tax":"0","cart_tax":"0","total":"2.00","total_tax":"0","customer_id":5,"order_key":"wc_order_O6jvDq6kygZxu","billing":{"first_name":"Hello","last_name":"Test","company":"","address_1":"Test 123","address_2":"","city":"1111","state":"","postcode":"8888","country":"AT","email":"123#test123.com","phone":"1234"},"shipping":{"first_name":"","last_name":"","company":"","address_1":"","address_2":"","city":"","state":"","postcode":"","country":""},"payment_method":"stripe","payment_method_title":"Credit Card (Stripe)","transaction_id":"","customer_ip_address":"178.165.131.46","customer_user_agent":"Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/70.0.3538.77 Safari\/537.36","created_via":"checkout","customer_note":"","date_completed":null,"date_paid":null,"cart_hash":"","billing_period":"day","billing_interval":"1","suspension_count":0,"requires_manual_renewal":false,"cancelled_email_sent":"","trial_period":"","schedule_trial_end":null,"schedule_next_payment":null,"schedule_cancelled":null,"schedule_end":{"date":"2019-08-03 10:49:24.000000","timezone_type":3,"timezone":"Europe\/Berlin"},"schedule_payment_retry":null,"schedule_start":{"date":"2019-08-03 10:40:55.000000","timezone_type":3,"timezone":"Europe\/Berlin"},"switch_data":"","number":"327","meta_data":[{"id":3350,"key":"is_vat_exempt","value":"no"},{"id":3351,"key":"_cred_meta","value":"a:1:{i:0;a:3:{s:15:\"cred_product_id\";s:3:\"168\";s:12:\"cred_form_id\";i:156;s:12:\"cred_post_id\";i:325;}}"},{"id":3352,"key":"_cred_post_id","value":"325"},
{"id":3353,"key":"_cred_form_id","value":"156"},
{"id":3373,"key":"_stripe_customer_id","value":"cus_FXnn7KOOi3LN6T"},{"id":3374,"key":"_stripe_source_id","value":"src_1F2iCEFxFOzrkjfff6wQYsA7"}],"line_items":{"31":{}},"tax_lines":[],"shipping_lines":[],"fee_lines":[],"coupon_lines":[]}';
$object = json_decode($json);
$meta_data = $object->meta_data;
$target_id = 3353;
foreach($meta_data as $data) {
if ($data->id == $target_id) {
$result = $data;
break;
}
}
echo $result->value;
Data from woocomerce is a json. So we just need to pointing to target data.
I want to get the browser details of the client. so that am using $_SERVER['HTTP_USER_AGENT'] to get the details but it get some extra information also like
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
in chrome browser. what i have done is convert the string into an array like
Array ( [0] => Mozilla/5.0 [1] => (Windows [2] => NT [3] => 10.0; [4]
=> Win64; [5] => x64) [6] => AppleWebKit/537.36 [7] => (KHTML, [8] => like [9] => Gecko) [10] => Chrome/60.0.3112.90 [11] => Safari/537.36 )
If i search for Chr -- i want it to search the array and return Chrome/60.0.3112.90 .
Please suggest solution for this thanks.
Code :
echo $browser = $_SERVER['HTTP_USER_AGENT'];
$strArray = explode(' ',$browser);
print_r($strArray);
The way you're converting the user-agent to an array is quite faulty, for example "Windows NT" becomes ['Windows','NT'] and this is not something you want.
You might want to use ua-parser to extract the user-agent information in a better way.
require_once 'vendor/autoload.php';
use UAParser\Parser;
$ua = "Mozilla/5.0 (Macintosh; Intel Ma...";
$parser = Parser::create();
$result = $parser->parse($ua);
print $result->ua->family; // Safari
print $result->ua->major; // 6
print $result->ua->minor; // 0
print $result->ua->patch; // 2
print $result->ua->toString(); // Safari 6.0.2
print $result->ua->toVersion(); // 6.0.2
print $result->os->family; // Mac OS X
print $result->os->major; // 10
print $result->os->minor; // 7
print $result->os->patch; // 5
print $result->os->patchMinor; // [null]
print $result->os->toString(); // Mac OS X 10.7.5
print $result->os->toVersion(); // 10.7.5
print $result->device->family; // Other
print $result->toString(); // Safari 6.0.2/Mac OS X 10.7.5
print $result->originalUserAgent; // Mozilla/5.0 (Macintosh; Intel Ma...
With the added need for searching for OS that is written in comments the accepted answer will not work well.
#Nabils answer will work quite well but since it splits the string in very small pieces it may be hard to use.
I thought I could use preg_split and create a good array to search and I think I made it.
I don't know all variations of user agents but they seem to follow a pattern.
This will split on space but also on ( and ).
$input_line = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36";
$arr = preg_split("/( \()|(\) )/", $input_line);
$arr2 =explode(" ", end($arr)); // explode "Chrome/60.0.3112.90 Safari/537.36" on space
Unset($arr[count($arr)-1]); // remove above exploded
$arr = Array_merge($arr,$arr2); // reinsert them as two items
//Var_dump($arr);
$search = "Chr";
Foreach($arr as $val){
If($search == Substr($val,0,3)) echo $val;
}
See here how it works: https://3v4l.org/qF65g
You can use regex on the string.
$str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36";
$search = "Chr";
Preg_match("/". $search . ".*?\/\d+\.\d+\.\d+\.\d+/", $str, $match);
Var_dump($match);
It will capture the search and any characters til / then your digits with dots.
https://3v4l.org/uYpoP
You can create function to search from array
echo $browser = $_SERVER['HTTP_USER_AGENT'];
$strArray = explode(' ',$browser);
print_r($strArray);
function search_from_array($array,$string)
{
$return_str = "";
foreach ($array as $key => $value) {
$pos = stripos($value, $string);
if($pos !== false)
$return_str=$value;
}
return $return_str;
}
echo search_from_array($strArray,"chr");
The code below is first the client code, then the class file.
For some reason the 'deductTokens()' method is calling twice, thus charging an account double.
I've been programming all night, so I may just need a second pair of eyes:
if ($action == 'place_order') {
if ($_REQUEST['unlimited'] == 200) {
$license = 'extended';
} else {
$license = 'standard';
}
if ($photograph->isValidPhotographSize($photograph_id, $_REQUEST['size_radio'])) {
$token_cost = $photograph->getTokenCost($_REQUEST['size_radio'], $_REQUEST['unlimited']);
$order = new ImageOrder($_SESSION['user']['id'], $_REQUEST['size_radio'], $license, $token_cost);
$order->saveOrder();
$order->deductTokens();
header('location: account.php');
} else {
die("Please go back and select a valid photograph size");
}
}
######CLASS CODE#######
<?php
include_once('database_classes.php');
class Order {
protected $account_id;
protected $cost;
protected $license;
public function __construct($account_id, $license, $cost) {
$this->account_id = $account_id;
$this->cost = $cost;
$this->license = $license;
}
}
class ImageOrder extends Order {
protected $size;
public function __construct($account_id, $size, $license, $cost) {
$this->size = $size;
parent::__construct($account_id, $license, $cost);
}
public function saveOrder() {
//$db = Connect::connect();
//$account_id = $db->real_escape_string($this->account_id);
//$size = $db->real_escape_string($this->size);
//$license = $db->real_escape_string($this->license);
//$cost = $db->real_escape_string($this->cost);
}
public function deductTokens() {
$db = Connect::connect();
$account_id = $db->real_escape_string($this->account_id);
$cost = $db->real_escape_string($this->cost);
$query = "UPDATE accounts set tokens=tokens-$cost WHERE id=$account_id";
$result = $db->query($query);
}
}
?>
When I die("$query"); directly after the query, it's printing the proper statement, and when I run that query within MySQL it works perfectly.
$action = $_REQUEST['action'];
account.php is just a list of orders, never does it call up downloads.php. Just tried commenting out the redirect, but I'm having the same problem. I don't understand how it's getting called twice, the die statements are showing the right query, and the script doesn't reload itself.
Here are my apache access logs:
71.*** - - [22/May/2010:13:14:35 +0000] "POST /download.php?action=confirm_download&photograph_id=122 HTTP/1.1" 200 1951 "http://***.com/viewphotograph.php?photograph_id=122" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)"
71.*** - - [22/May/2010:13:14:36 +0000] "GET /download.php?action=place_order&photograph_id=122&size_radio=xsmall&unlimited=0 HTTP/1.1" 302 453 "http://*** .com/download.php?action=confirm_download&photograph_id=122" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)"
71.*** - - [22/May/2010:13:14:36 +0000] "GET /download.php?action=place_order&photograph_id=122&size_radio=xsmall&unlimited=0 HTTP/1.1" 302 453 "http://*** .com/download.php?action=confirm_download&photograph_id=122" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)"
71.*** - - [22/May/2010:13:14:36 +0000] "GET /account.php HTTP/1.1" 200 2626 "http://***.com/download.php?action=confirm_download&photograph_id=122" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)"
I understand there's obviously something wrong here. But I can't figure out where the second request is coming from.
The line:
header('location: account.php');
Forwards the browser, but doesn't end the php script.
This might be fine, the code-snippet you gave doens't "do" anything after this line.
Another option might be a double-click
If you double-click on the submit button, the form will be sent twice.
We used some javascript to disable the submit button after the first click.
Another long shot, but it happened to me once in Firefox - doubled page executions, resulting in doubled inserts and updates - close your browser and restart.
I know this will sound strange but make sure that you don't have tag with empty src="" attribute or any css style refering to empty url (like background: url();) on your site around the place when you have your code that runs twice.
Read about some trouble this may cause here: http://hi.baidu.com/zhenyk/blog/item/38a1051fc63b96c3a686698f.html
How can one detect the search engine bots using php?
I use the following code which seems to be working fine:
function _bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
update 16-06-2017
https://support.google.com/webmasters/answer/1061943?hl=en
added mediapartners
Here's a Search Engine Directory of Spider names
Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}
Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:
http://www.useragentstring.com/pages/useragentstring.php
Or more specifically for crawlers:
http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler
If you want to -say- log the number of visits of most common search engine crawlers, you could use
$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
// $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}
You can checkout if it's a search engine with this function :
<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
'Rambler' => 'Rambler',
'Yahoo' => 'Yahoo',
'AbachoBOT' => 'AbachoBOT',
'accoona' => 'Accoona',
'AcoiRobot' => 'AcoiRobot',
'ASPSeek' => 'ASPSeek',
'CrocCrawler' => 'CrocCrawler',
'Dumbot' => 'Dumbot',
'FAST-WebCrawler' => 'FAST-WebCrawler',
'GeonaBot' => 'GeonaBot',
'Gigabot' => 'Gigabot',
'Lycos spider' => 'Lycos',
'MSRBOT' => 'MSRBOT',
'Altavista robot' => 'Scooter',
'AltaVista robot' => 'Altavista',
'ID-Search Bot' => 'IDBot',
'eStyle Bot' => 'eStyle',
'Scrubby robot' => 'Scrubby',
'Facebook' => 'facebookexternalhit',
);
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
$crawlers_agents = implode('|',$crawlers);
if (strpos($crawlers_agents, $USER_AGENT) === false)
return false;
else {
return TRUE;
}
}
?>
Then you can use it like :
<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>
I'm using this to detect bots:
if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
// is bot
}
In addition I use a whitelist to block unwanted bots:
if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
// allowed bot
}
An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.
Note: My whitelist is based on Facebooks robots.txt.
Because any client can set the user-agent to what they want, looking for 'Googlebot', 'bingbot' etc is only half the job.
The 2nd part is verifying the client's IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it's under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site's visitor, you're sure it's a crawler from that search engine.
I've written a library in Java that performs these checks for you. Feel free to port it to PHP. It's on GitHub: https://github.com/optimaize/webcrawler-verifier
If you really need to detect GOOGLE engine bots you should never rely on "user_agent" or "IP" address because "user_agent" can be changed and acording to what google said in: Verifying Googlebot
To verify Googlebot as the caller:
1.Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.
2.Verify that the domain name is in either googlebot.com or google.com
3.Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.
Here is my tested code :
<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 )
{
//add your code
}
?>
In this code we check "hostname" which should contain "googlebot.com" or "google.com" at the end of "hostname" which is really important to check exact domain not subdomain.
I hope you enjoy ;)
I use this function ... part of the regex comes from prestashop but I added some more bot to it.
public function isBot()
{
$bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
$userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
$isBot = !$userAgent || preg_match($bot_regex, $userAgent);
return $isBot;
}
Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )
One distinctive feature of most of the bot is that they don't carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )
Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector
You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client’s IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.
I made one good and fast function for this
function is_bot(){
if(isset($_SERVER['HTTP_USER_AGENT']))
{
return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
}
return false;
}
This cover 99% of all possible bots, search engines etc.
<?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
$lastupdated = date("Ymd", filemtime(FILE_BOTS));
if ($lastupdated != date("Ymd")) {
$lists = array(
'http://labs.getyacg.com/spiders/google.txt',
'http://labs.getyacg.com/spiders/inktomi.txt',
'http://labs.getyacg.com/spiders/lycos.txt',
'http://labs.getyacg.com/spiders/msn.txt',
'http://labs.getyacg.com/spiders/altavista.txt',
'http://labs.getyacg.com/spiders/askjeeves.txt',
'http://labs.getyacg.com/spiders/wisenut.txt',
);
foreach($lists as $list) {
$opt .= fetch($list);
}
$opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
$fp = fopen(FILE_BOTS,"w");
fwrite($fp,$opt);
fclose($fp);
}
$ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
$ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
$agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
$host = strtolower(gethostbyaddr($ip));
$file = implode(" ", file(FILE_BOTS));
$exp = explode(".", $ip);
$class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
$threshold = CLOAKING_LEVEL;
$cloak = 0;
if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
$cloak++;
}
if (stristr($file, $class)) {
$cloak++;
}
if (stristr($file, $agent)) {
$cloak++;
}
if (strlen($ref) > 0) {
$cloak = 0;
}
if ($cloak >= $threshold) {
$cloakdirective = 1;
} else {
$cloakdirective = 0;
}
}
?>
That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com
Needs a bit of work, but definitely the way to go.
100% Working Bot detector. It is working on my website successfully.
function isBotDetected() {
if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
) {
return true; // 'Above given bots detected'
}
return false;
} // End :: isBotDetected()
I'm using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt and know about new user_agents and put them in your condition of if clause.
$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
// if not meet the conditions then
// do what you need
// here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
if($user_agent!=""){
$myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
fwrite($myfile, $user_agent);
$user_agent = "\n";
fwrite($myfile, $user_agent);
fclose($myfile);
}
}
This is the content of useragent.txt
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
For Google i'm using this method.
function is_google() {
$ip = $_SERVER['REMOTE_ADDR'];
$host = gethostbyaddr( $ip );
if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {
$forward_lookup = gethostbyname( $host );
if ( $forward_lookup == $ip ) {
return true;
}
return false;
} else {
return false;
}
}
var_dump( is_google() );
Credits: https://support.google.com/webmasters/answer/80553
Verifying Googlebot
As useragent can be changed...
the only official supported way to identify a google bot is to run a
reverse DNS lookup on the accessing IP address and run a forward DNS
lookup on the result to verify that it points to accessing IP address
and the resulting domain name is in either googlebot.com or google.com
domain.
Taken from here.
so you must run a DNS lookup
Both, reverse and forward.
See this guide on Google Search Central.
function bot_detected() {
if(preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']){
return true;
}
else{
return false;
}
}
might be late, but what about a hidden a link. All bots will use the rel attribute follow, only bad bots will use the nofollow rel attribute.
<a style="display:none;" rel="follow" href="javascript:void(0);" onclick="isabot();">.</a>
function isabot(){
//define a variable to pass with ajax to php
// || send bots info direct to where ever.
isabot = true;
}
for a bad bot you can use this:
<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>
for PHP specific you can remove the onclick attribute and replace the href attribute with a link to your ip detector/ bot detector like so:
<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>
OR
<a style="display:none;" rel="nofollow" href="https://somedomain.com/badbotdetector.php">.</a>
you can work with it and maybe use both, one detects a bot, while the other proves it to be a bad bot.
hope you find this useful