Google Hotel Finder, file_get_contents PHP

Google Hotel Finder, file_get_contents PHP - php

I try to check if a hotel is available through the Google Hotel Finder. I use the following PHP code:
<?php
//get date, in the future with $_POST
//get source
$source = file_get_contents('https://www.google.co.uk/hotelfinder/#search;l=london;d=2015-08-14;n=1;usd=1;h=17709217511794056234;si=;av=d');
//"Book a room" only shows when room(s) are available
if (strpos($source, "Book a room") !== false) {
echo "Room(s) available";
} else {
echo "Nothing available";
}
echo $source;
?>
When I run this code in my server, Google Hotel Finder gives me the following error message: "Google Hotel Finder has not been optimised for your browser. For best results, please try Chrome, Firefox 3.5+, Internet Explorer 8+, Safari 4+".
So Google detected that I was visiting the page trough PHP... Is there an way to ignore or bypass this "block"?

You may try to send user agent info with header.
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n" . // check function.stream-context-create on php.net
"User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10\r\n" // i.e. An iPad
)
);
$context = stream_context_create($options);
$file = file_get_contents($url, false, $context);
PHP file_get_contents() and headers

Related

Check if visit is from a search engine Php Wordpress [duplicate]

How can one detect the search engine bots using php?

I use the following code which seems to be working fine:
function _bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
update 16-06-2017
https://support.google.com/webmasters/answer/1061943?hl=en
added mediapartners

Here's a Search Engine Directory of Spider names
Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}

Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:
http://www.useragentstring.com/pages/useragentstring.php
Or more specifically for crawlers:
http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler
If you want to -say- log the number of visits of most common search engine crawlers, you could use
$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
// $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}

You can checkout if it's a search engine with this function :
<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
'Rambler' => 'Rambler',
'Yahoo' => 'Yahoo',
'AbachoBOT' => 'AbachoBOT',
'accoona' => 'Accoona',
'AcoiRobot' => 'AcoiRobot',
'ASPSeek' => 'ASPSeek',
'CrocCrawler' => 'CrocCrawler',
'Dumbot' => 'Dumbot',
'FAST-WebCrawler' => 'FAST-WebCrawler',
'GeonaBot' => 'GeonaBot',
'Gigabot' => 'Gigabot',
'Lycos spider' => 'Lycos',
'MSRBOT' => 'MSRBOT',
'Altavista robot' => 'Scooter',
'AltaVista robot' => 'Altavista',
'ID-Search Bot' => 'IDBot',
'eStyle Bot' => 'eStyle',
'Scrubby robot' => 'Scrubby',
'Facebook' => 'facebookexternalhit',
);
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
$crawlers_agents = implode('|',$crawlers);
if (strpos($crawlers_agents, $USER_AGENT) === false)
return false;
else {
return TRUE;
}
}
?>
Then you can use it like :
<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>

I'm using this to detect bots:
if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
// is bot
}
In addition I use a whitelist to block unwanted bots:
if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
// allowed bot
}
An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.
Note: My whitelist is based on Facebooks robots.txt.

Because any client can set the user-agent to what they want, looking for 'Googlebot', 'bingbot' etc is only half the job.
The 2nd part is verifying the client's IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it's under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site's visitor, you're sure it's a crawler from that search engine.
I've written a library in Java that performs these checks for you. Feel free to port it to PHP. It's on GitHub: https://github.com/optimaize/webcrawler-verifier

If you really need to detect GOOGLE engine bots you should never rely on "user_agent" or "IP" address because "user_agent" can be changed and acording to what google said in: Verifying Googlebot
To verify Googlebot as the caller:
1.Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.
2.Verify that the domain name is in either googlebot.com or google.com
3.Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.
Here is my tested code :
<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 )
{
//add your code
}
?>
In this code we check "hostname" which should contain "googlebot.com" or "google.com" at the end of "hostname" which is really important to check exact domain not subdomain.
I hope you enjoy ;)

I use this function ... part of the regex comes from prestashop but I added some more bot to it.
public function isBot()
{
$bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
$userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
$isBot = !$userAgent || preg_match($bot_regex, $userAgent);
return $isBot;
}
Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )
One distinctive feature of most of the bot is that they don't carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )

Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector

You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client’s IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.

I made one good and fast function for this
function is_bot(){
if(isset($_SERVER['HTTP_USER_AGENT']))
{
return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
}
return false;
}
This cover 99% of all possible bots, search engines etc.

<?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
$lastupdated = date("Ymd", filemtime(FILE_BOTS));
if ($lastupdated != date("Ymd")) {
$lists = array(
'http://labs.getyacg.com/spiders/google.txt',
'http://labs.getyacg.com/spiders/inktomi.txt',
'http://labs.getyacg.com/spiders/lycos.txt',
'http://labs.getyacg.com/spiders/msn.txt',
'http://labs.getyacg.com/spiders/altavista.txt',
'http://labs.getyacg.com/spiders/askjeeves.txt',
'http://labs.getyacg.com/spiders/wisenut.txt',
);
foreach($lists as $list) {
$opt .= fetch($list);
}
$opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
$fp = fopen(FILE_BOTS,"w");
fwrite($fp,$opt);
fclose($fp);
}
$ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
$ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
$agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
$host = strtolower(gethostbyaddr($ip));
$file = implode(" ", file(FILE_BOTS));
$exp = explode(".", $ip);
$class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
$threshold = CLOAKING_LEVEL;
$cloak = 0;
if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
$cloak++;
}
if (stristr($file, $class)) {
$cloak++;
}
if (stristr($file, $agent)) {
$cloak++;
}
if (strlen($ref) > 0) {
$cloak = 0;
}
if ($cloak >= $threshold) {
$cloakdirective = 1;
} else {
$cloakdirective = 0;
}
}
?>
That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com
Needs a bit of work, but definitely the way to go.

100% Working Bot detector. It is working on my website successfully.
function isBotDetected() {
if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
) {
return true; // 'Above given bots detected'
}
return false;
} // End :: isBotDetected()

I'm using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt and know about new user_agents and put them in your condition of if clause.
$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
// if not meet the conditions then
// do what you need
// here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
if($user_agent!=""){
$myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
fwrite($myfile, $user_agent);
$user_agent = "\n";
fwrite($myfile, $user_agent);
fclose($myfile);
}
}
This is the content of useragent.txt
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174

For Google i'm using this method.
function is_google() {
$ip = $_SERVER['REMOTE_ADDR'];
$host = gethostbyaddr( $ip );
if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {
$forward_lookup = gethostbyname( $host );
if ( $forward_lookup == $ip ) {
return true;
}
return false;
} else {
return false;
}
}
var_dump( is_google() );
Credits: https://support.google.com/webmasters/answer/80553

Verifying Googlebot
As useragent can be changed...
the only official supported way to identify a google bot is to run a
reverse DNS lookup on the accessing IP address and run a forward DNS
lookup on the result to verify that it points to accessing IP address
and the resulting domain name is in either googlebot.com or google.com
domain.
Taken from here.
so you must run a DNS lookup
Both, reverse and forward.
See this guide on Google Search Central.

function bot_detected() {
if(preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']){
return true;
}
else{
return false;
}
}

might be late, but what about a hidden a link. All bots will use the rel attribute follow, only bad bots will use the nofollow rel attribute.
<a style="display:none;" rel="follow" href="javascript:void(0);" onclick="isabot();">.</a>
function isabot(){
//define a variable to pass with ajax to php
// || send bots info direct to where ever.
isabot = true;
}
for a bad bot you can use this:
<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>
for PHP specific you can remove the onclick attribute and replace the href attribute with a link to your ip detector/ bot detector like so:
<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>
OR
<a style="display:none;" rel="nofollow" href="https://somedomain.com/badbotdetector.php">.</a>
you can work with it and maybe use both, one detects a bot, while the other proves it to be a bad bot.
hope you find this useful

Create POST (protocol) for Wordpress from reading debugging tool Chrome

I have a button on my page WordPress to check or uncheck the post of WordPress as favorite. It is my intention to make a POST call from php to do this. Later I call this php from a mobile app.
My App Mobile ==> (get_favorito.php) POST (idUser, idPost, Status) ==> Favorite ON / OFF
I currently use WP 4.4.2 and Plugin for WordPress FAVORITES (https://github.com/kylephillips/favorites)
I launch the POST used the tool for developers of Chrome.
image important debugging
And I can see that the call is made:
http://web.domine.com/wp-admin/admin-ajax.php?action=simplefavorites_favorite&nonce=XXXXXXcd14&postid=273&siteid=1&status=inactive
or
http://web.domine.com/wp-admin/admin-ajax.php?action=simplefavorites_favorite&nonce=XXXXXXcd14&postid=273&siteid=1&status=active
My question comes with the part of Header and Cookie. How did you get this information?
I'm trying this, but it does not work.
This is the php I am writing.
<?php
$ruta = 'http://' . $_SERVER['HTTP_HOST'];
$json = file_get_contents($ruta . '/wp-admin/admin-ajax.php?action=simplefavorites_nonce');
$arr = json_decode($json, true);
$nonce = $arr['nonce'];
$opts = array(
'http'=>array(
'method'=>'POST',
'header'=> 'POST /wp-admin/admin-ajax.php HTTP/1.1\r\n' .
'Host: web.domine.com\r\n' .
'Connection: keep-alive\r\n' .
'Content-Length: 84\r\n' .
'Accept: */*\r\n' .
'Origin: http://web.domine.com\r\n' .
'X-Requested-With: XMLHttpRequest\r\n' .
'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36\r\n' .
'Content-Type: application/x-www-form-urlencoded; charset=UTF-8\r\n' .
'Referer: http://web.domine.com/hola-mundo-2/\r\n' .
'Accept-Encoding: gzip, deflate\r\n' .
'Accept-Language: es-ES,es;q=0.8\r\n' .
'Cookie: wordpress_dddd3333f97127bf3816f4455971ce5aa=peteradmin%7C1457085836%7CxWJrk7EQVEYRpZY9Jzev4fH6jx3cq97wx6LPaMd9C4v%7Cd232ca14edca535e653dd37607b754d78926410e317d34315cbcb5533cda08c8; PHPSESSID=8eda0049e17a67becb1c8fddd18c6c51;
wordpress_logged_in_dddd3333f97127bf3816f4455971ce5aa=peteradmin%7C1457085836%7CxWJrk7EQVEYRpZY9Jzev4fH6jx3cq97wx6LPaMd9C4v%7C63a7b53cfbb2c5a3b86e59c65e9977077e352ad8fe00228dee9b04a7a1e36ad9;
wp-settings-1=libraryContent%3Dbrowse%26editor%3Dtinymce%26mfold%3Do;
wp-settings-time-1=1456991866;
wordpress_test_cookie=WP+Cookie+check;
simplefavorites=%5B%7B%22site_id%22%3A1%2C%22posts%22%3A%7B%221%22%3A194%2C%222%22%3A208%2C%223%22%3A273%7D%7D%5D'
)
);
$context = stream_context_create($opts);
//
//
$param = "action=simplefavorites_favorite&nonce='.$nonce.'&postid=273&siteid=1&status=active";
$json = file_get_contents($ruta . '/wp-admin/admin-ajax.php?'.$param.'', false, $context);
echo $json;
?>
(I put spaces so that cookies are correctly displayed)
And now I get nonce with:
http://web.domine.com/wp-admin/admin-ajax.php?action=simplefavorites_nonce

Hello I was redirected here from nubelo in order to answer.
The headers are set automatically by the browser and the cookies are set by different pages of wordpress like the wp-login.php page.
The simplefavorites cookie is a cookie that stores an anonymoys user favorite posts array, and it is returned in the response headers of the wp-admin/admin-ajax.php?action=simplefavorites_array page. For logged in users the favorites information is returned in json format in the response of that same page.
I made a php script to toggle the status it just sends the cookies to the respective endpoints and you would only need to store the cookies in your mobile app and send them with your request.
https://gist.github.com/chaps/eec3769560c7d8debe59

Foreach Loop only loops once

I am trying to make many requests to my website, using proxies and headers in PHP, and grab a proxy line by line from a text file to use in the file_get_contents, however I have 3 proxies in the text file (one per line) and the script is only using one, then ending. (I am executing it from command line)
<?php
$proxies = explode("\r\n", file_get_contents("proxies.txt"));
foreach($proxies as $cpr0xy) {
$aContext = array(
'http' => array(
'proxy' => "tcp://$cpr0xy",
'request_fulluri' => true,
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36\r\n"
), );
$rqcon = stream_context_create($aContext);
$destc = file_get_contents("http://domain.com/file.php", False, $rqcon);
echo $destc;
} ?>
Right now its only using the first proxy and it is returning the value correctly, however then the script stops. My goal is for it to endlessly make requests until it runs out of proxies in proxies.txt

This should work for you:
$proxies = explode(PHP_EOL, file_get_contents("proxies.txt"));

Setting useragent in PHP file_get_contents?

I am trying to get the content from a page using file_get_contents to get the HTML and regex for further processing.
The site I am getting my content from has a desktop and mobile site so I was wondering is there a way to send a custom useragent to get the mobile site instead of the desktop site?
Using file_get_contents I have tried it with my code shown below but all I get is a blank page:
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n" .
"User-Agent: Mozilla/5.0 (iPad; U; CPU iPad OS 5_0_1 like Mac OS X; en-us) AppleWebKit/535.1+ (KHTML like Gecko) Version/7.2.0.0 Safari/6533.18.5\r\n" // i.e. An iPad )
);
$context = stream_context_create($options);
$file = file_get_contents('http://www.example.com/api/'.$atrib,false,$context);
$doc = new DOMDocument();// create new dom document
$doc->loadHTML($file);// load the xmlpage
$tags = $doc->getElementsByTagName('video'); // find the tag we are looking for
foreach ($tags as $tag) { // for ever tag that is the same make it a new tag of its own
if (isset($_GET['key']) && $_GET['key'] == $key) { // if key is in url and matches script key - do or dont
echo $tag->getAttribute('src'); // get out 3 min video from the attribute in page
} else { // if key is not in url or not correct show error
echo "ACCESS DENIED!"; // our out bound error
}
}
I am trying to get the useragent to load up the content from the sites mobile page and useing regex get the src url from this line of code in the page just in case this is the problem:
<video id="player" src="http://example.com/api/4.m3u8" poster="http://example.com/default.png" autoplay="" autobuffer="" preload="" controls="" height="537" width="935"></video>

As mentioned by Casimir et Hippolyte in the comments, uncomment the closing parenthesis at the end of the line of "User-Agent: Mozilla...:
ini_set('display_errors', 'On');

how to detect search engine bots with php?

How can one detect the search engine bots using php?

I use the following code which seems to be working fine:
function _bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
update 16-06-2017
https://support.google.com/webmasters/answer/1061943?hl=en
added mediapartners

Here's a Search Engine Directory of Spider names
Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}

Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:
http://www.useragentstring.com/pages/useragentstring.php
Or more specifically for crawlers:
http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler
If you want to -say- log the number of visits of most common search engine crawlers, you could use
$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
// $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}

You can checkout if it's a search engine with this function :
<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
'Rambler' => 'Rambler',
'Yahoo' => 'Yahoo',
'AbachoBOT' => 'AbachoBOT',
'accoona' => 'Accoona',
'AcoiRobot' => 'AcoiRobot',
'ASPSeek' => 'ASPSeek',
'CrocCrawler' => 'CrocCrawler',
'Dumbot' => 'Dumbot',
'FAST-WebCrawler' => 'FAST-WebCrawler',
'GeonaBot' => 'GeonaBot',
'Gigabot' => 'Gigabot',
'Lycos spider' => 'Lycos',
'MSRBOT' => 'MSRBOT',
'Altavista robot' => 'Scooter',
'AltaVista robot' => 'Altavista',
'ID-Search Bot' => 'IDBot',
'eStyle Bot' => 'eStyle',
'Scrubby robot' => 'Scrubby',
'Facebook' => 'facebookexternalhit',
);
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
$crawlers_agents = implode('|',$crawlers);
if (strpos($crawlers_agents, $USER_AGENT) === false)
return false;
else {
return TRUE;
}
}
?>
Then you can use it like :
<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>

I'm using this to detect bots:
if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
// is bot
}
In addition I use a whitelist to block unwanted bots:
if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
// allowed bot
}
An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.
Note: My whitelist is based on Facebooks robots.txt.

Because any client can set the user-agent to what they want, looking for 'Googlebot', 'bingbot' etc is only half the job.
The 2nd part is verifying the client's IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it's under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site's visitor, you're sure it's a crawler from that search engine.
I've written a library in Java that performs these checks for you. Feel free to port it to PHP. It's on GitHub: https://github.com/optimaize/webcrawler-verifier

If you really need to detect GOOGLE engine bots you should never rely on "user_agent" or "IP" address because "user_agent" can be changed and acording to what google said in: Verifying Googlebot
To verify Googlebot as the caller:
1.Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.
2.Verify that the domain name is in either googlebot.com or google.com
3.Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.
Here is my tested code :
<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 )
{
//add your code
}
?>
In this code we check "hostname" which should contain "googlebot.com" or "google.com" at the end of "hostname" which is really important to check exact domain not subdomain.
I hope you enjoy ;)

I use this function ... part of the regex comes from prestashop but I added some more bot to it.
public function isBot()
{
$bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
$userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
$isBot = !$userAgent || preg_match($bot_regex, $userAgent);
return $isBot;
}
Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )
One distinctive feature of most of the bot is that they don't carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )

Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector

You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client’s IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.

I made one good and fast function for this
function is_bot(){
if(isset($_SERVER['HTTP_USER_AGENT']))
{
return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
}
return false;
}
This cover 99% of all possible bots, search engines etc.

<?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
$lastupdated = date("Ymd", filemtime(FILE_BOTS));
if ($lastupdated != date("Ymd")) {
$lists = array(
'http://labs.getyacg.com/spiders/google.txt',
'http://labs.getyacg.com/spiders/inktomi.txt',
'http://labs.getyacg.com/spiders/lycos.txt',
'http://labs.getyacg.com/spiders/msn.txt',
'http://labs.getyacg.com/spiders/altavista.txt',
'http://labs.getyacg.com/spiders/askjeeves.txt',
'http://labs.getyacg.com/spiders/wisenut.txt',
);
foreach($lists as $list) {
$opt .= fetch($list);
}
$opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
$fp = fopen(FILE_BOTS,"w");
fwrite($fp,$opt);
fclose($fp);
}
$ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
$ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
$agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
$host = strtolower(gethostbyaddr($ip));
$file = implode(" ", file(FILE_BOTS));
$exp = explode(".", $ip);
$class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
$threshold = CLOAKING_LEVEL;
$cloak = 0;
if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
$cloak++;
}
if (stristr($file, $class)) {
$cloak++;
}
if (stristr($file, $agent)) {
$cloak++;
}
if (strlen($ref) > 0) {
$cloak = 0;
}
if ($cloak >= $threshold) {
$cloakdirective = 1;
} else {
$cloakdirective = 0;
}
}
?>
That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com
Needs a bit of work, but definitely the way to go.

100% Working Bot detector. It is working on my website successfully.
function isBotDetected() {
if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
) {
return true; // 'Above given bots detected'
}
return false;
} // End :: isBotDetected()

I'm using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt and know about new user_agents and put them in your condition of if clause.
$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
// if not meet the conditions then
// do what you need
// here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
if($user_agent!=""){
$myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
fwrite($myfile, $user_agent);
$user_agent = "\n";
fwrite($myfile, $user_agent);
fclose($myfile);
}
}
This is the content of useragent.txt
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174

For Google i'm using this method.
function is_google() {
$ip = $_SERVER['REMOTE_ADDR'];
$host = gethostbyaddr( $ip );
if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {
$forward_lookup = gethostbyname( $host );
if ( $forward_lookup == $ip ) {
return true;
}
return false;
} else {
return false;
}
}
var_dump( is_google() );
Credits: https://support.google.com/webmasters/answer/80553

Verifying Googlebot
As useragent can be changed...
the only official supported way to identify a google bot is to run a
reverse DNS lookup on the accessing IP address and run a forward DNS
lookup on the result to verify that it points to accessing IP address
and the resulting domain name is in either googlebot.com or google.com
domain.
Taken from here.
so you must run a DNS lookup
Both, reverse and forward.
See this guide on Google Search Central.

function bot_detected() {
if(preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']){
return true;
}
else{
return false;
}
}

might be late, but what about a hidden a link. All bots will use the rel attribute follow, only bad bots will use the nofollow rel attribute.
<a style="display:none;" rel="follow" href="javascript:void(0);" onclick="isabot();">.</a>
function isabot(){
//define a variable to pass with ajax to php
// || send bots info direct to where ever.
isabot = true;
}
for a bad bot you can use this:
<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>
for PHP specific you can remove the onclick attribute and replace the href attribute with a link to your ip detector/ bot detector like so:
<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>
OR
<a style="display:none;" rel="nofollow" href="https://somedomain.com/badbotdetector.php">.</a>
you can work with it and maybe use both, one detects a bot, while the other proves it to be a bad bot.
hope you find this useful

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Google Hotel Finder, file_get_contents PHP - php

Related

Check if visit is from a search engine Php Wordpress [duplicate]

Create POST (protocol) for Wordpress from reading debugging tool Chrome

Foreach Loop only loops once

Setting useragent in PHP file_get_contents?

how to detect search engine bots with php?

Categories

Resources