Detecting a crawl (Search Engine's visit) using PHP

Detecting a crawl (Search Engine's visit) using PHP - php

When a search engine visits a webpage, what does get_browser() function and $_SERVER['HTTP_USER_AGENT'] return?
Also, what is the other possible evidence that PHP offers when a search engine crawls a webpage?

The get_browser() function attempts to determine the browser's features (in array) but dont count too much on it because of the non standard user-agents; instead, for a serious app, build your own.
the $_SERVER["HTTP_USER_AGENT"] is a long string "describing" the user's browser and can be used as first parameter in the above function (optional); A tip: use this one to uncover user's browser instead of get_browser() itself! Also be prepared for a missing user agent as well! An example of this string is this:
Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3
a search engine or robot or spider or crawler that follows the rules will visit your page according to the information stored of robots.txt that must exist in your site's root.
Without a robots.txt a spider can crawl the whole site, as long as it find links inside your pages; if you have this file you can program it so to tell the spider what to search; NOTE: this rule applies only to "good" spiders and not the bad ones

get_browser() & $_SERVER['HTTP_USER_AGENT'] will return you the Useragents, it should look like this :
Google :
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
Googlebot-Image/1.0
Bing :
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
Yahoo :
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
-> To fully control (and limit) the crawl don't use robots.txt, use .htaccess or http.conf rules. (good crawler don't give a f*** about your disallow rules half of the time in robots.txt)

Related

How to detect social media giants Bots and refine the useragent in php?

I am trying to build the script that will capture the USER-AGENT of the users.That can easily be done using $_SERVER['HTTP_USER_AGENT']
example: Below are all the twitter Bots that detect by $_SERVER['HTTP_USER_AGENT']
I just simple post the link of php script on twitter and it detect the bots:
Here are the Bots thats Captured by HTTP_USER_AGENT of twitter network.
1
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.1.2) Gecko/20090729 Firefox/52.0
2
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)
3
Mozilla/5.0 (compatible; AhrefsBot/6.1; News; +http://ahrefs.com/robot/)
4
Mozilla/5.0 (compatible; TrendsmapResolver/0.1)
5 (Not sure its bot or Normal Agent)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36
6
Twitterbot/1.0
7
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)
Now I want to Refine/filter the Bots name from the detected HTTP_USER_AGENT
example:
rv:1.9.1.2
Trident/4.0
(compatible; AhrefsBot/6.1; News; +http://ahrefs.com/robot/)
(compatible; TrendsmapResolver/0.1)
Twitterbot/1.0
(Applebot/0.1; +http://www.apple.com/go/applebot)
What I have tried so far:
if (
strpos($_SERVER["HTTP_USER_AGENT"], "Twitterbot/1.0") !== false ||
strpos($_SERVER["HTTP_USER_AGENT"], "Applebot/0.1") !== false
) {
$file =fopen("crawl.txt","a");
fwrite($file,"TW-bot detected.\n");
echo "TW-bot detected.";
}
else {
$file =fopen("crawl.txt","a");
fwrite($file,"Nothing found.\n");
echo "Nothing";
}
But somehow the above code is not working.let me know where I am getting wrong and in the crawl.txt always shows Nothing found
let me know the proper/better/best way to detect bots or any direction or guidence is apprecheated.

You might find that its easy to spot the bots which capture simple website previews, but the user-agents of bots which scrape for restricted content are a lot more difficult.
You'd have to do more than just parse the UA. Interrogating the REMOTE_ADDR will be necessary also. You'd fire each request through something like http://ip-api.com to determine if its coming from a datacenter. Be careful of users with proxies, they will trigger false positives. You could go further and investigate the browser capabilities with Javascript, but be aware this is a difficult problem and its a constant arms-race between a providers detection tools and (usually) black-hat advertisers.

404 Bot Attack on My Website (DDoS of Sorts)

Over the last few days I have noticed that my Wordpress website had been running quite slowly, so I decided to investigate. After checking my database I saw that a table which was responsible for tracking 404 errors was over 1GB is size. At this point it was evident I was being targeted by bots.
After checking my access log I could see that there was a pattern of sorts, the bot seemed to land on a legitimate page which listed my categories and then move into a category page and at this point they request seemingly random page numbers, many of which are non-existent pages causing the issue.
Example:
/watch-online/ - Landing Page
/category/evolution/page/7 - 404
/category/evolution/page/1
/category/evolution/page/3
/category/evolution/page/5 - 404
/category/evolution/page/8 - 404
/category/evolution/page/4 - 404
/category/evolution/page/2
/category/evolution/page/6 - 404
/category/evolution/page/9 - 404
/category/evolution/page/10 - 404
This is the actual order of requests and they all happen within a second, at this point the IP becomes blocked as too many 404's have been thrown but this seems to have no affect due to the sheer number of bots all doing the same thing.
Also the category changes with each bot so they are all attacking random categories and generating 404 pages.
At the moment there are 2037 unique ip's which have thrown similar 404s in the last 24 hours.
I also use Cloudflare and have manually blocked many ip's from ever reaching my box but this attack is relentless and it seems as though they keep generating new ip's. Here is a list of some offending ip's:
77.101.138.202
81.149.196.188
109.255.127.90
75.19.16.214
47.187.231.144
70.190.53.222
62.251.17.234
184.155.42.206
74.138.227.150
98.184.129.57
151.224.41.144
94.29.229.186
64.231.243.218
109.160.110.135
222.127.118.145
92.22.14.143
92.14.176.174
50.48.216.145
58.179.196.182
Other than automatically blocking ip's for too many 404 errors I can think of no other real solution and this in itself is quite ineffective due to the sheer number of ip's.
Any suggestions on how to deal with this would be greatly appreciated as there appears to be no end to this attack and my websites performance really is taking a hit.
Some User Agents Include:
Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.86 Safari/537.36
Mozilla/5.0 (Windows NT 6.2; rv:26.0) Gecko/20100101 Firefox/26.0
Mozilla/5.0 (compatible; MSIE
10.0; Windows NT 7.0; WOW64; Trident/6.0)
Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:22.0) Gecko/20100101
Firefox/22.0 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36

If its your personal website, you can try checking cloudflare, which is free and also it can provide support against any ddos attacks.May be you can give a try.

Okay so after much searching, experimentation and head banging I have finally mitigated the attack.
The solution was to install the apache module 'mod_evasive' see:
https://www.digitalocean.com/community/tutorials/how-to-protect-against-dos-and-ddos-with-mod_evasive-for-apache-on-centos-7
So for any other poor soul that gets slammed as severally as I did have a look at that and get your thresholds finely tuned. This is a simple, cheap and very effective means of drastically downplaying any attack similar to the one I suffered.
My server is still getting bombarded by bots but this really does limit their damage.

Why does Chrome user agent string change in windows 8?

In my computer with windows 8 and Google Chrome 35, the variable $_SERVER['HTTP_USER_AGENT'] sometimes returns
Mozilla/5.0 (Linux; U; Android 2.3.4; en-us; Kindle Fire HD Build/GINGERBREAD) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
when the correct value would be:
Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36
Why does it happen and how can I prevent it?

Kindle is both a hardware e-reader and a piece of software to read e-books on computers, laptops and tablets. Could it be that you installed that on your machine and that it nested itself in Chrome, like a plug-in/add-on?
If you're sure no such thing is the case, consider the suggestion by Alok in the comments. If you wouldn't know how to work with the console, check here whether the PHP- and JS-detected uAs read the same. If not, that would indeed be the cause.
Although I wouldn't know how to cure that then, other than by removing the (other) plug-ins/add-ons one by one.

PHP sniff out Safari Desktop version

Super quick one, I know that browser sniffing is frowned upon but I need to (using PHP) detect Desktop version Safari only, cannot seem to find specifically this combination on Google, or SO for that matter.
I know how to use $_SERVER['HTTP_USER_AGENT'] but don't know which bit to look for for Mac OSX/Windows 7/8.
Thanks.

The User Agent string for Safari on the Mac will be something like this:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.13+ (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2
on Windows you will find something like this:
Mozilla/5.0 (Windows; U; Windows NT 6.1; tr-TR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27

Change User Agent in php.ini

i am trying to change user agent in php.ini file as follows.
user_agent="Mozilla/5.0 (iPhone Simulator; U;
CPU iPhone OS 4_3_2 like Mac OD X; en-us)
AppleWebKit/535.17.9(KHTML, like Gecko)
Version/5.0.2 Mobile/8H7Safari/6533.18.5"
after that when i check user agent in my php file with following command and this show that user agent has not been change.
echo $_SERVER['HTTP_USER_AGENT'];
this shows : Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)
which is still not iphone user agent which i have set in php.ini file.
so please help me how to set user agent in php.ini file which switch my browser request as iphone browser request.
i have also tried with following command.
ini_set('user_agent', 'Mozilla/5.0 (iPhone Simulator; U;
CPU iPhone OS 4_3_2 like Mac OD X; en-us)
AppleWebKit/535.17.9 (KHTML, like Gecko) Version/5.0.2
Mobile/8H7 Safari/6533.18.5');
this also gives same result and i am unable to switch to iphone browser request.

I'm afraid you've misunderstood. The user_agent setting in php.ini has nothing to do with $_SERVER['HTTP_USER_AGENT].
The setting in php.ini is used as a default for when PHP does HTTP requests, for example with cURL.
$_SERVER['HTTP_USER_AGENT'] contains the user agent that the web browser sent along with its request to your PHP script. That's why it's showing MSIE because you're viewing the page in MSIE.
If you want to send a different user agent from your browser, you'll have to use a browser plugin unless the browser allows you to freely modify it. For example like this.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Detecting a crawl (Search Engine's visit) using PHP - php

When a search engine visits a webpage, what does get_browser() function and $_SERVER['HTTP_USER_AGENT'] return? Also, what is the other possible evidence that PHP offers when a search engine crawls a webpage?

Related

How to detect social media giants Bots and refine the useragent in php?

404 Bot Attack on My Website (DDoS of Sorts)

Why does Chrome user agent string change in windows 8?

PHP sniff out Safari Desktop version

Change User Agent in php.ini

Categories

Resources