What is proper way to block bots/crawlers hitting link/page?

What is proper way to block bots/crawlers hitting link/page? - php

I am working on analytics and I am getting many in accurate results mostly because of either social media bots or other random bots like BufferBot,DataMinr etc from Twitter.
Is there any Web API/Database of all known bots available which I can use to check if it is a bot or human ?
Or is there any good way to block such kind of bots so that it doesn't effect the stats in terms of analytics?

You can link to a hidden page that is blocked by robots.txt. When visited, captures the user-agent and IP address of the bot and then appends one or both of them to a .htaccess file which blocks them permanently. It only catches bad bots and is automated so you don't have to do anything to maintain it.
Just make sure you set up the robots.txt file first and then give the good bots a fair chance to read it and update their crawling accordingly.

Create a file callled robots.txt in your route and add the following lines:
User-agent: *
Disallow: /

There is no way to outright block ALL bots, it would be an insane amount of time spent, you could use a .htaccess file or a robots.txt, stopping google indexing the site is easy but blocking bot traffic can get complicated and act like a house of cards
I suggest using this list of crawlers/web-bots http://www.robotstxt.org/db.html

Related

Can I stop robots from ruining statistics in Google Analytics?

I use Google Analytics to get visitors statistics on my webiste (PHP) and I see that a lot of traffic comes from sites like share-buttons.xyz, traffic2cash.xyz and top1-seo-service.com. I think this is because I use SEO-firendy URL:s (for looks in the addess bar).
This is not really a problem for the site itself, but when I look at the statistics in Google Analytics it includes these robots and non-users and the numbers are therefore not true.
Is there a way to block these robots or do I have to subtract the robots visits from the statistics manually every time I want a report?

If you see this happening you can prospectively exclude them from all future reports in GA by using a filter on that view (admin - filters, create filter, then apply to specific view)
If you specifically want to do it proactively using PHP then you could use some regex to match undesirable referrers in request headers and return nothing.

The answer to the main question is yes, but it requires to be be persistent and it is basically an ongoing task that you will need to perform. Yes, I know is a pain.
Just to let you know this has nothing todo with PHP or your friendly URL, your website is being a victim of what is known as ghost referrals. Google has not publicly said anything on the topic but just recently I found this article reporting that Google has finally found a solution here.
However, I choose to be sceptical about this. In the mean time this is what you need to do:
Make sure to leave a view untouched without any filters (Read the fourth paragrah)
In Google Analytics > admin > view > View Settings> Check "Exclude all hits from known bots and spiders" like this.
In the same view block spam bots: a) Check the list of ghost referrals in YOUR REPORT following this method and b) Create a filter like this.
I recommend you to read this article in full that contains lots of details and more information.
Some people like to create filters with Regex listening all the spammy bots, if you want to check a up to date list visit this repository.

Stop Facebook probing a site for content with PHP

Okay, so when you post a link on Facebook, it does a quick scan of the page to find images and text etc. to create a sort of preview on their site. I'm sure other social networks such as Twitter do much the same, too.
Anyway, I created a sort of "one time message" system, but when you create a message and send the link in a chat on Facebook, it probes the page and renders the message as "seen".
I know that the Facebook probe has a user agent of facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php), so I could just block all requests from anything with that user agent, but I was wondering if there's a more efficient way of achieving this with all sites that "probe" links for content?

No there's no fool-proof way to do this. The easiest way to achieve something like this is to manually block certain visitors from marking the content as seen.
Every entity on the web identifies itself with a user agent, although not every non-human entity identfies itself in an unique way there are online database like this one that can help achieve your goal.
In case of trying to block all bots via robots.txt, not every bot holds up to that standard. I will speculate that Facebook may try to prevent malware from being spread across their network by visiting any shared link.

you could try something like this in your robots.txt file
User-agent: *
Disallow: /

Restricting JS links from search engine's crawling

I would like to prevent google from following links I have in JS.
I didn't find how to do that in robots.txt
Am I looking in the wrong place?
Some more information:
I'm seeing google is crawling those pages although the links only appear in JS.
The reason I don't want him to crawl is that this content depends on external API's which I don't want to waste my rate limit with them on google crawlers and only per user demand

Direct from google ->
http://www.google.com/support/webmasters/bin/answer.py?answer=96569

Google probably won't find any links you have hidden in JS, but someone else could link to the same place.
It isn't links that matter though, it is URLs. Just specify the URLs you don't want search engines to visit in the robots.txt. The fact that you usually expose them to the browser via JS is irrelevant.
If you really want to limit access to the content, then just reducing discoverability probably isn't sufficient and you should put an authentication layer (e.g. password protection) in place.

sending a long link in an email using PHP

I am trying to implement a website which among other things, let users invite other users to specific pages. Unfortunately the link address of those pages are fairly long, and often cross the 70 characters limit. SO when I add them to the mail, even if I start a new line before the link, still the link address is cut halfway, and then the email client (gmail, for example) assumes the link ends at the end of the line. SO when the user clicks on the link, they experience it as broken.
I am coding all this in PHP, but the problem seem to be general.
What is the standard solution to this problem?

Place the URL in <> brackets. Most mail clients will parse the URL correctly and make it clickable, even when wrapped.
<http://www.somereallylongdomain.com/somereallylongdirectory/somereallylongfilename.html>

You could use a URL shortener to redirect to the longer links. Bit.ly has an API with which your code can interface for this purpose.

I don't know if there are better solutions, but you can implement a url shortener with http://yourls.org/ or with other tools...

Create your own URL shortening solution. There are several ways you can go, depending on the complexity of your requirements:
if you're using only a few selected urls which are always repeating, use apache rewrite
if the url is user specific or changes in other ways from case to case, use a database table that stores a short url and the original url
if you don't want or can't implement your own solution, use an existing url shortening service via an API, but make sure not to expose security relevant information

How do I block Google adsense linking to files on my server?

Today I noticed that a few Google advertisements in an adsence block on one of my pages were trying to display a file called "/pagead/badge/checkout_999999.gif" from my server. I did a bit of investigating and found out that the companies behind these adverts use Google Checkout and "checkout_999999.gif" is supposed to be a tiny shopping kart icon with a tooltip which reads "This site accepts Google Checkout".
My problem is that "/pagead/badge/checkout_999999.gif" doesn't exist on my server. What do you do to handle this on your website? e.g:
Save the logo () on your server in the place it is expected by Google?
Use a mod_rewrite rule to redirect the request? To where though?
Find a adsense option to turn off Google checkout enabled adverts? (I looked but couldn't see one?)
Ignore the issue and get on with something important
Back-story - please ignore unless very bored: Page 2 of the search page on our site suddenly stopped working and I didn't know why. It turned out to be related to Google adsense. We use PHP session variables to save search criteria over different pages which worked fine for a while but then randomly stopped working. Random bugs are the worst! I was trying to work out what else is random on the page and decided that the Google ads were the only other random thing. Sure enough, sometimes a particular advert seemed to clear the session variables and break the search. What was actually happening was that the advert was requesting a image from our server ("checkout_999999.gif") which didn't exist and Apache was behind-the-scenes redirecting to the site homepage which unfortunately clears the session variables needed for the search - hence the non-obvious breakage. I'm a bit worried that Google ads can request random files from my sever? I'd prefer if they could only use absolute URL's if they want to include logo's or other media.

Sounds like a bug with google adsense delivery. File a bug with them is your best bet for a long-term fix.

After playing around with Apache mod_rewrite for a while I have found a rule that seems to fix my Google Adsense issue:
RewriteRule ^\/pagead\/badge\/checkout\_999999\.gif$ http://pagead2.googlesyndication.com/pagead/badge/checkout_999999.gif [R=301,L]
The problem is I'm not sure how to stop a similar thing happening in the future if Google decides to hotlink to a different file?

As Google doesn't care about the problem (previous posts were sent in June and we are now in October... and the bug was reported directely to Google), I decided to put an image on my server that would suggest the user to click on the ad!
Clicks then increased!!
As this is my server, I can do what I want and that's not my problem if Google asks for files on my server. They cost me money by using my bandwidth and connecting to my server, I am now paid for that!!
You should do the same...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.