Restricting JS links from search engine's crawling - php

I would like to prevent google from following links I have in JS.
I didn't find how to do that in robots.txt
Am I looking in the wrong place?
Some more information:
I'm seeing google is crawling those pages although the links only appear in JS.
The reason I don't want him to crawl is that this content depends on external API's which I don't want to waste my rate limit with them on google crawlers and only per user demand

Direct from google ->
http://www.google.com/support/webmasters/bin/answer.py?answer=96569

Google probably won't find any links you have hidden in JS, but someone else could link to the same place.
It isn't links that matter though, it is URLs. Just specify the URLs you don't want search engines to visit in the robots.txt. The fact that you usually expose them to the browser via JS is irrelevant.
If you really want to limit access to the content, then just reducing discoverability probably isn't sufficient and you should put an authentication layer (e.g. password protection) in place.

Related

Can I stop robots from ruining statistics in Google Analytics?

I use Google Analytics to get visitors statistics on my webiste (PHP) and I see that a lot of traffic comes from sites like share-buttons.xyz, traffic2cash.xyz and top1-seo-service.com. I think this is because I use SEO-firendy URL:s (for looks in the addess bar).
This is not really a problem for the site itself, but when I look at the statistics in Google Analytics it includes these robots and non-users and the numbers are therefore not true.
Is there a way to block these robots or do I have to subtract the robots visits from the statistics manually every time I want a report?
If you see this happening you can prospectively exclude them from all future reports in GA by using a filter on that view (admin - filters, create filter, then apply to specific view)
If you specifically want to do it proactively using PHP then you could use some regex to match undesirable referrers in request headers and return nothing.
The answer to the main question is yes, but it requires to be be persistent and it is basically an ongoing task that you will need to perform. Yes, I know is a pain.
Just to let you know this has nothing todo with PHP or your friendly URL, your website is being a victim of what is known as ghost referrals. Google has not publicly said anything on the topic but just recently I found this article reporting that Google has finally found a solution here.
However, I choose to be sceptical about this. In the mean time this is what you need to do:
Make sure to leave a view untouched without any filters (Read the fourth paragrah)
In Google Analytics > admin > view > View Settings> Check "Exclude all hits from known bots and spiders" like this.
In the same view block spam bots: a) Check the list of ghost referrals in YOUR REPORT following this method and b) Create a filter like this.
I recommend you to read this article in full that contains lots of details and more information.
Some people like to create filters with Regex listening all the spammy bots, if you want to check a up to date list visit this repository.

Stop Facebook probing a site for content with PHP

Okay, so when you post a link on Facebook, it does a quick scan of the page to find images and text etc. to create a sort of preview on their site. I'm sure other social networks such as Twitter do much the same, too.
Anyway, I created a sort of "one time message" system, but when you create a message and send the link in a chat on Facebook, it probes the page and renders the message as "seen".
I know that the Facebook probe has a user agent of facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php), so I could just block all requests from anything with that user agent, but I was wondering if there's a more efficient way of achieving this with all sites that "probe" links for content?
No there's no fool-proof way to do this. The easiest way to achieve something like this is to manually block certain visitors from marking the content as seen.
Every entity on the web identifies itself with a user agent, although not every non-human entity identfies itself in an unique way there are online database like this one that can help achieve your goal.
In case of trying to block all bots via robots.txt, not every bot holds up to that standard. I will speculate that Facebook may try to prevent malware from being spread across their network by visiting any shared link.
you could try something like this in your robots.txt file
User-agent: *
Disallow: /

Topic search in search engines

I am developing a website similar to a web forum. Users will post queries and others help them through their replies. In many websites like mine, the topic of the query will be included in the URL, e.g. www.sample.com/topic-1.html or www.sample.com/topic-2.html, and these links can be found via search engines.
How can I dynamically generate the HTML files and configure my website so that search engines can access it?
No, they usually aren't putting these files on web server manually. They are rewriting urls using web server (e.g. apache2/nginx). Check the response headers to get more info about what happens behind the scenes.
Please see How to create friendly URL in php?

What is proper way to block bots/crawlers hitting link/page?

I am working on analytics and I am getting many in accurate results mostly because of either social media bots or other random bots like BufferBot,DataMinr etc from Twitter.
Is there any Web API/Database of all known bots available which I can use to check if it is a bot or human ?
Or is there any good way to block such kind of bots so that it doesn't effect the stats in terms of analytics?
You can link to a hidden page that is blocked by robots.txt. When visited, captures the user-agent and IP address of the bot and then appends one or both of them to a .htaccess file which blocks them permanently. It only catches bad bots and is automated so you don't have to do anything to maintain it.
Just make sure you set up the robots.txt file first and then give the good bots a fair chance to read it and update their crawling accordingly.
Create a file callled robots.txt in your route and add the following lines:
User-agent: *
Disallow: /
There is no way to outright block ALL bots, it would be an insane amount of time spent, you could use a .htaccess file or a robots.txt, stopping google indexing the site is easy but blocking bot traffic can get complicated and act like a house of cards
I suggest using this list of crawlers/web-bots http://www.robotstxt.org/db.html

iframes for ads? getting user information?

I was trying to do something like Google's Adsense. I believe they use javascript? But is using iFrame a good idea to have someone put on their site if they want to display ads? Would iFrames able to capture user's data information such as cookies (how adsense works, they get users cookies--that's why they can display ads of sites you've visited, correct me if I'm wrong)?
If this works, how would I able to get users cookies? Is it possible? How does google get users cookies?
Thanks for your help in advance!
(how adsense works, they get users cookies--that's why they can display ads of sites you've visited, correct me if I'm wrong)?
You are wrong. Google can only access Google's cookies. It's a big point in cookie security; no browser will allow you to get to other sites' cookies. Google can use cookies to identify you, but can't use them to see your behaviour on non-Google sites.
AdSense knows what you've been browsing by checking what links you click on Google Search and other services, what Ads you click on, what pages you visit that have AdSense in them (window.top.document.location) and which pages you visit them from (window.top.document.referrer), and probably more methods that people smarter than me at Google come up with :)
EDIT: as shown in comments, in fact one can't rely on top properties.
No you can't get these cookies. They're stored to be readable only by the domain AdSense uses to log people.
This is why an iframe is used, it allows google to load a specific url on a domain they control, the url contains an identifier telling them what AdSense campaign is being used.
Besides, the cookie which is present (but not accessible by you) doesn't contain any information about the user itself. It is instead just an identifier to link the person to data which is already present on the google servers.

Categories