How to protect a site from (google) caching? - php

I would like to hide some content from public (like google cached pages). Is it possible?

Add the following HTML tag in the <head> section of your web pages to prevent Google from showing the Cached link for a page.
<META NAME="ROBOTS" CONTENT="noarchive">
Check out Google webmaster central | Meta tags to see what other meta tags Google understands.

Option 1: Disable 'Show Cached Site' Link In Google Search Results
If you want to prevent google from archiving your site, add the following meta tag to your section:
<meta name="robots" content="noarchive">
If your site is already cached by Google, you can request its removal using Google's URL removal tool. For more instructions on how to use this tool, see "Remove a page or site from Google's search results" at Google Webmaster Central.
Option 2: Remove Site From Google Index Completely
Warning! The following method will remove your site from Google index completely. Use it only if you don't want your site to show up in Google results.
To prevent ("protect") your site from getting to Google's cache, you can use robots.txt. For instructions on how to use this file, see "Block or remove pages using a robots.txt file".
In principle, you need to create a file named robots.txt and serve it from your site's root folder (/robots.txt). Sample file content:
User-agent: *
Disallow: /folder1/
User-Agent: Googlebot
Disallow: /folder2/
In addition, consider setting robots meta tag in your HTML document to noindex ("Using meta tags to block access to your site"):
To prevent all robots from indexing your site, set <meta name="robots" content="noindex">
To selectively block only Google, set <meta name="googlebot" content="noindex">
Finally, make sure that your settings really work, for instance with Google Webmaster Tools.

robots.txt: http://www.robotstxt.org/

You can use a robots.txt file to request that your page is not indexed. Google and other reputable services will adhere to this, but not all do.
The only way to make sure that your site content isn't indexed or cached by any search engine or similar service is to prevent access to the site unless the user has a password.
This is most easily achieved using HTTP Basic Auth. If you're using the Apache web server, there are lots of tutorials (example) on how to configure this. A good search term to use is htpasswd.

A simple way to do this would be with a <meta name="robots" content="noarchive"/>
You can also achieve a similar effect with the robots.txt file.
For a good explanation, see the official google blog on the robot's execution policy

I would like to hide some content from public....
Use a login system to view the content.
...(like google cached pages).
Configure robots.txt to deny Google bot.

If you want to limit who can see content, secure it behind some form of authentication mechanism (e.g. password protection, even if it is just HTTP Basic Auth).
The specifics of how to implement that would depend on the options provided by your server.

You can also add this HTTP Header on your response, instead of needing to update the html files:
X-Robots-Tag: noarchive
eg for Apache:
Header set X-Robots-Tag "noarchive"
See also: https://developers.google.com/search/reference/robots_meta_tag?csw=1

Related

No Robots.txt but I want to noindex a page

I have an HTML/PHP/CSS site, very simple, and I have no robots.txt file for it. The site is indexed, it's all great.
But now I created a page for something I need, and I want to make that one page noindex.
Do I have to create a robots.txt file for it, or is there an easier way to do it without having to create a robots.txt?
Also, I did Google for this before asking, and I came across an article that instructed to put the following code on your page under the <head> code:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
I did that. However, after I checked the page here: http://www.seoreviewtools.com/bulk-meta-robots-checker/
It says: Meta robots: Not found
Explanation: There are no restrictions for indexing or serving detected.
So then, how can I make that page noindex?
Although having a robots.txt file is the best you can do, there is never a guarantee that a search robot will not index that page. I believe most, if not all respectable search engines will follow the restrictions in that file but there are plenty of other bots out there that don't.
That said, if you don't link to the page anywhere, no search engine will index it - at least in theory: if you or other people access that page and their browser or an extension submits that page for indexing, it ends up being indexed anyway.

How to get Facebook Debugger to read canonical URL?

So this is happening when I test my website using Facebook's Open Graph Object Debugger:
It doesn't like the trailing numbers after the profile page. But I have both of these defined properly:
<meta property="og:url" content="http://www.website.com/profile/139">
<link rel="canonical" href="http://www.website.com/profile/139">
I've tried for hours and it just keeps redirecting to the homepage:
Is there anything I can add to my .htaccess file or PHP header to prevent this 301 redirect?
May be related to the way Facebook/Google handle URL parameters: http://gohe.ro/1fpOA0N
The anser was a problem with our domain host WP Engine, who tricks spiders into ignoring pure numeric strings at the end of page URL's. Pertains specifically to:
Googlebot (Google's spider)
Slurp! (Yahoo's spider)
BingBot (Bing's spider)
Facebook OG/Debugger
For example, the following URL:
http://www.website.com/profile/12345
Will be interpreted to these bots as:
http://www.website.com/profile
However, if the string is non-numeric the bots will recognize it. This is done for caching purposes. But again, this pertains only to WP Engine and a few other hosting providers.
Facebook treats the og:url Meta Tag as the Canonical for your page:
<meta property="og:url" content="http://www.yoursite.com/your-canonical-url" />
If your Canonical url is redirecting you are in fact creating a loop.
Don't redirect from your Canonical.
Canonical is the page which should be considered the better option for the spiders.
If a page has a Canonical url tag it means that it is NOT the best/default page but rather it is a lesser variation of the Canocical.

Html - PHP - Robots

I am just looking for a bit of advice / feedback, I was thinking about setting up and opencart behind an HTML site (shop) that gets ranked well in Google.
The index.html site appears instead of the index.php page by default on the web server (I have tested it).
I was hoping I could construct the site in maintenance mode on the domain, then just delete the html site leaving the (live) opencart site once finished (about 2 weeks).
Just worried in case this may effect ranking.
In the robot.txt file put:
User-agent: *
User-agent: Googlebot
Disallow: /index.php
I would also put in the index.php page (opencart) header:
<meta name="robots" content="nofollow">
<meta name="googlebot" content="noindex, noarchive">
I don't want Google to cache the "website under maintenance" opencart index.php page. It could take a month or so to refresh it.
Obviously I would remove/change the Disallow robot.txt and meta tag etc commands once live and html site files deleted.
I would like to know if any one has tried it or if it will work? Effect google ranking etc?
Is it just a Bad idea? Any feedback would be appreciated.
Many Thanks
I assume you're using LAMP for your website (Linux, Apache, MySQL, PHP)
1) Apache has option to set default page, set it to index.php instead of index.html
2) You may either use re-write rule in .htaccess file (read more here. If your hosting provider doesn't give permission to .htaccess, there's a workaround!In index.php you may include this snippet at the top:
<?php
if($_SERVER['REQUEST_URI'] == '/index.php'){
header("HTTP/1.1 301 Moved Permanently");
header("Location: /");
die();
}
?>
So even if user opens up http://www.domain.com/index.php,
he'll get redirected to http://www.domain.com/
(Eg: My site http://theypi.net/index.php goes to http://theypi.net/)
I've also set similar redirect for http://www.theypi.net to redirect to http://theypi.net
Choosing between one of the two options (with or without www) helps improve ranking as well)
To your question
I would like to know if any one has tried it or if it will work? Effect google ranking etc?
Shorter URL: This is part of URL hygiene which is meant for SEO improvement
If homepage opens just through domain name (without index.php) then your CTR (Click Through Rate) impact in search results is higher.
I would suggest not to use robot blocking mechanism unless above steps aren't feasible for you.
Hope it helps, Thanks!
Edit:
And if you don't even have permission to set homepage as index.php. You may do one of following:
1. create index.html and put php code. If WebServer understands php, put redirect logic as above.
2. else, put JavaScript redirect (not a recommended way)
<script language=”JavaScript”> self.location=”index.php”; </script>

Turn off search privacy

How can I hide the profile details from search engines using php?
I want to prevent the search engines from user`s details in my website only when the user
set turn off search privacy from their account page.
Ex: Facebook profile privacy
You can add following lines to your header part of your html/php page
<meta name="robots" content="noindex, nofollow" />
<meta name="googlebot" content="noindex, nofollow" />
Create a robots.txt with something like the following
User-agent: *
Disallow: /profile
More information on possible options can be found here.
You may try using /robots.txt but there's no guarantee that they will really respect that configuration.
The second thing you may do is to just hide contains from robots by specifying blacklist (you can google list of search robot ip) all ips from lists like this one (although this is practice which is used by malware).
Using robots.txt / html meta tags is nice if the bot respects them but is otherwise pointless. The only ready way to protect the info is to use some sort of authentication system where only registered members may view certain content otherwise limiting who can see what.

How Does WordPress Block Search Engines?

If you go to wordpress admin and then settings->privacy, there are two options asking you whether you want to allow your blog to be searched though by seach engines and this option:
I would like to block search engines,
but allow normal visitors
How does wordpress actually block search bots/crawlers from searching through this site when the site is live?
According to the codex, it's just robots meta tags, robots.txt and suppression of pingbacks:
Causes <meta name='robots' content='noindex,nofollow' /> to be generated into the section (if wp_head is used) of your site's source, causing search engine spiders to ignore your site.
Causes hits to robots.txt to send back:
User-agent: *
Disallow: /
Note: The above only works if WordPress is installed in the site root and no robots.txt exists.
These are "guidelines" that all friendly bots will follow. A malicious spider searching for E-Mail addresses or forms to spam into will not be affected by these settings.
With a robots.txt (if installed as root)
User-agent: *
Disallow: /
or (from here)
I would like to block search engines, but allow normal visitors -
check this for these results:
Causes "<meta name='robots' content='noindex,nofollow' />"
to be
generated into the
section (if wp_head is used) of your
site's source, causing search engine
spiders to ignore your site.
* Causes hits to robots.txt to send back:
User-agent: *
Disallow: /
Note: The above only works if WordPress is installed in the site root and no robots.txt exists.
Stops pings to ping-o-matic and any other RPC ping services specified in the Update
Services of Administration > Settings > Writing. This works by having the function privacy_ping_filter() remove
the sites to ping from the list. This
filter is added by having
add_filter('option_ping_sites','privacy_ping_filter');
in the default-filters. When the
generic_ping function attempts to get
the "ping_sites" option, this filter
blocks it from returning anything.
Hides the Update Services option entirely on the
Administration > Settings > Writing
panel with the message "WordPress is
not notifying any Update Services
because of your blog's privacy
settings."
You can't actually block bots and crawlers from searching through a publicly available site; if a person with a browser can see it, then a bot or crawler can see it (caveat below).
However, there is something call the Robots Exclusion Standard (or robots.txt standard), which allows you to indicate to well behaved bots and crawlers that they shouldn't index your site. This site, as well as Wikipedia, provide more information.
The caveat to the above comment that what you see on your browser, a bot can see, is this: most simple bots do not include a Javascript engine, so anything that the browser renders as a result of Javascript code will probably not be seen by a bot. I would suggest that you don't use this as a way to avoid indexing, since the robots.txt standard does not rely on the presence of Javascript to ensure correct rendering of your page.
Once last comment: bots are free to ignore this standard. Those bots are badly behaved. The bottom line is that anything that can read your HTML can do what it likes with it.
I don't know for sure but it probably generates a robots.txt file which specifies rules for search engines.
Using a Robots Exclusion file.
Example:
User-agent: Google-Bot
Disallow: /private/

Categories