I have an HTML/PHP/CSS site, very simple, and I have no robots.txt file for it. The site is indexed, it's all great.
But now I created a page for something I need, and I want to make that one page noindex.
Do I have to create a robots.txt file for it, or is there an easier way to do it without having to create a robots.txt?
Also, I did Google for this before asking, and I came across an article that instructed to put the following code on your page under the <head> code:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
I did that. However, after I checked the page here: http://www.seoreviewtools.com/bulk-meta-robots-checker/
It says: Meta robots: Not found
Explanation: There are no restrictions for indexing or serving detected.
So then, how can I make that page noindex?
Although having a robots.txt file is the best you can do, there is never a guarantee that a search robot will not index that page. I believe most, if not all respectable search engines will follow the restrictions in that file but there are plenty of other bots out there that don't.
That said, if you don't link to the page anywhere, no search engine will index it - at least in theory: if you or other people access that page and their browser or an extension submits that page for indexing, it ends up being indexed anyway.
Related
I created a php page that is only accessible by means of token/pass received through $_GET
Therefore if you go to the following url you'll get a generic or blank page
http://fakepage11.com/secret_page.php
However if you used the link with the token it shows you special content
http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4
Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through the provided link.
Are dynamic pages that are dependent of $_GET variables indexed by google and other search engines?
If so, will include the following be enough to hide it?
Robots.txt User-agent: * Disallow: /
metadata: <META NAME="ROBOTS" CONTENT="NOINDEX">
Even if I type into google:
site:fakepage11.com/
Thank you!
If a search engine bot finds the link with the token somehow¹, it may crawl and index it.
If you use robots.txt to disallow crawling the page, conforming search engine bots won’t crawl the page, but they may still index its URL (which then might appear in a site: search).
If you use meta-robots to disallow indexing the page, conforming search engine bots won’t index the page, but they may still crawl it.
You can’t have both: If you disallow crawling, conforming bots can never learn that you also disallow indexing, because they are not allowed to visit the page to see your meta-robots element.
¹ There are countless ways how search engines might find a link. For example, a user that visits the page might use a browser toolbar that automatically sends all visited URLs to a search engine.
If your page isn't discoverable then it will not be indexed.
by "discoverable" we mean:
it is a standard web page, i.e. index.*
it is referenced by another link either yours or from another site
So in your case by using the get parameter for access, you achieve 1 but not necessarily 2 since someone may reference that link and hence the "hidden" page.
You can use the robots.txt that you gave and in that case the page will not get indexed by a bot that respects that (not all will do). Not indexing your page doesn't mean of course that the "hidden" page URL will not be in the wild.
Furthermore another issue - depending on your requirements - is that you use unencrypted HTTP, that means that your "hidden" URLs and content of pages are visible to every server between your server and the user.
Apart from search engines take care that certain services are caching/resolving content when URLs are exchanged for example in Skype or Facebook messenger. In that cases they will visit the URL and try to extract metadata and maybe cache it if applicable. Of course this scenario does not expose your URL to the public but it is exposed to the systems of those services and with them the content that you have "hidden".
UPDATE:
Another issue to consider is the exposing of a "hidden" page by linking to another page. In that case in the logs of the server that hosts the linked URL your page will be seen as a referral and thus be visible, that expands also to Google Analytics etc. Thus if you want to remain stealth do not link to another pages from the hidden page.
I have a (randomly named) php file that does a bit of processing and then uses a header("location:url") to redirect to another page.
As mentioned, the script has a random name (eg: euu238843a.php) for security reasons as I don't want people stumbling upon it.
Thing is - How do I stop Google from indexing it - I want other pages in the same directory to be indexed but not this php file. I don't want people to be able to do a site:myurl.com and find the "hidden" script.
I would normally put a meta name="robots" content="noindex" in the head but, I can't do that as the page needs to ouput headers at the end to redirect.
You can dynamically updated the robots.txt file within the directory using a PHP script that outputs a new or appended robots.txt as needed. If you specify each dynamic filename on a new line, such as Disallow: /File_12345.html, you can avoid having to disallow the entire directory.
I am just looking for a bit of advice / feedback, I was thinking about setting up and opencart behind an HTML site (shop) that gets ranked well in Google.
The index.html site appears instead of the index.php page by default on the web server (I have tested it).
I was hoping I could construct the site in maintenance mode on the domain, then just delete the html site leaving the (live) opencart site once finished (about 2 weeks).
Just worried in case this may effect ranking.
In the robot.txt file put:
User-agent: *
User-agent: Googlebot
Disallow: /index.php
I would also put in the index.php page (opencart) header:
<meta name="robots" content="nofollow">
<meta name="googlebot" content="noindex, noarchive">
I don't want Google to cache the "website under maintenance" opencart index.php page. It could take a month or so to refresh it.
Obviously I would remove/change the Disallow robot.txt and meta tag etc commands once live and html site files deleted.
I would like to know if any one has tried it or if it will work? Effect google ranking etc?
Is it just a Bad idea? Any feedback would be appreciated.
Many Thanks
I assume you're using LAMP for your website (Linux, Apache, MySQL, PHP)
1) Apache has option to set default page, set it to index.php instead of index.html
2) You may either use re-write rule in .htaccess file (read more here. If your hosting provider doesn't give permission to .htaccess, there's a workaround!In index.php you may include this snippet at the top:
<?php
if($_SERVER['REQUEST_URI'] == '/index.php'){
header("HTTP/1.1 301 Moved Permanently");
header("Location: /");
die();
}
?>
So even if user opens up http://www.domain.com/index.php,
he'll get redirected to http://www.domain.com/
(Eg: My site http://theypi.net/index.php goes to http://theypi.net/)
I've also set similar redirect for http://www.theypi.net to redirect to http://theypi.net
Choosing between one of the two options (with or without www) helps improve ranking as well)
To your question
I would like to know if any one has tried it or if it will work? Effect google ranking etc?
Shorter URL: This is part of URL hygiene which is meant for SEO improvement
If homepage opens just through domain name (without index.php) then your CTR (Click Through Rate) impact in search results is higher.
I would suggest not to use robot blocking mechanism unless above steps aren't feasible for you.
Hope it helps, Thanks!
Edit:
And if you don't even have permission to set homepage as index.php. You may do one of following:
1. create index.html and put php code. If WebServer understands php, put redirect logic as above.
2. else, put JavaScript redirect (not a recommended way)
<script language=”JavaScript”> self.location=”index.php”; </script>
i have got a price comparison website.
You can click on a link of an offer and i'll get $1 from the shop.
The problem is: crawlers crawling the whole website. So they "click on the links".
How can i prevent them from clicking? Javascript is a bad solution.
Thank you!
I've been thinking about this the wrong way.
I agree with everything that #yttriuszzerbus says above - add a robots.txt to the file, add "rel=nofollow" to links, and block the user agents that you know about.
So if you've got someone who's now trying to click on a link, it's either a live person, or a badly behaved bot that you don't want clicking.
So how about doing something strange to create the links to the shop sites? Normally, you'd never, ever do this, as it makes your site impossible to index. But that's not an issue - all the well-behaved bots won't be indexing those links because they'll be obeying the robots.txt file.
I'm thinking of something like not having an <a href= tag in there - instead, generate the text of the link adding underlining to the font using a stylesheet, so it looks like a link to a normal user, and then having a javascript onClick function that redirects the user when they click on it. Bots won't see it as a link, and users won't notice a thing.
You could:
Use "rel=nofollow" to instruct crawlers not to follow your links.
Block certain user-agent strings
Use robots.txt to exclude spread of your site.
Unfortunately, none of the above will exclude badly-behaved crawlers. The only solution to actually prevent crawlers is some kind of JavaScript link or a CAPTCHA.
I also have similar project.
My problem was solved only by block certain user-agent strings.
Another problem is that I don't know every "bad" user-agent's, so when a new crawler enters the site, I add it to the blacklist and retroactively remove its visits from statistics.
"rel=nofollow" and robots.txt not work at all for me.
I would like to hide some content from public (like google cached pages). Is it possible?
Add the following HTML tag in the <head> section of your web pages to prevent Google from showing the Cached link for a page.
<META NAME="ROBOTS" CONTENT="noarchive">
Check out Google webmaster central | Meta tags to see what other meta tags Google understands.
Option 1: Disable 'Show Cached Site' Link In Google Search Results
If you want to prevent google from archiving your site, add the following meta tag to your section:
<meta name="robots" content="noarchive">
If your site is already cached by Google, you can request its removal using Google's URL removal tool. For more instructions on how to use this tool, see "Remove a page or site from Google's search results" at Google Webmaster Central.
Option 2: Remove Site From Google Index Completely
Warning! The following method will remove your site from Google index completely. Use it only if you don't want your site to show up in Google results.
To prevent ("protect") your site from getting to Google's cache, you can use robots.txt. For instructions on how to use this file, see "Block or remove pages using a robots.txt file".
In principle, you need to create a file named robots.txt and serve it from your site's root folder (/robots.txt). Sample file content:
User-agent: *
Disallow: /folder1/
User-Agent: Googlebot
Disallow: /folder2/
In addition, consider setting robots meta tag in your HTML document to noindex ("Using meta tags to block access to your site"):
To prevent all robots from indexing your site, set <meta name="robots" content="noindex">
To selectively block only Google, set <meta name="googlebot" content="noindex">
Finally, make sure that your settings really work, for instance with Google Webmaster Tools.
robots.txt: http://www.robotstxt.org/
You can use a robots.txt file to request that your page is not indexed. Google and other reputable services will adhere to this, but not all do.
The only way to make sure that your site content isn't indexed or cached by any search engine or similar service is to prevent access to the site unless the user has a password.
This is most easily achieved using HTTP Basic Auth. If you're using the Apache web server, there are lots of tutorials (example) on how to configure this. A good search term to use is htpasswd.
A simple way to do this would be with a <meta name="robots" content="noarchive"/>
You can also achieve a similar effect with the robots.txt file.
For a good explanation, see the official google blog on the robot's execution policy
I would like to hide some content from public....
Use a login system to view the content.
...(like google cached pages).
Configure robots.txt to deny Google bot.
If you want to limit who can see content, secure it behind some form of authentication mechanism (e.g. password protection, even if it is just HTTP Basic Auth).
The specifics of how to implement that would depend on the options provided by your server.
You can also add this HTTP Header on your response, instead of needing to update the html files:
X-Robots-Tag: noarchive
eg for Apache:
Header set X-Robots-Tag "noarchive"
See also: https://developers.google.com/search/reference/robots_meta_tag?csw=1