How Does WordPress Block Search Engines? - php

If you go to wordpress admin and then settings->privacy, there are two options asking you whether you want to allow your blog to be searched though by seach engines and this option:
I would like to block search engines,
but allow normal visitors
How does wordpress actually block search bots/crawlers from searching through this site when the site is live?

According to the codex, it's just robots meta tags, robots.txt and suppression of pingbacks:
Causes <meta name='robots' content='noindex,nofollow' /> to be generated into the section (if wp_head is used) of your site's source, causing search engine spiders to ignore your site.
Causes hits to robots.txt to send back:
User-agent: *
Disallow: /
Note: The above only works if WordPress is installed in the site root and no robots.txt exists.
These are "guidelines" that all friendly bots will follow. A malicious spider searching for E-Mail addresses or forms to spam into will not be affected by these settings.

With a robots.txt (if installed as root)
User-agent: *
Disallow: /
or (from here)
I would like to block search engines, but allow normal visitors -
check this for these results:
Causes "<meta name='robots' content='noindex,nofollow' />"
to be
generated into the
section (if wp_head is used) of your
site's source, causing search engine
spiders to ignore your site.
* Causes hits to robots.txt to send back:
User-agent: *
Disallow: /
Note: The above only works if WordPress is installed in the site root and no robots.txt exists.
Stops pings to ping-o-matic and any other RPC ping services specified in the Update
Services of Administration > Settings > Writing. This works by having the function privacy_ping_filter() remove
the sites to ping from the list. This
filter is added by having
add_filter('option_ping_sites','privacy_ping_filter');
in the default-filters. When the
generic_ping function attempts to get
the "ping_sites" option, this filter
blocks it from returning anything.
Hides the Update Services option entirely on the
Administration > Settings > Writing
panel with the message "WordPress is
not notifying any Update Services
because of your blog's privacy
settings."

You can't actually block bots and crawlers from searching through a publicly available site; if a person with a browser can see it, then a bot or crawler can see it (caveat below).
However, there is something call the Robots Exclusion Standard (or robots.txt standard), which allows you to indicate to well behaved bots and crawlers that they shouldn't index your site. This site, as well as Wikipedia, provide more information.
The caveat to the above comment that what you see on your browser, a bot can see, is this: most simple bots do not include a Javascript engine, so anything that the browser renders as a result of Javascript code will probably not be seen by a bot. I would suggest that you don't use this as a way to avoid indexing, since the robots.txt standard does not rely on the presence of Javascript to ensure correct rendering of your page.
Once last comment: bots are free to ignore this standard. Those bots are badly behaved. The bottom line is that anything that can read your HTML can do what it likes with it.

I don't know for sure but it probably generates a robots.txt file which specifies rules for search engines.

Using a Robots Exclusion file.
Example:
User-agent: Google-Bot
Disallow: /private/

Related

Add "noindex" in a link to a pdf

I have a website where I have links to a php script where I generate a pdf with the mPdf library and it is displayed in the browser or downloaded, depending on the configuration.
The problem is that I do not want it to be indexed in google. I've already put the link rel="nofollow" with that is no longer indexed, but how can I dexindexe what are already there?
With rel="noindex, nofollow" does not work.
Would have to do it only by php or some html tag
How Google is supposed to deindex something if you did prevent its robot from accessing the resource? ;) This may seem counter-intuitive at first.
Remove the rel="nofollow" on links, and in the script which is serving the PDF files, include a X-Robots-Tag: none header. Google will be able to enter the resource, and it will see that it is forbidden to index this particular resource and will remove the record from the index.
When deindexing is done, add the Disallow rule to the robots.txt file as #mtr.web mentions so robots won't drain your server anymore.
Assuming you have a robots.txt file, you can stop google from indexing any particular file by adding a rule to it. In your case, it would be something like this:
User-agent: *
disallow: /path/to/PdfIdontWantIndexed.pdf
From there, all you have to do is make sure that you submit your robots.txt to Google, and it should stop indexing it shortly thereafter.
Note:
It may also be wise to remove your url from the existing Google index because this will be quicker in the case that it has already been crawled by Google.
Easiest way: Add robots.txt to root, and add this:
User-agent: *
Disallow: /*.pdf$
Note: if there are parameters appended to the URL (like ../doc.pdf?ref=foo) then this wildcard will not prevent crawling since the URL no longer ends with “.pdf”

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

I created a php page that is only accessible by means of token/pass received through $_GET
Therefore if you go to the following url you'll get a generic or blank page
http://fakepage11.com/secret_page.php
However if you used the link with the token it shows you special content
http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4
Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through the provided link.
Are dynamic pages that are dependent of $_GET variables indexed by google and other search engines?
If so, will include the following be enough to hide it?
Robots.txt User-agent: * Disallow: /
metadata: <META NAME="ROBOTS" CONTENT="NOINDEX">
Even if I type into google:
site:fakepage11.com/
Thank you!
If a search engine bot finds the link with the token somehow¹, it may crawl and index it.
If you use robots.txt to disallow crawling the page, conforming search engine bots won’t crawl the page, but they may still index its URL (which then might appear in a site: search).
If you use meta-robots to disallow indexing the page, conforming search engine bots won’t index the page, but they may still crawl it.
You can’t have both: If you disallow crawling, conforming bots can never learn that you also disallow indexing, because they are not allowed to visit the page to see your meta-robots element.
¹ There are countless ways how search engines might find a link. For example, a user that visits the page might use a browser toolbar that automatically sends all visited URLs to a search engine.
If your page isn't discoverable then it will not be indexed.
by "discoverable" we mean:
it is a standard web page, i.e. index.*
it is referenced by another link either yours or from another site
So in your case by using the get parameter for access, you achieve 1 but not necessarily 2 since someone may reference that link and hence the "hidden" page.
You can use the robots.txt that you gave and in that case the page will not get indexed by a bot that respects that (not all will do). Not indexing your page doesn't mean of course that the "hidden" page URL will not be in the wild.
Furthermore another issue - depending on your requirements - is that you use unencrypted HTTP, that means that your "hidden" URLs and content of pages are visible to every server between your server and the user.
Apart from search engines take care that certain services are caching/resolving content when URLs are exchanged for example in Skype or Facebook messenger. In that cases they will visit the URL and try to extract metadata and maybe cache it if applicable. Of course this scenario does not expose your URL to the public but it is exposed to the systems of those services and with them the content that you have "hidden".
UPDATE:
Another issue to consider is the exposing of a "hidden" page by linking to another page. In that case in the logs of the server that hosts the linked URL your page will be seen as a referral and thus be visible, that expands also to Google Analytics etc. Thus if you want to remain stealth do not link to another pages from the hidden page.

Do the search engines index pages conaining GET request (php)

I have some pages on the website, which are hidden by GET request: For example, if you navigate the page http://www.mypage.com/example.php you see one content
but if you navigate http://www.mypage.com/example.php?name=12345 you get other content
Do the search engines see such pages? If yes, is it possible to hide them from search engines, like google
Thanx in advance
I am sure, there are no links for such page anywhere on internet, as I take it as a "secret" page.
But even with that, they can crawl it?
I could be wrongt. But when you dont have any hyperlink wich refers to "?name=12345" they shouldnt find the page. But if there is a hyperlink at any page of the world it may be possible.
There is a saying that security through obscurity is no security at all. If you have a page that you want to actually be secret or secure, you need to do something other than making sure it isn't indexed.
Search engines typically find pages by looking at links. If there isn't a link to the page, then it probably won't index it (unless it finds the page in some other way -- eg, like Bing did: http://thecolbertreport.cc.com/videos/ct2jwf/bing-gets-served). Note that whether you have a GET parameter (/index.php?param=12345) or not (/index.php) won't affect this. Search engine crawlers can find either of them just as easily.
If your concern is to stop search engines from indexing your site, you should use a robots.txt file. Check out http://www.robotstxt.org/robotstxt.html for some info on robots.txt files (the examples below come from that page). If you want to prevent search engines from indexing any page on your site, you can do something like:
User-agent: *
Disallow: /
If you want to disallow specific directories, you can do something like:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
If you want to disallow specific URLs, you can do something like:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

Hide uploaded files from search results?

A client running WordPress has requested the development of the following feature on their website.
They would like to include/exclude specific files (typically PDF) uploaded via the WordPress media uploader from search results.
I'm guessing this could be done somehow using a robots.txt file, but I have no idea where to start.
Any advice/ideas?
This is from Google Webmaster Developers site https://developers.google.com/webmasters/control-crawl-index/docs/faq
How long will it take for changes in my robots.txt file to affect my search results?
First, the cache of the robots.txt file must be refreshed (we generally cache the contents for up to one day). Even after finding the change, crawling and indexing is a complicated process that can sometimes take quite some time for individual URLs, so it's impossible to give an exact timeline. Also, keep in mind that even if your robots.txt file is disallowing access to a URL, that URL may remain visible in search results despite that fact that we can't crawl it. If you wish to expedite removal of the pages you've blocked from Google, please submit a removal request via Google Webmaster Tools.
And here are specifications for robots.txt from Google https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
If your file's syntax is correct the best answer is just wait till Google updates your new robots file.
I'm not certain how to do this within the confines of WordPress, but if you're looking to exclude particular file types, I would suggest using the X-Robots-Tag HTTP Header. It's particularly great for PDFs and non-HTML based file types where you would normally want to use a robots tag.
You can add the header for all specific FileType requests and then set a value of NOINDEX. This will prevent the PDFs from being included in the search results.
You can use the robots.txt file if the URLs end with the filetype or something that is unique to the file type. Example: Disallow: /*.pdf$ ... but I know that's not always the case with URLs.
https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

How to protect a site from (google) caching?

I would like to hide some content from public (like google cached pages). Is it possible?
Add the following HTML tag in the <head> section of your web pages to prevent Google from showing the Cached link for a page.
<META NAME="ROBOTS" CONTENT="noarchive">
Check out Google webmaster central | Meta tags to see what other meta tags Google understands.
Option 1: Disable 'Show Cached Site' Link In Google Search Results
If you want to prevent google from archiving your site, add the following meta tag to your section:
<meta name="robots" content="noarchive">
If your site is already cached by Google, you can request its removal using Google's URL removal tool. For more instructions on how to use this tool, see "Remove a page or site from Google's search results" at Google Webmaster Central.
Option 2: Remove Site From Google Index Completely
Warning! The following method will remove your site from Google index completely. Use it only if you don't want your site to show up in Google results.
To prevent ("protect") your site from getting to Google's cache, you can use robots.txt. For instructions on how to use this file, see "Block or remove pages using a robots.txt file".
In principle, you need to create a file named robots.txt and serve it from your site's root folder (/robots.txt). Sample file content:
User-agent: *
Disallow: /folder1/
User-Agent: Googlebot
Disallow: /folder2/
In addition, consider setting robots meta tag in your HTML document to noindex ("Using meta tags to block access to your site"):
To prevent all robots from indexing your site, set <meta name="robots" content="noindex">
To selectively block only Google, set <meta name="googlebot" content="noindex">
Finally, make sure that your settings really work, for instance with Google Webmaster Tools.
robots.txt: http://www.robotstxt.org/
You can use a robots.txt file to request that your page is not indexed. Google and other reputable services will adhere to this, but not all do.
The only way to make sure that your site content isn't indexed or cached by any search engine or similar service is to prevent access to the site unless the user has a password.
This is most easily achieved using HTTP Basic Auth. If you're using the Apache web server, there are lots of tutorials (example) on how to configure this. A good search term to use is htpasswd.
A simple way to do this would be with a <meta name="robots" content="noarchive"/>
You can also achieve a similar effect with the robots.txt file.
For a good explanation, see the official google blog on the robot's execution policy
I would like to hide some content from public....
Use a login system to view the content.
...(like google cached pages).
Configure robots.txt to deny Google bot.
If you want to limit who can see content, secure it behind some form of authentication mechanism (e.g. password protection, even if it is just HTTP Basic Auth).
The specifics of how to implement that would depend on the options provided by your server.
You can also add this HTTP Header on your response, instead of needing to update the html files:
X-Robots-Tag: noarchive
eg for Apache:
Header set X-Robots-Tag "noarchive"
See also: https://developers.google.com/search/reference/robots_meta_tag?csw=1

Categories