I have a website where I have links to a php script where I generate a pdf with the mPdf library and it is displayed in the browser or downloaded, depending on the configuration.
The problem is that I do not want it to be indexed in google. I've already put the link rel="nofollow" with that is no longer indexed, but how can I dexindexe what are already there?
With rel="noindex, nofollow" does not work.
Would have to do it only by php or some html tag
How Google is supposed to deindex something if you did prevent its robot from accessing the resource? ;) This may seem counter-intuitive at first.
Remove the rel="nofollow" on links, and in the script which is serving the PDF files, include a X-Robots-Tag: none header. Google will be able to enter the resource, and it will see that it is forbidden to index this particular resource and will remove the record from the index.
When deindexing is done, add the Disallow rule to the robots.txt file as #mtr.web mentions so robots won't drain your server anymore.
Assuming you have a robots.txt file, you can stop google from indexing any particular file by adding a rule to it. In your case, it would be something like this:
User-agent: *
disallow: /path/to/PdfIdontWantIndexed.pdf
From there, all you have to do is make sure that you submit your robots.txt to Google, and it should stop indexing it shortly thereafter.
Note:
It may also be wise to remove your url from the existing Google index because this will be quicker in the case that it has already been crawled by Google.
Easiest way: Add robots.txt to root, and add this:
User-agent: *
Disallow: /*.pdf$
Note: if there are parameters appended to the URL (like ../doc.pdf?ref=foo) then this wildcard will not prevent crawling since the URL no longer ends with “.pdf”
Related
A client running WordPress has requested the development of the following feature on their website.
They would like to include/exclude specific files (typically PDF) uploaded via the WordPress media uploader from search results.
I'm guessing this could be done somehow using a robots.txt file, but I have no idea where to start.
Any advice/ideas?
This is from Google Webmaster Developers site https://developers.google.com/webmasters/control-crawl-index/docs/faq
How long will it take for changes in my robots.txt file to affect my search results?
First, the cache of the robots.txt file must be refreshed (we generally cache the contents for up to one day). Even after finding the change, crawling and indexing is a complicated process that can sometimes take quite some time for individual URLs, so it's impossible to give an exact timeline. Also, keep in mind that even if your robots.txt file is disallowing access to a URL, that URL may remain visible in search results despite that fact that we can't crawl it. If you wish to expedite removal of the pages you've blocked from Google, please submit a removal request via Google Webmaster Tools.
And here are specifications for robots.txt from Google https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
If your file's syntax is correct the best answer is just wait till Google updates your new robots file.
I'm not certain how to do this within the confines of WordPress, but if you're looking to exclude particular file types, I would suggest using the X-Robots-Tag HTTP Header. It's particularly great for PDFs and non-HTML based file types where you would normally want to use a robots tag.
You can add the header for all specific FileType requests and then set a value of NOINDEX. This will prevent the PDFs from being included in the search results.
You can use the robots.txt file if the URLs end with the filetype or something that is unique to the file type. Example: Disallow: /*.pdf$ ... but I know that's not always the case with URLs.
https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
i have got a price comparison website.
You can click on a link of an offer and i'll get $1 from the shop.
The problem is: crawlers crawling the whole website. So they "click on the links".
How can i prevent them from clicking? Javascript is a bad solution.
Thank you!
I've been thinking about this the wrong way.
I agree with everything that #yttriuszzerbus says above - add a robots.txt to the file, add "rel=nofollow" to links, and block the user agents that you know about.
So if you've got someone who's now trying to click on a link, it's either a live person, or a badly behaved bot that you don't want clicking.
So how about doing something strange to create the links to the shop sites? Normally, you'd never, ever do this, as it makes your site impossible to index. But that's not an issue - all the well-behaved bots won't be indexing those links because they'll be obeying the robots.txt file.
I'm thinking of something like not having an <a href= tag in there - instead, generate the text of the link adding underlining to the font using a stylesheet, so it looks like a link to a normal user, and then having a javascript onClick function that redirects the user when they click on it. Bots won't see it as a link, and users won't notice a thing.
You could:
Use "rel=nofollow" to instruct crawlers not to follow your links.
Block certain user-agent strings
Use robots.txt to exclude spread of your site.
Unfortunately, none of the above will exclude badly-behaved crawlers. The only solution to actually prevent crawlers is some kind of JavaScript link or a CAPTCHA.
I also have similar project.
My problem was solved only by block certain user-agent strings.
Another problem is that I don't know every "bad" user-agent's, so when a new crawler enters the site, I add it to the blacklist and retroactively remove its visits from statistics.
"rel=nofollow" and robots.txt not work at all for me.
In building a site in PHP, I have found that the URL is capable of having extra info that doesn't belong, i.e.
http://www.mydomain.com/index.php/extrainformation
I've read about it being apart of $_SERVER['PATH_INFO'] but need to find a way to stop this information from being displayed as it is showing up in results of Google searches. Is this something I can prevent by adding a condition in my .htaccess file?
Any insight?
That information is technically a valid URL even if it is ignored by your web page. So if a search engine like Google finds a URL, probably through a link, that contains that extra information, and it pulls up a valid web page, they will display it in their results.
You can solve this a few ways:
Use canonical URLs to specify the proper URL without the extra information
Do a 301 redirect to the URL without the garbage information if it is appended to a URL
Return an error (HTTP 40x) that the URL is invalid
All three will prevent Google from indexing pages with those kind of URLs
Those look like Apache's multiviews. Add this to your htaccess file:
Options -MultiViews
I have a simple question. Let's say that I have this in robots.txt:
User-agent: *
Disallow: /
And something like this in .htaccess:
RewriteRule ^somepage/.*$ index.php?section=ubberpage¶meter=$0
And of course in index.php something like:
$imbaVar = $_GET['section']
// Some splits some whatever to get a specific page
include("pages/theImbaPage.html") // Or php or whatever
Will the robots be able to see what's in that html included by the script (site.com/somepage)? I mean... the URL points to an inaccessible place... (the /somepage is disallowed) but still it is redirected to a valid place (index.php).
Assuming the robots will respect the robots.txt, then it wouldn't be able to see any page in the site at all (you stated you used Disallow: /.
If the robots however do not respect your robots.txt file, then they would be able to see the content, as the redirection is made server side.
No. By disallowing robot access, robots aren't allowed to browse any pages on your site and they're following your rules
If you go to wordpress admin and then settings->privacy, there are two options asking you whether you want to allow your blog to be searched though by seach engines and this option:
I would like to block search engines,
but allow normal visitors
How does wordpress actually block search bots/crawlers from searching through this site when the site is live?
According to the codex, it's just robots meta tags, robots.txt and suppression of pingbacks:
Causes <meta name='robots' content='noindex,nofollow' /> to be generated into the section (if wp_head is used) of your site's source, causing search engine spiders to ignore your site.
Causes hits to robots.txt to send back:
User-agent: *
Disallow: /
Note: The above only works if WordPress is installed in the site root and no robots.txt exists.
These are "guidelines" that all friendly bots will follow. A malicious spider searching for E-Mail addresses or forms to spam into will not be affected by these settings.
With a robots.txt (if installed as root)
User-agent: *
Disallow: /
or (from here)
I would like to block search engines, but allow normal visitors -
check this for these results:
Causes "<meta name='robots' content='noindex,nofollow' />"
to be
generated into the
section (if wp_head is used) of your
site's source, causing search engine
spiders to ignore your site.
* Causes hits to robots.txt to send back:
User-agent: *
Disallow: /
Note: The above only works if WordPress is installed in the site root and no robots.txt exists.
Stops pings to ping-o-matic and any other RPC ping services specified in the Update
Services of Administration > Settings > Writing. This works by having the function privacy_ping_filter() remove
the sites to ping from the list. This
filter is added by having
add_filter('option_ping_sites','privacy_ping_filter');
in the default-filters. When the
generic_ping function attempts to get
the "ping_sites" option, this filter
blocks it from returning anything.
Hides the Update Services option entirely on the
Administration > Settings > Writing
panel with the message "WordPress is
not notifying any Update Services
because of your blog's privacy
settings."
You can't actually block bots and crawlers from searching through a publicly available site; if a person with a browser can see it, then a bot or crawler can see it (caveat below).
However, there is something call the Robots Exclusion Standard (or robots.txt standard), which allows you to indicate to well behaved bots and crawlers that they shouldn't index your site. This site, as well as Wikipedia, provide more information.
The caveat to the above comment that what you see on your browser, a bot can see, is this: most simple bots do not include a Javascript engine, so anything that the browser renders as a result of Javascript code will probably not be seen by a bot. I would suggest that you don't use this as a way to avoid indexing, since the robots.txt standard does not rely on the presence of Javascript to ensure correct rendering of your page.
Once last comment: bots are free to ignore this standard. Those bots are badly behaved. The bottom line is that anything that can read your HTML can do what it likes with it.
I don't know for sure but it probably generates a robots.txt file which specifies rules for search engines.
Using a Robots Exclusion file.
Example:
User-agent: Google-Bot
Disallow: /private/