Disallow robots can be bypassed with htaccess?

Disallow robots can be bypassed with htaccess? - php

I have a simple question. Let's say that I have this in robots.txt:
User-agent: *
Disallow: /
And something like this in .htaccess:
RewriteRule ^somepage/.*$ index.php?section=ubberpage&parameter=$0
And of course in index.php something like:
$imbaVar = $_GET['section']
// Some splits some whatever to get a specific page
include("pages/theImbaPage.html") // Or php or whatever
Will the robots be able to see what's in that html included by the script (site.com/somepage)? I mean... the URL points to an inaccessible place... (the /somepage is disallowed) but still it is redirected to a valid place (index.php).

Assuming the robots will respect the robots.txt, then it wouldn't be able to see any page in the site at all (you stated you used Disallow: /.
If the robots however do not respect your robots.txt file, then they would be able to see the content, as the redirection is made server side.

No. By disallowing robot access, robots aren't allowed to browse any pages on your site and they're following your rules

Related

Add "noindex" in a link to a pdf

I have a website where I have links to a php script where I generate a pdf with the mPdf library and it is displayed in the browser or downloaded, depending on the configuration.
The problem is that I do not want it to be indexed in google. I've already put the link rel="nofollow" with that is no longer indexed, but how can I dexindexe what are already there?
With rel="noindex, nofollow" does not work.
Would have to do it only by php or some html tag

How Google is supposed to deindex something if you did prevent its robot from accessing the resource? ;) This may seem counter-intuitive at first.
Remove the rel="nofollow" on links, and in the script which is serving the PDF files, include a X-Robots-Tag: none header. Google will be able to enter the resource, and it will see that it is forbidden to index this particular resource and will remove the record from the index.
When deindexing is done, add the Disallow rule to the robots.txt file as #mtr.web mentions so robots won't drain your server anymore.

Assuming you have a robots.txt file, you can stop google from indexing any particular file by adding a rule to it. In your case, it would be something like this:
User-agent: *
disallow: /path/to/PdfIdontWantIndexed.pdf
From there, all you have to do is make sure that you submit your robots.txt to Google, and it should stop indexing it shortly thereafter.
Note:
It may also be wise to remove your url from the existing Google index because this will be quicker in the case that it has already been crawled by Google.

Easiest way: Add robots.txt to root, and add this:
User-agent: *
Disallow: /*.pdf$
Note: if there are parameters appended to the URL (like ../doc.pdf?ref=foo) then this wildcard will not prevent crawling since the URL no longer ends with “.pdf”

Do the search engines index pages conaining GET request (php)

I have some pages on the website, which are hidden by GET request: For example, if you navigate the page http://www.mypage.com/example.php you see one content
but if you navigate http://www.mypage.com/example.php?name=12345 you get other content
Do the search engines see such pages? If yes, is it possible to hide them from search engines, like google
Thanx in advance
I am sure, there are no links for such page anywhere on internet, as I take it as a "secret" page.
But even with that, they can crawl it?

I could be wrongt. But when you dont have any hyperlink wich refers to "?name=12345" they shouldnt find the page. But if there is a hyperlink at any page of the world it may be possible.

There is a saying that security through obscurity is no security at all. If you have a page that you want to actually be secret or secure, you need to do something other than making sure it isn't indexed.
Search engines typically find pages by looking at links. If there isn't a link to the page, then it probably won't index it (unless it finds the page in some other way -- eg, like Bing did: http://thecolbertreport.cc.com/videos/ct2jwf/bing-gets-served). Note that whether you have a GET parameter (/index.php?param=12345) or not (/index.php) won't affect this. Search engine crawlers can find either of them just as easily.
If your concern is to stop search engines from indexing your site, you should use a robots.txt file. Check out http://www.robotstxt.org/robotstxt.html for some info on robots.txt files (the examples below come from that page). If you want to prevent search engines from indexing any page on your site, you can do something like:
User-agent: *
Disallow: /
If you want to disallow specific directories, you can do something like:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
If you want to disallow specific URLs, you can do something like:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

Keep old website (HTML files) on webserver but disallow search agents to index them

I’ve just finished a website for a client who is going to replace their old (very old, HTML hard-coded website). The problem is that they (for now) want to save their old website and all the files on the webserver in the original position. This does not create any issues with the new website which is made in PHP and Wordpress but it makes a big deal when Google (and others) are dropping by with their search robots and indexing.
When doing a Google search it still finds the old HTML files. Is there any way that I could “keep” the old HTML files on the web server but make sure that for the first no robots are going to index them and if anyone is trying to navigate to an HTML page, e.g. http://www.clientdomain.com/old_index_file.html, they are getting redirect? I think the last part might be able to be done in .htaccess but I haven’t found anything useful searching for it.
The first question about not allowing robots and agents to index HTML files, I’ve tried to put these two lines in my robots.txt file
Disallow: /*.html$
Disallow: /*.htm$
But I’m unsure if it will work?
I might deal with this in a completely wrong way but I’ve never tried that a client has requested to keep the old website on same server and in original location before.
Thanks,
- Mestika

<?php
$redirectlink = ‘http://www.puttheredirectedwebpageurlhere.com‘;
//do not edit below here
header (‘HTTP/1.1 301 Moved Permanently’);
header(‘Location: ‘.$redirectlink);
exit;
?>
This code will use a 301 redirect the page to the URL that you desire. The filename of this .php should be the URL slug of the page you want to redirect.
301 Redirect
A 301 redirect, or also known as a permanent redirect, should be put in place to permanently redirect a page. The word ‘permanent’ is there to imply that ALL qualities of the redirected page will be passed on to the detour page.
That includes:
PageRank
MozRank
Page Authority
Traffic Value
A 301 redirect is implemented if the change you want to make is, well… permanent. The detour page now embodies the redirected page as if it was the former. A complete takeover.
The old page will be removed from Google’s index and the new one will replace it.
Or you can do it in your htaccess like shown by the above poster.

There's probably a lot of ways to handle this, assuming you have a clear mapping of pages from the old template to the new one, you could detect the Google bot in your old template (see [1]) and do a 301 redirect (see [2] for example) to the new template.
List item
[1] how to detect search engine bots with php?
List item
[2] How to implement 303 redirect?

Will take some work, but sounds like you'll need to crack open your htaccess file and start adding 301 redirects from the old content to the new.
RewriteCond %{REQUEST_URI} ^/oldpage.html
RewriteRule . http://www.domainname.com/pathto/newcontentinwp/ [R=301,L]
Rinse and repeat

This is definitely something mod_rewrite can help with. Converting your posted robots.txt to a simple rewrite:
RewriteEngine on
RewriteRule /.*\.html /index\.php [R]
The [R] flag signifies an explicit redirect. I would recommend seeing http://httpd.apache.org/docs/2.4/rewrite/remapping.html for more information. You can also forbid direct access with the [F] flag.

What's the secret behind Wordpress' sitename.com/post-name type permalinks

In Wordpress it is possible to have urls such as sitename.com/post-title although there exists no directory with name post-name.
I guess that index.php (or may be 404 page) must be handling such kind of requests.
Anyone please explain the exact trick behind this...

Read about Mod Rewrite basics in this page. It's an Apache module that allows you to parse a request content and redirect it to a "real" page without changing browser address.
Let's say you have this:
RewriteRule ^post/([0-9]+)?/?([0-9]+)?/?$ /index.php?p=$1&page=$2 [QSA]
With such a rewrite rule, a request address looking like:
www.mysite.com/post/10/2
Will be redirected to:
www.mysite.com/index.php?p=10&page=2
And this is better on a point of view of both aesthetics Search Engine Optimization.
To be specific, in WordPress every request is redirected to index.php page which contains an internal "parser" of the request that loads the content depending on the parsing result.
नमस्ते!

I guess that index.php must be handling such kind
of requests.
Exactly.
They are redirecting all requests to index.php trough .htaccess rules. If you're thinking about adopting their solution in your own application - don't. Today there are better implementations of this technique. Here's an example.

How it works:
mod_rewrite takes requests that match a certain parameter and can redirect them to other locations. In case of wordpress, all response (*) are accepted and transferred to the index.php page which has a standard router which splits the url i.e $_SERVER["REQUEST_URI"] and matches it against the stored configuration of url and loads the correct component (category, post, page etc)

Get destiniation URL using PHP?

I'd like to redirect the user depending on the destination url and referrer url. Say I have a homepage url http://www.example.com/ and in that page there are a bunch of links that point to http://www.example.com/page/x/. When the user goes to http://www.example.com/page/ from http://www.example.com/ it should redirect to another page. But when the user goes to http://www.example.com/page/x/ via a link from http://www.example.com/ it should not redirect. In order to achieve this, the solution I am thinking is to get the destination url as well to correctly determine if the user comes from http://www.example.com/ but wants to view http://www.example.com/page/x/. Bottom line is I want to prevent access to http://www.example.com/page/ but not to its sub pages.

What you are trying to do here is scary bad.
You can't rely on the referer being returned by the browser (but it is a good indicator). You could use a generic javascript to rewrite every link on the page to append a CGI variable containing the URIencoded URL of the current page (but where javascript is enabled it won't work). Or you could put rewrite the output buffer to inject CGI vars into hrefs in PHP. Neither of these are trivial - and if they break your users will not be able to navigate.
But leaving aside the implementation for now - your solution seems to be rather absurd.
If the problem is to
prevent access to http://www.example.com/page/
but allow requests for
http://www.example.com/page/x/
Then create an index.php in http://www.example.com/page/ with something like....
<?php
header('Location: /', true, 301);
?>
Or disable auto-index on your webserver.

As it seems no one can access the example.com/page/, You can have a Header function in the http://www.example.com/page/ to redirect the page to some other page.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Disallow robots can be bypassed with htaccess? - php

Assuming the robots will respect the robots.txt, then it wouldn't be able to see any page in the site at all (you stated you used Disallow: /. If the robots however do not respect your robots.txt file, then they would be able to see the content, as the redirection is made server side.

No. By disallowing robot access, robots aren't allowed to browse any pages on your site and they're following your rules

Related

Add "noindex" in a link to a pdf

Do the search engines index pages conaining GET request (php)

Keep old website (HTML files) on webserver but disallow search agents to index them

What's the secret behind Wordpress' sitename.com/post-name type permalinks

Get destiniation URL using PHP?

Categories

Resources