Do the search engines index pages conaining GET request (php)

Do the search engines index pages conaining GET request (php) - php

I have some pages on the website, which are hidden by GET request: For example, if you navigate the page http://www.mypage.com/example.php you see one content
but if you navigate http://www.mypage.com/example.php?name=12345 you get other content
Do the search engines see such pages? If yes, is it possible to hide them from search engines, like google
Thanx in advance
I am sure, there are no links for such page anywhere on internet, as I take it as a "secret" page.
But even with that, they can crawl it?

I could be wrongt. But when you dont have any hyperlink wich refers to "?name=12345" they shouldnt find the page. But if there is a hyperlink at any page of the world it may be possible.

There is a saying that security through obscurity is no security at all. If you have a page that you want to actually be secret or secure, you need to do something other than making sure it isn't indexed.
Search engines typically find pages by looking at links. If there isn't a link to the page, then it probably won't index it (unless it finds the page in some other way -- eg, like Bing did: http://thecolbertreport.cc.com/videos/ct2jwf/bing-gets-served). Note that whether you have a GET parameter (/index.php?param=12345) or not (/index.php) won't affect this. Search engine crawlers can find either of them just as easily.
If your concern is to stop search engines from indexing your site, you should use a robots.txt file. Check out http://www.robotstxt.org/robotstxt.html for some info on robots.txt files (the examples below come from that page). If you want to prevent search engines from indexing any page on your site, you can do something like:
User-agent: *
Disallow: /
If you want to disallow specific directories, you can do something like:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
If you want to disallow specific URLs, you can do something like:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

Related

Add "noindex" in a link to a pdf

I have a website where I have links to a php script where I generate a pdf with the mPdf library and it is displayed in the browser or downloaded, depending on the configuration.
The problem is that I do not want it to be indexed in google. I've already put the link rel="nofollow" with that is no longer indexed, but how can I dexindexe what are already there?
With rel="noindex, nofollow" does not work.
Would have to do it only by php or some html tag

How Google is supposed to deindex something if you did prevent its robot from accessing the resource? ;) This may seem counter-intuitive at first.
Remove the rel="nofollow" on links, and in the script which is serving the PDF files, include a X-Robots-Tag: none header. Google will be able to enter the resource, and it will see that it is forbidden to index this particular resource and will remove the record from the index.
When deindexing is done, add the Disallow rule to the robots.txt file as #mtr.web mentions so robots won't drain your server anymore.

Assuming you have a robots.txt file, you can stop google from indexing any particular file by adding a rule to it. In your case, it would be something like this:
User-agent: *
disallow: /path/to/PdfIdontWantIndexed.pdf
From there, all you have to do is make sure that you submit your robots.txt to Google, and it should stop indexing it shortly thereafter.
Note:
It may also be wise to remove your url from the existing Google index because this will be quicker in the case that it has already been crawled by Google.

Easiest way: Add robots.txt to root, and add this:
User-agent: *
Disallow: /*.pdf$
Note: if there are parameters appended to the URL (like ../doc.pdf?ref=foo) then this wildcard will not prevent crawling since the URL no longer ends with “.pdf”

Internal redirect urls to external sites shows up on google

I have a redirect URL for external links example would be click which does a great job. However, making a site search on google like this 'site:www.mysite.com' shows my redirect links, and clicking on them takes me directly to where its redirecting to.
Is there any way to stop google from using those specific links in their results?
Any help would be much appreciated!

You could create a robots.txt file and place these 2 lines of text in it:
User-agent: *
Disallow: /redirect
See the explanation here,
http://www.robotstxt.org/robotstxt.html

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

I created a php page that is only accessible by means of token/pass received through $_GET
Therefore if you go to the following url you'll get a generic or blank page
http://fakepage11.com/secret_page.php
However if you used the link with the token it shows you special content
http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4
Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through the provided link.
Are dynamic pages that are dependent of $_GET variables indexed by google and other search engines?
If so, will include the following be enough to hide it?
Robots.txt User-agent: * Disallow: /
metadata: <META NAME="ROBOTS" CONTENT="NOINDEX">
Even if I type into google:
site:fakepage11.com/
Thank you!

If a search engine bot finds the link with the token somehow¹, it may crawl and index it.
If you use robots.txt to disallow crawling the page, conforming search engine bots won’t crawl the page, but they may still index its URL (which then might appear in a site: search).
If you use meta-robots to disallow indexing the page, conforming search engine bots won’t index the page, but they may still crawl it.
You can’t have both: If you disallow crawling, conforming bots can never learn that you also disallow indexing, because they are not allowed to visit the page to see your meta-robots element.
¹ There are countless ways how search engines might find a link. For example, a user that visits the page might use a browser toolbar that automatically sends all visited URLs to a search engine.

If your page isn't discoverable then it will not be indexed.
by "discoverable" we mean:
it is a standard web page, i.e. index.*
it is referenced by another link either yours or from another site
So in your case by using the get parameter for access, you achieve 1 but not necessarily 2 since someone may reference that link and hence the "hidden" page.
You can use the robots.txt that you gave and in that case the page will not get indexed by a bot that respects that (not all will do). Not indexing your page doesn't mean of course that the "hidden" page URL will not be in the wild.
Furthermore another issue - depending on your requirements - is that you use unencrypted HTTP, that means that your "hidden" URLs and content of pages are visible to every server between your server and the user.
Apart from search engines take care that certain services are caching/resolving content when URLs are exchanged for example in Skype or Facebook messenger. In that cases they will visit the URL and try to extract metadata and maybe cache it if applicable. Of course this scenario does not expose your URL to the public but it is exposed to the systems of those services and with them the content that you have "hidden".
UPDATE:
Another issue to consider is the exposing of a "hidden" page by linking to another page. In that case in the logs of the server that hosts the linked URL your page will be seen as a referral and thus be visible, that expands also to Google Analytics etc. Thus if you want to remain stealth do not link to another pages from the hidden page.

Disallow robots can be bypassed with htaccess?

I have a simple question. Let's say that I have this in robots.txt:
User-agent: *
Disallow: /
And something like this in .htaccess:
RewriteRule ^somepage/.*$ index.php?section=ubberpage&parameter=$0
And of course in index.php something like:
$imbaVar = $_GET['section']
// Some splits some whatever to get a specific page
include("pages/theImbaPage.html") // Or php or whatever
Will the robots be able to see what's in that html included by the script (site.com/somepage)? I mean... the URL points to an inaccessible place... (the /somepage is disallowed) but still it is redirected to a valid place (index.php).

Assuming the robots will respect the robots.txt, then it wouldn't be able to see any page in the site at all (you stated you used Disallow: /.
If the robots however do not respect your robots.txt file, then they would be able to see the content, as the redirection is made server side.

No. By disallowing robot access, robots aren't allowed to browse any pages on your site and they're following your rules

How Does WordPress Block Search Engines?

If you go to wordpress admin and then settings->privacy, there are two options asking you whether you want to allow your blog to be searched though by seach engines and this option:
I would like to block search engines,
but allow normal visitors
How does wordpress actually block search bots/crawlers from searching through this site when the site is live?

According to the codex, it's just robots meta tags, robots.txt and suppression of pingbacks:
Causes <meta name='robots' content='noindex,nofollow' /> to be generated into the section (if wp_head is used) of your site's source, causing search engine spiders to ignore your site.
Causes hits to robots.txt to send back:
User-agent: *
Disallow: /
Note: The above only works if WordPress is installed in the site root and no robots.txt exists.
These are "guidelines" that all friendly bots will follow. A malicious spider searching for E-Mail addresses or forms to spam into will not be affected by these settings.

With a robots.txt (if installed as root)
User-agent: *
Disallow: /
or (from here)
I would like to block search engines, but allow normal visitors -
check this for these results:
Causes "<meta name='robots' content='noindex,nofollow' />"
to be
generated into the
section (if wp_head is used) of your
site's source, causing search engine
spiders to ignore your site.
* Causes hits to robots.txt to send back:
User-agent: *
Disallow: /
Note: The above only works if WordPress is installed in the site root and no robots.txt exists.
Stops pings to ping-o-matic and any other RPC ping services specified in the Update
Services of Administration > Settings > Writing. This works by having the function privacy_ping_filter() remove
the sites to ping from the list. This
filter is added by having
add_filter('option_ping_sites','privacy_ping_filter');
in the default-filters. When the
generic_ping function attempts to get
the "ping_sites" option, this filter
blocks it from returning anything.
Hides the Update Services option entirely on the
Administration > Settings > Writing
panel with the message "WordPress is
not notifying any Update Services
because of your blog's privacy
settings."

You can't actually block bots and crawlers from searching through a publicly available site; if a person with a browser can see it, then a bot or crawler can see it (caveat below).
However, there is something call the Robots Exclusion Standard (or robots.txt standard), which allows you to indicate to well behaved bots and crawlers that they shouldn't index your site. This site, as well as Wikipedia, provide more information.
The caveat to the above comment that what you see on your browser, a bot can see, is this: most simple bots do not include a Javascript engine, so anything that the browser renders as a result of Javascript code will probably not be seen by a bot. I would suggest that you don't use this as a way to avoid indexing, since the robots.txt standard does not rely on the presence of Javascript to ensure correct rendering of your page.
Once last comment: bots are free to ignore this standard. Those bots are badly behaved. The bottom line is that anything that can read your HTML can do what it likes with it.

I don't know for sure but it probably generates a robots.txt file which specifies rules for search engines.

Using a Robots Exclusion file.
Example:
User-agent: Google-Bot
Disallow: /private/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.