Hide uploaded files from search results? - php

A client running WordPress has requested the development of the following feature on their website.
They would like to include/exclude specific files (typically PDF) uploaded via the WordPress media uploader from search results.
I'm guessing this could be done somehow using a robots.txt file, but I have no idea where to start.
Any advice/ideas?

This is from Google Webmaster Developers site https://developers.google.com/webmasters/control-crawl-index/docs/faq
How long will it take for changes in my robots.txt file to affect my search results?
First, the cache of the robots.txt file must be refreshed (we generally cache the contents for up to one day). Even after finding the change, crawling and indexing is a complicated process that can sometimes take quite some time for individual URLs, so it's impossible to give an exact timeline. Also, keep in mind that even if your robots.txt file is disallowing access to a URL, that URL may remain visible in search results despite that fact that we can't crawl it. If you wish to expedite removal of the pages you've blocked from Google, please submit a removal request via Google Webmaster Tools.
And here are specifications for robots.txt from Google https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
If your file's syntax is correct the best answer is just wait till Google updates your new robots file.

I'm not certain how to do this within the confines of WordPress, but if you're looking to exclude particular file types, I would suggest using the X-Robots-Tag HTTP Header. It's particularly great for PDFs and non-HTML based file types where you would normally want to use a robots tag.
You can add the header for all specific FileType requests and then set a value of NOINDEX. This will prevent the PDFs from being included in the search results.
You can use the robots.txt file if the URLs end with the filetype or something that is unique to the file type. Example: Disallow: /*.pdf$ ... but I know that's not always the case with URLs.
https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

Related

how to redirect all subsite urls to one single url in a multi-site and also send a variable/value to this sub-site

I have a specific requirement and am looking for suggestions on the best possible way to achieve that. I would start by apologizing if I sound too naïve. What I am trying to achieve in here is:
A) I have a parent site, say, www.abc.com.
B) I am planning to enable multisite option for it. This parent site has a area map with a number of location images overlayed. All of these images, when clicked, should lead to a subsite.
C) This subsite (has already been coded) is totally dynamic and every single information being displayed on it is being extracted from the database. It uses a session variable, which for now has been hard-coded at the very beginning of the header. This variable also decides on which database to refer to. So it will display information for different locations, based on the location selected on the parent site. Even the URL should appear per that. Say if Location ‘A’ was clicked on parent-site then the session variable needs to set to ‘LocA’ on the sub-site and the URL should be something like www.abc.com/LocA and if the Location ‘B’ was clicked then the session variable should be set to ‘LocB’ and the URL should appear as www.abc.com/LocB etc.. Trying to figure out how to achieve this. [It will have one front-end for all the locations but different databases for each location.]
I am an entrepreneur with some programming experience from my past (but none related to website designing). Because of the help from all you geniuses and the code samples lying around, I was able to code the parent site and the sub-site (using html, php, js, css ). Now the trouble is how to put it all together and make it work in correlation. Though it will still be a week or two before I get to try it but I am trying to gather insights so that I am ready by the time I reach there. Any help will be deeply appreciated.
I think the fundamental thing to understand before you get deeper is what a URL is. A URL is not part of the content that you display to the user; nor is it the name of a file on your server. A URL is the identifier the user sends your server, which your server can use to decide what content to serve. The existence of "sub-sites", and "databases", and even "files" is completely invisible to the end user, and you can arrange them however you like; you just need to tell the server how to respond to different URLs.
While it is possible to have the same URL serve different content to different users, based on cookies or other means of identifying a user, having entire sites "hidden" behind such conditions is generally a bad idea: it means users can't bookmark that content, or share it with others; and it probably means it won't show up in search results, which need a URL to link to.
When you don't want to map URLs directly to files and folders, the common approach involves two things:
Rewrite rules, which essentially say "when the user requests URL x, pretend they requested URL y instead".
Server-side code that acts as a "front controller", looking at the (rewritten) URL that was requested, and deciding what content to serve.
As a simple example:
The user requests /abc/holidays/spain
An Apache server is configured with RewriteRule /(...)/holidays/(.*) /show-holidays.php?site=$1&destination=$2 so expands it to /show-holidays.php?site=abc&destination=spain
The show-holidays.php script looks at the parameter $_GET['site'] and loads the configuration for sub-site "abc"
It then looks at $_GET['destination'] and loads the appropriate content
The output of the PHP script is sent back to the user
If the user requests /def/holidays/portugal, they will get different content, but the same PHP script will generate it
Both the rewrite rules and the server-side script can be as simple or as complex as you like - some sites have a single PHP script which accepts all responses, looks at the real URL that was requested, and decides what to do; others have a long list of mappings from URLs to specific PHP scripts.

Add "noindex" in a link to a pdf

I have a website where I have links to a php script where I generate a pdf with the mPdf library and it is displayed in the browser or downloaded, depending on the configuration.
The problem is that I do not want it to be indexed in google. I've already put the link rel="nofollow" with that is no longer indexed, but how can I dexindexe what are already there?
With rel="noindex, nofollow" does not work.
Would have to do it only by php or some html tag
How Google is supposed to deindex something if you did prevent its robot from accessing the resource? ;) This may seem counter-intuitive at first.
Remove the rel="nofollow" on links, and in the script which is serving the PDF files, include a X-Robots-Tag: none header. Google will be able to enter the resource, and it will see that it is forbidden to index this particular resource and will remove the record from the index.
When deindexing is done, add the Disallow rule to the robots.txt file as #mtr.web mentions so robots won't drain your server anymore.
Assuming you have a robots.txt file, you can stop google from indexing any particular file by adding a rule to it. In your case, it would be something like this:
User-agent: *
disallow: /path/to/PdfIdontWantIndexed.pdf
From there, all you have to do is make sure that you submit your robots.txt to Google, and it should stop indexing it shortly thereafter.
Note:
It may also be wise to remove your url from the existing Google index because this will be quicker in the case that it has already been crawled by Google.
Easiest way: Add robots.txt to root, and add this:
User-agent: *
Disallow: /*.pdf$
Note: if there are parameters appended to the URL (like ../doc.pdf?ref=foo) then this wildcard will not prevent crawling since the URL no longer ends with “.pdf”

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

I created a php page that is only accessible by means of token/pass received through $_GET
Therefore if you go to the following url you'll get a generic or blank page
http://fakepage11.com/secret_page.php
However if you used the link with the token it shows you special content
http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4
Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through the provided link.
Are dynamic pages that are dependent of $_GET variables indexed by google and other search engines?
If so, will include the following be enough to hide it?
Robots.txt User-agent: * Disallow: /
metadata: <META NAME="ROBOTS" CONTENT="NOINDEX">
Even if I type into google:
site:fakepage11.com/
Thank you!
If a search engine bot finds the link with the token somehow¹, it may crawl and index it.
If you use robots.txt to disallow crawling the page, conforming search engine bots won’t crawl the page, but they may still index its URL (which then might appear in a site: search).
If you use meta-robots to disallow indexing the page, conforming search engine bots won’t index the page, but they may still crawl it.
You can’t have both: If you disallow crawling, conforming bots can never learn that you also disallow indexing, because they are not allowed to visit the page to see your meta-robots element.
¹ There are countless ways how search engines might find a link. For example, a user that visits the page might use a browser toolbar that automatically sends all visited URLs to a search engine.
If your page isn't discoverable then it will not be indexed.
by "discoverable" we mean:
it is a standard web page, i.e. index.*
it is referenced by another link either yours or from another site
So in your case by using the get parameter for access, you achieve 1 but not necessarily 2 since someone may reference that link and hence the "hidden" page.
You can use the robots.txt that you gave and in that case the page will not get indexed by a bot that respects that (not all will do). Not indexing your page doesn't mean of course that the "hidden" page URL will not be in the wild.
Furthermore another issue - depending on your requirements - is that you use unencrypted HTTP, that means that your "hidden" URLs and content of pages are visible to every server between your server and the user.
Apart from search engines take care that certain services are caching/resolving content when URLs are exchanged for example in Skype or Facebook messenger. In that cases they will visit the URL and try to extract metadata and maybe cache it if applicable. Of course this scenario does not expose your URL to the public but it is exposed to the systems of those services and with them the content that you have "hidden".
UPDATE:
Another issue to consider is the exposing of a "hidden" page by linking to another page. In that case in the logs of the server that hosts the linked URL your page will be seen as a referral and thus be visible, that expands also to Google Analytics etc. Thus if you want to remain stealth do not link to another pages from the hidden page.

access denied for specific image in post in wordpress

The Social Media Department would like a directory where they can upload images and other media that need to be kept private until we are ready to publish them. Ideally, we would want the user to get a 404 error instead of being prompted to log in or instead of getting an "access denied" message if they put in an URL for a private file.
Because the Social Media Department does not want to have to move images once an article is ready to be published, really what they need is a way for images that are saved to the WordPress Media Library or some other folder to return a 404 error if they are part of articles that are not published and display for anyone if they are part of articles that art published.
Our users like to try and guess what we'll be announcing by putting in random image file names once they know the URL structure for the images
the only way is to track what you want or dont want or both. at some point you have to ask can this file be served. not hard to code but could be an expensive operation per request.
To keep users from guessing names, you can either prepend/append a random string (per Graham Walters) or hash the whole name. Don't forget to suppress autoindexing of the directory via the Options .htaccess command, or a "Nothing to see here folks, move along." index.html file.
If users can somehow get hold of the names (say, via a leak), but there aren't too many "embargoed" files, the embargoed files could be added to an .htaccess blacklist similar to hotlink protection. Return a 404 if anyone requests those files not via your official pages. Remove them from the blacklist once they go live. If you set up the hotlink protection correctly, you may be able to forbid access to whole classes of files (such as by filetype), except for your official pages.

rewriting links in scraped content using mod_rewrite

I'm looking to create an iframe on my site that contains amazon.com, and I'd like to control it (see what product the user is at).
I realize I can't do this because of browser security policy issues, and the only real workaround is to feed the entire page through my server.
So I load the page and I change all the href values from something like
grocery-breakfast-foods-snacks-organic/b/ref=sa_menu_gro7?ie=UTF8&node=16310101&pf_rd_p=328655101&pf_rd_s=left-nav-1&pf_rd_t=101&pf_rd_i=507846&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=1S4N4RYF949Z2NS263QP
(the links on the site are relative) to 'me.com/work.php?link='.urlencode(theirlink).
The problem is the amount of time this takes - plus PHP runs frequently out of memory doing this.
Could I use mod_rewrite to rewrite all domains from:
http://www.me.com/grocery-breakfast-foods-snacks-organic/b/ref=sa_menu_gro7?ie=UTF8&node=16310101&pf_rd_p=328655101&pf_rd_s=left-nav-1&pf_rd_t=101&pf_rd_i=507846&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=1S4N4RYF949Z2NS263QP
to:
http://www.me.com/work.php?url=urlencode(thatlink)
And if not, are there any better options rather then going through every <a> tag?
Thanks!
Have you checked out the associates API? You can get your data that way.
https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=498&categoryID=14
http://astore.amazon.com/

Categories