How can I hide the profile details from search engines using php?
I want to prevent the search engines from user`s details in my website only when the user
set turn off search privacy from their account page.
Ex: Facebook profile privacy
You can add following lines to your header part of your html/php page
<meta name="robots" content="noindex, nofollow" />
<meta name="googlebot" content="noindex, nofollow" />
Create a robots.txt with something like the following
User-agent: *
Disallow: /profile
More information on possible options can be found here.
You may try using /robots.txt but there's no guarantee that they will really respect that configuration.
The second thing you may do is to just hide contains from robots by specifying blacklist (you can google list of search robot ip) all ips from lists like this one (although this is practice which is used by malware).
Using robots.txt / html meta tags is nice if the bot respects them but is otherwise pointless. The only ready way to protect the info is to use some sort of authentication system where only registered members may view certain content otherwise limiting who can see what.
Related
I have an HTML/PHP/CSS site, very simple, and I have no robots.txt file for it. The site is indexed, it's all great.
But now I created a page for something I need, and I want to make that one page noindex.
Do I have to create a robots.txt file for it, or is there an easier way to do it without having to create a robots.txt?
Also, I did Google for this before asking, and I came across an article that instructed to put the following code on your page under the <head> code:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
I did that. However, after I checked the page here: http://www.seoreviewtools.com/bulk-meta-robots-checker/
It says: Meta robots: Not found
Explanation: There are no restrictions for indexing or serving detected.
So then, how can I make that page noindex?
Although having a robots.txt file is the best you can do, there is never a guarantee that a search robot will not index that page. I believe most, if not all respectable search engines will follow the restrictions in that file but there are plenty of other bots out there that don't.
That said, if you don't link to the page anywhere, no search engine will index it - at least in theory: if you or other people access that page and their browser or an extension submits that page for indexing, it ends up being indexed anyway.
So this is happening when I test my website using Facebook's Open Graph Object Debugger:
It doesn't like the trailing numbers after the profile page. But I have both of these defined properly:
<meta property="og:url" content="http://www.website.com/profile/139">
<link rel="canonical" href="http://www.website.com/profile/139">
I've tried for hours and it just keeps redirecting to the homepage:
Is there anything I can add to my .htaccess file or PHP header to prevent this 301 redirect?
May be related to the way Facebook/Google handle URL parameters: http://gohe.ro/1fpOA0N
The anser was a problem with our domain host WP Engine, who tricks spiders into ignoring pure numeric strings at the end of page URL's. Pertains specifically to:
Googlebot (Google's spider)
Slurp! (Yahoo's spider)
BingBot (Bing's spider)
Facebook OG/Debugger
For example, the following URL:
http://www.website.com/profile/12345
Will be interpreted to these bots as:
http://www.website.com/profile
However, if the string is non-numeric the bots will recognize it. This is done for caching purposes. But again, this pertains only to WP Engine and a few other hosting providers.
Facebook treats the og:url Meta Tag as the Canonical for your page:
<meta property="og:url" content="http://www.yoursite.com/your-canonical-url" />
If your Canonical url is redirecting you are in fact creating a loop.
Don't redirect from your Canonical.
Canonical is the page which should be considered the better option for the spiders.
If a page has a Canonical url tag it means that it is NOT the best/default page but rather it is a lesser variation of the Canocical.
The Linkedin documentation can be found here
As it says, it needs:
og:title
og:description
og:image
og:url
Here is an example of my wordpress blog source code that for simplicity I use Jetpack plug-in:
<!-- Jetpack Open Graph Tags -->
<meta property="og:type" content="article" />
<meta property="og:title" content="Starbucks Netherlands Intel" />
<meta property="og:url" content="http://lorentzos.com/starbucks-netherlands-intel/" />
<meta property="og:description" content="Today I had some free time at work. I wanted to play more with Foursquare APIs. So the question: "What is the correlation of the Starbucks Chain in the Netherlands?". Methodology: I found all the p..." />
<meta property="og:site_name" content="Dionysis Lorentzos" />
<meta property="og:image" content="http://lorentzos.com/wp-content/uploads/2013/08/starbucks-intel-nl-238x300.png" />
In Facebook it works great, or you can see the meta data here. However LinkedIn is more stubborn and doesn't really parse the data even the If you're unable to set Open Graph tags within the page that's being shared, LinkedIn will attempt to fetch the content automatically by determining the title, description, thumbnail image, etc.
I know that I don't have the og:image:width tag but Linkedin doesn't even parse title, description or url. Any ideas to debug it?
I checked again my html and found some warnings/errors in metadata. I fixed them and all work good. So the solution if you encounter the same problem:
Check your html again and debug it. Even if the page load well in your browser, the LinkedIn parser is not as powerful in terms of small errors. This tool might help.
My very first suggestion is appending a meaningless query to the URL, so that LinkedIn thinks it's a new link (this doesn't affect anything else) i.e.:
http://example.com/link.php?42 or http://example.com/link.html?refid=LinkedIn
If that doesn't suit your needs, a more drastic measure is in order.
After making sure you don't have any errors in your console and validating your site using:
http://validator.w3.org/...
Add the prefix attribute to every tag (not to html tag), then re-sign in with your LinkedIn account to clear the cache...
prefix="go: http://ogp.me/ns#" i.e.:
<meta prefix="og: http://ogp.me/ns#" property="og:title" content="Title of Page" />
<meta prefix="og: http://ogp.me/ns#" property="og:type" content="article" />
<meta prefix="og: http://ogp.me/ns#" property="og:image" content="http://example.com/image.jpg" />
<meta prefix="og: http://ogp.me/ns#" property="og:url" content="http://example.com/" />
I hope one of these three solutions works for someone. Cheers!
If you're sure you've done everything right (using open graph meta tags, no errors on validator.w3.org) and it still is not working, be sure to try it with a different page, it might be a LinkedIn cache thing.
I had a <h1>Project information</h1> on my page, which LinkedIn used as the title for sharing the page, instead of the <title> or <meta property="og:title" [...]/> tag. Even though I did everything right. But when I completely removed this <h1>Project information</h1> from the page source, it kept using 'Project information' as the title even thought it wasn't on the page anymore.
After trying a different page, it worked.
I stumbled about the same problem for our Wordpress site. The problem is created by conflicting OGP and oembed headers in standard wordpress + yoast / jetpack seo plugin.
You need to disabled the oembed headers with this plugin (this has no side effects): https://wordpress.org/plugins/disable-embeds/
After that you can force a fresh link preview by appending a ?1 as some of you guys already pointed out!
I hope that fixes your problem.
I wrote a detailed explanation for the problem here: https://pmig.at/2017/10/26/linkedin-link-preview-for-wordpress/
Linkedin caches the urls so it's very practical to make sure that this is not your problem before starting to debug.
This might tool then might come in handy: https://www.linkedin.com/post-inspector/inspect/
Here you can preview your url and see how it looks like when sharing. It refreshes the caching as well so you can be sure if you have a problem or if it was the caching only.
After a long trial and error I found out that my .htaccess was somehow blocking the Linkedin robot (wordpress site). For those who use the ithemes security plugin for wordpress or another security plugin make sure that LinkedIn is not blocked.
Make sure there is no line like:
RewriteCond %{HTTP_USER_AGENT} ^Link [NC,OR]
The easiest way to check is to use wordpress default htaccess lines.
As mentioned before, make sure you don't retry cached pages in linkedin.
You can try this only once a week!
I had a link to my site and I wanted to customize the image Linkedin displayed. So I added open graph tags which didn't seem to render at all. Until I read this:
The first time that LinkedIn's crawlers visit a webpage when asked to share content via a URL, the data it finds (Open Graph values or our own analysis) will be cached for a period of approximately 7 days.
This means that if you subsequently change the article's description, upload a new image, fix a typo in the title, etc., you will not see the change represented during any subsequent attempts to share the page until the cache has expired and the crawler is forced to revisit the page to retrieve fresh content.
https://developer.linkedin.com/docs/share-on-linkedin
The solution for me was to add a hashbang. I am on an ajax style application which doesn't render the whole page, I think linkedin has a bit of a hissy fit about the text/image not being on the page on initial scrape, adding
%23!
to the end of my encoded url or
#!
to the unencoded url before sending it off to linkedin seemed to do the trick nicely for my share button popup. Not wsure if this is only Ajax/js apps or not but it certainly solved a couple of hours of effort for me.
I guess this is only useful if your application is setup to handle the escape_fragment in the url and render a static page not a dynamic one but I can't test this theory right now
This was happening on one of my client's sites as well. I discovered that the .htaccess file was blocking the site from LinkedIn if the user-agents contained the string "jakarta".
As soon as I remove this filtering, LinkedIn was able to access all of the required the OpenGraph (og) information when the client would post a link.
True, the documentation states that you can have: title, url, description, and image. But in reality, you have two options. Pick one of the two following sets and use it, as you have no other choice...
Set 1 Options
og:title
og:url
og:image
Set 2 Options
og:title
og:url
og:description
That is the reason why og:description is mysteriously missing from preview links. But if you drop image, then your description will finally display.
Try it: Wikipedia has an og description but no og image, while GitHub has both. Share Wikipedia and Share GitHub. Clearly seems like either you get a choice to display description or a choice to display image. I have spent weeks struggling with LinkedIn Support to correct this, but to no avail.
When i look in my webmaster tools account i have hundreads of dupliacte meta descriptions. When i look at each one the duplicate urls's are like so:
/index.php?route=product/product&product_id=158?48efc520
/index.php?route=product/product&product_id=158?abc56c80
Where are these numbers coming from after my product id????
Thanks
Pjn
That means that the <meta name="description" content="..."> is the same for several pages.
Since you're not sure how the additional parameters are added to your URL, you could use the link tag to specify the canonical URL. This needs to be added to the head of each page.
<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
For more information, have a look at http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html.
To resolve this, add code to 301 redirect Urls with extraneous parameters to the canonical Url. For example, redirect /index.php?route=product/product&product_id=158?48efc520 to
/index.php?route=product/product&product_id=158
Assuming these extraneous parameters are indeed not being used.
Stackoverflow does this, see how this Url redirects:
Google webmaster tools HTML suggestions - duplicate meta descriptions
Do you have a referral system or something in place? Those are querystrings, and without the name part of the usual name=value pair, they tend to look like those for referrals.
Is this in a system you build yourself, or are you using something?
Without more information about your setup, it will be tough to diagnose.
Hard to say where Google "learned" the names. You can some things in Webmaster Tools to avoid the reports.
One is the canonical like somebody mentioned. The other is in Webmaster Tools in Site Configuration/Settings. Click the Parameter Handling tab and enter your exceptions
I would like to hide some content from public (like google cached pages). Is it possible?
Add the following HTML tag in the <head> section of your web pages to prevent Google from showing the Cached link for a page.
<META NAME="ROBOTS" CONTENT="noarchive">
Check out Google webmaster central | Meta tags to see what other meta tags Google understands.
Option 1: Disable 'Show Cached Site' Link In Google Search Results
If you want to prevent google from archiving your site, add the following meta tag to your section:
<meta name="robots" content="noarchive">
If your site is already cached by Google, you can request its removal using Google's URL removal tool. For more instructions on how to use this tool, see "Remove a page or site from Google's search results" at Google Webmaster Central.
Option 2: Remove Site From Google Index Completely
Warning! The following method will remove your site from Google index completely. Use it only if you don't want your site to show up in Google results.
To prevent ("protect") your site from getting to Google's cache, you can use robots.txt. For instructions on how to use this file, see "Block or remove pages using a robots.txt file".
In principle, you need to create a file named robots.txt and serve it from your site's root folder (/robots.txt). Sample file content:
User-agent: *
Disallow: /folder1/
User-Agent: Googlebot
Disallow: /folder2/
In addition, consider setting robots meta tag in your HTML document to noindex ("Using meta tags to block access to your site"):
To prevent all robots from indexing your site, set <meta name="robots" content="noindex">
To selectively block only Google, set <meta name="googlebot" content="noindex">
Finally, make sure that your settings really work, for instance with Google Webmaster Tools.
robots.txt: http://www.robotstxt.org/
You can use a robots.txt file to request that your page is not indexed. Google and other reputable services will adhere to this, but not all do.
The only way to make sure that your site content isn't indexed or cached by any search engine or similar service is to prevent access to the site unless the user has a password.
This is most easily achieved using HTTP Basic Auth. If you're using the Apache web server, there are lots of tutorials (example) on how to configure this. A good search term to use is htpasswd.
A simple way to do this would be with a <meta name="robots" content="noarchive"/>
You can also achieve a similar effect with the robots.txt file.
For a good explanation, see the official google blog on the robot's execution policy
I would like to hide some content from public....
Use a login system to view the content.
...(like google cached pages).
Configure robots.txt to deny Google bot.
If you want to limit who can see content, secure it behind some form of authentication mechanism (e.g. password protection, even if it is just HTTP Basic Auth).
The specifics of how to implement that would depend on the options provided by your server.
You can also add this HTTP Header on your response, instead of needing to update the html files:
X-Robots-Tag: noarchive
eg for Apache:
Header set X-Robots-Tag "noarchive"
See also: https://developers.google.com/search/reference/robots_meta_tag?csw=1