I have user profiles on my site. Users can make it public by checking a checkbox (search able via a search engine) and uncheck the box to block the page from being searched on a search engine. Site is in php codeignitor.
How is this accomplished? I am esp lost on when the user unchecks the box to block the page from being public how is that done and how to do this in as real time as possible? A good example are the profiles on fb or linkedin.
This isn't secure, but you could check the referring URL of the visitors and allow/deny their requests by looking for a search engine's address. The results would still show up in Google, and there would be page caching (which you can kinda stop with a <meta> tag).
Basically, warn the users that when they make the page public, it's not that easy to make it private after that. I'd make it a tedious and painful process, as people will complain that "your site is brokeded".
the fastest way to get pages automatically removed from google
set a new last modified date of that specific page
return an HTTP 410 for a page URL (you can still display content on it, or make the HTTP 410 google user agent specific, or no-cookie specific, or no language header specific)
put the URL into a sitemap.xml (with the new last modified date)
ping sitemap.xml to google
with this method you can make pages disappear from google SERPS in 1 to 2 hours (if your site gets crawled regularly).
note: if you want to kick out thousands and thousands of pages at once this process will take longer.
Related
What is the proper way, using PHP, to record web page views? I believe that currently we just record a view each time a page is hit on, but I am assuming that is including hits from bots, or other things we don't want to be recording.
How can we just record real legit views into our DB and not include stuff that shouldn't be counted as an actual page view?
Thanks!
Use google analytics
To set up the web tracking code:
Find the tracking code snippet for your property.
Sign in to your Google Analytics account, and select the Admin tab. ...
Find your tracking code snippet. ...
Copy the snippet. ...
Paste your snippet (unaltered, in its entirety) into every web page you want to track. ...
Check your setup.
1) Ignore any known bots that visits your web page. (best use robots.txt)
2) Use Ajax call at the end of page to get rid of bouncers (visitors that opens web by mistake and closes it before everything is loaded).
3) I assume you can call Ajax in some delay, so robots will already has left your page and visitor is still browsing it.
4) Record IP addresses and (if possible) some device identifier to find unique visitors.
I once saw a friends site, which allowed you to see a private php file by browsing to a url.
So for example, http://example.com/accessthesite.php would then show you the content that is hidden on http://example.com/index.php
They basically used it for previewing changes to their site as it hadn't launched fully.
So whilst /index.php had a coming soon page up, if you browsed to /accessthesite.php you would then be redirected to /index.php and shown the full website.
Any idea's how they did this?
It is likely that accessthesite.php set a session variable, or a cookie. Once set, index.php contains code to recognize this and display an alternate view.
I have used this method myself for various thing, but I often times find it easier to setup a second development and/or staging site. Pending changes can be viewed there before they are published to the public site. In general, it's not a good idea for developers to be working directly on the public site since there might be problems. You do not want to inconvenience users by displaying a site that isn't ready yet.
I want to create a private url as
http://domain.com/content.php?secret_token=XXXXX
Then, only visitors who have the exact URL (e.g. received by email) can see the page. We check the $_GET['secret_token'] before displaying the content.
My problem is that if by any chance search bots find the URL, they will simply index it and the URL will be public. Is there a practical method to avoid bot visits and subsequent index?
Possible But Unfavorable Methods:
Login system (e.g. by php session): But I do not want to offer user login.
Password-protected folder: The problem is as above.
Using Robots.txt: Many search engine bots do not respect it.
What you are talking about is security through obscurity. Its never a good idea. If you must, I would offer these thoughts:
Make the link expire
Lock the link to the C or D class of IPs that it was accessed from the first time
Have the page challenge the user with something like a logic question before forwarding to the real page with a time sensitive token (2 step process), and if the challenge fails send a 404 back so the crawler stops.
Try generating a 5-6 alphanumeric password and attach along with the email, so eventhough robots spider it , they need password to access the page. (Just an extra added safety measure)
If there is no link to it (including that the folder has no index
view), the robot won't find it
You could return a 404, if the token is wrong: This way, a robot (and who else doesn't have the token) will think, there is no such page
As long as you don't link to it, no spider will pick it up. And, since you don't want any password protection, the link is going to work for everyone. Consider disabling the secret key after it is used.
you only need to tell the search engines not to index /content.php, and search engines that honor robots.txt wont index any pages that start with /content.php.
Leaving the link unpublished will be ok in most circumstances...
...However, I will warn you that the prevalence of browser toolbars (Google and Yahoo come to mind) change the game. One company I worked for had pages from their intranet indexed in Google. You could search for the page, and a few results came up, but you couldn't access them unless you were inside our firewall or VPN'd in.
We figured the only way those links got propagated to Google had to be through the toolbar. (If anyone else has a better explanation, I'd love to hear it...) I've been out of that company a while now, so I don't know if they ever figured out definitively what happened there.
I know, strange but true...
OK here my problem: content is disappearing from my site. It's not the most secure site out there, it has a number of issues. Right now every time I upload a page that can delete content from the my site using simple links wired to a GET request I find the corresponding content being deleted in mass.
Example, I have a functionality on my site to upload images. Once the user uploads an image, the admin(the owner) can use another page to delete all(owned) images from the site. The delete functionality is implemented in such a way that a user clicks on the link under each thumbnail of uploaded images he would send a get request that deletes the image information from the site's database and deletes the image from the server file system.
The other day I uploaded that functionality and the next morning I found all my images deleted. The pages are protected using user authentication when you view the pages using a browser. To my surprise, however, I could wget that page with out any problem.
So I was wondering if some evil web bot was deleting my content using those links? Is that possible? What do you advice for further securing my website.
It is absolutely possible. Even non-evil web bots could be doing it. The Google bot doesn't know the link it follows has any specific functionality.
The easiest way to possibly address this is to setup a proper robots.txt file to tell the bots not to go to specific pages. Start here: http://www.robotstxt.org/
RFC 2616 (HTTP protocol), section 9.1.1: Safe Methods:
The convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.
Basically, if your application allows deletion via GET requests, it's doing it wrong. Bots will follow public links, and they have no obligation to expect to delete things when doing so, and neither do browsers. If the links are protected it could still be browser prefetching or acceleration of some kind.
Edit: It might also be Bing. Nowadays Internet Explorer sends data to Microsoft about everywhere you go to gather data for its shitty search engine.
Typically, a search-bot will scan a page for any links and peek down those links to see what pages are behind that. So yeah, if a both has access to that page, the page contains links to delete items / stuff and the both opens those links to see what's behind them, the code simply gets triggered.
There's a couple of ways to block bots from scanning pages. Look into robot.txt implementations. Also, you might want to look into the mechanism / safety of your admin authentication system... ;-)
You can use the robots.txt file to block the access for some web bots.
And for those that don't look for the robots.txt file you can also use javascript, there shouldn't be many webbots interpreting it.
delete
A web site I'm working on sells vehicles to business entities only.
Consequently, it displays data aimed at business customers (prices without Value Added Tax, warranty limitations, etc.). In Germany, showing this kind of data to private end-users can be punished as misleading advertising.
One way around that is to show a dialog when the user enters the site. In the dialog, the user must confirm that they are a business user.
My idea at the moment is to use a flag in $_SESSION to detect whether the user is new, and then to redirect them to a confirmation page using a header redirect. When they confirm they are a business user, they get taken to the actual page.
However, search engines should see the content straight away, without the confirmation page.
Does somebody have a genius simple way of detecting search engine bots
Without the use of JavaScript
Without the need for constant maintenance (e.g. a list of spiders' USER_AGENT strings)
Bot detection doesn't need to be 100% reliable as long as the major search engines are served properly. Any other ideas on how to fulfill the legal requirement of having the user confirm their business status are very welcome as well.
The web site is based on PHP 5 and runs on a Linux-based shared hosting package (can't install any extensions).
Adding an absolute positioned overlay to all pages if the session variable isn't set is the easiest solution I'd think: Still serving the whole page (for users & bots), but not usable for users untill they confirm their status,