My web site has a database lookup; filling out a CAPTCHA gives you 5 minutes of lookup time. There is also some custom code to detect any automated scripts. I do this as I don't want someone data mining my site.
The problem is that Google does not see the lookup results when it crawls my site. If someone is searching for a string that is present in the result of a lookup, I would like them to find this page by Googling it.
The obvious solution to me is to use the PHP variable $_SERVER['HTTP_USER_AGENT'] to bypass the CAPTCHA and custom security code for the Google bots. My question is whether this is sensible or not.
People could then use Google's cache to view the lookup results without having to fill out the CAPTCHA, but would Google's own script detection methods prevent them from data mining these pages?
Or would there be some way for people to make $_SERVER['HTTP_USER_AGENT'] appear as Google to bypass the security measures?
Thanks in advance.
Or would there be some way for people to make $_SERVER['HTTP_USER_AGENT'] appear as Google to bypass the security measures?
Definitely. The user agent is laughably easy to forge. See e.g. User Agent Switcher for Firefox. It's also easy for a spam bot to set its user agent header to the Google bot.
It might still be worth a shot, though. I'd say just try it out and see what the results are. If you get problems, you may have to think about another way.
An additional way to recognize the Google bot could be the IP range(s) it uses. I don't know whether the bot uses defined IP ranges - it could be that that's not the case, you'd have to find out.
Update: it seems to be possible to verify the Google Bot by analyzing its IP. From Google Webmaster Central: How to verify Googlebot
Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:
host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.
the $_SERVER['HTTP_USER_AGENT'] parameter is not secure, people can fake it if they really want to get your results. your decision is a business one, basically do you wish to lower security and potentially allow people/bots to scrape your site, or do you want your results hidden from google.
Related
I am working on a Pinterest clone in WordPress at the moment.
I want to know how Pinterest manages to get their site indexed by search engines when you have to log in to see the pin screens?
So, I need a solution where users have to log in to the site, but search engines can still index everything without logging in. Is this technically cloaking?
Any ideas?
You can do this in your script by using a reverse DNS lookup. Google suggest this method themselves. With PHP you can do this using gethostbyaddr().
The fastest way is to check the user-agent string. The full list is at http://user-agents.org/.
If you try the DNS method suggested by Leonard, you'll find it's extremely slow. Some IPs don't even have valid reverse lookups. The page in question recommends a reverse lookup only for verifying the origin of a request that claims to be googlebot in the user-agent string. You shouldn't worry about people hacking their browsers (pretending to be the googlebot) until you are popular.
I have a few thoughts on this but I can see problems with both. I don't need 100% accurate data. An 80% solution that allows me to make generalizations about the most popular domains I'm routing users to is fine.
Option 1 - Use PHP. Route links through a file track.php that makes sure the referring page is from my domain before tracking the click. This page then routes the user to the final intended URL. Obviously bots could spoof this. Do many? I could also check the user agent. Again, I KNOW many bots spoof this.
Option 2 - Use JavaScript. Execute a JavaScript on click function that writes the click to the database and then directs the user to the final URL.
Both of these methods feel like they may cause problems with crawlers following my outgoing links. What is the most effective method for tracking these outgoing clicks?
The most effective method for tracking outgoing links (it's used by Facebook, Twitter, and almost every search engine) is a "track.php" type file.
Detecting bots can be considered a separate problem, and the methods are covered fairly well by these questions: http://duckduckgo.com/?q=how+to+detect+http+bots+site%3Astackoverflow.com But doing a simple string search for "bot" in the User-Agent will probably get you close to your 80%* (and watching for hits to /robots.txt will, depending on the type of bot you're dealing with, get you 95%*).
*: a semi-educated guess, based on zero concrete data
Well, Google analytics and Piwik use Javascript for that.
Since bots can't use JS, you'll only have humans. In the other way, humans can disable JS too (but sincerely, that's rarely the case)
Facebook, Deviantart, WLM, etc use server side script to track. I don't know how they filter bots but a nice robots.txt with one or two filter and that should be good enough to get 80% I guess.
This question already has answers here:
Top techniques to avoid 'data scraping' from a website database
(14 answers)
Closed 5 years ago.
I have LAMP server where I run a website, which I want to protect against bulk scraping / downloading. I know that there is no perfect solution for this, that the attacker will always find a way. But I would like to have at least some "protection" which hardenes the way of stealing data than just having nothing at all.
This website has cca. 5000 of subpages with valuable text data and couple of pictures on each page. I would like to be able online analyze incoming HTTP requests and if there is suspicious activity (e.g. tens of requests in one minute from one IP) it would automatically blacklist this certain IP address from further access to the site.
I fully realize that what I am asking for has many flaws, but I am not really looking for bullet-proof solution, but just a way how to limit script-kiddies from "playing" with easily scraped data.
Thank you for your on-topic answers and possible solution ideas.
Although this is a pretty old post, I think the answer isnt quite complete and I thought it worthwhile to add in my two cents. First, I agree with #symcbean, try to avoid using IP's but instead using a session, a cookie, or another method to track individuals. Otherwise you risk lumping together groups of users sharing an IP. The most common method for rate limiting, which is essentially what you are describing "tens of requests in one minute from one IP", is using the leaky bucket algorithm.
Other ways to combat web scrapers are:
Captchas
Make your code hard to interpret, and change it up frequently. This makes scripts harder to maintain.
Download IP lists of known spammers, proxy servers, TOR exit nodes, etc. This is going to be a lengthy list but its a great place to start. You may want to also block all amazon EC2 IP's.
This list, and rate limiting, will stop simple script kiddies but anyone with even moderate scripting experience will easily be able to get around you. Combating scrapers on your own is a futile effort but my opinion is biased because I am a cofounder of Distil Networks which offers anti-scraping protection as a service.
Sorry - but I'm not aware of any anti-leeching code available off-the-shelf which does a good job.
How do you limit access without placing burdens on legitimate users / withuot providing a mechanism for DOSing your site? Like spam prevention, the best solution is to use several approaches and maintain scores of badness.
You've already mentioned looking at the rate of requests - but bear in mind that increasingly users will be connecting from NAT networks - e.g. IPV6 pops. A better approach is to check per session - you don't need to require your users to register and login (although openId makes this a lot simpler) but you could redirect them to a defined starting point whenever they make a request without a current session and log them in with no username/password. Checking the referer (and that the referer really does point to the current content item) is a good idea too. Tracking 404 rates. Road blocks (when score exceeds threshold redirect to a capcha or require a login). Checking the user agent can be indicative of attacks - but should be used as part of the scoring mechanism, not as a yes/no criteria for blocking.
Another approach, rather than interrupting the flow, is when the thresholds are triggered start substituting content. Or do the same when you get repeated external hosts appearing in your referer headers.
Do not tar pit connections unless you've got a lot of resource serverside!
Referrer checking is one very simple technique that works well against automated attacks. You serve content normally if the referrer is your own domain (ie the user has reached the page by clicking a link on your own site), but if the referrer is not set, you can serve alternate content (such as a 404 not found).
Of course you need to set this up to allow search engines to read your content (assuming you want that) and also be aware that if you have any flash content, the referrer is never set, so you can't use this method.
Also it means that any deep links into your site won't work - but maybe you want that anyway?
You could also just enable it for images which makes it a bit harder for them to be scraped from the site.
Something that I've employed on some of my websites is to block known User-Agents of downloaders or archivers. You can find a list of them here: http://www.user-agents.org/ (unfortunately, not easy to sort by Type: D). In the host's setup, I enumerate the ones that I don't want with something like this:
SetEnvIf User-Agent ^Wget/[0-9\.]* downloader
Then I can do a Deny from env=downloader in the appropriate place. Of course, changing user-agents isn't difficult, but at least it's a bit of a deterrent if going through my logs is any indication.
If you want to filter by requests per minute or something along those lines, I don't think there's a way to do that in apache. I had a similar problem with ssh and saslauth, so I wrote a script to monitor the log files and if there were a certain number of failed login attempts made within a certain amount of time, it appended an iptables rule that blocked that IP from accessing those ports.
If you don't mind using an API, you can try our https://ip-api.io
It aggregates several databases of known IP addresses of proxies, TOR nodes and spammers.
I would advice one of 2 things,
First one would be, if you have information that other people want, give it to them in a controlled way, say, an API.
Second would be to try and copy google, if you scrape the results of google ALOT (and I mean a few hundred times a second) then it will notice it and force you to a Captcha.
I'd say that if a site is visited 10 times a second, its probably a bot. So give it a Captcha to be sure.
If a bot crawls your website slower then 10 times a second, I see no reason to try and stop it.
You could use a counter (DB or Session) and redirect the page if the limit is triggered.
/**Pseudocode*/
if( ip == currIp and sess = currSess)
Counter++;
if ( Count > Limit )
header->newLocation;
I think dynamic blocking of IPs using IP blocker will help better.
I've been having a spam problem on my site, where people sign up and act extremely abusive to other users of my site. I can easy IP ban them, except they always come back under a different IP address through a proxy or TOR.
So I was curious if there are any php classes or functions that can look up the IP and determine if its a genuine user, or someone behind a proxy, in which case it would muzzle their accounts upon registration.
Many legitimate users will come to you through proxies - are you sure you want to filter all of them out? For example:
ISPs that run caching proxies for all their users
People on corporate networks
To answer your question, checking for the X-Forwarded-For or Via headers is probably your best bet.
Following RichieHindle's answer, I'd suggest some kind of profanity filter/detection - detect the unacceptable behaviour and suspend the accounts. Use of a proxy could definitely influence weight of decisions made by the filter/detector!
Actually stopping them is difficult, but if their nasty content doesn't get published they'll soon give up.
Is it possible to find out where the users come from? For example, I give a client a banner, and the link. The client may put the banner/link to any website, lets say to a site called www.domain.com.
When the user click the banner, is it possible to know where he coming from(www.domain.com)?
Have a look at the HTTP_REFERER variable. It will tell you what site the user was on before he came to your site.
Yes. You give the client a unique URL, like www.yourdomain.com/in/e10c89ee4fec1a0983179c8231e30a45. Then, track these urls and accesses in a database.
The real problem is tracking unique visitors.
See
$_SERVER["HTTP_REFERER"]
Although that can't always be trusted as it's set by the client but you may not care in your case.
In some scenarios, $_SERVER["HTTP_REFERER"] will only work when php (php.ini) is configured with register_globals bool configured to on.
Register globals can allow exploitation in loosely coded php applications. Commonly in apps that allow users to post data.
I have used the following method in the past to check referrers in applications where I controll the operator input.
session_start();
if(!isset($_SESSION['url_referer']))
{
$_SESSION['url_referer'] = $_SERVER['HTTP_REFERER'];
}
Without hashing strings in session variables, I do not know of a more efficient practice. Does anyone know the best practices?
Finest Regards,
Brad
The only chance is that you use a unique ID (as pointed out by gnud). This ay you can track the incomming links. Referrer may be altered/removed from browsers or proxies (many companies do that).
Using the IP to track unique visitors is a bad idea. AOL still pools the IPs and you might use different IPs every few minutes and with proxys yiur counting will be not very accurate.
I'd say, go with the unique ID.