On my website, I have 1000 products, and they all have their own web page which are accessible by something like product.php?id=PRODUCT_ID.
On all of these pages, I have a link which has a url action.php?id=PRODUCT_ID&referer=CURRNT_PAGE_URL .. so if I am visiting product.php?id=100 this url becomes action.php?prod_id=100&referer=/product.php?id=1000 clicking on this url returns the user back to referer
Now, the problem I am facing is that I keep getting false hits from spiders. Is there any way by which I can avoid these false hits? I know I can "diallow" this url in robots.txt but still there are bots who ignore this. What would you recommend?
Any ideas are welcome. Thanks
Currently, the easiest way of making a link inaccessible to 99% of robots (even those that choose to ignore robots.txt) is with Javascript. Add some unobtrusive jQuery:
<script type="text/javascript">
$(document).ready(function() {
$('a[data-href]').attr('href', $(this).attr('data-href'));
});
});
</script>
The construct your links in the following fashion.
Click me!
Because the href attribute is only written after the DOM is ready, robots won't find anything to follow.
Your problem consists of 2 separate issues:
multiple URLs lead to the same resource
crawlers don't respect robots.txt
The second issue is hard to tackle, read Detecting 'stealth' web-crawlers
The first one is easier.
You seem to need an option to let the user go back to the previous page.
I'm not sure why you do not let the browser's history take care of this (through the use of the back-button and javascript's history.back();), but there are enough valid reasons out there.
Why not use the refferer header?
Almost all common browser send information about the referring page with every request. It can be spoofed, but for the mayority of visitors this should be a working solution.
Why not use a cookie?
If you store the CURRNT_PAGE_URL in a cookie, you can still use a single unique URLs for each page, and still dynamically create breadcrumbs and back links based on the refferer set in the cookie, and not be dependent on the HTTP-referrer value.
You can use the robots.txt file to prevent complying bots.
Next thing you can do, once robots.txt is configured is to check your server logs. Find any useragents that seem suspicious.
Let's say you find evil_webspider_crawling_everywhere as a useragent, you can check for it in the headers of the request (sorry, no example, haven't used php in a long time) and deny access to the webspider.
Another option is to use PHP to detect bots visiting your page.
You could use this PHP function to detect the bot (this gets most of them):
function bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
And then echo href links to page only when you find that the user is not a bot:
if (bot_detected()===false)) {
echo "http://example.com/yourpage";
}
I don't believe you can stop user agents that don't obey your advice.
Before going down this route I would really want to make ascertain that bots/spiders are a problem - doing anything that prevents natural navigation of your site should be seen as a last resort.
If your really want to stop spiders what you might want to consider is using javascript in the your links so that navigation only happens after the link is clicked. This should stop spiders.
Personally I'm not fussed about spiders or bots.
Related
I'm writing a pay per click function on my site. It's fairly easy to add a link on a button:
http://www.mysite.com/advertLink?id=123
I could pick up the ID and redirect accordingly. But how do I stop Google and other bots from "clicking" on this link? I don't want the users that are clicking on the link to be charged for clicks that are generated by bots?
Also, are there other types of traffic I should consider blocking? I am considering, for example, blocking all traffic outside of my country from being registered as clicks because this site is very much only directed at my country?
UPDATE
The nofollow and htaccess rules are a good start. But I was hoping there was perhaps more foolproof way. I see, for example on this site: www.pricecheck.co.za, that if you click on an add, it takes you to a fancy forwarding page. I am curious as to what logic is on that page. It also looks like perhaps javascript is used somehow. See what I mean here:
http://www.pricecheck.co.za/offers/19453458/Apple+iPad+2+Black+64GB+9.7%22+Tablet+With+WiFi+&+3G/
change your button to an a link and put rel="nofollow" which should tell search engines not to follow the link. Alternatively you could display the link using javascript and search engines normally wont follow it:
<script type="text/javascript">
document.write('link');
</script>
and like hakan says, add rules to disallow it in your robots.txt.
You could also check the referrer in your script to make sure it was clicked on from your page.
Nice bots will read and respect your robots.txt. You can write something like
User-agent: *
Disallow: /advertLink
i have got a price comparison website.
You can click on a link of an offer and i'll get $1 from the shop.
The problem is: crawlers crawling the whole website. So they "click on the links".
How can i prevent them from clicking? Javascript is a bad solution.
Thank you!
I've been thinking about this the wrong way.
I agree with everything that #yttriuszzerbus says above - add a robots.txt to the file, add "rel=nofollow" to links, and block the user agents that you know about.
So if you've got someone who's now trying to click on a link, it's either a live person, or a badly behaved bot that you don't want clicking.
So how about doing something strange to create the links to the shop sites? Normally, you'd never, ever do this, as it makes your site impossible to index. But that's not an issue - all the well-behaved bots won't be indexing those links because they'll be obeying the robots.txt file.
I'm thinking of something like not having an <a href= tag in there - instead, generate the text of the link adding underlining to the font using a stylesheet, so it looks like a link to a normal user, and then having a javascript onClick function that redirects the user when they click on it. Bots won't see it as a link, and users won't notice a thing.
You could:
Use "rel=nofollow" to instruct crawlers not to follow your links.
Block certain user-agent strings
Use robots.txt to exclude spread of your site.
Unfortunately, none of the above will exclude badly-behaved crawlers. The only solution to actually prevent crawlers is some kind of JavaScript link or a CAPTCHA.
I also have similar project.
My problem was solved only by block certain user-agent strings.
Another problem is that I don't know every "bad" user-agent's, so when a new crawler enters the site, I add it to the blacklist and retroactively remove its visits from statistics.
"rel=nofollow" and robots.txt not work at all for me.
I have an affiliate link on my webpage. When you click on the link it follows the href value which is as follows:
www.site_name.com/?refer=my_affiliate_id
This would be fine, except that the site offers no tracking for the ads, so I can't tell how many clicks I am getting. I could easily implement my own tracking by changing the original link href value to a php script which increments some click stats in a database and then redirects the user to the original page. Like so:
<?php // Do database updating stuff here
Header("Location: http://www.site_name.com/?refer=my_affiliate_id");
?>
But I have read some articles that say that using redirects may be seen by google as a sign of 'blackhat' techniques and they might rank me lower, unindex my site or even hurt the site that I'm redirecting too.
Does anybody know if this is true, or have any idea of the best way I could go about this?
Many thanks in advance
Joe
You could always do what Google does with search results. They have the link href normal, until the mousedown event. something to the effect of:
adlink.onmousedown = function(e) {
var callingLink = /* stuff to actually get the element here */;
callingLink.href = 'http://mysite.com/adtrack_redirect_page.ext?link=' + escape(callingLink.href);
}
Or something like that :P
So, Google will see a normal link, but almost all users will be redirected to your counter page.
Using a 301 redirect simple tells Google that the website is permamently moved. It should have, according to most random people on the internet and according to Google itself, no effect on your page-rank.
Actually I've read (can't remember where exactly) that this kind of redirect DOES HURT your rating. No, it won't "kill" your website nor the referenced, as far as I know (and please do check further), but it will hurt your site's rating as I said.
Anyway I'd recommend using some javascript to refer anything out of you domain - something like "window.open(....)" should do the trick, as Google will not follow this code.
There, refer to your tracking script which will redirect further.
You could use a javascript onClick event to send an ajax signal to your server whenever the link is clicked. That way the outgoing link is still fully functional, and your server-side script can increment your counter to track the clickthrough.
hi im using ajax to extract all the pages into the main page but am not being able to control the refresh , if somebody refreshes the page returns back to the main page can anybody give me any solutions , i would really appreciate the help...
you could add anchor (#something) to your URL and change it to something you can decode to some particular page state on every ajax event.
then in body.onload check the anchor and decode it to some state.
back button (at least in firefox) will be working alright too. if you want back button to work in ie6, you should add some iframe magic.
check various javascript libraries designed to support back button or history in ajax environment - this is probably what you really need. for example, jQuery history plugin
You can rewrite the current url so it gives pointers to where the user was - see Facebook for examples of this.
I always store the 'current' state in PHP session.
So, user can refresh at any time and page will still be the same.
if somebody refreshes the page returns back to the main page can anybody give me any solutions
This is a feature, not a bug in the browser. You need to change the URL for different pages. Nothing is worse then websites that use any kind of magic either on the client side or the server side which causes a bunch of completely different pages to use the same URL. Why? How the heck am I gonna link to a specific page? What if I like something and want to copy & paste the URL into an IM window?
In other words, consider the use cases. What constitutes a "page"? For example, if you have a website for stock quotes--should each stock have a unique URL? Yes. Should you have a unique URL for every variation you can make to the graph (i.e. logarithmic vs linear, etc)? Depends--if you dont, at least provide a "share this" like google maps does so you can have some kind of URL that you can share.
That all said, I agree with the suggestion to mess with the #anchor and parse it out. Probably the most elegant solution.
I own an image hosting site and would like to generate one popup per visitor per day. The easiest way for me to do this was to write a php script that called subdomains, like ads1.sitename.com
ads2.sitename.com
unfortunatly most of my advertisers want to give me a block of javascript code to use rather than a direct link, so I can't just make the individual subdomains header redirects.I'd rather use the subdomains that way I can manage multiple advertisers without changing any code on page, just code in my php admin page. Any ideas on how I can stick this jscript into the page so I don't need to worry about a blank ads1.sitename.com as well as the popup coming up?
I doubt you'll find much sympathy for help with pop-up ads.
How about appending a simple window.close() after the advertising code? That way their popup is displayed and your window closes neatly.
I'm not sure that I've ever had a browser complain that the window is being closed. This method has always worked for me. (IE, Firefox, etc.)
At the risk of helping someone who wants to deploy popup ads (which is bound to fail due to most popup blockers anyway), why can't you just have the subdomains load pages that load the block of Javascript the advertisers give you?
Hey, cut the guy some slack. Popups might not be very nice, but at least he's trying to reduce the amount of them. And popup blockers are going to fix most of it anyway. In any case, someone else might find this question with more altruistic goals (not sure how they'd fit that with popups, but hey-ho).
I don't quite follow your question, but here's some ideas:
Look into Server Side Includes (SSI) to easily add a block of javascript to each page (though you could also do it with a PHP include instead)
Do your advertiser choosing in your PHP script rather than calling the subdomains
Decipher the javascript to work out what it's doing and put a modified version in the subdomain page so it doesn't need an additional popup. Shouldn't be too hard.