price comparison website - crawler - php

i have got a price comparison website.
You can click on a link of an offer and i'll get $1 from the shop.
The problem is: crawlers crawling the whole website. So they "click on the links".
How can i prevent them from clicking? Javascript is a bad solution.
Thank you!

I've been thinking about this the wrong way.
I agree with everything that #yttriuszzerbus says above - add a robots.txt to the file, add "rel=nofollow" to links, and block the user agents that you know about.
So if you've got someone who's now trying to click on a link, it's either a live person, or a badly behaved bot that you don't want clicking.
So how about doing something strange to create the links to the shop sites? Normally, you'd never, ever do this, as it makes your site impossible to index. But that's not an issue - all the well-behaved bots won't be indexing those links because they'll be obeying the robots.txt file.
I'm thinking of something like not having an <a href= tag in there - instead, generate the text of the link adding underlining to the font using a stylesheet, so it looks like a link to a normal user, and then having a javascript onClick function that redirects the user when they click on it. Bots won't see it as a link, and users won't notice a thing.

You could:
Use "rel=nofollow" to instruct crawlers not to follow your links.
Block certain user-agent strings
Use robots.txt to exclude spread of your site.
Unfortunately, none of the above will exclude badly-behaved crawlers. The only solution to actually prevent crawlers is some kind of JavaScript link or a CAPTCHA.

I also have similar project.
My problem was solved only by block certain user-agent strings.
Another problem is that I don't know every "bad" user-agent's, so when a new crawler enters the site, I add it to the blacklist and retroactively remove its visits from statistics.
"rel=nofollow" and robots.txt not work at all for me.

Related

Stopping bots from registering on pay per click?

I'm writing a pay per click function on my site. It's fairly easy to add a link on a button:
http://www.mysite.com/advertLink?id=123
I could pick up the ID and redirect accordingly. But how do I stop Google and other bots from "clicking" on this link? I don't want the users that are clicking on the link to be charged for clicks that are generated by bots?
Also, are there other types of traffic I should consider blocking? I am considering, for example, blocking all traffic outside of my country from being registered as clicks because this site is very much only directed at my country?
UPDATE
The nofollow and htaccess rules are a good start. But I was hoping there was perhaps more foolproof way. I see, for example on this site: www.pricecheck.co.za, that if you click on an add, it takes you to a fancy forwarding page. I am curious as to what logic is on that page. It also looks like perhaps javascript is used somehow. See what I mean here:
http://www.pricecheck.co.za/offers/19453458/Apple+iPad+2+Black+64GB+9.7%22+Tablet+With+WiFi+&+3G/
change your button to an a link and put rel="nofollow" which should tell search engines not to follow the link. Alternatively you could display the link using javascript and search engines normally wont follow it:
<script type="text/javascript">
document.write('link');
</script>
and like hakan says, add rules to disallow it in your robots.txt.
You could also check the referrer in your script to make sure it was clicked on from your page.
Nice bots will read and respect your robots.txt. You can write something like
User-agent: *
Disallow: /advertLink

Redirecting outbound links and rel="nofollow" attribute - what is the difference?

As far as I know many websites add rel="nofollow" attribute to all outbound links inside their forum's posts. As I understand, that way they tell search robots not to use those links for ranking webpages. Also I've noticed that some forums use inside redirect (I'm not sure if this is the right term though) for outgoing links. Let's say the forum url is http://someforum.com. So when I post with a link
Hi this is [url="http://mysite.com"]my site[/url]
The link transforms to something like this
Hi this is my site
I suspect that the meaning of this is the same as adding rel="nofollow" atttribute.
Am I wright? If yes, is there any sense in using this kind of redirection and why not just use a rel="nofollow" attribute instead?
This kind of redirecting is used for several reasons. Here are some I am aware of:
tracking outgoing traffic leaving the own site
displaying a warning page that the user is leaving the site now with the ability to cancel within a few seconds and go back
The 2nd point gives you a chance to keep traffic on your site. And there may be legal reasons in countries like Germany here. In Germany you are responsible even for content when it is not your own but you are linking to it. So in Germany you must check the linked content on a regular basis and warn users that the linked content is not under your control. This can be done on such an extra redirect page.
I am not a lawyer but this is one of the most discussed internet-related legal issues here.
How the redirection is done will determine if ranking juice is past to the recipient.
A 301 Redirect will work almost like a direct link, with a little loss of ranking in the process.

How to hide a page url from bots/spiders?

On my website, I have 1000 products, and they all have their own web page which are accessible by something like product.php?id=PRODUCT_ID.
On all of these pages, I have a link which has a url action.php?id=PRODUCT_ID&referer=CURRNT_PAGE_URL .. so if I am visiting product.php?id=100 this url becomes action.php?prod_id=100&referer=/product.php?id=1000 clicking on this url returns the user back to referer
Now, the problem I am facing is that I keep getting false hits from spiders. Is there any way by which I can avoid these false hits? I know I can "diallow" this url in robots.txt but still there are bots who ignore this. What would you recommend?
Any ideas are welcome. Thanks
Currently, the easiest way of making a link inaccessible to 99% of robots (even those that choose to ignore robots.txt) is with Javascript. Add some unobtrusive jQuery:
<script type="text/javascript">
$(document).ready(function() {
$('a[data-href]').attr('href', $(this).attr('data-href'));
});
});
</script>
The construct your links in the following fashion.
Click me!
Because the href attribute is only written after the DOM is ready, robots won't find anything to follow.
Your problem consists of 2 separate issues:
multiple URLs lead to the same resource
crawlers don't respect robots.txt
The second issue is hard to tackle, read Detecting 'stealth' web-crawlers
The first one is easier.
You seem to need an option to let the user go back to the previous page.
I'm not sure why you do not let the browser's history take care of this (through the use of the back-button and javascript's history.back();), but there are enough valid reasons out there.
Why not use the refferer header?
Almost all common browser send information about the referring page with every request. It can be spoofed, but for the mayority of visitors this should be a working solution.
Why not use a cookie?
If you store the CURRNT_PAGE_URL in a cookie, you can still use a single unique URLs for each page, and still dynamically create breadcrumbs and back links based on the refferer set in the cookie, and not be dependent on the HTTP-referrer value.
You can use the robots.txt file to prevent complying bots.
Next thing you can do, once robots.txt is configured is to check your server logs. Find any useragents that seem suspicious.
Let's say you find evil_webspider_crawling_everywhere as a useragent, you can check for it in the headers of the request (sorry, no example, haven't used php in a long time) and deny access to the webspider.
Another option is to use PHP to detect bots visiting your page.
You could use this PHP function to detect the bot (this gets most of them):
function bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
And then echo href links to page only when you find that the user is not a bot:
if (bot_detected()===false)) {
echo "http://example.com/yourpage";
}
I don't believe you can stop user agents that don't obey your advice.
Before going down this route I would really want to make ascertain that bots/spiders are a problem - doing anything that prevents natural navigation of your site should be seen as a last resort.
If your really want to stop spiders what you might want to consider is using javascript in the your links so that navigation only happens after the link is clicked. This should stop spiders.
Personally I'm not fussed about spiders or bots.

One adress (www.domain.com) for the whole website ? Do you recommend it for SEO?

Hello I'm planning to developp a communication platform fully in Ajax and Long-Polling.
There will be no full page reloading !
So the website adress would always be www.domain.com
Do you recommend that for SEO ?
Forget about SEO, what about your visitors - will people be able to bookmark a page on your site and get back to where they want to be? Will they be be able to email a link to their friends to show them something?
Its not just Google that likes to have direct URLs to visit. Those direct URLs are vital for SEO, but they're also important for your human visitors too.
Google has a full specification on how to make ajax-powered sites like this crawlable.
The trick is to update window.location.hash with an escaped fragment whenever you want specific content to be linkable, and treated as its own page, without having to reload. For example, Twitter rewrites their URIs from http://twitter.com/user to http://twitter.com/#!/user.
From an SEO standpoint these are both valid and will be regarded as its own separate page. They can be directly linked to, and be used in browser history navigation. If you update your meta-data (keywords, description etc.) and sitemaps accordingly, SEO will be the least of your worries.
As long as you can generate a fully qualified link for each page, you should be fine if you generate a sitemap including those links and submitting it to google.
If you look on Twitter and FB, they #! in the URL so Google still crawls pages
If it's mostly using Ajax for content population, loading and state changes, then it's probably a bad model for SEO purposes anyway. Somewhat of a moot point by nature, no?

Opening Javascript based popup ads on the same page

I own an image hosting site and would like to generate one popup per visitor per day. The easiest way for me to do this was to write a php script that called subdomains, like ads1.sitename.com
ads2.sitename.com
unfortunatly most of my advertisers want to give me a block of javascript code to use rather than a direct link, so I can't just make the individual subdomains header redirects.I'd rather use the subdomains that way I can manage multiple advertisers without changing any code on page, just code in my php admin page. Any ideas on how I can stick this jscript into the page so I don't need to worry about a blank ads1.sitename.com as well as the popup coming up?
I doubt you'll find much sympathy for help with pop-up ads.
How about appending a simple window.close() after the advertising code? That way their popup is displayed and your window closes neatly.
I'm not sure that I've ever had a browser complain that the window is being closed. This method has always worked for me. (IE, Firefox, etc.)
At the risk of helping someone who wants to deploy popup ads (which is bound to fail due to most popup blockers anyway), why can't you just have the subdomains load pages that load the block of Javascript the advertisers give you?
Hey, cut the guy some slack. Popups might not be very nice, but at least he's trying to reduce the amount of them. And popup blockers are going to fix most of it anyway. In any case, someone else might find this question with more altruistic goals (not sure how they'd fit that with popups, but hey-ho).
I don't quite follow your question, but here's some ideas:
Look into Server Side Includes (SSI) to easily add a block of javascript to each page (though you could also do it with a PHP include instead)
Do your advertiser choosing in your PHP script rather than calling the subdomains
Decipher the javascript to work out what it's doing and put a modified version in the subdomain page so it doesn't need an additional popup. Shouldn't be too hard.

Categories