I have used URL Shortener services, such as goo.gl or bit.ly, to shorten long URLs in my applications using their respective APIs. These APIs are very convenient, unfortunately I have noticed that the long URL gets hit when they shorten it. Let me explain a bit the issue I have. Let's say for instance that I want users to validate something (such as an email address, or a confirmation) and propose to them in my application a link for them to visit in order to validate something. I take this long URL, and use the API to shorten it. The target link (a PHP script for example) is getting hit when I call the shorten API, which makes the validation process useless.
One solution would be to make an intermediate button on the target page which the user has to click to confirm, but that solution makes another step in the validation process, which I would like to simplify.
I would like to know if anyone has already encountered this problem of if anyone has a clue in how to solve it.
Thanks for nay help.
I can't speak to Google but at Bitly we crawl a portion of the URLs shortened via our service to support various product features (spam checking, title fetching, etc) which is the cause of the behavior you are seeing.
In this type of situation we make two recommendations:
Use robots.txt to mark relevant paths as "disallowed". This is a light form of protection as there's nothing forcing clients to respect robots.txt but well behaved bots like BitlyBot or GoogleBot will respect your robots.txt file.
As mentioned by dwhite.me in a comment and as you acknowledged in your post, it is usually best to not do any state changing actions in response to GET requests. As always there's a judgement call on the risks associated vs the added complexity of a safer approach.
Related
A number of my pages are produced from results pulled from MySQL using $_Get. It means the urls end like this /park.php?park_id=1. Is this a security issue and would it be better to hide the query string from the URL? If so how do I go about doing it?
Also I have read somewhere that Google doesn't index URLs with a ?, this would be a problem as these are the main pages of my site. Any truth in this?
Thanks
It's only a security concern if this is sensitive information. For example, you send a user to this URL:
/park.php?park_id=1
Now the user knows that the park currently being viewed has a system identifier of "1" in the database. What happens if the user then manually requests this?:
/park.php?park_id=2
Have they compromised your security? If they're not allowed to view park ID 2 then this request should fail appropriately. But is it a problem is they happen to know that there's an ID of 1 or 2?
In either case, all the user is doing is making a request. The server-side code is responsible for appropriately handling that request. If the user is not permitted to view that data, deny the request. Don't try to stop the user from making the request, because they can always find a way. (They can just manually type it in. Even without ever having visited your site in the first place.) The security takes place in responding to the request, not in making it.
There is some data they're not allowed to know. But an ID probably isn't that data. (Or at least shouldn't be, because numeric IDs are very easy to guess.)
No, there is absolutely no truth to it.
ANY data that comes from a client is subject to spoofing. No matter if it's in a query string, or a POST form or URL. It's as simple as that...
As far as "Google doesn't index URLs with a ?", who-ever told you that has no clue what they are talking about. There are "SEO" best practices, but they have nothing to do with "google doesn't index". It's MUCH more fine grained than that. And yes, Google will index you just fine.
#David does show one potential issue with using an identifier in a URL. In fact, this has a very specific name: A4: Insecure Direct Object Reference.
Note that it's not that using the ID is bad. It's that you need to authorize the user for the URL. So doing permissions soley by the links you show the user is BAD. But if you also authorize them when hitting the URL, you should be fine.
So no, in short, you're fine. You can go with "pretty urls", but don't feel that you have to because of anything you posted here...
I have a few thoughts on this but I can see problems with both. I don't need 100% accurate data. An 80% solution that allows me to make generalizations about the most popular domains I'm routing users to is fine.
Option 1 - Use PHP. Route links through a file track.php that makes sure the referring page is from my domain before tracking the click. This page then routes the user to the final intended URL. Obviously bots could spoof this. Do many? I could also check the user agent. Again, I KNOW many bots spoof this.
Option 2 - Use JavaScript. Execute a JavaScript on click function that writes the click to the database and then directs the user to the final URL.
Both of these methods feel like they may cause problems with crawlers following my outgoing links. What is the most effective method for tracking these outgoing clicks?
The most effective method for tracking outgoing links (it's used by Facebook, Twitter, and almost every search engine) is a "track.php" type file.
Detecting bots can be considered a separate problem, and the methods are covered fairly well by these questions: http://duckduckgo.com/?q=how+to+detect+http+bots+site%3Astackoverflow.com But doing a simple string search for "bot" in the User-Agent will probably get you close to your 80%* (and watching for hits to /robots.txt will, depending on the type of bot you're dealing with, get you 95%*).
*: a semi-educated guess, based on zero concrete data
Well, Google analytics and Piwik use Javascript for that.
Since bots can't use JS, you'll only have humans. In the other way, humans can disable JS too (but sincerely, that's rarely the case)
Facebook, Deviantart, WLM, etc use server side script to track. I don't know how they filter bots but a nice robots.txt with one or two filter and that should be good enough to get 80% I guess.
I am trying to create a script to get the amount of backlinks to particular URLs - the method I am currently using is to query the google search API for link:example.com/foo/bar which returned the amount of results - I used that value to estimate the backlinks.
However, I am looking for alternate solutions.
The most basic approach would be to log $_SERVER['HTTP_REFERER'] on every incoming request, which is the URL of the site linking to your site. I'm sure there are some caveats to this approach (i.e. conditions under which Referer is not sent, potential for being spammed through bogus Referer URLs), but I can't speak to all of them. The Wikipedia page may be a good starting point.
There are also pingbacks/trackbacks, but I wouldn't rely on them.
Pingbacks / Trackbacks are to determine hits from a particular website. These are manual, rather than automatic, and are meaningful when there is a HIT from them.
However, the approach you did till now, is something that involves a huge cache of links and backlinks.
Either there must be some kind of database to track the nodes of connection between two pages, or you must start builiding your own.
Use the available ones, and better build a mashup of more than one database. But, if you want to have strong system built, then verify the backlink from your system, and then maintain the cache at your end too. The cache should include the verified backlinks only.
I hope this works.
I think http://www.opensiteexplorer.org/ and their api might be of more help.
Im interesting in how bookmarks work for social networks sites like facebook for example, when you look at someone's profile its
www.facebook.com/customname
or if they didnt make one yet its
www.facebook.com/generatedname
Is there a get request somewhere im missing??? Is the
www.facebook.com/profile.php?key=
hidden in the url? But how does the server know to interpret the url to look for someone's profile page? How does it work!!!!! Thanks!
Yes, the request is usually hidden using rewrite engines such as mod_rewrite.
As such something like facebook.com/customname is rewritten to facebook.com/profile.php?key=customname, which then internally looks up the correct profile page from the database.
There is some solution called mod_rewrite, which actually translates the URL visited by the user (and visible to the user) into the path of the script (along with all the parameters).
Example: when you visit eg. http://www.facebook.com/ben, server may actually translate it into www.facebook.com/profile.php?name=ben without you noticing it (because it happens on the server side).
That is how it is done.
But there is still another, loosely related solution that happens on the client side (within the user's browser, not on the server). This solution is called pushState and it is HTML5's feature (HTML5 is new standard, supporting application-like behaviours in modern browsers).
Just look at this demonstration (it allows you to change URL, go back and forth, but if you type the visited URL directly you will show that there is nothing on the server). To make similar thing, you will need to learn JavaScript (language of the scripts executed on browser's side).
Alternatively to pushState some pages (like Twitter and - afair - Facebook) use solutions based on location hash (the part of the URL after #), which lets them maintain compatibility with some deprecated browsers, like IE7 etc.
Maybe this is far too much to answer your question, but you now should be pretty informed about how the URL visible to the user may differ from what is really invoked.
If you have any additional questions, let me know.
They probably use .htaccess or a similar mechanism to redirect all requests to a single entry file. That file starts processing the request and can also check to see if there is an account for customname that was specified on the url.
I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.
While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.
If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.
There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.
Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.
force a reCAPTCHA every 10 page loads for each unique IP
There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.
Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.
I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.
Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.
There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.
What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!
Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?
Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.
My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.