Is there a way to detect in my script whether the request is coming from normal web browser or some script executing curl. I can see the headers and can distinguish with "User-Agent and other few headers" but in curl fake headers can be set, so i am not able to track the request.
Please suggest me ways about identifying the curl or other similar non browser request.
The only way to catch most "automated" requests is to code in logic that spots activity that couldn't possibly be human with a browser.
For example, hitting pages too fast, filling out a form too fast, have an external source in the html file (like a fake css file through a php file), and check to see if the requesting IP has downloaded it in the previous stage of your site (kind of like a reverse honeypot), but you would need to exclude certain IP's/user agents from being blocked, otherwise you'll block google's webspiders. etc.
This is probably the only way of doing it if curl (or any other automated script) is faking its headers to look like a browser.
Strictly speaking, there is no way.
Although there are non-direct techiques, but I would never discuss it in public, especially on a site like Stackoverflow, which encourage screen scraping, content swiping autoposting and all this dirty roboting stuff.
In some cases you can use CAPTCHA test to tell a human from a bot.
As far as i know, you can't see the difference between a "real" call from your browser and one from curl.
You can compare the header (User-agent) but its all i know.
Related
I have a webserver, and certain users have been retrieving my images using an automated script.I wish to redirect them to a error page or give them an invalid image only if it's a CURL request.
my image resides in http://example.com/images/AIDd232320233.png, is there someway I can route it with .htaccess to my controller index function to where I can check if it's an authentic request?
and my other question, how can I check browser headers to distinguish between most likely authentic ones and ones done with a cURL request?
Unfortunately, the short answer is 'no.'
cURL provides all of the necessary options to "spoof" any browser. That is to say, more specifically, browsers identify themselves via specific header information, and cURL provides all of the tools to set header data in whatever manner you choose. So, directly distinguishing two requests from one another is not possible.*
*Without more information. Common methods to determine if there is a Live Human initiating the traffic are to set cookies during previous steps (attempts to ensure that the request is a natural byproduct of a user being on your website), or using a Captcha and a cookie (validate someone can pass a test).
The simplest is to set a cookie, which will really only ensure that bad programmers don't get through, or programmers who don't want to spend the time to tailor their scraper to your site.
The more tried and true approach is a Captcha, as it requires the user to interact to prove they have blood in their veins.
If the image is not a "download" but more of a piece of a greater whole (say, just an image on your site), a Captcha could be used to validate a human before giving them access to the site as a whole. Or if it is a download, it would be presented before unlocking the download.
Unfortunately, Captchas are are "a pain," both to set up, and for the end-user. They don't make a whole lot of sense for general-purpose access, they are a little overboard.
For general-purpose stuff, you can really only throttle IPs, download limits and the like. And even there, you have nothing you can do if the requests are distributed. Them's the breaks, really...
We have certain action links which are one time use only. Some of them do not require any action from a user other than viewing it. And here comes the problem, when you share it in say Viber, Slack or anything else that generates a preview of the link (or unfurls the link as Slack calls it) it gets counted as used since it was requested.
Is there a reliable way to detect these preview generating requests solely via PHP? And if it's possible, how does one do that?
Not possible with 100% accuracy in PHP alone, as it deals with HTTP requests, which are quite abstracted from the client. Strictly speaking you cannot even guarantee that user have actually seen the response, even tho it was legitimately requested by the user.
The options you have:
use checkboxes like "I've read this" (violates no-action requirement)
use javascript to send "I've read this" request without user interaction (violates PHP alone requirement)
rely on cookies: redirect user with set-cookie header, then redirect back to show content and mark the url as consumed (still not 100% guaranteed, and may result with infinite redirects for bots who follow 302 redirects, and do not persist cookies)
rely on request headers (could work if you had a finite list of supported bots, and all of them provide a signature)
I've looked on the entire internet to solve this problem. And I've found some workarounds to verify if the request is for link preview generation.
Then, I've created a tool to solve it. It's on GitHub:
https://github.com/brunoinds/link-preview-detector
You only need to call a single method from the class:
<?php
require('path\to\LinkPreviewOrigin.php');
$response = LinkPreviewOrigin::isForLinkPreview();
//true or false
I hope to solve your question!
I am curious to know if detecting the visitor browser with client-side script is more reliable than server-side script?
It is easy and popular to get the visitor browser both by PHP and Javascript. In the former one, we analyze $_SERVER['HTTP_USER_AGENT'] sent by the header array. However, header is not always reliable. Can Javascript be more reliable as it get the visitor browser from the visitor's machine?
I mean is it possible to miss the USER AGENT in header and get the browser by javascript?
UPDATE: Please do not introduce methods such as jQuery as I am familiar with them. I just want to know if it's possible for header's user agent to fail when javascript still can detect browser? Comparison of client-side and server-side methods.
The User-Agent can be tested server side or client side, either way it can be spoofed.
You can finger print the browser with JavaScript (seeing what methods and objects the browser provides) and use that to infer the browser, but that is less precise and JavaScript can be disabled / blocked / edited by the client.
So neither is entirely reliable.
It is generally a bad idea to do anything based on the identify of the browser though.
OK. So User-Agent header is not required by RFC
User agents SHOULD include this field with requests.
https://www.rfc-editor.org/rfc/rfc2616#section-14.43
Which means the server side detection is not guaranteed.
Similarly client side detection typically relies on navigator.userAgent but that is also provided by the user agent (browser or what not) and similarly cannot be guaranteed.
Thus the answer to your question is 50/50 :)
Now, if you are trying to figure out how to handle different browsers - feature detection is your safest bet here - but that's a different question ;)
I would just use the server side detection.
If a user wants to mask their browser, their browser will likely be masked on both ends.
If you want to find out their browser for HTML compatibility, they should be expecting mildly broken pages if they've masked their browser (but you should always try your best not to have browser specific HTML). If it's for javascript compatibility, they should also be expecting some broken javascript.
Take a look at $.browser() in jquery
A different angle: why do we want to detect the browser?
In the case of analytics, there isn't much you can do really. Anyone that does a little research can send whatever user agent string they like, but who's going to go through all the trouble ;)
If we're talking about features to enable/disable on a website, you should really be going for feature detection. By focusing on what the browser can/can't do, instead of what it calls itself, you can generally expect that browser to perform whatever action reliably if the feature you need is present.
More info: http://jibbering.com/faq/notes/detect-browser/
One big advantage to use client-side javascript is that you can get much more information about the browser.
Here is an interesting example: https://panopticlick.eff.org/
I want to know whether a user are actually looking my site(I know it's just load by the browser and display to human, not actually human looking at it).
I know two method will work.
Javascript.
If the page was load by the browser, it will run the js code automatically, except forbid by the browser. Then use AJAX to call back the server.
1×1 transparent image of in the html.
Use img to call back the server.
Do anyone know the pitfall of these method or any better method?
Also, I don't know how to determine a 0×0 or 1×1 iframe to prevent the above method.
A bot can access a browser, e.g. http://browsershots.org
The bot can request that 1x1 image.
In short, there is no real way to tell. Best you could do is use a CAPTCHA, but then it degrades the experience for humans.
Just use a CAPTCHA where required (user sign up, etc).
I want to know whether a user are actually looking my site(I know it's just load by the browser and display to human, not actually human looking at it).
The image way seems better, as Javascript might be turned off by normal users as well. Robots generally don't load images, so this should indeed work. Nonetheless, if you're just looking to filter a known set of robots (say Google and Yahoo), you can simply check for the HTTP User Agent header, as those robots will actually identify themselves as being a robot.
you can create an google webmasters account
and it tells you how to configure your site for bots
also show how robot will read your website
I agree with others here, this is really tough - generally nice crawlers will identify themselves as crawlers so using the User-Agent is a pretty good way to filter out those guys. A good source for user agent strings can be found at http://www.useragentstring.com. I've used Chris Schulds php script (http://chrisschuld.com/projects/browser-php-detecting-a-users-browser-from-php/) to good effect in the past.
You can also filter these guys at the server level using the Apache config or .htaccess file, but I've found that to be a losing battle keeping up with it.
However, if you watch your server logs you'll see lots of suspect activity with valid (browser) user-agents or funky user-agents so this will only work so far. You can play the blacklist/whitelist IP game, but that will get old fast.
Lots of crawlers do load images (i.e. Google image search), so I don't think that will work all the time.
Very few crawlers will have Javascript engines, so that is probably a good way to differentiate them. And lets face it, how many users actually turn of Javascript these days? I've seen the stats on that, but I think those stats are very skewed by the sheer number of crawlers/bots out there that don't identify themselves. However, a caveat is that I have seen that the Google bot does run Javascript now.
So, bottom line, its tough. I'd go with a hybrid strategy for sure - if you filter using user-agent, images, IP and javascript I'm sure you'll get most bots, but expect some to get through despite that.
Another idea, you could always use a known Javascript browser quirk to test if the reported user-agent (if its a browser) is really actually that browser?
"Nice" robots like those from google or yahoo will usually respect a robots.txt file. Filtering by useragent might also help.
But in the end - if someone wants to gain automated access it will be very hard to prevent that; you should be sure it is worth the effort.
Inspect the User-Agent header of the http request.
Crawlers should set this to anything but a known browser.
here are the google-bot header http://code.google.com/intl/nl-NL/web/controlcrawlindex/docs/crawlers.html
In php you can get the user-agent with :
$Uagent=$_SERVER['HTTP_USER_AGENT'];
Then you just compare it with the known headers
as a tip preg_match() could be handy to do this all in a few lines of code.
I am working with Open Id, just playing around making a class to interact / auth Open Id's on my site (in PHP). I know there are a few other Libraries (like RPX), but I want to use my own (its good to keep help better understand the protocol and whether its right for me).
The question I have relates to the Open Id discovery sequence. Basically I have reached the point where I am looking at using the XRDS doc to get the local identity (openid.identity) from the claimed identity (openid.claimed_id).
My question is, do I have to make a cURL request to get the XRDS Location (X-XRDS-location) and then make another cURL request to get the actual XRDS doc??
It seems like with a DUMB request I only make one cURL request and get the Open Id Server, but have to make two to use the XRDS Smart method. Just doesn't seem right, can anyone else give me some info.
To be complete, yes, your RP must HTTP GET on the URL the user gives you, and then search for an XRDS document reference and if found do another HTTP GET from there. Keep in mind that the XRDS may be hosted on a different server, so don't code up anything that would require the connection to be the same between the two requests since it might not be the same connection.
If in your initial HTTP GET request you include the HTTP header:
Accept: application/xrds+xml
Then the page MAY respond immediately with the XRDS document rather than an HTML document that you have to parse for an XRDS link. You'll be able to detect that this has occurred by checking the HTTP response header for application/xrds+xml in its Content-Type header. This is an optimization so that RPs don't typically have to make that second HTTP GET call -- but you can't rely on it happening.
The best advice I can give you, is to try to abstract your HTTP requesting a little bit, and then just go through the entire process of doing an HTTP request twice.
You can keep your curl instances around if you want to speed things up using persistent connections, but that may or may not be want you want.
I hope this helps, and good luck.. OpenID is one of the most bulky and convoluted web standards I've come across since WebDAV =)
Evert
I know I'm late to the game here, but I think you should also check out the webfinger protocol. It takes the standard "email as userid" pattern and lets you do a lookup from there to discover openid etc.