I have a webserver, and certain users have been retrieving my images using an automated script.I wish to redirect them to a error page or give them an invalid image only if it's a CURL request.
my image resides in http://example.com/images/AIDd232320233.png, is there someway I can route it with .htaccess to my controller index function to where I can check if it's an authentic request?
and my other question, how can I check browser headers to distinguish between most likely authentic ones and ones done with a cURL request?
Unfortunately, the short answer is 'no.'
cURL provides all of the necessary options to "spoof" any browser. That is to say, more specifically, browsers identify themselves via specific header information, and cURL provides all of the tools to set header data in whatever manner you choose. So, directly distinguishing two requests from one another is not possible.*
*Without more information. Common methods to determine if there is a Live Human initiating the traffic are to set cookies during previous steps (attempts to ensure that the request is a natural byproduct of a user being on your website), or using a Captcha and a cookie (validate someone can pass a test).
The simplest is to set a cookie, which will really only ensure that bad programmers don't get through, or programmers who don't want to spend the time to tailor their scraper to your site.
The more tried and true approach is a Captcha, as it requires the user to interact to prove they have blood in their veins.
If the image is not a "download" but more of a piece of a greater whole (say, just an image on your site), a Captcha could be used to validate a human before giving them access to the site as a whole. Or if it is a download, it would be presented before unlocking the download.
Unfortunately, Captchas are are "a pain," both to set up, and for the end-user. They don't make a whole lot of sense for general-purpose access, they are a little overboard.
For general-purpose stuff, you can really only throttle IPs, download limits and the like. And even there, you have nothing you can do if the requests are distributed. Them's the breaks, really...
Related
We have certain action links which are one time use only. Some of them do not require any action from a user other than viewing it. And here comes the problem, when you share it in say Viber, Slack or anything else that generates a preview of the link (or unfurls the link as Slack calls it) it gets counted as used since it was requested.
Is there a reliable way to detect these preview generating requests solely via PHP? And if it's possible, how does one do that?
Not possible with 100% accuracy in PHP alone, as it deals with HTTP requests, which are quite abstracted from the client. Strictly speaking you cannot even guarantee that user have actually seen the response, even tho it was legitimately requested by the user.
The options you have:
use checkboxes like "I've read this" (violates no-action requirement)
use javascript to send "I've read this" request without user interaction (violates PHP alone requirement)
rely on cookies: redirect user with set-cookie header, then redirect back to show content and mark the url as consumed (still not 100% guaranteed, and may result with infinite redirects for bots who follow 302 redirects, and do not persist cookies)
rely on request headers (could work if you had a finite list of supported bots, and all of them provide a signature)
I've looked on the entire internet to solve this problem. And I've found some workarounds to verify if the request is for link preview generation.
Then, I've created a tool to solve it. It's on GitHub:
https://github.com/brunoinds/link-preview-detector
You only need to call a single method from the class:
<?php
require('path\to\LinkPreviewOrigin.php');
$response = LinkPreviewOrigin::isForLinkPreview();
//true or false
I hope to solve your question!
I developed a PHP application, its main purpose is to fetch data from a database. I want to prevent fetching all records from database by using machine requests (I mean requests those are made by non-human i.e. some mechanism like CURL, you generally prevent such requests via CAPTCHA.).
How can I let only search engines to grab my data but no one else without sensible usability damage ?
related: Preventing non-human generated requests
To open your question, I clicked the link and my browser made the request to the stackOverflow server and asked for this page. That's the same what cURL does... except it can't handle JavaScript. But again, I didn't parse the JavaScript on behalf of my browser. It was again, a program.
what I really needed to emphasis is that, virtually there is no way you can prevent a machine from faking a user activity.
But here are some tricks if you are interested. Personally I prefer methods that doesn't involve the human directly.
Add a captcha challenge to pages.
If your target audience is mostly modern people with modern browsers, use some Ajax page loading. This will keep most low end scrapers but not all. Google can process some ajax requests. See hashbangs.
Add a captcha challenge to pages.
If your target audience is mostly modern people with modern browsers, use some Ajax page loading. This will keep most low end scrapers but not all. Google can process some ajax requests. See hashbangs.
Log IP addresses of the users and look for guys with several thousands of hits in a small time.
Add some flood control to the site. You can disallow a form submission (for example) from being processing more than once in a minute.
Add tokens to the form and validate it. This will at least make the crawling a two step process.
And make your site fetch a little data from the database. For an example, if your application is a calendar, you can disallow all requests to show dates in a range longer than an year.
You can't block bots by its user agent. cURL and other programs can use a user-given different user agent when making the request.
You can adjust how googlebot should behave in Google web master central. Try to match it with your flood control mechanism.
and remember, Google advices you not to depend on its user agent.
I am currently researching the best way to share the same session across two domains (for a shared shopping cart / shared account feature). I have decided on two of three different approaches:
Every 15 minutes, send a one time only token (made from a secret and user IP/user agent) to "sync the sessions" using:
img src tag
img src="http://domain-two.com/sessionSync.png?token="urlsafebase64_hash"
displays an empty 1x1 pixel image and starts a remote session session with the same session ID on the remote server. The png is actually a PHP script with some mod_rewrite action.
Drawbacks: what if images are disabled?
a succession of 302 redirect headers (almost same as above, just sending token using 302's instead:
redirect to domain-2.com/sessionSync.php?token="urlsafebase64_hash"
then from domain-2.com/sessionSync, set(or refresh) the session and redirect back to domain-1.com to continue original request.
QuestionL What does Google think about this in terms of SEO/Pagerank?? Will their bots have issues crawling my site properly? Will they think I am trying to trick the user?
Drawbacks: 3 requests before a user gets a page load, which is slower than the IMG technique.
Advantages: Almost always works?
use jsonp to do the same as above.
Drawbacks: won't work if javascript is disabled. I am avoiding this option because of particularly this.
Advantages: callback function on success may be useful (but not really in this situation)
My questions are:
What will google think of using 302's as stated in example 2 above? Will they punish me?
What do you think the best way is?
Are there any security considerations introduced by any of these methods?
Am I not realizing something else that might cause problems?
Thanks for all the help in advance!
Just some ideas:
You could use the jsonP approach and use the <noscript> tag to set the 302-chains mode.
You won't find a lot of js disabled clients in the human part of your web clients.
But the web crawlers will mostly fall in the 302-chain mode, and if you care about them you could maybe implement some user-agent checks in sessionSync to give them specific instructions. For example give them a 301 permanent redirect. Your session synchronistation needs are maybe not valid for web crawlers, maybe you can redirect them permanently (so only the first time) without handling any specific session synchronisation for them. Well it depends ofg your implementation of this 302-chains but you could as well set something in the crawlers session to let them crawl domain-1 without any check on domain-2, as this depends on the url you generate on the page, and that you could have something in the session to prevent the domain-2 redirect on url generation.
I have a script that uses JSONP to make cross domain ajax calls. This works great but my question is, is there a way to prevent other sites from accessing and getting data from these URL's? I basically would like to make a list of sites that are allowed and only return data if they are in the list. I am using PHP and figure I might be able to use "HTTP_REFERER" but have read that some browsers will not send this info.... ??? Any ideas?
Thanks!
There really is no effective solution. If your JSON is accessible through the browser, then it is equally accessible to other sites. To the web server a request originating from a browser or another server are virtually indistinguishable aside from the headers. Like ILMV commented, referrers (and other headers) can be falsified. They are after all, self-reported.
Security is never perfect. A sufficiently determined person can overcome any security measures in place, but the goal of security is to create a high enough deterrent that laypeople and or most people would be dissuaded from putting the time and resources necessary to compromise the security.
With that thought in mind, you can create a barrier of entry high enough that other sites would probably not bother making requests with the barriers of entry put into place. You can generate single use tokens that are required to grab the json data. Once a token is used to grab the json data, the token is then subsequently invalidated. In order to retrieve a token, the web page must be requested with a token embedded within the page in javascript that is then put into the ajax call for the json data. Combine this with time-expiring tokens, and sufficient obfuscation in the javascript and you've created a high enough barrier.
Just remember, this isn't impossible to circumvent. Another website could extract the token out of the javascript, and or intercept the ajax call and hijack the data at multiple points.
Do you have access to the servers/sites that you would like to give access to the JSONP?
What you could do, although not ideal is to add a record to a db of the IP on the page load that is allowed to view the JSONP, then on the jsonp load, check if that record exists. Perhaps have an expiry on the record if appropriate.
e.g.
http://mysite.com/some_page/ - user loads page, add their IP to the database of allowed users
http://anothersite.com/anotherpage - as above, add to database
load JSONP, check the IP exists in the database.
After one hour delete the record from the db, so another page load would be required for example
Although this could quite easily be worked around if the scraper (or other sites) managed to work out what method you are using to allow users to view the JSONP, they'd only have to hit the page first.
How about using a cookie that holds a token used with every jsonp request?
Depending on the setup you can also use a variable if you don't want to use cookies.
Working with importScript form the Web Worker is quite the same as jsonp.
Make a double check like theAlexPoon said. Main-script to web worker, web worker to sever and back with security query. If the web worker answer to the main script without to be asked or with the wrong token, its better to forward your website to the nirvana. If the server is asked with the wrong token don't answer. Cookies will not be send with an importScript request, because document is not available at web worker level. Always send security relevant cookies with a post request.
But there are still a lot of risks. The man in the middle knows how.
I'm certain you can do this with htaccess -
Ensure your headers are sending "HTTP_REFERER" - I don't know any browser that wont send it if you tell it to. (if you're still worried, fall back gracefully)
Then use htaccess to allow/deny access from the right referer.
# deny all except those indicated here
order deny,allow
deny from all
allow from .*domain\.com.*
Is there a way to detect in my script whether the request is coming from normal web browser or some script executing curl. I can see the headers and can distinguish with "User-Agent and other few headers" but in curl fake headers can be set, so i am not able to track the request.
Please suggest me ways about identifying the curl or other similar non browser request.
The only way to catch most "automated" requests is to code in logic that spots activity that couldn't possibly be human with a browser.
For example, hitting pages too fast, filling out a form too fast, have an external source in the html file (like a fake css file through a php file), and check to see if the requesting IP has downloaded it in the previous stage of your site (kind of like a reverse honeypot), but you would need to exclude certain IP's/user agents from being blocked, otherwise you'll block google's webspiders. etc.
This is probably the only way of doing it if curl (or any other automated script) is faking its headers to look like a browser.
Strictly speaking, there is no way.
Although there are non-direct techiques, but I would never discuss it in public, especially on a site like Stackoverflow, which encourage screen scraping, content swiping autoposting and all this dirty roboting stuff.
In some cases you can use CAPTCHA test to tell a human from a bot.
As far as i know, you can't see the difference between a "real" call from your browser and one from curl.
You can compare the header (User-agent) but its all i know.