spammers in magento website - php

We are facing huge online customers issue but with same Ip address which is consuming our huge CPU usage on server.
We are already installed mage firewall but we have to manually block the IP's while going to blacklist.
Is there any way through which we can save CPU usage due to spam users, hackers which is throwing traffic which is not relevant for website?
We have already enabled Magento cache and Full page cache extension in Magento.
What more we can do so that we can protect our Magento website from vulnerability traffic effects and our CPU usage saves for other processes.

No real solution, but some things that have propably to be considered:
First, I would check, what defines a spammer in your case.
How many times does the spammer do a certain action? Does he follow a special interaction pattern?
If the user is doing an action y, which is known to be done by spammers, for the first time, you could start to track the repeating action y. After x-times you could block the user.
The difficulty here is to find a difference in the using pattern between a spammer and a regular user, who is propably just doing things fast.
The reason why you should find a spammer-pattern is that this way you don't need to save every user's IP-address. Sure, you could do this, so that you don't need to find a pattern anymore and you just need to check how often the IP-adress is interacting. But this will fill up the database pretty fast.
By tracking and saving I am talking about storing the user-IP in a database. So you see that you have to find a way to store a miminum of false positives (good users), 0 at best, in the database. Otherwise you have good users who get blacklisted for no reason.
Maybe the way with the least implementation effort and with the minimum risk is keeping it the way it is, and blacklisting spammers manually. This on the other hand is an effort for you.

Related

Block automated request to page

I've developed a site with basic search function. The site accepts input in GET params and the site is hosted on a shared hosting server. Thus there are limits on SQL execution.
My goal is to stop [or at least lower the chance] of processing automated search query so the site does not reach the SQL limit for bogus search queries where there is no real user.To prevent this I've used CSRF in the landing page from where search is initiated.
What else can I try to make sure that the search is performed only for real users and not for automated/bot search. I've thought of CAPTCHAs but asking to confirm CAPTCHA for every search query will make it really worse.
Welcome to the eternal dichotomy between useability and security. ;)
Many of the measures that are used to detect and block bots, also impact usability (such as the extra steps required by opaque captchas). None of which solve the bot problem 100% either (ala captcha farms).
The trick is to use a reasonable mix of controls, and to try hard not to impact the user experience as much as possible.
A good combination that I have used with success to protect high-cost functions on high-volume sites is:
CSRF: this is a good basic measure in itself to stop blind script
submission, but won't slow down or stop a sophisticated attacker at
all;
response caching: search engines tend to get used for the same thing
repeatedly, so by caching the answers to common searches, you avoid
making the SQL request altogether (which avoids the resource
consumption, and also improves regular usage too);
source throttling: track the source IP and restrict the number of
operations within a reasonable window (not rejecting any outside
this, just queueing them and so throttling the volume to a reasonable
level); and
transparent captcha: something like Google's CAPTCHAv3 in
transparent mode will help you drop a lot of the automated requests,
without impacting the user experience.
You could also look at developing your search function to search an XML file instead of via the database. This would enable you to search as many times as you like without any issues.

How to prevent excessive site visits (suspected screen scraping) from hackers?

I have a website that has been hacked once to have it's database stolen. I think it was done by an automated process that simply accessed the visible website using a series of searches, in the style of 'give me all things beginning with AA', then 'with AB', then 'with AC' and so on. The reality is a little more complicated than this, but that illustrates the principal of the attack. I found the thief and am now taking steps against them, but I want to prevent more like this in the future.
I thought there must be some ready made PHP (which I use) scripts out there. Something that for instance recorded the IP address of the last (say) 50 visitors and tracked the frequency of their requests over the last (say) 5 minutes. It would ban them for (say) 24 hours if they exceeded a certain threshold of requests. However to my amazement I can find no such class, library or example of code intended for this purpose, anywhere online.
Am I missing a trick, or is there a solution here - like the one I imagine, or maybe an even simpler and more effective safeguard?
Thanks.
There are no silver bullets. If you are trying to brainstorm some possible workarounds and solutions there are none that are particularly easy but here are some things to consider:
Most screen scrapers will be using curl to do their dirty work. There is some discussion such as here on SO about whether trying to block based on User-Agent (or lack thereof) is a good way to prevent screen scrapes. Ultimately, if it helps at all it is probably a good idea (and Google does it to prevent websites from screen scraping them). Because User-Agent spoofing is possible this measure can be overcome fairly easily.
Log user requests. If you notice an outlier that is far beyond your average number of user requests (up to you to determine what is uneacceptable), then you can serve them an HTTP 500 error until they revert back to an acceptable range.
Check number of broken links attempted. If a request to a broken link is served, add it to a log. A few of these should be fine, but it should be pretty clear to find someone who is fishing for data. If they are looking for AA, AB, AC, etc. When that occurs, start to serve HTTP 500 errors for all of your pages for a set amount of time. You can do this by serving all of your page requests through a Front Controller, or by creating a custom 404-file not found page and redirecting requests there. The 404 page can log them for you.
Set errors when there is a sudden change in statistics. This is not to shut anyone down, this is just to get you to investigate. The last thing you want to do is shut someone down by accident, because to them it will just seem like the website is down. If you set up a script to send you an e-mail when there has been a sudden change in usage patterns but before you shut someone down, it can help you adjust your decision making appropriately.
These are all fairly broad concepts and there are plenty of other solutions or tweaks on this that can work. In order to do it successfully you will need to monitor your own web patterns in order to determine a safe fix. This is not a small undertaking to craft such a solution (at least not well).
A Caveat
This is important: Security is always going to be counterbalanced by useability. If you do it right you won't be sacrificing too much security and your users will never run into these issues. Extensive testing will be important, and because of the nature of websites and downtime being so crucial, perform extensive testing whenever you introduce a new security measure, before bringing it live. Otherwise, you will have a group of very unhappy people to deal with and a potential en mass loss of users. And in the end, screen scraping is probably a better thing to deal with than angry users.
Another caveat
This could interfere with SEO for your web page, as search engines like Google employ screen scraping to keep records up to date. Again, the note on balance applies. I am sure there is a fix here that can be figured out but it would stray too far from the original question to look into it.
If you're using Apache, I'd look into mod_evasive:
http://www.zdziarski.com/blog/?page_id=442
mod_evasive is an evasive maneuvers module for Apache to provide
evasive action in the event of an HTTP DoS or DDoS attack or brute
force attack. It is also designed to be a detection and network
management tool, and can be easily configured to talk to ipchains,
firewalls, routers, and etcetera. mod_evasive presently reports abuses
via email and syslog facilities.
...
"Detection is performed by creating an internal dynamic hash table of
IP Addresses and URIs, and denying any single IP address from any of
the following:
Requesting the same page more than a few times per second
Making more than 50 concurrent requests on the same child per second
Making any requests while temporarily blacklisted (on a blocking list)"

How to protect website from bulk scraping /downloading? [duplicate]

This question already has answers here:
Top techniques to avoid 'data scraping' from a website database
(14 answers)
Closed 5 years ago.
I have LAMP server where I run a website, which I want to protect against bulk scraping / downloading. I know that there is no perfect solution for this, that the attacker will always find a way. But I would like to have at least some "protection" which hardenes the way of stealing data than just having nothing at all.
This website has cca. 5000 of subpages with valuable text data and couple of pictures on each page. I would like to be able online analyze incoming HTTP requests and if there is suspicious activity (e.g. tens of requests in one minute from one IP) it would automatically blacklist this certain IP address from further access to the site.
I fully realize that what I am asking for has many flaws, but I am not really looking for bullet-proof solution, but just a way how to limit script-kiddies from "playing" with easily scraped data.
Thank you for your on-topic answers and possible solution ideas.
Although this is a pretty old post, I think the answer isnt quite complete and I thought it worthwhile to add in my two cents. First, I agree with #symcbean, try to avoid using IP's but instead using a session, a cookie, or another method to track individuals. Otherwise you risk lumping together groups of users sharing an IP. The most common method for rate limiting, which is essentially what you are describing "tens of requests in one minute from one IP", is using the leaky bucket algorithm.
Other ways to combat web scrapers are:
Captchas
Make your code hard to interpret, and change it up frequently. This makes scripts harder to maintain.
Download IP lists of known spammers, proxy servers, TOR exit nodes, etc. This is going to be a lengthy list but its a great place to start. You may want to also block all amazon EC2 IP's.
This list, and rate limiting, will stop simple script kiddies but anyone with even moderate scripting experience will easily be able to get around you. Combating scrapers on your own is a futile effort but my opinion is biased because I am a cofounder of Distil Networks which offers anti-scraping protection as a service.
Sorry - but I'm not aware of any anti-leeching code available off-the-shelf which does a good job.
How do you limit access without placing burdens on legitimate users / withuot providing a mechanism for DOSing your site? Like spam prevention, the best solution is to use several approaches and maintain scores of badness.
You've already mentioned looking at the rate of requests - but bear in mind that increasingly users will be connecting from NAT networks - e.g. IPV6 pops. A better approach is to check per session - you don't need to require your users to register and login (although openId makes this a lot simpler) but you could redirect them to a defined starting point whenever they make a request without a current session and log them in with no username/password. Checking the referer (and that the referer really does point to the current content item) is a good idea too. Tracking 404 rates. Road blocks (when score exceeds threshold redirect to a capcha or require a login). Checking the user agent can be indicative of attacks - but should be used as part of the scoring mechanism, not as a yes/no criteria for blocking.
Another approach, rather than interrupting the flow, is when the thresholds are triggered start substituting content. Or do the same when you get repeated external hosts appearing in your referer headers.
Do not tar pit connections unless you've got a lot of resource serverside!
Referrer checking is one very simple technique that works well against automated attacks. You serve content normally if the referrer is your own domain (ie the user has reached the page by clicking a link on your own site), but if the referrer is not set, you can serve alternate content (such as a 404 not found).
Of course you need to set this up to allow search engines to read your content (assuming you want that) and also be aware that if you have any flash content, the referrer is never set, so you can't use this method.
Also it means that any deep links into your site won't work - but maybe you want that anyway?
You could also just enable it for images which makes it a bit harder for them to be scraped from the site.
Something that I've employed on some of my websites is to block known User-Agents of downloaders or archivers. You can find a list of them here: http://www.user-agents.org/ (unfortunately, not easy to sort by Type: D). In the host's setup, I enumerate the ones that I don't want with something like this:
SetEnvIf User-Agent ^Wget/[0-9\.]* downloader
Then I can do a Deny from env=downloader in the appropriate place. Of course, changing user-agents isn't difficult, but at least it's a bit of a deterrent if going through my logs is any indication.
If you want to filter by requests per minute or something along those lines, I don't think there's a way to do that in apache. I had a similar problem with ssh and saslauth, so I wrote a script to monitor the log files and if there were a certain number of failed login attempts made within a certain amount of time, it appended an iptables rule that blocked that IP from accessing those ports.
If you don't mind using an API, you can try our https://ip-api.io
It aggregates several databases of known IP addresses of proxies, TOR nodes and spammers.
I would advice one of 2 things,
First one would be, if you have information that other people want, give it to them in a controlled way, say, an API.
Second would be to try and copy google, if you scrape the results of google ALOT (and I mean a few hundred times a second) then it will notice it and force you to a Captcha.
I'd say that if a site is visited 10 times a second, its probably a bot. So give it a Captcha to be sure.
If a bot crawls your website slower then 10 times a second, I see no reason to try and stop it.
You could use a counter (DB or Session) and redirect the page if the limit is triggered.
/**Pseudocode*/
if( ip == currIp and sess = currSess)
Counter++;
if ( Count > Limit )
header->newLocation;
I think dynamic blocking of IPs using IP blocker will help better.

Smart PHP Session Handling/ Security

I've decided the best way to handle authentication for my apps is to write my own session handler from the ground up. Just like in Aliens, its the only way to be sure a thing is done the way you want it to be.
That being said, I've hit a bit of a roadblock when it comes to my fleshing out of the initial design. I was originally going to go with PHP's session handler in a hybrid fashion, but I'm worried about concurrency issues with my database. Here's what I was planning:
The first thing I'm doing is checking IPs (or possibly even sessions) to honeypot unauthorized attempts. I've written up some conditionals that sleep naughtiness. Big problem here is obviously WHERE to store my blacklist for optimal read speed.
session_id generates, hashed, and gets stored in $_SESSION[myid]. A separate piece of the same token gets stored in a second $_SESSION[mytoken]. The corresponding data is then stored in TABLE X which is a location I'm not settled on (which is the root of this question).
Each subsequent request then verifies the [myid] & [mytoken] are what we expect them to be, then reissues new credentials for the next request.
Depending on the status of the session, more obvious ACL functions could then be performed.
So that is a high level overview of my paranoid session handler. Here are the questions I'm really stuck on:
I. What's the optimal way of storing an IP ACL? Should I be writing/reading to hosts.deny? Are there any performance concerns with my methodology?
II. Does my MitM prevention method seem ok, or am I being overly paranoid with comparing multiple indexes? What's the best way to store this information so I don't run into brick walls at 80-100 users?
III. Am I hammering on my servers unnecessarily with constant session regeneration + writebacks? Is there a better way?
I'm writing this for a small application initially, but I'd prefer to keep it a reusable component I could share with the world, so I want to make sure I make it as accessible and safe as possible.
Thanks in advance!
Writing to hosts.deny
While this is a alright idea if you want to completely IP ban a user from your server, it will only work with a single server. Unless you have some kind of safe propagation across multiple servers (oh man, it sounds horrible already) you're going to be stuck on a single server forever.
You'll have to consider these points about using hosts.deny too:
Security: Opening up access to as important a file as hosts.deny to the web server user
Pain in the A: Managing multiple writes from different processes (denyhosts for example)
Pain in the A: Safely making amends to the file if you'd like to grant access to an IP that was previously banned at a later date
I'd suggest you simply ban the IP address on the application level in your application. You could even store the banned IP addresses in a central database so it can be shared by multiple subsystems with it still being enforced at the application level.
I. Optimal way of storing IP ACL would be pushing banned IP's to an SQL database, which does not suffer from concurrency problems like writing to files. Then an external script, on a regular basis or a trigger, may generate IPTABLES rules. You do not need to re-read your database on every access, you write only when you detect mis-behavior.
II. Fixation to IP is not a good thing on public Internet if you offer service to clients behind transparent proxies, or mobile devices - their IP changes. Let users chose in preferences, if they want this feature (depends on your audience, if they know what does the IP mean...). My solution is to generate unique token per (page) request, re-used in that page AJAX requests (not to step into a resource problem - random numbers, session data store, ...). The tokens I generate are stored within session and remembered for several minutes. This let's user open several tabs, go back and submit in an earlier opened tab. I do not bind to IP.
III. It depends... there is not enough data from you to answer. Above may perfectly suit your needs for ~500 user base coming to your site for 5 minutes a day, once. Or it may fit even for 1000 unique concurent users in a hour at a chat site/game - it depends on what your application is doing, and how well you cache data which can be cached.
Design well, test, benchmark. Test if session handling is your resource problem, and not something else. Good algorithms should not throw you into resource problems. DoS defense included, and it should not be an in-application code. Applications may hint to DoS prevention mechanisms what to do, and let the defense on specialized tools (see answer I.).
Anyway, if you get into a resource problems in future, the best way to get out is new hardware. It may sound rude or even incompetent to someone, but calculate price for new server in 6 months, practically 30% better, versus price for your work: pay $600 for new server and have additional 130% of horsepower, or pay yourself $100 monthly for improving by 5% (okay, improve by 40%, but if the week is worth $25 may seriously vary).
If you design from scratch, read https://www.owasp.org/index.php/Session_Management first, then search for session hijacking, session fixation and similar strings on Google.

How to disable the same person to play my RPG game as two different persons?

Of course, I store all players' ip addresses in mysql and I can check if there is a person with the same ip address before he registers, but then, he can register to my page at school or wherever he wants. So, any suggestions?
The only way that proves particularly effective is to make people pay for accessing your game.
Looking behind the question:
Why do you want to stop the same person registering and playing twice?
What advantage will they have if they do?
If there's no (or only a minimal) advantage then don't waste your time and effort trying to solve a non-problem. Also putting up barriers to something will make some people more determined to break or circumvent them. This could make your problem worse.
If there is an advantage then you need to think of other, more creative, solutions to that problem.
You can't. There is no way to uniquely identify users over the internet. Don't use ip addresses because there could be many people using the same ip, or people using dynamic ip's.
Even if somehow you made them give you a piece of legal identification, you still wouldn't be absolutely sure that they were not registered on the site twice as two different accounts.
I would check the user's IP every time they log onto the game, then log users who come from the same IP and how much they interact. You may find that you get some users from the same IP (ie, roomates, spouses, who play together and are not actually the same person). You may just have to flag these users and monitor their interactions - for example, is there a chat service in the game? If they don't ever talk to each other, they're more than likely the same person, and review on an individual basis.
If its in a webrowser you could bring the information like OS or browser but this even makes it not save but still safer.
It would take the hackers only a little more time and You have to look for the possibility that some people could play on systems with the same OS and browser
The safest thing would be that people on the same IP cannot do things with each other like trading or like in the game PKR (poker game) that you cannot sit on the same table.
An other thing would be wise to do is to use captcha's, its very user unfriendly but it keeps a lot bots out
If it is a browser-based game, Flash cookies are a relatively resilient way to identify a computer. Or have them pay a minimal amount, and identify them by credit card number - that way, it still won't be hard to make multiple account (friends' & family members' cards), but it will be hard to make a lot of them. Depending on your target demographic, it might prohibit potential players from registering, though.
The best approach is probably not worrying much about it and setting the game balance in such a way that progress is proportional to time spent playing (and use a strong captcha to keep bots away). That way, using multiple accounts will offer no advantage.
There are far too many ways to circumvent any restrictions to limit to a single player. FAR too many.
Unless the additional player is causing some sort of problem it is not worth the attempt. You will spend most of your time chasing 'ghosts' instead of concentrating on improving the game and making more money.
IP bans do not work nor flash cookies as a control mechanism either.
Browser fingerprinting does not work either. People can easily use a second browser.
Even UUID's will not work as those too can be spoofed.
And if you actually did manage to discover and implement a working method, the user could simply use a second computer or laptop and what then?
People can also sandbox a browser so as to use the same browser twice thus defeating browser identification.
And then there are virtual machines....
We have an extreme amount of control freaks out there wanting to control every aspect of computing. And the losers are the people who do the computing.
Every tracking issue I ever had I can circumvent easily. Be it UUID's, mac addresses, ip addresses, fingerprinting, etc. And it is very easy to do too.
Best suggestion is to simply watch for any TOU violations and address the problem accordingly.

Categories