Dynamically blocking IPs in high-traffic site: best strategy? - php

I've got some bad bots targeting my website and I need to dynamically handle the IP addresses from which those bots come. It's a pretty high-traffic site, we get a couple of millions of pageviews per day and that's why we're using 4 servers (loadbalanced). We don't use any caching (besides assets) because most of our responses are unique.
Code-technically it's a pretty small PHP website, which does no database queries and one XML request per pageview. The XML request get's a pretty fast response.
I've developed a script to (very frequently) analyse which IP addresses are doing abusive requests and I want to handle requests from those IPs differently for a certain amount of time. The IPs that are abusive change a lot so I need to block different IPs every couple of minutes
So: I see IP xx.xx.xx.xx being abusive, I record this somewhere and then I want to give that IP a special treatment for the next x minutes it does requests. I need to do this in a fast way, because I don't want to slow down the server and have the legitimate users suffer for this.
Solution 1: file
Writing the abusive IPs down in a file and then reading that file for every request seems
too slow. Would you agree?
Solution 2:PHP include
I could let my analysis script write a PHP include file which the PHP engine then would include for every request. But: I can imagine that, while writing the PHP file, a lot of users that do a request right then get an error because the file is being used.
I could solve that potential problem by writing the file and then doing a symlink change (which might be faster).
Solution 3: htaccess
Another way to separate the abusers out would be to write an htacces that blocks or redirects them. This might be the most efficient way but I need to write an htaccess file every x minutes then.
I'd love to hear some thoughts/reactions on my proposed solutions, especially concerning speed.

What about dynamically configuring iptables to block the bad IPs? I don't see any reason to do the "firewalling" in PHP...

For the record I've finally decided to go for (my own proposed) solution number 2, generating a PHP file that is included on every page request.
The complete solution is as follows:
A Python script analyses the accesslog file every x minutes and doles out "punishments" to certain IP addresses. All currently running punishments are written into a fairly small (<1Kb) PHP file. This PHP file is included for every page request. Directly after generation of the PHP file an rsync job is started to push the new PHP file out to the other 3 servers behind the loadbalancer.
In the Python script that generates the PHP file I first concatenate the complete contents of the file. I then open, write and close the file sequentially to lock the file for the shortest possible period.

I would seriously consider putting up another server that holds the (constantly changing) block list in-memory and serves the front-end servers.
I implemented such a solution using Node.JS and found the implementation easy and performance very good.
memcached could also be used, but I never tried it.

Related

Can making a curl request to hundreds of sites be considered an attack by some hosts?

Sometimes we don't have the APIs we would like to, and this is one of these cases.
I want to extract certain information from certain website, so I was considering using a CURL request to hundreds of pages within a site in a programmatically way by using a CRON job in my server.
Then caching the response and firing it again after one or multiple days.
Could that potentially be considered as some kind of attack by the server who might see hundreds of calls to certain sites in a very short period of time from the same server IP?
Lets say, 500 hundred curls?
What would you recommend me? Perhaps making use of the sleep command from curl to curl to reduce the frequency of those requests?
There are a lot of situations where your scripts could end up getting blocked by the website's firewall. One of the best steps you can take in seeing if this is allowed is by contacting the site owner and letting them know what you want to do. If that's not possible read their Terms of Service, and see if it's strictly prohibited.
If time is not of the essence when making these calls then, yes, you can definitely utilize the sleep command to delay the time between each request, and I would recommend it if you find out you need to make a few less requests per second.
You could definitely do this. However you should keep a few things in mind:
Most competent sites will have a clause in their Terms of Service which prohibit the use of the site in anyway other than the interface provided.
If the site see's what you are doing and notices a detrimental effect on their network they will block your ip (our organization was running into this issue enough that it warranted us developing a program that logs ips and the rate at which they access content, then if they attempt to access more than x number of pages in y number of seconds we ban the ip for z minutes), however you might be able to circumvent this by utilizing the sleep command as you had mentioned.
If you require information on the page that is loaded dynamically via javascript after the markup has been rendered, the response you receive from your curl request will not include this information. For cases such as these there are programs such as iMacros which allow you to write scripts in your browser to carry out actions programmatically as if you were actually using the browser.
As mentioned by #RyanCady the best solution may be to reach out to the owner of the site and explain what you are doing and see if they can accommodate your requirement.

I get mass traffic to one of my PHP files, should I split to several same files?

Ok, so I get lots of traffic from around the world to a php file that is hosted on my server. this file runs several checks against the visitor, runs several SQL queries, and decide upon the user status what to do.
I'm getting like hundreds of hits per second.
So, my question is:
Should I create many same files, and randomly drive the traffic to each of the files I created?
I want to avoid traffic loss and overload, but I dont know if this even matter by splitting to different files.
Thanks for all the helpers.

Load Balancing - How to set it up correctly?

Here it gets a little complicated. I'm in the last few months to finish a larger Webbased Project, and since I'm trying to keep the budget low (and learn some stuff myself) I'm not touching an Issue that I never touched before: load balancing with NGINX, and scalability for the future.
The setup is the following:
1 Web server
1 Database server
1 File server (also used to store backups)
Using PHP 5.4< over fastCGI
Now, all those servers should be 'scalable' - in the sense that I can add a new File Server, if the Free Disk Space is getting low, or a new Web Server if I need to handle more requests than expected.
Another thing is: I would like to do everything over one domain, so that the access to differend backend servers isnt really noticed in the frontend (some backend servers are basically called via subdomain - for example: the fileserver, over 'http://file.myserver.com/...' where a load balancing only between the file servers happens)
Do I need an additional, separate Server for load balancing? Or can I just use one of the web servers? If yes:
How much power (CPU / RAM) do I require for such a load-balancing server? Does it have to be the same like the webserver, or is it enough to have a 'lighter' server for that?
Does the 'load balancing' server have to be scalable too? Will I need more than one if there are too many requests?
How exactly does the whole load balancing work anyway? What I mean:
I've seen many entries stating, that there are some problems like session handling / synchronisation on load balanced systems. I could find 2 Solutions that maybe would fit my needs: Either the user is always directed to the same machine, or the data is stored inside a databse. But with the second, I basically would have to rebuild parts of the $_SESSION functionality PHP already has, right? (How do I know what user gets wich session, are cookies really enough?)
What problems do I have to expect, except the unsynchronized sessions?
Write scalable code - that's a sentence I read a lot. But in terms of PHP, for example, what does it really mean? Usually, the whole calculations for one user happens on one server only (the one where NGINX redirected the user at) - so how can PHP itself be scalable, since it's not actually redirected by NGINX?
Are different 'load balancing' pools possible? What I mean is, that all fileservers are in a 'pool' and all web servers are in a 'pool' and basically, if you request an image on a fileserver that has too much to do, it redirects to a less busy fileserver
SSL - I'll only need one certificate for the balance loading server, right? Since the data always goes back over the load balancing server - or how exactly does that work?
I know it's a huge question - basically, I'm really just searching for some advices / and a bit of a helping hand, I'm a bit lost in the whole thing. I can read snippets that partially answer the above questions, but really 'doing' it is completly another thing. So I already know that there wont be a clear, definitive answer, but maybe some experiences.
The end target is to be easily scalable in the future, and already plan for it ahead (and even buy stuff like the load balancer server) in time.
You can use one of web servers for load balacing. But it'll be more reliable to set the balacing on a separate machine. If your web servers responds not very quickly and you're getting many requests then load balancer will set the requests in the queue. For the big queue you need a sufficient amount of RAM.
You don't generally need to scale a load balancer.
Alternatively, you can create two or more A (address) records for your domain, each pointing to different web server's address. It'll give you a 'DNS load-balancing' without a balancing server. Consider this option.

Will I run into load problems with this application stack?

I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?
MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.

Sending large files via HTTP

I have a PHP client that requests an XML file over HTTP (i.e. loads an XML file via URL). As of now, the XML file is only several KB in size. A problem I can foresee is that the XML becomes several MBs or Gbs in size. I know that this is a huge question and that there are probably a myriad of solutions, but What ideas do you have to transport this data to the client?
Thanks!
based on your use case i'd definitely suggest zipping up the data first. in addition, you may want to md5 hash the file and compare it before initiating the download (no need to update if the file has no changes), this will help with point #2.
also, would it be possible to just send a segment of XML that has been instead of the whole file?
Ignoring how well a browser may or may-not handle a GB-sized XML file, the only real concern I can think of off the top of my head is if the execution time to generate all the XML is greater than any execution time thresholds that are set in your environment.
PHP's max_execution_time setting
PHP's set_time_limit() function
Apache's TimeOut Directive
Given that the XML is created dynamically with your PHP, the simplest thing I can think of is to ensure that the file is gzipped automatically by the webserver, like described here, it offers a general PHP approach and an Apache httpd-specific solution.
Besides that, having a browser (what else can be a PHP-client?) do such a job every night for some data synchonizing sounds like there must be a far simpler solution somewhere else.
And, of course, at some point, transferring "a lot" of data is going to take "a lot" of time...
The problem is that he's syncing up two datasets. The problem is completely misstated.
You need to either a) keep a differential log of changes to dataset A to that you can send that log to dataset B, or b) keep two copies of the dataset (last nights and the current dataset), and then compare them so you can then send the differential log from A to B.
Welcome to the world of replication.
The problem with (a) is that it's potentially invasive to all of your code, though if you're using an RDBMS you could do some logging perchance via database triggers to keep track of inserts/updates/deletes, and write the information in to a table, then export the relevant rows as your differential log. But, that can be nasty too.
The problem with (b) is the whole "comparing the database" all at once. Fine for 100 rows. Bad for 10^9 rows. Nasty nasty.
In fact, it can all be nasty. Replication is nasty.
A better plan is to look into a "real" replication system designed for the particular databases that you're running (assuming you're running a database). Something that perhaps sends database log records over for synchronization rather than trying to roll your own.
Most of the modern DBMS systems have replication systems.
Gallery2, which allows you to upload photos over http, makes you set up a couple of php parameters, post_max_size and upload_max_filesize, to allow larger uploads. You might want to look into that.
It seems to me that posting large files has problems with browser time-outs and the like, but on the plus side it works with proxy servers and firewalls better than trying a different file upload protocol.
Thanks for the responses. I failed to mention that transferring the file should be relatively fast (few mintues max, is this even possible?). The XML that is requested will be parsed and inserted into a database every night. The XML may be the same from the night before, or it may be different. One solution that was proposed is to zip the xml file and then transfer it. So there are basically two requirements: 1. it has to relatively fast 2. it should minimize the number of writes to the database.
One solution that was proposed is to zip the xml file and then transfer it. but that only satisfies (1)
Any other ideas?
Are there any algorithms that I could apply to compress the XML? How are large files such as MP3s being downloaded in a matter of seconds?
PHP receiving GB's of data will take long and is overhead.
Even more perceptible to flaws.
I would - dispatch the assignment to a shellscript (wget with simple error catching) that is not bothered by execution time and on failure could perhaps even retry on its own merit.
Am not experienced with this, but though one could use exec() or alike, these sadly run modal.
Calling a script with **./test.sh &** makes it run in background and solves that problem / i guess. The script could easily let your PHP pick it back up via a wget `http://yoursite.com/continue-xml-stuff.php?id=1049381023&status=0ยด. The id could be a filename, if you don't need to backtrack lost requests. The status would indicate how the script ended up handling the request.
Have you thought about using some sort of version control system to handle this? You could leverage its ability to calculate and send just the differences in the files, plus you get the added benefits of maintaining a version history of your file.
Since I don't know the details of your situation I'll throw question out there. Just for sake of argument does it have to be HTTP? FTP is much better suited for large data transfer and can be automated easily via PHP or Perl.
If you are using Apache, you might also consider Apache mod_gzip. This should allow you to compress the file automatically and the decompression should also happen automatically, as long as both sides accept gzip compression.

Categories