PHP CURL causing Huge Apache access log

PHP CURL causing Huge Apache access log - php

I'm curious to know how to stop Apache from logging every URL I search with CURL.
My PHP script opens a few hundred thousand URLs, scans them, takes a tiny bit of info, closes, and then opens the next.
I discovered after opening the access log that each and every URL opened with CURL is written to the access log.
::1 - - [01/Dec/2010:18:37:37 -0600] "GET /test.php HTTP/1.1" 200 8469 "-"..."
My access log is almost 45MBytes large. Help anyone?

This is the purpose for access log - recording any incoming traffic
In order to effectively manage a web server, it is necessary to get feedback about the activity and performance of the server as well as any problems that may be occurring. The Apache HTTP Server provides very comprehensive and flexible logging capabilities. This document describes how to configure its logging capabilities, and how to understand what the logs contain.
source: http://httpd.apache.org/docs/trunk/logs.html
Of course, you have the option to disable logging (preferable not)

If all of your curl requests are coming from a single or otherwise manageable group of IPs you can exclude them from your logs with a configuration similar to the following:
# Set your address here, you can do this for multiple addresses
SetEnvIf Remote_Addr "1\.1\.1\.1" mycurlrequest
CustomLog logs/access_log common env=!mycurlrequest
You can do something similar with the user agent field which by default will indicate that it's curl.
You can read more here:
http://httpd.apache.org/docs/2.2/logs.html#accesslog (conditional logging is the last section under this header)
and here
http://httpd.apache.org/docs/2.2/mod/mod_setenvif.html#setenvif
If you want to conditionally exclude logging I would to it by the most precise method possible such as the ip address. In the event the server is externally accessible you probably don't want to find yourself NOT logging external requests from curl.
Using conditional logging you can also segment your logging if you want to multiple files one of which you could roll more frequently. The benefit of that is you can save space and at the same time have log data to help research and debug.

See the Apache manual, about Conditional Logs. That may be what you are looking for.

Related

Is there a setting to set too many requests?

I am programming in PHP and I am wondering if there is a way to set up too many requests or does this need to be coded up manually? For example, if someone has opened 30 pages in 60 seconds, that is too many requests (and they may potentially be a bot), thus they get sent a too many requests HTTP status code.
If it is supposed to be done manually, what is the best practice to set up something like this?

You could try using ratelimit by Apache.
Here is a sample provided by Apache. The rate limit is 400kb/second for the particular IP.
<Location "/downloads">
SetOutputFilter RATE_LIMIT
SetEnv rate-limit 400
</Location>
More specifically, you can try a module like Mod Evasive to prevent multiple requests from accessing the server. You can use a product like CloudFlare to mitigate DDOS attacks.
If you really want to use PHP for this you can log the amount of requests from a given IP, and if requests from that IP is greater than a certain value, you can block the IP from accessing your page.
To do this, you can store the IP addresses in a database along with a date column indicating when they accessed your page, and calculate aggregates of their access in a particular period using SQL.

Just for anyone that might not be using apache here is the nginx documenation
https://docs.nginx.com/nginx/admin-guide/security-controls/controlling-access-proxied-http/

PHP cURL function

We have develop a CURL function on our application. This curl function is mainly to map the data over from 1 site to our form-field in our application.
However, this function has been working fine all the while and ready for use for more than 2 months. Yesterday, this fucntion was broken down. the data from this website is no longer able to map over. We are trying to find out why the problem is. When we troubleshooting, it shows that there is response timeout issue.
To re-ensure there were nothing wrong on our coding and our server performance is working, we have duplicates this instance to another server and try out the function. It was working perfectly.
Wondering if any one out there facing such problem?
What could the possibility to cause this issue?
When we are using cURL, will the site owner know that we are calling their data to map into ours server application? If so, is there a way that we can overcome this?
Could be the owner that block our server ip address? tht's why it function works well on my another server but not in the original server?
Appreciate your help on this.
Thank you,

Your problem description is far too generic to determine a specific cause. Most likely however there is a specific block in place.
For example a firewall rule on the other end, or on your end, would cause all traffic to be dropped, thus causing the timeout. There could also be a regular network outage between both servers, but that's unlikely.
Yes, they will see it in their Apache (or IIS) logs regularly. No, you cannot hide from the server logs - it logs all successful requests. You either get the data, or you stay stealthy. Not both.
Yes, the webserver logs will contain the IP doing all the requests. Adding a DROP rule to the firewall is then a trivial task.
I have applied such a firewall rule to bandwidth and/or data leechers a lot of times in the past few years, although usually I prefer the more resilient deny from 1.2.3.4 approach in Apache vhost/htaccess. Usually, if you use someone else's facilities, it's nice to ask for proper permission - lessens the chance you get blocked this way.

I faced a similar problem some time ago
My server IP was blocked from the website owner
It can be seen in the server logs. Google Analytics, however, won't see this, as cURL doesn't execute javascript.
Try to ping the destination server from the one executing the cURL.
Some advices are:
Use a browser header to mask your request.
If you insist on using this server, you can run trough a proxy.
Put some sleep() between the requests.

Is this code safe against being run from another server

I'll explain the setup quickly, we have multiple domains, running on 2 servers (Dev and Live), which all are populated from a CMS database on the Live server. I'm adding in 404 reporting, so each site logs any 404's it gets, and we can view them in the CMS.
Each site, when our Framework detects a 404 error, it does a curl call to http://cms.example.com/log404.php and sends the $_SERVER variable. At the top of the log404.php I have this code which wraps the whole logging code.
if (in_array($_SERVER['REMOTE_ADDR'], array('dev server ip', 'live server ip'))) {
Then in here I store the relevant bits of data from $_POST. The reason I did it this way, rather than each site just directly adding to the database, was if we want to change the logging code somehow (log different data, write it a file, change the database etc), it only needs changing once, rather than in 15+ different sites.
Is the above if statement a safe way to check if the data was posted by us, and not somebody else? Or would it be possible for somebody to manipulate the curl call so the REMOTE_ADDR appears to be one of our IP's?

$_SERVER['REMOTE_ADDR'] uses the IP address from the TCP handshake so the same answer as this applies: Is it possible to pass TCP handshake with spoofed IP address?.
So if this is over the internet, then you are safe from the IP being spoofed.
However, for extra protection you should also protect your log service with authentication (as #moonwave99 suggested) and you should only run your log service over a HTTPS connection.

Use of .htaccess to mitigate denial of service attacks

I have an application that requires logon.
It is only possible to access the site via a single logon page.
I am concerned about DDOS and have (thanks to friends here) been able to write a script that will recognise potential DDOS attacks and lock the particular IP to prevent site access (also a security measure to prevent multiple password/username combination guesses)
Is there any value in blocking those IPs that offend with .htaccess. I can simply modify the file to prevent my server allowing access to the offending IP for a period of time but will it do any good? Will the incoming requests still bung up the system, even though .htaccess prevents them being served or will it reduce the load allowing genuine requests in?
it is worth noting that most of my requests will come from a limited range of genuine IPs so the implementation I intend is along the lines of:
If DDOS attack suspected, Allow access only from IPs from which there has been a previous good logon for a set time period. Block all suspect IPs where there has been no good logon permanently, unless a manual request to unblock has been made.
Your sage advice would be greatly appreciated. If you think this is a waste of time, please let me know!
Implementation is pretty much pure PHP.

Load caused by a DDOS attack will be lower if blocked by .htaccess as the unwanted connections will be refused early and not allowed to call your PHP scripts.
Take for example a request made for the login script, your apache server will call the PHP script which will (I'm assuming) do a user lookup in a database of some kind. This is load.
Request <---> Apache <---> PHP <---> MySQL (maybe)
If you block and ip (say 1.2.3.4) your htacess will have an extra line like this:
Deny from 1.2.3.4
And the request will go a little like this:
Request <---> Apache <-x-> [Blocked]
And no PHP script or database calls will happen, this is less load than the previous example.
This also has the added bonus of preventing bruteforce attacks on the login form. You'll have to decide when to add IPs to a blocklist, maybe when they give incorrect credentials 20 times in a minute or continuously over half an hour.
Firewall
It would be better to block the requests using a firewall though, rather than with .htaccess. This way the request never gets to apache, it's a simple action for the server to drop the packet based on a IP address rule.
The line below is a shell command that (when run as root) will add an iptables rule to drop all packets originating from that IP address:
/sbin/iptables -I INPUT -s 1.2.3.4 -j DROP

Count downloads without `echo file_get_contents($file)`?

I am now having download links on my server that directly points to files. I have a set of quite complicated rewrite rules but they don't affect what I am asking for.
What I want to do is to count the number of downloads. I know I could write a PHP script to echo the content and with a rewrite rule so that the PHP script will process all downloads.
However, there are a few points that I am worried about:
There is a chance that some dangerous paths (e.g. /etc/passwd, ../../index.php) will not be blocked due to carelessness or unnoticed bugs
Need to handle HTTP 404 Not Found response (and others) in the script which I prefer letting Apache handle them (I have an error handler script that rely on server redirect variables)
HTTP headers (like content type or modified time) may not be correctly set
Using a PHP script doesn't usually allow HTTP 304 Unmodified response so that browser caching will be useless, and re-download can consume extra bandwidth Actually I can check for that, but would require some more coding and debugging.
PHP script uses more processing power than directly loading the file directly by Apache
So, I would like to find some other ways to perform statistics. Can I, for example, make Apache trigger a script when certain files (in certain directories) are being requested and downloaded?

This may not be quite what you're looking for, but in the spirit of using the right tool for the job you could easily use Google Analytics (or probably any other analytics package) to track this. Take a look at https://support.google.com/analytics/bin/answer.py?hl=en-GB&answer=1136922.
Edit:
It would require the ability to modify the vhost setup for your site, but you could create a separate apache log file for your downloads. Let's say you've got a downloads folder to store the files that are available for download, you could add something like this to your vhost:
SetEnvIf Request_URI "^/downloads/.+$" download
LogFormat "%U" download-log
CustomLog download-tracking.log download-log env=download
Now, any time something is requested from the /downloads/ folder, it will be logged in the download-tracking.log file.
A few things to know:
You can have as many SentEnvIf lines as you need. As long as they all set the download environment variable, the request will be logged to the CustomLog
The LogFormat I've shown will log only the URI requested, but you can easily customize that to log much more than just the URI, see http://httpd.apache.org/docs/2.2/mod/mod_log_config.html#logformat for more details.
If you're providing PDF files, be aware that some browsers/plugins will make a separate request for each page of the PDF so you would need to account for that when you read the logs.
The primary benefit of this method is that it does not require any coding, just a simple config change and you're ready to go. The downside, of course, is that you'd have to do some kind of log processing. It just depends what is most important to you.
Another option would be to use a PHP script and the readfile function. This makes it much easier to log requests to a database, but it does come with the other issues you mentioned earlier.
There are ways to pipe Apache logs to MySQL, but from what I've seen it can be tricky. Depending on what you're doing, it may be worth the effort... but then again it might not.

You can parse the Apache log files.

Apaches mod_lua probably is the most general, flexible and effective approach to hooking own code into the request processing inside apache. Usually you chose that language for the task that offers the most direct approach. And lua is much better in teracting with c/c++ than anything else.
However there certainly are other strategies, so be creative. Two things come to my mind immediately:
some creative use of PAM if you are under some sort of unix like system: configure some kind of dummy authentication requirement and setup PAM for processing. Inside the PAM configuration you can do whatever you like. The avantage: you get requests and can filter yourself what to count and what not. You have to make sure the PAM response does not create a valid session though, so that you really get a tick for each request done by a client, not only the first one.
there are other apache modules that allow to do request processing. Have a look at the forensic module or the external filter module. Both allow to hook external logic into request processing. You will need cli based php configured for that.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.