Are there HTTP header fields I could use to spot spam bots?

Are there HTTP header fields I could use to spot spam bots? - php

It stands to reason that scrapers and spambots wouldn't be built as well as normal web browsers. With this in mind, it seems like there should be some way to spot blatant spambots by just looking at the way they make requests.
Are there any methods for analyzing HTTP headers or is this just a pipe-dream?
Array
(
[Host] => example.com
[Connection] => keep-alive
[Referer] => http://example.com/headers/
[Cache-Control] => max-age=0
[Accept] => application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
[User-Agent] => Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.44 Safari/534.7
[Accept-Encoding] => gzip,deflate,sdch
[Accept-Language] => en-US,en;q=0.8
[Accept-Charset] => ISO-8859-1,utf-8;q=0.7,*;q=0.3
)

If I were writing a spam bot, I would fake the headers of a normal browser, so I doubt this is a viable approach. Some other suggestions that might help
Instead
use a captcha
if that's too annoying, a simple but effective trick is to include a text input which is hidden by a CSS rule; users won't see it, but spam bots won't normally bother to parse and apply all the CSS rules, so they won't realise the field is not visible and will put something in it. Check on form submission that the field is empty and disregard it if it is.
use a nonce on your forms; check that the nonce that was used when you rendered the form is the same as when it's submitted. This won't catch everything, but will ensure that the post was at least made by something that received the form in the first place. Ideally change the nonce every time the form is rendered.

You can't find all bots this way, but you could catch some, or at least get some probability of UA being a bot and use that with conjunction with another method.
Some bots forget about Accept-Charset and Accept-Encoding headers. You may also find impossible combinations of Accept and User-Agent (e.g. IE6 won't ask for XHTML, Firefox doesn't advertise MS Office types).
When blocking, be careful about proxies, because they could modify the headers. I recommend backing off if you see Via or X-Forwarded-For headers.
Ideally, instead of writing rules manually, you could use bayesian classifier. It could be as simple as joining relevant headers together and using them as a single "word" in the classifier.

Related

MSXML2.XMLHTTP send data to PHP script

I have been looking into an option to send data read from an attached file in an Outlook message, directly to a PHP script that will then insert the date in a nice MySQL database.
The extraction of the file and the splitting of data all ok, but here is the trick...
From the internet (here) I found a nice post by Jeremy Slade who has managed to send some data to a cgi scipt, all good.
So, clever as I thought I was, I thought I could re-write this into dealing with a PHP script.
But then the works stopped.
I have shortened the code to below snippet;
Sub TestURL()
Set xhr = CreateObject("MSXML2.XMLHTTP")
URL = "http://somedomain.com/php/test.php"
data = "someVariable=Test"
With xhr
.Open "POST", URL, False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.Send data
End With
End Sub
This should, in theory, open a MSXML2.XMLHTTP request at the given URL and send whatever data with it to the script.
Funny enough, the script is called, but no data is passed ?
I've tried setting the PHP script to both $_GET and $_POST for the [someVariable] element, yet on neither is there any response ?
When I set the PHP to $_GET I matched the VBA MSXML2.XMLHTTP object to "GET" as well and vice versa...
I've tried passing the 'data' variable as argument to the 'function' .send by including it in brackets
i.e.
.send (data)
But this doesn't work either...
I'm a bit at a loss, because the script is called, a dataline is added to the table yet there is not an actual transfer of the 'sent' data ??
I've tried connecting the data string to the URL that is passed to the HTTP object, essentially passing a 'GET' URL to the HTTP object.
i.e.
URL = URL & "?" & data
but to no avail...:-(
The php script works in itself properly, if I pass data directly from the browser
i.e.
http://somedomain.com/php/test.php?someVariable=Test
the data is correctly added and the variable is read...
Can some more enlightened spirits guide me in the right direction ?
20141016 ********** UPDATE **********
Ok, when digging into stuff I found there is also an option to refer to the XmlHttp object as "Microsoft.XmlHttp" ?
Funny enough, when setting the object like that,
i.e.
Set xhr = CreateObject("Microsoft.XMLHTTP")
The code works and the data is added to the table and the .responsText is a success message.
Yet if I return to the original code, I get a PHP error message that tells me that there is an error in my PHP syntax ?? This would imply that the actual 'data' that is being send differs between using "MSXML2.XMLHTTP" and using "Microsoft.XMLHTTP" ???
Have tried to dig out the difference between the two from internet but can't find any post that provides me with a full understanding of the subject ?
Despite the fact that my code now works, I still have the bothering question of not understanding the difference between the two and would appreciate a reply from someone who does :-) As I now have a code that works, but not an understanding of why it works...:-)
Or mroeover not an understanding of why the "MSXML2" option does NOT work...
Much appreciated,
Kindest regards
Martijn

This is not exactly an answer but more like a comment as I lack enough reputation to comment.
The issue can be analyzed using Fiddler which provides details of the requests and responses. I checked the same code as yours in my system with both MSXML2.XMLHTTP and Mirosoft.XMLHTTP objects and found no difference in teh requests. Both of them passed the POST request body containing someVariable=Test to the URL http://somedomain.com/php/test.php.
Here is the raw POST request in both cases:
POST http://somedomain.com/php/test.php HTTP/1.1
Accept: */*
Accept-Language: en-us
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/5.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; .NET4.0C; .NET4.0E; MS-RTC LM 8)
Host: somedomain.com
Content-Length: 17
Proxy-Connection: Keep-Alive
Pragma: no-cache
someVariable=Test
And the response from the sample URL provided:
HTTP/1.1 405 Method Not Allowed
Server: nginx/1.7.6
Date: Thu, 08 Jan 2015 15:23:58 GMT
Content-Type: text/html
via: HTTP/1.1 proxy226
Connection: close
<html>
<head><title>405 Not Allowed</title></head>
<body bgcolor="white">
<center><h1>405 Not Allowed</h1></center>
<hr><center>nginx/1.7.6</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
One question here would be whether the web server in question is expecting further data to be passed by the way of headers (User-Agent, Referer, Cookies etc) or as part of the request body (may be further input elements that are part of the webform)?

Ajax response is JSON on server and STRING on local computer with CakePHP, why?

locally and on the server, I get different results with the same code.
Locally my results arrive as string, while on the server, the same code returns JSON object. Can anybody tell me why?
The javascript:
$.post(
url, // Various urls of type '/users/add_secondary_email_ajax'
data,
function(res){
if (typeof(res.success)=='undefined'){
ModalManager.update_body_html(res);
}else{
callback_success(res);
}
}
);
The CakePHP:
$this->autoRender = false;
$this->RequestHandler->respondAs('json');
echo json_encode( array('success'=>true) ); // this arrives as string locally
return;
I also had this working on my other computer, but not this one. Could it be some PHP settings?
Both computers have the same versions of Browser & CakePHP version (2.2.3).
I see differences in PHP and Apache versions. Could be settings also, but I don't know where to look.
Header On Broken Computer:
Request URL:localhost/alert_subscribers/subscribe_ajax
Request Method:POST
Status Code:200 OK
Request Headers
Accept:*/*
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8,bg;q=0.6
Connection:keep-alive
Content-Length:153
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:timezoneoffset=-120; viewedJobsGuest=[24]; __atuvc=13%7C11%2C46%7C12; CAKEPHP=dfbf9407743d43eb619a42aa5dbda735; toolbarDisplay=hide
Host:jobsadvent.dev
Origin:URL:localhost
Referer:URL:localhost/search
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
X-Requested-With:XMLHttpRequest
Form Data
data[title]:the title
data[email]:fake2#hotmail.com
data[alert]:1
Response Headers
Connection:Keep-Alive
Content-Length:57
Content-Type:text/html
Date:Fri, 21 Mar 2014 10:19:06 GMT
Keep-Alive:timeout=5, max=100
Server:Apache/2.2.26 (Unix) DAV/2 PHP/5.4.24 mod_ssl/2.2.26 OpenSSL/0.9.8y
X-Powered-By:PHP/5.4.24
Header on Working computer
Request URL:http://domain.com/alert_subscribers/subscribe_ajax
Request Method:POST
Status Code:200 OK
Request Headers
Accept:*/*
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8,bg;q=0.6
Connection:keep-alive
Content-Length:162
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:__atuvc=1%7C10%2C5%7C11; timezoneoffset=-120; CAKEPHP=sb3013ffk40h7o1jhsl8ulqfj4; toolbarDisplay=hide
Host:domain.com
Origin:http://domain.com
Referer:http://domain.com/search
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
X-Requested-With:XMLHttpRequest
Form Data
data[title]:the title
data[email]:fake#hotmail.com
data[alert]:1
Response Headers
Connection:close
Content-Length:57
Content-Type:application/json
Date:Fri, 21 Mar 2014 10:24:32 GMT
Server:Apache/2.2.15 (CentOS)
X-Powered-By:PHP/5.3.3
As for the routes.php file both are identical and contain the following line:
Router::parseExtensions('json');

This could be an issue with your apache settings:
The answer given on Apache sending incorrect response header for .js files suggests that you need something like
<FilesMatch \.php$>
SetHandler application/x-httpd-php
</FilesMatch>
to get the right content types.

Refer to the jQuery.post() documentation. There is a fourth parameter (dataType) that you can use that will force jQuery to coerce the response to the correct datatype. You will need to set that equal to 'json' if you want an object back.

Well, no - computer 1 it is application/json and in the other it is text/html. Both have the same code I posted up there.
There's your problem. jQuery uses the response's Content-Type header as a guide.
The CakePHP docs seem to indicate that $this->RequestHandler->respondAs() may work better if you pass it application/json rather than just json.

JSON parsing should fix it.
$.post(
url, // Various urls of type '/users/add_secondary_email_ajax'
data,
function(res){
var result = JSON.parse(res);
if (typeof(result.success)=='undefined'){
ModalManager.update_body_html(res);
}else{
callback_success(res);
}
}
);

I would set the contentType and dataType in your $.POST request.
$.POST({
contentType : "application/x-www-form-urlencoded; charset=UTF-8",
dataType : "json"
})

when calling api:
$.post();
"dataType" param should be set as "json".if it is not sepecfied, ajax will intelligent guessed (xml, json, script, text, html...).see manual here:
so how could the ajax guess the type of data ?
There is a response header, "Content-Type:", by which the server tell the client what type is data. I think , ajax need this header to guess the data type.
this is your broken computer's response:
Content-Type:text/html
and this is your working computer's response:
Content-Type:application/json
if you don't want to specified the param "dataType" of $.post(), you can change the response header, there must be many ways to change it, like this:
<?php
header("Content-Type:application/json");
?>

That could be messy, but don't get worried until there's something to really worry about.
Statement of fact: one of your servers is behaving as expected and the other is not.
With the way that your error is manifesting, it sounds an awfully lot like you are not specifying your request specifically enough or your borked server is failing Content Negotiation.
There are two basic things that come into play here that you likely already know about: the requester's "Accept" header that allows the user agent to specify the content types that it is willing to receive and the server's ability to interpret that request and serve it appropriately. In absence of an explicitly set Accept header, text\html is the default response type.
Accept Header: RFC2616 Hypertext Transfer Protocol Section 14.1
The Accept request-header field can be used to specify certain media
types which are acceptable for the response. Accept headers can be
used to indicate that the request is specifically limited to a small
set of desired types, as in the case of a request for an in-line
image.
The asterisk "" character is used to group media types into ranges,
with "/" indicating all media types and "type/" indicating all
subtypes of that type. The media-range MAY include media type
parameters that are applicable to that range.
The accept headers that you set for each request indicate that you don't care what the server gives you. You might try setting your accept header to application/json and see if the "broken" server can interpret it and serve you. If that works, then it seems you're just running into an inconsistency with the way the servers are defaulting their response types. This even looks to be what you're asking for it to do. You said you accept all response types. If you don't specify something specific, the most reasonable type for a server to give you is text/html
MIME Types: RFC 2046 Multipurpose Internet Mail Extensions
JSON: RFC 4627 The application/json Media Type for JavaScript Object Notation (JSON)
If setting the Accept header doesn't work for you, you're going to want to check your server's MIME type registration to make sure that [application\json] is registered and configured. That is not an esoteric configuration subject, so it should be available in any server's configuration documentation.
If neither of those approaches work, then the solution is to unplug the offending machine, carry it to the top of the building, and throw it as far as you can.

HTTP request getting partial response

I'm trying to get this CrunchBase API page as a string in PHP. When I visit that page in a browser, I get the full response (some 230K characters); however, when I try to get the page in a script, the response is much shorter (24341 characters on a server and 36629 characters locally, with exactly the same number of characters for other long CrunchBase pages). To get the page, I am using a function almost identical to drupal_http_request() although I'm not using Drupal. (I have also tried using cURL and file_get_contents() and got the same result. And now that I'm thinking about it I have experienced the same from CrunchBase in Python in the past.)
What could be causing this and how can I fix it? PHP 5.3.2, Apache 2.2.14, Ubuntu 10.04. Here are additional details on the response:
[protocol] => HTTP/1.1
[headers] => Array
(
[content-type] => text/javascript; charset=utf-8
[connection] => close
[status] => 200 OK
[x-powered-by] =>
[etag] => "d809fc56a529054e613cd13e48d75931"
[x-runtime] => 0.00453
[content-length] => 230310
[cache-control] => private, max-age=0, must-revalidate
[server] => nginx/1.0.10 + Phusion Passenger 3.0.11 (mod_rails/mod_rack)
)
I don't think it's a user agent issue as I used User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6 in the request.
UPDATE
According to this thread I needed to add the Accept-Encoding: gzip, deflate header to the request. That does result in a longer request result, but now I have to figure out how to inflate it. The gzinflate() function fails with a Warning: Data error. Any thoughts on how to inflate the response?

See the comments in the PHP docs about gzinflate(), specifically the remarks about stripping the initial bytes. The last comment did the trick for me:
<?php $dec = gzinflate(substr($enc,10)); ?>
Though it seems that the number of bytes to be stripped depends on the original encoder. Another comment has a more thorough solution, and a reference to RFC1952 for further reading.
Evidently gzdecode() is meant to address to this issue, but it hasn't been released yet.
ps -- I deleted my comment about the returned data being plain text. I was wrong.

prevent query to captcha generator from YSlow

i have a pretty simple captcha, something like this:
<?php
session_start();
function randomText($length) {
$pattern = "1234567890abcdefghijklmnopqrstuvwxyz";
for($i=0;$i<$length;$i++) {
$key .= $pattern{rand(0,35)};
}
return $key;
}
$textCaptcha=randomText(8);
$_SESSION['tmptxt'] = $textCaptcha;
$captcha = imagecreatefromgif("bgcaptcha.gif");
$colText = imagecolorallocate($captcha, 0, 0, 0);
imagestring($captcha, 5, 16, 7, $textCaptcha, $colText);
header("Content-type: image/gif");
imagegif($captcha);
?>
the problem is that if the user have YSlow installed, the image is query 2 times, so, the captcha is re-generated and never match with the one inserted by the user.
i saw that is only query a second time if i pass the content-type header as gif, if i print it as a normal php, this doesn't happen.
someone have any clue about this? how i can prevent it or identify that the second query is made by YSlow, to do not generate the captcha again.
Regards,
Shadow.

YSlow does request the page components when run, so it sounds like your problem is cases where the user has YSlow installed and it's set to run automatically at each page load.
The best solution may be to adjust your captcha code to not recreate new values within the same session, or if it does to make sure the session variable matches the image sent.
But to your original question about detecting the second query made by YSlow, it's possible if you look at the HTTP headers received.
I just ran a test and found these headers sent with the YSlow request. The User-Agent is set to match the browser (Firefox in my case), but you could check for the presence of X-YQL-Depth as a signal. (YSlow uses YQL for all of its requests.)
Array
(
[Client-IP] => 1.2.3.4
[X-Forwarded-For] => 1.2.3.4, 5.6.7.8
[X-YQL-Depth] => 1
[User-Agent] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0.1) Gecko/20100101 Firefox/8.0.1
[Accept-Encoding] => gzip
[Host] => www.example.com
[Connection] => keep-alive
[Via] => HTTP/1.1 htproxy1.ops.sp1.yahoo.net[D1832930] (YahooTrafficServer/1.19.5 [uScM])
)

Email Tracking - GMail

I am creating my own email tracking system for email marketing tracking. I have been able to determine each persons email client they are using by using the http referrer but for some reason GMAIL does not send a HTTP_REFERRER at all!
So I am trying to find another way of identifying when gmail requests a transparent image from my server. I get the following headers print_r($_SERVER);:
DOCUMENT_ROOT = /usr/local/apache/htdocs
GATEWAY_INTERFACE = CGI/1.1
HTTP_ACCEPT = */*
HTTP_ACCEPT_CHARSET = ISO-8859-1,utf-8;q=0.7,*;q=0.3
HTTP_ACCEPT_ENCODING = gzip,deflate,sdch
HTTP_ACCEPT_LANGUAGE = en-GB,en-US;q=0.8,en;q=0.6
HTTP_CONNECTION = keep-alive
HTTP_COOKIE = __utmz=156230011.1290976484.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=156230011.422791272.1290976484.1293034866.1293050468.7
HTTP_HOST = xx.xxx.xx.xxx
HTTP_USER_AGENT = Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.237 Safari/534.10
PATH = /bin:/usr/bin
QUERY_STRING = i=MTA=
REDIRECT_STATUS = 200
REMOTE_ADDR = xx.xxx.xx.xxx
REMOTE_PORT = 61296
REQUEST_METHOD = GET
Is there anything of use in that list? Or is there something else I can do to actually get the http referrer, if not how are other ESPs managing to find whether gmail was used to view an email?
Btw, I appreciate it if we can hold back on whether this is ethical or not as many ESPs do this already, I just don't want to pay for their service and I want to do it internally.
Thanks all for any implementation advice.
Update
Just thought I would update this question and make it clearer in light of the bounty.
I would like to find out when a user opens my email when sent to a GMail inbox. Assume, I have the usual transparent image tracking and the user does not block images.
I would like to do this with the single request and the header details I get when the transparent image is requested.

Are your images requested with HTTP or HTTPS?
If so, that's the problem.
HTTPS->HTTP referrals do not leak a Referer Header (HTTP_REFERER).
If you embed a HTTP hosted image in an email that is requested from an HTTPS page, it won't send a referrer. (HTTP pages requesting HTTPS, however, do send a referer).The solution is to embed the image as HTTPS. I've tested it, and sure enough, secure HTTPS images do indeed send the Referrer.
One way Gmail could block the referrer information on loaded images by default is if they used a referrer policy, which is supported on most modern browsers. (As of 2011, they did not implement such a policy.)
See the below screenshot of an embedded image that is generated dynamically with the HTTP REFERER of the request:

Make the link something like http://www.example.com/image.jpg?h=8dh38dj
image.jpg is a PHP file and 8dh38dj is the hash of the email you included the link in. When the user requests the file, your PHP script will get '8dh38dj', look that up in your database and find the matching email. Parse the domain i.e. gmail.com from example#gmail.com and you know it is from gmail. To make jpg files execute as PHP, use an AddHandler in php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.