Curl doesn't recognize expires value in cookie correctly - php

I'm trying to perfom a log-in on pinterest.com with curl. I got the following request-response-flow:
GET-Request the login form and scrape hidden fields (csrftoken)
POST-Request login credentials (mail and pw) and scraped csrftoken
Receive Session Cookie for login
Using Curl, I can see the following Headers being sent and received:
GET /login/?next=%2F HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
Host: pinterest.com
Referer:
Accept: text/html,application/xhtml+xml,application/xml,*/*
Accept-Language: de-de,en-us
Connection: keep-alive
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Date: Tue, 10 Apr 2012 15:03:24 GMT
ETag: "45d6a85f0ede46f13f4fc751842ce5b7"
Server: nginx/0.8.54
Set-Cookie: csrftoken=dec6cb66064f318790c6d51e3f3a9612; Max-Age=31449600; Path=/
Set-Cookie: _pinterest_sess="eJyryMwNcTXOdtI3zXcKNq0qznIxyXVxK/KqSsy3tY8vycxNtfUN8a3yc3E09nXxLPdztLVVK04tLs5MsfXNAopVpVf6VnlW+Qba2gIAuqgZIg=="; Domain=pinterest.com; HttpOnly; expires=Tue, 17-Apr-2012 15:03:24 GMT; Max-Age=1334675004; Path=/
Vary: Cookie, Accept-Encoding
Content-Length: 4496
Connection: keep-alive
So after step 1, the two cookies csrftoken and _pinterest_sess are set. But a look in the cookiejar file (I use CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR to let curl handle the cookie processing) shows the following:
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
pinterest.com FALSE / FALSE 1365519805 csrftoken dec6cb66064f318790c6d51e3f3a9612
#HttpOnly_.pinterest.com TRUE / FALSE -1626222087 _pinterest_sess "eJyryMwNcTXOdtI3zXcKNq0qznIxyXVxK/KqSsy3tY8vycxNtfUN8a3yc3E09nXxLPdztLVVK04tLs5MsfXNAopVpVf6VnlW+Qba2gIAuqgZIg=="
First thing to note is the #HttpOnly_ in preceding the _pinterest_sess cookie line. I just assume that curl handles that just fine. But looking further, one can see that a negative value is set as expiration date: -1626222087
I don't know where that's coming from, because the cookie is set with "expires=Tue, 17-Apr-2012 15:03:24 GMT" (which is about 7 days in the future, counting from today).
On the next request, the _pinterest_sess cookie won't be set by curl:
POST /login/?next=%2F HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
Host: pinterest.com
Referer: https://pinterest.com/login/?next=%2F
Cookie: csrftoken=dec6cb66064f318790c6d51e3f3a9612
Accept: text/html,application/xhtml+xml,application/xml,*/*
Accept-Language: de-de,en-us
Connection: keep-alive
Content-Length: 123
Content-Type: application/x-www-form-urlencoded
HTTP/1.1 302 FOUND
Content-Type: text/html; charset=utf-8
Date: Tue, 10 Apr 2012 15:05:26 GMT
ETag: "d41d8cd98f00b204e9800998ecf8427e"
Location: http://pinterest.com/
Server: nginx/0.8.54
Set-Cookie: _pinterest_sess="eJzLcssPCy4NTclIjvAOrjQzyywoCChISgvLDi+2tY9PrSjILEottvUN8a3yc4k09gtxrfRLt7VVK04tLs5MAYonV/qGeFb4ZkWW+4LES4tTi+KBEv4u6UZ+WYEmvlm+QOxZ6R/iWOEbEmgLAKNfJps="; Domain=pinterest.com; HttpOnly; expires=Tue, 17-Apr-2012 15:05:26 GMT; Max-Age=1334675126; Path=/
Vary: Cookie
Content-Length: 0
Connection: keep-alive
In the response, another _pinterest_sess cookie is set since curl didn't send the last one.
Currently, I don't know if I'm doing something wrong or if curl just isn't able to parse the expires value in the cookie correctly.
Any help would be greatly appreciated :)
// edit
One more thing:
According to http://opensource.apple.com/source/curl/curl-57/curl/lib/cookie.c the function curl_getdate() is used to extract the date. The documentation on that function lists some examples (http://curl.haxx.se/libcurl/c/curl_getdate.html):
Sun, 06 Nov 1994 08:49:37 GMT
Sunday, 06-Nov-94 08:49:37 GMT
Sun Nov 6 08:49:37 1994
06 Nov 1994 08:49:37 GMT
06-Nov-94 08:49:37 GMT
Nov 6 08:49:37 1994
06 Nov 1994 08:49:37
06-Nov-94 08:49:37
1994 Nov 6 08:49:37 GMT
08:49:37 06-Nov-94
Sunday 94 6 Nov 08:49:37
1994 Nov 6
06-Nov-94
Sun Nov 6 94
1994.Nov.6
Sun/Nov/6/94/GMT
Sun, 06 Nov 1994 08:49:37 CET
06 Nov 1994 08:49:37 EST
Sun, 12 Sep 2004 15:05:58 -0700
Sat, 11 Sep 2004 21:32:11 +0200
20040912 15:05:58 -0700
20040911 +0200
None of them matches the above mentioned expires date "Tue, 17-Apr-2012 15:03:24 GMT" because all examples with hyphens only use 2-digit-years..

You are experiencing an issue on your computer because of the limits of 32 bit signed integer values.
The server sets a cookie with the Max-Age of 1334675004 seconds in the future.
Max-Age=1334675004
You posted your question here # 2012-04-10 15:13:24Z. That is a UNIX timestamp of 1334070804. If you add 1334675004 to it and you take a 32 bit integer limit of 2147483647 into account while having an integer roundtrip, you'll get: -1626221485:
1334070804
+ 1334675004
------------
-1626221485
As the numbers show, it looks like that the server did misunderstood the Max-Age attribute, if you substract each values from each other there is a circa delta of 7 days in seconds (604200 = ~6.99 days, the difference is because the cookie was set earlier than you posted your question here). However Max-Age is the delta of seconds, not the absolut UNIX timestamp.
Try to raise PHP_INT_MAX with your PHP version, or compile against 64 bit, this should prevent negative numbers. However, the max-age calculation is still broken with the server. You might want to contact pinterest.com and report the problem.

Looks like pinterest.com is using Max-age incorrectly, and that's why curl is deleting this cookie.
From your example, Max-age contains timestamp for Tue, 17-Apr-2012 15:03:24 GMT, while it should contain number of seconds from request time to this date - 604800 (judging from request time - Date header)
What curl is doing is adding Max-age value to current timestamp and saving it as signed 32bit integer, hence -1626222087.
As for solution - you can try contacting pinterest and report a bug.

Actually you do not need to contact pinterest site since it is not required to send back to server cookie max age(if you will use cookie for a short period of time or you may calculate yourself correct max age). Just flip the minus sign and it will work meaning it will be send back to server. And it was not all what you have to do. Sometimes depending on login page presented you have to parce hidden fields also(where CSRF tokens resided and that have to match with the same token value in cookie). Moreover, it will sometimes require to change cookies(reset cookie values). So pinterest web site is making harder and harder to login using automated login tools and doing screen scraping. And recently they have changed how their site functions. So all the above mentioned points does not work now. Actually you do not really know when they will change how login works. You have to try and "guess" when change happens. Actually that attitude should be towards not developers but the ones who are threats to security of the system(intruders). You have to think about legality issue of above mentioned points too. Pinterest has API(although it is down right now) so it is the best and most correct way to use that API (pls see https://github.com/kellan/pinterest.api.php). There you are exchanging messages in a json format. Last option to use m.pinterest.com which is for mobile devices and it is strightforward to use like parce one login html for hidden input fields and resubmit form with correct values (to use it you are again faced with legality issues too). Please consult with pinterest site before using curl like tools or wait until pinterest api is up. Yes, there some improvements in the system like getting json responses which puts the end to screen scraping but that does not mean completely new api. Also right now they(seemingly) implemented web services, restful, api and taking ajax requests which are again steps towards to positive improvement. There are many discussions are going on the net on this matter so please refer to them for detailed info.

Related

Browser ignoring html/php cache headers?

I am confused about cache headers.
I am trying to make the browser cache my php/html pages because they are mostly static and they would not change for a month or so.
The urls look like: example.com/article-url and in background that is a php page like /article_url.php
I tried this in PHP:
$cache_seconds = 60*60*24*30;
header("Expires: ".gmdate('D, d M Y H:i:s \G\M\T', time()+$cache_seconds));
header("Cache-Control:public, max-age=".$cache_seconds);
And in browser debug window I can see that indeed the page would expire next month:
Request URL: https://www.example.com/article-url
Request Method: GET
Status Code: 200
Referrer Policy: no-referrer-when-downgrade
cache-control: public, max-age=2592000
content-encoding: gzip
content-length: 2352
content-type: text/html
date: Mon, 25 Nov 2019 14:23:40 GMT
expires: Wed, 25 Dec 2019 14:23:40 GMT
server: nginx
status: 200
vary: Accept-Encoding
But if I access that page again, I see it was generated again, I made it print the request timestamp in footer, and I can see the page is generated again on each page load.
I was expecting browser to show exact same page from cache.
What am I doing wrong ?

Internet Explorer 10 back button caching

In Internet Explorer 10, if you press the back button it would try to fetch the previous page from the browser cache. This behavior differs from virtually every other browser including IE9 in which pressing the back button would do a full reload of the previous page instead of reusing the cache.
How do I communicate with IE10 from the website, possibly using javascript/headers etc to not do this cache utilization for the site globally?
(Note: I'm not looking for an IE10 setting to disable this. I'm looking for a solution that can be implemented in the Website and not the browser to instruct IE10 to not use this cache for the back button). Also I'm looking for a global solution that works for every page in the site...
I use PHP/Jquery for the site
so here's more information
The page is a Form. It contains some dynamically loaded info. (Let's say it contains the number of times the user submitted the form)
You click on the submit button of the form. You will then then get redirected to the form's action page.
Then you press the back button.
In every other browser, it would reload the initial form with the newly updated "number of times the user submitted the form". In IE10 however, this doesn't happen....How do I get this to happen in IE 10.
Here are some example headers:
1. When you first load the form:
Request Header
Key Value
Request GET /path/to/my/page HTTP/1.1
Accept text/html, application/xhtml+xml, */*
Accept-Language en-US
User-Agent Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)
Accept-Encoding gzip, deflate
Host myhost.com
If-Modified-Since Tue, 10 Sep 2013 23:55:33 GMT
If-None-Match "1378857333"
DNT 1
Connection Keep-Alive
Cookie __utma=104299925.1011127538.1340896287.1364829735.1378764406.12; __utmz=104299925.1340896287.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); has_js=1; __utmc=104299925; __qca=P0-1247924781-1340896285157; _mkto_trk=id:601-CPX-764&token:_mch-sadfsadfze.com-1358808312889-73607; __utma=171146939.775168663.1343066079.1375907514.1378762647.41; __utmz=171146939.1343066079.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); s_stats_browser_info=%7B%22pluginInfo%22%3A%7B%22pdf%22%3A%5B%22pdf%22%2C%22application/pdf%22%2C%220%22%5D%2C%22quicktime%22%3A%5B%22qt%22%2C%22video/quicktime%22%2C%220%22%5D%2C%22realplayer%22%3A%5B%22realp%22%2C%22audio/x-pn-realaudio-plugin%22%2C%220%22%5D%2C%22wma%22%3A%5B%22wma%22%2C%22application/x-mplayer2%22%2C%220%22%5D%2C%22director%22%3A%5B%22dir%22%2C%22application/x-director%22%2C%220%22%5D%2C%22flash%22%3A%5B%22fla%22%2C%22application/x-shockwave-flash%22%2C%220%22%5D%2C%22java%22%3A%5B%22java%22%2C%22application/x-java-vm%22%2C%221%22%5D%2C%22gears%22%3A%5B%22gears%22%2C%22application/x-googlegears%22%2C%220%22%5D%2C%22silverlight%22%3A%5B%22ag%22%2C%22application/x-silverlight%22%2C%220%22%5D%7D%2C%22res%22%3A%221920x1080%22%7D; _pk_id.2.1644=19232922ec6753dc.1371502517.1.1371502630.1371502517.; SESS569093948b0206b05eb2212616da3db6=1977iogjr841af2s8l4sd1cjd0; XDEBUG_SESSION=12250; has_js=1; __utmc=171146939
Response Header:
> Key Value Response HTTP/1.1 200 OK Date Tue, 10 Sep 2013 23:55:44 GMT
> Server Apache/2.2.20 (Ubuntu) X-Powered-By PHP/5.4.15-1~tooptee10+1
> Last-Modified Tue, 10 Sep 2013 23:55:44 +0000 Cache-Control no-cache,
> must-revalidate, post-check=0, pre-check=0 ETag "1378857344"
> Keep-Alive timeout=15, max=9987 Connection Keep-Alive
> Content-Type text/html; charset=utf-8
2. When you hit the back button to go back to that form
Request Header
> Key Value
> Request GET /path/to/my/page HTTP/1.1
> Accept text/html, application/xhtml+xml, */*
> Accept-Language en-US
> User-Agent Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)
> Accept-Encoding gzip, deflate
> Host myhost.com
Response Header
> Key Value Response HTTP/1.1 304 Not Modified
> X-Powered-By PHP/5.4.15-1~tooptee10+1 ETag "1378857344"
> Keep-Alive timeout=15, max=9987 Content-Type text/html; charset=utf-8
> Content-Length 117183 Expires Tue, 10 Sep 2013 22:55:36 GMT
> Last-Modified Tue, 10 Sep 2013 23:55:44 GMT
Notice that it ends up returning a 304. When I tried this in Firefox, it returned 200 instead when you press the back button.
I think the behaviour you want is a behaviour that breaks the expectation of the back button for users.
Users expect that when they press back, it returns them back to the page they were previously viewing, in the state it was in when they left it. Most modern browsers achieve this by not only caching the page, but by retaining the page state (including the Javascript context) in memory so that when returning to the page via the back button, it's in the same state it was before, including anything they wrote into forms or any Javascript they interacted with.
In most browsers you can forcibly override this by setting Cache-Control headers such as no-cache and no-store. I don't know if no-store would work in your case for IE10, or if IE10 ignores even this and just goes back to the page anyway. If it did, I don't think I'd really blame it. It's doing it in the user's interest of both being fast, and of returning back to the page as it was when it was viewed before.
I think the approach that I would take, and you don't have to agree with me, is to re-think the design. Why do you require users to hit "back" if you are not going to show them the same thing they saw when they were back there? If you want to show an updated form, why not redirect after POST back to the form, which will count as a new page load and honor your Cache-Control headers? That is what I'd do and it's become somewhat of a de-facto standard.
tl;dr it's possible, but I'm not certain, that you could do what you want with no-store, but I'd be looking at moving to redirect after POST instead so as not to rely on the back button for something other than going back to the previous state.
You may be able to set some headers in PHP
Cache-Control: private, must-revalidate, max-age=0
Expires: Thu, 01 Jan 1970 00:00:00

How to check if a page is cached with php

...and whether it was cached 30 days ago,,
I am using this code:
$page=get_headers('http://www.w3schools.com/php/func_date_strtotime.asp');
The output is this:
0=>HTTP/1.1 200 OK
1=>Connection: close
2=>Date: Thu, 03 May 2012 10:51:00 GMT
3=>Server: Microsoft-IIS/6.0
4=>MicrosoftOfficeWebServer: 5.0_Pub
5=>X-Powered-By: ASP.NET
6=>Pragma: no-cache
7=>Content-Length: 23643
8=>Content-Type: text/html
9=>Expires: Thu, 03 May 2012 10:50:00 GMT
10=>Set-Cookie: ASPSESSIONIDSAARQQST=AAMAAHBBBHBELMHDCHNNLMFP; path=/
11=>Cache-control: no-cache
I read that pragma cache , doesnt necessary mean that the page is uncacheable.
I want to know 2 things:
1) if the page is cached
2) if it was cached 30 days ago.
Can I do this
$date1=gmdate("D, d M Y H:i:s", strtotime("30 days ago")) . " GMT";
$date2=$page['Expires'];
if($date1>$date2)
{
echo 'The page was cached for longer than 30 days';
}
Since PHP is a server side language you cannot check browser cache(which is a client side) using PHP. So you need some client side scripting like Javascript and not server side programming like PHP.

Google storage (curl + PHP + create a dummy _$folder$)

Can anyone enlighten me how to create a _$folder$ using Google storage API? Here is the class that I have so far (I managed to list/filter files), but no success with creating a 'directory'. http://guy.codepad.org/lEO4J6hL
When I try to create a test_$folder$, this is what I send to the server:
PUT /test_$folder$ HTTP/1.1
Host: static.hotelpublisher.com.commondatastorage.googleapis.com
Date: Mon, 29 Nov 2010 19:45:49 GMT
Content-Length: 0
Content-Type: application/octet-stream
Content-MD5: 1B2M2Y8AsgTpgAmY7PhCfg==
Authorization: GOOG1 GOOGRVMFQJPKHRXAU3F6:gtzlxexMjBOafn5tOZKF7UZGv1I=
x-goog-acl: public-read
This is what I get in return:
<?xml version='1.0' encoding='UTF-8'?><Error><Code>MalformedHeaderValue</Code><Message>An HTTP header value was malformed.</Message><Date>mon, 29 nov 2010 19:34:23</Date></Error>
This is done following the Google provided documentation, thus I don't see why this does not work.
Hm... have you tried taking out the Content-MD5 header? It doesn't look like it's necessary since your Content-Length is 0.

What headers do I want to send together with a 304 response?

When I send a 304 response. How will the browser interpret other headers which I send together with the 304?
E.g.
header("HTTP/1.1 304 Not Modified");
header("Expires: " . gmdate("D, d M Y H:i:s", time() + $offset) . " GMT");
Will this make sure the browser will not send another conditional GET request (nor any request) until $offset time has "run out"?
Also, what about other headers?
Should I send headers like this together with the 304:
header('Content-Type: text/html');
Do I have to send:
header("Last-Modified:" . $modified);
header('Etag: ' . $etag);
To make sure the browser sends a conditional GET request the next time the $offset has "run out" or does it simply save the old Last Modified and Etag values?
Are there other things I should be aware about when sending a 304 response header?
This blog post helped me a lot in order to tame the "conditional get" beast.
An interesting excerpt (which partially contradicts Ben's answer) states that:
If a normal response would have included an ETag header, that header must also be included in the 304 response.
Cache headers (Expires, Cache-Control, and/or Vary), if their values might differ from those sent in a previous response.
This is in complete accordance with the RFC 2616 sec 10.3.5.
Below a 200 request...
HTTP/1.1 200 OK
Server: nginx/0.8.52
Date: Thu, 18 Nov 2010 16:04:38 GMT
Content-Type: image/png
Last-Modified: Thu, 15 Oct 2009 02:04:11 GMT
Expires: Thu, 31 Dec 2010 02:04:11 GMT
Cache-Control: max-age=315360000
Accept-Ranges: bytes
Content-Length: 6394
Via: 1.1 proxyIR.my.corporate.proxy.name:8080 (IronPort-WSA/6.3.3-015)
Connection: keep-alive
Proxy-Connection: keep-alive
X-Junk: xxxxxxxxxxxxxxxx
...And its optimal valid 304 counterpart.
HTTP/1.1 304 Not Modified
Server: nginx/0.8.52
Date: Thu, 18 Nov 2010 16:10:35 GMT
Expires: Thu, 31 Dec 2011 16:10:35 GMT
Cache-Control: max-age=315360000
Via: 1.1 proxyIR.my.corporate.proxy.name:8080 (IronPort-WSA/6.3.3-015)
Connection: keep-alive
Proxy-Connection: keep-alive
X-Junk: xxxxxxxxxxx
Notice that the Expires header is at most Current Date + One Year as per RFC-2616 14.21.
The Content-Type header only applies to responses which contain a body. A 304 response does not contain a body, so that header does not apply. Similarly, you don't want to send Last-Modified or ETag because a 304 response means that the document hasn't changed (and so neither have the values of those two headers).
For an example, see this blog post by Anne van Kesteren examining WordPress' http_modified function. Note that it returns either Last-Modified and ETag or a 304 response.

Categories