How to politely ask remote webpage if it changed? - php

The remote webpage is updated - sometimes slower, once in ten minuter or so. Sometimes more often, like every minute or more frequently. There's a piece of data on that page I'd want to store, updating it whenever it changes (not necessarily grabbing every change but not falling too far behind the "current", and keeping the updates run 24/7).
Downloading the whole remote page every minute to check if it differs from previous version is definitely on the rude side.
Pinging the remote website for headers once a minute won't be too excessive.
If there's any hint when to recheck for updates, or have the server reply with the content only after the content changes, it would be ideal.
How should I go about minimizing unwanted traffic to the remote server while still staying up-to-date?
The "watcher/updater" is written in PHP, fetching the page using simplexml_load_file() to grab the remote URL every minute now, so something that plays nice with that (e.g. doesn't drop the connection upon determining the file differs only to reconnect for actual content half a second later, just proceeds with the content request) would be probably preferred.
edit: per request, sample headers.
> HEAD xxxxxxxxxxxxxxxxxxxxxxxxxxx HTTP/1.1
> User-Agent: curl/7.27.0
> Host: xxxxxxxxxxxxxx
> Accept: */*
>
* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1/.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Server: nginx
< Date: Tue, 18 Feb 2014 19:35:04 GMT
< Content-Type: application/rss+xml; charset=utf-8
< Content-Length: 9865
< Connection: keep-alive
< Status: 200 OK
< X-Frame-Options: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< X-Content-Type-Options: nosniff
< X-UA-Compatible: chrome=1
< ETag: "66509a4967de2c5984aa3475188012df"
< Cache-Control: max-age=0, private, must-revalidate
< X-Request-Id: 351a829a-641b-4e9e-a7ed-80ea32dcb071
< X-Runtime: 0.068888
< X-Powered-By: Phusion Passenger
< X-Frame-Options: SAMEORIGIN
< Accept-Ranges: bytes
< X-Varnish: 688811779
< Age: 0
< Via: 1.1 varnish
< X-Cache: MISS

ETag: "66509a4967de2c5984aa3475188012df"
This is a very promising header. If it indeed corresponds to changes in the page itself, you can query the server setting this request header:
If-None-Match: "<the last received etag value>"
If the content was not modified, the server should respond with a 304 Not Modified status and no body. See http://en.wikipedia.org/wiki/HTTP_ETag. It also seems to be running a cache front end, so you're probably not hitting it too hard anyway.

Send an HTTP HEAD request using cURL and retrieve the Last-Modified value. This is similar to GET but HEAD only transfers the status line and header section, so you won't be "rude" to the other server if you're sending a HEAD request.
In command-line, we can achieve this using the following command:
curl -s -v -X HEAD http://example.com/file.html 2>&1 | grep '^< Last-Modified:'
It shouldn't be too hard to rewrite this using PHP's cURL library.

Related

Transfer-Encoding: chunked sent twice (chunk size included in response body)

I'm using Apache 2.2 and PHP 7.0.1. I force chunked encoding with flush() like in this example:
<?php
header('HTTP/1.1 200 OK');
echo "hello";
flush();
echo "world";
die;
And I get unwanted characters at the beginning and end of the response:
HTTP/1.1 200 OK
Date: Fri, 09 Sep 2016 15:58:20 GMT
Server: Apache/2.2.15 (CentOS)
X-Powered-By: PHP/7.0.9
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
a
helloworld
0
The first one is the chunk size in hex (10 = A). I'm using Klein as PHP router and I have found that the problem comes up only when the HTTP status header is rewritten. I guess there is a problem with my Apache config, but I wasn't able to figure it out.
Edited: My problem had nothing to do with Apache but Nginx and chunked_transfer_encoding directive. Check the answer below.
This is how Transfer-Encoding: chunked works. The extra characters you're seeing are part of the encoding, rather than the body.
A client that understands the encoding will not include them in the result; a client that doesn't doesn't support HTTP/1.1, and should be considered bugged.
As #Joe pointed out before, that is the normal behavior when Chunked transfer enconding is enabled. My tests where not accurate because I was requesting Apache directly on the server. Actually, when I was experiencing the problem in Chrome I was querying a Nginx service as a proxy for Apache.
By running tcpdump I realized that Nginx was rechunking responses, but only when rewritting HTTP status header (header('HTTP/1.1 200 OK')) in PHP. The solution to sending Transfer-Encoding: chunked twice is to set chunked_transfer_encoding off in the location context of my Nginx .php handler.

varnish not work only for one web site

I have a server with more site , after install varnish I tested if cache works, but for one web site not work varnish (have response of max-age=0). If I try to insert a simple php page (not correlated to main website) in same folder of this website, the response works.
This is a header when try :
HTTP/1.1 200 OK
Server: Apache/2.2.27 (Unix) mod_ssl/2.2.27 OpenSSL/1.0.1e-fips
X-Powered-By: PHP/5.2.17
Set-Cookie: PHPSESSID=ragejao4sm1kckjn1trvap3ft0; path=/
Vary: User-Agent,Accept-Encoding
Content-Encoding: gzip
Content-Type: text/html
Cache-Control: max_age=8600
magicmarker: 1
Content-Length: 11863
Accept-Ranges: bytes
Date: Fri, 12 Jun 2015 12:28:15 GMT
X-Varnish: 1250916100
Age: 0
Via: 1.1 varnish
Connection: keep-alive
Varnish by default doesn't cache responses where cookies are set.
If you want to change this behaviour you need to consider how the cookie is being used (it looks like a session cookie) and either use the session id as part of the cache hash (ie so other users don't get a cached response from someone else's session) or use something like ESI to allow the "common" parts of the page to be cached while the session specific parts are fetched independently.
http://www.varnish-cache.org/trac/wiki/VCLExampleCacheCookies
https://www.varnish-cache.org/trac/wiki/ESIfeatures

html/php: Changing images order by swapping filenames doesn't take immediate effect

I have following problem that's bugging my mind completely. I have to take over this cms from someone who doesn't want to care for it anymore and is giving no support whatsoever.
Situation is as follows: on the site there are several photo albums which are populated by reading a directory in php. All is good there, pictures are shown in the order they are read. In the management system, these pictures can be changed in order by an up or down-button. The way this is done is by swapping the image's filenames. This works, when I change the order for an image i can see server-side the filenames have actually been swapped.
This is however not the case on the site, at least not immediately: it takes an average of 10 minutes to see the images swapped there. Ofcourse, my client can't work like this, and he claims it has always worked before. I have tried to turn off caching browser-side, this hasn't helped. I can also note the changes take effect on the same time in IE and FF. I tried several ways of turning off cache server-side in php too, also to no avail.
Is there any other place where I should be looking or could there be another reason why these changes don't take effect immediately?
In addition, changes i make to javascript don't get picked up immediately too. I installed fiddler and this is the request header for that js file:
GET http://www.nobel-country-gite.be/admin/modules/Photoalbum/js/album.js HTTP/1.1
Accept: application/javascript, /;q=0.8
Referer: http://www.nobel-country-gite.be/admin/index.php?page=pic&album=24
Accept-Language: nl-BE
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
If-Modified-Since: Wed, 27 May 2015 15:55:12 GMT
If-None-Match: "ba1248f5-138b-5171244a92f66"
DNT: 1
Host: www.nobel-country-gite.be
Pragma: no-cache
Cookie: __utmc=39679548; __utma=39679548.1608184058.1429963360.1432662247.1432664636.7; __utmz=39679548.1429963360.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=1; PHPSESSID=7uge1ltg2rc11q63untthrc5s1; __utma=1.459796341.1429963360.1432662247.1432664636.7; __utmz=1.1429963360.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
Response header is as follows:
HTTP/1.1 304 Not Modified
Server: Apache
Last-Modified: Wed, 27 May 2015 15:55:12 GMT
ETag: "ba1248f5-138b-5171244a92f66"
Vary: Accept-Encoding
Content-Type: application/javascript
Date: Wed, 27 May 2015 16:57:55 GMT
X-Varnish: 1826689067 1825041752
Age: 556
Via: 1.1 varnish
Connection: keep-alive
I would expect the answer to be different instead of 'not modified'?
Edit - upon waiting a few minutes and refreshing the page again, the response for this file is what is expected:
HTTP/1.1 200 OK
Server: Apache
Last-Modified: Wed, 27 May 2015 16:57:30 GMT
ETag: "ba1248f5-1387-51713237ac28e"
Vary: Accept-Encoding
Content-Type: application/javascript
Transfer-Encoding: chunked
Date: Wed, 27 May 2015 17:03:43 GMT
X-Varnish: 1827728442
Age: 0
Via: 1.1 varnish
Connection: keep-alive
I couldn't help but notice you are using Varnish (indicated by the X-Varnish response header). Varnish is a caching reverse proxy, which means your pages are not just being cached by the browser, but also on the server (by Varnish). Your browser connects to Varnish, and Varnish connects to your Apache backend.
The first response header includes "Age: 556" - that's the cached version's age in seconds (almost 10 minutes). Then the age comes across as "0" when the page refreshes - that's because Varnish has updated its cache. Probably you can access the page over HTTPS to see your changes immediately reflected (Varnish doesn't work for HTTPS and most people don't bother setting up an HTTPS cache), or you can generally add garbage GET parameters to your URL (e.g. "?bogus=123") to force Varnish to re-fetch the page (this won't make other users see the new version, since they'll be accessing via the normal URLs).
Fixes: You can use varnishadm to ban (expire) certain URLs in Varnish when you've made a change; you can modify the "Cache-Control" or "Expires" headers your CMS/Apache (via PHP, .htaccess, etc.) produces to reduce cache time (Varnish completely respects cache control headers in its caching strategy); you can change Varnish's behavior by editing the relevant VCL (usually "default.vcl"); or you can accept that caches are generally good (they save a lot of time and resources in generating the response), and maybe a 10 minute delay is an acceptable trade-off.

Link between PHP and HTTP Request and Response Messages

When I did a networks course I learned about HTTP Request and Response messages and I know how to code in php reasonably enough to get around. Now my question is, the PHP has to have some link to HTTP request and response message but how. I can't seem to see the link between the two. My reasoning for asking this is that I am using the Twitter API console tool to query their api. The tool sends the following HTTP request:
GET /1.1/search/tweets.json?q=%40twitterapi HTTP/1.1
Authorization:
OAuth oauth_consumer_key="DC0se*******YdC8r4Smg",oauth_signature_method="HMAC-SHA1",oauth_timestamp="1410970037",oauth_nonce="2453***055",oauth_version="1.0",oauth_token="796782156-ZhpFtSyPN5K3G**********088Z50Bo7aMWxkvgW",oauth_signature="Jes9MMAk**********CxsKm%2BCJs%3D"
Host:
api.twitter.com
X-Target-URI:
https://api.twitter.com
Connection:
Keep-Alive
and then I get a HTTP response:
HTTP/1.1 200 OK
x-frame-options:
SAMEORIGIN
content-type:
application/json;charset=utf-8
x-rate-limit-remaining:
177
last-modified:
Wed, 17 Sep 2014 16:07:17 GMT
status:
200 OK
date:
Wed, 17 Sep 2014 16:07:17 GMT
x-transaction:
491****a8cb3f7bd
pragma:
no-cache
cache-control:
no-cache, no-store, must-revalidate, pre-check=0, post-check=0
x-xss-protection:
1; mode=block
x-content-type-options:
nosniff
x-rate-limit-limit:
180
expires:
Tue, 31 Mar 1981 05:00:00 GMT
set-cookie:
lang=en
set-cookie:
guest_id=v1%3A14109******2451388; Domain=.twitter.com; Path=/; Expires=Fri, 16-Sep-2016 16:07:17 UTC
content-length:
59281
x-rate-limit-reset:
1410970526
server:
tfe_b
strict-transport-security:
max-age=631138519
x-access-level:
read-write-directmessages
So how do these HTTP request and response messages fit into PHP? Does PHP auto generate this? How do I add authorization to PHP requests etc? I'm confused about the deeper workings of PHP
When the client sends the HTTP request to the server, there has to be something to receive the HTTP request, which is called a web server. Examples of web servers are Apache, IIS, Nginx, etc. You can also write your own server, which can handle input however it wants. In this case, I'll assume that you are requesting a PHP file.
When the web server captures the HTTP request, it determines how it should be handled. If the file requested is tweets.json, it will go make sure that file exists, and then pass control over to PHP.
PHP then begins its execution, and performs any logic that the script needs to do, meaning it could go to the database, it reads, writes and makes decisions based cookies, it does math, etc.
When the PHP script is done, it will return a HTML page as well as a bunch of headers back to the web server that called it. From there, the web server turns the HTML page and headers back into a HTTP request to respond.
That is a pretty simple overview, and web servers can work in many different ways, but this is a simple example of how it could work in a introductory use-case. In more complex scenarios, people can write their own web servers, which perform more complex logic inside of the web server software, rather than passing it off to PHP.
When it comes down to it, PHP files are just scripts that the web server executes when they are called, they provide the HTTP request as input, and get a web page and headers as output.

PHP, cURL, Sessions and Cookies - Oh my

I wrote a PHP web application which uses authentication and sessions (no cookies though). All works fine for the users in their browsers. At this point though I need to add functionality which will perform a task automatically... users don't need to see anything and can't interact with this process. So I wrote my new PHP, import.php, which works in my browser. I set up a new cron job to call 'php import.php'. Doesn't work. Started Googling and it seems maybe I need to be using cURL and possibly cookies but I'm not certain. Basically import.php needs to authenticate and then access functions in a separate file, funcs.php, in the same directory on the local server. So I added cURL to import.php and reran from the command line; I see the following:
[me#myserver]/var/www/html/webapp% php ./import.php
* About to connect() to myserver.internal.corp port 443 (#0)
* Trying 192.168.111.114... * connected
* Connected to myserver.internal.corp (192.168.111.114) port 443 (#0)
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* Remote Certificate has expired.
* SSL certificate verify ok.
* SSL connection using SSL_RSA_WITH_3DES_EDE_CBC_SHA
* Server certificate:
* subject: CN=dept,O=Corp,L=Some City,ST=AK,C=US
* start date: Jan 11 16:48:38 2012 GMT
* expire date: Feb 10 16:48:38 2012 GMT
* common name: myserver
* issuer: CN=dept,O=Corp,L=Some City,ST=AK,C=US
> POST /webapp/import.php HTTP/1.1
Host: myserver.internal.corp
Accept: */*
Content-Length: 356
Expect: 100-continue
Content-Type: multipart/form-data; boundary=----------------------------2c5ad35fd319
< HTTP/1.1 100 Continue
< HTTP/1.1 200 OK
< Date: Thu, 27 Dec 2012 22:09:00 GMT
< Server: Apache/2.4.2 (Unix) OpenSSL/0.9.8g PHP/5.4.3
< X-Powered-By: PHP/5.4.3
* Added cookie webapp="tzht62223b95pww7bfyf2gl4h1" for domain myserver.internal.corp, path /, expire 0
< Set-Cookie: webapp=tzht62223b95pww7bfyf2gl4h1; path=/
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Content-Length: 344
< Content-Type: text/html
<
* Connection #0 to host myserver.internal.corp left intact
* Closing connection #0
I'm not sure what I'm supposed to do after I authenticate via cURL. Or is there an alternate way to authenticate with which I don't use cURL? Currently all pages in the web app take action (or not) based on $_SESSION and $_POST value checks. If cURL is the only way, do I need cookies? If I need cookies, once I send it back to the server why do I need to do to process it?
Basically import.php checks for and reads files from the same directory. Supposing there are files when the cron runs and parses them and inserts data into the DB. Again, everything works in the browser, just not the import from the command line.
Having never done this before (or much PHP for that matter), I'm completely stumped.
Thanks for your help.
I've solved my problems with this one.
shell_exec('nohup php '.realpath(dirname(__FILE__)).'/yourscript.php > /dev/null &');
You can set this to run every x minutes, and it will run in the background without user delay.
Can we start from here?
This is highly unlikely to help anybody but the requirements for this project changed so I ended up creating a PHP-based REST API and rewriting this import script in Python to integrate with some others tools being developed. All works as needed. In Python...
import cookielib
import getopt
import os
import sys
import urllib
import urllib2
import MultipartPostHandler
Shouldn't need to provide any more details - anybody versed enough in Python should get the drift. Script reads a file and submits it to my PHP API.

Categories