In this post I will try to explain the “most important” aspects of caching in the HTTP protocol. It is assumed that the HTTP/1.1 protocol is used (not 1.0), in a set up where a reverse proxy, such as Squid, is placed in front of an origin server. A client that requests a resource, such as a browser, usually has a forward proxy itself, which is, in this sense, exactly the same type of caching entity as that reverse proxy. The client requests a resource from its local cache, which, possibly, forwards that request to its parent, the reverse proxy, which, in turn, possibly, forwards that request to its parent, the origin server. The ultimate goal is to minimize the time it takes for the client to get a response.
A definition list of the terms commonly used in this context:
- An object that you can identify with a URI. For example: http://example.com/foo and http://example.net/bar?x=y.
- A cached resource that is past its expiry date.
- Parent cache
- If a cache does not have a complete answer to an incoming request, it forwards the request to its parent cache.
- Origin server
- The server that generates the content. The origin server has no parent cache.
- Reverse proxy / Accelerator
- A caching proxy that handles all incoming connections for an origin server and caches the response.
- A cache revalidates a cached resource by checking with its parent cache whether it is still up to date, downloading any updates that are available.
- A cache refreshes a cached resource by deleting it from its cache and downloading a fresh copy from the parent cache, regardless of whether or not it was already up to date.
Before demonstrating usage examples, let me introduce (some of) the caching related headers in HTTP:
- Cache-Control, HTTP/1.1
- Expires, HTTP/1.0
- Last-Modified, both
ETag header holds the entity tag of a resource. This can be used in requests the same way as the “last modified” timestamp, to tell the server that it only needs to send the resource if it is different from what is currently in the cache. An entity tag is an opaque value, a literal string whose content has no meaning in the protocol: it either matches exactly with another string or it does not. Any logic in this tag might be interpreted by humans but caches regard it as a raw incoherent string of characters.
Cache-Control header allows several different values that detail the cachability of a certain resource. It can be sent by both the server and the client (i.e.: both in a request and in a response). If sent in a response, the most intersting value, here, is
This directive is very similar to the
Expires header: it indicates up until what time a certain resource may be cached. If a cache gets a request for a resource that it has cached but that, according to this header, has “expired”, it will ask its parent for confirmation on the validity of that cache. This is done by forwarding the request, adding a
If-Modified-Since header with the timestamp of the locally cached version, or a
If-None-Match with its etag.
If the resource has not changed on the origin server, it will reply with
304 Not Modified and will not send the body of the resource. Thus, an expiry date merely tells caches when to ultimately at least check with the origin server whether their copy is still valid. If the resource happens to never change, the cache may never have to redownload the entity, always getting a
Coming from a client it means that it will accept a cached response that is at most that many seconds old. When you hit the “refresh” button on a browser, it repeats the request for the page but adds a
Cache-Control: max-age=0 header. This essentially indicates all caches that they must revalidate their caches. This does not require caches to refresh their caches no matter what. That is done by sending a
no-cache directive along with the request. For example, the following command will flush the entire cache of a reverse proxy:
wget --recursive \ --no-directories \ --delete-after \ --header='Cache-Control: no-cache' \ http://example.com/
Now let us look at different types of resources with different requirements:
Quickly changing, unpredictable content
Resources that depend on user-generated content, like a “comments” page, or similar. If it is important that users always see the very latest development, the server should allow caching (
Last-Modified) but require revalidation (
Cache-Control: public, max-age=0 ETag: "latestid:4" Last-Modified: Wed, 09 Sep 2009 13:22:01 GMT
The origin server will have to check if the resource has changed on every request. If all those
304 Not Modified replies are too much, consider increasing the
max-age directive a bit to offload that to the reverse proxy.
Static content that rarely changes
Stylesheets, images, “about us” pages, etcetera. These resources almost never change, and even if they do, it is OK if users do not get the updates immediately. Moreover, allowing agressive caching means that the client’s cache can serve content without even opening a connection to the reverse proxy. If you have a gallery website, for example, where users spend a lot of time clicking around, this will allow very quick page loads because almost no new connections are made over the internet, eliminating that latency.
Cache-Control: public, max-age=86400 ETag: "revision183" Last-Modified: Wed, 10 Dec 2008 06:44:39 GMT
Content that may absolutely not be cached by public/shared caches, such as company secrets or resources that could put users in an embarassing position if it got out they requested it somewhere in the past. The
Cache-Control: private header can be put in the request as well as in the answer to prevent caching of such a resource by public/shared caches.
In the end, the ultimate reference is the HTTP/1.1 RFC. I recommend reading at least the related parts of that document, which I tried to summarize in this post.
The Squid proxy software can act as a forward proxy as well as a reverse proxy. The latter functionality is dubbed “accelerator mode” in Squid-speak.