Hypertext Transfer Protocol (HTTP)

Basic operation
HTTP version 1.0
HTTP version 1.1
Request methods
Header fields

Proxies and caches
Security and authentication
State management
The HTTP message body
Status codes

HyperText Transfer Protocol (HTTP) is a connectionless and stateless application-layer protocol that defines how a client computer retrieves content (web pages etc.) from a web server (sometimes called the origin server). The protocol follows a request-response paradigm in which the client computer issues requests which the server responds to. The protocol is said to be connectionless because, although a TCP connection must be established between the client and the server before a request can be sent, once the request has been satisfied the connection is dropped. Furthermore, the protocol has no mechanism for keeping track of the current state of a user transaction, hence it is said to be stateless. Web applications that need to keep track of a user's actions must use other means.

The HTTP client application is called a Web browser (sometimes referred to as a user agent). Communication takes place over a TCP connection, which is established with port 80 on the Web server. When a user initiates a request for a web page by clicking on a link, the browser creates an HTTP request packet which contains, among other things, the uniform resource locator (URL) associated with the link. A software program called a HTTP daemon, which runs on the server and listens to port 80 for incoming traffic, handles HTTP request packets. Between the client computer and the Web server itself, there may be a number of intermediaries involved, such as proxy servers, gateway servers, and tunnels.

Most web resources exist on a server in the form of files of various kinds, but a response may also be created dynamically by a server-side CGI script. All being well the requested resource, including a Content-Type: header that tells the client what kind of data the resource contains, is sent to the client computer. If the web browser recognises the content type it will display the content correctly, if necessary invoking the appropriate plug-in (a plug-in is a software module designed to handle a specific media type that can be called upon by the browser when required).

Basic operation

The first version of HTTP was relatively simple. The browser would request a TCP connection with the server, and send a command such as:

GET / something.html

The command is followed by carriage-return and line-feed characters (CRLF). The server would respond with the contents of the requested file. No request headers were used, and GET was the only method defined. The server would return no additional information, and the protocol was restricted to the retrieval of HTML text documents. Once the document was sent, the server would close the connection. The server terminated each line in the document with a mandatory line-feed character, which could optionally be preceded by a carriage-return. The client computer could not send any data to the server.

HTTP version 1.0

HTTP 1.0 introduced new methods and allowed the use of headers in request and response messages, which follow the general format shown below.

The general form of HTTP 1.0 messages

An HTTP request message consists of a request line, optionally followed by one or more header, and an optional entity body which could contain data (for example, form data). The "|" character indicates alternatives. The request line consists of the method name, the path to the requested resource, and the version of HTTP being used. A typical request line might appear as follows:

GET /path_to_file/index.html HTTP/1.0

An HTTP response message is similar to the request message except that a status line replaces the request line. The status line contains the HTTP version, a status code that indicates whether or not the request was successful, and a brief status description. Two typical status lines are shown below.

HTTP/1.0 200 OK
HTTP/1.0 404 Not Found

HTTP 1.0 allowed access to certain resources to be restricted to users with a valid username and password. A status code of 401 in a response, for example, indicates that the user is not authorised to view the resource they have requested.

Request headers can be used by a client to specify the preferred content type, or to specify conditional retrieval (for example, so that a resource will only be returned if it has changed since a given date). Response headers can be used by the server to specify the type of content being returned. The body of a response message may contain the requested file, the result of a user query, or some other (often dynamically generated) response from the server.

HTTP 1.0 introduced the HEAD method, which allows a client to retrieve only the file header information without actually retrieving the file itself, and the POST method, which allows a client to send a significant amount of information (such as the data entered into a form by a user) to the server. The data sent to the server will often be passed to a CGI script to be processed in some way, and will often result in a dynamically-generated response. A typical form submission is shown below.

A typical form submission

HTTP version 1.1

The most recent version of HTTP is version 1.1 (RFC 2616, 1999). Basic operation is the same as for HTTP 1.0, but a number of new features have been added, some of the more important of which are described below.

Hostname identification - the domain name system (DNS) allows multiple host names to be bound to the same IP address, which means that a number of websites with different domain names can reside on the same physical server. HTTP 1.1 requests include a Host: header that specifies the hostname of the host that is the subject of the request.
Persistent connections - Web pages are HTML documents that typically contain references to image files and other types of media. In previous versions of HTTP, each file referenced became the subject of a separate HTTP request, and required a new TCP connection to be set up with the server, incurring considerable overhead. Persistent connections (the default in HTTP 1.1) allow multiple HTTP requests to be serviced over the same connection. The connection remains open until it is explicitly closed by either the client or the server (or until it times out). Persistent HTTP connections greatly reduce the latency associated with multiple requests, since there is no need for repeated TCP connection handshakes. The use of a persistent connection also allows multiple HTTP requests (and the responses from the server to be pipelined (a client can make further requests without having to wait for a response to requests already sent). There is a possibility that the server might close the connection before all responses are sent, so the client is responsible for keeping track of requests and resending them if necessary.
Chunked transfer encoding - allows a server to start sending a response without knowing its total length. The response is sent as a series of small chunks, which are sent in series.
Byte serving - this is where a server sends only the part resource that has been explicitly requested by a client.

Request methods

GET - used to request a specific resource from the server when a user clicks on a hyperlink or types a URL into a browser's address box. The request becomes conditional if an If-Modified-Since header is included, such that the requested resource is only returned by the server if it has been modified since the date given by the If-Modified-Since header. This conditional use of GET reduces both network traffic and response time by allowing cached resources to be used if they are still current.
HEAD - the HEAD method performs the same function as GET, except that the response from the server does not include an entity body. This method can be used to retrieve meta-information about the requested entity without actually retrieving the entity itself, and is often used by clients who use caching to check whether the currently stored copy of the resource has changed since it was last requested. If not, the local copy will be used. If the resource has changed, the updated version is retrieved using the GET method.
POST - submits data (for example, from a form) to the server to be processed. The URL is usually a CGI script, which performs the processing it is programmed to perform, and usually returns a dynamically generated response to the client. The data to be sent is included in the request's entity body.
OPTIONS - returns the HTTP methods that the server supports for the specified URL, and allows the client to determine the capabilities of a server (for example, to determine whether the server is HTTP 1.1 compliant) without initiating a request for a resource.
CONNECT - this method converts the request connection to a transparent TCP/IP tunnel, usually to facilitate secure (SSL-encrypted) communication (HTTPS) through an unencrypted HTTP proxy.

Header fields

Content-Type - indicates the media type of the data sent to the client or (in the case of the HEAD method) the media type that would have been sent had the request been a GET.
Date - the date and time at which the message was created.
Expires - the date after which the resource ceases to be current.
From - the email address of the user initiating the request.
Host - HTTP 1.1 enables a single IP address to be associated with multiple hostnames, allowing a number of Web sites with different domain names to exist on the same physical server. Each HTTP request must specify the hostname for which the request is intended using the Host: header (the only mandatory header in an HTTP 1.1 request).
If-Modified-Since - used with GET to make the request conditional. The requested resource is only returned if has been modified since the time specified, otherwise a 304 (not modified) response will be returned.
Last-Modified - indicates the date and time at which the resource was last modified. If the recipient has a copy of the resource that is older than the date given, it will be considered stale.
Location - used to redirect the client to a location other than the request URL in order to complete the request or identify a new resource. The field will contain an absolute URL.
Referer - allows the client to pass the server the URL of the resource from which the requested URL was obtained.
Server - this field contains information about the server software application that handled the request.
User-Agent - information about the user agent originating the request. This information can be used for statistical purposes, and to tailor the response to take account of any user agent limitations.

Proxies and caches

A proxy server receives a HTTP request from a client computer and forwards it to the appropriate web server on behalf of the client, maintaining the anonymity of the client. The web server sends the response to the proxy server, which forwards it to the client. The proxy may store a copy of the response in its cache for a period of time, from which it can service additional requests for the same resource. This has the effect of reducing the overall amount of Internet traffic and speeding up response times. The proxy can make a conditional request to the server, following receipt of which the server will return the requested document only if the proxy does not already have an up-to-date version in its cache. The origin server may also include an Expires: header with any resource returned that provides a time frame within which the proxy may consider the resource to be current. The client can, if required, specify that a request should not be satisfied from a cache, or may specify the maximum acceptable age of a cached response.

A proxy server relays client requests to a web server

Security and authentication

Basic authentication is a challenge-response access control mechanism in which a user's password is sent in clear text across a network. An alternative method called digest authentication, developed in parallel with HTTP 1.1, employs a password known only to client and server. The client computes a digest value using the Message Digest 5 (MD5) algorithm. The value is computed using a combination of the username and password, and several values related to the requested resource. It may also be time-dependent. The computed digest value is then sent across the network to the server, which applies the MD5 algorithm to the received digest value, and can verify its authenticity using the information it holds about the client. Digest authentication is far more secure than basic authentication. The dominant method of establishing a secure HTTP connection is to use the HTTPS URL scheme, which is syntactically identical to that used for normal HTTP connections, but which signals the browser to use an added encryption layer - either Secure Sockets Layer (SSL) or Transport Layer Security (TLS). SSL is particularly suited for HTTP because it provides some protection even if only one end of the connection (typically the server) is authenticated.

State management

From the server's point of view, all HTTP requests are stateless in that each request is considered to be independent of any other request. This model conflicts with the need that many Web applications have for keeping track of user activity. An online shopping application, for example, must keep track of what is in the user's shopping cart. Many Web applications store state information in "cookies". These are small data files which are stored on the client computer, and which are updated by the Web application as the user interacts with it. The server includes updated state information in its response to the client, which saves the information in a cookie. The data held in the cookie is sent by the client to the server as part of the next request. An application can assign an expiration time and date to a cookie, after which it may not be used. Cookies may contain sensitive information such as credit card details, passwords, and personal details. As such, they represent a potential security risk, since they may be accessed by a hacker (or some autonomous agent) gaining unauthorised access to the client system. Alternative methods of state management that do not involve the use of cookies are provided by application environments like Microsoft's Active Server Pages (ASP), Sun Microsystems' Java Server Pages (JSP), and the open-source PHP server-side scripting environment.

The HTTP message body

Both request and response messages may include an entity consisting of entity header fields and an entity body. The entity body differs from the message body only if a transfer encoding has been applied to it. The presence of a message body in a request is indicated by the inclusion of a Content-Length or Transfer-Encoding header field. A response to a request that uses the HEAD method will not include a message body, even though it may well include a number of entity header fields. Most other response messages will, of course, include a message body, which will consist of all or part of the requested resource (or user data sent from the client to the server). The entity header fields provide meta-information about the entity body (or about the resource identified by the request, if the HEAD method has been used).

When an entity body is included with a message, the media type of the underlying data is specified by the header field Content-Type, while the Content-Length header specifies the length of the data in bytes. The Content-Encoding header is used to specify any additional encoding that has been applied to the entity body (such encoding is primarily used to allow a document to be compressed), and enables the client to determine the correct decoding mechanism to apply.

Status codes

The HTTP response status line includes a machine-readable 3-digit status code, followed by a human-readable reason phrase. How the user agent deals with the response depends on the code and the response headers. If the status code indicates a problem, the user agent will usually display the reason phrase to the user to inform them of the nature of the problem. The general class of the status code is determined by the first digit, as follows:

1xx - informational (request received, process continuing)
2xx - success (the requested action was successfully completed)
3xx - redirection (further action is required to fulfil the request)
4xx - client error (the request contains bad syntax or cannot be fulfilled)
5xx - server error (the server failed to fulfil an apparently valid request)

The status codes defined for HTTP 1.1 are listed in the table below (the reason phrases shown are recommendations only, and may be safely replaced with locally defined messages).

HTTP 1.1 Status Codes
Code	Reason phrase
100	Continue
101	Switching Protocols
200	OK
201	Created
202	Accepted
203	Non-Authoritative Information
204	No Content
205	Reset Content
206	Partial Content
300	Multiple Choices
301	Moved Permanently
302	Found
303	See Other
304	Not Modified
305	Use Proxy
307	Temporary Redirect
400	Bad Request
401	Unauthorized
402	Payment Required
403	Forbidden
404	Not Found
405	Method Not Allowed
406	Not Acceptable
407	Proxy Authentication Required
408	Request Time-out
409	Conflict
410	Gone
411	Length Required
412	Precondition Failed
413	Request Entity Too Large
414	Request-URI Too Large
415	Unsupported Media Type
416	Requested range not satisfiable
417	Expectation Failed
500	Internal Server Error
501	Not Implemented
502	Bad Gateway
503	Service Unavailable
504	Gateway Time-out
505	HTTP Version not supported

Author:
Website:
Page title:
URL:
Published:
Last revised:
Accessed: