Markup Language

What is markup language?

A markup language is neither a programming language nor a scripting language. It is rather a way of annotating a document, using symbolic instructions, to show how it should be laid out and formatted. Markup of one sort or another has been used for centuries by printers and publishers to specify how text and images should appear on the printed page. Traditionally, markup has been used to specify the typeface, style and size of text in books, magazines, newspapers and other printed publications. More recently, it has been used to specify how web pages are displayed in a browser window. The first markup language used to create web pages was the HyperText Markup Language (HTML).

During the 1970s and 1980s IBM, thanks largely to the ideas and persistence of researcher Charles Goldfarb, developed the IBM Generalised Markup Language (GML), which ultimately became the Standard Generalised Markup Language (SGML) in which most of today's markup languages, including HTML, have their roots. The SGML specification was released as a standard by ISO in 1986. SGML markup focused on the structural aspects of a document, leaving the visual presentation to the interpreter. It specified a syntax for the inclusion of markup in documents, together with a separate syntax for describing what markup instructions (called tags) were allowed and where - the Document Type Definition (DTD). With SGML, authors could create and use any markup they desired by selecting appropriate tags and using contextualised names. SGML can thus be described as a meta-language, since it essentially allows an author to create their own, specialised markup language, tailored to a specific commercial or industrial environment.

HyperText Markup Language

Tim Berners-Lee recognised the flexibility and extensibility of markup languages created with SGML, and used it to create the HyperText Markup Language (HTML), although a formal DTD was only developed later. HTML, in its various forms, is today probably the most widely used of all markup languages world wide. With HTML, as with many other markup languages, the markup instructions (or tags) are inserted into the text of a web page, which is essentially an ASCII text document, as illustrated by the following HTML code fragment:

<h1>Markup Language</h1>
Markup instructions are embedded <em>within the text</em> of web page content.

The markup instructions, or tags, are enclosed within angle-brackets. The text between each pair of tags is the actual text of the between these instructions is the actual text of the document. The tags used here are examples of structural markup. The <h1> ... </h1> tag, for example, denotes that the enclosed text is a level 1 header. The <p> ... </p> tag is used to denote that the enclosed text is a paragraph. The <em> ... </em> tag denotes that the enclosed text is to be emphasised. A program that interprets such structural markup (e.g. a browser) will apply its own set of default styles when rendering the information. The typeface, font size, foreground and background colours and indentation, for example, are not specified by structural tags such as those illustrated in the above example. Different browser implementations will therefore present the same content slightly differently.

The <h1> tag, for example, is declaring that the text it contains is a level 1 header, so most browsers will display the text in bold type, probably at a significantly larger font size than other text. Other HTML tags, such as the <b> ... </b> tag are considered to be presentational tags, because they tell the browser that the enclosed text should be displayed using a bold typeface, without providing any information about the purpose of the text.

Certain tags must always be included in an HTML document, as they define a basic structure that is common to all HTML documents. The following code, for example, represents the basic structure of all HTML pages:

    <title>Some text goes here</title>


    The page content goes here ...


The <html> ... </html> tag encloses the entire document, and essentially tells the user agent (the browser) that this is a web page. The <head ... </head> tag encloses the document's header information, which usually includes the page title and may include various other items of metadata (information about the document), none of which is actually displayed inside the browser window. The <body> ... </body> tag encloses the actual content of the web page, and will be displayed within the browser window.

Each tag contains the name of the element type that it represents, and may contain additional attributes and values that modify the behaviour of the tag in some way. Most HTML elements require both an opening and a closing tag. The closing tag in each case can be recognised by the presence of a forward slash, "/", (correctly called a solidus or virigule). Page elements such as graphic images are not actually part of the HTML file itself, but are referenced by an appropriate tag, which acts as a kind of placeholder. When the browser finds and <img> tag, for example, it will retrieve and display the image stored in the file referenced by the tag's src attribute at the appropriate position in its screen rendition of the page.

eXtensible Markup Language

eXtensible Markup Language (XML) was developed by the World Wide Web Consortium (W3C) as a simplified version of SGML, scaled down for use in Web applications. Details of the first version of XML (XML 1.0) were originally published as a W3C Recommendation in 1998. Like SGML, of which it is a subset, XML is a meta-language. Because of its SGML roots, documents created with XML are effectively also SGML documents, so existing SGML users find the transition to XML relatively painless, while much of the software used with SGML works equally well with XML. Despite much of the complexity of SGML having been removed from XML, it retains one of the most important features of SGML - extensibility. Languages created with XML are also extensible, allowing users to create and define their own tags. Because of its ease of use, flexibility, and extensibility, XML is widely used for communicating data between applications.

In HTML, both the tag set used and the tag semantics (i.e. what each tag actually does) are fixed. The tag <h1> ... </h1> is a member of the standard tag set, and is used to display the text between its opening and closing tags as a level 1 header. Although HTML itself has grown from twenty tags in the original version to over ninety in the most recent specification, and can now display a more diverse range of page elements, it is constrained by the need to maintain backward compatibility with earlier versions of HTML, and by the manner in which browser vendors interpret the HTML Document Type Definition (DTD). XML, on the other hand, places no limits on the number of tags that can be defined. The example below shows a simple XML document:

<?xml version="1.0"?>
  <title>Why I am Overworked</title>
  <author role="Author">
    <company>Jones and Associates</company>
  <abstract>This is the text of the abstract.</abstract>

As with HTML, the XML document itself can be read and understood by humans relatively easily, and the meaning of each tag can be discerned from the name of the tag. The structure of the document reflects the structure of the information it contains. The ability to arbitrarily structure documents is an important element of XML - it enables us to preserve the structure of the information being disseminated within the the XML document itself - something which is far more difficult to achieve with HTML. The meaning of each tag is defined in the Document Type Definition (see below), which also defines the order in which they may be used, what attributes and options they can take, and any default attribute values.

<?xml version="1.0"?>
<!DOCTYPE publication [
<!ELEMENT publication (title,author+,abstract*)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (firstname, lastname, (university | company)?)>
<!ATTLIST author role (Author|Techwriter) "Author">
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT university (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT abstract (#PCDATA)>

An ELEMENT is essentially a label used to describe a discrete component of the document's content. Elements are declared in the Document Type Definition as shown above, and usually define tag pairs. In this example, the author element contains three additional elements - firstname, lastname and (university | company). The way the third of these is constructed tells us that it can be either the element university or the element company. The ATTLIST keyword is used to define the attributes of a particular element. In the example shown above, the author element is defined as having one attribute (role) which can take as its value either "Author" or "Techwriter" (the default value is defined as "Author"). The construct used for the other elements, e.g. <!ELEMENT firstname (#PCDATA)> tells us that the element will contain data of some description (i.e. text information) as opposed to further tags.

XML allows us to arbitrarily define tags and the structural relationships between them. There is no predefined tag set, and thus no pre-ordained set of semantics. The semantics of an XML document will be determined by the applications that process them, or by style sheets. Because HTML and XML are both inherently related to SGML, a familiarity with HTML should mean that learning to author XML documents should be (relatively) painless. There are, however, some significant differences.

First of all, it is necessary to provide a Document Type Definition (DTD) for each XML application created so that authoring and validation tools can be used to create and validate documents. It is also necessary to use style sheets to enable the documents created to be correctly interpreted by XML enabled user agents (browser). These can be either cascading style sheets (CSS), or eXtensible Style Language (XSL) style sheets XSL is the XML equivalent to CSS. The separation of content from presentation, which is only encouraged but not enforced by the HTML 4.0 specification, is thus inherent in XML. There are also differences in syntax. For example, HTML tags are not case sensitive, but XML tags are. One way to avoid problems in this respect is to define all tags in lower-case.

Unlike HTML, which has a limited number of rigidly defined tags, there are no constraints on the number of tags that can be defined for an XML application, and both tags and attributes can be given meaningful names that closely reflect the namespace of the environment for which the application is intended. Furthermore, the structure of an XML document type can me made to reflect the naturally occurring structure of the information it records. XML is therefore ideally suited to the creation of flexible, portable document formats, and can be used to transfer data between widely differing applications.

eXtensible HyperText Markup Language

From January 2000, all W3C Recommendations for HyperText Markup Language have been based on XML rather than SGML. Just as HTML was a product of SGML, eXtensible HyperText Markup Language (XHTML) is a product of XML. XHTML documents are required to be well-formed XML documents - this means they require a more rigorous approach in respect of their construction and syntax, although they largely retain the tag set used for HTML 4.0, making the transition from HTML to XHTML for Web authors relatively painless. One of the most noticeable differences between HTML and XHTML is the requirement for all tags to be closed. Empty HTML tags such as <br> must be replaced by a new closed form of the tag: <br/>. Other changes include the requirement that all tag and attribute names must be in lower-case, and all attribute values must be quoted (e.g. colspan="2"). In essence, XHTML has the same tag set as HTML version 4.1, although W3C has deprecated a number of tags such as the <font> ... </font> tag, since purely presentational issues, including the style (or styles) of text used in a document, should now be determined using style sheets.