HTML Character Encoding
In order to display a web page correctly, your web browser needs to know what character set to use. In the early days of the World Wide Web this was not really a problem, because the only character set used for web pages was the 128-character ASCII character set.
Within a decade, however, the web had become a truly worldwide phenomenon, and the resulting demand for internationalisation led to the development of a number of disparate and largely incompatible international character sets. It soon became obvious that some kind of international standard was needed.
ASCII (sometimes referred to as US-ASCII) is the abbreviation for the American Standard Code for Information Interchange, a character encoding standard developed for electronic communication and based on typographical symbols predominantly used in the United States. The ASCII standard was first published in 1963 and was most recently updated in 1986. ASCII has been widely-used to represent text in computers and telecommunications devices.
The table below lists the ASCII characters together with their decimal and hexadecimal code points. It also gives the named HTML entity used to represent a character, if one exists. Note that a numeric HTML entity exists for every printable character in the ASCII character set. It consists of an ampersand (&) followed by a hash sign (#) and then the decimal code point followed by a semi-colon (;). The upper-case character 'A' would thus be represented as follows:
The hexadecimal code point can also be used. The format is the same, except that an 'x' precedes the hexadecimal code to indicate that a hexadecimal code point is being used. Here is the hexadecimal numeric HTML entity for the upper-case character 'A':
The use of HTML entities is normally only required for characters that have special meaning within the HTML code itself, like the quotation mark, apostrophe, ampersand, less than, and greater than sympols (we have highlighted the corresponding entity names in the table).
ASCII uses a 7-bit character encoding (the eighth bit was originally reserved for parity checking). This allows a total of one hundred and twenty-eight characters to be encoded. The first thirty-two characters are non-printing control characters (now mostly obsolete) used in data transmission, or to control devices such as printers. The last character (127 or 7Fhex is the delete (DEL) control character.
The ninety-five remaining code points (32 to 126 or 20hex to 7Ehex) all represent printable characters, including the digits 0 to 9, the lowercase letters a to z, the uppercase letters A to Z, a number of punctuation symbols and other symbols, and the space character. The ASCII character set forms the basis for more recent character encoding standards such as ISO-8859 and UTF-8.
The default character set for the HTML 4.0 standard which emerged towards the end of the 1990s was based on the ISO/IEC 8859 series of standards for 8-bit character encodings. Unlike ASCII, ISO/IEC 8859 used all eight bits to encode characters and was thus able to represent twice as many characters.
The first 128 characters were identical to those in the ASCII character set, making ISO/IEC 8859 a superset of ASCII. Characters 128-159 were reserved for control characters. The remaining 96 places were intended for non-ASCII characters used by other languages based on the classical Latin alphabet.
Because the number of extra characters required to support these languages far exceeded the 96 places available, the ISO/IEC 8859 standard consisted of fifteen separate parts: ISO/IEC 8859-1 - ISO/IEC 8859-16 (ISO/IEC 8859-12 was abandoned). Each part was intended for use with a different set of languages.
The table below gives a brief summary of the ISO/IEC 8859 series of standards. The left-hand column shows year in which the current version of the corresponding part was published. The year in which the original version was published, if applicable, is shown in parentheses.
|Latin-1 (Western European)
|Danish (partial), Dutch (partial), English, Faeroese, Finnish (partial), French (partial), German, Icelandic, Irish, Italian, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Catalan, Swedish, Albanian, Southeast Asian Indonesian, Afrikaans, Swahili.
Revised as ISO/IEC 8859-15 in 1999.
|Latin-2 (Central European)
|Bosnian, Polish, Croatian, Czech, Slovak, Slovene, Serbian, Hungarian.|
|Latin-3 (South European)
|Turkish, Maltese, Esperanto.
Largely superceded by ISO/IEC 8859-9 for Turkish.
|Latin-4 (North European)
|Estonian, Latvian, Lithuanian, Greenlandic, Sami.|
|Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian (partial).|
|Covers most common Arabic language characters.|
|Modern Greek, Ancient Greek (limited to monotonic orthography).|
|Covers the modern Hebrew alphabet as used in Israel.|
|Turkish (similar to ISO/IEC 8859-1, but with Turkish characters replacing Icelandic characters).|
|Contains characters needed for the Thai language.|
|Latin-7 (Baltic Rim)
|Celtic languages such as Gaelic and the Breton language.|
|A revision of 8859-1 that removes seldom-used symbols and adds the euro sign € and the letters Š, š, Ž, ž, Œ, œ, and Ÿ (completes coverage of French, Finnish and Estonian).|
|Latin-10 (South-Eastern European)
|Albanian, Croatian, Hungarian, Italian, Polish, Romanian, Slovene, Finnish, French, German, Irish Gaelic.|
The default character set for HTML 4 was the Latin 1 (Western European) character set ISO/IEC 8859-1. To use a different character set, it was necessary to explicitly specify the character encoding to be used - for example by using a meta element with the charset parameter, or by using the http-equiv and content attributes within the document's <head> tag).
For the majority of languages, the required character repertoire could be found in a single part of the standard, but this was not true in every case. Moreover, a number of East Asian languages, including Chinese, Japanese and Korean, are not represented at all in the ISO/IEC 8859 system. These shortcomings would be addressed by the development of the Unicode standard.
The table below lists characters 128 – 255 in the ISO/IEC 8859-1 repertoire together with their decimal and hexadecimal code points and named HTML entities. Numeric HTML entities exist for every printable character in the ISO/IEC 8859-1 repertoire and can be derived in the manner described for ASCII printable characters (see above).
Windows-1252 (or CP-1252) is an 8-bit character encoding scheme for the Latin alphabet that was developed by Microsoft and is the default encoding scheme for versions of the Microsoft Windows operating system and other Microsoft software intended for use with English and some other Western languages. Although almost identical to ISO 8859-1, Windows-1252 has never been an ANSI or ISO standard.
Because of the former popularity of Windows-1252 (at one time probably one of the most widely used character encoding schemes in the world) the charset label "windows-1252" is still recognised by most if not all browsers, although probably less than one percent of web sites worldwide now declare the use of Windows-1252.
As for ISO 8859-1, the first 128 characters in Windows-252 (i.e. code points 0-127) are identical to those in the ASCII character set. Windows-1252 is a thus a superset of ASCII. In fact, it differs from ISO 8859-1 only in its use of printable characters rather than control characters for code points 128 through 159. A summary of these characters is shown in the table below.
The code point representations shown in the above table are those used in the final version of Windows-1252, which made its first appearance in Windows 98, and was subsequently ported to older versions of Windows.
In terms of printable characters, Windows-1252 can be considered to be a superset of ISO 8859-1. In additional to all of the printable characters in ISO 8559-1, Windows-1252 includes curly quotation marks, and all of the printable characters in ISO 8859-15 that were not included in ISO 8859-1 (albeit in different positions).
Windows-1252 characters have in the past been included in web pages that claimed to use the charset ISO 8859-1 charset. This can occur, for example, when text containing "smart quotes" is created in Microsoft Word and then pasted into an HTML document. The quotation marks and apostrophes are subsequently read by the browser as control characters and are displayed incorrectly.
There are still a significant number of websites in existence that make use of Windows-1252 characters, but incorrectly identify the charset as ISO 8859-1. The default behaviour of most browsers, whenever they encounter a reference to ISO 1885-1, is to parse the text as if the charset has been declared as Windows-1252. This ensures that any non-ISO 8859-1 characters will be correctly displayed.
In 1987, three software engineers – Joe Becker, Lee Collins and Mark Davis – initiated a project to develop a universal character encoding scheme which they called Unicode. The project led to the incorporation of the Unicode Consortium in California in 1991, whose stated aim was to develop, extend, and promote the use of the Unicode Standard.
The Unicode Standard claims to include encodings for (almost) every character, punctuation mark and symbol used by every language in the world. It is supported by current versions of virtually every operating system and web browser. The encoding scheme used is called the Unicode Transformation Format (UTF) and has several variants. The most relevant variant from the point of view of developing web pages is UTF-8.
UTF-8 is a variable-width encoding scheme that can use from one to four 8-bit bytes to represent any character in the Unicode repertoire. Like each of the ISO 8559 standards that preceded it, it is a superset of ASCII. It is also a superset of the ISO 8559 standards. The World Wide Web Consortium (W3C) guidelines on Internationalisation techniques, under the heading Choosing and applying a character encoding, now advises web authors to choose UTF-8 for all content.
The first 256 characters of Unicode character sets (including UTF-8) are identical to the 256 characters of ISO 8859-1. This represents just a tiny fraction, however, of the total code space available. Unicode has a total of 1,114,112 code points! the Unicode code space is divided into seventeen planes (the basic multilingual plane plus sixteen supplementary planes), each of which contains 65,536 (2 16) code points.
Unicode version 12.1 (released in May 2019) defines a total of 137, 994 encoded characters with unique identifying names, although it should be noted that more complex characters and symbols (emojis, for example) can be created using a combination of two or more characters from this code space. It is therefore not possible to calculate the actual number of character representations possible with Unicode.
What we can say is that UTF-8 can represent every character within the Universal Coded Character Set (UCS), which is defined by the international standard ISO/IEC 10646. This is the character repertoire now used by HTML. Each character in the ISO/IEC 10646 repertoire is identified by an unambiguous name and a unique numeric value (its code point). The ISO/IEC 10646 standard is maintained in tandem with the Unicode standard, and both standards share the same set of unique code points.
Each Unicode code point is (usually) written as a five-digit hexadecimal number including leading zeros, prefixed with an upper-case "U" followed by a plus sign. For example, the uppercase letter "E", which has the decimal code point 69 (45h), would be written as follows:
Most web pages today specify the use of the UTF-8 encoding scheme, because it can represent all of the characters used by virtually every language in the world, as well as a huge range of symbols and characters used in areas such as mathematics and science. To some extent, this has eliminated the need to use HTML entity references (which we will talk about shortly) since any character in the UTF-8 repertoire will be rendered correctly by modern browsers.
Each plane of the Unicode code space is further subdivided into Unicode blocks. A block is a contiguous range of code points, and each block has a unique descriptive name. The number of code points in any Unicode block is always a multiple of 16. Blocks can vary in size, from a minimum size of 16 code points up to a maximum size of 65,536 code points.
At the time of writing, all of the named HTML entity names published by the W3C refer to code points belonging to Unicode blocks contained within the Basic Multilingual Plane (BMP) except for those that refer to characters in the Mathematical Alphanumeric Symbols block, which is part of the Supplementary Multilingual Plane (SMP).
Declaring a character encoding
The default character encoding for HTML5, and the one recommended by the World Wide Web Consortium for all content, is UTF-8. Having said that, there is nothing to prevent you from using alternative character encodings should you (for whatever reason) feel the need to do so.
Whichever character encoding standard you decide to use, you should always declare it in the head of your HTML documents in order to ensure that a web browser will render your pages correctly. If you don't specify a character encoding, the browser will either assume the default encoding for HTML 5 or use the encoding (if any) specified in the HTTP header. Which brings us to a rather tricky point.
The character encoding declared in the web server's HTTP header when it delivers your content will override any character encoding you specify in your HTML document. If your chosen character encoding differs from the character encoding specified by the web server's HTTP header, a web browser may not display all of your text correctly.
We'll come back to that point shortly. Before we do, let's look at how we actually declare our chosen character encoding within the HTML document itself. The W3C has this to say:
"Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive). The declaration should fit completely within the first 1024 bytes at the start of the file, so it's best to put it immediately after the opening head tag."
Both of the following declarations do exactly the same thing:
. . .
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
. . .
According to the W3C, it doesn't matter which one you use (although we prefer the first option because its shorter and easier to remember). One other point here is of paramount importance: you must ensure that the HTML file is saved with the same character encoding you have declared for the document. Most HTML editors will allow you to choose the encoding to be used when saving a file; the default encoding is usually UTF-8, but you might want to check, just to make sure.
Now let's tackle the issue of the HTTP header. As stated earlier, any character encoding declared in the HTTP header used by the web server to serve your documents will override the character encoding declaration used in your HTML document; this is obviously not an issue if they are both the same. If they are not, then what can you do about it? There are a couple of options available to you.
If you have administrative access to the server settings, you could change the character encoding declared by the HTTP header to match that of your HTML documents. Note that you should still declare the character encoding in the head of your HTML file, as it will be used by user agents when rendering your document in offline mode. You must also make sure that the character encoding declared in your document matches the character encoding used in the HTTP header.
If you do not have access to the web server settings (this is likely to be the case if you are using a web hosting service, for example) you may still be able to override the default "Content-Type" HTTP settings using a file called .htaccess, which you would need to create and upload to your website's root directory. Using the Header directive within the .htaccess file allows you to add, modify or delete HTTP response headers.
A detailed discussion of how you would go about implementing either of the above solutions is beyond the scope of this page. The good news is that if you stick to using UTF-8, there shouldn't be a problem, since most web servers don't specify a charset, and most of those that do specify UTF-8 by default.
One final point relates to something called the byte order mark or BOM. This is a two-byte code that is sometimes included in HTML files that use a Unicode character encoding. Before the advent of UTF-8 in 1993, all Unicode characters were transmitted as 16-bit code units with either the most significant byte first or the least significant byte first, depending on the order in which they were stored in memory. The BOM was added to the beginning of a transmission to indicate the byte order being used.
If there is a byte order mark at the beginning of an HTML file, most modern browsers will interpret this as meaning that the character encoding used by the document is UTF-8. Furthermore, this will (in the majority of cases) override any character encoding declared elsewhere, including the HTTP Content-type header.
You can check for the presence of a BOM at the start of a web page using the W3C Internationalization Checker. Enter the URL of the page you want to check, then click on the Check button. Details of the document's character encoding, including the presence or absence of a byte order mark, will be shown in the Character encoding section of the Internationalization Checker's output, as illustrated below.
W3C's Internationalization Checker will indicate whether a BOM is present
Note that the Unicode standard allows, but does not require or recommend, the use of a byte order mark for UTF-8. It does not recommend removing a BOM if present, because some code may require it in order to function correctly. A number of Microsoft applications - Notepad for example - add a BOM when saving text as UTF-8 and will be unable to interpret UTF-8 text correctly unless the BOM is present (or the file consists purely of ASCII characters).
The byte order mark can cause problems for some legacy software that does not know how to handle it, although these problems are gradually disappearing with the adoption of up-to-date browsers and HTML editing programs. Perhaps of more concern, though not something we need to worry about just yet, is that the presence of a BOM in files used by some scripting languages can sometimes cause unwanted side effects.
HTML character references
HTML character references provide an alternative means of representing characters in your HTML code. Essentially, since UTF-8 can be used to represent every code point in the Unicode code space, the only characters you must use an HTML entity for are the HTML reserved characters - ampersand (&), less-than (<), and greater-than (>). There are other circumstances, however, where the use of HTML character references may be necessary (or simply more convenient).
In HTML documents, each character used can either represent itself directly or be represented by a sequence of characters called a character reference. There are two types of character reference - a numeric character reference and a character entity reference. A numeric character reference uses the character's Unicode code point expressed as a decimal or hexadecimal number, whereas a character entity reference uses a unique entity name.
The following character references all represent the Greek capital letter Omega (Ω):
Ω <!-- decimal numeric character reference -->
Ω <!-- hexadecimal numeric character reference -->
Ω <!—character entity reference -->
Note the format for each representation. All HTML character references begin with the ampersand character (&) and end with a semi-colon (;). Numeric character references must include either the decimal representation of the character's code point prefixed with a hash sign (#), or the hexadecimal representation of the character's code point prefixed with a hash sign plus the lowercase letter "x". Leading zeros are ignored.
Named entities must include the unique W3C-defined symbolic name that refers to the character. Note that entity names are case sensitive. Note also that, in a number of cases, there may be two or more names defined for the same Unicode character (the reasons for this are probably historical; it doesn't seem to matter which name is used).
At the time of writing, W3C defines names for 1450 Unicode characters (and yes, we did count them - twice in fact), although the current list is a work in progress. We present a list of the named entities, in Unicode code point order and organised by Unicode block (see above) at the bottom of this page. As far as we have been able to ascertain (by carrying out a number of random checks), numeric character references can be used for any currently defined Unicode character.
The arguments for and against the use of HTML character references are many and varied, and a detailed discussion of the relative merits thereof is beyond the scope of this page. The general consensus seems to be that you should use actual Unicode characters (as opposed to HTML character references) wherever possible. This has the advantage of reducing document size and making your HTML code easier to read.
The use of Unicode characters is obviously conditional on code editors, browsers, and other related software providing support for Unicode. It is also imperative to save your HTML document in the correct (i.e. UTF-8) format, and to declare the character encoding as UTF-8 within the document's <head> element.
Links to character reference tables
Use the links below to view listings of commonly used Unicode code blocks. Each listing includes the HTML numeric entity reference and entity name (if applicable) for each code point in the block, together with a brief description. Characters are listed in ascending numerical order according to their Unicode code point value. NOTE: before using an entity reference in your HTML pages, it is worth checking whether it is supported by popular browsers, as support can vary (particular care should be taken, for example, when using combining diacritical marks).