HTML ASCII Character Sets and Entity Definitions

Posted on November 6

By default, HTML+ documents are made up of 8-bit characters from the ISO 8859 Latin-1 character set. The network protocol used to retrieve documents may translate the character set into a locally acceptable form, e.g. EBCDIC.

The HTTP protocol uses the MIME standard (RFC 1341) to specify the document type and character set. ISO SGML entity definitions are used to include characters which are missing from the character set or which would otherwise be confused with markup elements, e.g:

ampersand &

less than sign <

greater than sign >

the double quote sign ”

Some other useful accented characters in 7-bit ASCII entity definitions are:

en dash – (half the width of an em unit)

em dash — (equal to width of an “m” character)

en space

em space

non breaking space
soft hyphen (normally invisible)

copyright sign ©

trade mark sign ™

registered sign ®

There are a large number of entities defined by the ISO, covering most languages and symbols for publishing and mathematics. Requiring all browsers to support these would be impractical, e.g. how should a dumb terminal show such symbols. In some cases there will be accepted ways of mapping them to normal characters, e.g. æ as ae and è as e. Perhaps the safest recommendation is that where authors need to use a specialised character or symbol, they should use ISO entity names rather than inventing their own. Browsers should leave unrecognised entity names untranslated.

Tags: // Category: Webmastering.


Comments are closed.