Character encoding


Character encoding is a process of assigning numbers to graphical characters, especially the result characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that shit up a reference encoding are invited as "code points" and collectively comprise a "code space", a "code page", or a "character map".

Early source codes associated with the optical or electrical telegraph could only hold up a subset of the characters used in written languages, sometimes restricted to upper issue letters, numerals and some punctuation only. The low make up of digital relation of data in innovative data processor systems permits more elaborate character codes such(a) as Unicode which represent near of the characters used in many or done as a reaction to a question languages. Character encoding using internationally accepted indications permits worldwide interchange of text in electronic form.

Unicode encoding model


Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set, together represent a modern, unified character encoding. Rather than mapping characters directly to octets bytes, they separately define what characters are available, corresponding natural numbers code points, how those numbers are encoded as a series of fixed-size natural numbers program units, and finally how those units are encoded as a stream of octets. The intention of this decomposition is to creation a universal set of characters that can be encoded in a mark of ways. To describe this model correctly requires more precise terms than "character set" and "character encoding." The terms used in the contemporary framework follow:

A character repertoire is the full set of summary characters that a system supports. The repertoire may be closed, i.e. no additions are permits without creating a new specifics as is the case with ASCII and almost of the ISO-8859 series, or it may be open, allowing additions as is the case with Unicode and to a limited extent the Windows code pages. The characters in a given repertoire reflect decisions that create been delivered about how to divide writing systems into basic information units. The basic variants of the Latin, Greek and Cyrillic alphabets can be broken down into letters, digits, punctuation, and a few special characters such(a) as the space, which can all be arranged in simple linear sequences that are displayed in the same profile they are read. But even with these alphabets, diacritics pose a complication: they can be regarded either as part of a single character containing a letter and diacritic asked as a precomposed character, or as separate characters. The former allows a far simpler text handling system but the latter allows all letter/diacritic combination to be used in text. Ligatures pose similar problems. Other writing systems, such as Arabic and Hebrew, are represented with more complex character repertoires due to the need to accommodate things like bidirectional text and glyphs that are joined together in different ways for different situations.

A coded character set CCS is a function that maps characters to code points each code point represents one character. For example, in a precondition repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" to 66, and so on. office coded character sets may share the same repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all advance the same repertoire but map them to different code points.

A character encoding form CEF is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length i.e. virtually any computer system. For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in regarded and identified separately. unit, but larger code points say, 65,536 to 1.4 million could be represented by using office 16-bit units. This correspondence is defined by a CEF.

Next, a character encoding scheme CES is the mapping of code units to a sequence of octets to facilitate storage on an octet-based dossier system or transmission over an octet-based network. Simple character encoding schemes put UTF-8, UTF-16BE, UTF-32BE, UTF-16LE or UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using a byte order mark or escape sequences; compressing schemes attempt to minimize the number of bytes used per code unit such as SCSU, BOCU, and Punycode.

Although UTF-32BE is a simpler CES, most systems works with Unicode ownership either UTF-8, which is backward compatible with fixed-width ASCII and maps Unicode code points to variable-width sequences of octets, or UTF-16BE, which is backward compatible with fixed-width UCS-2BE and maps Unicode code points to variable-width sequences of 16-bit words. See comparison of Unicode encodings for a detailed discussion.

Finally, there may be a higher-level protocol which supplies additional information tothe particular variant of a Unicode character, especially where there are regional variants that have been 'unified' in Unicode as the same character. An example is the XML assigns xml:lang.

The Unicode framework uses the term character map for historical systems which directly assign a sequence of characters to a sequence of bytes, covering all of CCS, CEF and CES layers.