The Wayback Machine - https://web.archive.org/web/20120223172854/https://en.wikipedia.org/wiki/UTF-1

UTF-1

From Wikipedia, the free encyclopedia
Jump to: navigation, search

UTF-1 is a way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design it is not possible to resynchronise if decoding starts in the middle of a character (this makes truncation hard, among other things) and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of division by a number which is not a power of 2. Due to these issues, UTF-1 never gained wide acceptance and has been almost totally replaced by UTF-8.

[edit] Design

UTF-1 is a multi-byte encoding like UTF-8; a single Unicode code point can be encoded in one, two, three, or five octets. While the ASCII range is encoded as one octet, as in UTF-8, the ASCII octets 0x21 - 0x7E (decimal 33 - 126) are also used in UTF-1 multi-byte encodings; therefore UTF-1 is unsuited for many Internet protocols, including MIME.

UTF-1 does not use the C0 and C1 control codes in other encodings – any 0x00–0x20 or 0x7F–0x9F octet stands for the corresponding code points in ISO-8859-1 (U+0000–0020 and U+007F–009F, respectively). This design with 66 protected octets tried to be ISO 2022 compatible.

The UTF-1 encoding scheme uses "modulo 190" arithmetic (256 − 66 = 190); it was designed to encode the complete 31 bits of the original Universal Character Set (UCS-4). For comparison, UTF-8 protects all 128 ASCII octets, and needs two bits in trail bytes of multi-byte encodings for this purpose, resulting in "modulo 64" arithmetic (8 − 2 = 6, 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).

codepoint UTF-16BE UTF-16LE UTF-8 UTF-1
U+007F 007F 7F00 7F 7F
U+0080 0080 8000 C280 80
U+009F 009F 9F00 C29F 9F
U+00A0 00A0 A000 C2A0 A0A0
U+00BF 00BF BF00 C2BF A0BF
U+00C0 00C0 C000 C380 A0C0
U+00FF 00FF FF00 C3BF A0FF
U+0100 0100 0001 C480 A121
U+015D 015D 5D01 C59D A17E
U+015E 015E 5E01 C59E A1A0
U+01BD 01BD BD01 C6BD A1FF
U+01BE 01BE BE01 C6BE A221
U+07FF 07FF FF07 DFBF AA72
U+0800 0800 0008 E0A080 AA73
U+0FFF 0FFF FF0F E0BFBF B548
U+1000 1000 0010 E18080 B549
U+4015 4015 1540 E48095 F5FF
U+4016 4016 1640 E48096 F62121
U+D7FF D7FF FFD7 ED9FBF F72FC3
U+E000 E000 00E0 EE8080 F73A79
U+F8FF F8FF FFF8 EFA3BF F75C3C
U+FDD0 FDD0 D0FD EFB790 F762BA
U+FDEF FDEF EFFD EFB7AF F762D9
U+FEFF FEFF FFFE EFBBBF F7644C
U+FFFD FFFD FDFF EFBFBD F765AD
U+FFFE FFFE FEFF EFBFBE F765AE
U+FFFF FFFF FFFF EFBFBF F765AF
U+10000 D800DC00 00D800DC F0908080 F765B0
U+38E2D D8A3DE2D A3D82DDE F0B8B8AD FBFFFF
U+38E2E D8A3DE2E A3D82EDE F0B8B8AE FC21212121
U+FFFFF DBBFDFFF BFDBFFDF F3BFBFBF FC2137B27A
U+100000 DBC0DC00 C0DB00DC F4808080 FC2137B27B
U+10FFFF DBFFDFFF FFDBFFDF F48FBFBF FC21396E6C

[edit] See also

[edit] References

  • ISO IR 178 (PDF, 256 KB, the retired UTF-1 specification)
Personal tools
Namespaces
Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages
Morty Proxy This is a proxified and sanitized view of the page, visit original site.