UTF-8

UTF-8
StandardUnicode Standard
ClassificationUnicode Transformation Format, extended ASCII, variable-length encoding
ExtendsASCII
Transforms / EncodesISO/IEC 10646 (Unicode)
Preceded byUTF-1

UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.[1] Almost every webpage is stored in UTF-8.

UTF-8 is capable of encoding all 1,112,064[2] valid Unicode scalar values using a variable-width encoding of one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters is identical to an ASCII file. Most software designed for any extended ASCII can read and write UTF-8 (including on Microsoft Windows) and this results in fewer internationalization issues than any alternative text encoding.[3][4]

UTF-8 is dominant for all countries/languages on the internet, is used in most standards, often the only allowed encoding, and is supported by all modern operating systems and programming languages.

  1. ^ "Chapter 2. General Structure". The Unicode Standard (6.0 ed.). Mountain View, California, US: The Unicode Consortium. ISBN 978-1-936213-01-6.
  2. ^ "Conformance". The Unicode Standard (6.0 ed.). Mountain View, California, US: The Unicode Consortium. D76 Unicode scalar value. ISBN 978-1-936213-01-6. - 17 planes times 216 code points per plane, minus 211 technically-invalid surrogates
  3. ^ Cite error: The named reference Microsoft GDK was invoked but never defined (see the help page).
  4. ^ Cite error: The named reference whatwg was invoked but never defined (see the help page).