UTF-8

UTF-8
StandardUnicode Standard
ClassificationUnicode Transformation Format, extended ASCII, variable-length encoding
ExtendsASCII
Transforms / EncodesISO/IEC 10646 (Unicode)
Preceded byUTF-1

UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.[1] Almost every web page is stored in UTF-8.

UTF-8 is capable of encoding all 1,112,064[2] valid Unicode code points using a variable-width encoding of one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters is identical to an ASCII file, and most software designed for any extended ASCII can read and write UTF-8. Using UTF-8 results in fewer internationalization issues than any alternative text encoding,[3][4] virtually all software can at least read and write UTF-8 text (including on Microsoft Windows) and it is the most-used method of storing text, accounting for 98.3% of all web pages, 99.1% of the top 100,000 pages, and up to 100% for many languages, as of 2024.[5]

  1. ^ "Chapter 2. General Structure". The Unicode Standard (6.0 ed.). Mountain View, California, US: The Unicode Consortium. ISBN 978-1-936213-01-6.
  2. ^ 17 planes times 216 code points per plane, minus 211 technically-invalid surrogates
  3. ^ Cite error: The named reference Microsoft GDK was invoked but never defined (see the help page).
  4. ^ Cite error: The named reference whatwg was invoked but never defined (see the help page).
  5. ^ Cite error: The named reference W3TechsWebEncoding was invoked but never defined (see the help page).