UTF-16

UTF-16
Example of Unicode character encoding through UTF-16
Language(s)International
StandardUnicode Standard
ClassificationUnicode Transformation Format, variable-width encoding
ExtendsUCS-2
Transforms / EncodesISO/IEC 10646 (Unicode)

UTF-16 (16-bit Unicode Transformation Format) is a character encoding method capable of encoding all 1,112,064 valid code points of Unicode.[a] The encoding is variable-length as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set),[1][2] once it became clear that more than 216 (65,536) code points were needed,[3] including most emoji and important CJK characters such as for personal and place names.[4]

UTF-16 is used by systems such as the Microsoft Windows API, the Java programming language and JavaScript/ECMAScript. It is also sometimes used for plain text and word-processing data files on Microsoft Windows. It is used by more modern implementations of SMS.[5]

UTF-16 is the only encoding (still) allowed on the web that is incompatible with 8-bit ASCII.[6][b] However it has never gained popularity on the web, where it is declared by under 0.003% of public web pages.[8] UTF-8, by comparison, accounts for over 98% of all web pages.[9] The Web Hypertext Application Technology Working Group (WHATWG) considers UTF-8 "the mandatory encoding for all [text]" and that for security reasons browser applications should not use UTF-16.[10]

The variable length character of UTF-16, combined with the fact that most characters are not variable length (so variable length is rarely tested), has led to many bugs in software, including in Windows itself,[11] the solution is usually adopting UTF-8, as most software has done including (partially) Windows itself and Java and JavaScript.


Cite error: There are <ref group=lower-alpha> tags or {{efn}} templates on this page, but the references will not show without a {{reflist|group=lower-alpha}} template or {{notelist}} template (see the help page).

  1. ^ "C.2 Encoding Forms in ISO/IEC 10646" (PDF). The Unicode Standard, version 6.0. Mountain View, CA: Unicode Consortium. February 2011. p. 573. ISBN 978-1-936213-01-6. [...] the term UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard.
  2. ^ "FAQ: What is the difference between UCS-2 and UTF-16?". unicode.org. Archived from the original on 2003-08-18. Retrieved 2024-03-19. UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1 [...]
  3. ^ "What is UTF-16?". The Unicode Consortium. Unicode, Inc. Retrieved 7 January 2023. UTF-16 uses a single 16-bit code unit to encode over 60,000 of the most common characters in Unicode
  4. ^ Lunde, Ken (2022-01-09). "2022 Top Ten List: Why Support Beyond-BMP Code Points?". Medium. Retrieved 2024-01-07. I first came up with the idea for this Top Ten List over 10 years ago, which was prompted by some environments that still supported only BMP code points. The idea, of course, was to motivate the developers of such environments to support code points beyond the BMP by providing an enumerated list of reasons to do so. And yes, there are still some environments that support only BMP code points, such as the VivaDesigner app.
  5. ^ Chad Selph (2012-11-08). "Adventures in Unicode SMS". Twilio. Archived from the original on 2015-09-08. Retrieved 2015-08-28.
  6. ^ "HTML Living Standard". w3.org. 2020-06-10. Archived from the original on 2020-09-08. Retrieved 2020-06-15. UTF-16 encodings are the only encodings that this specification needs to treat as not being ASCII-compatible encodings.
  7. ^ "Encoding Standard". encoding.spec.whatwg.org. Retrieved 2023-04-22.
  8. ^ "Usage Statistics of UTF-16 for Websites, September 2024". w3techs.com. Retrieved 2024-09-03.
  9. ^ "Usage Statistics of UTF-8 for Websites, September 2024". w3techs.com. Retrieved 2024-09-03.
  10. ^ "Encoding Standard". encoding.spec.whatwg.org. Retrieved 2018-10-22. The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding. [..] The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that UTF-8 is now the mandatory encoding for all text things on the Web.
  11. ^ "Should UTF-16 be considered harmful?". Software Engineering Stack Exchange. Retrieved 2024-11-20. File names editing in Window dialogs in broken (delete required 2 presses on backspace)