Language(s) | International |
---|---|
Standard | Unicode Standard |
Classification | Unicode Transformation Format, variable-width encoding |
Extends | UCS-2 |
Transforms / Encodes | ISO/IEC 10646 (Unicode) |
UTF-16 (16-bit Unicode Transformation Format) is a character encoding method capable of encoding 1,112,064 code points of Unicode.[a] The encoding is variable-length as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set),[1][2] once it became clear that more than 216 (65,536) code points were needed,[3] including most emoji and important CJK characters such as for personal and place names.[4]
UTF-16 is used by systems such as the Microsoft Windows API, the Java programming language and JavaScript/ECMAScript. It is also sometimes used for plain text and word-processing data files on Microsoft Windows. It is used by more modern implementations of SMS.[5]
UTF-16 is the only encoding (still) allowed on the web that is incompatible with 8-bit ASCII.[6][b] However it has never gained popularity on the web, where it is declared by under 0.003% of public web pages.[8] UTF-8, by comparison, accounts for over 98% of all web pages.[9] The Web Hypertext Application Technology Working Group (WHATWG) considers UTF-8 "the mandatory encoding for all [text]" and that for security reasons browser applications should not use UTF-16.[10]
The variable length character of UTF-16, combined with the fact that most characters are not variable length (so variable length is rarely tested), has led to many bugs in software, including in Windows itself,[11] the solution is usually adopting UTF-8, as most software has done including (partially) Windows itself and Java and JavaScript.
Cite error: There are <ref group=lower-alpha>
tags or {{efn}}
templates on this page, but the references will not show without a {{reflist|group=lower-alpha}}
template or {{notelist}}
template (see the help page).
[...] the term UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard.
UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1 [...]
UTF-16 uses a single 16-bit code unit to encode over 60,000 of the most common characters in Unicode
I first came up with the idea for this Top Ten List over 10 years ago, which was prompted by some environments that still supported only BMP code points. The idea, of course, was to motivate the developers of such environments to support code points beyond the BMP by providing an enumerated list of reasons to do so. And yes, there are still some environments that support only BMP code points, such as the VivaDesigner app.
UTF-16 encodings are the only encodings that this specification needs to treat as not being ASCII-compatible encodings.
The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding. [..] The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that UTF-8 is now the mandatory encoding for all text things on the Web.
File names editing in Window dialogs in broken (delete required 2 presses on backspace)