International Components for Unicode

Developer(s)Unicode Consortium
Initial release1999
Stable release
75.1[1] Edit this on Wikidata / 16 April 2024; 5 months ago (16 April 2024)
Repository
Written inC/C++ (C++11) and Java 8+
Operating systemCross-platform
TypeLibraries for Unicode and internationalization
LicenseUnicode License
Websiteicu.unicode.org

International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, supported, and used by IBM and many other companies.[2] ICU has been included as a standard component with Microsoft Windows since Windows 10 version 1703.[3]

ICU provides the following services: Unicode text handling, full character properties, and character set conversions; Unicode regular expressions; full Unicode sets; character, word, and line boundaries; language-sensitive collation and searching; normalization, upper and lowercase conversion, and script transliterations; comprehensive locale data and resource bundle architecture via the Common Locale Data Repository (CLDR); multiple calendars and time zones; and rule-based formatting and parsing of dates, times, numbers, currencies, and messages. ICU provided complex text layout service for Arabic, Hebrew, Indic, and Thai historically, but that was deprecated in version 54, and was completely removed in version 58 in favor of HarfBuzz.[4]

ICU provides more extensive internationalization facilities than the standard libraries for C and C++. Future ICU 75 planned for April 2024 will require C++17 (up from C++11) or C11 (up from C99), depending on what languages is used. ICU has historically used UTF-16, and still does only for Java; while for C/C++ UTF-8 is supported,[5][6] including the correct handling of "illegal UTF-8".[7]

ICU 73.2 has improved significant changes for GB18030-2022 compliance support, i.e. for Chinese (that updated Chinese GB18030 Unicode Transformation Format standard is slightly incompatible); has "a modified character conversion table, mapping some GB18030 characters to Unicode characters that were encoded after GB18030-2005" and has a number of other changes such as improving Japanese and Korean short-text line breaking, and in "English, the name “Türkiye” is now used for the country instead of “Turkey” (the alternate spelling is also available in the data)."[8]

ICU 74 "updates to Unicode 15.1, including new characters, emoji, security mechanisms, and corresponding APIs and implementations. [..] ICU 74 and CLDR 44 are major releases, including a new version of Unicode and major locale data improvements."[9] Of the many changes some are for person name formatting, or for improved language support, e.g. for Low German, and there's e.g. a new spoof checker API, following the (latest version) Unicode 15.1.0 UTS #39: Unicode Security Mechanism.

  1. ^ "Release ICU 75.1 · unicode-org/icu". Retrieved 21 April 2024.
  2. ^ "ICU - International Components for Unicode". site.icu-project.org. Archived from the original on 2021-08-27. Retrieved 2011-11-14.
  3. ^ Chen, Raymond (27 May 2021). "How can I convert between IANA time zones and Windows registry-based time zones?". The Old New Thing. Microsoft.
  4. ^ "Layout Engine - ICU User Guide". userguide.icu-project.org.
  5. ^ Cite error: The named reference UTF-8 was invoked but never defined (see the help page).
  6. ^ "UTF-8 - ICU User Guide". userguide.icu-project.org. Retrieved 2018-04-03.
  7. ^ "#13311 (change illegal-UTF-8 handling to Unicode "best practice")". bugs.icu-project.org. Retrieved 2018-04-03.
  8. ^ "ICU - International Components for Unicode - ICU 73". icu.unicode.org. Retrieved 2023-09-24.
  9. ^ "ICU - International Components for Unicode - ICU 74". icu.unicode.org. Retrieved 2023-11-29.