
In some recent context, probably GTALUG-related, that UTF-16 is dumb. That it is the worst of two worlds: that not all characters could be represented as a single 16-bit value, that byte-order issues arise in byte streams of UTF-16, that UTF-16 is wasteful for representing ASCII. Java, Python before 3, Javascript, Microsoft's C and C++, the Jolliete filesystem, the NT File system, and many other things use UTF-8. Dumb. It represents a frozen mistake. Python 3, if I understand correctly, represents strings as arrays of UTF-32 characters. That too is dumb. It is wasteful. The driving justification is that it is thought that indexing an array of characters is an important and frequent operation and it should be O(1) in cost. An array of UTF-8 bytes can be indexed cheaply, but that does not get you to the n'th character. Indexing the character would cost O(n). But it turns out that indexing an array of characters is not a very useful operation. Iterating through the characters of a string is common, but that is a different thing and is cheap to do with UTF-8. I've just read an old blog post that is quite good at showing me even more odd things about UNICODE that I hadn't internalized. <https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/> - UNICODE has code points, not characters. - "character" isn't a well defined concept. - a glyph might be made up of a sequence of codepoints combined. - a glyph might have several representations as a sequence of codepoints. - there may not be a canonical form for a glyph. - collating sequence is tough - fixed-width terminals cannot reasonably handle UNICODE - there is a well reasoned argument why indexing by codepoint is not important There's more. I recommend reading it and some of the things it references. Beware: I spent a lot of time following references to references. Example: => <https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/> => <https://en.wikipedia.org/wiki/Ogham> => <https://en.wikipedia.org/wiki/Nimrod> => <https://en.wikipedia.org/wiki/Josephus> Luckilly, the code I write only has to understand ASCII. It has to be able to pass through other code points. UTF-8 is quite good for this.