string representation is tricky

In some recent context, probably GTALUG-related, that UTF-16 is dumb. That it is the worst of two worlds: that not all characters could be represented as a single 16-bit value, that byte-order issues arise in byte streams of UTF-16, that UTF-16 is wasteful for representing ASCII. Java, Python before 3, Javascript, Microsoft's C and C++, the Jolliete filesystem, the NT File system, and many other things use UTF-8. Dumb. It represents a frozen mistake. Python 3, if I understand correctly, represents strings as arrays of UTF-32 characters. That too is dumb. It is wasteful. The driving justification is that it is thought that indexing an array of characters is an important and frequent operation and it should be O(1) in cost. An array of UTF-8 bytes can be indexed cheaply, but that does not get you to the n'th character. Indexing the character would cost O(n). But it turns out that indexing an array of characters is not a very useful operation. Iterating through the characters of a string is common, but that is a different thing and is cheap to do with UTF-8. I've just read an old blog post that is quite good at showing me even more odd things about UNICODE that I hadn't internalized. <https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/> - UNICODE has code points, not characters. - "character" isn't a well defined concept. - a glyph might be made up of a sequence of codepoints combined. - a glyph might have several representations as a sequence of codepoints. - there may not be a canonical form for a glyph. - collating sequence is tough - fixed-width terminals cannot reasonably handle UNICODE - there is a well reasoned argument why indexing by codepoint is not important There's more. I recommend reading it and some of the things it references. Beware: I spent a lot of time following references to references. Example: => <https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/> => <https://en.wikipedia.org/wiki/Ogham> => <https://en.wikipedia.org/wiki/Nimrod> => <https://en.wikipedia.org/wiki/Josephus> Luckilly, the code I write only has to understand ASCII. It has to be able to pass through other code points. UTF-8 is quite good for this.

"Executive Summary" would be helpful, Hugh. -- William Park <opengeometry@yahoo.ca> On Tue, Apr 16, 2019 at 10:36:41PM -0400, D. Hugh Redelmeier via talk wrote:
In some recent context, probably GTALUG-related, that UTF-16 is dumb. That it is the worst of two worlds: that not all characters could be represented as a single 16-bit value, that byte-order issues arise in byte streams of UTF-16, that UTF-16 is wasteful for representing ASCII.
Java, Python before 3, Javascript, Microsoft's C and C++, the Jolliete filesystem, the NT File system, and many other things use UTF-8. Dumb. It represents a frozen mistake.
Python 3, if I understand correctly, represents strings as arrays of UTF-32 characters. That too is dumb. It is wasteful. The driving justification is that it is thought that indexing an array of characters is an important and frequent operation and it should be O(1) in cost.
An array of UTF-8 bytes can be indexed cheaply, but that does not get you to the n'th character. Indexing the character would cost O(n).
But it turns out that indexing an array of characters is not a very useful operation. Iterating through the characters of a string is common, but that is a different thing and is cheap to do with UTF-8.
I've just read an old blog post that is quite good at showing me even more odd things about UNICODE that I hadn't internalized.
<https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/>
- UNICODE has code points, not characters.
- "character" isn't a well defined concept.
- a glyph might be made up of a sequence of codepoints combined.
- a glyph might have several representations as a sequence of codepoints.
- there may not be a canonical form for a glyph.
- collating sequence is tough
- fixed-width terminals cannot reasonably handle UNICODE
- there is a well reasoned argument why indexing by codepoint is not important
There's more. I recommend reading it and some of the things it references.
Beware: I spent a lot of time following references to references. Example:
=> <https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/>
=> <https://en.wikipedia.org/wiki/Ogham>
=> <https://en.wikipedia.org/wiki/Nimrod>
=> <https://en.wikipedia.org/wiki/Josephus>
Luckilly, the code I write only has to understand ASCII. It has to be able to pass through other code points. UTF-8 is quite good for this. --- Talk Mailing List talk@gtalug.org https://gtalug.org/mailman/listinfo/talk

On 2019-04-16 10:36 p.m., D. Hugh Redelmeier via talk wrote:
Java, Python before 3, Javascript, Microsoft's C and C++, the Jolliete filesystem, the NT File system, and many other things use UTF-8.
You meant to say UTF-16? Collation is difficult in anything but the simplistic "ASCIIbetical" case. People expect natural sort orders now, with '10' coming after ' 9' and case being of lesser importance. Once you get outside English*, things get much more delightful. In Welsh, for instance, 'ff' and 'll' sort as different codepoints to f and l, but an initial 'ng' sorts as a 'g' as it's merely an inflected form. Capitalization is a whole different horror and left as an exercise for the reader. Suffice to say, an initial 'ff' (as in the rare Welsh/English surnames ffrench and ffinch) is never capitalized. Stewart *: difficult, because we assimilate everything, accents and all.

| From: Stewart C. Russell via talk <talk@gtalug.org> | On 2019-04-16 10:36 p.m., D. Hugh Redelmeier via talk wrote: | > | > Java, Python before 3, Javascript, Microsoft's C and C++, the Jolliete | > filesystem, the NT File system, and many other things use UTF-8. | | You meant to say UTF-16? Yes. Thanks! [the rest is for orthographic nerds only] | Collation is difficult Yes. And even just string equality. And there are security implications here. | Once you get outside English*, | things get much more delightful. | *: difficult, because we assimilate everything, accents and all. Including Scots, accent and all :-) | In Welsh, for instance, 'ff' and 'll' | sort as different codepoints to f and l, but an initial 'ng' sorts as a | 'g' as it's merely an inflected form. Capitalization is a whole | different horror and left as an exercise for the reader. Suffice to say, | an initial 'ff' (as in the rare Welsh/English surnames ffrench and | ffinch) is never capitalized. And that's not all ffolkes! But Jasper Fforde, apparently. As far as I know, the idea of upper-case doesn't apply to most languages. Of course other languages have distinctions that we're not used to. Think of all the forms of eacj letter in Arabic. One UNICODE surprise: it has a capital scharfes S. Wikipedia says: In 2017, the Council for German Orthography ultimately adopted capital ß (ẞ) into German orthography, ending a long orthographic debate.[4] <https://en.wikipedia.org/wiki/%C3%9F> In English, certain "s" letters were written in a way that looks like an f to us (but the cross is missing or different). I remember thinking "King Charles the Fecond" was a witty pun (Spring Thaw, 1967). This seems to be related to the scharfes S. <https://en.wikipedia.org/wiki/Long_s> Have a look at the contrasting Britanica pages. A google for "charles the fecond" gets me lots of books.google.* hits for books that have been OCRed incorrectly: the long s has been taken as an f.
participants (3)
-
D. Hugh Redelmeier
-
Stewart C. Russell
-
William Park