string representation is tricky

17 Apr 2019

      In some recent context, probably GTALUG-related, that UTF-16 is dumb.
That it is the worst of two worlds: that not all characters could be 
represented as a single 16-bit value, that byte-order issues arise in byte 
streams of UTF-16, that UTF-16 is wasteful for representing ASCII.

Java, Python before 3, Javascript, Microsoft's C and C++, the Jolliete 
filesystem, the NT File system, and many other things use UTF-8.  Dumb.  
It represents a frozen mistake.

Python 3, if I understand correctly, represents strings as arrays of 
UTF-32 characters.  That too is dumb.  It is wasteful.  The driving 
justification is that it is thought that indexing an array of characters 
is an important and frequent operation and it should be O(1) in cost.

An array of UTF-8 bytes can be indexed cheaply, but that does not get you 
to the n'th character.  Indexing the character would cost O(n).

But it turns out that indexing an array of characters is not a very useful 
operation.  Iterating through the characters of a string is common, but 
that is a different thing and is cheap to do with UTF-8.

I've just read an old blog post that is quite good at showing me even more 
odd things about UNICODE that I hadn't internalized.

<https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/>

- UNICODE has code points, not characters.

- "character" isn't a well defined concept.

- a glyph might be made up of a sequence of codepoints combined.

- a glyph might have several representations as a sequence of codepoints.

- there may not be a canonical form for a glyph.

- collating sequence is tough

- fixed-width terminals cannot reasonably handle UNICODE

- there is a well reasoned argument why indexing by codepoint is not 
  important

There's more.  I recommend reading it and some of the things it 
references.

Beware: I spent a lot of time following references to references.  
Example:

=> <https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/>

=> <https://en.wikipedia.org/wiki/Ogham>

=> <https://en.wikipedia.org/wiki/Nimrod>

=> <https://en.wikipedia.org/wiki/Josephus>

Luckilly, the code I write only has to understand ASCII.  It has to be 
able to pass through other code points.  UTF-8 is quite good for this.

D. Hugh Redelmeier

William Park

Stewart C. Russell

D. Hugh Redelmeier

tags

participants (3)