interesting article and comments about UCS-16, UTF-16, UTF-8

From this I learned that Java and JavaScript now have optimizations to use LATIN-1 when they can. Normally they use UTF-16 (originally UCS-16). I take it that Using Latin-1 is an opportunistic optimization hidden from the program. I don't think Python 3 uses
https://news.ycombinator.com/item?id=20600195 There are so many hairy details! UTF-8 gets a bit less coverage since it has fewer hairy details. this. I think that Linux does this right and needs no such hack: just use UTF-8. Of course Java, JavaScript, Python 2, and Python 3 on Linux don't get it right.

On 2019-08-03 6:18 p.m., D. Hugh Redelmeier via talk wrote:
https://news.ycombinator.com/item?id=20600195
There are so many hairy details!
UTF-8 gets a bit less coverage since it has fewer hairy details.
From this I learned that Java and JavaScript now have optimizations to use LATIN-1 when they can. Normally they use UTF-16 (originally UCS-16). I take it that Using Latin-1 is an opportunistic optimization hidden from the program. I don't think Python 3 uses this.
I think that Linux does this right and needs no such hack: just use UTF-8. Of course Java, JavaScript, Python 2, and Python 3 on Linux don't get it right. --- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
I looked through it briefly and a lot of it would depend on what the implementer think the language is being used for. There may be a very good reason for it similar to SSO or small string optimizations in most STLs. Sure that program may get bit in the foot but it's one program. I would be curious to see it across a lot of different Javascript, Python 2/3 and Java programs to see if it's a good idea. It's the same with hardware heuristics or instructions for compiler backends, sure 5% of programs may perform better but what about the other 95%? It's interesting to point through that UTF-16 is so complex that UTF-8 or another less complex version of encoding would be preferred and therefore if the full encoding is required may cause overhead, the question is how much and if hardware is already encoding it? Nick

On Sat, Aug 03, 2019 at 06:18:51PM -0400, D. Hugh Redelmeier via talk wrote:
https://news.ycombinator.com/item?id=20600195
There are so many hairy details!
UTF-8 gets a bit less coverage since it has fewer hairy details.
From this I learned that Java and JavaScript now have optimizations to use LATIN-1 when they can. Normally they use UTF-16 (originally UCS-16). I take it that Using Latin-1 is an opportunistic optimization hidden from the program. I don't think Python 3 uses this.
I think that Linux does this right and needs no such hack: just use UTF-8. Of course Java, JavaScript, Python 2, and Python 3 on Linux don't get it right.
UTF-8 just makes much more sense. Backwards compatible with ascii, no endieness issues, stupidly simple. It just makes sense. 16 bit characters are just all sorts of pain. :) Of course given who invented UTF8, it is no wonder it is briliant and simple. -- Len Sorensen
participants (3)
-
D. Hugh Redelmeier
-
lsorense@csclub.uwaterloo.ca
-
nick