
On 2019-08-03 6:18 p.m., D. Hugh Redelmeier via talk wrote:
https://news.ycombinator.com/item?id=20600195
There are so many hairy details!
UTF-8 gets a bit less coverage since it has fewer hairy details.
From this I learned that Java and JavaScript now have optimizations to use LATIN-1 when they can. Normally they use UTF-16 (originally UCS-16). I take it that Using Latin-1 is an opportunistic optimization hidden from the program. I don't think Python 3 uses this.
I think that Linux does this right and needs no such hack: just use UTF-8. Of course Java, JavaScript, Python 2, and Python 3 on Linux don't get it right. --- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
I looked through it briefly and a lot of it would depend on what the implementer think the language is being used for. There may be a very good reason for it similar to SSO or small string optimizations in most STLs. Sure that program may get bit in the foot but it's one program. I would be curious to see it across a lot of different Javascript, Python 2/3 and Java programs to see if it's a good idea. It's the same with hardware heuristics or instructions for compiler backends, sure 5% of programs may perform better but what about the other 95%? It's interesting to point through that UTF-16 is so complex that UTF-8 or another less complex version of encoding would be preferred and therefore if the full encoding is required may cause overhead, the question is how much and if hardware is already encoding it? Nick