Re: [GTALUG] interesting article and comments about UCS-16, UTF-16, UTF-8

3 Aug 2019

      On 2019-08-03 6:18 p.m., D. Hugh Redelmeier via talk wrote:
...
https://news.ycombinator.com/item?id=20600195
There are so many hairy details!
UTF-8 gets a bit less coverage since it has fewer hairy details.
From this I learned that Java and JavaScript now have optimizations to
use LATIN-1 when they can.  Normally they use UTF-16 (originally
UCS-16).  I take it that Using Latin-1 is an opportunistic
optimization hidden from the program.  I don't think Python 3 uses
this.
I think that Linux does this right and needs no such hack: just use
UTF-8.  Of course Java, JavaScript, Python 2, and Python 3 on Linux
don't get it right.
---
Post to this mailing list talk@gtalug.org
Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
I looked through it briefly and a lot of it would depend on what the implementer
think the language is being used for. There may be a very good reason for it
similar to SSO or small string optimizations in most STLs. Sure that program
may get bit in the foot but it's one program. I would be curious to see it 
across a lot of different Javascript, Python 2/3 and Java programs to see if
it's a good idea. 

It's the same with hardware heuristics or instructions for compiler backends,
sure 5% of programs may perform better but what about the other 95%?

It's interesting to point through that UTF-16 is so complex that UTF-8 or another
less complex version of encoding would be preferred and therefore if the full
encoding is required may cause overhead, the question is how much and if hardware
is already encoding it?

Nick

Re: [GTALUG] interesting article and comments about UCS-16, UTF-16, UTF-8

nick