(root)/Notes/Notes/notes/utf-8.md 🍁

UTF-8

a brilliant and beautiful hack

--- https://youtu.be/_mZBa3sqTrI?t=2407

--- https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

resource RFC 3629, UTF-8, a transformation format of ISO 10646 --- https://www.rfc-editor.org/rfc/rfc3629.html

resource UTF-8 history email chain, a trip to the past --- utf-8-history.txt --- https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

resource The WTF-8 Encoding, a super < set of UTF-8 that can also encode unpaired surrogates --- The WTF-8 encoding.html] --- https://wtf-8.codeberg.page/#unicode-scalar-value

properties

utf-8 is a super < set of ascii: every valid 7-bit ascii byte is a valid utf-8 one-byte sequence. furthermore, no 7-bit ascii bytes appear within utf-8 multi-byte sequences, and so programs expecting ascii sentinels such as the \0 terminator and the / path separator do not misbehave when processing utf-8. for other nice properties of utf-8, see RFC 3629, $1 'Introduction' https://www.rfc-editor.org/rfc/rfc3629.html#section-1

representation

encoding bits codepoints description
0b0yyyzzzz 7 U+0000--U+007F one-byte encoding sequence
0b110xxxyy 0b10yyzzzz 11 U+0080--U+07FF two-byte encoding sequence
0b1110wwww 0b10xxxxyy 0b10yyzzzz 16 U+0800--U+FFFF three-byte encoding sequence
0b11110uvv 0b10vvwwww 0b10xxxxyy 0b10yyzzzz 21 U+100000--U+10FFFF four-byte encoding sequence