blob: cdf2cd6bc0183214974b4e87c5018d5bb715bac5 [file] [log] [blame] [view]
# Unicode Overview
This document goes over general concepts of Unicode and text rendering.
## **Breakdown**
This is a general overview of how text gets transformed from raw bytes to a
glyph on the screen. The document is catered towards explaining niche concepts
while also providing some pitfalls to avoid.
Chrome deals with Unicode hence this assumes text will be rendered as unicode
characters. Chrome uses the ICU library for Unicode:
[ICU](https://icu.unicode.org/) which is an open source set of libraries that
provide Unicode support for many different applications.
This doc will be going over 4 stages.
1. Unicode codepoint encodings
2. Codepoints
3. Graphemes
4. Glyphs
## **Unicode codepoint encodings**
Binary and Codepoint encoding are CS concepts with a lot of online resources.
You can search for any online resource to get more familiar with this concept.
- [Unicode Technical Reports](https://www.unicode.org/reports/)
Codepoint encoding is a way for software to convert codepoints to assigned bytes
while codepoint decoding will transform bytes back to characters. The most
commonly used encodings are UTF-8 and UTF-16.
Imagine this as a mapping between binary and codepoints.
- User inputs a string while specifying the codepoint encoding. (Note: Generally
this fallsback to the default encoding scheme)
- Program will transform the string into binary and store it.
- When the program needs to use the string again, it fetches the binary and
decodes it back to the string with the encoding schema.
The main difference between encoding schemas is how much memory is reserved for
each codepoint.
- `UTF-8` - This is a variable length 8 bit encoding scheme that saves the most
memory for the first 127 values (in a single byte) but can also grow in size
all the way up to 6 bytes depending on the codepoint.
- `UTF-16` - This is a variable length (1 or 2 bytes) 16 bit encoding scheme
that can support most codepoints while using less memory than UTF-32. Generally
used if you only need to support most languages/symbols.
- `UTF-32` - This is a fixed width for 32 bits that can support all codepoints.
The tradeoff however is that it will use the most memory per codepoint.
For example:
In UTF-8 the letter `A` is represented as the following:
```
Binary 01000001 => Codepoint (U+0041).
UTF-8:
41
std::string str = "\x41";
UTF-16:
0041
std::u16string str = u"\x41"
UTF-32:
000000041
std::u32string str = U"\U00000041"
```
For characters that might require multiple 4 bytes like the treble clef `๐„ž`, the
encoding will change depending on the schema.
```
Binary
11110000 10011101 10000100 10011110
UTF-8:
F0 9D 84 9E
std::string str = "\xF0\x9D\x84\x9E"
UTF-16:
D834 DD1E
std::u16string str = u"\xD834\xDD1E"
UTF-32:
0001D11E
std::u32string str = U"\U0001D11E"
```
## **Codepoints**
A codepoint is a unique number that represents some type of character +
information. Codepoints also have
[properties/attributes](https://unicode-org.github.io/icu/userguide/strings/properties.html)
that describe how to perform rendering on them. (i.e. BiDi, Block, Script, etc)
For example, `U+0041` is a codepoint that represents the letter `A`. Unicode has
a large library of codepoints that handles characters from different languages,
symbols used in pronunciation, and even emojis.
Some codepoints can affect their surronding characters.
For example, diacritical codepoints such as the "combining acute accent"
`(U+0301)` is used to append an accent on a character.
```
โ—Œฬ
```
Some codepoints do not map to a displayed character.
ZWJ (`U+200D`) is a zero width joiner and is a codepoint that joins two
codepoints together. Left-To-Right Embedding (`U+202A`) is a codepoint that
forces text to be interpreted as left-to-right.
Variation selectors are another set of codepoints that only affects their
surrounding character. These codepoints will affect the presentation of the
preceding character. For example an emoji + U+FE0E will set the emoji to a text
display while emoji + U+FE0F will set the emoji to the colored display. If you
do not specify a variation, the shaping engine will just pick the default glyph
in the font.
```
U+2708 maps to an airplane: โœˆ๏ธ
Adding a variation selector (U+FE0E or U+FE0F) will affect the way the emoji is
displayed
U+2708 U+FE0E = โœˆ๏ธŽ
U+2708 U+FE0F = โœˆ๏ธ
```
## **Graphemes**
A grapheme is a sequence of one or multiple codepoints. For example, โ€œeโ€ and โ€œรฉโ€
are both graphemes. โ€œeโ€ is a single codepoint while โ€œรฉโ€ can be either a single
codepoint or multiple codepoints depending on how itโ€™s encoded.
```
ร‰ (U-00E9)
```
or
```
โ€œeโ€ + โ€œยดโ€
(U-0065) + (U-00B4)
```
Emojis are another example of grapheme clusters that can be combinations of
multiple codepoints.
```
๐Ÿ‘จโ€โœˆ๏ธ is actually a combination of
๐Ÿ‘จ Man (U+1F468) +
Zero Width Joiner (U+200D) +
โœˆ๏ธ Airplane (U+2708)
```
**Note: Graphemes are not breakable!**
Because of codepoints such as diatric or joiners that can append multiple
codepoints together to a grapheme, many codepoints can make up a single
grapheme.
For example a grapheme can consist of :
- 1x codepoint: codepoint
- 2x codepoint: base + diatric
- 3x codepoint: joiner codepoint [presentation]
- Nx codepoint: codepoint + joiner + codepoint + joiner + codepoint + etc
Luckily we do not need to implement all of the various combinations or
understand what makes a valid grapheme. Chrome relies on the ICU library to
iterate through graphemes.
## **Glyphs**
A glyph is set of graphic primitives that are painted to the screen that
represents the grapheme. Fonts are pre-made mappings between characters and
glyphs. The pixels printed on the screen will be based on the Font loaded that
maps Graphemes to Glyphs. Fonts can be user controlled so any type of image can
be displayed depending on what is mapped to the grapheme.
Note that there isn't a guarantee mapping of 1:1 between grapheme and glyph, it
is more of an N:M.
Translating Grapheme to Glyphs is a multi-step process that will be covered
here: (TODO - add link to RenderText doc).
## **Pitfalls:**
### ***Rule: Do not break up graphemes!***
Modifying an encoded string is hard and has more risk then you think.
If you are trying to truncate a string to a length of 3, you might try something
like:
```
string new_string = original_string.substr(0,3);
```
But that is incorrect!
Lets imagine `original_string` is actually an emoji and you attempt to truncate
the string to a width of 3.
```
std::string original_string = "๐Ÿ˜”"; // Encoded as "\xF0\x9F\x98\x94" length = 4
string new_string = original_string.substr(0,3);
```
Well what's wrong here? You might think this is fine since there is only 1
displayed character in the emoji string, but that is incorrect!
`"๐Ÿ˜”" maps to "F0 9F 98 94"`
The length of the string is actually 4. By taking a substring of emoji with
width of 3, it will corrupt the data by removing the last codepoint of the
string.
`F0 9F 98` ~~`94`~~
This becomes a corrupted string which does not map to a valid unicode character.
Developers might also attempt to highlight only a section of codepoints of the
string.
`SetColor(Yellow, Range(1...3));`
But this is also incorrect depending on which codepoints are highlighted.
Similar to the previous example, if the range is only part of one codepoint
within the grapheme, this will break!
It is impossible to know what color to use for that glyph because its range
does not encompass the entire grapheme.
## **Recommendations:**
### Avoid using custom text modifications
The crux of this document is to highlight the difficulties of Unicode. Before
trying to add custom string modifiers, look at the `gfx::` namespace to see if
the functionality already exists. If it is not covered, please reach out to the
owners and consider adding the new functionality to `gfx::`.