JavaScript has a Unicode problem

Published 20th October 2013 · tagged with JavaScript, Unicode

The way JavaScript handles Unicode is… surprising, to say the least. This write-up explains the pain points associated with Unicode in JavaScript, provides solutions for common problems, and explains how the ECMAScript 6 standard improves the situation.

Unicode basics

Before we take a closer look at JavaScript, let’s make sure we’re all on the same page when it comes to Unicode.

It’s easiest to think of Unicode as a database that maps any symbol you can think of to a number called its code point, and to a unique name. That way, it’s easy to refer to specific symbols without actually using the symbol itself. Examples:

Code points are usually formatted as hexadecimal numbers, zero-padded up to at least four digits, with a U+ prefix.

The possible code point values range from U+0000 to U+10FFFF. That’s over 1.1 million possible symbols. To keep things organised, Unicode divides this range of code points into 17 planes that consist of about 65 thousand code points each.

The first plane (U+0000 → U+FFFF) and is called the Basic Multilingual Plane or BMP, and it’s probably the most important one, as it contains all the most commonly used symbols. Most of the time you don’t need any code points outside of the BMP for text documents in English. Just like any other Unicode plane, it groups about 65 thousand symbols.

That leaves us about 1 million other code points (U+010000 → U+10FFFF) that live outside the BMP. The planes these code points belong to are called supplementary planes, or astral planes.

Astral code points are pretty easy to recognize: if you need more than 4 hexadecimal digits to represent the code point, it’s an astral code point.

Now that we have a basic understanding of Unicode, let’s see how it applies to JavaScript strings.

Escape sequences

You may have seen things like this before:

>> '\x41\x42\x43'
'ABC'

>> '\x61\x62\x63'
'abc'

These are called hexadecimal escape sequences. They consist of two hexadecimal digits that refer to the matching code point. For example, \x41 represents U+0041 LATIN CAPITAL LETTER A. These escape sequences can be used for code points in the range from U+0000 to U+00FF.

Also common is the following type of escape:

>> '\u0041\u0042\u0043'
'ABC'

>> 'I \u2661 JavaScript!'
'I ♡ JavaScript!'

These are called Unicode escape sequences. They consist of exactly 4 hexadecimal digits that represent a code point. For example, \u2661 represents U+2661 WHITE HEART SUIT. These escape sequences can be used for code points in the range from U+0000 to U+FFFF, i.e. the entire Basic Multilingual Plane.

But what about all the other planes — the astral planes? We need more than 4 hexadecimal digits to represent their code points… So how can we escape them?

In ECMAScript 6 this will be easy, since it introduces a new type of escape sequence: Unicode code point escapes. For example:

>> '\u{41}\u{42}\u{43}'
'ABC'

>> '\u{1F4A9}'
'💩' // U+1F4A9 PILE OF POO

Between the braces you can use up to six hexadecimal digits, which is enough to represent all Unicode code points. So, by using this type of escape sequence, you can easily escape any Unicode symbol based on its code point.

For backwards compatibility with ECMAScript 5 and older environments, the unfortunate solution is to use surrogate pairs:

>> '\uD83D\uDCA9'
'💩' // U+1F4A9 PILE OF POO

In that case, each escape represents the code point of a surrogate half. Two surrogate halves form a single astral symbol.

Note that the surrogate code points don’t look anything like the original code point. There are formulas to calculate the surrogates based on a given astral code point, and the other way around — to calculate the original astral code point based on its surrogate pair.

Using surrogate pairs, all astral code points (i.e. from U+010000 to U+10FFFF) can be represented… But the whole concept of using a single escape to represent BMP symbols, and two escapes for astral symbols, is confusing, and has lots of annoying consequences.

Counting symbols in a JavaScript string

Let’s say you want to count the number of symbols in a given string, for example. How would you go about it?

My first thought would probably be to simply use the length property.

>> 'A'.length // U+0041 LATIN CAPITAL LETTER A
1

>> 'A' == '\u0041'
true

>> 'B'.length // U+0042 LATIN CAPITAL LETTER B
1

>> 'B' == '\u0042'
true

In these examples, the length property of the string happens to reflect the number of characters. This makes sense: if we use escape sequences to represent the symbols, it’s obvious that we only need a single escape for each of these symbols. But this is not always the case! Here’s a slightly different example:

>> '𝐀'.length // U+1D400 MATHEMATICAL BOLD CAPITAL A
2

>> '𝐀' == '\uD835\uDC00'
true

>> '𝐁'.length // U+1D401 MATHEMATICAL BOLD CAPITAL B
2

>> '𝐁' == '\uD835\uDC01'
true

>> '💩'.length // U+1F4A9 PILE OF POO
2

>> '💩' == '\uD83D\uDCA9'
true

Internally, JavaScript represents astral symbols as surrogate pairs, and it exposes the separate surrogate halves as separate “characters”. If you represent the symbols using nothing but ECMAScript 5-compatible escape sequences, you’ll see that two escapes are needed for each astral symbol. This is confusing, because humans generally think in terms of Unicode symbols or graphemes instead.

Accounting for astral symbols

Getting back to the question: how to accurately count the number of symbols in a JavaScript string? The trick is to account for surrogate pairs properly, and only count each pair as a single symbol. You could use something like this:

var regexAstralSymbols = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;

function countSymbols(string) {
	return string
		// Replace every surrogate pair with a BMP symbol.
		.replace(regexAstralSymbols, '_')
		// …and *then* get the length.
		.length;
}

Or, if you use Punycode.js (which ships with Node.js), make use of its utility methods to convert between JavaScript strings and Unicode code points. The punycode.ucs2.decode method takes a string and returns an array of Unicode code points; one item for each symbol.

function countSymbols(string) {
	return punycode.ucs2.decode(string).length;
}

In ES6 you can do something similar with Array.from which uses the string’s iterator to split it into an array of strings that each contain a single symbol:

function countSymbols(string) {
	return Array.from(string).length;
}

Or, using the spread operator ...:

function countSymbols(string) {
	return [...string].length;
}

Using any of those implementations, we’re now able to count code points properly, which leads to more accurate results:

>> countSymbols('A') // U+0041 LATIN CAPITAL LETTER A
1

>> countSymbols('𝐀') // U+1D400 MATHEMATICAL BOLD CAPITAL A
1

>> countSymbols('💩') // U+1F4A9 PILE OF POO
1

Accounting for lookalikes

But if we’re being really pedantic, counting the number of symbols in a string is even more complicated. Consider this example:

>> 'mañana' == 'mañana'
false

JavaScript is telling us that these strings are different, but visually, there’s no way to tell! So what’s going on there?