Approaches to line breaking

This article gives a high level summary of various typographic strategies for wrapping text at the end of a line, for a variety of scripts.

Line-breaking is often a precursor to text justification. For a similar high-level summary of approaches to justification see Approaches to full justification.

This article provides a broad overview of the different strategies used by different writing systems, but is only an overview. Special rules apply to pretty much all scripts affecting what characters can and can't start or end a line. Some writing systems allow hyphenation, and others don't. We will only give examples of the main differences, rather than exhaustively list all the details.

For more detailed information about how line-breaking happens in various scripts, see the Language enablement index.

Basic parameters

The most fundamental algorithm used to wrap text at the end of a line depends on the confluence of two factors:

whether 'words' or syllables are separated in the text, and if so, how, and
whether the writing system wraps words, syllables, or characters to the next line.

What is a word?

A clear definition of the term 'word' is very difficult to arrive at, and yet the distinction between words and syllables is significant in certain languages for the purposes of line-breaking.

Often applications and algorithms assume that a word is a sequence of characters delimited by space, or occasionally some other punctuation character. Some languages, however, are written in scripts that only delimit syllables, but still regard words as units that are composed of one or more syllables (eg. Tibetan and Vietnamese). Others do not visually identify word or syllable boundaries at all, but maintain a distinction between words and syllables (eg. Khmer, which typically doesn't visually separate either within a phrase, but is strongly inclined to treat 'words' as a basic unit when wrapping lines, rather than syllables or characters).

Even if a word is assumed, for a particular set of languages, to be a sequence of letters bounded by spaces, linguistically-speaking this obscures some significant underlying differences and complications. The composition of those words can differ significantly from language to language. For example,

words in Finnish may end with several prepositional or other suffixes attached to the base word (talo means 'house', and talostani means 'from my house'),
words in German may be a composite, made up of a sequence of smaller words, such as Eingabeverarbeitungsfunktionen, which is a compound of the words Eingabe, Verarbeitung, and Funktion, followed by a plural marker,
in Arabic, small words like 'and' (و) are written alongside the following word with no intervening space (eg. الجامعات والكليات means 'universities and colleges', but there is only one space).

When 'word delimiters' are not present, in languages such as Khmer, Thai, Japanese, etc., the definition of what constitutes a word can be subjective when compound nouns or grammatical particles are involved. For example, the Thai translation of 'writing', การเขียน, might be regarded as a single word (kānkhīan) or as two (kān khīan).

For the purposes of this article, we will not try to define the term 'word' too closely. Instead, we will just use it to mean a vaguely-defined semantic unit that may comprise one or more syllables.

Broad types

The following table provides a high level view of factors that influence how a writing system wraps text at the end of a line. The language+script combinations listed in the table are only examples, and only refer to writing systems in modern use. Where a language name is not followed by a script name, both language and script have the same name. Note that it is common for a given language to be written using more than one script.

Note also that some language-script combinations (with asterisks) appear in more than one place in the table, indicating that there are alternative approaches. The reasons for this are described later.

	Space as word separator	Other word separator	Syllable separator	No separator
Wraps words	Amharic (ethiopic), Arabic, Armenian, Bengali, Cherokee, Dhivehi (thaana), English (latin), English (deseret), Fula (adlam), Georgian, Greek, Gujarati, Hebrew, Hindi (devanagari), Inuktitut (UCAS), Kannada, Korean (hangul), Malayalam, Mandaic, Mandinka(n’ko), Oriya, Panjabi (gurmukhi), Russian (cyrillic), Sinhala, Syriac, Tamil, Tedim (pau cin hau), Telugu	Samaritan		Khmer, Lao, Myanmar, Thai
Wraps syllables	Eastern Cham, Korean (hangul)*, Sundanese		Vietnamese (latin), Tibetan	Balinese, Batak, Chinese, Javanese, Western Cham
Wraps characters		Amharic (ethiopic)*		Japanese, Vai

Archaic scripts are much more likely to use a scriptio continua approach (ie. no word or syllable breaks), although in modern texts describing them you may find spaces separating units of text. Older versions of the scripts mentioned may also use different rules for word division and line breaking.

In following sections we will give examples of the main alternatives, and mention some of the implications. We'll focus on modern usage only, and we'll defer mention of hyphenation until later.

Languages that wrap words

Space delimited words

This is an approach that most people are familiar with, and it’s the way the English text in this article works. When the end of the line is reached, the application typically looks for the previous space, which is taken to be a word delimiter, and wraps everything after that to the next line*.

Of course, things can be more complicated than just finding the previous space when justification is applied to text. For example, it may be possible to reduce the inter-word spacing on a line, in order to allow a word to fit when it would naturally overflow slightly; or conversely, it may be better to choose an earlier break than the immediately-preceding space, moving a word down even though it could fit, in order to improve a following line. Read more about justification.

Many scripts work this way. Among others, they include scripts used for all major European languages, including Cyrillic and Greek; scripts used for major Indian languages, such as Devanagari, Gujarati, and Tamil; scripts used for modern Semitic languages, such as Arabic, Hebrew, and Syriac; and scripts used for American languages, such as Cherokee and Unified Canadian Syllabic (UCAS).

Devanagari line breaks — Line break opportunities for Hindi text (Devanagari script).

Languages written in right-to-left scripts, such as Arabic or Hebrew or Dhivehi, also typically wrap full words to the next line. However they do so, of course, in the opposite direction from, say, English.

Arabic line breaks — Line break opportunities for Arabic text.

Text in languages such as Arabic, Hebrew or Dhivehi, however, gets significantly more complicated when it contains bidirectional text. If we make the text read '...في this is English ذلك... ' in the above example we end up with the following.

Arabic line breaks in bidirectional text — Wrapping embedded opposite-direction text in Arabic.

Looking at the above example, you will notice that the relative order of the English words has been rearranged across the line break. This is because horizontal bidirectional text is never read upwards, from line to line. This output is managed by the bidirectional reordering process, before line-break opportunities are calculated, and only affects the positioning of font glyphs. Characters in memory run in order of pronunciation, and don't change.

Vertically-set Chinese, Japanese, Korean, and Traditional Mongolian wrap words upwards, but the new line appears to the left for CJK, and right for Mongolian.

South-East Asia: no word separator

Thai, Lao, and Khmer are languages that are written with no spaces between words. Spaces do occur, but they serve as phrase delimiters, rather than word delimiters. However, when Thai, Lao, or Khmer text reaches the end of a line, the expectation is that text is wrapped a word at a time. For humans, this is is not too hard (if you speak the language), but applications have to find a way to understand the text in order to determine where the word boundaries are.

Khmer line break opportunities. — Line break opportunities in Khmer text.

Most applications do this by using dictionary lookup. It’s not 100% perfect, and authors may need to adjust things from time to time. For example, here are two alternative sets of line-breaking opportunities for a Thai phrase.

Long Thai words. — Alternative line break opportunities for Thai text.

Shorter Thai words. — Alternative line break opportunities for Thai text.

The difference here is not just a question of faulty implementations. As mentioned earlier, the concept of what is a word in writing systems that don't clearly delimit them is somewhat fluid. The above differences arise from different subjective opinions about whether compound words should be wrapped whole or not to the next line.

In the past, the Unicode character U+200B ZERO WIDTH SPACE (ZWSP) was used to indicate word boundaries for these scripts, and some standard keyboards such as Khmer NIDA still generate ZWSP with the spacebar key, but recently major languages have line-breaking implementations at their disposal, which means ZWSP is not essential. Large-scale manual entry of ZWSP is also not very practical because the user cannot see the separator in most scenarios; this leads to problems with ZWSP being inserted in the wrong position, or multiple times. ZWSP may, however, be used to hand-craft and fix aspects of line-break behaviour.

It is also important to bear in mind that the scripts referred to here may be used to write languages other than those mentioned, in particular minority languages for which different dictionaries are needed. Since such dictionaries may not available in a given browser or other application, there is a tendency to use ZWSP in order to compensate.

Languages that wrap syllables

Some writing systems wrap not just words but syllables to the next line. Often it is preferable to wrap whole words, but text can also be broken at syllable boundaries instead.

Some analysis of the text is typically needed to determine where the syllable boundaries occur. Often the end of a syllable is marked by a final consonant that is a combining character, or the end of the syllable may be indicated by a special mark, however in some cases the location of syllable boundaries may be visually ambiguous. Furthermore, the syllable in question may be an orthographic syllable, rather than a phonetic syllable (see below).

Chinese and Korean are included here, although they are slightly unusual in that a syllable normally corresponds to a single character, rather than a sequence. (Although in the rare case where Korean is stored as jamos, rather than syllabic characters, there is a sequence involved.)

Tibetan: visible syllable dividers

A good example of of a writing system that breaks regularly at syllable boundaries is Tibetan, which uses ་ [U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG] (pronounced tsek) to signal the end of a syllable.

Tibetan wraps by moving complete syllables to the next line, so that the original line ends with a tsek mark. Tibetan words can be made up of multiple syllables and although it is preferable to avoid breaking a line in the middle of the word, it is not essential. A syllable, on the other hand should always be kept intact.

Line break opportunities in Tibetan. — Line break opportunities in Tibetan

Korean hangul: alternatives

Korean is unusual in that words in modern hangul text are normally separated by spaces, but the writing system allows content authors to choose one of two ways for that text to be wrapped.

Syllable-based wrapping is common, especially in fully justified text (which is more common in CJK writing systems than in Western ones), but paragraphs with a ragged right edge will often wrap whole words. However, the choice is motivated by author preference, rather than any hard and fast rule.

You may also come across Hangul written with no space between words (ie. like Chinese and Japanese), especially in older texts.