Small Datum: 2018

Monday, December 17, 2018

New small servers for performance testing

My old NUC cluster found a new home and I downsized to 2 new NUC servers. The new server is NUC8i7beh with 16g RAM, 500g Samsung 860 EVO for the OS and 500g Samsung 970 EVO for performance. The Samsung 860 is SATA and the Samsung 970 is an m.2 device. I expect to wear out the performance devices as I have done that in the past. With the OS on a separate device I avoid the need to reinstall the OS when that happens.

The new NUC has a post-Skylake CPU (i7-8559u), provides 4 cores (8 HW threads) compared to 2 cores (4 HW threads) in the old NUCs. I disabled turbo boost again to avoid performance variance as mentioned in the old post. I am not sure these have sufficient cooling for sustained boost and when boost isn't sustained there are frequent changed in CPU performance. I also disabled hyperthreads out of concern for both the impact from Spectre fixes and to avoid a different syscall overhead each time I update the kernel.

I might use these servers to examine the impact of the ~10x increase in PAUSE times on InnoDB with and without HT enabled. I might also use them for another round of MySQL performance testing when 8.0.14 is release.

I am a big fan of Intel NUC servers. But maybe I am not a fan of the SATA cables they use. I already had one of my old NUCs replaced under warranty after one of the SATA wires was bare. In the new NUCs I just setup a few of the SATA cables appear to be cut and I wonder if that eventually becomes bare.

Friday, December 14, 2018

LSM math - size of search space for LSM tree configuration

I have written before and will write again about using 3-tuples to explain the shape of an LSM tree. This makes it easier to explain the configurations supported today and configurations we might want to support tomorrow in addition to traditional tiered and leveled compaction. The summary is that n LSM tree has N levels labeled from L1 to Ln and Lmax is another name for Ln. There is one 3-tuple per level and the components of the 3-tuple are (type, fanout, runs) for Lk (level k) where:

type is Tiered or Leveled and explains compaction into that level
fanout is the size of a sorted run in Lk relative to a sorted run from Lk-1, a real and >= 1
runs is the number of sorted runs in that level, an integer and >= 1

Given the above how many valid configurations exist for an LSM tree? There are additional constraints that can be imposed on the 3-tuple but I will ignore most of them except for limiting fanout and runs to be <= 20. The answer is easy - there are an infinite number of configurations because fanout is a real.

The question is more interesting when fanout is limited to an integer and the number of levels is limited to between 1 and 10. I am doing this to explain the size of the search space but I don't think that fanout should be limited to an integer.

There are approximately 2^11 configurations only considering compaction type, which has 2 values, and 1 to 10 levels because there are 2^N configurations of compaction types for a tree with N levels and the sum of 2^1 + 2^2 + ... + 2^9 + 2^10 = 2^11 - 1

But when type, fanout and runs are considered then there are 2 x 20 x 20 = 800 choices per level and 800^N combinations for an LSM tree with N levels. Considering LSM trees with 1 to 10 levels then the number of valid configurations is the sum 800^1 + 800^2 + ... + 800^9 + 800^10. That is a large number of configurations if exhaustive search were to be used to find the best configuration. Note that I don't think exhaustive search should be used.

Thursday, December 13, 2018

LSM math - how many levels minimizes write amplification?

How do you configure an LSM tree with leveled compaction to minimize write amplification? For a given number of levels write-amp is minimal when the same fanout (growth factor) is used between all levels, but that does not explain the number of levels to use. In this post I answer that question.

The number of levels that minimizes write-amp is one of ceil(ln(T)) or floor(ln(T)) where T is the total fanout -- sizeof(database) / sizeof(memtable)
When #1 is done then the per-level fanout is e when the number of levels is ln(t) and a value close to e when the number of levels is an integer.

Introduction

I don't recall reading this result elsewhere, but I am happy to update this post with a link to such a result. I was encouraged to answer this after a discussion with the RocksDB team and thank Siying Dong for stating #2 above while leaving the math to me. I assume the original LSM paper didn't address this problem because that system used a fixed number of levels.

One result from the original LSM paper and updated by me is that write-amp is minimized when the per-level growth factor is constant. Sometimes I use fanout or per-level fanout rather than per-level growth factor. In RocksDB the option name is max_bytes_for_level_multiplier. Yes, this can be confusing. The default fanout in RocksDB is 10.

Math

I solve this for pure-leveled compaction which differs from what RocksDB calls leveled. In pure-leveled all levels used leveled compaction. In RocksDB leveled the first level, L0, uses tiered and the other levels used leveled. I started to explain this here where I claim that RocksDB leveled is really tiered+leveled. But I am not asking for them to change the name.

Assumptions:

LSM tree uses pure-leveled compaction and compaction from memtable flushes into the first level of the LSM tree uses leveled compaction
total fanout is T and is size(Lmax) / size(memtable) where Lmax is the max level of the LSM tree
workload is update-only so the number of keys in the database is fixed
workload has no write skew and all keys are equally likely to be updated
per-level write-amp == per-level growth factor. In practice and in theory the per-level write-amp tends to be less than the per-level growth factor.
total write-amp is the sum of per-level write-amp. I ignore write-amp from the WAL.

Specify function for write-amp and determine critical points

# wa is the total write-amp
# n is the number of levels
# per-level fanout is the nth root of the total fanout
# per-level fanout = per-level write-amp
# therefore wa = number of levels * per-level fanout
wa = n * t^(1/n)

# given the function for write-amp as wa = a * b
# ... then below is a' * b + a * b'
a = n, b = t^(1/n)
wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2)

# which simplifies to
wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)

# critical point for this occurs when wa' = 0
t^(1/n) - (1/n) * ln(t) * t^(1/n) = 0
t^(1/n) = (1/n) * ln(t) * t^(1/n)
1 = (1/n) * ln(t)
n = ln(t)

When t = 1024 then n = ln(1024) ~= 6.93. In this case write-amp is minimized when 7 levels are used although 6 isn't a bad choice.

Assuming the cost function is convex (see below) the critical point is the minimum for write-amp. However, n must be an integer so the number of levels that minimizes write-amp is one of: ceil(ln(t)) or floor(ln(t)).

The graph for wa when t=1024 can be viewed thanks to Desmos. The function looks convex and I show below that it is.

Determine whether critical point is a min or max

The critical point found above is a minimum for wa if wa is convex so we must show that the second derivative is positive.

wa = n * t ^ (1/n)
wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n)
wa' = t^(1/n) * (1 - (1/n) * ln(t))

# assuming wa' is a * b then wa'' is a' * b + a * b'
a = t^(1/n)
a' = ln(t) * t^(1/n) * -1 * (1/n^2)
a' = - ln(t) * t^(1/n) * (1/n^2)

b = 1 - (1/n) * ln(t)
b' = (1/n^2) * ln(t)

# a' * b
- ln(t) * t^(1/n) * (1/n^2) --> called x below
+ ln(t) * ln(t) * (1/n^3) * t^(1/n) --> called y below

# b' * a
t^(1/n) * (1/n^2) * ln(t) --> called z below

# therefore wa'' = x + y + z
# note that x, y and z all contain: t^(1/n), 1/n and ln(t)
wa'' = t^(1/n) * (1/n) * ln(t) * (-(1/n) + (ln(t) * 1/n^2) + (1/n))
wa'' = t^(1/n) * (1/n) * ln(t) * ( ln(t) * 1/n^2 )'
wa'' = t^(1/n) * 1/n^3 * ln(t)^2

Therefore wa'' is positive, wa is convex and the critical point is a minimum value for wa

Solve for per-level fanout

The next step is to determine the value of the per-level fanout when write-amp is minimized. If the number of levels doesn't have to be an integer then this occurs when ln(t) levels are used and below I show that the per-level fanout is e in that case. When the number of levels is limited to an integer then the per-level fanout that minimizes write-amp is a value that is close to e.

# total write-amp is number of levels * per-level fanout
wa = n * t^(1/n)

# The per-level fanout is t^(1/n) and wa is minimized when n = ln(t)
# Therefore we show that t^(1/n) = e when n = ln(t)
Assume t^(1 / ln(t)) = e
ln (t^(1 / ln(t))) = ln e
(1 / ln(t)) * ln(t) = 1
1=1

When the t=1024 then ln(t) ~= 6.93. With 7 levels the per-level fanout is t^(1/7) ~= 2.69 while e ~= 2.72.

Saturday, December 1, 2018

Pixelbook review

This has nothing to do with databases. This is a review of a Pixelbook (Chromebook laptop) that I got on sale last month. This one has a core i5, 8gb RAM and 128gb storage. It runs Linux too but I haven't done much with that. I expected a lot from this given that my 2013 Nexus 7 tablet is still awesome. I have been mostly happy with the laptop but if you care about keyboards and don't like the new Macs thanks to the butterfly keyboard then this might not be the laptop for you. My 3 complaints:

keyboard is hard to read. It is grey on grey and too hard to read when there is light on my back even with the backlight (backlit?) turned all the way up. I don't get it -- grey on grey. So this is a great device for using in a dark room or for improving your touch typing skills.
touchpad control is too coarse grained so it is either too fast or too slow. The settings has 5 values via a slider (1=slowest, 5=fastest). I have been using it at 3 which is a bit too fast for me while 2 is a bit too slow. I might go back to 2 but that means picking up my finger more frequently when moving a pointer across the screen.
no iMessage - my family uses Apple devices and I can't run that here as I can on a Mac laptop
the "" ey is flay --> the "k" key is flaky -> spacebar is flaky. Keys go bad for a few days, then get better, repeat. Ugh, his is one-off Google hardware. Maybe they don't want Apple and the butterfly keyboard to have all the fun. Fortunately I bought from an authorized reseller (Best Buy) so the 1 year warranty should apply.
Charger failed, fortunately that is easy to replace.

Monday, November 19, 2018

Review of TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores

This is review of TRIAD which was published in USENIX ATC 2017. It explains how to reduce write amplification for RocksDB leveled compaction although the ideas are useful for many LSM implementations. I share a review here because the paper has good ideas. It isn't easy to keep up with all of the LSM research, even when limiting the search to papers that reference RocksDB, and I didn't notice this paper until recently.

TRIAD reduces write amplification for an LSM with leveled compaction and with a variety of workloads gets up to 193% more throughput, up to 4X less write amplification and spends up to 77% less time doing compaction and flush. Per the RUM Conjecture improvements usually come at a cost and the cost in this case is more cache amplification (more memory overhead/key) and possibly more read amplification. I assume this is a good tradeoff in many cases.

The paper explains the improvements via 3 components -- TRIAD-MEM, TRIAD-DISK and TRIAD-LOG -- that combine to reduce write amplification.

TRIAD-MEM

TRIAD-MEM reduces write-amp by keeping frequently updated keys (hot keys) in the memtable. It divides keys into the memtable into two classes: hot and cold. On flush the cold keys are written into a new L0 SST while the hot keys are copied over to the new memtable. The hot keys must be written again to the new WAL so that the old WAL can be dropped. TRIAD-MEM tries to keep the K hottest keys in the memtable and there is work in progress to figure out a good value for K without being told by the DBA.

An extra 4-bytes/key is used for the memtable to track write frequency and identify hot keys. Note that RocksDB already 8 bytes/key for metadata. So TRIAD-MEM has a cost in cache-amp but I don't think that is a big deal.

Assuming the per-level write-amp is 1 from the memtable flush this reduces it to 0 in the best case where all keys are hot.

TRIAD-DISK

TRIAD-DISK reduces write-amp by delaying L0:L1 compaction until there is sufficient overlap between keys to be compacted. TRIAD continues to use an L0:L1 compaction trigger based on the number of files in the L0 but can trigger compaction earlier when there is probably sufficient overlap between the L0 and L1 SSTs.

Overlap is estimated via Hyperloglog (HLL) which requires 4kb/SST and is estimated as the following where file-i is the i-th SST under consideration, UniqueKeys is the estimated number of distinct keys across all of the SSTs and Keys(file-i) is the number of keys in the i-th SST. The paper states that both UniqueKeys and Keys are approximated using HLL. But I assume that per-SST metadata already has an estimate or exact value for the number of keys in the SST. The formula for overlap is:
UniqueKeys(file-1, file-2, ... file-n) / sum( Keys( file-i))

The benefit from early L0:L1 compaction is less read-amp, because there will be fewer sorted runs to search on a query. The cost from always doing early compaction is more per-level write-amp which is etimated by size(L1 input) / size(L0 input). TRIAD-DISK provides the benefit with less cost.

In RocksDB today you can manually schedule early compaction by setting the trigger to 1 or 2 files, or you can always schedule it to be less early with a trigger set to 8 or more files. But this setting is static. TRIAD-DISK uses a cost-based approach to do early compaction when it won't hurt the per-level write-amp. This is an interesting idea.

TRIAD-LOG

TRIAD-LOG explains improvements to memtable flush that reduce write-amp. Data in an L0 SST has recently been written to the WAL. So they use the WAL in place of writing the L0 SST. But something extra, an index into the WAL, is written on memtable flush because everything in the L0 must have an index. The WAL in the SST (called the CL-SST for commit log SST) will be deleted when it is compacted into the L1.

There is cache-amp from TRIAD-LOG. Each key in the CL-SST (L0) and maybe in the memtable needs 8 extra bytes -- 4 bytes for CL-SST ID, 4 bytes for the WAL offset.

Assuming the per-level write-amp is one from the memtable flush for cold keys this reduces that to 0.

Reducing write amplification

The total write-amp for an LSM tree with leveled compaction is the sum of:

writing the WAL = 1
memtable flush = 1
L0:L1 compaction ~= size(L1) / size(L0)
Ln compaction for n>1 ~= fanout, the per-level growth factor, usually 8 or 10. Note that this paper explains why it is usually a bit less than fanout.

TRIAD avoids the write-amp from memtable flush thanks to TRIAD-MEM for hot keys and TRIAD-LOG for cold keys. I will wave my hands and suggest that TRIAD-DISK reduces write-amp for L0:L1 from 3 to 1 based on the typical LSM configuration I use. So TRIAD reduces the total write-amp by 1+2 or 3.

Reducing total write-amp by 3 is a big deal when the total write-amp for the LSM tree is small, for example <= 10. But that only happens when there are few levels beyond the L1. Assuming you accept my estimate for total write-amp above then per-level write-amp is ~8 for both L1:L2 and L2:L3. The total write-amp for an LSM tree without TRIAD would be 1+1+3+8 = 13 if the max level is L2 and 1+1+3+8+8 = 21 if the max level is L3. And then TRIAD reduces that from 13 to 10 or from 21 to 18.

But my write-amp estimate above is more true for workloads without skew and less true for workloads with skew. Many of the workloads tested in the paper have a large amount of skew. So while I have some questions about the paper I am not claiming they are doing it wrong. What I am claiming is that the benefit from TRIAD is significant when total write-amp is small and less significant otherwise. Whether this matters is workload dependent. It would help to know more about the LSM tree from each benchmark. How many levels were in the LSM tree per benchmark? What is the per-level write-amp with and without TRIAD? Most of this can be observed from compaction statistics provided by RocksDB. The paper has some details on the workloads but that isn't sufficient to answer the questions above.

Questions

The paper documents the memory overhead, but limits the definition of read amplification to IO and measured none. I am interested in IO and CPU and suspect there might be some CPU read-amp from using the commit-log SST in the L0 both for searches and during compaction as logically adjacent data is no longer physically adjacent in the commit-log SST.
impact of more levels?

Another question is how far down the LSM compaction occurs. For example if the write working set fits in the L2, should compaction stop at the L2. It might with some values of compaction priority in RocksDB but it doesn't for all. When the workload has significant write skew then the write working set is likely to fit into one of the smaller levels of the LSM tree.

An interesting variant on this is a workload with N streams of inserts that are each appending (right growing). When N=1 there is an optimization in RocksDB that limits write-amp to 2 (one for WAL, one for SST). I am not aware of optimizations in RocksDB for N>2 but am curious if we could do something better.

Friday, November 2, 2018

Converting an LSM to a B-Tree and back again

I wonder if it is possible to convert an LSM to a B-Tree. The goal is to do it online and in-place -- so I don't want two copies of the database while the conversion is in progress. I am interested in data structures for data management that adapt dynamically to improve performance and efficiency for a given workload.

Workloads change in the short and long term. I hope that data structures can be adapt to the change and converting between an LSM and a B-Tree is one way to adapt. This is more likely to be useful when the data structure supports some kind of partitioning in the hope that different workloads can be isolated to different partitions -- and then some can use an LSM while others use a B-Tree.

LSM to B-Tree

A B-Tree is one tree. An LSM is a sequence of trees. Each sorted run in the LSM is a tree. With leveled compaction in RocksDB there are a few sorted runs in level 0 (L0) and then one sorted run in each of L1, L2 up to the max level (Lmax).

A B-Tree persists changes by writing back pages -- either in-place or copy-on-write (UiP or CoW). An LSM persists changes by writing and then re-writing rows. I assume that page write back is required if you want to limit the database to one tree and row write back implies there will be more than one tree.

There are two things that must be done online and in-place:

Convert the LSM from many trees to one tree
Convert from row write back to page write back

Note that my goal has slightly changed. I want to move from an LSM to a data structure with one tree. For the one-tree solution a B-Tree is preferred but not required.

The outline of a solution:

Reconfigure the LSM to use 2 levels -- L0 and L1 -- and 3 trees -- memtable, L0, L1.
Disable the L0. At this point the LSM has two trees -- memtable and L1.
Flush the memtable and merge it into the L1. Now there is one tree.
After the flush disable the memtable and switch to a page cache. Changes now require a copy of the L1 block in the page cache that eventually get written back via UiP or CoW.

The outline above doesn't explain how to maintain indexes for the L1. Note that after step 2 there is one tree on disk and the layout isn't that different from the leaf level of a B-Tree. The interior levels of the B-Tree could be created by reading/rewriting the block indexes stored in the SSTs.

B-Tree to LSM

The conversion can also be done in the opposite direction (B-Tree to LSM)

Treat the current B-Tree as the max level of the LSM tree. While it might help to flush the page cache I don't think that is required. This is easier to do when your LSM uses a B-Tree per level, as done by WiredTiger.
Record new changes for insert, update, delete in a memtable rather than a page cache.
When the memtable is full then flush it to create a new tree (sorted run, SST) on disk.
Eventually start to do compaction.

Friday, October 19, 2018

Combining tiered and leveled compaction

There are simple optimization problems for LSM tuning. For example use leveled compaction to minimize space amplification and use tiered to minimize write amplification. But there are interesting problems that are harder to solve:

maximize throughput given a constraint on write and/or space amplification
minimize space and/or write amplification given a constraint on read amplification

To solve the first problem use leveled compaction if it can satisfy the write amp constraint, else use tiered compaction if it can satisfy the space amp constraint, otherwise there is no solution. The lack of a solution might mean the constraints are unreasonable but it can also mean we need to enhance LSM implementations to support more diversity in LSM tree shapes. Even when there is a solution using leveled or tiered compaction there are solutions that would do much better were an LSM to support more varieties of tiered+leveled and leveled-N.

When I mention solved above I leave out that there is more work to find a solution even when tiered or leveled compaction is used. For both there are decisions about the number of levels and per-level fanout. If minimizing write amp is the goal then that is a solved problem. But there are usually more things to consider.

Tiered+leveled

I defined tiered+leveled and leveled-N in a previous post. They occupy the middle ground between tiered and leveled compaction with better read efficiency than tiered and better write efficiency than leveled. They are not supported today by popular LSM implementations but I think they can and should be supported.

While we tend to explain compaction as a property of an LSM tree (all tiered or all leveled) it is really a property of a level of an LSM tree and RocksDB already supports hybrids, combinations of tiered and leveled. For tiered compaction in RocksDB all levels except the largest use tiered. The largest level is usually configured to use leveled to reduce space amp. For leveled compaction in RocksDB all levels except the smallest use leveled and the smallest (L0) uses tiered.

So tiered+leveled isn't new but I think we need more flexibility. When a string of T and L is created from the per-level compaction choices then the regex for the strings that RocksDB supports is T+L or TL+. I want to support T+L+. I don't want to support cases where leveled is used for a smaller level and tiered for a larger level. So I like TTLL but not LTTL. My reasons for not supporting LTTL are:

The benefit from tiered is less write amp and is independent of the level on which it is used. The reduction in write amp is the same whether tiered is used for L1, L2 or L3.
The cost from tiered is more read and space amp and that is dependent on the level on which it is used. The cost is larger for larger levels. When space amp is 2 more space is wasted on larger levels than smaller levels. More IO read amp is worse for larger levels because they have a lower hit rate than smaller levels and more IO will be done. More IO implies more CPU cost from decompression and the CPU overhead of performing IO.

From above the benefit from using T is the same for all levels but the cost increases for larger levels so when T and L are both used then T (tiered) should be used on the smaller levels and L (leveled) on the larger levels.

Leveled-N

I defined leveled-N in a previous post. Since then a co-worker, Maysam Yabandeh, explained to me that a level that uses leveled-N can also be described as two levels where the smaller uses leveled and the larger uses tiered. So leveled-N might be syntactic sugar in the LSM tree configuration language.

For example with an LSM defined using the triple syntax from here as (compaction type, fanout, runs-per-level) then this is valid: (T,1,8) (T,8,2) (L,8,2) (L,8,1) and has total fanout of 512 (8 * 8 * 8). The third level (L,8,2) uses leveled-N with N=2. Assuming we allow LSM trees where T follows L then the leveled-N level can be replaced with two levels: (L,8,1) (T,1,8). Then the LSM tree is defined as (T,1,8) (T,8,2) (L,8,1) (T,1,8) (L,8,1). These LSM trees have the same total fanout and total read/write/space amp. Compaction from (L,8,1) to (T,1,8) is special. It has zero write amp because it is done by a file move rather than merging/writing data so all that must be updated is LSM metadata to record the move.

So in general I don't support T after L but I do support it in the special case. Of course we can pretend the special case doesn't exist if we use the syntactic sugar provided by leveled-N. But I appreciate that Maysam discovered this.

Wednesday, October 3, 2018

Minimizing write amplification in an LSM

Write-amplification for an LSM with leveled compaction is minimized when the per-level growth factor (fanout) is the same between all levels. This is a result for an LSM tree using a given number of levels. To find the minimal write-amplification for any number of levels this result can be repeated for 2, 3, 4, ... up to a large value. You might find that a large number of levels is needed to get the least write-amp and that comes at price of more read-amp, as the RUM Conjecture predicts.

In all cases below I assume that compaction into the smallest level (from a write buffer flush) has no write-amp. This is done to reduce the size of this blog post.

tl;dr - for an LSM with L1, L2, L3 and L4 what values for per-level fanout minimizes write-amp when the total fanout is 1000?

(10, 10, 10) for leveled
(6.3, 12.6, 12.6) for leveled-N assuming two of the levels have 2 sorted runs
(>1, >1, >1) for tiered

Minimizing write-amp for leveled compaction

For an LSM with 4 levels (L1, L2, L3, L4) there is a per-level fanout between L1:L2, L2:L3 and L3:L4. Assume this uses classic leveled compaction so the total fanout is size(L4) / size(L1). The product of the per-level fanouts must equal the total fanout. The total write-amp is the sum of the per-level write-amp. I assume that the per-level write amp is the same as the per-level fanout although in practice and in theory it isn't that simple. Lets use a, b and c as the variables for the per-level fanout (write-amp) then the math problem is:

minimize a+b+c
such that a*b*c=k and a, b, c > 1

While I have been working on my math skills this year they aren't great and corrections are welcome. This is a constrained optimization problem that can be solved using Lagrange Multipliers. From above #1 is the sum of per-level write-amp and #2 means that the product of per-level fanout must equal the total fanout. The last constraint is that a, b and c must (or should) all be > 1.

This result uses Lagrange Multipliers for an LSM tree with 4 levels do there are 3 variables: a, b, c. But the math holds for an LSM tree with fewer levels or with more levels. If there are N levels then there are N-1 variables.

L(a, b, c) = a + b + c - lambda * (a*b*c - k)

dL/da = 1 - lambda * bc

dL/db = 1 - lambda * ac

dL/dc = 1 - lambda * ab

then

lambda = 1/bc = 1/ac = 1/ab
bc == ac == ab

and a == b == c to minimize the sum in #1

I wrote a Python script to discover the (almost) best values and the results match the math above.

Minimizing write-amp for tiered compaction

Assuming you can reason about tiered compaction using the notion of levels then the math changes a bit because the per-level write-amp with tiered equals 1 regardless of the per-level fanout. For tiered with 4 levels and 3 variables the problem is:

minimize 1+1+1
such that a*b*c = k and a, b, c > 1

Any values for a, b and c are sufficient as long they satisfy the constraints in #2. But it still helps to minimize a+b+c if that is predicts read-amp because a, b and c are also the number of sorted runs in L2, L3 and L4. So my advice is to use a == b == c in most cases.

Minimizing write-amp for leveled-N compaction
I explain leveled-N compaction here and here. It is like leveled compaction but allows a level to have more than one sorted run. This reduces the per-level write-amp at the cost of more read-amp. Sometimes that is a good trade.

The math above can also be used to determine how to configure per-level fanout to minimize write-amp for leveled-N. Assume an LSM tree with 4 levels (L1, L2, L3, L4) and 2 sorted runs in L2 and L3. The problem is:

minimize a + b/2 + c/2
such that a*b*c = k and a, b, c > 1

For leveled compaction I assume that per-level write-amp is all-size(Ln+1) / all-size(Ln) for compaction from Ln into Ln+1. For leveled-N I assume it is run-size(Ln+1) / all-size(Ln) where all-size is the size of all sorted runs on that level and run-size is the size of one sorted run. The astute reader might notice that all-size(Ln) == run-size(Ln) for traditional leveled. For leveled-N I assume that fanout continues to be run-size(Ln+1) / run-size(Ln).

Therefore with leveled-N the per-level write-amp is b/2 for L2 to L3 and c/2 for L3 to L4 because there are 2 sorted runs in the compaction input (twice as much data) in those cases. Were there 3 sorted runs then the values would be b/3 and c/3.

Lagrange Multipliers can be used to solve this assuming we want to minimize a + b/2 + c/2.

L(a, b, c) = a + b/2 + c/2 - lambda * (a*b*c - k)

dL/da = 1 - lambda * bc

dL/db = 1/2 - lambda * ac

dL/dc = 1/2 - lambda * ab

then

lambda = 1/bc = 1/2ac = 1/2ab
bc == 2ac -> b == 2a
bc == 2ab -> c == 2a
2ac == 2ab -> c == b

and 2a == b == c to minimize the sum

If the total fanout is 1000 then the per-level fanout values that minimize write-amp are 10, 10, 10 for leveled and 6.30, 12.60, 12.60 for this example with leveled-N and can be computed by "bc -l"

# for leveled-N
e(l(1000/4)/3)

6.29960524947436582381

e(l(1000/4)/3) * 2
12.59921049894873164762

# and for leveled
e(l(1000)/3)
9.99999999999999999992

One way to think of this result is that with leveled compaction the goal is to use the same per-level fanout between levels. This also uses the same per-level write-amp between levels because per-level write-amp == the per-level fanout for leveled.

But with leveled-N compaction we need to adjust the per-level fanout for levels to continue to get the same per-level write-amp between levels.

Tuesday, October 2, 2018

Describing tiered and leveled compaction

This is another attempt by me to define the shape of an LSM tree with more formality and this builds on previous posts here and here. My key point is that compaction is the property of a level in an LSM tree rather than the LSM tree. Some levels can use tiered and others can use leveled. This combination of tiered and leveled is already done in popular LSM implementations but it hasn't been called out as a feature.

Stepped Merge

The Stepped Merge paper might have been the first description of tiered compaction. It is a way to improve B-Tree insert performance. It looked like an LSM tree with a few sorted runs at each level. When a level was full the sorted runs at that level were merged to create a larger sorted run in the next level. The per-level write-amplification was 1 because compaction into level N+1 merged runs from level N but did not read/rewrite a run already on level N+1.

This looks like tiered compaction. However it allows for N sorted runs on the max level which means that space-amplification will be >= N. I assume that is too much for most users of tiered compaction in Cassandra, RocksDB and HBase. But this isn't a problem for Stepped Merge because it is an algorithm for buffering changes to a B-Tree, not for storing the entire database and it doesn't lead to a large space-amp for that workload. Note that the InnoDB change buffer is a B-Tree that buffers changes to other B-Trees for a similar reason.

Compaction per level

I prefer to define compaction as a property of a level in an LSM tree rather than a property of the LSM tree. Unfortunately this can hamper discussion because it takes more time and text to explain compaction per level.

I will start with definitions:

When a level is full then compaction is done from it to the next larger level. For now I ignore compaction across many levels, but that is a thing (see "major compaction" in HBase).
A sorted run is a sequence of key-value pairs stored in key order. It is stored using 1+ files.
A level is tiered when compaction into it doesn't read/rewrite sorted runs already in that level.
A level is leveled when compaction into that level reads/rewrites sorted runs already in that level.
Levels are full when they have a configurable number of sorted runs. In classic leveled compaction a level has one sorted run. A tiered level is full when it has X sorted runs where X is some value >= 2.
leveled-N uses leveled compaction which reads/rewrites an existing sorted run, but it allows N sorted runs (full when runs == N) rather than 1.
The per level fanout is size(sorted-run in level N) / size(sorted-run in level N-1)
The total fanout is the product of the per level fanouts. When the write buffer is 1G and the database is 1000G then the total fanout must be 1000.
The runs-per-level is the number of sorted runs in a level when it is full.
The per level write-amplification is the work done to compact from Ln to Ln+1. It is 1 for tiered, all-size(Ln+1) / all-size(Ln) for leveled and run-size(Ln+1) / all-size(Ln) for leveled-N where run-size is the size of a sorted run and all-size is the sum of the sizes of all sorted runs on a level.

A level can be described by a 3-tuple (c, f, r) where c is the type of compaction (T or L for tiered or leveled), f is the fanout and r is the runs-per-level. Unfortunately, now we have made the description of an LSM tree even more complex because there is a 3-tuple per level. For now I don't use 3-tuples to describe the write buffer (memory component). That is a topic for another post. Example 3-tuples include:

T:1:4 - this is tiered with fanout=1 and runs-per-level=4. It is a common configuration for the RocksDB level 0 (L0) where the fanout is 1 because the compaction input is a write buffer flush so the size of a sorted run in L0 is similar to the size of a full write buffer. For now I ignore that RocksDB can merge write buffers on a flush.
T:8:8 - this is tiered with fanout=8 and runs-per-level=8. When Ln and Ln+1 both use tiered then runs-per-level in Ln == fanout in Ln+1.
T:8:4 - this is tiered with fanout=8 and runs-per-level=4. It might be used when the next larger level uses leveled and runs-per-level on this level can be smaller than fanout to reduce read-amp.
L:10:1 - this is common in RocksDB with leveled compaction, fanout=10 and runs-per-level=1
L:10:2 - this is leveled-N with runs-per-level=2

Compaction per LSM tree

An LSM tree can be described using the per level 3-tuples from the previous section. The following are examples for popular algorithms.

Classic LSM with total fanout = 1000 is:

C0 is the write buffer
C1, C2 and C3 are L:10:1

RocksDB leveled with total fanout = 1000 is:

L0 is T:1:4
L1 is L:1:1
L2, L3, L4 are L:10:1

Stepped Merge with total fanout = 1000 is:

L1 is T:1:10
L2, L3, L4 are T:10:10

Tiered in HBase and Cassandra with total fanout = 1000 might be:

L1 is T:1:10
L2, L3 are T:10:10
L4 is L:10:1

Tiered+leveled

Note that some smaller levels using tiered and some larger levels using leveled is done by both RocksDB leveled and Cassandra/HBase tiered. I think both of these are examples of an LSM variant that I call tiered+leveled but I won't ask any of the projects update their docs. My definition of tiered+leveled is the smallest (1 or more) levels use tiered compaction, then 0 or more levels use leveled-N, then the remaining levels use leveled. When tiered=T, leveled=L and leveled-N=N then the regex for this is T+N*L+.

I don't think it is a good idea for leveled levels to precede tiered levels in tiered+leveled (TTL is OK, LTL is not) but that is a topic for another post.

The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification.

LSM trees with tiered+leveled can be described using 3-tuples and the previous section does that but here I provide one for a tree that uses leveled-N for L1 and L2 with total fanout = 1000:

L0 is T:1:4
L1 is L:1:2
L2 is L:10:2
L3, L4 are L:10:1

And another example to show that tiered levels don't have to use the same fanout or runs-per-level, but runs-per-level for Ln == fanout for Ln+1:

L0 is T:1:20
L1 is T:20:10
L2 is T:10:2
L3 is L:5:1

Leveled-N

Leveled-N can reduce the per level write-amp at the cost of increasing the per level read-amp.

The regex for an LSM tree that uses leveled-N is N+L+ (see the previous section). The largest level should use leveled compaction with runs-per-level=1 to avoid too much space amplification. An example 3-tuple for leveled-N with fanout=1000 that has runs-per-level=2 for L1 and L2 is:

L1 is L:10:2
L2 is L:10:2
L3 is L:10:1

Monday, October 1, 2018

Transaction Processing in NewSQL

This is a list of references for transaction processing in NewSQL systems. The work is exciting. I don't have much to add and wrote this to avoid losing interesting links. My focus is on OLTP, but some of these systems support more than that.

By NewSQL I mean the following. I am not trying to define "NewSQL" for the world:

Support for multiple nodes because the storage/compute on one node isn't sufficient.
Support for SQL with ACID transactions. If there are shards then cross-shard operations can be consistent and isolated.
Replication does not prevent properties listed above when you are wiling to pay the price in commit overhead. Alas synchronous geo-replication is slow and too-slow commit is another form of downtime. I hope NewSQL systems make this less of a problem (async geo-replication for some or all commits, commutative operations). Contention and conflict are common in OLTP and it is important to understand the minimal time between commits to a single row or the max number of commits/second to a single row.

NewSQL Systems

MySQL Cluster - this was NewSQL before NewSQL was a thing. There is a nice book that explains the internals. There is a company that uses it to make HDFS better. Cluster seems to be more popular for uses other than web-scale workloads.
VoltDB - another early NewSQL system that is still getting better. It was after MySQL Cluster but years before Spanner and came out of the H-Store research effort.
Spanner - XA across-shards, Paxos across replicas, special hardware to reduce clock drift between nodes. Sounds amazing, but this is Google so it just works. See the papers that explain the system and support for SQL. This got the NewSQL movement going.
CockroachDB - the answer to implementing Spanner without GPS and atomic clocks. From that URL they explain it as "while Spanner always waits after writes, CockroachDB sometimes waits before reads". It uses RocksDB and they help make it better.
FaunaDB - FaunaDB is inspired by Calvin and Daniel Abadi explains the difference between it and Spanner -- here and here. Abadi is great at explaining distributed systems, see his work on PACELC (and the pdf). A key part of Calvin is that "Calvin uses preprocessing to order transactions. All transactions are inserted into a distributed, replicated log before being processed." This approach might limit the peak TPS on a large cluster, but I assume that doesn't matter for a large fraction of the market.
YugaByte - another user of RocksDB. There is much discussion about it in the