For example. In essence any tree that allows you to "rotate" based on some stati...

jlokier · on Aug 19, 2021

The "hot items" optimisation you're thinking of have less of an effect on trees with wide fanout, such as B-trees with a substantial block size.

This is because the ratio of number of "hot" items you can store at a reduced depth to total number of item is a much smaller ratio with a high fanout tree than a low fanout tree; and at the same time, the depths are lower with a high fanout tree, reducing the benefit of reduced depth anyway.

In other words, you can only give the benefit to a smaller number of hot items, and the benefit you can give is smaller.

On top of that, the added complications of rotations and statistics mean fewer items can be stored in interior nodes (which I'd call index nodes) for a given node size. This reduces the fanout and increases the full depth.

And the rotations and access statistics cause more writing.

There are tradeoffs and perhaps it is worth it for some tree parameters. Perhaps for exceptionally large trees with particular access patterns. It would be interesting to see how it's implemented.

But generally for a high-fanout B-tree, you might as well just store frequently accessed items in a separate small tree as a cache, and keep that cached in memory.

toolslive · on Aug 20, 2021

You're absolutely right. However a cow strategy forces you to rewrite parts of the tree _anyway_ (typically the path from the node you're updating to the root). So you might as well rotate an out of balance node in that path. The rewriting also causes you to shy away from high fanouts because of the overhead.

jlokier · on Aug 20, 2021

Well, I would definitely be interested to see the paper where that B-tree rotation is described along with measurements! There are many effective B-tree modifications; maybe that's one of them. At least with storage B-trees, I haven't seen a paper describing anything like that yet.

One of the "rules of thumb" of optimising high-fanout B-trees larger than memory is to generally put values in the leaves, because the benefit of being able to keep more pointers cached in memory in compact internal nodes to help locate values in fewer I/O steps outweighs the benefit of having an arbitrary and small subset of larger values in the internal nodes. But if you have particularly well-chosen larger values in the internal nodes, and a low-entropy access pattern, perhaps the rule of thumb doesn't apply then.

There are quite a few B-tree modifications where separate small trees or logs or both are maintained to improve performance, but these are generally to optimise writes, which can be sorted and distributed in batches to reduce random access block I/O, rather than to optimise frequently accessed reads by acting as an on-storage persistent cache.

On fanout:

High fanout is generally used with storage devices, because low fanout always increases the overhead.

For example on NVMe SSDs I've used, reading or writing any block smaller than 4096 bytes takes the same time as 4096 bytes. On HDDs the equivalent number is much larger due to seek time.

So lowering fanout below 4096 bytes per block would just increase the number of block I/Os per tree operation. without reducing the time per I/O.

On COW:

You're right about the rewriting, although it's worth knowing that times have changed: On NVMe you have durable write atomicity guarantees so you don't need to COW as much to have the same storage guarantees. Inside the NVMe of course it may be doing COW on hidden page remappings to provide this atomicity, or it may use other techniques (e.g. a supercapacitor to ensure writes always complete) but from the outside you can simply overwrite blocks or parts of blocks in place.

This adds quite a saving to safe tree updates, and it also reduces padding required in write-ahead logs at durable commit points, including the in-block logs used in several B-tree optimisations (e.g. see bcache on Linux).

Even with COW, it isn't always necessary to update the path to the root by rewriting parent blocks in general. To avoid a parent block write, sometimes it's faster to log an update to the parent's pointer entry, either in an actual logical log somewhere, or using a separate logical to physical page mapping table, or by using pointer-to-log entries that are resolved at read time.

If it's an in-memory B-tree, a COW strategy doesn't need to rewrite parent blocks at all. It can update the pointers in place directly with atomic pointer updates, although some implementations prefer a logical-to-physical page mapping table instead and to atomically update the table.

toolslive · on Aug 21, 2021

NVMe devices can be split into regions with a different block size per region. If you look at the spec and are interested in building a storage system on top of them (as I was in a previous life) you will see they don't resemble a hdd at all. It's a shame to put a file system on top of them and throw away all the iops. Also, there's no such thing as an atomic write. HDDs, SSDs, NVMes all lie to you in different ways. I've seen SSD power failures where in a whole bank every byte had it's 2nd bit set. So it's an illusion to think that you don't break what you don't touch.

https://www.usenix.org/system/files/conference/fast13/fast13...

jlokier · on Aug 21, 2021

I tried to split an NVMe into different blocks sizes per region recently. It was to get the small performance improvement (~2%) of 4096 byte sectors, while needing to boot on a motherboard that needs 512 byte sectors (I didn't want to deal with switching a remote server to UEFI and figure out every issue remotely; it's a shame there isn't some trivial BIOS extension to support 4096 byte sectors).

So I decided to format a boot region and main region.

What I found with the particular models I had was:

- Multiple regions were fine, and different block sizes (512 or 4096 bytes) were fine too, but they wouln't accept different block sizes in different regions at the same time. Format commands failed when you tried that.

- The atomicity size and alignment were identical regardless of block size. Durable (power fail safe) atomicity was reported as 4096 byte aligned blocks. Write command overlapping atomicity was reported as 512kiB aligned blocks.

I'm not sure why you may think I think NVMe resembles a HDD regarding atomicity. HDDs don't claim to do atomic replacement of sector contents, and neither do non-NVMe SSDs.

However, NVMe specifies the atomicity, and the device reports atomicity parameters. Are you saying the NVMe devices lie about their atomicity as well? If so, that's a real shame, after going to the effort of reporting atomicity parameters in the first place. Or are you inferring it from experience with HDDs and non-NVMe SSDs?

With SSDs (at least low-quality non-NVMe ones) I'm not convinced you can make use of "you don't break what you don't touch". Blocks are remapped all over the place, so you don't know which banks you are touching, so if there are faults with the SSD you don't know which random blocks may become corrupt as a side effect of the blocks you are writing to, so you can't work around it.

In other words, such devices are not usable for anything if you need power fail safety. Atomicity is the least of your worries.

The most extreme awfulness I've heard of is USB flash reporting fake capacity. So when you write your important files to it, it's guaranteed to corrupt them. https://www.sqlite.org/howtocorrupt.html#_fake_capacity_usb_...

However, to the extent you decide to depend on a device not corrupting random unrelated data on power failure, it's hard to see why you should not include the NVMe-reported block atomicity in that assumption and modify your storage algorithm to take advantage of it.

Of course any device can fail catastrophically to meet its own basic specifications, such as your bit 2 in every byte example, and in general random corruption due to design or manufacturing issues. It doesn't make sense to design around those kinds of "unknown unknowns" at the B-tree and logging format design level though. Those need to be handled at a higher level, where we are willing to write off a database and perhaps switch to a backup.

Edit: ps. Thanks for the Usenix paper link. It's a good paper, and I'd hoped to see someone do those measurements. I hope to see a similar paper about NVMe someday.

toolslive · on Aug 22, 2021

| Are you saying the NVMe devices lie about their atomicity as well.

I've seen 'em do it, man.