What I find interesting about the performance of this type of hardware is how it...

bob1029 · on Jan 29, 2021

I have personally found that making even the most primitive efforts at single-writer principle and batching IO in your software can make many orders of magnitude difference.

Saturating an NVMe drive with a single x86 thread is trivial if you change how you play the game. Using async/await and yielding to the OS is not going to cut it anymore. Latency with these drives is measured in microseconds. You are better off doing microbatches of writes (10-1000 uS wide) and pushing these to disk with a single thread that monitors a queue in a busy wait loop (sort of like LMAX Disruptor but even more aggressive).

Thinking about high core count parts, sacrificing an entire thread to busy waiting so you can write your transactions to disk very quickly is not a terrible prospect anymore. This same ideology is also really useful for ultra-precise execution of future timed actions. Approaches in managed lanaguages like Task.Delay or even Thread.Sleep are insanely inaccurate by comparison. The humble while(true) loop is certainly not energy efficient, but it is very responsive and predictable as long as you dont ever yield. What's one core when you have 63 more to go around?

mikepurvis · on Jan 29, 2021

Isn't the use or non-use of async/await a bit orthogonal to the rest of this?

I'm not an expert in this area, but wouldn't it be just as lightweight to have your async workers pushing onto a queue, and then have your async writer only wake up when the queue is at a certain level to create the batched write? Either way, you won't be paying the OS context switching costs associated with blocking a write thread, which I think is most of what you're trying to get out of here.

pbalcer · on Jan 29, 2021

Right, I agree. I'd go even further and say that async/await is a great fit for a modern asynchronous I/O stack (not read()/write()). Especially with io_uring using polled I/O (the worker thread is in the kernel, all the async runtime has to do is check for completion periodically), or with SPDK if you spin up your own I/O worker thread(s) like @benlwalker explained elsewhere in the thread.

throwawaygimp · on Jan 30, 2021

Very interesting. I'm currently desiging and building a system which has a separate MCU just for timing accurate stuff rather than having the burdon of realtime kernel stuff, but I never considered just dedicating a core. Then I could also use that specifically to handle some IO queues too perhaps, so it could do double duty and not necessarily be wasteful. Thanks... now I need to go figure out why I either didn't consider that - or perhaps I did and discarded it for some reason beyond me right now. Hmm... thought provoking post of the day for me

pbalcer · on Jan 29, 2021

The authors of the article I linked to earlier came to the same conclusions. And so did the SPDK folks. And the kernel community (or axboe :)) when coming up with io_uring. I'm just hoping that we will see software catching up.

MrFoof · on Jan 30, 2021

>Latency with these drives is measured in microseconds.

For context and to put numbers around this, the average read latency of the fastest, latest generation PCI 4.0 x4 U.2 enterprise drives is 82-86µs, and the average write latency is 11-16µs.

AtlasBarfed · on Jan 29, 2021

scylladb had a blogpost once about how surprisingly small amounts of cpu time are available to process packets at the modern highest speed networks like 40gbit and the like.

I can't find it now. I think they were trying to say that cassandra can't keep up because of the JVM overhead and you need to be close to metal for extreme performance.

This is similar. Huge amounts of flooding I/O from modern PCIx SSDs really closes the traditional gap between CPU and "disk".

The biggest limiter in cloud right now is the EBS/SAN. Sure you can use local storage in AWS if you don't mind it disappearing, but while gp3 is an improvement, it pales to stuff like this.

Also, this is fascinating:

"Take the write speeds with a grain of salt, as TLC & QLC cards have slower multi-bit writes into the main NAND area, but may have some DIMM memory for buffering writes and/or a “TurboWrite buffer” (as Samsung calls it) that uses part of the SSDs NAND as faster SLC storage. It’s done by issuing single-bit “SLC-like” writes into TLC area. So, once you’ve filled up the “SLC” TurboWrite buffer at 5000 MB/s, you’ll be bottlenecked by the TLC “main area” at 2000 MB/s (on the 1 TB disks)."

I didn't know controllers could swap between TLC/QLC and SLC.

PeterCorless · on Jan 30, 2021

Hi! From ScyllaDB here. There are a few things that help us really get the most out of hardware and network IO.

1. Async everywhere - We use AIO and io_uring to make sure that your inter-core communications are non-blocking.

2. Shard-per-core - It also helps if specific data is pinned to a specific CPU, so we partition on a per-core basis. Avoids cross-CPU traffic and, again, less blocking.

3. Schedulers - Yes, we have our own IO scheduler and CPU scheduler. We try to get every cycle out of a CPU. Java is very "slushy" and though you can tune a JVM it is never going to be as "tight" performance-wise.

4. Direct-attached NVMe > networked-attached block storage. I mean... yeah.

We're making Scylla even faster now, so you might want to check out our blogs on Project Circe:

• Introducing Project Circe: https://www.scylladb.com/2021/01/12/making-scylla-a-monstrou...

• Project Circe January Update: https://www.scylladb.com/2021/01/28/project-circe-january-up...

The latter has more on our new scheduler 2.0 design.

1996 · on Jan 29, 2021

> I didn't know controllers could swap between TLC/QLC and SLC.

I wish I could control the % of SLC. Even dividing a QLC space by 16 makes it cheaper than buying a similarly sized SLC

tanelpoder · on Jan 29, 2021

I learned the last bit from here (Samsung Solid State Drive TurboWrite Technology pdf):

https://images-eu.ssl-images-amazon.com/images/I/914ckzwNMpS...

StillBored · on Jan 29, 2021

Yes a number of articles about these newer TLC drives talk about it. The end result is that an empty drive is going to benchmark considerably different from one 99% full of uncompressable files.

for example:

https://www.tomshardware.com/uk/reviews/intel-ssd-660p-qlc-n...

tyingq · on Jan 29, 2021

A paper on making LSM more SSD friendly: https://users.cs.duke.edu/~rvt/ICDE_2017_CameraReady_427.pdf

pbalcer · on Jan 29, 2021

Thanks for sharing this article - I found it very insightful. I've seen similar ideas being floated around before, and they often seem to focus on what software can be added on top of an already fairly complex solution (while LSM can appear to be conceptually simple, its implementations are anything but).

To me, what the original article shows is an opportunity to remove - not add.

1MachineElf · on Jan 29, 2021

Reminds me of the Solid-State Drive checkbox that VirtualBox has for any VM disks. Checking it will make sure that the VM hardware emulation doesn't wait for the filesystem journal to be written, which would normally be advisable with spinning disks.

jeffbee · on Jan 29, 2021

If you think about it from the perspective of the authors of large-scale databases, linear access is still a lot cheaper than random access in a datacenter filesystem.

digikata · on Jan 29, 2021

Not only the assumptions at the application layer, but potentially the filesystem too.

ddorian43 · on Jan 29, 2021

Disappointed there was no lmdb comparison in there.