This is pretty cool. Just want to copy-paste a paragraph from your link that sums up how it's different from other tools:
>Cloud service providers like AWS and Azure leverage verification software such as the TLA+ model checker to achieve the same goal, but whereas those solutions typically verify a high level system design, Stateright is able to verify the underlying system implementation in addition to the design (along with providing other unique benefits explained in the "Comparison with TLA+" chapter). On the other end of the spectrum are tools such as Jepsen which can validate a final implementation by testing a random subset of the system's behavior, whereas Stateright can systematically enumerate all possible behaviors within a specified model of the system.
This is really cool! Also, `JobMarket` is a pun I'm surprised I haven't seen before.
One recommendation for the guide: you know the tool much better than the audience, so you're going to underestimate how hard it is for readers to follow. It's something all experts do all the time. Have you considered smaller, sorter step examples?
I’m thrilled to hear that as I’m a big fan of your work!
Thank you for the suggestion as well. Do you have thoughts on specific places where I should either split a chapter and/or associated project to make the ramp up more incremental?
Hm. So the easiest, short term thing might be more annotations. For example, is `on_start` and `on_msg` Rust-isms, or are they particular to this project? If the latter, I'd suggest putting a comment describing what they do in the code itself, and then a separate page describing how they work (with a small, easily modifiable example). Things like that would go a long way to getting people to the expertise-level. Past that, I get the impression there are three ideas here:
1. Actor systems; writing code with Stateright
2. Model checking; model checking with Stateright
3. The actual problems we want to apply Stateright too
Right now you're teaching all three of those simultaneously; I think it would be better if you could scaffold, or introduce the ideas one at a time. For example, showcasing setup and modelchecking first on a property simpler than linearizability. I've been really fond of race-free critical sections for concurrent properties recently, just because it has relatively fewer moving parts.
Sorry these thoughts are a little disjointed, writing guides is really really hard
EDIT: probably not something that would be directly helpful to people, but one thing I'd personally be very interested in reading is how this all works under the hood. If I understanding correctly, you're making each Actor inspectable by `ActorModel`, which then acts as an orchestrator and records history. Is that correct?
Thanks for these insights! I think retaining the actor example while postponing the content on model checking and linearizability might be the way to go. More details on related Rust concepts could be helpful too, given that this may be the first encounter w/ the language for many readers. With that in mind, I’m inclined to break the first chapter into 3:
1. A high level overview of distributed systems and the actor model. Perhaps no code.
2. A Rust intro via an implementation of the Actor trait (which is where the `on_msg`/etc methods can be explained). The chapter would include the same Actor implementation but would omit verification.
3. An intro to system properties, nondeterminism, and verification via model checking. Here the book could postpone a discussion of linearizability and instead indicate a simpler property for the actor system. Linearizability could then be introduced when discussing replication later in the book.
I can also add a supplementary chapter explaining how it works under the hood. You’re correct about `ActorModel`: its a reusable model of actor system semantics parameterized by an `Actor` implementation. Its “dual” is the `spawn` function that lets you run that same `Actor` implementation on a real network.
Imagine that you are doing this at web scale. Instead of one web server you have 2000, and another 1000 middleware servers all talking to a distributed database cluster with another 500 machines. You can probably run your web servers close to stateless sharing sessions across redis. And your middleware worker machines can all run pub/sub so as a web dev you won't have this problem much. But the developers of your distributed key/value store, your distributed pub/sub queueing system and your database are going to be very different animals. Take a look at the last 10 years of Kyle Kingsbury's work with Jepsen and you'll see what I mean (https://aphyr.com/tags/jepsen). Opportunities for corruption, unavailability and lossage abound with distributed systems.
Imagine you're making a theatre booking system, where groups can reserve tickets in blocks. Lots of people are reserving tickets both individually and in groups at a time. How do you enforce that groups are mostly booked together? What about if there's a contiguous block when they start purchasing but then some other individuals want one of those seats before the group finishes purchasing? What if one person in the group needs to cancel: do you pick a seat in the middle and let an individual "split" an existing group block, or do you pick one off the edge and shuffle the seats? Does that changed if the group spans two rows? Do any of these decisions change if the tickets are "family" tickets instead of "group" tickets? What if some of the individual tickets are for "floating" tickets and not strictly assigned seats? What about wheelchair accessible seats?
Whatever things you end up deciding, does your system, where many people are reserving, cancelling, and re-reserving over long periods of time, guarantee all of the client's expectations?
Imagine that you are designing an electronic bank vault door. The product requirements state that the door should be secure when it’s closed and should only be able to be opened when authorized. Oh, in an emergency, the vault should fail closed unless there are people inside.
How would you prove to the client that your solution implements these requirements? Modeling the system using a language like TLA+ is one way of doing that.
How would I prove to the client that the implementation conforms to the
model? Wouldn't it end up being written in some crappy programming
language for which no one can agree on a formal semantics anyway,
running on some grotesquely complicated processor family full of
defects and bugs?
Regarding the last point — correct, Stateright aims to verify both.
It’s important to clarify that this doesn’t provide a proof of correctness, but it can dramatically improve confidence in both the design and implementation compared with fuzz testing, for example. This is done by exhaustively enumerating possible nondeterministic outcomes (e.g. due to message reordering) within specified constraints (e.g. up to S servers and C clients performing X operations…).
We've been working on some techniques for this. To my understanding, Stateright (the OP's repo) directly connects the spec with Rust code, so you can make some implementation guarantees.
If you don't have that (say you're using TLA+), one really promising technique is to generate behavioral traces and then refine those into code tests. I know a few companies have done this successfully, and I've been working on some examples of how it's done, but to my knowledge there's no battle-tested tooling that do this automatically. It's all bespoke for now.
> Wow, these formal specification tools is something I always wanted to look into but the learning curve seems steep.
One of my dreams is to flatten that learning curve. I don't think we'll ever make these tools effortless, but I think it's possible to reach a point where "these tools are too hard to learn" isn't a barrier anymore.
(Even then, I imagine a lot of people will learn some formal specification, and think "this isn't useful for me", and never touch them again. And that's fine. But I want anybody to be able to reach the point where they can make that decision, as opposed to feeling like they're locked out by default. We have a long way to go, but I think we're considerably further along than we were five years ago.)
> Having a specification which is guaranteed to be correct before implementation begins is a programmer’s wet dream.
I wouldn't say guaranteed to be correct, but at least it's, like, closer to correct? Less likely to have horrific bugs, or bizarre edge cases in the requirements that nobody notices until neck deep into the implementation. Stuff that increases founded confidence.
As a regular web dev, you are constantly staring a distributed system in the face: it’s made of the program running on the server (back end), and the program running in the web browser (front end). How do those two programs agree on pieces of common information? How does the server deal with communications from other clients? At what cadence? How are the caches invalidated? Regular web dev is anything but simple, if you seriously consider the questions it poses. Tools like this help you think such questions.
Imagine you are implementing a database engine and you want to make your database be able to run in run multiple nodes in order to make it fault-tolerant, therefore you need to implement some kind of consensus algorithm (Paxos, Raft) to select a leader and a followers, in case if some node fails, your entire cluster can keep working.
That said, implementing a consensus algorithm is quite hard. Modeling and testing is much harder, there is even a language just for this purpose; TLA+ [0].
Web page can be seen as a client to Paxos/Raft - a non-voting participant. It is a way to structure application for resilience, scalability and/or availability.
It includes a comparison w/ TLA+/TLC for those familiar with those technologies.