I have written an in-house implementation of protobuf for C++ (sorry can't share) and studied the wire format extensively.
Google's implementations, at least C++ and Java, are a bunch of bloated crap (or maybe they're very good, but for a use case that I haven't yet encountered). Don't shoot down the format because of a specific implementation, find or write a better one and enjoy the fact that every language has at least some implementation available.
Then, the author laments the wire format itself for having varint-encoded length prefixes, which can not be fixed up. That is true, but it's not that much of a problem. Most straightforward is to simply go through nested data multiple times, once to calculate the length, and again for the actual encoding (and then again and again for deeper nesting).
What makes this bearable is the fact that data will be mostly loaded into L2 cache (L1 for smaller messages) on the first pass, which makes the next pass much faster.
The story breaks down for large, deeply nested messages, but then, the topic here is telemetry which I would expect to consist of a stream of small, shallow messages.
> Google's implementations, at least C++ and Java, are a bunch of bloated crap (or maybe they're very good, but for a use case that I haven't yet encountered).
Right, they were designed for use in Google's servers, where binary size is mostly irrelevant, while speed and features (e.g. reflection) are valued.
"Lite mode" (not mentioned in the article, for some reason) optimizes for code size instead. Admittedly, it's not as small as an implementation written from scratch with minimal code footprint as the main goal, but it does cut the size quite a bit...
(Disclaimer: I wrote those C++ and Java implementations, but that was 10-15 years ago, things may have changed in the meantime...)
> HN is one of the few places you can talk smack about a companies tooling and have the person who wrote it reply!
Not just that, I suspect it’s a sign of HN’s unique culture that the “smack talk” seems not to be taken personally and instead a highly nuanced and interesting discussion ensues!
> Google's implementations, at least C++ and Java, are a bunch of bloated crap (or maybe they're very good, but for a use case that I haven't yet encountered).
As someone who has been working on protobuf-related things for >10 years, including creating a size-focused implementation (https://github.com/protocolbuffers/upb), and has been working on the protobuf team for >5 years, I have a few thoughts on this (thoughts are my own, and I don't speak for anybody else).
I think it is true that protobuf C++ could be a lot more lean than it currently is (I can't speak to Java as I don't work on it directly). That's why I created upb to begin with. But there's also a bit more to this story.
The protobuf core runtime is split into two parts, "lite" and "full". If you don't need reflection for your protos, it's better to use "lite" by using "option optimize_for = LITE_RUNTIME" in your .proto file (https://developers.google.com/protocol-buffers/docs/proto#op...). That will cut out a huge amount of code size from your binary. On the downside, you won't get functionality that requires reflection, such as text format, JSON, and DebugString().
Even the lite runtime can get "lighter" if you compile your binary to statically link the runtime and strip unused symbols with -ffunction-sections/-fdata-sections/--gc-sections flags. Some parts of the lite runtime are only needed in unusual situations, like ExtensionSet which is only used if your protos use proto2 extensions (https://developers.google.com/protocol-buffers/docs/proto#ex...). If you avoid these cases, the lite runtime is quite light.
However, there is also the issue of the generated code size. Unlike the runtime, this is not a fixed cost, but is proportional to the number of messages you use. If you have a lot of messages it can quickly dwarf the size of the runtime. For this reason, C++ also supports "option optimize_for = CODE_SIZE" which uses reflection-based algorithms for all parsing/serialization/etc instead of using generated code. This means you pay the fixed size hit from linking in the full runtime, but the generated code size is much smaller. On the downside, "optimize_for = CODE_SIZE" has a severe ~10x speed penalty for parsing and serialization.
I have long had the goal of making https://github.com/protocolbuffers/upb competitive with protobuf C++ in speed while achieving much smaller code size. With the benefit of 10 years of hindsight and many wrong turns, upb is beginning to meet and even surpass these goals. It is an order of magnitude smaller than protobuf C++, both in the core runtime and the generated code, and after some recent experiments it is beginning to significantly surpass it in speed also (I want to publish these results soon, but the code was merged in this PR: https://github.com/protocolbuffers/upb/pull/310).
upb has downsides that prevent it from being fully "user ready" yet: the API is still not 100% stable, there is no C++ API for the generated code yet (and C APIs for protobuf are relatively verbose and painful), it has a bunch of legacy APIs sitting around that I am just on the verge of being able to finally delete, and it doesn't support proto2 extensions yet. On the upside, upb is 100% conformant on every other protobuf feature, it supports reflection, JSON, and text format, but also lets you omit these if you don't want to pay the code size.
I hope 2021 is a year when I'll be able to publish more about these results, and when upb will be a more viable choice for users who want a smaller protobuf implementation.
What is this thing with JSON support? Don't people use pb so they do not have to deal with JSON? I'd expect that for a truly lean pb implementation, adding JSON is a 300% increase in code size?
People implementing public-facing APIs might not be comfortable forcing all the consumers of that API to use Protobuf, but may want to use Protobuf internally within their service. So, Protobuf lets you convert to/from JSON, so that your service can expose a JSON API while maintaining Protobuf schemas and never dealing with a JSON parser directly.
The implementation is based on Protobuf's existing reflection features, so if you are already compiling with that enabled, the JSON implementation does not add much code bloat on top of that.
PB is used for a wide variety of things, its a lingua-franca of data interchange within Google, but web (frontend) still often uses json. So if I want to have some proto in a database and render it on a UI to a user, somewhere in there I'm probably going to be translating the raw proto to either json or json-like (I'll admit I'm not sure how stuff like grpc-web fits in here, if you can just get proto-serialized bytes passed around in some a blob, but there are some pretty low-overhead json proto encodings that exist anyway).
But it's somewhat ugly as well, from an architecture point of view. Because that argument translates to anything else you might want to do with these objects.
In past libraries for C++ I've tried to prevent this kind of coupling by adding generated template methods like "walk(f)" where f would be a templated callable, called with a descriptor and data reference for each field. Any kind of pretty printer or SQL statement can be built that way.
> Don't people use pb so they do not have to deal with JSON?
I think that's one reason, but there are others too, like getting generated type-safe accessors and an explicit schema. I've sometimes used protobuf for this reason even when I'm not planning to use binary format.
> I'd expect that for a truly lean pb implementation, adding JSON is a 300% increase in code size?
Here is a breakdown of a binary that uses upb and links in binary format, JSON format, text format, and some generated protos:
Things may have improved since, but the implementations are somehow very large and slow.
Things may have changed since, but AFAIK the C++ implementation would always allocate on the heap for nested messages, and perhaps even for optional scalars. This may be optimal for larger documents, but not for smallish messages (my use case was market data and trading instructions). I measured certain small messages, where an encode/decode pair would take over a microsecond with Google's implementation, but about 50 ns with a simpler one (versus 15 ns for a memcpy).
For Java my experience is mostly with the API itself, which felt very heavy.
Edit: I think a lot depends on your use case. I use protobuf mostly as a 'trusted' protocol. If someone didn't set a required field, I don't care. Some bloat may have to do with verifications that I've never needed.
This has never been the case, except for string fields where std::string forces us to allocate.
Ideally we will eventually use std::string_view for string accessors instead of std::string, so that even string data can be allocated on an arena instead of the heap.
> This has never been the case, except for string fields where std::string forces us to allocate.
I'm was quite surprised you didnt offer your own stringview implementation (or something similar) the last time I looked at protobuf. I'd naively assume that inside Google this could be quite a low-efford high-reward optimization.
The internal version of protobuf lets you switch individual string fields to string_view using [ctype=STRING_PIECE], but migrating the default away from std::string is mainly just an enormous migration challenge.
Internally we also do something slightly nuts: we break the encapsulation of std::string so that we can point it to arena-allocated memory (we then "steal" the memory back before the destructor runs). We can only afford to do this internally, where the implementation of std::string is known. The real long-term solution is to move to string_view.
Since all other comments appear to contradict you and make apologies from authority (after all, Google can't do anything wrong, right?) I'd like to just reassure you with my 20 years of experience developing C (and the last 10 with C++) in network and system software: that any library -- any library -- that forces internal dynamic memory upon its user smells bad. It's not a universal condemnation, but it begs the question, or rather the skepticism to ask why?
In the case of protobufs, there is no real answer. Protobufs is one of the few network libraries that doesn't operate with zero-copy. That steps over a tangible line in the sand.
No zero-copy for networking? Forced internal heap allocations with only this arena feature after a decade? Sorry no. Protobufs isn't useful for serious network applications.
Unless you are targeting a constrained embedded system, this kind of anti-memory-allocation thinking is counter-productive. Avoiding memory allocation significantly increases the complexity of an API -- or worse, leads to shortcuts like using (thread-unsafe) globals or (overrun-prone) fixed buffers.
In C, dealing with memory allocation is such a pain that C programmers still tend to avoid it. But C++, especially post-C++11, makes dynamic memory allocation much, much easier to deal with. (And GC'd languages, obviously, are easier still.) There is still a performance cost, obviously, but that cost almost never matters in application programming use cases, and is even negligible in many systems programming cases.
I do think Protobuf does too much allocation, but saying that libraries should not allocate memory at all is an outdated view.
> Avoiding memory allocation significantly increases the complexity of an API -- or worse, leads to shortcuts like using (thread-unsafe) globals or (overrun-prone) fixed buffers.
Allow me to unpack this, because the object interface that all of these serialization schemes present is what increases complexity. It makes sense that the simple implementation would tend toward dynamic memory and not further compound such complexity on their interface. This is a flawed architectural premise taken by these library implementations.
On the receiving side, there is little reason to ever deviate from zero-copy/zero-allocate. These libraries are far too pro-active in deserializing incoming data into native object structures. It's not necessary to present the user with this; they should pique through the results lazily -- query into the data at their own discretion. This is important because it bounds the minimum requisite cost of receiving messages at ... zero. It costs nothing to discard trash. Whereas with these proactive serialization layers, every single incoming attack becomes an exercise against the deserializer. In a zero-copy/zero-allocate -- let's even say, zero-parse mere lexing of the data, software has more flexibility. In practice, that means passing pointers around to parts of messages straight off the wire to application functionality further up the stack.
On the sending side, there is little reason to ever force whole native object representations to conduct a serialization. In other words, don't force the user to build an std::map first and serialize it later. All input is streaming input. Properties do not have to be pre-buffered, they can be streamed. One can model any network serialization this way, sans perhaps canonical representations (sorted JSON keys, etc), and far more efficiently than with requiring arbitrary native structures.
This is not really a rebuttal about the merits for or against dynamic memory in and of itself. I know the research, thread-aware allocators can be pretty good -- even darn good, and the state of the art in GC is nothing to shake a stick at. The problem is that with better design it's just not necessary, and in the end it does have a cost that one should want to eliminate if possible. I'm certain it's quite usually possible, at least more than I see in a list like https://en.wikipedia.org/wiki/Comparison_of_data-serializati... etc
What you describe sounds exactly like how Cap'n Proto works. Obviously I like that design since I wrote it.
But it does have drawbacks. The encoded size is necessarily a bit larger to allow random access traversal of the raw bytes (though it compresses well). The API to manipulate structures in-place is a little awkward, particularly on the writing side. And, you can't really use the generated types as mutable in-memory state, as people commonly like to do with Protobuf types -- as a result, a common feature request for Cap'n Proto is to support generating "native" structs with the ability to convert between those and the zero-copy types as desired.
Everything is full of trade-offs. I don't disagree with your design preferences but I do object to the extreme line you are taking on them. You said: "Protobufs isn't useful for serious network applications." That is plainly contradicted by the existence of a trillion-dollar company built on said technology.
Constrained embedded systems cover a broad range of things up to and including your phone. There's few things I hate to see more than a flat profile from memory allocation or cache misses.
The good news is that flatbuffers[1] is a reasonable replacement for most of my use cases. In particular being able to mmap() them directly is a wonderful thing that you can't do with protobufs in addition to being very allocation sparse.
> Constrained embedded systems cover a broad range of things up to and including your phone.
No, modern phones are certainly not constrained in the way I meant, and I don't think you could call them "embedded" either. The common programming languages used on phones are very memory-allocation-friendly.
> There's few things I hate to see more than a flat profile from memory allocation or cache misses.
I think you may be arguing a different point, or a different level of extremity of the point. Reducing memory allocation to optimize performance is a fine thing that everyone does. The person I was replying to, though, seemed to be asserting that libraries should completely avoid allocating memory for themselves.
> The good news is that flatbuffers[1] is a reasonable replacement for most of my use cases. In particular being able to mmap() them directly is a wonderful thing that you can't do with protobufs in addition to being very allocation sparse.
Yeah... I'm the author of Cap'n Proto, which has the same property, and predates Flatbuffers.
> The common programming languages used on phones are very memory-allocation-friendly.
I would hardly classify Dalvik or ART as "allocation friendly", they don't perform escape analysis and if you do it constantly you'll be in a world of constant hard GC pauses. Multiple times over the years I've had to build free-lists in Java to avoid this specific problem.
Same for C++ if you use one of the built-in generic memory allocators. The fastest new/delete are the ones that you don't call.
For what it's worth I tend to agree with the grand-parent thread. The lack of awareness of allocations, cache-invalidation via indirection are a significant contributor to why we see software clawing back hardware wins across the years on these platforms.
> I would hardly classify Dalvik or ART as "allocation friendly", they don't perform escape analysis and if you do it constantly you'll be in a world of constant hard GC pauses. Multiple times over the years I've had to build free-lists in Java to avoid this specific problem.
I don't know what you're doing, but I've generally found free-lists to be a net performance negative in Java libraries. Time and again, I've been called in to "optimize" Java code that uses them, and usually by simply removing them I can get rid of the performance problems entirely.
Dalvik and ART have different behaviors than traditional Java VMs. Dalvik is aimed at low memory devices, ART trades some memory for speed but still is significantly different(i.e. it doesn't do escape analysis). As always benchmarking on hardware is the way to confirm this but I've had numerous cases where free lists or pre-allocated arrays has gained 10-30x performance improvements.
Flatbuffers in particular excels here since once you have the ByteBuffer in memory you can immediately start accessing data without needing to do any extensive parsing.
Even the official Android docs are very explicit[1] on the allocation point. Allocations are not cheap and even with generational collection you still will blow the 16.6ms frame window for 60 FPS if any of your operations allocate excessively.
You may be right - presumably iPhones have fewer GC pauses (though you can still have VM, loading, compression/decompression, network, and other pauses.)
Be that as it may, lots of people still manage to use Android devices.
Once upon a time, we were concerned with handling network traffic at wire speed. I can remember when the AIX folks were happy they could implement the TCP fast path entirely in the interrupt handler.
Certainly you wouldn't use protobuf to encode TCP packets. It's meant for encoding application-layer messages, typically ones that aren't especially large but contain a lot of complex structure.
> make apologies from authority (after all, Google can't do anything wrong, right?)
That's not my position at all. In my other comment (https://news.ycombinator.com/item?id=25586447) I explain how I've spent 10 years trying to improve on protobuf C++ precisely because I agree that some of these limitations are unnecessary.
> No zero-copy for networking? Forced internal heap allocations with only this arena feature after a decade? Sorry no. Protobufs isn't useful for serious network applications.
I suppose it depends what you are comparing it to. Almost every JSON library has the same limitations you mentioned, and yet many people find JSON useful for network applications. But I agree that giving users full control over allocations makes a library useful in many more situations.
I think arena allocation is a pretty reasonable solution to the problem. You can use whatever memory you want for the arena (stack, heap, static buffer) and you can constrain it so that no heap allocations are allowed.
Unfortunately protobuf C++ can't live fully within this arena model while it uses std::string for accessors. Hopefully this can be fixed at some point.
> I suppose it depends what you are comparing it to. Almost every JSON library has the same limitations you mentioned, and yet many people find JSON useful for network applications.
JSON is a human-readable format which is hugely advantageous to develop and operate in many settings. Protobufs doesn't have that advantage. Yet we're paying all of the same costs to structure the data with both. That's enough to posit JSON as a net winner over protobufs.
> Unfortunately protobuf C++ can't live fully within this arena model while it uses std::string for accessors.
Forcing std::string as a container for core components of a networking API (specifically for an accessor) is an exemplary demonstration of a lack of seriousness in a library.
You can debate the design decisions all you want (and I agree `std::string` was a poor choice here)... but it's kind of absurd to say that a technology underpinning products used by billions of people every day is not "serious".
Why is that absurd? Products used by billions of people may just as well not have serious requirements. I beg not to debate semantics, indeed my own fault, but in the world of network software requirements can start to get very serious, straight through to userspace protocol stacks on DMA'ed device buffers.
Would you go through all of the effort to rig an application with userspace networking only to return its results inside an `std::string` -- one of the few std containers where one can't even control the allocator if they wanted to? That's absurd, if anything.
If you are arguing that there exist use cases for which Protobuf is ill-suited, then fine, I obviously agree.
But you seem to be arguing that Protobuf is bad because it is not well suited to certain use cases, dismissing all other use cases as "non-serious". That is offensive.
I asked you not to argue semantics, but you're doing it anyway. I didn't respond here to argue protobufs actually is bad, since all software is suited to something for somebody somewhere.
In fact, I can just do that now: Protobufs is bad. You seem to agree, because you haven't even advocated for it -- just flatbuffers and Cap'n Proto and the phonebook of alternatives. Listen, I don't care who you've worked for or what you've done. As a scientist and engineer I'm totally ready and able to entertain critiques of my prior works without being offended. I'm offended that this is the quality of your participation here.
This is a really unprofessional and unreasonable approach (and I can see you're being downvoted for it). Since you say you're a scientist and engineer, can you share your prior work so we can evaluate it?
> but in the world of network software requirements can start to get very serious, straight through to userspace protocol stacks on DMA'ed device buffers.
Google does all of that when it makes sense, and yet uses it to push protobufs. That’s the entire point: you are calling it absurd and non-serious, but this only shifts my opinion on you, not on Google.
> Forcing std::string as a container for core components of a networking API (specifically for an accessor) is an exemplary demonstration of a lack of seriousness in a library.
im not super experienced with c++, so please forgive the maybe obvious question, but why is that?
(im guessing it has to do with allowing control over where allocations come from?)
In general I agree although I think it's fair to separate the protobuf wire format from the proto3 schema format which had a canonical json representation.
> No zero-copy for networking? Forced internal heap allocations with only this arena feature after a decade? Sorry no. Protobufs isn't useful for serious network applications.
That's a bit harsh. Protobufs deliver smaller wire size than any of the newer "zero-copy" formats. And many receivers of zero-copy formats will... copy the data into some internal representation. If your protobuf implementation delivers classes that are good enough to work with internally (store in maps, forward, etc) then you don't really lose something; instead you gain, due to no manual conversion layer.
CBOR also delivers optimal wire sizes with its variable length encoding. Ceteris paribus, there's no technical advantage to using protobufs. The choice is almost always non-technical (business, platform, partnership, etc) or simple naivete. It's not harsh, it's just engineering.
"that any library -- any library -- that forces internal dynamic memory upon its user smells bad"
Cough.
While with embedded systems there definitely is a big thing about dynamic memory allocation, much as I don't like it, it's not like very popular and successful libraries don't do this. It's a pretty common and accepted practice, and there are standard idioms for how it is done.
> AFAIK the C++ implementation would always allocate on the heap for nested messages
FWIW if you reuse the same message object for multiple parsings, it will re-use the sub-objects as well, thus amortizing away the allocation cost. Parsing the same message into the same object twice should do zero allocations on the second parse. This is the intended way to use Protobuf for small-size messages.
Apparently the C++ implementation has also grown support for arena allocation more recently. (After my time, so I don't know much about it.)
But then all that must come with bookkeeping, which brings its own cost.
Take a look at an implementation like Prost, for Rust. It's very similar to what I did (10 years ago by now). Everything is just inline, except when messages can be recursive (which should be rare for most protocols).
> Everything is just inline, except when messages can be recursive (which should be rare for most protocols).
Many messages are have lots of optional sub-message fields, and set only a few of them in any given message. These messages would be huge if everything is inline (especially if the same thing happens in those sub-messages).
I agree that inlining all sub-messages works great for dense schemas, but it assumes too much about the schema to be a good design for a general-purpose proto library I think. Also maps and repeated fields can never be inline.
The bookkeeping is not that hard... the pointer is null until it is first allocated, then it remains non-null, while a separate boolean indicates whether the sub-message is actually present in the parent.
The big problem is bloat in memory usage if you parse many differently-shaped messages, requiring the app to implement hacks like only reusing a particular object a certain number of times.
This over-assuming-of-shared-context is one of my pet peeves on this site. Based on the average post you see on Hackernews, it's a virtual certainty he means "snooping on users of a program". I can't explain why, but it just bothers me that this would be considered so obvious as to escape mention. "Telemetry" means something, and it meant something to aerospace engineers for a long time before somebody at google started using it to refer to something incredibly narrow.
Of course, I'm also annoyed that some computer science PhD apparently googled around for a cooler name for multi-dimensional arrays and ruined the term "Tensor" forevermore.
> Based on the average post you see on Hackernews, it's a virtual certainty he means "snooping on users of a program". I can't explain why, but it just bothers me that this would be considered so obvious as to escape mention. "Telemetry" means something, and it meant something to aerospace engineers for a long time before somebody at google started using it to refer to something incredibly narrow.
Did you even read the post? It's obvious OP works at Datadog ( they mention it and mention the work is for their employer), so telemetry as in OpenTelemetry ( the new standard around metrics, traces and logs).
Well, I started to read the article, thinking that it might refer to sending, you know, telemetry. The hackernews link was titled like this:
>Don't Use Protobuf for Telemetry
so I thought I had a shot of reading something relevant to me. So I opened up the article:
>Protobuf needs no introduction, but this post argues that you shouldn’t use it for telemetry. The basic premise of this post is that a good telemetry library needs to be lightweight to avoid perturbing the application; inefficient diagnostic tools are self-defeating. Unlike other formats, nested Protobuf messages cannot be written contiguously into a stream without significant buffering. The post doesn’t argue to never use Protobuf, but that the trade-off made by the wire-format itself, as opposed to any existing implementation, is unlikely to work for lightweight message senders.
My field is related to robotics, where we actually send, you know, telemetry. Like, over a radio. What I'm complaining about is that this is arguably the "main" usage of this word, but people in the web space assume that like protobuf, the concept of telemetry needs no introduction: it means sending back statistics, metrics and logs about your web services. When he says "a good telemetry library" needs to be lightweight, there's zero preamble here to get me oriented.
You saying "did you even read the post" is exactly missing the point. Yes, I read through the post, and through context clues, discerned that this post isn't relevant to my interests. I just notice that people writing about cloud service technology never feel compelled to warn you that's what they're going to be talking about. They make grand proclamations, and you have to play detective to find out whether their claims apply to you. "Microservices are the future!" "Okay, does he mean like... for all of software? Or for SaaS companies?"
Imagine if the world was a different place, and the average programmer worked at a robotics company instead of a web tech company. And then you saw an article titled "gRPC is dead - DDS is the future". And then it jumped in and started talking about all these throughput metrics, and it took you 6 paragraphs to figure out that the reason they're done with gRPC is that they're shipping around 4K images at 60fps to dozens of microservices that all run on the same computer. And you'd go oh, well that's... irrelevant. That's how it feels, all the time.
> What I'm complaining about is that this is arguably the "main" usage of this word
Telemetry is getting remote metrics. You can have telemetry in robotics, you can have telemetry for app performance, you can have telemetry for environment sensors, etc. They're all valid - I'm not sure why you say your field's usage is the "main" one.
Yes. Just recently I discovered that some web types refer to generating HTML from some other representation as "rendering". To graphics people, and artists, rendering is taking a description of something and turning it into a directly viewable picture. HTML is a long, long way from displayed pixels.
Come on. There's a big industry out there working to push game and entertainment content through the GPU and onto the screen. That's rendering.
Rendering is what happens when fat is cooked slowly over low heat and becomes a liquid, rather than crisping up. What pixels have to do with that, I have no idea.
The practice of naming software or start-ups after real, useful terms from other industries is super annoying to me.
And it's especially unconscionable for web devs, who should be more wary than most about global namespace collision (looking at you, browser window). At least have the decency to change the word slightly.
Software is now literally everywhere, doing everything.
Picture if there was a website dedicated to Math News. Some articles talk about how amazing Tangents are. Some talk about Bayesian inference. It would be a mess.
Yes, Hacker News is a mess. That's pretty unavoidable. All conversations take place within a domain - a culture, a set of shared assumptions. And if you start talking with someone new, yes, it will take you quite a while to find out what assumptions they have that are different from yours.
Like, I wander in to an economics class, and they put the independent axis in the wrong direction.
I open up a computer graphics library, and it's using the left-hand rule.
One textbook says negative charge goes this way, and the textbook for a different course says it goes the other way.
Yes, people don't even understand that others MIGHT not understand their domain.
To be fair, if you don’t want to read grand proclamations about opinions which make zero sense except for people in roles tightly related to surveillance capitalism, perhaps this isn’t the site for you. (Us...)
I read the whole post, and had to do a mid-post brain state reshuffle to get out of my assumed “pump in the field” context to his ”snooping on users” context.
This for me, is because I’m playing with an ESP32 based project which uses protobufs, and I clocked into it from that mind set.
The intended use of “telemetry” was not obvious for me, until some way until reading the piece. Even the first mention of Datadog didn’t raise any suspicions to me, it was the use of Java that first had me wondering, then down around the 6th para the lightbulb went off when he said “a diagnostic agent which tells you what your application is doing” that I stopped and re oriented my brain.
People tend to jump from telemetry straight to snooping on users even though it's just one possible case.
A possible megabyte of data: A rolling in-memory debug-level log + stack traces with context + timing information, that you snapshot when the app catches and exception. Or a buffered set of timing stats from a minute of runtime.
The author mentioned implementing ddsketch for probabilistic quantile distribution. That's something you run on your own app/infrastructure rather than for snooping on users.
If you are expecting multi-MB messages protobuf isn't the best option as the author highlighted. Often you can combat this by breaking a large message into a stream of smaller messages but for larger amounts of data you will want something else.
pb is also used in low-resource/embedded systems (to send the data to the servers).
In these cases, the processors typically have little to no cache and very limited RAM,
Performing multiple encoding passes is the only option and each pass is as expensive as the first.
Its a shame you can't share the code, we could really do with a modern C++ implementation. Having said that, protozero looks very interesting [1], need to find some time to look into it.
You don't work for Blizzard do you? When I worked there engineers (not my team thankfully) designed their own alternative to protobuf to try and save more bytes over the wire, which in my opinion was a really poor decision. Rather than get on with actually adding value, they ended up pushing back multiple other teams deadlines while adding almost no value. It was a classic "not built here" mentality and doing engineering work for the sake of the work, rather than actually ask "what problem are we trying to solve here?"
If protobuf works for Google then it essentially works for 99.999% of every other company on the globe.
If protobuf works for Google then it essentially works for 99.999% of every other company on the globe.
I can't agree with this mindset.
Another commenter here pointed out the Google implementation is 20x slower than others, and 1.6MB for this kind of task feels bloaty. Just because it meets Google's needs doesn't mean it's universally adequate.
I did a search on this page and I didn't find the comment you are referencing. It seems very unlikely to me that google's implementation is 20x slower.
I think the general point is 'if it works for Google, you should feel relatively safe adopting it for your own use'. It doesn't mean it'll fit every single usecase.
If what you need is something that is super efficient over the wire - or any other such requirement not filled by Protobuf - then maybe look for a different protocol altogether. Or design your own!
But for the rest of us - the '99.999%' - the trade-offs are well understood and we'd rather go for the tool we know than reinvent the wheel or use a less maintained tool.
These are simply tradeoffs. For example, one can make very fast protobuf parsers/serializers if you prioritize that over the ergonomics of the generated code.
No I don't work for Blizzard. When I did this, we were using a very fast custom wire format, but entirely hand-coded. Say flatbuffers without the code generation part. And having massive trouble with schema evolution.
Another group had already evaluated pb and found it way too slow. They had designed something similar but faster. I wrote my implementation of pb to prevent this, and showed pb could be fast enough. It was definitely the right choice at the time, as it gave us C++ speed close to the old format, plus easy interop with other languages.
> If protobuf works for Google then it essentially works for 99.999% of every other company on the globe.
Uh.. no. Google is a massive company, but if you browse the comments in this thread, you'll find multiple remarks like "this was built for Google's servers". Google have specific use cases, and they build software for that. The software may well be lacking for other use cases. I can totally imagine Blizzard wanting to write their own implementation, think of the benefits of reducing parse time in a multiplayer server.
I remember watching a GDC talk of JAM (https://www.gdcvault.com/play/1018184/Network-Serialization-...), which was Blizzard's solution to network serialization for WoW back in 2004. The talk even had performance comparisons against protobuf and it seemed like a decent alternative that worked well for them. Was this new alternative an extension of JAM, or a full rewrite? I know they have millions of players monthly, but I can't imagine saving a bit on bandwidth could have warranted a full rewrite, especially when the majority of their traffic is from their own datacenters.
Yeah, the author seems to both understand the nesting issue, and yet not understand how it's not an issue.
As per the format guidelines, you shouldn't have deeply nested structures... or at least if you do it should be in cases where only one or two layers of nesting are decoded at any given stage (I've done this with "envelope" messages).
Of course, there are ways to write serialized lengths of nested elements recursively, you just have to do it yourself. It's just not what you normally want to do, so the standard library and compiler don't emit that logic. If you are going down that road, you gotta ask yourself why.
It didn't exist yet. We were evolving from simple raw messages with all sorts of problems.
The alternative was something custom again, with better support for schema evolution, but pb was convenient due to existing implementations in Python (system tests) and C# (UI).
His point on protobuf-java library adding nontrivial bloat to his Java app is definitely valid. However, his other argument about protobuf wire format being inefficient is hard to square with decades of practical experience Google had with protobufs, which are used for literally everything, including telemetry, and high throughput, low latency applications. Sure, having to recursively precompute lengths before serialization is a bit of a hassle, but I wouldn’t call it expensive.
It seems easy to square the practical experience of protobufs at Google versus the tradeoffs encountered in the wire format when you take into account how many Google engineers have directly contributed to the diaspora of related formats such as CapnProto, flatbuffers, msgpack, etc with different tradeoffs (especially for smaller scales than Google).
It seems clear enough that protobufs were optimized in a scale that involved a lot of time/cost-sensitive reads and far fewer time/cost-sensitive writes. That's a valid tradeoff for Google scale, that comes at a cost of other potential applications (such as telemetry that can be very time/cost-sensitive at write-time but has much more relaxed read time/cost-sensitivity).
> Sure, having to recursively precompute lengths before serialization is a bit of a hassle, but I wouldn’t call it expensive.
It's expensive in memory (needing to prebuild all the subcomponents of the message, instead of streaming them as needed), and potentially time (prebuilding then sending, versus streaming as soon as data is available to write). Of course, in the scheme of things it may not be expensive to particular use cases. (Which seems why the author seems to particularly highlight this specific use case where such things really are expensive overhead necessary to avoid.)
> It seems clear enough that protobufs were optimized in a scale that involved a lot of time/cost-sensitive reads and far fewer time/cost-sensitive writes.
Nah, you're assuming too much. Protobuf was thrown together in a fairly ad hoc way by a couple (brilliant!) engineers (Jeff and Sanjay) to help make the Google search index protocol easier to maintain. The specific design decisions in Protobuf were not carefully tested or weighed against other possibilities. They just did something that worked well enough, and it worked well enough that it was rapidly adopted by the rest of the company. It was then too late to change anything.
Yes, the fact that a variable-width size must be written before the data is a kind of big problem, which essentially requires you to make two passes over the message tree, one to compute sizes and one to write the data. There are some clever optimizations that can reduce the impact, but I don't think the designers would make the same decision if starting from scratch with no need to support legacy. But there was no point in history where it was worth breaking compatibility to fix this issue, so that's how it remains. It's a problem, just not that big a problem.
(I maintained Protobuf for several years, including writing version 2 and open sourcing it.)
> The specific design decisions in Protobuf were not carefully tested or weighed against other possibilities.
And just to be clear, I don't think this is bad. On the contrary, I think Protobuf won because it did a wide variety of things "pretty well" while moving quickly and solving real problems. This is how the best technologies are usually made, not by academically trying to perfect everything, but by banging out something that works and running with it to solve real problems. If you try to carefully design everything perfectly upfront, you'll spend a huge amount of time on decisions that don't really matter.
> This is how the best technologies are usually made, not by academically trying to perfect everything, but by banging out something that works and running with it to solve real problems.
I think this is a really important point. I don't think I've ever told you this before, but I am really impressed by how quickly you turned out proto2. While there are some things here and there that we wish we could change, a lot of it holds up really well. A lot of decisions I made in upb early on, where I thought I was improving on proto2, actually turned out to be bad ideas and the proto2 design was better.
Though in retrospect, I don't know if proto2 was a net win, given all the migration pain it caused. If I were doing it again I would take a more incremental improvement approach on proto1. It would have taken a lot longer but with less pain, I think. That said... who knows if that would have resulted in a better or worse outcome. Open sourcing would have taken a lot longer.
Had I realized at the time that I was taking on a project with no good solutions... heh.
Yes, protobufs are a very interesting and fulfilling technology to work on, but also frustrating because there is so much API exposure that changing any existing API is like trying to run through molasses. A clean break a la proto1->proto2 opens up lots of possibilities on a much shorter time scale, but also creates a heavy migration burden.
There are lot of improvements we can make without touching API, but whenever the API itself is a barrier to further improvements, there are just few good options for managing that.
Your point about incremental changes reminds me of the Linus rant about "bundling", where he argues that incremental changes lead to a better result than a big bang rewrite: https://yarchive.net/comp/linux/bundling.html I very much agree with that approach when possible, but Linux has the benefit of a much narrower API offered to its users. The protobuf API is not only the API of the core library, but of every generated class. The massive surface area just makes any kind of change to generated APIs an enormous challenge (for example, returning string_view from accessors instead of std::string).
If you're designing a critical component that's going to impact your business indefinitely, you can take some time for due diligence. This is basic Shift Left mentality and it has lots of benefits.
This was 2001-ish. No one at the time had any idea how widely the thing would end up being used, nor how big the company would grow. And spending too much time dwelling on each part of the tech stack then could easily have given a competitor the opportunity to pull ahead, in which case Google wouldn't be what it is today.
It's easy to sit in hindsight and say that all one's actions are justified, because otherwise the present wouldn't have happened. But it's a mistake to justify past actions judging solely on individual outcome. If it had all gone badly and Google had been sunk, these would've been listed as reasons why they should have done things differently. The individual action may be right or wrong in spite of any merit it may garnish.
Hey that reminds me, I've always wondered: how did it come to be that delimited encoding "won" for sub-messages and group encoding was deprecated? A lot of these problems would go away if things had gone the other way.
Groups can be encoded in one pass, they are efficient to decode (unless you were trying to skip the sub-message, a la LazyField), and they don't have the string/message ambiguity in UnknownFieldSet that messages have.
Do you remember how that came to pass? Maybe some of this happened during the evolution of proto1, which was before my time.
That decision predates me. I think it was basically because early versions of protobuf didn't actually support using a message type as field type, so instead people would declare "string" fields and then manually encode/decode another protobuf type into that field. When the ability to explicitly use message types as field types was added to the language, they wanted to use it in those existing protocols without breaking compatibility, so the design was fit to the pre-existing practice.
I argued for switching to group encoding for submessages when working on proto2, but was shot down. It was a long time ago, but I think the counter-argument was some combination of "it's not worth the breakage" and "the ability to lazily parse sub-messages is too valuable".
> Nah, you're assuming too much. Protobuf was thrown together in a fairly ad hoc way by a couple (brilliant!) engineers (Jeff and Sanjay) to help make the Google search index protocol easier to maintain. The specific design decisions in Protobuf were not carefully tested or weighed against other possibilities. They just did something that worked well enough, and it worked well enough that it was rapidly adopted by the rest of the company. It was then too late to change anything.
That's fair enough as a description of how the protocol was (not) designed that those sorts of trade-offs were not taken into account in the initial design.
However, it doesn't entirely invalidate my assumption: search index protocol is pretty obviously a context/scale that would have deeply favored more time/cost-sensitive reads (need to get search results to users as fast as possible) and the time/cost for writers less of a pressure (as writers for the index protocol could presumably be amortized with caching/proxying/throwing hardware at the problem). Whether that was an intentional "optimization" process or simply "optimization evolutionary pressure", it does seem to me (as an entirely outside observer) like a natural trade-off optimization that ocurred at Google scale for protobufs that kept protobufs feeling "well enough" that they were essentially left alone and never optimized for something with different trade-offs (such as use cases that were more write than read-heavy).
Which it is still useful to know those sorts of "optimization evolutionary pressures" to answer questions like "sure, it worked for Google, but will it work for me in this very different use case/scale?"
Yes, I agree with that. If Protobuf hadn't been a reasonably-good design for Google-scale distributed systems, it wouldn't have won (within the company, or beyond). So regardless of how that design came about, we can say it is a reasonably good design overall. But only in aggregate -- not every individual decision can be assumed to be great on its own.
Signal boosting this. I am the last person to be swayed by “use our magical design that solves all problems,” but in the case of protobufs, it solves most of them.
I didn’t like it until I was forced to use it. Now I’m not sure I’ll ever go back.
There’s a handy code snippet to turn any protobuf message into JSON:
from google.protobuf.json_format import MessageToJson
import json
from pprint import pprint as pp
pp(json.loads(MessageToJson(msg)))
CapWords function names matches the C++ style guide at Google, not java. A lot of python uses autogenerated bindings to C++ code and reuses the same naming conventions. Even pure python code would use that style although their public styleguide doesn't mention it anymore.
Notably this is no longer true[0]. While legacy stuff can still use the CapWords styling, snake_case names are preferred for all new methods and functions (including code that wraps c++). But this is a relatively recent change to the style guide, and the proto libraries far predate it.
(As a response to a sibling of the parent, the same rules apply to internal code, CapWords should only be used for consistency within a file. All new code should be embrace it's snakey heritage).
Is this a thing people care about? From what I've seen of server-side Java applications (we have several at work) 1.6MB and 700 classes are lost in the noise of the endless list of Maven dependencies. I'm sure you can do it a lot more efficiently without using someone else's thing, but what exactly is being optimised for here?
Yes, but in this case the author is an employee of DataDog, presumably working on their agents or clients and so is probably acutely focused on not impacting the performance of their customer's systems.
I want to be a fan of Apache Avro (https://avro.apache.org/) so much for situations such as this. And while I like Avro as a standard, most of the implementations I have found (specifically in C/C++) are... lacking. I feel Avro would be a great fit for something like this close to zero overhead (assuming a pre-shared schema) but there is little to no support for pre-shared schema and the RPC part of the standard is not a great fit for telemetry. Maybe one day I'll make an Avro library I like, or contribute to an extant one.
I think AVRO is fairly well designed overall, but JSON schemas? If you need deeply nested objects just look away, it becomes unintelligible when trying to understand the schema with any real nesting. The company I was at used it fairly well as long as you keep things 1 or 2 levels nested at most, but we had a legacy schema that was 5 or 6 levels deep at some points and it was just a disaster. My anecdote anyways.
Oh yeah. Schema nesting is a problem. I think a lot of it would be solved by having a schema allow a "types" field which just has a list your named types that can be referenced later in the list and in the actual schema. But yeah, totally concur on that pain point.
Ignoring the bits about Java specifically, one thing that jumps out at me (while reading through the comments) is the distinction between use cases. It seems that the wire format puts some extra burden on the encoding side, but has the benefit of making decoding nice.
In a situation where you're using protobufs internally, you're both encoding and decoding the messages. CPU hours aren't fungible, but you're paying a small latency cost somewhere in your system between when the data is produced and when it can be consumed.
In the case of telemetry, such as with Datadog, you pretty much only encode the data. That is, if I'm a Datadog customer, I'm literally never decoding the encoded data on my own servers. That being the case, it would seem that the argument is that if encoding performance is a desirable property, protobufs add overhead that doesn't offer any meaningful tradeoff to the user ("you're using more CPU on my server and I don't see tangible benefits").
CP was written by Kenton Varda, who spent many years working on protobuf at Google.
One massive advantage of protobuf for mainstream Google languages (C++, Java) is that Google has used them extremely heavily for many years, and you can trust that they've been extensively battle tested in Google's enormous high-traffic services, and scrutinized by Google's vast army of engineers. Kenton's experience notwithstanding, the same cannot be said of CP.
This argument doesn't apply for languages other than the ones that Google uses for production services.
Eh... I'd actually argue that Protobuf's big advantage is all the languages it supports that Cap'n Proto doesn't.
If you're using C++ exclusively, I'd argue Cap'n Proto beats Protobuf. That "army of engineers" isn't necessarily the advantage you think it is -- rather than forcing Protobuf to be the best it can be, I would argue they forced Protobuf to get stuck with early design decisions that no one thinks were ideal (e.g. varint encoding), because it's too hard to change once you have a lot of users. Battle-testing is great, but you could argue that Cap'n Proto benefited more from Protobuf's battle-testing than Protobuf itself did, since Cap'n Proto was designed from scratch with those lessons already learned.
On the other hand, if you need to support a broad set of languages, Protobuf is more likely to meet your needs. This is where the army of engineers and Google backing is helpful -- in building out support for a wide variety of languages and platforms.
Part of the reason that Kenton left Google and created CapnProto, was because he was the only person working on proto at Google trying to fix the problems Google found.
That's not really accurate... I transferred off the Protobuf project three years before quitting Google, based on feedback I received from management suggesting they didn't think my work there was worthwhile. By the time I left Google, there was a new team maintaining Protobuf, and my reasons for leaving were not related to Protobuf.
I didn't leave Google to create Cap'n Proto. Rather, having left Google and being free to do whatever I wanted for a while, I created Cap'n Proto mostly for fun, to "scratch an itch" by trying out a different design.
Damned if you're not the most patient self-advocate for an open source project I've ever seen. I feel like I can always count on finding The Kenton Varda in the comments whenever a protobuf discussion comes up, graciously discussing design trade-offs and dispelling misconceptions.
Did you ever consider looking for official / commercial backing for Cap N Proto? It seems demonstrably better than all the alternatives, but I get the sense that it hasn't taken over because "nobody ever got fired for buying IBM". I know for certain I'd have an easier time pushing capnp in my own organization if it was getting that type of backing.
> Did you ever consider looking for official / commercial backing for Cap N Proto?
Well, I don't think Cap'n Proto is marketable (for any kind of profit) on its own. It's one of those things that has to be free and open source to get any adoption. So a corporate backer would need to get some indirect benefit from its development and adoption. Also, at this point my goal in building and maintaining Cap'n Proto is primarily to benefit my other projects that use it -- if other people benefit too, great, but wide adoption of Cap'n Proto is not an intrinsic goal of mine.
Right now, Cap'n Proto development is de facto backed by Cloudflare, as we are using it heavily in Cloudflare Workers and that is driving all the recent work on the C++ implementation. Currently, this use is internal-only and so we have little reason to build out support for multiple languages. But, as the Workers platform grows the ability to support increasingly complex apps and distributed systems (especially with Durable Objects), limiting applications to communicating with HTTP only is getting awkward. One idea that has been tossed around is to expose Cap'n Proto directly to apps. It makes a lot of sense: it would be very easy for us to support since we already use it under the hood, and zero-copy communications make a ton of sense especially for worker-to-worker comms happening within a single process. If we decided to do this, Cloudflare would become the commercial backer. But, there's also a strong argument to support gRPC directly in Workers, given the existing ecosystem. Would we want to support both? Maybe, maybe not. There's a lot of trade-offs still to think about here, and my goal would be to make the best possible product decision for Cloudflare Workers -- not necessarily for Cap'n Proto.
Cap'n'proto has a number of drawbacks both from the 'one man band' pov. (Witness the two year lull when Kenton went sandstorming), an awkward API and limited language support.
Personally I don't think there is a perfect protocol because different people want different things whether self describing, easy/optimised memory management, zero copy, partial decode. The list goes on....
At a pinch, flatbuffers with flexbuffer evolution would be close to my goals but I'd much prefer having a meta description of messages and perhaps access ,authentication, transport security and use that (e.g. an OpenAPI v3.1 spec) to generate an implementation whether in protobuf, msgpack, JSON, ASN1 etc. whichever is suitable for a use case and using an appropriate transport whether quic, TCP, UDP.
Some of the high performance work I've seen uses ASN.1 on a very large virtual server at 100Gb line rates because the messages lend themselves to parallel decode.
I think Mike Acton had it right by suggesting things are tailored to the data needs and not overgeneralised.
> Witness the two year lull when Kenton went sandstorming
There wasn't really a lull; development on Cap'n Proto continued that whole time by the Sandstorm team in service of Sandstorm. There was an absence of official releases since Sandstorm always used the latest Cap'n Proto code from git. The same story continues today with Sandstorm replaced by Cloudflare Workers. TBH I should probably give up on "official releases" and just advise everyone to use git...
Perhaps this is too unhip, but I wonder - have you ever dealt with the OMG's Data Distribution Service (DDS)? I used to think it was too fuddy-duddy to even glance at, but the recent adoption of DDS by ROS2 made me take another look. DDS seems to have a lot of interesting properties, particularly the QoS and discovery mechanisms. What made me think of it is that DDS is really a free and open standard, but there are a number of commercial entities making money off of it. They provide DDS implementations for free, and then they make money by offering professional consulting and support. Maybe a Cloudflare-backed capnp could end up looking the same way. Anyway, I'm sure you have your head around the possibilities and tradeoffs, not sure why I'm rambling.
Eh, I'm not a big fan of the "provide paid support for open tech" model... it doesn't scale well, and it seems like it creates a perverse incentive to make the tech hard to use, to generate contracts.
Yeah, I see your point, and I have to admit that it's been kind of a turnoff as I've been considering DDS for my own company. The other thing it does is create incentives to create optional paid add-ons. Then you find the support company talking out of both sides of their mouth, because they're trying to sell you on how great it is that the underlying standard is free and open, but simultaneously trying to sell you on how their paid, proprietary add-on is absolutely crucial. It creates an unease, where you're never sure where the border between paid and free will lie.
There's a marshalling analog to Greenspun's Tenth Rule.
Every sufficiently complicated serialization implementation contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of ASN.1.
I'd add that this is true of most ASN.1 implementations as well.
The length encoding is a solved problem in ASN.1 (just use CER).
In many cases you do indeed want or need to use Protobuf for telemetry or other time-series data. The Protobuf RecordIO / TFRecord formats are very unhelpful for this use case.
Protobag is a library that helps you write and read time-series and other collections of Protobuf data using the standard Tar and Zip archives as containers. Protobag also includes code for leveraging self-describing messages to embed schemas in archives so that you never have data on disk that you can't read. Hope somebody finds Protobag helpful.
We already have FlatBuffers, which allow you to send structs on the wire without reinventing anything. It sounds like it's a good match here. Protobuf is better if you care about the byte count on the wire and less about the CPU time to pack and unpack it.
Protobuf's variable-length lengths are indeed annoying, but can't you encode back-to-front as in ASN.1? That lets you write your lengths just after encoding the value, which should work (but may leave your data in e.g. bytes 20+ of a buffer; depending on interfaces, you may need to do one final copy to move the data to the start of the buffer.)
(ASN.1 and protobuf are both tag-length-value formats with variable-length integer encodings... h/t to tptacek for mentioning that particular trick.)
I work with an application that spends something like 70-80% of its time in serialization / deserialization code. This is mostly due to heap allocation / deallocation (in Golang) of objects, and the RPC stacks in use not doing allocation arenas, reuse, or similar.
The "state of the art" in de(serialization) isn't awesome. :(.
Optimization: Prototype before polishing. Get it working before you optimize it. - Eric S. Raymond, The Art of Unix Programming (2003)
The First Rule of Program Optimization: Don't do it. - Michael Jackson
The Second Rule of Program Optimization (for experts only): Don't do it yet. - Michael Jackson
Spell create with an 'e'. - Ken Thompson (referring to design regrets on the UNIX creat(2) system call and the fallacy of premature optimization)
The No Free Lunch theorem: Any two optimization algorithms are equivalent when their performance is averaged across all possible problems (if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems).
FWIW at my company we do embedded robotics, use ridiculously underpowered processors with very limited capabilities, and still prefer use HTTP as it is well understood, has good tooling and is easy for everyone to deal with, which reduces real world business costs vs. custom wire-level formats. I'd gently suggest that unless you have a really particular use case (real time stock market feeds where latency is paramount, very expensive or constrained link layer) it makes little sense to worry about this sort of optimization.
What do optimization algorithms have to do with performance optimization? I mean, sure, you can use either to improve the other, but NFL has nothing to do with performance optimization.
Fair point - I just grepped for optimization. However, it's a bit like a single iteration of any such algorithm, if applied to the domain. Distilled, it means "no single approach works for all environments" which is essentially the feedback people writing this sort of article need to hear (because apparently they missed it). There could be great reasons to use protobuf for telemetry: like, it's good enough for your application, the team is familiar with it and the code is already written, tested and documented.
Google's own protobuf runtime is so bloated that even google themselves don't use it in many of their software, but rather nanopb[1], which is done by completely unrelated person to google.
I am only aware of nanopb being used at Google in embedded scenarios. The primary reason for the project I worked on was that the first Pixel Buds didn't have a C++ compiler (at least not a functional one - based on GCC 2 or 3 if I recall correctly & an extremely crippled implementation). Pixel Buds 2 didn't have that limitation (stock GCC 9) but nanopb was kept for the second Pixel Buds because the memory management was saner for an embedded RTOS environment (vs stock upstream that might use STL or otherwise try to allocate memory). I don't think runtime bloat was really on our radar (it could have been the next critical path, but wasn't on our radar at the time).
Nanopb itself needed some changes to the codegen for the first Pixel Buds (which I investigated & fixed) due to architectural peculiarities of the CSR8675. It generated a lot of constants but the CSR8675 had limited space there & was made worse because every 8 bits took up 1 16-bit word (e.g. the size of your string literal or any constant byte array is effectively doubled). So I changed the code gen to put those constants in the code section instead. Before this change resolved the issue more fully, random engineers who were unlucky enough to add a constant in some way were hitting limits internally with regularity (as the project was heading toward ship) & working around it by removing characters from log messages (e.g. "Some long message" => "Sme lng msg").
For what it's worth I tried to contribute any meaningful improvements we made back to nanopb but this one didn't make that cut as it was specific to a chip no one in the wider community would be using anyway (8675 is super old).
I don't exclude the possibility of it using in more places, but I'd say the average Google engineer does not encounter nanopb in their daily development.
> even google themselves don't use it in many of their software
Citation needed. Granted its a big company, so its certainly possible that some teams I'm unaware of were using it. But I never ran into one.
Besides, nanopb doesn't answer the critique of the article. Nanopb cuts down on heap traffic by statically allocating flat buffers large enough to hold maximums set by the application. It solves the two-pass problem by the simple expediency of actually running two passes, storing the size info on the ordinary call stack. In effect, it trades malloc/free traffic for higher peak memory utilization.
At the largest scales, big G suffers from total memory pressure more than it suffers from lack of arithmetic. Nanopb isn't a good trade for them. It is a good trade for my current (hard realtime, embedded) application.
They use it in android. They also use it in their firebase sdks for ios/android.
Nanopb doesnt solve wire issues. Thats for you to solve by designing your data better. Nanopb solves your applications binary size getting needlessly huge, due to less bloated code-gen and saner runtime (if you have to use protobuf, or are already married with it that is). It also gives more control over memory management, which is especially important for embedded.
> Protobuf-java is a little heavy [...] Just depending on the library adds 1.6MB and nearly 700 classes before you even generate your own message classes.
By comparison, protobuf-net [1] is about 260KB and 68 classes. Python's [2] is a 1MB package download (with source).
I work on OpenTelemetry Java, including a Java Agent that is originally a fork from Datadog's excellent base which they contributed to the project.
Protobuf definitely has a sizable footprint, but I guess it's still ok compared to other popular libraries like Guava and Jackson. I suspect Richard is saving an article for gRPC+Netty which is extremely large :-) Lots of classes, and every shaded Netty on a classpath has its own arena of pooled buffers.
Wire format I like well enough though. I think protobuf does provide good expressiveness to allow nesting if that fits the bill, or more flat structures, where it is a joy to output repeated fields completely independently and even interleaved in a stream.
It's unclear to me if protobuf itself is a bad wire format for telemetry or heavy use of nesting is. The OpenTelemetry Protocol is highly nested to be completely denormalized. This reduces data on the wire (or not so much compared to gzip, but can be considered a hand-coded gzip), but it means all data has to be traversed before any can be sent to compute the length prefix. I don't know which wins in practice - I've just never seen any comparison of the exact same data with different formats. I'm hoping the CPU cache alleviates the double traversal though. But a less nested format in protobuf would easily be possible.
I created a binary serialization format and library that is comparable to json but binary. It uses 1-byte tags to give type information, which often already contains the value (in case of 0,1,true,false,null,empty string, empty list, empty object, single letter string or one of the last 32 interned strings) very similarly to msgpack. But it has the advantage, that strings are interned, which does about the same for performance as a runtime schema-based thing like Avro, but is more generic. Also, like bencode, it uses the convention that keys in objects are sorted, so that linear deserialization directly into the target objects is much simpler. And instead of prepending the size or length of objects and lists, there is a separate end-tag, so that you can start streaming data, before you know how much it will be. Sure, you can't skip without reading the data you skip over, but you (the receiver) can just do an initial pass and store the skip information in a cache, if you need. That seems better than requiring the sender to do multipass without knowing if anyone will ever use the skip info.
Are there any advantages of protobuf over using DER encoded ASN.1 ? The format looks very similar. It looks like they kind of reinvented the wheel here.
ASN.1 is specified as a compilable language. Open source compilers, such as snacc, have existed for 30 years. ASN.1 also defines multiple encodings, DER is used in crypto as you can compare encoded blocks bit by bit. Other encoding rules, such as BER, are tagged. Protobuffers is the same thing re-invented by people who probably never saw ASN.1.
Ay my previous job we started out using simple StatsD messages that are a few bytes long, easy to implement, easy to debug, and quick.
But then possibly due to a full-steam-ahead mandate of "everything should use protobuf (or was it thrift)?", we switched to a much heavier format, where clients needed a codegen'd binary implementation of the protocol, while making messages themselves much heavier.
Isn't there an easy solution for this? A varint representation does not mandate that you use its variable length. Just pad with leading zeros and you have a fixed length.
If I understand the varint format right, its groups of 7 bits, stuffed into one byte each. Alright, so let's just always use 4 bytes, which gives you a value range from 0 to 2^28-1 for the length, that's up to 256M which should suffice if you are that concerned about telemetry speed. Now your length field is always 4 bytes. Even if you could shrink it to 1 or 2 most of the time, but since you care more about encoding speed than size on the wire that should be fine. Nothing forces you to encode value 42 in 1 byte. Just use 4.
If Protobuf was just created to make it easy for teams to inter-operate, couldn't you just use a REST API with JSON Schema? If it's gonna be a resource pig you might as well use simple components that are widely supported.
The post seems more like an argument against protobuf in general than against its use in telemetry. The main complaints seem to be that the library is big and that nested serialization is tricky because of design decisions.
In regards to binary size, using whatever serializing you're already doing for your main data seems like a win for telemetry. Including telemetry with your main data has pluses and minuses, but using the same serialization is generally fine, unless your serialization can fail, in which case it's hard to report on how many times telemetry serialization fails, because you might not be able to serialize it.
It really depends on your scale and how many messages you are processing per second. For a lot of applications, you’re absolutely correct, but if you’re scale is sufficient, a “micro” optimization like this is actually a “macro” optimization. Also the author of this article works at DataDog and I suspect the number of messages they process each second falls under the sufficient category.
For example, suppose you are processing 1M messages per second and you can shave 1 byte off the message size, that shaves off 1MB/sec of data that needs to be processed. If you’re paying for network bandwidth or storing the messages, that saves you something like 2.6TB of data each month.
2.6TB/month is not likely to be a huge deal when it comes to cost savings, but if you keep scaling the messages/sec or the bytes/msg you can start to get some significant savings.
Now I used message size as an example, and the article focuses on processing time not message size, but the point still stands. When you can make a micro optimization for something that is done a very large number of times, there are not-insignificant gains to be had.
If you manage scale in the milion messages/sec you really don't care about some TB/month. Most compagnies are still using REST/json in places where a binary protocol "would be better".
I think this was for client/application instrumentation where they want to try to have little-to-no overhead for including the datadog agent that ships telemetry back to Datadog's service
Google's implementations, at least C++ and Java, are a bunch of bloated crap (or maybe they're very good, but for a use case that I haven't yet encountered). Don't shoot down the format because of a specific implementation, find or write a better one and enjoy the fact that every language has at least some implementation available.
Then, the author laments the wire format itself for having varint-encoded length prefixes, which can not be fixed up. That is true, but it's not that much of a problem. Most straightforward is to simply go through nested data multiple times, once to calculate the length, and again for the actual encoding (and then again and again for deeper nesting).
What makes this bearable is the fact that data will be mostly loaded into L2 cache (L1 for smaller messages) on the first pass, which makes the next pass much faster.
The story breaks down for large, deeply nested messages, but then, the topic here is telemetry which I would expect to consist of a stream of small, shallow messages.