Yeah; expected limits are also fantastically useful in performance engineering. ...

citizen_friend · on March 18, 2023

An arena allocator sounds like it could handle your problem elegantly, without the special cases.

josephg · on March 19, 2023

I've thought about that.

While processing, yes - an arena allocator is a better fit. But my data is loaded from disk & held in memory while its manipulated by consumers of my API. Given the lifetime is determined by the caller, there's no obvious arena to allocate from.

I could put the whole thing into a long lived arena - but unless I'm careful, some operations would leak memory.

But it would definitely be better from a performance standpoint. Using smallvec, every time these values are read or written the code needs to check if the value is "spilled" or not. And I think there's a lot of code monomorphization involved too - using a vec in an arena would probably make my binary a fair bit smaller.

comex · on March 19, 2023

Even with an arena allocator, the indirection is likely to increase cache misses, especially when the array element type is something small like an integer.

It’s possible to implement something like smallvec without the branch by having it always contain a pointer field, which points to either the inline storage or a heap allocation. However this means it can’t be moved in memory (has to be pinned), and also means you can’t reuse the pointer field to be part of the inline storage in the inline case.

josephg · on March 19, 2023

I'd love to see some real numbers showing how these different decisions impact performance and code size. I suspect the branch cost is pretty minimal because so few of my smallvecs get spilled - so the branch predictor probably does a pretty good job at this.

And there's often fiercely diminishing returns from optimizing allocations. Dropping the number of allocations from 1M to 1k made a massive performance difference. Dropping it from 1k to 1 will probably be under the benchmark noise floor.

xigoi · on March 19, 2023

I don't think that goes against the rule. Your code doesn't impose an arbitrary limit on the data, it just internally represents it differently based on size.