SuperMalloc: A Super Fast Multithreaded Malloc for 64-bit Machines

emeryberger · on Dec 21, 2015

For what it's worth, in my tests, Hoard (http://www.hoard.org, https://github.com/emeryberger/Hoard) significantly outperformed SuperMalloc on a machine without TSX. For example, on a simple microbenchmark that tests scalability, Hoard is 2x faster with one thread, and the gap widens with more threads (Hoard is 3.3x faster with 8 threads, 4x faster with 16 threads).

throwawayaway · on Dec 21, 2015

Wow! Want to do a full disclosure?

Your papers were very valuable while doing my undergraduate study. Is Hoard neck and neck with jemalloc and lockless' allocator nowadays?

http://locklessinc.com/benchmarks_allocator.shtml

Someone · on Dec 21, 2015

Code at https://github.com/kuszmaul/SuperMalloc.

Paper seems to be there, too, at https://github.com/kuszmaul/SuperMalloc/blob/master/paper/cp...

wyldfire · on Dec 21, 2015

"SuperMalloc is an implementation of malloc(3) designed for x86 Hardware Transactional Memory (HTM). It turns out that the same design also makes it fast even without HTM. "

Does anyone know if there's a cpuid feature flag for HTM? Do all haswell-generation SKUs support the feature?

rayiner · on Dec 21, 2015

TSX is disabled in most (all?) Haswell SKUs due to bugs.

wyldfire · on Dec 21, 2015

Interesting. "In August 2014, Intel announced a bug in the TSX implementation on current steppings of Haswell, Haswell-E, Haswell-EP and early Broadwell CPUs, which resulted in disabling the TSX feature on affected CPUs via a microcode update." -- So only (non-early) Broadwell CPUs support it now?

putterson · on Dec 21, 2015

As well as Skylake CPUs now. Check in the "TSX-NI" table entry. http://ark.intel.com/products/88191/Intel-Core-i5-6600K-Proc...

webaholic · on Dec 21, 2015

It shows that it is supported in Skylake. I guess they fixed the bugs.

polskibus · on Dec 21, 2015

Hmmm. Does anybody know if this qualifies the processor for a refund/replacement if I bought one to learn and use TSX?

listic · on Dec 21, 2015

That's Intel Transactional Synchronization Extensions, right? https://en.wikipedia.org/wiki/Transactional_Synchronization_...

scott_s · on Dec 21, 2015

Neat, we have multiple allocator authors on this thread! I'm one of the authors of reference [38], Streamflow (http://www.scott-a-s.com/projects/#streamflow). Unlike the other allocators, our code has not been kept updated over the years. So it's very reasonable our allocator is not compared against in the experiments. In any event, the allocator for TBBmalloc shares a lot of design similarity. (It's based on McRT-malloc, which was presented as a paper during the same session as our allocator at ISMM 2006.)

The author of this paper does not use lock-free techniques, which our allocator used - I'm curious if using lock-free algorithms would have changed the author's design, or improved the performance. I do think that despite the similarities that Streamflow has to TBBmalloc, Streamflow is not susceptible to the same kind of memory blowup. The problem, as described in the recent paper:

"TBBmalloc can have an unbounded footprint. One case was documented by [40]. In this case, one thread allocates a large number of objects, and a second thread then frees them, placing them into the first threads foreign block. If the first thread then does not call free(), then the memory will never be removed from the foreign block to be reused. There appears to be no easy fix to this problem in TBBmalloc, since the thread-local locking policy assumes, deep in its design, that every thread calls free() periodically."

Streamflow avoids this problem by putting the remote-free block check on the malloc path. That is, when allocating memory, you always check if other threads remotely freed memory for you. Basically, if you're continually allocating memory, you're also continually checking to see if you should clean your memory that other threads freed for you. You can see this in action: https://github.com/scotts/streamflow/blob/master/streamflow....

All of the above is just to add some background - I look forward to really digging into this paper over the holidays.

(In case anyone actually follows my comments, they may know I currently do research and development for IBM Streams. That has zero relationship to this memory allocator I worked on early in grad school; it's just an odd coincidence of project names.)

collector · on Dec 21, 2015

Fast, multicore-scalable, low-fragmentation memory allocation through large virtual memory and global data structures [1]

An allocator that is designed for scalability from the ground up. Similar design to Streamflow (based on so-called spans) that eagerly returns memory (latency-aware) and a backend that also makes use of fragmenting virtual memory, which is plentiful available on 64bit systems.

[1]: http://dl.acm.org/citation.cfm?doid=2814270.2814294

scott_s · on Dec 21, 2015

You all are giving me a lot of reading material.

jws · on Dec 21, 2015

I think the significant design decision is that in a 64 bit world, virtual address space is no longer scarce. This allows a simpler implementation, roughly half the code.

The micro benchmarks are great, the whole program level benchmarks show nothing earth shaking. But I'll take ok performance and half the code any day.

thrownaway2424 · on Dec 21, 2015

Can you explain? It seems like the address space is getting more scarce, not less. Several years ago when x86-64 adoption really took off, 48 bits was a lot bigger than the actual amount of memory one might put in a system, but now it isn't (48 bits virtual vs. 40 bits physical).

jws · on Dec 21, 2015

The algorithm sparsely uses 512MB of virtual address space for its allocation and caches. For each of about 40 sizes, it keeps an array of fixed size blocks, plus a bunch of per thread and per CPU caches.

In a 32bit world, taking 1/4 to 1/8 of the address space would be impolite. In a 64 (or 48) bit world it doesn't matter.

binarycrusader · on Dec 21, 2015

Yes, in the x86 world, things are not as rosy as they could be.

SPARC moved to a 64-bit address space a long time ago. 32 terabytes of memory in a single server? Sure, why not.

But as the other poster pointed out, this really isn't an issue given the approach they've taken, even on x86.

RossBencina · on Dec 22, 2015

SuperMalloc was subject to critique and comparison with Bonwick's Slab Allocator in this excellent talk by Ryan Zezeski: "Memory by the Slab: The Tale of Bonwick's Slab Allocator" video, paper etc: http://paperswelove.org/2015/video/ryan-zezeski-memory-by-th...

lotsoflumens · on Dec 21, 2015

"dual licensed" - guaranteed to double your licensing confusion.

DiThi · on Dec 21, 2015

Can't we just use the MIT part? What's the point of having two FOSS licenses, one being non-viral?

jws · on Dec 21, 2015

I don't think adding a GPLv3 dual to an MIT license changes much. You can derive a work from MIT licensed code and license your work with a more restrictive license. You would still need to reproduce the MIT license alongside your GPLv3 license which would be confusing.

The author is giving you the option to omit the MIT text and have a pure GPLv3 if you wish to restrict access to your software in that way.

DiThi · on Dec 21, 2015

Don't all MIT licensed projects have this possibility de facto? I mean, I can take any other MIT licensed software, add some bits of mine GPL licensed, and release the whole thing as just GPL.

justincormack · on Dec 22, 2015

The whole thing is not GPL, someone can still just replace your changes with more MIT code and use the whole thing under BSD. You cannot change the license of someone else's code.

fbbbbb · on Dec 21, 2015

>reproduce the MIT license alongside your GPLv3 license

The MIT license must be included in the provided software, but it isn't a part of the new license.

lotsoflumens · on Dec 21, 2015

Also, beware of submerged patent mines.

thrownaway2424 · on Dec 21, 2015

Comparison to tcmalloc suspiciously omitted.

halayli · on Dec 21, 2015

Came here to say just that. tcmalloc performs better than jemalloc in various cases.

scott_s · on Dec 21, 2015

Interesting - I had noticed the omission, but I had assumed that meant that jemalloc had emerged as the agreed-upon allocator best for production. Are you aware of any recent published experiments with the two?

thrownaway2424 · on Dec 21, 2015

I'm not aware of any serious published works. Really the only way to choose an allocator is to try them all on your benchmark workload. Any other benchmarks are likely to be irrelevant.

I think tcmalloc vs. jemalloc is basically a wash. One is written by google people and the other is written by facebook people. Facebook is much more forward about their open source project than is Google, so more people have heard of jemalloc.

stass · on Dec 21, 2015

jemalloc is unaffiliated with FB. It was developed and integrated into FreeBSD and Firefox long before the author started working for Facebook.

thrownaway2424 · on Dec 22, 2015

What I meant about facebook's advocacy is that they have published lots of high profile blogs and whatnot about applications of jemalloc which have contributed greatly to its renown relative to tcmalloc, and I think it's fair to say that jemalloc versions 2, 3, and 4 have been largely underwritten by the Big Thumb.

And the other side of what I was trying to say is that for Google it's largely an issue of personality. tcmalloc is Google's malloc. JE works for Facebook. If JE worked at Google I'm sure Google would just use jemalloc.

halayli · on Dec 21, 2015

tcmalloc is great for small random sized allocations. when using std::string without .reserve() it performed much better from the testing I've done last year.