Exactly. I just wanted to add that #StoreLoad fences (i.e. mfence on x86 for exa...

BeeOnRope · on June 6, 2019

Yes, that's right. As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state - so you can hide a lot of the cost of atomic operations by ensuring there is a long-as-possible series of non-load/store operations after them (of course, this is often not possible).

mfence used to work like that, but due to various bugs/errata was "upgraded" to now block execution of all subsequent instructions until it retires (like mfence) in addition to it's store draining effects. So mfence is actually a slightly stronger barrier than atomic operations (but the difference is only apparent with non-WB memory).

If you want to be totally pedantic, it may be the case that that mfence or another fencing atomic operation results in the stores being visible faster: because they block further memory access instructions, there can less competition for resources like fill buffers, so it is possible that the stores drain faster.

For example, Intel chips have a feature where cache lines targeted by stores other than the ones at the head of the store buffer can be fetched, so called "RFO prefetch" - this gives MLP in the store pipeline. However, this will be limited by the available fill buffers and perhaps also heuristics ramping back this feature when fill buffers are highly used even if some are available (since load latency is generally way more important than load latency).

So something like an mfence/atomic op blocks later competing requests and gives stores the quietest possible environment to drain. I don't think the effect is very big though, and you could achieve the same effect by for e.g., just putting a bunch of nops after the "key" store (although you wouldn't know how many to put).

gpderetta · on June 7, 2019

> As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state.

That's great to know. I suspected that was the case and they had moved from the stall the pipeline approach, but I had never tested it.

BeeOnRope · on June 6, 2019

Sorry that should say "(like lfence)" not "(like mfence)".