Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Exactly.

I just wanted to add that #StoreLoad fences (i.e. mfence on x86 for example), as far as I know do not usually actually flush the store buffer per-se. They just stall the pipeline (technically they only need to stall any load) until all stores prior to the fence have been flushed out by the normal store buffer operations, i.e. the store buffer is always continuously flushing as fast as possible all the time.

You didn't imply otherwise, but I wanted to clarify that because I have seen comments elsewhere and in code claiming that a fence would make a prior store visible faster (i.e. the fence was added to improve latency instead of being required for correctness), which I do not think it is the case, at least at a microarchitectural level (things are more complex when a compiler is involved of course).



Yes, that's right. As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state - so you can hide a lot of the cost of atomic operations by ensuring there is a long-as-possible series of non-load/store operations after them (of course, this is often not possible).

mfence used to work like that, but due to various bugs/errata was "upgraded" to now block execution of all subsequent instructions until it retires (like mfence) in addition to it's store draining effects. So mfence is actually a slightly stronger barrier than atomic operations (but the difference is only apparent with non-WB memory).

If you want to be totally pedantic, it may be the case that that mfence or another fencing atomic operation results in the stores being visible faster: because they block further memory access instructions, there can less competition for resources like fill buffers, so it is possible that the stores drain faster.

For example, Intel chips have a feature where cache lines targeted by stores other than the ones at the head of the store buffer can be fetched, so called "RFO prefetch" - this gives MLP in the store pipeline. However, this will be limited by the available fill buffers and perhaps also heuristics ramping back this feature when fill buffers are highly used even if some are available (since load latency is generally way more important than load latency).

So something like an mfence/atomic op blocks later competing requests and gives stores the quietest possible environment to drain. I don't think the effect is very big though, and you could achieve the same effect by for e.g., just putting a bunch of nops after the "key" store (although you wouldn't know how many to put).


> As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state.

That's great to know. I suspected that was the case and they had moved from the stall the pipeline approach, but I had never tested it.


Sorry that should say "(like lfence)" not "(like mfence)".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: