Movdqu vs movdqa is just one of the possible bad things that can happen when mis...

MaulingMonkey · on April 18, 2020

To expand on the 0.01% cases - atomic operations (be they used for lock-free algorithms, or to implement spinlocks) generally require alignment. If you're lucky, unaligned access will segfault. I have been so lucky, when unaligned global char[] buffers were recast to types containing pthread locks. Operations on said locks were kind enough to segfault when unit testing a port to ARM hardware.

If you're unlucky, your "atomic" x64 instructions will silently revert to non-atomic behavior when straddling cachelines but otherwise "succeed". This will introduce one of the absolute nastiest kinds of heisenbugs into your codebase, in one of the hardest to understand and debug parts of your program. Speaking frankly - my coworkers won't figure it out. I won't figure it out. Instead, we'll ship unstable software, and maybe blame the hardware and suggest memtest.

A movdqa segfault by comparison is a tame, easily understood, easily fixed, and may well help me catch and fix said accidentally unatomic behavior.

gmueckl · on April 19, 2020

Citation needed that lock-prefixed instructions require alignment or must not span cache lines on x86. I don't see that in amy documentation.

MaulingMonkey · on April 19, 2020

> lock-prefixed instructions

Are not the only instructions used for atomic operations.

> Citation needed

EDIT: memory_order_acq_rel like I've shown bellow isn't actually a valid value to pass to store (noticed this when trying to spot sane disassembly in MSVC for yet another datapoint, which was turning into a noop!) memory_order_seq_cst is, however, which gets compiled to xchg in both GCC, Clang, and MSVC... but memory_order_release and memory_order_relaxed are both legal too, and get compiled down to vanilla movs in GCC, Clang, and MSVC as well: https://clang.godbolt.org/z/QfnxPJ

  void foo(std::atomic<int> & i) {
      i.store(42, std::memory_order_acq_rel);
  }

Gets compiled down into a no-lock-prefix mov by clang (gcc emits xchg which is implicitly lock and will be atomic even straddling cacheline boundaries): https://clang.godbolt.org/z/87x98C

Mov can't even take a lock prefix (https://software.intel.com/sites/default/files/managed/39/c5... page 1158):

> [...] The LOCK prefix can be prepended only to the following instructions and only to those forms of the instructions where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG [...]

And mov is only atomic when within a single cacheline on modern processors: (https://software.intel.com/sites/default/files/managed/39/c5... page 3052):

> The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

> [...]

> • Reading or writing a doubleword aligned on a 32-bit boundary

> The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

> [...]

> • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

> Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided.

(PDF is the latest version of "Intel® 64 and IA-32 Architectures Software Developer’s Manual, Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4" I found at https://software.intel.com/en-us/download/intel-64-and-ia-32... )

gmueckl · on April 20, 2020

I get the impression that you're blaming flaws in the implementation of std::atomic on the processor.

Memory reads using MOV are not atomic. That's not as bad as you make it sound. Instructions that are prefixed with LOCK (or implicitly LOCKed) are always atomic on misaligned data as per the last paragraph you quoted. This paragraph is weaseling around as this feature technically depends on external bus management that is implemented on the mainboard and outside the scope of the processor manual. But third party chipsets are dead. The same is repeated explicitly in the description of the LOCK prefix (Volume 2A, chapter 3):

> The integrity of the LOCK prefix is not affected by the alignment of the memory field. Memory locking is observed for arbitrarily misaligned fields.

As for the example code that you posted: I postulate that at least GCC is essentially broken and violates the guarantees given in their own manual and the C++ standard. The following example demonstrates this:

  #define UNSAFE
  
  template<typename valueType>
  class MyAtomic {
  public:
      valueType load()
      {
  #ifdef UNSAFE
          return __atomic_load_n(&value, __ATOMIC_RELAXED);
  #else
          valueType temp = 0;
          __atomic_compare_exchange_n (&value, &temp, temp, false, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
          return temp;
  #endif
      }

      void store(valueType newValue)
      {
  #ifdef UNSAFE
          __atomic_store_n(&value, newValue, __ATOMIC_RELAXED);
  #else
          __atomic_exchange_n(&value, newValue, __ATOMIC_RELAXED);
  #endif
      }
  
  private:
      valueType value;
  };

The UNSAFE version compiles to simple mov instructions. This is not correct as there is no alignment guarantee for value. Using LOCK CMPXCHG and XCHG for the read and write is possible and safe. This is the code you get when undefining UNSAFE.

gpderetta · on April 21, 2020

> Memory reads using MOV are not atomic.

They are according to the Intel documentation, as long as they are suitably aligned (which is not an issue for sane code). There are billions of line of codes that require this atomic behaviour. Not only C applications BTW (including the Linux kernel), the mapping of the JVM memory model to x86 also relies on that.

> I postulate that at least GCC is essentially broken and violates the guarantees given in their own manual and the C++ standard.

which guarantee is violated exactly? __atomic_load_n works on pointers of type T and the type implies the (implementation defined) allignemnt. The standard std::atomic<T> is the same. Using locked operations for, say, relaxed loads and stores would be pointlessly expensive and against the spirit of C++11 memory model.

Also lock prefixed operations straddling cachelines are implemented with an extremely slow path costing thousands of clock cycles instead of a couple of dozens; this slow path not only slows down the cpu executing it, but all the cpus in the system. It is so bad that newer Intel CPUS allow the OS to disable this (mis)feature as it is an easy source of DoS. In theory the OS can then catch the fault and emulate the atomic instruction in software: it will be even slower but at least it would only affect the process issuing it.

MaulingMonkey · on April 20, 2020

> I get the impression that you're blaming flaws in the implementation of std::atomic on the processor.

These flaws are not unique to std::atomic nor x86, they are in fact ubiquitous across multiple languages and processor types. If I were to blame something, I'd blame fundamental physics. You've only got a few options for cacheline straddling atomic operations:

1) Slow, complicated, and silent handling of the edge case in your processor and bus design

2) Misbehave

3) Explode violently

x86 chooses... well, actually, it lets you choose, to some degree, by providing a wide variety of instructions. Neat! You have limited control over which option your C++ compiler will choose, and the C++ standard allows any choice by labeling the situations where it comes up "undefined behavior". I prefer option 3 as the least bad option (they're all bad), giving me the chance to fix my code (by aligning my accesses, mooting the problem entirely.) I can always write my own debug assertions around my atomic wrappers to force behavior #3 even if the compiler choses something else... assuming the optimizer doesn't get too aggressive and optimize away my alignment checks.

Bonus points to the compiler if it saves me the trouble, and uses instructions that use option 3, and let me catch the mistake in my code in fully optimized assertionless release builds. This is why I defend the compiler's use of movdqa elsewhere in the discussion tree.

EDIT: I could see the argument for preferring option 1, if I could choose to do so ubiquitously. But I do not have the luxury of enough control to accomplish that. So I prefer option 3 for what I can control.

> Memory reads using MOV are not atomic

This is directly contradicted by the citations I just gave you, from Intel's own manuals, when aligned. An aligned 8, 16, or 32-bit memory read - including via MOV - is atomic, per 8.1.1 Guaranteed Atomic Operations, even on a 486. Pentiums and x64 expands atomicity to even more memory reads.

> I postulate that at least GCC is essentially broken

You do not merely postulate that GCC is essentially broken, since per my edit that is not GCC only behavior. You postulate GCC, Clang, and MSVC are all essentially broken, because they all compile release/relaxed stores to vanilla movs.

EDIT:

> This is not correct as there is no alignment guarantee for value.

Incorrect. value is guaranteed alignment of at least alignof(valueType) within the context of a defined-behavior program, which generally means the compiler imposes alignof(MyAtomic<valueType>) == alignof(valueType). This does mean alignof(MyAtomic<int>) > alignof(MyAtomic<char>) typically: https://wandbox.org/permlink/jatL7ZC7eC2V1lVw

gmueckl · on April 20, 2020

We're thinking very differently about the same topic. To me, MOV is not atomic, except in specific cases. You're saying that MOV is mostly atomic except in a specific set of cases. In the end, it's about how defensive you want to be about it. I tend to be very much on the defensive side in arguments like this.

I really can see the appeal in having the code blow up as soon as it can and as aggressively as it can. It's ideal for development. This stuff is hard enough as it is and impossible to test 100%. There's no testing harness on earth that allows you to enter two sections of code on two cores with a cycle-accurate delay and with a 100% defined state of the CPU (TLB, cache contents, branch predictor status...).

As for alignment guarantee by the language: your own example code uses pointer arithmetic to conjure up a class out of thin air that is overlaid over other memory and thus has the improper alignment you're exploiting. That demonstrates how little faith you are allowed to have in the alignment of an instance of your type if you didn't allocate it yourself. The same thing may happen if you make such a type part of packed struct.

I've just skimmed the relevant sections on alignment in the C++ standard, but it is not very explicit on the issue of alignment of fundamental types. The chapter on the atomics library doesn't bring that issue up at all. I'm not reading through the entire thing now - I'm not that bored.

MaulingMonkey · on April 22, 2020

> As for alignment guarantee by the language: your own example code uses pointer arithmetic to conjure up a class out of thin air that is overlaid over other memory and thus has the improper alignment you're exploiting. That demonstrates how little faith you are allowed to have in the alignment of an instance of your type if you didn't allocate it yourself.

I forgot to point out in my other reply: -fsanitize=undefined immediately catches this, and gives me the exact source file / line number of the bug. So there are ways to convince yourself you're mostly doing the right thing.

EDIT: Scroll up to the first error in https://wandbox.org/permlink/q0mxk5DViIVs2GBl if you want to see an example.

gpderetta · on April 21, 2020

> To me, MOV is not atomic, except in specific cases

it seems that you are in an extremely small minority. Mov is atomic for all properly aligned accesses. All valid accesses in C++ are de jure properly aligned, hence a plain move is always a safe way to lower (non-seq-cst) stores and loads.

You would want the compiler generate extremely pessimal code (locked instructions for all atomic load and stores ) for something that never happens in sane code.

MaulingMonkey · on April 22, 2020

> We're thinking very differently about the same topic.

Yes and no. I recognize the same edge cases you do, and I err on the side of caution. Sometimes, it's even quite worthwhile to make seemingly outlandish arguments such as "literally every compiler and library dealing with this topic out there is broken".

But in this case the compilers and libraries have reasonable rationale behind their decisions that I can't entirely tear apart. If I ruled the ecosystem with an iron fist, I might have made different ones... if there was a diversity of implementation decisions, it might be worthwhile to agitate in favor of said difference choices... but the ecosystem seems to near-unanimously agree in the other direction. So in this case, caution and defense on an individual level means working within the reality of that ecosystem. That reality is allowed by the C++ standard. One could perhaps argue the C++ standard is broken, and I've certainly taken many a tilt at the C++ standard windmill, but I feel I have many more dangerous - and more slayable - dragons to face down first.

I could replace every single mov with an xchg in all code I have access to, but it still wouldn't fix the problem: I link against closed source libs, compiled with movs, with their own alignment requirements, and their own multithreading nonsense. I must align my types or I will suffer the bugs.

> There's no testing harness on earth that allows you to enter two sections of code on two cores with a cycle-accurate delay and with a 100% defined state of the CPU (TLB, cache contents, branch predictor status...).

There have been some interesting research projects into this FWIW focusing on symbolic execution, static analysis, executable instrumentation, etc. - Maple, Thread Sanitizer, Helgrind - but cycle and bug accurate CPU emulators are mostly a retro community thing, not cheap, and not generalizable to the wide variety of hardware I need to deal with.

> That demonstrates how little faith you are allowed to have in the alignment of an instance of your type if you didn't allocate it yourself.

You're not entirely wrong. But by the time I get my hands on a given T with an alignment less than alignof(T), the nasal demons of undefined behavior have already been well and truly invoked. Trying to partially paper over the problem with misalignment tolerant code at that point is a waste of my time that merely encourages more subtle heisenbugs. The better use of my time, and the better solution to my lack of faith, is to verify alignment - and sometimes catch and fix misalignment - not to give up and assume I'm unaligned.

You're not entirely right either. Several ARM ports in multiple, large codebases, have turned up suprisingly few alignment bugs in my experience. They're suprisingly rare. You have to go out of your way to create most of them.

> The same thing may happen if you make such a type part of packed struct.

Modern Clang, GCC, and MSVC can all warn about this, which you can upgrade to an error: https://clang.godbolt.org/z/hNjao6

As can Rust: https://play.rust-lang.org/?version=stable&mode=debug&editio...

MaulingMonkey · on April 19, 2020

...and just to give a concrete example of a program that can silently "misrun" on x64 (via cacheline straddling - which is undefined behavior due to the misalignment, obviously):

  #include <atomic>
  #include <thread>
  #include <iostream>
  #include <cassert>
  #include <cstdlib>
  
  int main() {
      constexpr size_t target_cacheline_bits = 6;
      constexpr size_t target_cacheline_size = (1 << target_cacheline_bits); // 64 bytes
      constexpr size_t target_cacheline_straddle_mask = (target_cacheline_size - 1);
      char buffer[target_cacheline_size + sizeof(std::atomic<short>)];
      std::atomic<short> & straddling = *(std::atomic<short> *)((size_t)buffer | target_cacheline_straddle_mask); // UB due to misalignment
      straddling.store(1);
      
      constexpr size_t N = 1000000;
      std::thread t1([&](){
          for (size_t i=0; i<N; ++i) {
              straddling.store(i&1 ? 0x0001 : 0x0100, std::memory_order_relaxed);
          }
      });
      std::thread t2([&](){
          for (size_t i=0; i<N; ++i) {
              const short value = straddling.load(std::memory_order_relaxed);
              if (value != 0x0001 && value != 0x0100) {
                  std::cerr << "Got value: " << value << "\n";
                  std::exit(1);
              }
          }
      });
  
      t1.join();
      t2.join();
  }

https://wandbox.org/permlink/IK43Ou0NfAw9y7us

Sometimes this will run to completion without triggering anything. Other times the values "0" or "257" (0x0101) will appear in stderr. Feel free to raise "N", or exclude "0" from the error triggering case, to see more values.