Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A bug story: data alignment on x86 (2016) (pzemtsov.github.io)
40 points by phab on April 18, 2020 | hide | past | favorite | 39 comments


Summarizing the conclusion of the article: GCC correctly interprets the spec as saying that all integers reads must be aligned in memory, and when vectorizing the code chooses to use an "aligned" instruction that fails on unaligned data (MOVDQA). On modern x64 processors, the unaligned version of this instruction (MOVDQU) is just as fast on both aligned an unaligned data, and has the advantage of not causing a segfault when run.

Is this a bug in GCC that should be fixed, or is GCC justified in its behavior? Or is there another interpretation?

Having been bitten by this in the past, my conclusion is that while this is not a bug, it is (slight) evidence that GCC will not act in its users' best interests unless required to do so by the spec. Given the same inputs, Intel's ICC generates working fast code. If both compilers were equally available, I would usually prefer ICC over GCC for code that is going to run on modern Intel processors.


unless required to do so by the spec

Which IMHO is an absolutely stupid point of view, because compilers don't exist in a void. As the old saying goes, "what's right isn't always legal, and what's legal isn't always right", and behaving to the letter of the law is not the same as behaving to the spirit of the law.

Thus I think it is absolutely a bug. The standard even says undefined behaviour may result in something like "behaving in a manner characteristic of the environment", which is absolutely what programmers expect from the language.

I also suspect the fact that GCC has become a de-facto monopoly (duopoly if you count Clang/LLVM) among C compilers for Linux platforms makes them more likely to dismiss such complaints.

My experience agrees with yours that ICC and MSVC are nowhere near as aggressive and hostile with UB, yet still generate very good code.


> Thus I think it is absolutely a bug.

While I agree with your rationale, I disagree with it's application in this case. Changing this compiler behavior won't get rid of alignment-related crashes - it'll just make them harder for me to reproduce / debug because I'll have to get coredumps and symbols off of my ARM hardware instead.

> [...] MSVC [...]

Has it's own history of alignment issues as well FWIW, although I'm happy to see that v19 seems to error out on __declspec(align(...))ments it won't guarantee for parameters - whereas that used to be a mere warning. A warning that was missing and wouldn't trigger in at least one MSVC version, which caused it's share of debugging sessions for me: https://clang.godbolt.org/z/hykGEm


Given that it's gcc is generating instructions which require alignment "behaving in a manner characteristic of the environment" is one way of describing this situation


Then perhaps Intel is ultimately to blame for this mess, since requiring alignment is completely at odds with how x86 normally behaves, and as shown in the article, there's a version of the instruction not requiring alignment and not really slower at all.


GCC is doing its best to give you the best performance on a broad set of CPUs, and its behaviour is 100% correct. Saying you want GCC to not exploit this contract (that the char* actually points to an aligned uint32_t) is asking for inferior performance, which is unacceptable to most people still using C.

A warning would be nice though.


I should add that my comment above is dependent on compiling with options that target a specific processor, when the performance on that processor is equally good with aligned and unaligned data. That is, when compiling with "-march=native" on a modern Intel processor, I strongly prefer a compiler that generates fast and working code over one that segfaults because the spec allows it to. I'm open to arguments that it should behave differently when asked to generate generic code that can work on generic processors. Are there x64 processors in common use today that are significantly slower with MOVDQU than MOVDQA?


Movdqu vs movdqa is just one of the possible bad things that can happen when misaligned data is erroneously used. And it will segfault consistently which is a good thing which means that will be caught early in testing. You want compilers to produce binaries that will catch mistakes early in testing, not binaries that work 99.99% of the time and fail in obscure ways.


To expand on the 0.01% cases - atomic operations (be they used for lock-free algorithms, or to implement spinlocks) generally require alignment. If you're lucky, unaligned access will segfault. I have been so lucky, when unaligned global char[] buffers were recast to types containing pthread locks. Operations on said locks were kind enough to segfault when unit testing a port to ARM hardware.

If you're unlucky, your "atomic" x64 instructions will silently revert to non-atomic behavior when straddling cachelines but otherwise "succeed". This will introduce one of the absolute nastiest kinds of heisenbugs into your codebase, in one of the hardest to understand and debug parts of your program. Speaking frankly - my coworkers won't figure it out. I won't figure it out. Instead, we'll ship unstable software, and maybe blame the hardware and suggest memtest.

A movdqa segfault by comparison is a tame, easily understood, easily fixed, and may well help me catch and fix said accidentally unatomic behavior.


Citation needed that lock-prefixed instructions require alignment or must not span cache lines on x86. I don't see that in amy documentation.


> lock-prefixed instructions

Are not the only instructions used for atomic operations.

> Citation needed

EDIT: memory_order_acq_rel like I've shown bellow isn't actually a valid value to pass to store (noticed this when trying to spot sane disassembly in MSVC for yet another datapoint, which was turning into a noop!) memory_order_seq_cst is, however, which gets compiled to xchg in both GCC, Clang, and MSVC... but memory_order_release and memory_order_relaxed are both legal too, and get compiled down to vanilla movs in GCC, Clang, and MSVC as well: https://clang.godbolt.org/z/QfnxPJ

  void foo(std::atomic<int> & i) {
      i.store(42, std::memory_order_acq_rel);
  }
Gets compiled down into a no-lock-prefix mov by clang (gcc emits xchg which is implicitly lock and will be atomic even straddling cacheline boundaries): https://clang.godbolt.org/z/87x98C

Mov can't even take a lock prefix (https://software.intel.com/sites/default/files/managed/39/c5... page 1158):

> [...] The LOCK prefix can be prepended only to the following instructions and only to those forms of the instructions where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG [...]

And mov is only atomic when within a single cacheline on modern processors: (https://software.intel.com/sites/default/files/managed/39/c5... page 3052):

> The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

> [...]

> • Reading or writing a doubleword aligned on a 32-bit boundary

> The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

> [...]

> • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

> Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided.

(PDF is the latest version of "Intel® 64 and IA-32 Architectures Software Developer’s Manual, Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4" I found at https://software.intel.com/en-us/download/intel-64-and-ia-32... )


I get the impression that you're blaming flaws in the implementation of std::atomic on the processor.

Memory reads using MOV are not atomic. That's not as bad as you make it sound. Instructions that are prefixed with LOCK (or implicitly LOCKed) are always atomic on misaligned data as per the last paragraph you quoted. This paragraph is weaseling around as this feature technically depends on external bus management that is implemented on the mainboard and outside the scope of the processor manual. But third party chipsets are dead. The same is repeated explicitly in the description of the LOCK prefix (Volume 2A, chapter 3):

> The integrity of the LOCK prefix is not affected by the alignment of the memory field. Memory locking is observed for arbitrarily misaligned fields.

As for the example code that you posted: I postulate that at least GCC is essentially broken and violates the guarantees given in their own manual and the C++ standard. The following example demonstrates this:

  #define UNSAFE
  
  template<typename valueType>
  class MyAtomic {
  public:
      valueType load()
      {
  #ifdef UNSAFE
          return __atomic_load_n(&value, __ATOMIC_RELAXED);
  #else
          valueType temp = 0;
          __atomic_compare_exchange_n (&value, &temp, temp, false, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
          return temp;
  #endif
      }

      void store(valueType newValue)
      {
  #ifdef UNSAFE
          __atomic_store_n(&value, newValue, __ATOMIC_RELAXED);
  #else
          __atomic_exchange_n(&value, newValue, __ATOMIC_RELAXED);
  #endif
      }
  
  private:
      valueType value;
  };
The UNSAFE version compiles to simple mov instructions. This is not correct as there is no alignment guarantee for value. Using LOCK CMPXCHG and XCHG for the read and write is possible and safe. This is the code you get when undefining UNSAFE.


> Memory reads using MOV are not atomic.

They are according to the Intel documentation, as long as they are suitably aligned (which is not an issue for sane code). There are billions of line of codes that require this atomic behaviour. Not only C applications BTW (including the Linux kernel), the mapping of the JVM memory model to x86 also relies on that.

> I postulate that at least GCC is essentially broken and violates the guarantees given in their own manual and the C++ standard.

which guarantee is violated exactly? __atomic_load_n works on pointers of type T and the type implies the (implementation defined) allignemnt. The standard std::atomic<T> is the same. Using locked operations for, say, relaxed loads and stores would be pointlessly expensive and against the spirit of C++11 memory model.

Also lock prefixed operations straddling cachelines are implemented with an extremely slow path costing thousands of clock cycles instead of a couple of dozens; this slow path not only slows down the cpu executing it, but all the cpus in the system. It is so bad that newer Intel CPUS allow the OS to disable this (mis)feature as it is an easy source of DoS. In theory the OS can then catch the fault and emulate the atomic instruction in software: it will be even slower but at least it would only affect the process issuing it.


> I get the impression that you're blaming flaws in the implementation of std::atomic on the processor.

These flaws are not unique to std::atomic nor x86, they are in fact ubiquitous across multiple languages and processor types. If I were to blame something, I'd blame fundamental physics. You've only got a few options for cacheline straddling atomic operations:

1) Slow, complicated, and silent handling of the edge case in your processor and bus design

2) Misbehave

3) Explode violently

x86 chooses... well, actually, it lets you choose, to some degree, by providing a wide variety of instructions. Neat! You have limited control over which option your C++ compiler will choose, and the C++ standard allows any choice by labeling the situations where it comes up "undefined behavior". I prefer option 3 as the least bad option (they're all bad), giving me the chance to fix my code (by aligning my accesses, mooting the problem entirely.) I can always write my own debug assertions around my atomic wrappers to force behavior #3 even if the compiler choses something else... assuming the optimizer doesn't get too aggressive and optimize away my alignment checks.

Bonus points to the compiler if it saves me the trouble, and uses instructions that use option 3, and let me catch the mistake in my code in fully optimized assertionless release builds. This is why I defend the compiler's use of movdqa elsewhere in the discussion tree.

EDIT: I could see the argument for preferring option 1, if I could choose to do so ubiquitously. But I do not have the luxury of enough control to accomplish that. So I prefer option 3 for what I can control.

> Memory reads using MOV are not atomic

This is directly contradicted by the citations I just gave you, from Intel's own manuals, when aligned. An aligned 8, 16, or 32-bit memory read - including via MOV - is atomic, per 8.1.1 Guaranteed Atomic Operations, even on a 486. Pentiums and x64 expands atomicity to even more memory reads.

> I postulate that at least GCC is essentially broken

You do not merely postulate that GCC is essentially broken, since per my edit that is not GCC only behavior. You postulate GCC, Clang, and MSVC are all essentially broken, because they all compile release/relaxed stores to vanilla movs.

EDIT:

> This is not correct as there is no alignment guarantee for value.

Incorrect. value is guaranteed alignment of at least alignof(valueType) within the context of a defined-behavior program, which generally means the compiler imposes alignof(MyAtomic<valueType>) == alignof(valueType). This does mean alignof(MyAtomic<int>) > alignof(MyAtomic<char>) typically: https://wandbox.org/permlink/jatL7ZC7eC2V1lVw


We're thinking very differently about the same topic. To me, MOV is not atomic, except in specific cases. You're saying that MOV is mostly atomic except in a specific set of cases. In the end, it's about how defensive you want to be about it. I tend to be very much on the defensive side in arguments like this.

I really can see the appeal in having the code blow up as soon as it can and as aggressively as it can. It's ideal for development. This stuff is hard enough as it is and impossible to test 100%. There's no testing harness on earth that allows you to enter two sections of code on two cores with a cycle-accurate delay and with a 100% defined state of the CPU (TLB, cache contents, branch predictor status...).

As for alignment guarantee by the language: your own example code uses pointer arithmetic to conjure up a class out of thin air that is overlaid over other memory and thus has the improper alignment you're exploiting. That demonstrates how little faith you are allowed to have in the alignment of an instance of your type if you didn't allocate it yourself. The same thing may happen if you make such a type part of packed struct.

I've just skimmed the relevant sections on alignment in the C++ standard, but it is not very explicit on the issue of alignment of fundamental types. The chapter on the atomics library doesn't bring that issue up at all. I'm not reading through the entire thing now - I'm not that bored.


> As for alignment guarantee by the language: your own example code uses pointer arithmetic to conjure up a class out of thin air that is overlaid over other memory and thus has the improper alignment you're exploiting. That demonstrates how little faith you are allowed to have in the alignment of an instance of your type if you didn't allocate it yourself.

I forgot to point out in my other reply: -fsanitize=undefined immediately catches this, and gives me the exact source file / line number of the bug. So there are ways to convince yourself you're mostly doing the right thing.

EDIT: Scroll up to the first error in https://wandbox.org/permlink/q0mxk5DViIVs2GBl if you want to see an example.


> To me, MOV is not atomic, except in specific cases

it seems that you are in an extremely small minority. Mov is atomic for all properly aligned accesses. All valid accesses in C++ are de jure properly aligned, hence a plain move is always a safe way to lower (non-seq-cst) stores and loads.

You would want the compiler generate extremely pessimal code (locked instructions for all atomic load and stores ) for something that never happens in sane code.


> We're thinking very differently about the same topic.

Yes and no. I recognize the same edge cases you do, and I err on the side of caution. Sometimes, it's even quite worthwhile to make seemingly outlandish arguments such as "literally every compiler and library dealing with this topic out there is broken".

But in this case the compilers and libraries have reasonable rationale behind their decisions that I can't entirely tear apart. If I ruled the ecosystem with an iron fist, I might have made different ones... if there was a diversity of implementation decisions, it might be worthwhile to agitate in favor of said difference choices... but the ecosystem seems to near-unanimously agree in the other direction. So in this case, caution and defense on an individual level means working within the reality of that ecosystem. That reality is allowed by the C++ standard. One could perhaps argue the C++ standard is broken, and I've certainly taken many a tilt at the C++ standard windmill, but I feel I have many more dangerous - and more slayable - dragons to face down first.

I could replace every single mov with an xchg in all code I have access to, but it still wouldn't fix the problem: I link against closed source libs, compiled with movs, with their own alignment requirements, and their own multithreading nonsense. I must align my types or I will suffer the bugs.

> There's no testing harness on earth that allows you to enter two sections of code on two cores with a cycle-accurate delay and with a 100% defined state of the CPU (TLB, cache contents, branch predictor status...).

There have been some interesting research projects into this FWIW focusing on symbolic execution, static analysis, executable instrumentation, etc. - Maple, Thread Sanitizer, Helgrind - but cycle and bug accurate CPU emulators are mostly a retro community thing, not cheap, and not generalizable to the wide variety of hardware I need to deal with.

> That demonstrates how little faith you are allowed to have in the alignment of an instance of your type if you didn't allocate it yourself.

You're not entirely wrong. But by the time I get my hands on a given T with an alignment less than alignof(T), the nasal demons of undefined behavior have already been well and truly invoked. Trying to partially paper over the problem with misalignment tolerant code at that point is a waste of my time that merely encourages more subtle heisenbugs. The better use of my time, and the better solution to my lack of faith, is to verify alignment - and sometimes catch and fix misalignment - not to give up and assume I'm unaligned.

You're not entirely right either. Several ARM ports in multiple, large codebases, have turned up suprisingly few alignment bugs in my experience. They're suprisingly rare. You have to go out of your way to create most of them.

> The same thing may happen if you make such a type part of packed struct.

Modern Clang, GCC, and MSVC can all warn about this, which you can upgrade to an error: https://clang.godbolt.org/z/hNjao6

As can Rust: https://play.rust-lang.org/?version=stable&mode=debug&editio...


...and just to give a concrete example of a program that can silently "misrun" on x64 (via cacheline straddling - which is undefined behavior due to the misalignment, obviously):

  #include <atomic>
  #include <thread>
  #include <iostream>
  #include <cassert>
  #include <cstdlib>
  
  int main() {
      constexpr size_t target_cacheline_bits = 6;
      constexpr size_t target_cacheline_size = (1 << target_cacheline_bits); // 64 bytes
      constexpr size_t target_cacheline_straddle_mask = (target_cacheline_size - 1);
      char buffer[target_cacheline_size + sizeof(std::atomic<short>)];
      std::atomic<short> & straddling = *(std::atomic<short> *)((size_t)buffer | target_cacheline_straddle_mask); // UB due to misalignment
      straddling.store(1);
      
      constexpr size_t N = 1000000;
      std::thread t1([&](){
          for (size_t i=0; i<N; ++i) {
              straddling.store(i&1 ? 0x0001 : 0x0100, std::memory_order_relaxed);
          }
      });
      std::thread t2([&](){
          for (size_t i=0; i<N; ++i) {
              const short value = straddling.load(std::memory_order_relaxed);
              if (value != 0x0001 && value != 0x0100) {
                  std::cerr << "Got value: " << value << "\n";
                  std::exit(1);
              }
          }
      });
  
      t1.join();
      t2.join();
  }
https://wandbox.org/permlink/IK43Ou0NfAw9y7us

Sometimes this will run to completion without triggering anything. Other times the values "0" or "257" (0x0101) will appear in stderr. Feel free to raise "N", or exclude "0" from the error triggering case, to see more values.


I've just tried it and for me GCC 9.2.1 produced either movdqa or movdqu depending on which x64 CPU it is asked to tune for.

That seems like it's doing what it's supposed to.

I also tried a tiny change to the C code using __attribute__((packed)), and it made the compiler output movdqu on the CPU where it was previously using movdqa, admittedly also producing slightly less efficient code.

So again, seems like GCC is doing what it's supposed to.

I'm surprised the author of the article went into obscure arch-specific GCC attributes, and tried things like memcpy(), but didn't try __attribute__((packed)) on a single-element struct, which is arch-independent.


I'm surprised any seasoned C developer would make this mistake. You haven't been able to assume trivial translation of C code to assembly for decades.

Casting from a 'pointer to type A' to a 'pointer to type B' is unsafe in all but a handful of circumstances.

- B is char or unsigned char.

- A is char or unsigned char, and the pointer was previously cast from a pointer to type B.

- Where A is a struct (or a standard layout[0] class in C++) and B is the first member of that structure.

[0] https://en.cppreference.com/w/cpp/language/data_members#Stan...


> I'm surprised any seasoned C developer would make this mistake.

Not everyone gets the talk. And even for the ones that do, many consider it to be useless nitpicking until it bites them. (And some who are bitten would prefer to yell at the compiler for “breaking their code”…)

> A is char or unsigned char, and the pointer was previously cast from a pointer to type B.

And if A is void, of course.


> And if A is void, of course

To nitpick, casting is rarely the issue. Dereferencing a pointer to the wrong type is the issue. As void can't never be dereferenced nor can be a valid underlying type, void never factors in alias analysis.


You haven't been able to assume trivial translation of C code to assembly for decades.

Then perhaps everyone should continue to do that, and continue to very strongly complain to the compiler-authors who seem to have become absolutely engrossed in blindly following the standard to the letter and completely ignoring the fact that people are wanting C precisely because it's supposed to be close to Asm. But I guess it satiates their egos more to point and laugh... "we're following the standard, fuck you for thinking we care about anything else."


If the standard didn't allow this optimization then GCC would have to emit unaligned access instructions everywhere. If it did that then people who want C to be 'close to the machine' would be complaining about lackluster performance and calling for everything to be written in ASM.

The reality is that the abstract machine modeled by C is a long, long way from what modern CPUs actually do, and the C language standards committee seems to have little interest in extending their language


If you were compiling for an architecture that doesn't allow unaligned access, then you'd expect unaligned accesses to fault. If you were compiling for an architecture which does, then you'd wouldn't. That's what sane undefined behaviour should be expected to do.


> The reality is that the abstract machine modeled by C is a long, long way from what modern CPUs actually do,

But some modern CPUs fault on unaligned access or make such access slower. (I think Intel is actually sometimes in the latter camp.)

I don't get why people label "modern" as a synonym for their pre-existing favorite idea and then demand stuff reflect that "modernity" or else it is totally illegitimate. Seems like some kind of hangup.

As some have mentioned in similar discussions, even the x86 machine code does not reflect how the CPU really works, there are things like caching and branch prediction and speculative execution and microcode and ... There is a long list and probably no one person is aware of all the abstractions.


Regardless of what C is or isn't "supposed" to be, it's not close to assembler. It's an abstract machine, and one that is fundamentally different than actual machines today.

It would do well for people to remember that before complaining to compiler authors. You're not at the bottom of the stack, you're just a few lowering passes closer than in other languages.


> I'm surprised any seasoned C developer

I'm not too surprised. There is seasoned as in recently experienced (and burned).. and seasoned as outdated. I have come across plenty that don't realize the last 25 years of compiler optimizations have changed the rules of what can be assumed.


Since we're being pedantic: I seem to remember that A or B can be char or unsigned char or signed char. Plain char can be signed or unsigned, but it is a distinct type from the other two regardless of that.


I believe signed char is not legal in this case, precisely for the reason you specified.


Sorry, I just checked (I should've done that first!) and you are right. I had misremembered.


Previously:

https://news.ycombinator.com/item?id=17910851

https://news.ycombinator.com/item?id=12889855

Some new developments:

> C++ allows us to write the same function in much more readable way by employing some template programming. We’ll introduce a generic type called const_unaligned_pointer.

If you support C++20, there's std::bit_cast: https://en.cppreference.com/w/cpp/numeric/bit_cast

Edit: fixed link. Thanks, nkurz.


I think something has gone very wrong with the state of programming when the original C version reads extremely straightforwardly, the version that works without UB already looks quite a bit more noisy, and the C++ version is just... yuck!

In this case, just writing the appropriate Asm instructions themselves would've been much shorter and simpler, and also worked the first time.


...and something has certainly gone very wrong when you get downvoted for pointing out the truth! A lot of this idiotic complexity increase would be completely avoided if compiler authors would just exercise some common sense, but unfortunately it's not so common after all.


This is just expected behaviour isn't it? So many processors require aligned access from the original C target of the pdp-11 onwards. That it happens to work on a x86 is pure chance.


Am I the only one weirded out that the author didn't even consider writing/benchmarking the obvious byte-at-a-time version (in a loop, and/or unrolled) before resorting to a nonportable/incorrect 'optimized' version?

Versions of GCC old enough to drive are pretty good at generating fast code from byte operations and summing up bytes seems like the kind of transparently analyzable case where modern compilers really are sufficiently smart


Challenge for all the anti-C people here: what other language (a) lets you set up this situation in the first place and (b) avoids the problem "optimally"? That is, generates performant assembler on x86 and doesn't crash on ARM?

(a) is a much harder criterion than it sounds!


Pascal probably




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: