I don't know anything about memory access. But doesn't that only help if you hav...

sliken · on Nov 25, 2020

Depends. Cachelines are typically 64-128 bytes long and sometimes depending on various factors that might be across on memory channel, or spread across multiple memory channels, somewhat like a RAID-0 disk. I've seen servers (opterons I believe) that would allow mapping memory per channel or across channels based on settings in BIOS. Generally non-NUMA aware OS ran better with stripped memory and NUMA aware OSs ran better non-stripped.

So striping a caching line across multiple channels goes increase bandwidth, but not by much. If the dram latency is 70ns (not uncommon) and your memory is running at 3.2 GHz on a single 64 bit wide channel you get 128 bytes in 16 transfers. 16 transfers at 3.2GHz = 5ns. So you get a cache line back in 75ns. With 2 64 bit channels you can get 2 cache lines per 75ns.

So now with a 128 bit wide channel (twice the bandwidth) you wait 70ns then get 8 transfers @ 3.2GHz = 2.5ns. So you get a cache line back in 72.5ns. Clearly not a big difference.

So the question becomes for a complicated OS with a ton of cores do you want one cacheline per 72.5ns (the stripped config) or two cachlines per 75ns (the non-stripped config).

In the 16 bit 8 channel (assuming the same bus speed and latency) you get 8 cacheline per 90ns. However not sure what magic apple has but I'm seeing very low memory latencies on the M1, on the order of 33ns! With all cores busy I'm seeing cacheline througput of a cacheline per 11ns or so.

MobiusHorizons · on Nov 25, 2020

I believe modern superscalar architectures can run instructions out of order if they don't rely on the same data, so when paused waiting for a cache miss, the processor can read ahead in the code, and potentially find other memory to prefetch. I may be wrong about the specifics, but these are the types of tricks that modern CPUs employ to achieve higher speed.

sliken · on Nov 25, 2020

Sure, but generally a cacheline miss will quickly stall, sure you might have a few non-dependent instructions in the pipeline, but running a CPU at 3+GHz and waiting 70ns is an eternity. Doubly so when you can execute multiple instructions per cycle.

wmf · on Nov 25, 2020

You have to consider that a DRAM delivers a burst, not a single word. Usually the channel width times the burst length equals the cache line size.