Is avx512 an advantage really? Iirc the machine has to go into another mode wher...

robocat · on Nov 25, 2019

AVX on Intel can have unobvious performance degradation: a single AVX512 instruction runs at ¼ speed[1] until the core downclocks, and it stays downclocked for 2ms (severely affecting mixed loads). The downclock apparently takes 500us[2] (edit: fact check? This seems unbelievable), and the CPU is idle until it restabilises.

If AMD have made different choices about AVX implementation, then benchmarking becomes difficult.

Intel benchmarks for sustained AVX512 load (HPC measuring contests) cannot be used to extrapolate for normal mixed loads (single or or short bursts of AVX512 instructions).

Edit: are there better links on the true costs of AVX512?

[1] https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...

[2] https://news.ycombinator.com/item?id=21031905

Also see: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

bitL · on Nov 25, 2019

Even if 10980XE downclocks to 2.8GHz while using AVX512, it's still 10x faster in MKL than Threadripper 1/MKL.

From the AnandTech review, 3D particle test was showing AVX512 effect nicely:

https://images.anandtech.com/graphs/graph15044/113590.png

10980XE had 3.9x speedup compared to 3970WX per core using AVX512.

So for some scientific computing purposes (maybe game physics?) AVX512 is worth it.

dr_zoidberg · on Nov 25, 2019

> (maybe game physics?)

Not game physics, as it puts the CPU in a lower speed regime it'd have negative implications on the rest of the games performance. So far, that AVX512 requires this lower speed (due to thermals) is an implementation detail, and it could be expected that newer processes (Intels 10 or 7 nm?) would allow them to work AVX512 tasks on full speed.

Until that happens, and everyone has AVX512 (because it'd be a massive fail to have a game that requires you to have a HEDT Intel processor to play), it'd be a nice gimmick to have on very specific tech demos, and performance sensitive scientific code that you know will run on a certain machine with certain characteristics.

Danieru · on Nov 25, 2019

Games will ship with AVX512 special paths once the AMD chips in consoles support it. Until then it is just a fancy feature to make already fast CPUs a bit faster.

Game programmers will put time into making slow CPUs faster. Outside tech demos or hardware marketing tie-ins no budget is allocated to making yet more spare capacity.

shmerl · on Nov 25, 2019

What kind of games care about such specific instructions? Unless you are writing something in assembly, that's not something game developers usually are focused on.

paulmd · on Nov 28, 2019

All kinds of games use AVX. Particularly, anything using Denuvo won't run without it.

AVX-512 isn't just wider execution units, it's different types of instructions, particularly some that fill in holes in the existing sets of instructions. Once it starts to be widely available, it will get used, and will eventually be a requirement, just like AVX has.

Ice Lake is introducing AVX-512 on Intel mobile, Tiger Lake will introduce it on desktop, presumably Zen3 will be introducing it on AMD.

gameswithgo · on Nov 25, 2019

Unity is a very high level engine, that uses C#, and it now has a system built in that lets you write code that looks like C# but will translate it to whatever SIMD instruction set is available, like ISPC.

There are also various libraries that leverage metaprogramming to do similar things. I don't think you understand what game devs are willing to do, to get a few more polygons and pixels on the screen!

shmerl · on Nov 26, 2019

> I don't think you understand what game devs are willing to do, to get a few more polygons and pixels on the screen!

Totally depends on the trade-offs. You can write your whole game in assembly, and target very specific hardware, and may be beat optimizing compiler (doubtful). But at what cost? Time spent on that could be spent on making more games.

Normal up to date hardware handles games just fine, as long as they are not using some abysmal and poorly parallelized engines. Modern CPUs with more cores are also helping that, especially after Ryzen processors opened the gates for it.

jandrese · on Nov 25, 2019

AFAIK all of the consoles use AMD chips so I wouldn't expect games to adopt AVX512 anytime soon.

paulmd · on Nov 28, 2019

And specifically the next generation of consoles is using Zen2 architecture, which does not support AVX-512.

That said, desktops can apply whatever optimizations they want. Denuvo uses AVX in their DRM, which is also not a thing on console, so presumably they will eventually incorporate AVX-512.

gameswithgo · on Nov 25, 2019

downclocking is per core, it should normally be very simple to get huge net speedups with AVX-512 despite the downclocking, in game physics or anything else.

its also relatively common to write multiple versions of SIMD code (or use tools like ISPC or metaprogramming) to leverage whatever SIMD instruction set a cpu has. Such as the DOTS system in Unity. Games will happily leverage AVX-512 as soon as a fair number of desktop cpus support it.

JaRail · on Nov 25, 2019

The downclock would only be on the core using the instruction.

jacquesm · on Nov 25, 2019

For that price difference you can add a GPU.

crote · on Nov 25, 2019

From a developer perspective, yes. If you're manually writing AVX intrinsics instead of just relying on the compiler, AVX512 is quite exciting due to the added mask registers, allowing you to exclude certain elements of your vector from the operation. You can archieve this in AVX2, but it's quite painful.

Example: Let's say that, for whatever reason, you have a vector struct containing 3 doubles, and another 64 bits of arbitrary data. Now, if you want to add those vectors together, keeping the arbitrary data of one element, that's quite difficult to do with AVX2. In AVX512, you can just set the bits of the mask to zero to exclude them from the operation, making it trivial.

dragontamer · on Nov 26, 2019

> Example: Let's say that, for whatever reason, you have a vector struct containing 3 doubles, and another 64 bits of arbitrary data. Now, if you want to add those vectors together, keeping the arbitrary data of one element

What? That's just a _mm256_setzero_pd (set the whole register to zero), _mm256_maskload_pd (load the 3 doubles, ignoring the 4th load), and then _mm256_add_pd (4x double-precision add).

For more details: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

-----

AVX had mask instructions, but they took up all 256-bits. AVX512 mask instructions are exciting because they only use 1-bit per mask. A 64-bit mask can cover 64-bytes (aka: 512-bits) of masking.

oconnor663 · on Nov 25, 2019

> only useful for niche hpc "measuring contest" applications

It's useful for crypto. I haven't measured myself, but I expect AVX-512 roughly doubles the throughput of ChaCha20. (Not only do you have 2x as many lanes, you also have new bit rotation instructions.) Something similar applies if you use a tree hash like KangarooTwelve.

Whether your application is actually bottlenecked on crypto performance is another question of course.

gameswithgo · on Nov 25, 2019

avx512 is a spectacular instruction set and absolutely speeds real world workloads up a ton, and makes it easier for compilers to auto vectorize too. the only catch is that is is new so not many things leverage it yet.

slashdev · on Nov 25, 2019

Because of the throttling penalty it's often worse for a compiler to produce avx512 code. It's only worth it for very specialized tasks, not even gaming. This may change when throttling is no longer required on newer process nodes. But for now, it's mostly a gimmick.

Rarebox · on Nov 25, 2019

There's a lot of convenient new instructions in avx512 that can work on 128/256-bit vectors. I'm guessing that those wouldn't throttle more than regular avx.

slashdev · on Nov 25, 2019

I'm guessing you're right.

foobarbazetc · on Nov 25, 2019

It depends on your workload and which Xeon chips you choose — some of them don’t down-clock too much for AVX512.

gameswithgo · on Nov 25, 2019

how often? what suite of code did you benchmark? Why wouldn't it be good for gaming? Put the AVX-512 workload on one core for just that. Or batch the AVX-512 work together so it isn't short bursts.

Nobody is even thinking about leveraging this stuff without understanding things like this anyway.

slashdev · on Nov 26, 2019

Show me a game that uses the actual zmm register operations. The big thing is your game has to run at good frame rates without it, so there's little incentive to optimize something to use them, and if you do the speedup is partially mitigated by the throttling anyway. A lot of work for no gain.

In not saying you can't find something in a game to speedup with avx512, but that you wouldn't want to do that in the first place.

Twirrim · on Nov 25, 2019

I was just talking about this the other day on reddit. That situation was overblown. Cloudflare were cheaping out, and using chips that _weren't_ recommended or designed for use with AVX-512, and then being surprised that they weren't getting good performance out of it.

Cloudflare's chip where they saw the problems, was the Xeon Silver 4116: https://en.wikichip.org/wiki/intel/xeon_silver/4116#Frequenc...

Note that for most AVX workloads, it's fine. AVX2 sees a very minimal clock speed drop until almost all cores are actively engaged in the work. It's also worth mentioning that since Haswell days, the CPU groups cores doing AVX work together, to reduce the likelihood of the downclock impacting non-AVX work (I am somewhat curious what the impact on L1 / L2 caching is from that). AVX-512 is where it can hurt, but it really depends on the workload. Lots of AVX-512 instructions and you're fine, the throughput of the AVX-512 instructions is higher than scalar or AVX2 instructions, even with the down-clock.

The important thing to note here, is that Cloudflare went with what amounts to a bargain basement server CPU. It's almost on the absolute bottom end of the range, and it's a chip not designed for AVX workloads (or indeed anything resembling high performance work). Just take a look at the product brief from when the family was launched (page 8): https://www.intel.com/content/dam/www/public/us/en/documents...

If they'd actually taken a chip range designed for the workload, even at the bottom end of that range, just $100 or so more than the one they did choose: https://en.wikichip.org/wiki/intel/xeon_gold/5115#Frequencie...

Notice that it can handle a fair level of AVX-512 instructions before it down-clocks, and even then it takes a while before the down-clocking amount is significant, and it can handle significant AVX2 workloads before the maximum frequency gets affected at all (at the point where AVX2 starts causing it to down=clock, you'd be more than reaping the benefits of the faster instructions).

For just a few hundred more dollars: https://en.wikichip.org/wiki/intel/xeon_gold/6126#Frequencie..., you can be utilising all the cores doing AVX-512 instructions and still be faster than the not-designed-for-the-workload chip that Cloudflare cheaped out on.

Note: These extra costs on CPUs are negligible when you take in to account depreciation, server lifetime etc. The biggest lifetime cost of a rack of servers in production is never the cost of the servers. It's your other OpEx. Cheaping out on server processors is a perfect example of what AWS employees call "Frupidity": Stupidity through Frugality. (Frugality is a leadership value in AWS, and sometimes it leads to the most phenomenally stupid things happening)

Shift up in to the slightly more premium range, towards the kinds of chips you're likely to be running on in cloud platforms: https://en.wikichip.org/wiki/intel/xeon_platinum/8124#Freque...

You can be using all cores with AVX-512 instructions, and see a drop of only 300Mhz, on the entry level platinum chip from that same year as Cloudflare's chip.

temac · on Nov 25, 2019

I'm not 100% convinced by the 5115 vs 4116, but that will depend on the workload. The case is clear for the 6126, but "For just a few hundred more dollars" is a bit dishonest; you are talking about $1000 (for the 4116) vs $1770 (for the 6126). Now IF you are going to do AVX2/AVX512 like crazy and have the money, sure, go for it. But it is not a case of "oh but for just a little bit more, I've got way much better". And that's even considering other costs. Cause you can use that reasoning for all components, so in the end does ×1.7 really not matter? Not so sure... Hey even $770 vs. amortization through increased performance can be difficult to reach sometimes - that's also very workload / business model dependent.

Also if you actually needs tons of processors, availability might be an issue.

Now in the end, I'm not familiar enough with Cloudfare needs and I know that pretty much anybody can make mistakes, so it is possible that they should have gone with something like the 6126 instead of 4116. But then the 8124? hm, less probable.

Twirrim · on Nov 25, 2019

4116 release price was $1002 5115 release price was $1221

For "just a few hundred more dollars", you got two less cores, but notably better AVX-512 scaling.

5118 release price was $1273, https://en.wikichip.org/wiki/intel/xeon_gold/5118. That keeps you in the 12 core range, still only a hundred dollars more than the option they went with.

Also bare in mind those are RRPs, which Cloudflare likely isn't paying, especially at the scales they're operating at.

The crux of the point was that they were apparently surprised when a chip that wasn't designed for the kind of workload they decided to use it for, didn't perform well.

It's like buying small car and then being surprised that it doesn't have the same hauling power as a truck.

Frustratingly, everyone has taken it as gospel that it means AVX-512 is entirely crap and just going to hurt you.

It definitely has made optimising a bit more complicated. You could certainly argue that you might need to have the runtime profiling the code to figure out if using AVX-512 is harmful or not.

dnautics · on Nov 25, 2019

What happens when you have a vm on a core that's running avx-512? Will it also be down clocked? This is important to me since I'm orchestrating workloads on prem machines and my clients might want avx-512.

Twirrim · on Nov 25, 2019

A core is a core... whatever is running on the core will be affected by the down-clocking. It's not possible for a single core to be operating at two different speeds simultaneously.

Note that a CPU can switch fairly quickly between different clock speeds. It's not instantaneous but it's pretty quick.

I'd encourage CPU pinning if you can (pinning each VM to a specific core). If you're depending on oversubscribing, that won't be possible, but presumably you'd already be expecting more of an impact from noisy neighbours anyway.

Beltiras · on Nov 25, 2019

Unless you are actually running HPC workloads.

gameswithgo · on Nov 25, 2019

downclocks are only per core, its almost always worth it unless you arrange you workloads very badly (many small bursts of AVX-512 code)

gowld · on Nov 25, 2019

Yes, I'd rather have orange faster than blue, if my workload is orange-biased enough.