Is avx512 an advantage really? Iirc the machine has to go into another mode where processing across the board becomes slower except for the avx instructions, which to me seems only useful for niche hpc "measuring contest" applications.
AVX on Intel can have unobvious performance degradation: a single AVX512 instruction runs at ¼ speed[1] until the core downclocks, and it stays downclocked for 2ms (severely affecting mixed loads). The downclock apparently takes 500us[2] (edit: fact check? This seems unbelievable), and the CPU is idle until it restabilises.
If AMD have made different choices about AVX implementation, then benchmarking becomes difficult.
Intel benchmarks for sustained AVX512 load (HPC measuring contests) cannot be used to extrapolate for normal mixed loads (single or or short bursts of AVX512 instructions).
Edit: are there better links on the true costs of AVX512?
Not game physics, as it puts the CPU in a lower speed regime it'd have negative implications on the rest of the games performance. So far, that AVX512 requires this lower speed (due to thermals) is an implementation detail, and it could be expected that newer processes (Intels 10 or 7 nm?) would allow them to work AVX512 tasks on full speed.
Until that happens, and everyone has AVX512 (because it'd be a massive fail to have a game that requires you to have a HEDT Intel processor to play), it'd be a nice gimmick to have on very specific tech demos, and performance sensitive scientific code that you know will run on a certain machine with certain characteristics.
Games will ship with AVX512 special paths once the AMD chips in consoles support it. Until then it is just a fancy feature to make already fast CPUs a bit faster.
Game programmers will put time into making slow CPUs faster. Outside tech demos or hardware marketing tie-ins no budget is allocated to making yet more spare capacity.
What kind of games care about such specific instructions? Unless you are writing something in assembly, that's not something game developers usually are focused on.
All kinds of games use AVX. Particularly, anything using Denuvo won't run without it.
AVX-512 isn't just wider execution units, it's different types of instructions, particularly some that fill in holes in the existing sets of instructions. Once it starts to be widely available, it will get used, and will eventually be a requirement, just like AVX has.
Ice Lake is introducing AVX-512 on Intel mobile, Tiger Lake will introduce it on desktop, presumably Zen3 will be introducing it on AMD.
Unity is a very high level engine, that uses C#, and it now has a system built in that lets you write code that looks like C# but will translate it to whatever SIMD instruction set is available, like ISPC.
There are also various libraries that leverage metaprogramming to do similar things. I don't think you understand what game devs are willing to do, to get a few more polygons and pixels on the screen!
> I don't think you understand what game devs are willing to do, to get a few more polygons and pixels on the screen!
Totally depends on the trade-offs. You can write your whole game in assembly, and target very specific hardware, and may be beat optimizing compiler (doubtful). But at what cost? Time spent on that could be spent on making more games.
Normal up to date hardware handles games just fine, as long as they are not using some abysmal and poorly parallelized engines. Modern CPUs with more cores are also helping that, especially after Ryzen processors opened the gates for it.
And specifically the next generation of consoles is using Zen2 architecture, which does not support AVX-512.
That said, desktops can apply whatever optimizations they want. Denuvo uses AVX in their DRM, which is also not a thing on console, so presumably they will eventually incorporate AVX-512.
downclocking is per core, it should normally be very simple to get huge net speedups with AVX-512 despite the downclocking, in game physics or anything else.
its also relatively common to write multiple versions of SIMD code (or use tools like ISPC or metaprogramming) to leverage whatever SIMD instruction set a cpu has. Such as the DOTS system in Unity. Games will happily leverage AVX-512 as soon as a fair number of desktop cpus support it.
From a developer perspective, yes. If you're manually writing AVX intrinsics instead of just relying on the compiler, AVX512 is quite exciting due to the added mask registers, allowing you to exclude certain elements of your vector from the operation.
You can archieve this in AVX2, but it's quite painful.
Example:
Let's say that, for whatever reason, you have a vector struct containing 3 doubles, and another 64 bits of arbitrary data. Now, if you want to add those vectors together, keeping the arbitrary data of one element, that's quite difficult to do with AVX2. In AVX512, you can just set the bits of the mask to zero to exclude them from the operation, making it trivial.
> Example: Let's say that, for whatever reason, you have a vector struct containing 3 doubles, and another 64 bits of arbitrary data. Now, if you want to add those vectors together, keeping the arbitrary data of one element
What? That's just a _mm256_setzero_pd (set the whole register to zero), _mm256_maskload_pd (load the 3 doubles, ignoring the 4th load), and then _mm256_add_pd (4x double-precision add).
AVX had mask instructions, but they took up all 256-bits. AVX512 mask instructions are exciting because they only use 1-bit per mask. A 64-bit mask can cover 64-bytes (aka: 512-bits) of masking.
> only useful for niche hpc "measuring contest" applications
It's useful for crypto. I haven't measured myself, but I expect AVX-512 roughly doubles the throughput of ChaCha20. (Not only do you have 2x as many lanes, you also have new bit rotation instructions.) Something similar applies if you use a tree hash like KangarooTwelve.
Whether your application is actually bottlenecked on crypto performance is another question of course.
avx512 is a spectacular instruction set and absolutely speeds real world workloads up a ton, and makes it easier for compilers to auto vectorize too. the only catch is that is is new so not many things leverage it yet.
Because of the throttling penalty it's often worse for a compiler to produce avx512 code. It's only worth it for very specialized tasks, not even gaming. This may change when throttling is no longer required on newer process nodes. But for now, it's mostly a gimmick.
There's a lot of convenient new instructions in avx512 that can work on 128/256-bit vectors. I'm guessing that those wouldn't throttle more than regular avx.
how often? what suite of code did you benchmark?
Why wouldn't it be good for gaming? Put the AVX-512 workload on one core for just that. Or batch the AVX-512 work together so it isn't short bursts.
Nobody is even thinking about leveraging this stuff without understanding things like this anyway.
Show me a game that uses the actual zmm register operations. The big thing is your game has to run at good frame rates without it, so there's little incentive to optimize something to use them, and if you do the speedup is partially mitigated by the throttling anyway. A lot of work for no gain.
In not saying you can't find something in a game to speedup with avx512, but that you wouldn't want to do that in the first place.
I was just talking about this the other day on reddit. That situation was overblown. Cloudflare were cheaping out, and using chips that _weren't_ recommended or designed for use with AVX-512, and then being surprised that they weren't getting good performance out of it.
Note that for most AVX workloads, it's fine. AVX2 sees a very minimal clock speed drop until almost all cores are actively engaged in the work. It's also worth mentioning that since Haswell days, the CPU groups cores doing AVX work together, to reduce the likelihood of the downclock impacting non-AVX work (I am somewhat curious what the impact on L1 / L2 caching is from that).
AVX-512 is where it can hurt, but it really depends on the workload. Lots of AVX-512 instructions and you're fine, the throughput of the AVX-512 instructions is higher than scalar or AVX2 instructions, even with the down-clock.
The important thing to note here, is that Cloudflare went with what amounts to a bargain basement server CPU. It's almost on the absolute bottom end of the range, and it's a chip not designed for AVX workloads (or indeed anything resembling high performance work). Just take a look at the product brief from when the family was launched (page 8): https://www.intel.com/content/dam/www/public/us/en/documents...
Notice that it can handle a fair level of AVX-512 instructions before it down-clocks, and even then it takes a while before the down-clocking amount is significant, and it can handle significant AVX2 workloads before the maximum frequency gets affected at all (at the point where AVX2 starts causing it to down=clock, you'd be more than reaping the benefits of the faster instructions).
For just a few hundred more dollars: https://en.wikichip.org/wiki/intel/xeon_gold/6126#Frequencie..., you can be utilising all the cores doing AVX-512 instructions and still be faster than the not-designed-for-the-workload chip that Cloudflare cheaped out on.
Note: These extra costs on CPUs are negligible when you take in to account depreciation, server lifetime etc. The biggest lifetime cost of a rack of servers in production is never the cost of the servers. It's your other OpEx. Cheaping out on server processors is a perfect example of what AWS employees call "Frupidity": Stupidity through Frugality. (Frugality is a leadership value in AWS, and sometimes it leads to the most phenomenally stupid things happening)
You can be using all cores with AVX-512 instructions, and see a drop of only 300Mhz, on the entry level platinum chip from that same year as Cloudflare's chip.
I'm not 100% convinced by the 5115 vs 4116, but that will depend on the workload. The case is clear for the 6126, but "For just a few hundred more dollars" is a bit dishonest; you are talking about $1000 (for the 4116) vs $1770 (for the 6126). Now IF you are going to do AVX2/AVX512 like crazy and have the money, sure, go for it. But it is not a case of "oh but for just a little bit more, I've got way much better". And that's even considering other costs. Cause you can use that reasoning for all components, so in the end does ×1.7 really not matter? Not so sure... Hey even $770 vs. amortization through increased performance can be difficult to reach sometimes - that's also very workload / business model dependent.
Also if you actually needs tons of processors, availability might be an issue.
Now in the end, I'm not familiar enough with Cloudfare needs and I know that pretty much anybody can make mistakes, so it is possible that they should have gone with something like the 6126 instead of 4116. But then the 8124? hm, less probable.
Also bare in mind those are RRPs, which Cloudflare likely isn't paying, especially at the scales they're operating at.
The crux of the point was that they were apparently surprised when a chip that wasn't designed for the kind of workload they decided to use it for, didn't perform well.
It's like buying small car and then being surprised that it doesn't have the same hauling power as a truck.
Frustratingly, everyone has taken it as gospel that it means AVX-512 is entirely crap and just going to hurt you.
It definitely has made optimising a bit more complicated. You could certainly argue that you might need to have the runtime profiling the code to figure out if using AVX-512 is harmful or not.
What happens when you have a vm on a core that's running avx-512? Will it also be down clocked? This is important to me since I'm orchestrating workloads on prem machines and my clients might want avx-512.
A core is a core... whatever is running on the core will be affected by the down-clocking. It's not possible for a single core to be operating at two different speeds simultaneously.
Note that a CPU can switch fairly quickly between different clock speeds. It's not instantaneous but it's pretty quick.
I'd encourage CPU pinning if you can (pinning each VM to a specific core). If you're depending on oversubscribing, that won't be possible, but presumably you'd already be expecting more of an impact from noisy neighbours anyway.