> optimized Intel compiler/libraries now And that's just cheating anyway. Intel'...

fluffything · on Nov 25, 2019

That's not cheating.

There are two problems here.

The first problem is that AMD does not provide versions of those libraries tuned for AMD hardware. The investment of AMD into software for their hardware is close to zero, and without software, hardware is irrelevant. AMD has this problem when competing against nvidia in the GPU space, and against intel in the CPU space.

The second problem is that buyers of AMD products demand that Intel would release their software optimized for AMD products as well, which is just nonsensical. First, Intel is not required to do this and have no reason to do so - if this software doesn't work for your hardware, either buy Intel, or use something else. Second, these people don't expect AMD to provide these libraries, and aren't willing to provide them themselves, so... combined with AMD culture of not investing in software, nothing will happen.

That link is 10 years old, but these issues are at least 10 years older. That is, AMD has had 20 years to fix this, and nothing has happened.

I mean, this might sound like captain obvious, but Intel and Nvidia have super good profilers for their hardware (VTune and Nsight). AMD provides uProf, which sucks, so if that is what their engineers are using for their libraries, it isn't a surprise that they don't perform good.

unlinked_dll · on Nov 25, 2019

Strong disagree.

When writing optimized code you typically check for CPU feature before CPUID, and profile to verify that it actually is faster.

In Intel's case, libraries like MKL/IPP don't decide which implementation to use based on available features but CPUID, and fall back to slow versions even if the CPU supports all features required for the optimization.

There's nothing stopping Intel from providing fast libs that are optimized for their hardware, profiled on their systems, and utilize all the secret sauce they want while still being more or less "fast" on AMD.

But all that said those libs aren't especially compelling when there are alternatives that may be a bit slower on Intel but kick ass on AMD, when a non-zero segment of users are transitioning to the better value processors today.

toast0 · on Nov 25, 2019

> In Intel's case, libraries like MKL/IPP don't decide which implementation to use based on available features but CPUID, and fall back to slow versions even if the CPU supports all features required for the optimization.

The CPU features indicates if a feature is available, but doesn't indicate if it's advisable or fast -- that's why Intel built a table for their hardware, and they'd need a table for AMD hardware too, except they don't care, because it's not their hardware. AMD or someone could build that table and patch it in, or try to convince Intel to include it, but expecting Intel to build it themselves is wishful thinking at best.

sliken · on Nov 26, 2019

Dunno, just about every compiler on the planet has different settings for different CPUs and often let you hard code what you prefer.

I can see MKL not automatically picking the fastest path, but they should allow you to manually pick.

In fact they do, this flag lets you pick the AVX2 libs and speed increases up to 6.6x in my tests with matlab.

export MKL_DEBUG_CPU_TYPE=5

MertsA · on Nov 26, 2019

>that's why Intel built a table for their hardware

Err... Correct me if I'm wrong, but wasn't the check literally just a string match on "GenuineIntel"?

endorphone · on Nov 25, 2019

"and they'd need a table for AMD hardware too, except they don't care"

They very much do care, and that is amply evident. Every other compiler uses feature detection or developer intention (e.g. if I say use AVX, use it or crash). Intel actively and intentionally -- with intent -- sabotages performance on AMD devices.

This is where the market decides, however, and it's why the intel compilers and libraries are fringe products, and their software division is effectively a disaster. If you have a specific HPC setup with Intel processors maybe you'll cross compile with them, but most simply steer far clear of them. For a while Intel sold them as a software vendor -- not as some sort of coupled processor support -- and many learned they aren't trustworthy.

Const-me · on Nov 25, 2019

> Every other compiler uses feature detection or developer intention

ICC has tons of built-in functions like _mm_div_epi32, _mm256_log10_ps, _mm_sin_ps and many others. These are not hardware intrinsics. These are library functions. Other compilers don’t need to feature detect because unlike Intel’s they don’t have these functions in their standard libraries: no exponents/logarithms, no trigonometry, no integer divides, nothing at all on top of the hardware instructions.

I mostly use VC++ and I had to implement these functions more than once. Funny enough, AMD’s SSEPlus helped even on Intel, despite AMD abandoned the library a decade ago.

burnte · on Nov 25, 2019

> The second problem is that buyers of AMD products demand that Intel would release their software optimized for AMD products as well, which is just nonsensical. First, Intel is not required to do this and have no reason to do so - if this software doesn't work for your hardware, either buy Intel, or use something else.

While Intel isn't REQUIRED to do so, it absolutely is a sensible thing to do. If my customers need something, I want to provide it to them. If my customers want me to add a feature that makes their products run well on competitors' CPUs, then its in my best interest to make that happen, if I can get the other CPU maker to give me the data I need. This makes my direct customer happy, makes me look good for prospective customers, and builds loyalty to my brand. It also makes my customer happy by making THEIR customers happy, which keeps them customers of mine.

usefulcat · on Nov 25, 2019

All of those points would be a lot more relevant if Intel were a software company, or if it at least got a significant portion of it's revenue from software sales. I very much doubt that's the case though.

burnte · on Nov 26, 2019

They make the compiler to make their CPUs more attractive.

loeg · on Nov 26, 2019

Intel is very much of the mindset of that it is a silicon company selling hardware and everything else is accessory to that. The software is just inducement to use the hardware. Users of the compiler are not inherently customers; they're only customers if they're using the Intel compiler on Intel CPUs.

Yes, it's a weird mindset for a company that employs a lot of software engineers.

headmelted · on Nov 25, 2019

While I would like this issue resolved I'm not sure I follow your logic at all that Intel should be the people to fix it.

When there are only two viable options in a marketplace, offering improvements to make your direct competitor's product better than yours is shooting yourself in the foot.

Your point about happy customers would make sense if Intel was selling all manner of service contracts and extraneous engagements with it's consumer base, but is it?

I'm pretty sure the chip business is speed / cost and that's all.

What's ironic is that AMD knows better than any tech company I can think of exactly what the rewards are for coming second in the chip game.

berkut · on Nov 25, 2019

I totally buy that Intel should use the feature flags instead of checking the CPUid name for stuff like AVX, FMA, etc...

However, ICC does support several advanced optimisations whereby it literally schedules instructions based on expert knowledge of architecture cycle latency and number of ports: LLVM and GCC have similar tables for overall stats which I think AMD contribute to, but ICC can also hook in memory and cache bandwidth / latency to schedule instructions: this by definition would need to be Intel-specific, unless AMD were happy to give Intel this information.

So some of the optimisations are by definition Intel-only.

The question really is how fair should the fall-backs be.

wtallis · on Nov 25, 2019

> So some of the optimisations are by definition Intel-only.

The benefits of those optimizations may or may not be Intel-only, but the machine code emitted by the compiler is not Intel-only.

berkut · on Nov 25, 2019

Yes, but if you're interested in performance, it's very possible that using the wrong scheduling of instructions for those particular optimisations (very particular scheduling ones) will result in extremely bad performance due to pipeline stalls, so running the machine code on the wrong architecture (possibly even among Intel platforms) will not give good performance other than on the explicit machine it's optimised for.

Keep in mind that ICC supports dynamic code paths for different marchs, so in theory you could have code for Intel 6th Gen, 7th Gen, 8th Gen, etc all dynamically switchable at run time.

temac · on Nov 25, 2019

> extremely bad performance

Given how a modern high perf processor work, I don't expect that at all. Moderately bad performance in some extremely rare cases, and comparable performance in most, is WAY more probable. Better performance in some cases is also possible.

If you don't know the microarchitecture enough, you can also just look at benchmarks of Zen processors. They are vastly running code non-tuned for it, and for some loads I expect the code was actually tuned for Skylake, given all desktop Intel processors have been Skylake microarch for a non trivial number of years. It performs well on Zen.

berkut · on Nov 25, 2019

For highly-tuned code, it's fairly trivial to find cases where data alignment and the amount of loop unrolling compilers do can have significant differences between microarchs.

Whether it's to do with cache associativity, different cache / instruction latencies or number of ports between CPUs or things like false-sharing, it happens enough that it's worth doing per-CPU optimisations.

One of the reasons Zen2 runs code likely not directly optimised for it so well is likely because AMD's microarch is now quite similar to Core now (at least within the chiplets).

Previously, with things like Bulldozer, that was most definitely not the case, and you needed quite different instruction scheduling code to perform even moderately well on AMD machines of 5/6 years ago.

soup10 · on Nov 25, 2019

An enthusiast looking at published benchmarks is going to have a very different idea about what "highly-tuned" really means than the guys in the trenches spending lots of time profiling and making hardware sing.

temac · on Nov 29, 2019

> An enthusiast looking at published benchmarks is going to have a very different idea about what "highly-tuned" really means than the guys in the trenches spending lots of time profiling and making hardware sing.

That's actually what matters and drives the market (the capability to run very diverse general purpose loads with a reasonably high efficiency). If expert hand tuning to each microarch (and actually the whole computer around it) was required, Cell-like things would have won, coherency would be low, and so over.

You don't have that; the whole Earth use x64 for general purpose high perf, or other ISA but now the microarch is very similar for everybody. Oh yes, you will still find some differences and tuning points here and there, but the general principle are all the same, and any precise tuning will be as a rule of thumb as fragile as the gain it gives are impressive. Robust good enough perf is WAY more important. You don't want to fine tune all the programs again just because you went from 2 memory channels to 4.

I mean it is well known that we are at the point where random code offset changes give more perf diff than fine assembly adjustments in some cases, and there is even tooling to do diverse layout to empirically try to find out if perf diff of two snippets are by chance because of very indirect factors, or intrinsically one is better than the other. And except for VERY specific application, there is absolutely no economic sense in trying to get the last perf bits of a specific computer by trying to hand-tune everything at low level. Those modern beast do it dynamically largely well enough, in the first place.

Now I understand that this art still exists and TBH I sometimes even practice it myself (moderately), and if you actually are going to deploy some cool tricks to a fleet of similar servers, that can make sense. But in the long run, or even just medium, the improvement of the CPUs are going to "optimize" better than you are, like usual. So while I'm not saying "don't micro-optimize ever", I actually insist that Zen 2 is extremely decent, very similar to Skylake (not just Core, even the state of the art Skylake improvement of it, Ice Lake does not really count yet) even if it also have drawbacks and advantages here and there, and that the general purpose "enthusiast" benchmarks simply reflect that. And some of the loads in those benchmarks are actually programmed by quite expert people too.

Now if you have really really precise needs, fine, but I mean that's kind of irrelevant, you could also say "oh but all of that is crap, I need a shitload of shitload of IO and my mainframe is really cool for that, plus I'm an expert on this family since 20 years and now all the details of where it shines". Yeah, that can be an existing situation too. But not really where the fight is playing right now, for most people.

Dylan16807 · on Nov 25, 2019

> buyers of AMD products demand that Intel would release their software optimized for AMD products as well,

No, not at all.

They want to run the Intel-optimized version. But instead it checks CPUID and runs a deoptimized version.

omikun · on Nov 25, 2019

To play devil's advocate, why would Intel spend money to allow their competitor to leverage speedups from their compiler?

ivanhoe · on Nov 26, 2019

Because in many countries, including US, it's illegal to leverage one product to ensure unfair market advantage for your other products, and this is very close to that...

wtallis · on Nov 25, 2019

Intel spent money to prevent their competitor from leveraging speedups from their compiler. It would have been less work for them to simply use feature detection on AMD chips, since Intel already has to implement feature detection for their own chips.

Dylan16807 · on Nov 26, 2019

It took slightly more effort for them to put in the anti-optimization feature than to do the feature detection correctly.

So the answer to your question is "they don't need to", but your question has nothing to do with the problem.

7952 · on Nov 25, 2019

Because it would benefit the user of the compiler, their customer.

neonate · on Nov 25, 2019

Wow. Are you sure? How do you know this?

aidenn0 · on Nov 25, 2019

CentaurTechnologies ran benchmarks against ICC with the CPUID set to the default and then with it set to the name of an Intel chip with equivalent CPU flags.

Intel's response was roughly "We don't trust CPU flags, so we have kernels for each specific Intel chip, and a generic kernel for non-Intel chips"

acqq · on Nov 25, 2019

Luckily, from Agner's CPU blog linked above:

"[Update 2019:] The newest versions of Gnu and Clang C++ compilers are now optimizing better than the Intel compiler in my tests."

Good reason to avoid Intel compiler even more.

loeg · on Nov 26, 2019

Newest ICC is an LLVM frontend anyway. Maybe some special sauce, maybe not.

Dylan16807 · on Nov 25, 2019

The link in the post above is a reasonable intro: https://www.agner.org/optimize/blog/read.php?i=49

neonate · on Nov 25, 2019

Thank you!

zrm · on Nov 25, 2019

You're looking at this from the perspective of someone who is actually using this specific library. The issue is that it's cheating on a benchmark.

The large majority of the code people actually run isn't written by Intel or AMD. We use benchmarks for comparison because it gives people an idea of the relative performance of the hardware. Reviewers are not going to test the hardware against every individual piece of line-of-business software or in-house custom code that people will run on the hardware in real life, so instead you get a sample of some common categories of computer code.

Optimizing the benchmark for one architecture and not the other is cheating because it makes it no longer a representative sample. Real third party code is generally not going to enable instructions only for Intel and not AMD when they exist (and improve performance) on both, it will either use them on both or not at all.

paulmd · on Nov 28, 2019

So don't use a benchmark built with icc.

zrm · on Nov 28, 2019

It only became a topic for discussion after people failed to take your advice.

simion314 · on Nov 25, 2019

So you want that all binaries are packaged in 2 formats, one for AMD and one for Intel, basically all binaries will be double in size? Imagine Google code would detect if you run on Chrome and use native functions but for other browsers(including Chromium forks) will just load polyfils because checking if the feature exists is too much to ask.

ksec · on Nov 26, 2019

>o... combined with AMD culture of not investing in software, nothing will happen.

That is just not true, they work on the open source toolchain like LLVM and GCC and hopefully things will catch up. And as far as I can tell things are getting much better in 2019.

bitL · on Nov 25, 2019

They commit patches to OpenBLAS which is still 2-3x slower on Zen 2 than MKL on Intel hardware. Intel has better software engineers for high-performance math it seems.

fluffything · on Nov 25, 2019

> Intel has better software engineers for high-performance math it seems.

Intel has software engineers for pretty much every application. AMD has very few software engineers.

bitL · on Nov 25, 2019

Yeah, that's quite unfortunate, however as AMD seems to be able to earn some $ in the following ~2 years they might go on SWE hiring spree...

tutanchamun · on Nov 25, 2019

I think much of that will flow into their GPU department. They have the node advantage compared to Nvidia (TSMC 7nm vs. TSMC 12nm) but can only match Nvidia in power consumption and performance. Once Nvidia get's to 7nm with their new architecture they will increase the gap again (unless RDNA2 which is rumored for H1 2020 brings a big efficiency and performance improvement - and not only hardware accelerated RT).

jonhendry18 · on Nov 25, 2019

Or drive a truck full of money to Intel HQ to pay them to do it.

Probably need to be a lot of money though.

chrchang523 · on Nov 25, 2019

AOCL (https://developer.amd.com/amd-aocl/ ) is also a thing that exists.

But yes, they are generally way behind in this area, and in my domain (genomics) this is a serious barrier to adoption. I've been waiting for years for good-enough AMD-optimized linear algebra libraries, but distribution of Linux binaries statically linked to Intel MKL is still the obvious best choice as of November 2019, and that's a shame.

galeos · on Nov 25, 2019

MATLAB (which I think uses Intel MKL) showing faster performance for Intel than AMD but apparently due the above issues of the MKL libraries not utilising AVX on AMD processors? Is this correct?

https://www.extremetech.com/computing/302546-amd-threadrippe...

sliken · on Nov 26, 2019

Yes, it's correct. But it's easy to fix, just do this: export MKL_DEBUG_CPU_TYPE=5

Depending on the benchmark, I've seen up to 6.6x improvement in matlab on the AMD Rome/Zen2 chips

sliken · on Nov 26, 2019

See my other post on this thread. There's a Zen2/Rome/Ryzen 3xxx optimized compiler, lapback/blas, libm (math), random number generations, and others available.

ashleyn · on Nov 25, 2019

If all Intel has left over the competition is artificial vendor lock-in, they're in big trouble.

linuxftw · on Nov 25, 2019

CPU for desktops has been commoditized. Intel does a lot in networking, RAM, and storage. Not to mention having almost all of the server space.

rbanffy · on Nov 25, 2019

Desktops is relatively low growth and Intel holds a lot os sway with major integrators. There are about a handful of AMD offers at Dell or Lenovo and dozens of Intel ones.

HeWhoLurksLate · on Nov 25, 2019

And I just ordered an HP laptop with a Ryzen APU.

Voting with wallets counts, people! If you want to see more AMD laptops, buy them!

bitL · on Nov 25, 2019

You should have waited for 4000U series... By then you might even have HiDPI screens on AMD laptops...

HeWhoLurksLate · on Nov 25, 2019

I wanted one now. And yeah, the future is always brighter.

My current laptop- an HP Pavilion 2000 has an AMD APU as well- but it's from like a decade ago and it's been showing it's age for quite a while, and I needed something newer.

Ryzen 5 3550H, BTW

NullPrefix · on Nov 25, 2019

There's always something better on the horizon

bitL · on Nov 25, 2019

4000U should make 3000U/H completely obsolete, it's the first mobile Zen 2 APU series.

tutanchamun · on Nov 25, 2019

Yeah, especially if battery life is important as it will feature LPDDR4X-4266 support (0.6v instead of 1.2 and low power modes).

justinjlynn · on Nov 25, 2019

You think that's bad? Try getting Intel libraries like Embree to compile on PowerPC/OpenPOWER systems. It seems designed to be almost entirely impossible to port to new architectures... just about every piece of software released by Intel is made like this. SMH

berkut · on Nov 25, 2019

TBB has support for PowerPC and Sparc: Intel accepted patches.

It's not that difficult to port Embree to other architectures, at least for the 4-wide stuff: you just need intrinsics wrappers, and a bit of emulation for the instructions PowerPC doesn't support.

justinjlynn · on Nov 27, 2019

> not that difficult

afaict, the thing uses multi-templated indirected hardcoded asm inlines instead of intrinsic calls - it is not "that difficult" but it is by no means simple and they've done precisely zero favours to anyone trying. They've really gone out of their way, with Embree, to make it crazy hard, if not impossible, to fully activate the built-in SSE/MMX to AVX compatibility shim headers GCC ships with too, where they even can be.

sigh

bitL · on Nov 25, 2019

I've tried to use MKL_DEBUG_CPU_TYPE=5 trick on a Threadripper but it had no effect on scikit-learn nor matrix factorization inside MKL.

sliken · on Nov 26, 2019

To fix MKL for AMD Rome/Zen2 just do "export MKL_DEBUG_CPU_TYPE=5"

This helps quite a bit with Matlab (which uses MKL) as well.

For Blas/Lapack like libraries look for Blis and libFLAME.

Gcc-9.2.0 isn't bad, but if you want Zen2/Rome/Ryzen 3xxx optimizations I'd recommend AMD's AOCC, which is a tuned version of LLVM (including c, c++, and fortran).

If you need FFTs look at amd-fftw.

Basic low level math (pow, log, exp, exp2 and friends) library look at the "AMD Math library (libM).

sebazzz · on Nov 25, 2019

This almost calls from a microcode patch or kernel driver from AMD to mitigate this, if at all possible.

gowld · on Nov 25, 2019

It's only cheating if there exists an alternate compiler that works better on AMD.

It may be not-nice for Intel to use this form of "DRM" to lock their high-performance compiler to Intel chips, but they don't owe AMD users a high-performance compiler.

Now, if vendors are shipping pre-compiled software that has the enabled-for-Intel-only fast-paths and don't ofter AMD-optimized versions, those vendors are misbehaving (intentionally or not).

wjnc · on Nov 25, 2019

I agree. It's funny that the lack of attention of AMD to software / drivers (say Blas / MKL, CUDA) is holding the adoption of their silicon back in certain markets even after they've surpassed Intel in performance.

sliken · on Nov 26, 2019

AMD has a compiler (c, c++, fortran) and various math libs: libm for basics, blis/libflame for blas.