Intel MKL on AMD Zen
Introduction
Disclaimer: this post investigates how recent MKL versions behave on Zen CPUs. You should read the MKL license before using MKL. I shall not be held responsible for how you use MKL.
Intel MKL has been known to use a SSE code paths on AMD CPUs that
support newer SIMD instructions such as those that use the Zen
microarchitecture. A (by now) well-known trick has been to set the
MKL_DEBUG_CPU_TYPE
environment variable to the value 5
to force
the use of AVX2
kernels
on AMD Zen CPUs. Unfortunately, this variable has been
removed
from Intel MKL 2020 Update 1 and later. This can be confirmed easily
by running a program that uses MKL with ltrace -e getenv
.
Good news: Intel seems to be adding Zen kernels
However, it seems that Intel removed this option because they are adding Zen kernels to MKL. For instance, if we run the ACES dgemm benchmark with MKL 2020.2.254 on a Ryzen 3700X, performance is good:
$ ./mt-dgemm 4000 | grep GF
GFLOP/s rate: 382.756063 GF/s
A quick inspection with perf
shows that most cycles are spent
in a Zen-optimized kernel:
79.95% mt-dgemm libmkl_def.so [.] mkl_blas_def_dgemm_kernel_zen
Bad news: sgemm
is not yet implemented
However, it seems that they have not yet implemented Zen kernels
for every BLAS function yet. I modified the ACES benchmark to
use the sgemm
BLAS function and the results aren't quite as
good:
$ ./mt-sgemm 4000 | grep GF
GFLOP/s rate: 237.352720 GF/s
And indeed, perf
reveals that MKL does not use a Zen kernel:
88.90% mt-sgemm libmkl_def.so [.] LM_LOOPgas_1
A temporary workaround
Some quick tracing shows that MKL uses a single function
mkl_serv_intel_cpu_true
to detect whether it is dealing
with a genuine Intel CPU. Fortunately, the function is
rather trivial, so we can replace it by our own function:
int mkl_serv_intel_cpu_true() {
return 1;
}
And compile it as a shared library:
$ gcc -shared -fPIC -o libfakeintel.so fakeintel.c
And ensure that the library gets preloaded:
$ export LD_PRELOAD=libfakeintel.so
Now the sgemm
benchmark shows good performance:
$ ./mt-sgemm 4000 | grep GF
GFLOP/s rate: 851.541946 GF/s
And indeed, an AVX2-optimized code path is used:
82.73% mt-sgemm libmkl_avx2.so [.] mkl_blas_avx2_sgemm_kernel_0
The only minor downside is that MKL will also use AVX2 kernels for
other functions such as dgemm
. But this does not seem to impact
performance negatively. In fact, for the dgemm
benchmark performance
is slightly better on my machine (430 GF/s).
Making it permanent
Setting LD_PRELOAD
everytime on a machine can get weary and one can
easily forget it. An easy solution is to add our small library to the
ELF dynamic section of your program using
patchelf with the DT_NEEDED
tag. For example:
$ patchelf --add-needed libfakeintel.so yourbinary
Addendum (January 2023)
If you link MKL dynamically into your own application, you do not have to make a
separate shared library to preload. You can simply define the
mkl_serv_intel_cpu_true
symbol/function in your binary.