Introduction
Disclaimer: this post investigates how recent MKL versions behave on Zen CPUs. You should read the MKL license before using MKL. I shall not be held responsible for how you use MKL.
Intel MKL has been known to use a SSE code paths on AMD CPUs that support newer SIMD instructions such as those that use the Zen microarchitecture. A (by now) well-known trick has been to set the MKL_DEBUG_CPU_TYPE
environment variable to the value 5
to force the use of AVX2 kernels on AMD Zen CPUs. Unfortunately, this variable has been removed from Intel MKL 2020 Update 1 and later. This can be confirmed easily by running a program that uses MKL with ltrace -e getenv
.
Good news: Intel seems to be adding Zen kernels
However, it seems that Intel removed this option because they are adding Zen kernels to MKL. For instance, if we run the ACES dgemm benchmark with MKL 2020.2.254 on a Ryzen 3700X, performance is good:
A quick inspection with perf
shows that most cycles are spent in a Zen-optimized kernel:
79.95% mt-dgemm libmkl_def.so [.] mkl_blas_def_dgemm_kernel_zen
Bad news: sgemm
is not yet implemented
However, it seems that they have not yet implemented Zen kernels for every BLAS function yet. I modified the ACES benchmark to use the sgemm
BLAS function and the results aren’t quite as good:
And indeed, perf
reveals that MKL does not use a Zen kernel:
88.90% mt-sgemm libmkl_def.so [.] LM_LOOPgas_1
A temporary workaround
Some quick tracing shows that MKL uses a single function mkl_serv_intel_cpu_true
to detect whether it is dealing with a genuine Intel CPU. Fortunately, the function is rather trivial, so we can replace it by our own function:
And compile it as a shared library:
And ensure that the library gets preloaded:
Now the sgemm
benchmark shows good performance:
And indeed, an AVX2-optimized code path is used:
82.73% mt-sgemm libmkl_avx2.so [.] mkl_blas_avx2_sgemm_kernel_0
The only minor downside is that MKL will also use AVX2 kernels for other functions such as dgemm
. But this does not seem to impact performance negatively. In fact, for the dgemm
benchmark performance is slightly better on my machine (430 GF/s).
Making it permanent
Setting LD_PRELOAD
everytime on a machine can get weary and one can easily forget it. An easy solution is to add our small library to the ELF dynamic section of your program using patchelf with the DT_NEEDED
tag. For example:
Addendum (January 2023)
If you link MKL dynamically into your own application, you do not have to make a separate shared library to preload. You can simply define the mkl_serv_intel_cpu_true
symbol/function in your binary.