Intel MKL on AMD Zen

Aug 31, 2020

Tags: dev, linux

Introduction

Disclaimer: this post investigates how recent MKL versions behave on Zen CPUs. You should read the MKL license before using MKL. I shall not be held responsible for how you use MKL.

Intel MKL has been known to use a SSE code paths on AMD CPUs that support newer SIMD instructions such as those that use the Zen microarchitecture. A (by now) well-known trick has been to set the MKL_DEBUG_CPU_TYPE environment variable to the value 5 to force the use of AVX2 kernels on AMD Zen CPUs. Unfortunately, this variable has been removed from Intel MKL 2020 Update 1 and later. This can be confirmed easily by running a program that uses MKL with ltrace -e getenv.

Good news: Intel seems to be adding Zen kernels

However, it seems that Intel removed this option because they are adding Zen kernels to MKL. For instance, if we run the ACES dgemm benchmark with MKL 2020.2.254 on a Ryzen 3700X, performance is good:

$ ./mt-dgemm 4000 | grep GF    
GFLOP/s rate:         382.756063 GF/s

A quick inspection with perf shows that most cycles are spent in a Zen-optimized kernel:

79.95%  mt-dgemm  libmkl_def.so           [.] mkl_blas_def_dgemm_kernel_zen

Bad news: sgemm is not yet implemented

However, it seems that they have not yet implemented Zen kernels for every BLAS function yet. I modified the ACES benchmark to use the sgemm BLAS function and the results aren’t quite as good:

$ ./mt-sgemm 4000 | grep GF
GFLOP/s rate:         237.352720 GF/s

And indeed, perf reveals that MKL does not use a Zen kernel:

88.90%  mt-sgemm  libmkl_def.so           [.] LM_LOOPgas_1

A temporary workaround

Some quick tracing shows that MKL uses a single function mkl_serv_intel_cpu_true to detect whether it is dealing with a genuine Intel CPU. Fortunately, the function is rather trivial, so we can replace it by our own function:

int mkl_serv_intel_cpu_true() {
  return 1;
}

And compile it as a shared library:

$ gcc -shared -fPIC -o libfakeintel.so fakeintel.c

And ensure that the library gets preloaded:

$ export LD_PRELOAD=libfakeintel.so

Now the sgemm benchmark shows good performance:

$ ./mt-sgemm 4000 | grep GF
GFLOP/s rate:         851.541946 GF/s

And indeed, an AVX2-optimized code path is used:

82.73%  mt-sgemm  libmkl_avx2.so          [.] mkl_blas_avx2_sgemm_kernel_0

The only minor downside is that MKL will also use AVX2 kernels for other functions such as dgemm. But this does not seem to impact performance negatively. In fact, for the dgemm benchmark performance is slightly better on my machine (430 GF/s).

Making it permanent

Setting LD_PRELOAD everytime on a machine can get weary and one can easily forget it. An easy solution is to add our small library to the ELF dynamic section of your program using patchelf with the DT_NEEDED tag. For example:

$ patchelf --add-needed libfakeintel.so yourbinary