ROCm
According to the ROCm official document (v5.4.3) "ROCm is a brand name for ROCm open software platform (for software) or the ROCm™ open platform ecosystem (includes hardware like FPGAs or other CPU architectures)."
In the scope of Gentoo distribution, "ROCm" refers to ROCm open software platform, currently supporting AMDGPU as its hardware.
ROCm itself aims for as an environment for heterogeneous computing, not limiting to AMDGPU. It is the current packaging strategy of Gentoo that ROCm only supports AMDGPU; if ROCm is needed for other vendors (typically the cuda backend of
sci-libs/hip-*
packages), please file a bug to Gentoo BugzillaROCm is somewhat analogous to CUDA in the sense of providing an API (HIPAMD) whose usage is in the same spirit: simplified kernels-based GPU programming. It also provides OpenCL and OpenMP and programming models. Note that ROCm is not the only way to run compute tasks on AMD GPUs as Mesa3D (media-libs/mesa) also provides this capability over its own OpenCL and Vulkan APIs.
Components of ROCm
ROCm can be classified into five categories:
- Drivers and runtimes, provided by the amdgpu kernel model and
dev-libs/roct-thunk-interface
anddev-libs/rocr-runtime
. - Programming models. See ROCm#Programming_models for details.
- Compilers and tools. Gentoo uses to vanilla clang (
>=llvm-core/clang-14.0.6-r1
). - Libraries. Gentoo has packaged most libraries prefixed by
roc
andhip
insci-libs
category, withsrc_test
enabled. Allsci-libs/roc*
packages are written in HIP and uses hipamd as backend, whilesci-libs/hip*
are simple wrappers. - Deployment tools. As a user of Gentoo, the best choice to deploy common ROCm components is via portage.
Installation guide
Supported AMD GPUs
ROCm targets discrete AMD GPUs and officially supports nearly any GFX9 and later GPU generations with a few exceptions [1]. Basically this covers GCN-3 (Fiji chips only) to GCN-5 and all RDNA micro-architectures.
Kernel requirements
Kernel configuration options
The following kernel configuration options have be ticked:
Device Drivers --->
Graphics Support --->
<M> AMD GPU
[ ] Enable amdgpu support for SI parts
[ ] Enable amdgpu support for CIK parts
-*- Always enable userptr write support
ACP (Audio CoProcessor) Configuration --->
Display Engine Configuration --->
[*] HSA kernel driver for AMD GPU devices
[*] Enable HMM-based shared virtual memory manager
You can also choose to not build the AMD GPU support as a kernel module and so make it being built-in.
Kernel command line parameters
The defaults options values for the AMDGPU kernel module are adequate and provide a stable configuration. Under some circumstances it is possible to override various options (see drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c in the Linux kernel source tree for a detailed description of those).
For example, setting amdgpu.ppfeaturemask=0xffffffff
gives full features in AMDGPU power play, which maybe useful when adjusting GPU power profiles via rocm-smi
.
Checking for a functional kernel configuration
Once your Linux kernel has been reconfigured, recompiled by a manner of your choice and the machine rebooted to use it, you can check for KFD messages in the Linux kernel log. If AMDKFD (AMD Fusion Kernel Driver) which is the HSA compute Linux kernel driver has been successfully enabled for your AMD GPU you should see something like:
root #
dmesg | grep kfd
[ 9.324948] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 9.376206] kfd kfd: amdgpu: added device 1002:73df
iommu pass-through
According to AMD, running peer-to-peer programs on multiple GPUs may trigger system crash. The solution is to set kernel command line parameter iommu=pt
.
Some user reports also suggest that iommu pass-through mode decreases the possibility of amdgpu driver crashes like page faults.
Users privileges
Using ROCm requires access to the character device /dev/kfd which is owned by root:video
. Thus any user wanting to use ROCm should be included in the video group. This is achieved via:
user $
usermod -a -G video userlogin
This change will be effective at their next logon.
System monitoring tools
dev-util/rocm-smi provides various information about the underlying AMD GPU devices such as the die temperature, the current clock speeds and VRAM usage. To install it:
root #
emerge --ask dev-util/rocm-smi
You can simply invoke it without any options to see a short summary:
user $
rocm-smi
======================= ROCm System Management Interface ======================= ================================= Concise Info ================================= GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 45.0c 7.0W 500Mhz 96Mhz 0% auto 220.0W 16% 1% ================================================================================ ============================= End of ROCm SMI Log ==============================
You can invoke
rocm-smi
with the option -a
to view a complete report.Another package of interest is dev-util/rocminfo which provides detailed information about currently registered HSA devices. To install it:
root #
emerge --ask dev-util/rocminfo
Then, simply invoke it to see what is usable by ROCm:
user $
rocminfo
ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 9 7950X 16-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 9 7950X 16-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 4500 BDFID: 0 Internal Node ID: 0 Compute Unit: 32 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 131631704(0x7d88a58) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 131631704(0x7d88a58) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 131631704(0x7d88a58) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1031 Uuid: GPU-XX Marketing Name: AMD Radeon RX 6750 XT Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 3072(0xc00) KB L3: 98304(0x18000) KB Chip ID: 29663(0x73df) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2880 BDFID: 768 Internal Node ID: 1 Compute Unit: 40 SIMDs per CU: 2 Shader Engines: 4 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 12566528(0xbfc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1031 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done ***
In the above example, the integrated GPU had been deactivated in the BIOS, thus only the CPU itself and the dGPU are usable.
Programming models
OpenCL
Detailed information can be seen in OpenCL#AMD.
HIP
Detailed information can be seen in HIP.
OpenMP
Clang-15 is not functional as of mid-april 2023. See bug #904143.
Enabling OpenMP for Clang/LLVM
To enable OpenMP offloading on AMDGPU, ensure that :
- llvm-core/clang-runtime has been emerged with the USE flag
openmp
set - llvm-runtimes/openmp (LLVM/clang specific) has been emerged with the USE flags
offload
andllvm_targets_AMDGPU
both set.
This is achieved by:
- Checking if the portage configuration has
AMDGPU
set as a target for the variable LLVM_TARGETS:
root #
emerge --info
(...) LLVM_TARGETS="AMDGPU BPF X86" (...)
- Adding the required statements in /etc/portage/package.use either by adding a file under it (if /etc/portage/package.use) is a directory or an entry (if /etc/portage/package.use is a file) like this:
llvm-core/clang-runtime openmp
Then re-emerge llvm-core/clang-runtime (should kick llvm-runtimes/openmp in as a dependency):
root #
emerge --ask --deep --newuse llvm-core/clang-runtime
You can also set the
openmp
USE flag globally by adding it to the USE variable defined in /etc/portage/make.confTesting OpenMP support
user $
clang -fno-stack-protector -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa --rocm-path=/opt/gentoo/usr --rocm-device-lib-path=/opt/gentoo/usr/lib/amdgcn/bitcode <openmp source code> -o <executable>
-fno-stack-protector
is needed because Gentoo has turned on -fstack-protector-strong
in /etc/clang/gentoo-hardened.cfg
, see llvm-project/issues/62066 and bug #890377.If clang omits error: no library 'libomptarget-amdgpu-gfx1031.bc' found in the default clang lib directory or in LIBRARY_PATH; use '--libomptarget-amdgpu-bc-path' to specify amdgpu bitcode library
Then adding --libomptarget-amdgpu-bc-path=/usr/lib64/
to the compile command resolve this issue.
If llvm-core/clang does not contain amdgpu-arch tool or the target GPU is not on the machine
Clang omits error: cannot determine AMDGPU architecture: amdgpu-arch: Execute failed: Executable "amdgpu-arch" doesn't exist!
.
In this case clang cannot detect gpu architecture automatically (or in cross compile, arch is not present on compile machine), so clang needs a GPU arch specifier script:
#!/bin/bash
echo "gfx90a" # Change to the target to compile here, but do not append target features such as :xnack-
Make script executable:
user $
chmod +x /tmp/print_gpu_arch.sh
And then add option --amdgpu-arch-tool=/tmp/print_gpu_arch.sh
to the compilation command.
Others
The backend of ROCm is currently llvm/clang, so any programming model that can generate LLVM IR for AMDGPU can use ROCm. Numba is a jit compiler for python codes, and can offload to ROCm. Currently Gentoo does not packaged numba with ROCm yet.
ROCm libraries
Currently, Gentoo has packages rocBLAS
, rocFFT
, rocPRIM
, rocRAND
, rocSOLVER
, rocSPARSE
, rocThrust
, composable-kernel
, miopen
, and pytorch[rocm]
in sci-libs
category. Those are math and deep learning libraries written in HIP and runs on AMD GPUs.
Wrapper packages, are hipBLAS
(wrapper of rocBLAS
+rocSOLVER
vs cuBLAS
+cuSOLVER
), hipCUB
(wrapper of rocPRIM
vs CUB
), hipFFT
(rocFFT
vs cuFFT
), and hipSPARSE
(rocSPARSE
vs cuSPARSE
).
hipDNN
is currently not packaged. It's a wrapper of miopen
vs cudnn
.
dev-libs/rccl
(targeting nccl
) is collective communication routines for AMD GPUs. It can also run tests, but tests are only meaningful on multi GPU systems.
Ebuild for sci-libs/rocALUTION
(targeting paralution) is currently in development.
Specifying architectures to compile
With rocm.eclass
(ROCm version >=5.1.3), Gentoo handles the AMDGPU_TARGETS
USE_EXPAND
. The map between GPU and arch name can be viewed via checking use flag for ROCm libraries:
example $
equery uses rocBLAS
* Found these USE flags for sci-libs/rocBLAS-5.4.2-r1: U I - - amdgpu_targets_gfx1010 : RDNA GPU, codename navi10, including Radeon RX 5700XT/5700/5700M/5700B/5700XTB/5600XT/5600/5600M, Radeon Pro 5700XT/5700, Radeon Pro W5700X/W5700 - - amdgpu_targets_gfx1011 : RDNA GPU, codename navi12, including Radeon Pro 5600M/V520 - - amdgpu_targets_gfx1012 : RDNA GPU, codename navi14, including Radeon RX 5500XT/5500/5500M/5500XTB/5300/5300M, Radeon Pro 5500XT/5500M/5300/5300M, Radeon Pro W5500X/W5500/W5500M/W5300M + - amdgpu_targets_gfx1030 : RDNA2 GPU, codename navi21/sienna cichlid, including Radeon RX 6950XT/6900XT/6800XT/6800, Radeon Pro W6800 - - amdgpu_targets_gfx1031 : RDNA2 GPU, codename navi22/navy flounder, including Radeon RX 6750XT/6700XT/6800M/6700M - - amdgpu_targets_gfx1100 : RDNA3 GPU, codename navi31/plum bonito, including Radeon RX 7900XTX/7900XT - - amdgpu_targets_gfx1101 : RDNA3 GPU, codename navi32 - - amdgpu_targets_gfx1102 : RDNA3 GPU, codename navi33 - - amdgpu_targets_gfx803 : Fiji GPU, codename fiji, including Radeon R9 Nano/Fury/FuryX, Radeon Pro Duo, FirePro S9300x2, Radeon Instinct MI8 - - amdgpu_targets_gfx900 : Vega GPU, codename vega10, including Radeon Vega Frontier Edition, Radeon RX Vega 56/64, Radeon RX Vega 64 Liquid, Radeon Pro Vega 48/56/64/64X, Radeon Pro WX 8200/9100, Radeon Pro V320/V340/SSG, Radeon Instinct MI25 + - amdgpu_targets_gfx906 : Vega GPU, codename vega20, including Radeon (Pro) VII, Radeon Instinct MI50/MI60 + - amdgpu_targets_gfx908 : CDNA Accelerator, codename arcturus, including AMD Instinct MI100 Accelerator + - amdgpu_targets_gfx90a : CDNA2 Accelerator, codename aldebaran, including AMD Instinct MI200 series Accelerators - - benchmark : Build and install rocblas-bench. - - doc : Add extra documentation (API, Javadoc, etc). It is recommended to enable per package instead of globally - - test : Perform rocblas-test to compare the result between rocBLAS and system BLAS.
By default, officially supported architectures (gfx906 gfx908 gfx90a gfx1030
) are turned on. For example, for a system with Radeon VII and RX 6700XT, specify GPU archs for all packages:
# disable gfx908, gfx90a, gfx1030; turn on gfx1031; gfx906 remains on
*/* AMDGPU_TARGETS: -gfx908 -gfx90a -gfx1030 gfx1031
Adjusting use flags for individual packages is also supported. Portage will take care of the dependencies: if sci-libs/mipoen
enables gfx1031
, then sci-libs/rocBLAS
should turns on gfx1031
, or when portage will try to add it to /etc/portage/package.use/zz-autounmask
.
Upgrade to 5.1.3 or above from the legacy way
Before introducing rocm.eclass
(ROCm version <5.1.3), architectures are specified via environment variable AMDGPU_TARGETS
:
For users installing ROCm libraries using the legacy method (specifying /etc/portage/make.conf
), upgrading to 5.1.3 takes two steps:
1. RemoveAMDGPU_TARGETS
entry in/etc/portage/make.conf
2. Add/etc/portage/package.use/00-amdgpu-targets
mentioned in ROCm#Specifying_architectures_to_compile
Driver and userspace compatibility matrix
See User and kernel-space support matrix, where column "KFD" refers to the version of dev-libs/roct-thunk-interface
and the ROCm version of amdkfd in amdgpu kernel module (if using upstream amdgpu module Linux kernel, there's no clear ROCm version for amdkfd kernel driver).
XNACK target feature
NACK stands for negative-acknowledgement. It is an important feature in Heterogeneous Memory Management (HMM) [2]. Simply put, in HMM GPU VRAM and CPU RAM shares the same pointer space, while xnack allows the GPU to retry memory accesses after a page fault when the desired data does not exist in GPU VRAM, making VRAM kind of a cache of CPU RAM for GPU devices. ROCm official document [3] and other documents may help better understanding the mechanism and importance of xnack, how to enable it, and the performance impact: [4] [5]
For GFX9 series GPUs, it is common to append xnack feature flag behind GPU architecture, for example gfx906:xnack-
. xnack-
stands for xnack disabled, while xnack+
is xnack enabled, and the compiled GPU kernel can be only executed when xnack feature is disabled. gfx906
stands for "xnack any", meaning GPU kernel can be launched in both xnack on or xnack off, but may be less performant [6].
Whether GPU has xnack feature enabled can be checked with rocminfo
.
Contributing and developing guide
Testing ROCm libraries is not easy -- it requires recent AMD discrete GPUs and days of compilation and testing. If using ROCm libraries and mathematical correctness is considered important, please test the hardware by enabling tests:
FEATURES="test"
Then emerge the desired ROCm package. If test failures occurs, usually it is caused by small inconsistencies between ROCm libraries and CPU reference implementations. Or it is caused by upstream bugs, or Gentoo deployment strategy. In either situation, filing a bug report to Gentoo Bugzilla is welcome, and it would be better to report to upstream for mathematical errors.
There is a page ROCm packaging history working in progress.