AI/ML Networking의 기반인 RDMA

1.
RDMA. 10여년만에 다시금 눈에 들어온 단어입니다. 대략 2012년쯤 Low Latency Technology 2012 가을 행사 자료(2)을 통해 RDMA를 소개한 적이 있었습니다. Low Latency에 대한 관심이 높았고 한국거래소가 차세대로 시작한 Exture+에 Infiniband기술을 도입한다고 하면서 RDMA와 같은 기술에 대한 관심이 높았던 때입니다. 이후 RDMA기술을 매매시스템에 적용한 곳이 있는지 알 수 없고 저 또한 다른 기술에 관심을 기울이면서 깊이 묻어두었습니다.

그리고 DeepSeek의 기반기술을 소개하는 글을 읽었을 때 RDMA가 눈에 들어왔습니다. 이런저런 글을 찾아보니까 우연히 블로그의 글을 읽었습니다. 그림이 들어간 글인데 도표의 품질이 무척 높습니다.

AI/ML Networking Part I: RDMA Basics

위의 글은 RDMA 일반을 다루지 않습니다. GPUDirect RDMA와 관련한 내용입니다. 그렇지만 개념적으로 비슷한 내용으로 보입니다. GPU와 관련하여 문외한이기때문에 판단할 능력은 없지만 NVidia의 글을 보니까 같은 기술을 뿌리에 두고 있는 듯 합니다. NViDia가 정의한 GPUDirect RDMA입니다.

GPUDirect RDMA:Direct Communication between NVIDIA GPUs

Remote direct memory access (RDMA) enables peripheral PCIe devices direct access to GPU memory. Designed specifically for the needs of GPU acceleration, GPUDirect RDMA provides direct communication between NVIDIA GPUs in remote systems. This eliminates the system CPUs and the required buffer copies of data via the system memory, resulting in 10X better performance.

이상을 단순한 그림으로 표시하면 아래과 같지 않을까 합니다.

Userspace<–>NIC<—RDMA—>NIC<—>Userspace
GPU Memroy <—>NIC<—RDMA—> NIC <—>GPU Memory

위를 그림으로 비교하면 다음과 같습니다. How RDMA Accelerates Cluster Performance: A Comprehensive Guide에서 옮겨온 도표입니다.


그리고 프로세스 흐름을 글로 설명하면 아래입니다.

1.2. Standard DMA Transfer

First, we outline a standard DMA Transfer initiated from userspace. In this scenario, the following components are present:

  • Userspace program
  • Userspace communication library
  • Kernel driver for the device interested in doing DMA transfers

The general sequence is as follows:

  1. The userspace program requests a transfer via the userspace communication library. This operation takes a pointer to data (a virtual address) and a size in bytes.
  2. The communication library must make sure the memory region corresponding to the virtual address and size is ready for the transfer. If this is not the case already, it has to be handled by the kernel driver (next step).
  3. The kernel driver receives the virtual address and size from the userspace communication library. It then asks the kernel to translate the virtual address range to a list of physical pages and make sure they are ready to be transferred to or from. We will refer to this operation as pinning the memory.
  4. The kernel driver uses the list of pages to program the physical device’s DMA engine(s).
  5. The communication library initiates the transfer.
  6. After the transfer is done, the communication library should eventually clean up any resources used to pin the memory. We will refer to this operation as unpinning the memory.

1.3. GPUDirect RDMA Transfers

For the communication to support GPUDirect RDMA transfers some changes to the sequence above have to be introduced. First of all, two new components are present:

  • Userspace CUDA library
  • NVIDIA kernel driver

As described in Basics of UVA CUDA Memory Management, programs using the CUDA library have their address space split between GPU and CPU virtual addresses, and the communication library has to implement two separate paths for them.

The userspace CUDA library provides a function that lets the communication library distinguish between CPU and GPU addresses. Moreover, for GPU addresses it returns additional metadata that is required to uniquely identify the GPU memory represented by the address. Refer to Userspace API for details.

The difference between the paths for CPU and GPU addresses is in how the memory is pinned and unpinned. For CPU memory this is handled by built-in Linux Kernel functions (get_user_pages() and put_page()). However, in the GPU memory case the pinning and unpinning has to be handled by functions provided by the NVIDIA Kernel driver.


Developing a Linux Kernel Module using GPUDirect RDMA중에서

2.
RDMA를 AI와 관련한 네트워킹에서 접하였지만 사실 거의 매일 RDMA를 기반으로 기술을 사용하고 있습니다. Mellanox가 공급하는 libvma입니다. TCP Offload 기능을 채탁한 네트워크 카드중 Nvidia(Mellanox)의 ConnectX 시리즈는 ZeroServer가 채탁한 네트워크카드입니다. OS를 튜닝하는 서비스를 제공할 때 libvma을 설치하여 공급하기 때문에 자주 컴파일과 실행을 합니다.

Libvma와 관련한 패러미터를 보면 QP 및 CQ와 같은 용어가 등장합니다.

위에서 언급한 용어에 대한 설명입니다.

Queue Pairs (QPs):QPs are the fundamental units for managing RDMA operations. Each QP consists of three queues: a Receive Queue (RQ), a Send Queue (SQ), and a Completion Queue (CQ).
Work Requests (WRs):WRs are instructions that define the specific RDMA operations. They are posted into the SQ by the application, and the hardware then uses these WRs to perform the actual RDMA data transfer.
WRE (Work Request Entry):A WRE within a Queue Pair is a specific record that describes a particular RDMA operation. It provides instructions like the source address, destination address, the size of the data to be transferred, and other parameters necessary to perform the RDMA operation.

이상의 용어를 이용하여 RDMA 기술에 대한 상세한 소개자료입니다. 제목이 멋집니다.

Everything You Wanted to Know About RDMA But Were Too Proud to Ask

Download (PDF, 707KB)

Interaction Between Software and Hardware of RDMA에서 소개한 도표입니다.

The process of sending an RDMA request is as follows:

The software constructs WQE (Work Queue Element) and submits it to Work Queue.
The software writes Doorbell and notifies the hardware.
The hardware pulls and processes WQE.
The hardware completes processing, generates CQE, and writes it into CQ.
The hardware interrupts (optional).
The software polls CQ.
The software reads CQE updated by the hardware and knows that WQE is completed.

3.
HPC에서 각광을 받던 RDMA가 AI를 만나서 활짝 꽃을 피운 듯 합니다. 2010년전후한 때 많이 보이던 RDMA와 관련한 글들이 2020년 AI가 부상하면서 다시금 늘어납니다.

다시 위에서 소개한 블로그입니다. AI/ML Networking을 주제로 다양한 글을 소개합니다.

AI/ML Networking Part I: RDMA Basics
AI/ML Networking: Part-II: Introduction of Deep Neural Networks
AI/ML Networking: Part-III: Basics of Neural Networks Training Process
AI/ML Networking: Part-IV: Convolutional Neural Network (CNN) Introduction
AI for Network Engineers: Chapter 1 – Deep Learning Basics
AI for Network Engineers: Chapter 2 – Backpropagation Algorithm: Introduction
AI for Network Engineers: Backpropagation Algorithm
AI for Network Engineers: Multi-Class Classification
AI for Network Engineers: Convolutional Neural Network
AI for Network Engineers: Recurrent Neural Network (RNN)
AI for Network Engineers: Recurrent Neural Network (RNN) – Part II
AI for Network Engineers: Long Short-Term Memory (LSTM)
AI for Network Engineers: LSTM-Based RNN
AI for Network Engneers: Challenges in AI Fabric Design
AI for Network Engineers: Understanding Flow, Flowlet, and Packet-Based Load Balancing

RDMA가 등장한 배경을 알고 싶으면 An Introduction to RDMA Networking을 살펴보세요. IBM에서 근무하는 Animesh Trivedi이 2019년 한 강연자료입니다.

Download (PDF, 5.3MB)

Leave a Comment

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다

이 사이트는 Akismet을 사용하여 스팸을 줄입니다. 댓글 데이터가 어떻게 처리되는지 알아보세요.