1.
RDMA. 10여년만에 다시금 눈에 들어온 단어입니다. 대략 2012년쯤 Low Latency Technology 2012 가을 행사 자료(2)을 통해 RDMA를 소개한 적이 있었습니다. Low Latency에 대한 관심이 높았고 한국거래소가 차세대로 시작한 Exture+에 Infiniband기술을 도입한다고 하면서 RDMA와 같은 기술에 대한 관심이 높았던 때입니다. 이후 RDMA기술을 매매시스템에 적용한 곳이 있는지 알 수 없고 저 또한 다른 기술에 관심을 기울이면서 깊이 묻어두었습니다.
그리고 DeepSeek의 기반기술을 소개하는 글을 읽었을 때 RDMA가 눈에 들어왔습니다. 이런저런 글을 찾아보니까 우연히 블로그의 글을 읽었습니다. 그림이 들어간 글인데 도표의 품질이 무척 높습니다.
AI/ML Networking Part I: RDMA Basics
위의 글은 RDMA 일반을 다루지 않습니다. GPUDirect RDMA와 관련한 내용입니다. 그렇지만 개념적으로 비슷한 내용으로 보입니다. GPU와 관련하여 문외한이기때문에 판단할 능력은 없지만 NVidia의 글을 보니까 같은 기술을 뿌리에 두고 있는 듯 합니다. NViDia가 정의한 GPUDirect RDMA입니다.
GPUDirect RDMA:Direct Communication between NVIDIA GPUs
Remote direct memory access (RDMA) enables peripheral PCIe devices direct access to GPU memory. Designed specifically for the needs of GPU acceleration, GPUDirect RDMA provides direct communication between NVIDIA GPUs in remote systems. This eliminates the system CPUs and the required buffer copies of data via the system memory, resulting in 10X better performance.
이상을 단순한 그림으로 표시하면 아래과 같지 않을까 합니다.
Userspace<–>NIC<—RDMA—>NIC<—>Userspace
GPU Memroy <—>NIC<—RDMA—> NIC <—>GPU Memory
위를 그림으로 비교하면 다음과 같습니다. How RDMA Accelerates Cluster Performance: A Comprehensive Guide에서 옮겨온 도표입니다.
그리고 프로세스 흐름을 글로 설명하면 아래입니다.
1.2. Standard DMA Transfer
First, we outline a standard DMA Transfer initiated from userspace. In this scenario, the following components are present:
- Userspace program
- Userspace communication library
- Kernel driver for the device interested in doing DMA transfers
The general sequence is as follows:
- The userspace program requests a transfer via the userspace communication library. This operation takes a pointer to data (a virtual address) and a size in bytes.
- The communication library must make sure the memory region corresponding to the virtual address and size is ready for the transfer. If this is not the case already, it has to be handled by the kernel driver (next step).
- The kernel driver receives the virtual address and size from the userspace communication library. It then asks the kernel to translate the virtual address range to a list of physical pages and make sure they are ready to be transferred to or from. We will refer to this operation as pinning the memory.
- The kernel driver uses the list of pages to program the physical device’s DMA engine(s).
- The communication library initiates the transfer.
- After the transfer is done, the communication library should eventually clean up any resources used to pin the memory. We will refer to this operation as unpinning the memory.
1.3. GPUDirect RDMA Transfers
For the communication to support GPUDirect RDMA transfers some changes to the sequence above have to be introduced. First of all, two new components are present:
- Userspace CUDA library
- NVIDIA kernel driver
As described in Basics of UVA CUDA Memory Management, programs using the CUDA library have their address space split between GPU and CPU virtual addresses, and the communication library has to implement two separate paths for them.
The userspace CUDA library provides a function that lets the communication library distinguish between CPU and GPU addresses. Moreover, for GPU addresses it returns additional metadata that is required to uniquely identify the GPU memory represented by the address. Refer to Userspace API for details.
The difference between the paths for CPU and GPU addresses is in how the memory is pinned and unpinned. For CPU memory this is handled by built-in Linux Kernel functions (
get_user_pages()andput_page()). However, in the GPU memory case the pinning and unpinning has to be handled by functions provided by the NVIDIA Kernel driver.
Developing a Linux Kernel Module using GPUDirect RDMA중에서
2.
RDMA를 AI와 관련한 네트워킹에서 접하였지만 사실 거의 매일 RDMA를 기반으로 기술을 사용하고 있습니다. Mellanox가 공급하는 libvma입니다. TCP Offload 기능을 채탁한 네트워크 카드중 Nvidia(Mellanox)의 ConnectX 시리즈는 ZeroServer가 채탁한 네트워크카드입니다. OS를 튜닝하는 서비스를 제공할 때 libvma을 설치하여 공급하기 때문에 자주 컴파일과 실행을 합니다.
Libvma와 관련한 패러미터를 보면 QP 및 CQ와 같은 용어가 등장합니다.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
VMA INFO: --------------------------------------------------------------------------- VMA INFO: VMA_VERSION: 8.6.9-0 Development Snapshot built on Jun 26 2018 14:36:59 VMA INFO: Git: 41d72e46ba99badb1d7be73ac42af8b7ee6879e8 VMA INFO: Cmd Line: sockperf sr VMA INFO: Current Time: Tue Jun 26 14:49:19 2018 VMA INFO: Pid: 20439 VMA INFO: OFED Version: MLNX_OFED_LINUX-4.4-0.1.8.0: VMA INFO: Architecture: x86_64 VMA INFO: Node: r-aa-apollo03.mtr.labs.mlnx VMA INFO: --------------------------------------------------------------------------- VMA INFO: Log Level DETAILS [VMA_TRACELEVEL] VMA DETAILS: Log Details 0 [VMA_LOG_DETAILS] VMA DETAILS: Log Colors Enabled [VMA_LOG_COLORS] VMA DETAILS: Log File [VMA_LOG_FILE] VMA DETAILS: Stats File [VMA_STATS_FILE] VMA DETAILS: Stats shared memory directory /tmp/ [VMA_STATS_SHMEM_DIR] VMA DETAILS: VMAD output directory /tmp/vma/ [VMA_VMAD_NOTIFY_DIR] VMA DETAILS: Stats FD Num (max) 100 [VMA_STATS_FD_NUM] VMA DETAILS: Conf File /etc/libvma.conf [VMA_CONFIG_FILE] VMA DETAILS: Application ID VMA_DEFAULT_APPLICATION_ID [VMA_APPLICATION_ID] VMA DETAILS: Polling CPU idle usage Disabled [VMA_CPU_USAGE_STATS] VMA DETAILS: SigIntr Ctrl-C Handle Disabled [VMA_HANDLE_SIGINTR] VMA DETAILS: SegFault Backtrace Disabled [VMA_HANDLE_SIGSEGV] VMA DETAILS: Ring allocation logic TX 0 (Ring per interface) [VMA_RING_ALLOCATION_LOGIC_TX] VMA DETAILS: Ring allocation logic RX 0 (Ring per interface) [VMA_RING_ALLOCATION_LOGIC_RX] VMA DETAILS: Ring migration ratio TX 100 [VMA_RING_MIGRATION_RATIO_TX] VMA DETAILS: Ring migration ratio RX 100 [VMA_RING_MIGRATION_RATIO_RX] VMA DETAILS: Ring limit per interface 0 (no limit) [VMA_RING_LIMIT_PER_INTERFACE] VMA DETAILS: Ring On Device Memory TX 0 [VMA_RING_DEV_MEM_TX] VMA DETAILS: TCP max syn rate 0 (no limit) [VMA_TCP_MAX_SYN_RATE] VMA DETAILS: Tx Mem Segs TCP 1000000 [VMA_TX_SEGS_TCP] VMA DETAILS: Tx Mem Bufs 200000 [VMA_TX_BUFS] VMA DETAILS: Tx Mem Buf size 0 [VMA_TX_BUF_SIZE] VMA DETAILS: Tx QP WRE 2048 [VMA_TX_WRE] VMA DETAILS: Tx QP WRE Batching 64 [VMA_TX_WRE_BATCHING] VMA DETAILS: Tx Max QP INLINE 204 [VMA_TX_MAX_INLINE] VMA DETAILS: Tx MC Loopback Enabled [VMA_TX_MC_LOOPBACK] VMA DETAILS: Tx non-blocked eagains Disabled [VMA_TX_NONBLOCKED_EAGAINS] VMA DETAILS: Tx Prefetch Bytes 256 [VMA_TX_PREFETCH_BYTES] VMA DETAILS: Rx Mem Bufs 200000 [VMA_RX_BUFS] VMA DETAILS: Rx QP WRE 16000 [VMA_RX_WRE] VMA DETAILS: Rx QP WRE Batching 64 [VMA_RX_WRE_BATCHING] VMA DETAILS: Rx Byte Min Limit 65536 [VMA_RX_BYTES_MIN] VMA DETAILS: Rx Poll Loops 100000 [VMA_RX_POLL] VMA DETAILS: Rx Poll Init Loops 0 [VMA_RX_POLL_INIT] VMA DETAILS: Rx UDP Poll OS Ratio 100 [VMA_RX_UDP_POLL_OS_RATIO] VMA DETAILS: HW TS Conversion 3 [VMA_HW_TS_CONVERSION] VMA DETAILS: Rx Poll Yield Disabled [VMA_RX_POLL_YIELD] VMA DETAILS: Rx Prefetch Bytes 256 [VMA_RX_PREFETCH_BYTES] VMA DETAILS: Rx Prefetch Bytes Before Poll 0 [VMA_RX_PREFETCH_BYTES_BEFORE_POLL] VMA DETAILS: Rx CQ Drain Rate Disabled [VMA_RX_CQ_DRAIN_RATE_NSEC] VMA DETAILS: GRO max streams 32 [VMA_GRO_STREAMS_MAX] VMA DETAILS: TCP 3T rules Disabled [VMA_TCP_3T_RULES] VMA DETAILS: UDP 3T rules Enabled [VMA_UDP_3T_RULES] VMA DETAILS: ETH MC L2 only rules Disabled [VMA_ETH_MC_L2_ONLY_RULES] VMA DETAILS: Force Flowtag for MC Disabled [VMA_MC_FORCE_FLOWTAG] VMA DETAILS: Select Poll (usec) 100000 [VMA_SELECT_POLL] VMA DETAILS: Select Poll OS Force Disabled [VMA_SELECT_POLL_OS_FORCE] VMA DETAILS: Select Poll OS Ratio 10 [VMA_SELECT_POLL_OS_RATIO] VMA DETAILS: Select Skip OS 4 [VMA_SELECT_SKIP_OS] VMA DETAILS: CQ Drain Interval (msec) 10 [VMA_PROGRESS_ENGINE_INTERVAL] VMA DETAILS: CQ Drain WCE (max) 10000 [VMA_PROGRESS_ENGINE_WCE_MAX] VMA DETAILS: CQ Interrupts Moderation Enabled [VMA_CQ_MODERATION_ENABLE] VMA DETAILS: CQ Moderation Count 48 [VMA_CQ_MODERATION_COUNT] VMA DETAILS: CQ Moderation Period (usec) 50 [VMA_CQ_MODERATION_PERIOD_USEC] VMA DETAILS: CQ AIM Max Count 560 [VMA_CQ_AIM_MAX_COUNT] VMA DETAILS: CQ AIM Max Period (usec) 250 [VMA_CQ_AIM_MAX_PERIOD_USEC] VMA DETAILS: CQ AIM Interval (msec) 250 [VMA_CQ_AIM_INTERVAL_MSEC] VMA DETAILS: CQ AIM Interrupts Rate (per sec) 5000 [VMA_CQ_AIM_INTERRUPTS_RATE_PER_SEC] VMA DETAILS: CQ Poll Batch (max) 16 [VMA_CQ_POLL_BATCH_MAX] VMA DETAILS: CQ Keeps QP Full Enabled [VMA_CQ_KEEP_QP_FULL] VMA DETAILS: QP Compensation Level 256 [VMA_QP_COMPENSATION_LEVEL] VMA DETAILS: Offloaded Sockets Enabled [VMA_OFFLOADED_SOCKETS] VMA DETAILS: Timer Resolution (msec) 10 [VMA_TIMER_RESOLUTION_MSEC] VMA DETAILS: TCP Timer Resolution (msec) 100 [VMA_TCP_TIMER_RESOLUTION_MSEC] VMA DETAILS: TCP control thread 0 (Disabled) [VMA_TCP_CTL_THREAD] VMA DETAILS: TCP timestamp option 0 [VMA_TCP_TIMESTAMP_OPTION] VMA DETAILS: TCP nodelay 0 [VMA_TCP_NODELAY] VMA DETAILS: TCP quickack 0 [VMA_TCP_QUICKACK] VMA DETAILS: Exception handling mode -1(just log debug message) [VMA_EXCEPTION_HANDLING] VMA DETAILS: Avoid sys-calls on tcp fd Disabled [VMA_AVOID_SYS_CALLS_ON_TCP_FD] VMA DETAILS: Allow privileged sock opt Enabled [VMA_ALLOW_PRIVILEGED_SOCK_OPT] VMA DETAILS: Delay after join (msec) 0 [VMA_WAIT_AFTER_JOIN_MSEC] VMA DETAILS: Internal Thread Affinity -1 [VMA_INTERNAL_THREAD_AFFINITY] VMA DETAILS: Internal Thread Cpuset [VMA_INTERNAL_THREAD_CPUSET] VMA DETAILS: Internal Thread Arm CQ Disabled [VMA_INTERNAL_THREAD_ARM_CQ] VMA DETAILS: Thread mode Multi spin lock [VMA_THREAD_MODE] VMA DETAILS: Buffer batching mode 1 (Batch and reclaim buffers) [VMA_BUFFER_BATCHING_MODE] VMA DETAILS: Mem Allocate type 1 (Contig Pages) [VMA_MEM_ALLOC_TYPE] VMA DETAILS: Num of UC ARPs 3 [VMA_NEIGH_UC_ARP_QUATA] VMA DETAILS: UC ARP delay (msec) 10000 [VMA_NEIGH_UC_ARP_DELAY_MSEC] VMA DETAILS: Num of neigh restart retries 1 [VMA_NEIGH_NUM_ERR_RETRIES] VMA DETAILS: IPOIB support Enabled [VMA_IPOIB] VMA DETAILS: SocketXtreme Disabled [VMA_SOCKETXTREME] VMA DETAILS: BF (Blue Flame) Enabled [VMA_BF] VMA DETAILS: fork() support Enabled [VMA_FORK] VMA DETAILS: close on dup2() Enabled [VMA_CLOSE_ON_DUP2] VMA DETAILS: MTU 0 (follow actual MTU) [VMA_MTU] VMA DETAILS: MSS 0 (follow VMA_MTU) [VMA_MSS] VMA DETAILS: TCP CC Algorithm 0 (LWIP) [VMA_TCP_CC_ALGO] VMA DETAILS: Polling Rx on Tx TCP Disabled [VMA_RX_POLL_ON_TX_TCP] VMA DETAILS: Trig dummy send getsockname() Disabled [VMA_TRIGGER_DUMMY_SEND_GETSOCKNAME] VMA INFO: --------------------------------------------------------------------------- |
위에서 언급한 용어에 대한 설명입니다.
Queue Pairs (QPs):QPs are the fundamental units for managing RDMA operations. Each QP consists of three queues: a Receive Queue (RQ), a Send Queue (SQ), and a Completion Queue (CQ).
Work Requests (WRs):WRs are instructions that define the specific RDMA operations. They are posted into the SQ by the application, and the hardware then uses these WRs to perform the actual RDMA data transfer.
WRE (Work Request Entry):A WRE within a Queue Pair is a specific record that describes a particular RDMA operation. It provides instructions like the source address, destination address, the size of the data to be transferred, and other parameters necessary to perform the RDMA operation.
이상의 용어를 이용하여 RDMA 기술에 대한 상세한 소개자료입니다. 제목이 멋집니다.
Everything You Wanted to Know About RDMA But Were Too Proud to Ask
Interaction Between Software and Hardware of RDMA에서 소개한 도표입니다.
The process of sending an RDMA request is as follows:
The software constructs WQE (Work Queue Element) and submits it to Work Queue.
The software writes Doorbell and notifies the hardware.
The hardware pulls and processes WQE.
The hardware completes processing, generates CQE, and writes it into CQ.
The hardware interrupts (optional).
The software polls CQ.
The software reads CQE updated by the hardware and knows that WQE is completed.
3.
HPC에서 각광을 받던 RDMA가 AI를 만나서 활짝 꽃을 피운 듯 합니다. 2010년전후한 때 많이 보이던 RDMA와 관련한 글들이 2020년 AI가 부상하면서 다시금 늘어납니다.
다시 위에서 소개한 블로그입니다. AI/ML Networking을 주제로 다양한 글을 소개합니다.
AI/ML Networking Part I: RDMA Basics
AI/ML Networking: Part-II: Introduction of Deep Neural Networks
AI/ML Networking: Part-III: Basics of Neural Networks Training Process
AI/ML Networking: Part-IV: Convolutional Neural Network (CNN) Introduction
AI for Network Engineers: Chapter 1 – Deep Learning Basics
AI for Network Engineers: Chapter 2 – Backpropagation Algorithm: Introduction
AI for Network Engineers: Backpropagation Algorithm
AI for Network Engineers: Multi-Class Classification
AI for Network Engineers: Convolutional Neural Network
AI for Network Engineers: Recurrent Neural Network (RNN)
AI for Network Engineers: Recurrent Neural Network (RNN) – Part II
AI for Network Engineers: Long Short-Term Memory (LSTM)
AI for Network Engineers: LSTM-Based RNN
AI for Network Engneers: Challenges in AI Fabric Design
AI for Network Engineers: Understanding Flow, Flowlet, and Packet-Based Load Balancing
RDMA가 등장한 배경을 알고 싶으면 An Introduction to RDMA Networking을 살펴보세요. IBM에서 근무하는 Animesh Trivedi이 2019년 한 강연자료입니다.




