Onloading과 Offloading

1.
한달전쯤 어떤 분이 메일을 보내주셨습니다. Intel Sandbridge가 지원하는 Intel Data Direct I/O기술을 이용한 네트워크카드가 가장 빠르다는 내용이었습니다.

Intel DDIO가 어떤 기술일까요? 찾아보니 네트워크 프로세싱을 할 때 메인메모리 대신에 CPU Cache를 사용하도록 하는 기술입니다.

Intel DDIO makes the processor cache the primary destination and source of I/O data rather than main memory. By avoiding system memory, Intel DDIO reduces latency, increases system I/O bandwidth, and reduces power consumption due to memory reads and writes.

Sandbridge에 적용된 Intel의 DDIO기술을 설명한 그림입니다.

Intel DDIO기술과 Network Adaptor는 어떤 관계가 있을까요? 인텔의 자료는 네트워크카드의 종류에 상관없다고 합니다.

Intel Data Direct I/O Technology requires no industry enabling

Intel DDIO is enabled by default on all Intel Xeon processor E5 platforms. Intel DDIO has no hardware dependencies and is invisible to software, requiring no changes to drivers, operating systems, hypervisors, or applications. All I/O devices benefit from Intel DDIO, including Ethernet, InfiniBand*, Fibre Channel, and RAID.

2.
그럼 Low Latency와 High Performance Networking을 위한 기술은 어떻게 발전해왔는지 살펴보도록 하죠. 데이타량이 급격히 늘어나면 아래 그림과 같은 이슈가 발생합니다.

이런 이슈를 해결하기 위하여 기술들이 등장하였던 2005년 자료를 보면 RDMA와 TOE 및 Onload기술을 언급하고 있습니다. 그동안 여러번 다루었지만 최초 시장에 나왔던 그대로를 설명하고 있습니다.

TCP Offload Engine (TOE) is a concept that has been around for many years. It proposes the use of a specialized and dedicated processor on the network card to handle some or all of the packet processing. The idea is for the TOE to handle all tasks associated with protocol processing, and thereby relieve the main system processors of this work. Within the last few years, network cards with TOE processors have appeared on the market. These cards have proven effective in situations where the TCP/IP packets have certain desirable characteristics.

RDMA is a technology that enables the sending system to place the data payload of a network packet into a specific location on the destination system. This action is coordinated by the NICs on both ends of the transmission. Because data is placed immediately into its final memory location by the sending system, less processor time is spent moving network packet data within the receiving system.

Onloading retains use of the system processors as the principal engines for handling network traffic. It uses technologies throughout the platform to increase CPU efficiency by reducing the delays caused by memory accesses of network packet headers. Because the
processing requires considerable fetching of data, the onload approach focuses on reducing those portions of the data-handling overhead, without requiring changes to the fundamental hardware system, applications or datacenter operations. To see why this works, it is useful to examine exactly how a packet is processed.
Accelerating High-SpeedI/OAcceleration TechnologyNetworking with Intel?중에서

TOE, RDMA 및 Onloading로 나눈 기술을 다시 두가지로 재분류할 수 있습니다. 바로 Offloading vs. Onloading입니다. CISCO의 HPC네트워킹을 담당하는 Jeff Squyres가 Offloading을 다룬 글입니다.

There are many types of NICs available today ? let’s consider a spectrum of capabilities with “dumb” NICs on one end and “smart” NICs on the other end.

Dumb NICs are basically bridges between software and the network physical layer. ?These NICs support very few commands (e.g., send a frame, receive a frame), and rely on software (usually in the form of a low-level operating system driver) to do all the heavy lifting such as making protocol decisions, creating wire-format frames, interpreting the meaning of incoming frames, etc.

Smart NICs provide much more intelligence. ?They typically have microprocessors and can apply some amount of logic to traffic flowing through the NIC (both incoming and outgoing), such as VLAN processing or other network-related operations.

Smart NICs typically also offer some form of offloading work from the main CPU. ?This means that they offer high-level commands such as “send message X via TCP connection Y”, and handle the entire transaction in hardware ? the main CPU is not involved.

As another example, NICs that implement OpenFabrics-style RDMA typically do two offload-ish kinds of things (in addition to OS bypass, which I’ll discuss later):

Accept high-level commands, such as “put buffer X at address Y on remote node Z.” ?The hardware takes buffer X, fragments it (if necessary), frames the fragments, and sends them to the peer. ?The peer NIC hardware takes the incoming frames, sees that they’re part of an RDMA transaction, and puts the incoming data in buffer starting at address Y. ?Similarly, other high-level commands can effect protocol decisions and actions in peer NIC hardware.

Provide asynchronous, “background” progress in ongoing network operations. ?In the above “put” example, if the message was 1MB in length, it may take some time to retrieve the entire message from main RAM, fragment it, frame it, and send it (and it’s usually done in a pipelined fashion). ?The main CPU is not involved in this process ? the processor(s) on the NIC itself handle all of this.

While each of InfiniBand, iWARP, and RoCE NICs have slightly different ways of effecting offloading, they all expose high-level commands in firmware that allow relatively thin software layers to perform complex message-passing tasks that are wholly enacted in the NIC hardware.

The main idea is that?once a high-level command is issued, an offload-capable NIC’s firmware takes over and the main CPU is no longer necessary for the rest of that transaction.

The whole point of offloading is basically to allow the NIC to handle network-ish stuff, thereby freeing up the CPU to do non-network-ish stuff (i.e., whatever your application needs).

This is frequently referred to as overlapping computation (on the main CPU) with communication (on the NIC). ?Since CPUs are?tremendously?faster than NICs / network speeds, you can do a?lot?of computation on the main CPU while waiting for a network action to complete.
Hardware vs. software: user questions중에서

Offloading과 Onlaoding은 CPU를 서로 다르게 바라봅니다. Offloading은 CPU가 TCP/IP프로세싱을 하지 않도록 하여 성능을 향상하자는 방향입니다. 반면 Onloading은 특정한 CPU에 TCP/IP프로세싱을 전담하도록 하여 다른 CPU의 자원을 최대로 활용하자는 취지입니다. 멀티코어시대이기때문에 가능합니다. Offloading과 Onloading을 그림으로 비교하였습니다.

3.
현재 시장에 나와있는 나온 카드중 Chelsio, Mellanox, Myricom, Solarflare을 많이 사용합니다. Chelsio와 Mellanox는 Offloading이고 Myricom과 Solarflare는 Onloading입니다. 그렇다고 Offloading이라고 해서 소프트웨어 가속기가 없지 않습니다. Onloading이라고 해서 NIC의 하드웨어가 단순노동만 하는 것도 아닙니다. 사용하고자 하는 목적에 맞도록 튜닝을 하고 BMT를 하여야 선택하여야 합니다.

Chelsio vs Solarflare

사용목적중 DMA서비스나 DR서비스의 경우 방화벽이 이슈입니다. 방화벽을 통과하여야 합니다. 때문에 RoCE와 같은 프로토콜을 사용할 수 없습니다. TCP/IP나 TCP위에서 운용하는 iWARP이 선택가능합니다. 어떤 프로토콜이 더 성능이 좋을까요? 아래의 글처럼 iWARP을 선택한다고 할 경우 네트워크카드를 선택할 때 한가지를 추가하여야 합니다.

RDMA 지원

Design low latency iWARP network systems

RDMA기술에 대한 자본시장 IT 참여자들의 관심을 위해 9월 교육을 준비중입니다. 네트워킹기술이 어떻게 발전하든 RDMA가 주요한 영역을 차지하리라 생각합니다.

RDMA 및 iWARP 교육에 대한 의견을 받습니다

이 글 공유하기:

Leave a Comment 응답 취소