DeepSeek는 어떻게 저성능 Nvidia를 고성능으로 바꾸었을까

1.
DeepSeek가 저성능 GPU를 고성능 GPU로 바꾼 비밀이 무엇일까? 궁금했는데 이와 관련한 제목이 보였습니다.

DeepSeek’s AI breakthrough bypasses industry-standard CUDA for some functions, uses Nvidia’s assembly-like PTX programming instead

무언가 대단한 내용이 있을 것으로 기대하고 읽었는데 출처가 의외였습니다. 미래에셋증권의 AI보고서 AI Weekly #45
2025년 ‘AI 혁신은 계속되고 가속된다’중 일부입니다.

PTX가 무엇인지 알아보죠. 우선 Nvidia의 CUDA 공식문서에 있는 내용입니다.

1.1. Scalable Data-Parallel Computing using GPUs

Driven by the insatiable market demand for real-time, high-definition 3D graphics, the programmable GPU has evolved into a highly parallel, multithreaded, many-core processor with tremendous computational horsepower and very high memory bandwidth. The GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control; and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

PTX defines a virtual machine and ISA for general purpose parallel thread execution. PTX programs are translated at install time to the target hardware instruction set. The PTX-to-GPU translator and driver enable NVIDIA GPUs to be used as programmable parallel computers.

1.2. Goals of PTX

PTX provides a stable programming model and instruction set for general purpose parallel programming. It is designed to be efficient on NVIDIA GPUs supporting the computation features defined by the NVIDIA Tesla architecture. High level language compilers for languages such as CUDA and C/C++ generate PTX instructions, which are optimized for and translated to native target-architecture instructions.

The goals for PTX include the following:

Provide a stable ISA that spans multiple GPU generations.

Achieve performance in compiled applications comparable to native GPU performance.

Provide a machine-independent ISA for C/C++ and other compilers to target.

Provide a code distribution ISA for application and middleware developers.

Provide a common source-level ISA for optimizing code generators and translators, which map PTX to specific target machines.

Facilitate hand-coding of libraries, performance kernels, and architecture tests.

Provide a scalable programming model that spans GPU sizes from a single unit to many parallel units.

Parallel Thread Execution ISA Version 8.7중에서

이를 정리한 글입니다.

PTX (or PTX ISA)

이 포스트는 CUDA PTX (Parallel Thread Execution)에 대하여 정리한다. CUDA PTX는 a low-level parallel thread execution virtual machine and instruction set architecture (ISA)를 의미한다. .cu file을 컴파일 해서 나오는 GPU를 작동 시키는 ISA라고 생각할 수 있다. nvcc는 CUDA kernel code를 PTX instruction으로 컴파일한다. 컴파일된 PTX instruction은 GPU driver 내에 있는 또 다른 compiler에 의해 binary code로 번역되어 GPU를 작동시킨다.ISA를 공개한다는 것은 low-level device architecture를 공개하는 것과 같다고 한다. 그렇기에 NVIDIA는 GPU driver가 생성하는 instruction은 공개하지 않고, device에 독립적인 PTX ISA만을 공개한다. 이러한 방식은 또 다른 장점이 있는데, CPU와는 달리 GPU 기술은 아직 거의 표준이 된 ISA가 없고 (NVIDIA, Intel, AMD, Qualcomm이나 각자만의 GPU 구조를 가지고 있으며, NVIDIA 제품 내에서도 pascal, volta, turing, ampere로 거치면 계속 디자인이 변경됨), GPU architecture는 계속 진화하기에 (ex, tensor code 추가 등) 이렇게 device 독립적인 PTX ISA만 공개하고 이를 device driver가 각자 device에 맞게 한 번 더 컴파일한다. (구체적인 .cu -> binary code 컴파일 과정은 다음에 정리해봐야지)

정리하자면 PTX는 virtual assembly이다.

Runtime 시 실제 GPU에서 돌아가는 machine code로 번역된다.

따라서 Hardare implementation이 달라져도 동일한 PTX ISA를 사용할 수 있다.

CUDA PTX – 1 : Introduction중에서

어떤 글은 아래와 같이 Cuda-PTX의 관계를 정리합니다.

PTX (Parallel Thread Execution) assembly to unlock previously untapped efficiency in GPU operations.

CUDA is often the go-to framework for AI development because it provides an easy-to-use interface for GPU programming. However, CUDA is essentially a high-level abstraction—it translates code into PTX, which is Nvidia’s intermediate assembly language for GPU execution.
How DeepSeek is Making High-Performance AI Accessible to All중에서

Java로 개발한 프로그램을 실행하면 Java ByteCode로 변환하고 다시 이를 기계어로 변화하여 실행합니다. C언어로 개발한 프로그램의 경우 어셈블리언어로 변환한 후 다시 기계어로 변환하여 실행합니다. GPU도 Cuda로 개발한 프로그램을 PTX로 변환하여 실행합니다. 다만 GPU인 Nvidia는 GPU마다 구조가 다르기때문에 모든 GPU에서 동작하는 환경을 만들고 다시 이를 기계어로 바꾸는 절차를 두는 듯 합니다. 그래서 PTX를 Virtual Assembly라고 하는 이유인 듯 합니다. 때문에 ‘CUDA Moat를 깼다’라는 해석은 틀렸다고 하네요.

2.
PTX외에 또다른 기술이 DeepSeek의 성능에 영향을 주었다고 합니다. 화웨이의 AI Chip인 910C입니다.

Although R1 was reportedly trained on over two thousand H800 GPUs from Nvidia, it’s significant for Huawei that the company’s GPUs have explicit support for actually running the LLM. This could cut out yet another part of the process where AI firms in China had to rely on Western companies, in this case, Nvidia and AMD, whose GPUs are sought out for both training and inference thanks to their high performance. However, Huawei may be catching up.

“Inference performance on Huawei 910C achieves 60% of the H100’s performance from developers [sic] experience,” Jin said on X. “With hand-written CUNN kernels and optimizations, the performance is higher.” Jin also noted that the 910C could also be used for training, but the R1 was officially trained using H800 chips, though that doesn’t mean DeepSeek will continue to use those H800s forever.
Huawei adds DeepSeek-optimized inference support for its Ascend AI GPUs중에서

고성능이 아닌 일반소비자용 GPU를 이용한 소식도 있습니다. 역시나 출처는 중국입니다.

중국 선전 MSU-BIT 대학 연구팀이 소비자용 GPU의 성능을 최대 800배 향상하는 혁신적인 알고리즘을 개발했다. 이번 기술은 항공우주, 토목공학 등 다양한 산업 분야의 복잡한 재료역학 문제 해결에 적용될 것으로 기대된다.

30일(현지시각) 중국 계산역학 저널에 따르면, 양양(Yang Yang) 부교수 연구팀은 페리다이나믹스(PD) 이론의 계산 효율을 획기적으로 개선하는 ‘PD-General’ 프레임워크를 개발했다. 이 기술은 엔비디아의 CUDA 프로그래밍 기술을 활용해 GPU의 구조를 심층 분석하고, 알고리즘 설계와 메모리 관리를 최적화한 것이 특징이다.연구팀이 개발한 알고리즘을 소비자용 엔비디아 GeForce RTX 4070에 적용한 결과, 기존 직렬 프로그램 대비 최대 800배, 병렬 프로그램인 OpenMP와 비교해도 100배 빠른 처리 속도를 보였다. 수백만 개의 입자가 포함된 시뮬레이션에서 4000개의 반복 단계를 5분 만에 완료했으며, 가장 복잡한 2차원 단축 인장 문제에서도 2분 이내에 6985만 번의 반복 계산을 처리했다.

양양 부교수는 “이번 기술로 일반 가정용 GPU를 사용해 기존에 며칠 걸리던 계산을 몇 시간 또는 몇 분으로 단축할 수 있게 됐다”며 “페리다이나믹스 연구에서 중요한 진전”이라고 평가했다. 페리다이나믹스는 균열, 손상, 파괴 등 복잡한 물리적 문제를 해결하는 최신 이론이다. 특히 재료의 손상을 모델링하는 데 뛰어나 항공우주, 토목공학, 군사 분야 등에서 널리 활용된다. 그러나 높은 계산 복잡성으로 인해 대규모 시뮬레이션에서는 메모리 사용량이 많고 처리 속도가 느린 등의 한계가 있었다.

이번 기술은 항공우주 분야에서 충돌 시 항공기 재료의 균열 전파를 모델링하고, 토목공학에서는 지진 발생 시 교량이나 건물의 손상 진행을 시뮬레이션하는 데 활용될 수 있다. 또한, 군사 장비 개발을 위한 탄도 및 폭발 연구에서도 중요한 역할을 할 것으로 기대된다.

연구팀은 GPU 하드웨어의 지속적인 발전에 따라 PD-General 프레임워크의 성능이 더욱 향상될 것으로 전망했다. 이를 통해 더 복잡한 기계 문제 해결과 새로운 응용 분야 개척이 가능할 것으로 기대된다.

특히 이번 기술은 미국의 제재 대상이 아닌 일반 소비자용 GPU를 활용할 수 있어, 중국 산업계의 광범위한 활용이 가능할 것으로 보인다. 이는 미·중 기술 경쟁이 심화하는 가운데 중국이 독자적인 기술력을 확보하는 의미 있는 성과로 평가받고 있다.
中 연구진, GPU 성능 800배 높이는 혁신 알고리즘 개발중에서

위 기사에서 관심이 가는 부분은 두가지입니다.

첫째 “연구팀이 개발한 알고리즘을 소비자용 엔비디아 GeForce RTX 4070에 적용한 결과”라는 문장중 RTX 4070입니다. 일반 소비자용으로 사용하는 GPU입니다.
둘째 “GPU의 구조를 심층 분석하고, 알고리즘 설계와 메모리 관리를 최적화”라는 방향입니다. 하드웨어를 이해하고 이를 최적화하는 방향으로 개발하였습니다. 앞서 DeepSeek의 방향과 비슷합니다.

위 논문은 中国计算力学学报 25년 1호에 실렸다고 하는데 아직 홈페이지에 올라오지 않았습니다.

3.
앞서 미래에셋증권이 발간하는 AI 보고서들입니다.

AI Weekly#46 인공지능의 특이점은 이미 달성됐다
AI Weekly #47 엔비디아는 근본부터 AI 기업
AI Weekly #48:2025년의 AI: 중국의 선전포고와 미국의 대오각성

Download (PDF, 1.34MB)

Download (PDF, 1.74MB)

Download (PDF, 2.06MB)

Download (PDF, 2.59MB)

DeepSeek는 어떻게 저성능 Nvidia를 고성능으로 바꾸었을까

1.1. Scalable Data-Parallel Computing using GPUs

1.2. Goals of PTX

PTX (or PTX ISA)

이 글 공유하기:

Leave a Comment 응답 취소