DeepSeek, Fire-Flyer 그리고 High-Flyer

1.
DeepSeek.

페북에 올라오는 AI글중 최근에 부쩍 deepseek에 대한 언급이 많았습니다. 그냥 수많은 회사중 하나라고 생각했습니다. 그런데 뜻밖의 이야기를 보았습니다. 김성완님의 글입니다.

놀라운 성능으로 세상을 놀라게 한 DeepSeek-V3와 DeepSeek-R1은 중국의 신생 AI 기업 DeepSeek가 미국이나 중국의 빅 테크에 비해서 훨씬 저렴한 비용과 낮은 성능의 GPU로 해낸 일이라서 많은 관심을 불러일으켰습니다.
그런데 이 놀라운 결과가 중국의 한 퀀트 회사의 사이드 프로젝트에서 유래했다는 흥미로운 소식이 나왔습니다. 이들은 나노초 단위로 거래를 처리하기 위해 반도체도 직접 설계하는 곳들이라, 이들이 AI 분야에 뛰어들었다는 것은 매우 고무적인 일입니다.
가장 대표적인 퀀트 회사로는 르네상스 테크놀로지가 유명합니다. 이 회사는 전통적인 금융권 출신보다는 수학, 물리학, 컴퓨터과학 전공의 PhD 학위 소지자를 선호하는 것으로 잘 알려져 있습니다.
참고: 퀀트(Quant)는 금융시장의 패턴과 기회를 찾기 위해 수학적/통계적 모델을 개발하는 전문가입니다. 이들은 방대한 데이터를 분석하고 컴퓨터 알고리즘을 활용하여 체계적인 투자 전략을 수립합니다. 특히 인간의 감정이나 직관이 아닌, 데이터에 기반한 객관적이고 과학적인 투자 의사결정을 추구합니다. 주로 수학, 물리학, 컴퓨터공학 등 정량적 분석 능력이 뛰어난 이공계 박사급 인재들이 퀀트 분야에서 활약하며, 프로그래밍과 통계 분석에 능숙한 것이 특징입니다.

Quant. 금융회사가 AI기반의 프로젝트를 수행하다고 만든 프로젝트라고 합니다. 먼저 FinancialTimes 기사입니다.

The quant fund got its start in a Chengdu apart­ment, where founder Liang Wen­feng, a com­puter sci­ence gradu­ate of Zheji­ang Uni­versity, exper­i­mented with auto­mated stock trad­ing, accord­ing to local media reports. His pro­file in China’s asset man­age­ment asso­ci­ation registry says he was a freel­an­cer until 2013, when he incor­por­ated his first invest­ment firm.

By 2021, all of High-Flyer’s strategies were using AI, accord­ing to man­ager Cai Liyu, employ­ing strategies sim­ilar to those pion­eered by hugely prof­it­able hedge fund Renais­sance Tech­no­lo­gies. “AI helps to extract valu­able data from massive data sets which can be use­ful for pre­dict­ing stock prices and mak­ing invest­ment decisions,” he said dur­ing a road­show that streamed online that year.

Cai said the com­pany’s first com­put­ing cluster had cost nearly Rmb200mn and that High Flyer was invest­ing about Rmb1bn to build a second super­com­put­ing cluster, which would stretch across a roughly foot­ball pitch-sized area. Most of their profits went back into their AI infra­struc­ture, he added.

The second cluster, now com­plete, con­nects more than 10,000 of Nvidia’s cut­ting-edge pro­cessors with serv­ers and stor­age, giv­ing Deep­Seek the com­put­ing power to train a large model, accord­ing to archived ver­sions of the com­pany’s web­site. The group acquired the Nvidia A100 chips before Wash­ing­ton restric­ted their deliv­ery to China in mid-2022.

“We always wanted to carry out lar­ger-scale exper­i­ments, so we’ve always aimed to deploy as much com­pu­ta­tional power as pos­sible,” founder Liang told Chinese tech site 36Kr last year. “We wanted to find a paradigm that can fully describe the entire fin­an­cial mar­ket.”

The com­pany is one of six Chinese groups with more than 10,000 A100 pro­cessors, com­monly believed to be a com­pu­ta­tional threshold for self-train­ing large mod­els, accord­ing to Guosheng Secur­it­ies. The other five are all Chinese tech giants, though their col­lect­ive com­put­ing power pales in com­par­ison to US com­pan­ies. Meta has said it will have com­put­ing power equal to nearly 600,000 of Nvidia’s more advanced H100 chips by the end of the year.
Chinese quant fund becomes an AI high-flyer중에서

위 글을 읽으면서 가장 먼저 떠오른 기업은 크래프트테크놀로지스입니다. 출발점도 비슷하고 하려고 했던 바도 비슷합니다. 다만 지금은 다른 방향으로 나가는 듯 합니다.

무언가 부족한 글입니다. 그래서 다른 기사를 찾았습니다. 눈에 들어온 기사가 아래입니다.

Deepseek: From Hedge Fund to Frontier Model Maker

기사중 창업자의 인터뷰를 정리한 부분입니다. 눈에 들어온 부분은 구형 GPU를 이용한 시스템 구축입니다.

Deepseek’s CEO lays out a grand strategy for AGI development. It explores:

  • Why High-Flyer decided to make early GPU purchases,
  • Liang’s belief in LLMs and the linguistic nature of human intelligence,
  • Methods to sustainably manage high research costs, including innovative uses of philanthropic budgets,
  • How High-Flyer plans to democratize AI access,
  • Organizational designs that facilitate innovation, from unconventional hiring to rejecting KPIs,
  • How curiosity-driven startups can succeed in an era dominated by tech giants,
  • Why High-Flyer pursues “hardcore innovation” instead of a business model based on imitation.

인터뷰중 GPU와 부분입니다. 최초 AI기반의 자산운용전략을 개발하였을 때부터 GPU를 사용했다고 합니다. 물론 2015년전후한 때에 100장이라고 하네요.재사용이 가능한 GPU를 협력관계를 통해 다른 기업을 지원하는 용도로 사용하는 듯 합니다.

Waves: GPUs typically depreciate at about 20% (annually).
Liang: We haven’t calculated precisely, but it’s likely less. NVIDIA GPUs hold their value well, and older cards still find buyers. Our previously retired GPUs still held decent value when sold second-hand, so we didn’t lose too much.

Waves: GPUs are the scarce commodity in this wave of ChatGPT-related startups, yet you had the foresight to stockpile 10,000 of them as early as 2021. Why?
Liang: It was a gradual process — from a single card in the early days to 100 cards in 2015, 1,000 cards in 2019, and then 10,000 cards. Up to a few hundred cards, we relied on external Internet data centers. When the scale expanded, we began building our own facilities.  People may think there’s some hidden business logic behind this, but it’s mainly driven by curiosity.

Waves: Some assumed your clusters were primarily for financial market predictions.
Liang: If purely for quant investing, even a small number of GPUs would suffice. Our broader research aims to understand what kind of paradigms can fully describe the entire financial market, whether there are simpler ways to express it, the boundaries of these paradigms’ capabilities, and whether they have broader applicability, among other questions.

그러면 DeepSeek가 기술적으로 어떤 뛰어난 점을 가졌는지를 분석할 능력은 없습니다. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models이 무언가 다르다고는 하지만 ㅠㅠㅠㅠㅠ

Download (PDF, 716KB)

다만 홈페이지를 보면서 다른 AI회사와 다른 점을 발견할 수 있었습니다. 홈페이지에 있는 구성도입니다. 다른 회사와 달리 하드웨어 및 소프트웨어기술과 관련한 표현들이 보입니다.

Fire-Flyer. 이를 다룬 논문은 Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning입니다.

The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC.

Download (PDF, 3.43MB)

논문을 보면 초기단계에서 적은 비용으로 고성능을 발휘하도록 고민한 흔적이 나옵니다. 아주 익숙한 Infiniband도 등장하고 Mellanox도 보입니다.

그리고 논문을 보면

HFReduce : Hardware software co-Design In Network
HAIScale : Special Optimization FOR Deep Learning Moders Training
High-Throughput Distributed File System: 3FS

을 다루고 있습니다.

블로그를 보면 각 기술에 대한 간단한 설명을 만날 수 있습니다.

hfreduce | High-performance multi-card parallel communication tool
Haiscale | Magic Square Firefly high performance parallel training tool library
Magic Square Power | High-speed file system 3FS

2.
앞서 DeepSeek 창업자의 인터뷰를 소개했습니다. 왜 AI기업가로서 연구,투자와 관련한 부분을 다룬 부분입니다.

Waves: What kind of curiosity?
Liang: Curiosity about the boundaries of AI capabilities. For many outsiders, the wave triggered by ChatGPT has been particularly disruptive; however, for those within the field, the impact of AlexNet in 2012 has ushered in a new era. AlexNet’s error rate was significantly lower than that of other models at the time, reviving neural network research that had been dormant for decades.

While specific technical directions have constantly evolved, the combination of models, data, and computing power has remained a constant. Especially after OpenAI released GPT-3 in 2020, the direction became clear: massive computing power would be essential. Yet even in 2021, when we were investing in the construction of Yinghuo Two, most people still couldn’t grasp the rationale.

Waves: So you did start paying attention to computational power in 2012?

Liang: Researchers have an insatiable hunger for computational resources. Small experiments often lead to a desire for larger-scale trials, prompting us to continuously expand our capacity.

Waves: Some assumed your clusters were primarily for financial market predictions.

Liang: If purely for quant investing, even a small number of GPUs would suffice. Our broader research aims to understand what kind of paradigms can fully describe the entire financial market, whether there are simpler ways to express it, the boundaries of these paradigms’ capabilities, and whether they have broader applicability, among other questions.

Waves: But this process is also a money-burning endeavor.

Liang: An exciting endeavor perhaps cannot be measured purely in monetary terms. It’s like someone buying a piano for a home — first, they can afford it, and second, such a group of people are eager to play beautiful music on it.

Waves: Clusters require significant expenses — maintenance, labor, and even electricity.

Liang: Electricity and maintenance are relatively inexpensive, constituting about 1% of hardware costs annually. Labor is more significant but represents an investment in our future and a key asset for the company. The people we choose tend to be relatively humble, driven by curiosity, and have the opportunity to conduct research here.

Waves: High-Flyer recently announced its entry into the large-model space. Why is a Quant Fund undertaking such an endeavor?
Liang Wenfeng: Our large-model project is unrelated to our quant and financial activities. We’ve established an independent company called DeepSeek, to focus on this.

Many in our High-Flyer team come from an AI background. Years ago, we experimented with various applications before entering the complex domain of finance. AGI may be one of the next most challenging frontiers, so for us, the question is not “why” but “how”.

Waves: Are you training a general-purpose model, or focusing on vertical domains like finance?
Liang: We’re working on AGI — Artificial General Intelligence. Language models are likely a prerequisite for AGI and already exhibit some AGI characteristics. So we’ll start there and later expand into areas like computer vision.

Waves: Due to the entry of tech giants, many startup companies have abandoned the pursuit of solely developing general-purpose large models.
Liang: We won’t prematurely focus on applications. Our focus is solely on the large model itself.

 

Waves: Why do you define your goal as “to focus on research and exploration”?

Liang: It’s driven by curiosity. From a broader perspective, we want to validate certain hypotheses. For example, we hypothesize that the essence of human intelligence might be language, and human thought could essentially be a linguistic process. What you think of as “thinking” might actually be your brain weaving language. This suggests that human-like AGI could potentially emerge from large language models.

From a closer perspective, GPT-4 still holds many mysteries waiting to be unraveled. While reproducing it, we are also conducting research to uncover these secrets.

Waves: But research comes at a higher cost.

Liang: Reproduction alone is relatively cheap — based on public papers and open-source code, minimal times of training, or even fine-tuning, suffices. Research, however, involves extensive experiments, comparisons, and higher computational and talent demands.

마지막으로 Deepseek: The Quiet Giant Leading China’s AI Race은 인터뷰를 정리한 또다른 글입니다. 중요한 주제는 아래입니다.

Part 1: How was the first shot of the price war fired?
Part 2: The Real Gap Isn’t One or Two Years. It’s Between Original Innovation and Imitation.
Part 3: More Investments Do Not Equal More Innovation
Part 4: A group of young people doing “inscrutable” work
Part 5: All the methods are products of a previous generation

동양계 AI기업가의 생각을 읽어보실 수 있습니다.

3.
미국 익명사이트에 DeepSeek가 미친 영향을 소개한 글이 올라왔습니다. 원문은 영어입니다.

메타 생성 AI 조직이 패닉 모드에 빠지다

DeepSeek v3로 시작되었는데, 이는 이미 벤치마크에서 Llama 4를 뒤처지게 만들었습니다. 여기에 “550만 달러 훈련 예산을 가진 알려지지 않은 중국 회사”라는 사실이 더해져 상황을 악화시켰습니다.엔지니어들이 DeepSeek을 분석하고 가능한 모든 것을 복사하기 위해 필사적으로 움직이고 있습니다. 과장이 아닙니다.
경영진은 생성형 AI 조직의 막대한 비용을 정당화하는 것에 대해 걱정하고 있습니다. DeepSeek v3를 훈련하는 데 든 비용보다 생성형 AI 조직의 모든 “리더” 개개인의 연봉이 더 높은데, 이런 “리더”들이 수십 명이나 있다는 사실을 경영진에게 어떻게 설명할 수 있겠습니까.DeepSeek r1은 상황을 더욱 무섭게 만들었습니다. 기밀 정보는 공개할 수 없지만 어차피 곧 공개될 것입니다.

원래는 엔지니어링에 중점을 둔 작은 조직이었어야 했는데, 많은 사람들이 영향력을 차지하고 조직의 채용을 인위적으로 부풀리려고 하다 보니 모두가 손해를 보게 되었습니다.
Meta genai org in panic mode중에서

위 글에 대한 반응입니다. 트위터입니다.

Geeknews의 Meta의 생성형 AI 조직은 DeepSeek때문에 충격에 빠져있음 (teamblind.com)을 댓글을 보면 다른 AI기업의 반응을 볼 수 있습니다.

Leave a Comment

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다

이 사이트는 스팸을 줄이는 아키스밋을 사용합니다. 댓글이 어떻게 처리되는지 알아보십시오.