Low Latency를 다루는 프로젝트들

1.
확실히 예전과 다릅니다. HFT 혹은 Low Latency를 주제로 한 글이 많긴 많습니다. 앞서 두번의 글을 통해 책을 소개하였습니다.

Low Latency Trading Insights
전문적으로 HFT시스템 개발을 다룬 책

이번에는 Low Latency기술을 적용한 몇 가지 프로젝트를 소개합니다.

HFTPerformance: An Open-Source Framework for High-Frequency Trading System Benchmarking and Optimization

위 프로젝트가 기술적으로 중점을 둔 부분은 아래과 같습니다.

Low-Latency Optimizations

Lock-free Data Structures: SPSC/MPSC queues with cache-line padding
Custom Memory Pool: O(1) allocation without heap fragmentation
Cache-Aware Design: 64-byte aligned structures for cache efficiency
CPU Affinity: Thread pinning for deterministic performance
RDTSC Timing: Nanosecond-precision measurement with minimal overhead

“어플리케이션를 어떻게 설계할지”에 대한 정답은 없습니다. 개인적으로 “가장 단순한 구조가 가장 빠르다”는 생각을 하지만 실제로 단순한 구조를 만들기 쉽지 않습니다. 소프트웨어 구조를 논외로 하면 남는 것이 OS와 하드웨어입니다. 아래에 관련한 이슈를 정리해놓았습니다.

5.2 Impact of OS and Hardware Optimizations

Next, we use HFTPerformance to quantify the effect of several well-known optimizations in a controlled manner. Each test in this subsection was run for 10 seconds with 1 million tick messages total (to gather a robust distribution sample). We repeated each scenario 5 times and report averaged results (with negligible variance between runs thanks to deterministic config).

CPU Affinity and Isolation: We measured latency for processing a fixed 100k msg/s feed in four setups: (A) no pinning (OS scheduler free to move threads), (B) pinned threads to isolated cores, © pinned + disabled Hyper-threading on those cores (to avoid sibling interference), and (D) pinned + isolated + real-time scheduling (using SCHED_FIFO). The median latencies were: A: 15.4 µs, B: 12.7 µs, C: 12.5 µs, D: 12.4 µs. The tail (99.9th) saw a more dramatic improvement: A: 120 µs, B: 40 µs, C: 35 µs, D: 30 µs. This aligns with expectations that isolating the CPU (preventing other tasks or kernel from preempting) and using real-time priority both reduce jitter significantly[12][17]. The incremental difference between C and D was small, indicating that once isolated, the default scheduling was already not interfering much. These results, obtained via HFTPerformance, quantitatively support the practice of dedicating cores to critical HFT threads.

Kernel Bypass vs Standard Stack: We integrated a simple use of DPDK in the Market Data Generator and Exchange Simulator, bypassing the kernel networking. We then compared a scenario of sending and receiving 64-byte messages via kernel UDP sockets versus via DPDK’s poll-mode drivers (in both cases on the same machine using a virtual NIC device for testing). With kernel UDP, average one-way latency was ~9 µs (for just the network send-receive path between two threads on loopback interface at 10Gbps rate) with significant jitter (max ~30 µs, some outliers to 100 µs under load). With DPDK, average dropped to ~5 µs and was extremely consistent (max ~6 µs). While these absolute numbers include only the network path, they reflect how HFTPerformance can isolate the contribution of network stack overhead. The ~4 µs saving is in line with literature stating that kernel bypass can save a few microseconds by avoiding interrupts and syscalls[8][37]. This test showcases the tool’s ability to test different I/O mechanisms by swapping modules (we simply recompiled the Exchange Simulator to use DPDK send/receive calls and kept the rest the same).

Threading Model Comparison: We set up a test of a simple arbitrage strategy on two instruments to see single-thread vs multi-thread differences. In single-thread mode, the strategy processed both instrument feeds sequentially. In dual-thread mode, each instrument’s feed was handled by its own thread (on separate cores) and they operated in parallel — essentially two parallel pipelines. We expected the single thread to have better per-event latency (no context switching), but the multi-thread to handle combined throughput better. Indeed, results: at low load (total 50k msg/s), single-thread median latency was 10 µs vs multi-thread 14 µs. However, when we increased load to 200k msg/s (100k per instrument), the single thread began to lag, showing 95th percentile latencies of 100+ µs (queue buildup), whereas the dual-thread setup still kept 95th percentile under 30 µs by dividing the work. The total throughput supported before saturating was ~120k msg/s for single-thread versus ~220k msg/s for dual-thread in this particular strategy. This experiment demonstrates how HFTPerformance can help find the crossover point where parallelization outweighs its overhead. It quantitatively supports the intuition that “HFT infra is 80% plumbing, 20% genius algo — nail the pipes, or watch your edge evaporate”[38]: i.e., if your code can’t keep up with the firehose of data, clever algorithms won’t matter. Using the tool, one can iteratively refine such decisions about threading and concurrency.

Code Optimization Case Study: We took a real-world inspired computation — a moving average calculation on tick prices — and tested two implementations: one using a straightforward loop in C++, and another using SIMD instructions (via AVX2 intrinsics). We embedded each into the strategy and measured how long it added to per-tick processing. The scalar version achieved ~3 million computations per second, whereas the SIMD version achieved ~10 million per second (roughly a 3× speedup). In terms of latency per tick in our pipeline, this translated to roughly 0.33 µs delay (scalar) versus 0.1 µs (SIMD) added. This is a small difference in absolute terms, but in HFT even sub-microsecond improvements can accumulate. If a strategy does dozens of such math operations, these optimizations become meaningful. The key point is HFTPerformance allowed us to measure those differences directly in situ. Rather than relying on isolated micro-benchmarks or theoretical CPU FLOPS, we timed them as part of message processing. Moreover, by looking at the timeline of events, we confirmed that the SIMD version freed up enough CPU headroom to handle a higher tick rate before saturating. This showcases the tool’s usefulness in guiding low-level code optimization: you can plug in a new optimized routine and immediately see the latency impact on the overall system.

Design and Implementation of a Low-Latency High-Frequency Trading System for Cryptocurrency Markets

위 프로젝트의 레포지토리는 Crypto HFT Trading Platform입니다. 위 프로젝트가 중점을 두는 사항입니다. 글을 읽어보면 LMAX가 오래전에 공개한 라이브러리가 등장합니다. 세상에 새로운 것은 없네요…

2.2 Low-Latency Design Principles

The goal of low-latency design is to reduce unpredictability and worst-case delays. Key principles include mechanical sympathy, lock-free data structures, zero-copy processing, and the single-writer principle:

· Mechanical Sympathy: This term (coined by LMAX) means aligning software design with CPU architecture. For example, minimizing cache misses, avoiding false sharing, and aligning data to cache lines can vastly improve performance. We consider CPU caches, memory bandwidth, and instruction pipelining when structuring data flows.

· Lock-Free Data Structures: Locks and blocking queues introduce non-deterministic latencies due to context switches and contention. In contrast, lock-free ring buffers or queues (using atomic CAS operations) can achieve predictable throughput. The LMAX Disruptor is a ring buffer with sequence counters, ensuring only one thread writes a given slot, eliminating locks[7]. Similarly, Agrona (a library we use) provides lock-free collections and ring buffers off-heap

· Zero-Copy Processing: Every memory copy or allocation can induce CPU overhead and, in Java, garbage collection. Zero-copy means processing messages directly from pre-allocated buffers without extra copies. For example, Simple Binary Encoding (SBE) serializes messages to and from fixed binary layouts, so a receiver can read fields directly from a byte array without intermediate objects. Network I/O frameworks like Netty also support zero-copy features (e.g., using ByteBuf and FileRegion abstractions) to avoid copying data between user/kernel space

· Single-Writer Principle: Designing each data structure with a single writer thread avoids synchronization on writes. Readers may spin or wait, but writes never contend. In an order book, for instance, one thread can own each side of the book. This dramatically simplifies concurrency. We follow this principle by having one thread (the matching engine) write order updates, while other threads (risk, market data distributors) read from it via the Disruptor without locks.

By adhering to these principles, many advanced HFT systems achieve latencies on the order of microseconds or less. In practice, inter-thread communication latencies can reach tens of nanoseconds[3], and end-to-end tick-to-trade latencies can be under a millisecond even accounting for network I/O

2.
앞서 프로젝트는 해외 프로젝트입니다. 아래 프로젝트는 우리말 프로젝트입니다. 짧고 쉽게 말하자를 운영하는 기훈님이 올린 글들입니다. 앞서 프로젝트와 비교할 때 주제별로 기술적인 배경을 자세히 다루고 있습니다. 공부할 때 도움이 되지 않을까 합니다.

저자가 올린 벤치마크 결과는 다음과 같습니다.

핵심 결과:

Limit Order: 약 1,136,000 TPS

Market Buy Order: 약 113,000 TPS

Mixed Order: 약 251,500 TPS

특징:

순수 엔진 성능 측정 (네트워크/DB 제외)

CPU 1코어만 사용하여 달성한 성능

macOS 환경에서 실행 (코어 고정/스케줄링 미적용)

Linux 환경에서 추가 성능 향상 예상

실제 거래소 환경:

엔드투엔드 지연시간: 약 10-50ms

네트워크/DB 오버헤드 포함

실제 처리량: 네트워크/DB 병목에 따라 결정

[거래소만들기] 13. 벤치마크 테스트 결과
[거래소만들기] 12. 실제 구현된 기능들
[거래소만들기] 11. 매칭 엔진 — 실제 구현과 자료구조
[거래소만들기] 10. 시스템 아키텍처 — 코드 구조와 시스템 흐름
[거래소만들기] 9. Kafka — 분산 로그 구조로 이해하는 Kafka의 본질
[거래소만들기] 8. Memory Pre-allocation & Zero-copy
[거래소만들기] 7. CPU 코어 고정(Core Pinning)과 실시간 스케줄링(Real-time Scheduling)
[거래소 만들기] 6. UDP기반 Brodcast로 오더북 제공하기
[거래소 만들기] 5. NUMA Node와 메모리 접근 비용
[거래소 만들기] 4. 락이 성능을 망치는 이유 (OS·CPU·DB 레벨 분석)
[거래소 만들기] 3. Write-Ahead Logging(WAL)
[거래소 만들기] 2. BTreeMap과 VecDeque
[거래소 만들기] 1.RingBuffer란?

5.2 Impact of OS and Hardware Optimizations

2.2 Low-Latency Design Principles

이 글 공유하기:

Leave a Comment 응답 취소