Branchless Computing

1.
오랜만에 ssrn에서 HFT를 키워드로 해서 검색을 해보았습니다. 대부분 뻔한 논문들인데 전혀 예상하지 못한 논문이 한편 보였습니다.

Semi-static Conditions in Low-latency C for High Frequency Trading: Better than Branch Prediction Hints

논문을 보면 요즘 CPU를 다루고 있습니다. 저에게는 무척이나 생소한 분야입니다. CPU의 구조와 원리를 알아야 합니다. 논문을 완전히 이해하지 못해도 기본 개념 하나는 익히려고 찾아보았습니다.

Branch Prediction

이와 관련한 여럿 글들이 있을 듯 하지만 제가 본 글을 아래입니다. Vulkan이라는 API를 소개하는 위키에 있는 CPU Architecture Essentials입니다.

Branch prediction

With so much deep pipelining and out of order execution inside the CPU, we run into an issue with branches. What happens when the CPU hits a branch but the data is still not known on where to branch? We could stop the whole thing until the data arrives, but that would mean stopping the cpu completely for many clocks, if the branch data depends on a RAM load that might take 200 cycles to arrive. The modern CPUs decide to just decide on the branch direction, and continue executing anyway. If they decide wrong, they will flush the half-done instructions and the pending memory writes, and begin anew. If they guessed correctly all is good and nothing is done. This is known as branch prediction. It tries to predict the future to do educated guesses of where the program flow will go. To do that it keeps storage inside the CPU that records wether a given branch was taken or not last time, and even if there is a pattern like flipping between taken and not taken. The complexity of modern branch predictors is really high, and their specific working details are often closely guarded trade secrets. On something like a i7 or ryzen, the branch predictor hits the correct prediction more than 95% of the time on average. The place where they struggle is when the branch is essentially random for branches that are predictable and stable the predictor will almost always hit it correctly. If the branch is random and is constantly mispredicted, the CPU will stall and have to reset constantly due to the bad predictions, which is often a thing that can slow softare by a significant factor. In some cases, its worth to run both sides of the branch and use a select instruction or a branchless blend as the penalty for misprediction is higher than the calculation. For a video that talks about how to program in a branchless way for performance, “Branchless programming in Cpp” “CPPCON: Branchless Programming in Cpp” can be a good one. Another great article that explains different types of branch predictor seen in CPUs is this one “Danluu: Branch-prediction”
CPU Architecture Essentials중에서

자료를 읽어보면 요즘 CPU의 성능향상을 위해 Branch Prediction을 도입하였는데…Pipelining과 관계가 있네요. 위키의 정의입니다.

Branch Prediction

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures.

Download (PDF, 1.36MB)

2.
요즘 CPU에서 일반적으로 도입한 Branch Prediction. 그런데 Branchless Computing이라는 말이 나온 것으로 보면 특정한 경우 Branch Prediction이 이익이 아니라는 뜻으로 보입니다.

Most software code contains conditional branches. In code, they appear in if-then-else clauses, loops, and switch-case constructs. When encountering a conditional branch, the processor checks a condition and may jump to a new code path, if the branch is taken, or continue with the following instructions. Unfortunately, the processor may only know whether a jump is necessary after executing all instructions before the jump. For better performance, modern processors predict the branch and execute the following instructions speculatively. It is a powerful optimization.

There are some limitations to speculative execution, however. For example, the processor must discard the work done after the misprediction and start anew when the branch is mispredicted. Thankfully, processors are good at recognizing patterns and avoiding mispredictions, when possible. Nevertheless, some branches are intrinsically hard to predict and they may cause a performance bottleneck.
Making Your Code Faster by Taming Branches 중에서

The modern CPUs decide to just decide on the branch direction, and continue executing anyway. If they decide wrong, they will flush the half-done instructions and the pending memory writes, and begin anew.

Branchless Computing을 주제로 한 글을 무척 많습니다. 먼저 Algorithmica가 담고 있는 글이 개인적으로 좋았습니다.

Algorithmica is an open-access web book dedicated to the art and science of computing.

Instruction-Level Parallelism이라는 소제목아래에 다음과 같은 내용으로 채웠습니다.

Pipeline Hazards

The Cost of Branching

Branchless Programming

Instruction Tables

Throughput Computing

두번째는 CppCon 2021 Presentation Materials에서 by Fedor Pikus이 발표한 Branchless Computing: Why Conditions Are Bad for Your Code, and What Can You Do About It입니다.

Download (PDF, 9.08MB)

세번째는 C++ design patterns for low-latency applications including high-frequency trading입니다. 논문이면서 소스코드를 모아놓은 레포지토리이기도 합니다. Low-Latency Programming Repository for High-Frequency Trading (HFT)를 보면 논문의 소스코드가 있습니다. 옛날 자료도 있지만 개인적으로 첫번째 디자인패턴이 좋았습니다.

Comprehensive guide on low-latency programming techniques such as Cache Warming, Constexpr, Loop Unrolling, Lock-Free Programming, and Short-circuiting.

Market-neutral statistical arbitrage pairs trading strategy optimized for low-latency.

Implementation of the Disruptor pattern in C++ optimized for speed and scalability.

Download (PDF, 1.74MB)

#include <benchmark/benchmark.h>

// A typical error checking setup
int errorCounterA = 0;

bool checkForErrorA() {
  // Produce an error once every 10 calls
  errorCounterA++;
  return (errorCounterA % 10) == 0;
}

bool checkForErrorB() {
  // Simulate some error check
  return false;
}

bool checkForErrorC() {
  // Simulate some error check
  return false;
}

void handleErrorA() {
  // Simulate some error handling
}

void handleErrorB() {
  // Simulate some error handling
}

void handleErrorC() {
  // Simulate some error handling
}

void executeHotpath() {
  // Simulate some hot path execution
}

static void Branching(benchmark::State& state) {
  errorCounterA = 0;  // reset the counter before benchmark run
  for (auto _ : state) {
    if (checkForErrorA())
      handleErrorA();
    else if (checkForErrorB())
      handleErrorB();
    else if (checkForErrorC())
      handleErrorC();
    else
      executeHotpath();
  }
}

// A new setup using flags
enum ErrorFlags {
  ErrorA = 1 << 0,
  ErrorB = 1 << 1,
  ErrorC = 1 << 2,
  NoError = 0
};

int errorCounterFlags = 0;

ErrorFlags checkErrors() {
  // Produce ErrorA once every 10 calls
  errorCounterFlags++;
  return (errorCounterFlags % 10) == 0 ? ErrorA : NoError;
}

void HandleError(ErrorFlags errorFlags) {
  // Simulate some error handling based on flags
  if (errorFlags & ErrorA) {
    handleErrorA();
  }
  // handle other errors similarly...
}

void hotpath() {
  // Simulate some hot path execution
}

static void ReducedBranching(benchmark::State& state) {
  errorCounterFlags = 0;  // reset the counter before benchmark run
  for (auto _ : state) {
    ErrorFlags errorFlags = checkErrors();
    if (errorFlags)
      HandleError(errorFlags);
    else
      hotpath();
  }
}

// Register the functions as a benchmark
BENCHMARK(Branching);
BENCHMARK(ReducedBranching);

BENCHMARK_MAIN();

#include <benchmark/benchmark.h>

// A typical error checking setup

int errorCounterA = 0;

bool checkForErrorA() {

// Produce an error once every 10 calls

errorCounterA++;

return (errorCounterA % 10) == 0;

}

bool checkForErrorB() {

// Simulate some error check

return false;

}

bool checkForErrorC() {

// Simulate some error check

return false;

}

void handleErrorA() {

// Simulate some error handling

}

void handleErrorB() {

// Simulate some error handling

}

void handleErrorC() {

// Simulate some error handling

}

void executeHotpath() {

// Simulate some hot path execution

}

static void Branching(benchmark::State& state) {

errorCounterA = 0; // reset the counter before benchmark run

for (auto _ : state) {

if (checkForErrorA())

handleErrorA();

else if (checkForErrorB())

handleErrorB();

else if (checkForErrorC())

handleErrorC();

else

executeHotpath();

}

// A new setup using flags

enum ErrorFlags {

ErrorA = 1 << 0,

ErrorB = 1 << 1,

ErrorC = 1 << 2,

NoError = 0

};

int errorCounterFlags = 0;

ErrorFlags checkErrors() {

// Produce ErrorA once every 10 calls

errorCounterFlags++;

return (errorCounterFlags % 10) == 0 ? ErrorA : NoError;

}

void HandleError(ErrorFlags errorFlags) {

// Simulate some error handling based on flags

if (errorFlags & ErrorA) {

handleErrorA();

}

// handle other errors similarly...

}

void hotpath() {

// Simulate some hot path execution

}

static void ReducedBranching(benchmark::State& state) {

errorCounterFlags = 0; // reset the counter before benchmark run

for (auto _ : state) {

ErrorFlags errorFlags = checkErrors();

if (errorFlags)

HandleError(errorFlags);

else

hotpath();

}

// Register the functions as a benchmark

BENCHMARK(Branching);

BENCHMARK(ReducedBranching);

BENCHMARK_MAIN();

마지막으로는 실제로 어떻게 Branchless를 적용할 때 어떤 숫자를 보여주는지를 예제로 소개하는 곳입니다. 앞서 자료와 비슷하지만 좀더 실용적인 예를 보여줍니다.

Branchless-programming

예제중 하나입니다.

#include <stdio.h>
void branched_ifElse(int value) {

	if(value % 2 == 0)
        display("The number is even");
    else
        display("The number is odd");
    return;
}
void branchless_ifElse(int value) {
    (value%2==0)?display("The number is even"):display("The number is odd");
    return;
}
void display(char *message){
    printf("%s\n",message);
    return;
}
int main()
{
    int i;
    //Change the function between branchless_ifElse and branched_ifElse in order to compare execution times.
    for(i = 0; i<10000; i++)
        branchless_ifElse(i);
}

#include <stdio.h>

void branched_ifElse(int value) {

if(value % 2 == 0)

display("The number is even");

else

display("The number is odd");

return;

}

void branchless_ifElse(int value) {

(value%2==0)?display("The number is even"):display("The number is odd");

return;

}

void display(char *message){

printf("%s\n",message);

return;

}

int main()

{

int i;

//Change the function between branchless_ifElse and branched_ifElse in order to compare execution times.

for(i = 0; i<10000; i++)

branchless_ifElse(i);

}

4.
아래는 Branch Prediction과 관련한 논문및 보고서입니다.

Branch Prediction Is Not a Solved Problem: Measurements, Opportunities, and Future Directions

Download (PDF, 2.98MB)

2023년도 데이터산업 백서의 ‘5부 데이터산업 기술 동향’중 Branch Prediction과 관련한 부분입니다.

Download (PDF, 545KB)

이 글 공유하기:

Leave a Comment 응답 취소