기계학습, 고빈도매매 및 기계학습펀드의 실패

1.
현재 같이 일하고 있는 팀들의 금요일 퇴근 무렵. 나이가 비슷한 분들이라 이런저런 이야기를 나누다 핀테크를 주제로 이야기를 나누었습니다. 모두 저의 변경입니다.

“왜 로보어드바이저를 하지 않으세요?”
“미국 로보바이저가 대상고객을 명확히 하고 시작했지만 한국은 차별화할 수 없고 차별화하더라도 수수료경쟁에서 벗어날 수 없어서……”
“블록체인은?”
“플랫폼 비지니스를 할 수 있으면 도전해보겠지만 SI와 같은 시장에 없어서…”

핀테크가 금융을 뒤짚을 듯 하더니만 쑥 사그라들고 4차산업혁명이 그 자리를 메우고 있습니다. 그러나 유행이 사라지고나면 시장에서 실패한 전사자들만 남습니다. 다만 기계학습이나 인공지능은 오랜동안 생명력을 유지할 듯 합니다. 기술과 자본과 데이타를 가진 집단에게 이익을 보장하기때문입니다.기계학습과 트레이딩을 접목하는 것은 더이상 새롭지 않습니다. 그렇다고 HFT와 같이 시장에서 압도적인 위력을 발휘하지 않습니다. 자료가 없지만 인구에 회장하는 빈도가 적습니다. 아직은 관심이지 실제의 영역은 아닌 듯 합니다.

최근 살펴본 기계학습과 관련한 논문중 재미있는 주제를 다룬 두편을 소개합니다. 모두가 ‘이다’라고 할 때 ‘아니다’라고 하는 The 7 Reasons Most Machine Learning Funds Fail입니다. 2018년에 나올 Advances in Financial Machine Learning의 압축해서 정리한 발표자료입니다. 글을 쓴 Marcos Lopez de Prado도 HFT때부터 유명했고 논문으로 자주 졉했던 분입니다.

글 머리를 보면 다루는 목차가 있습니다. 앞서 ‘아니오’라고 했지만 정확히 ‘오류를 줄여라’를 이야기하는 논문입니다.

Download (PDF, 1.49MB)

글쓴이가 쓴 논문목록을 참고로 하시면 그중 Stock Portfolio Design and Backtest Overfitting을 참고로 올립니다.

Download (PDF, 535KB)

다음은 지금도 관심을 가지고 보고 있는 고빈도매매와 기계학습을 연결한 High Frequency Market Making with Machine Learning입니다.

This paper introduces a trade execution model to evaluate the economic impact of classifiers through backtesting. Extending the concept of a confusion matrix, we present a ‘trade information matrix’ to attribute the expected profit and loss of tick level predictive classifiers under execution constraints, such as fill probabilities and position dependent trade rules, to correct and incorrect predictions. We apply the execution model and trade information matrix to Level II E-mini S&P 500 futures history and demonstrate an estimation approach for measuring the sensitivity of the P&L to classification error. Our approach directly evaluates the performance sensitivity of a market making strategy to classifier error and augments traditional market simulation based testing.

Download (PDF, 907KB)

위의 배경이 되는 논문이 Classification-Based Financial Markets Prediction Using Deep Neural Networks과 Sequence Classification of the Limit Order Book using Recurrent Neural Networks입니다.

Download (PDF, 605KB)

Download (PDF, 610KB)

2.
Golden Compass가 Hang Seng Index Futures을 대상으로 기계학습모델에 따른 수익율을 비교하는 시험을 하였습니다. 이 때 사용한 방법들이 Neural Networks, Random Forest, Naïve Bayes, K-nearest neighbors 및 SVM입니다. 어떤 모델이 가장 좋은 결과를 얻었을까요? Comparing Supervised Learning Methods for Hang Seng Index Futures Long/Short Strategy에 담긴 결과를 참고하세요. 더불어 Golden Compass는 SVM을 이용하여 Nikkei 지수거래를 시험한 결과도 소개합니다.

SVM Trend Strategy on Nikkei 225 Mini Futures

#################################################
# Nikkei SVM Trading Strategy
#################################################
# clear environment
rm(list = ls())

# require libraries
library(xts)
library(ggplot2)
library(grid)
library(gridExtra)
library(dplyr)
library(lazyeval)
library(zoo)
library(tidyquant)
library(rdrop2)
library(httpuv)
library(caret)
library(kernlab)

# load data from Bloomberg consisting of daily prices, 30D RSI and Volume saved in RData file
load("daily_data.RData")

# feature engineering is conducted here
input <- xts()
input$cur_ret <- ROC(prices$PX_LAST)
input$ret1 <- lag(ROC(prices$PX_LAST),1)
input$ret2 <- lag(ROC(prices$PX_LAST),2)
input$ret3 <- lag(ROC(prices$PX_LAST),3)
input$ret4 <- lag(ROC(prices$PX_LAST),4)
input$ret5 <- lag(ROC(prices$PX_LAST),5)

input$rsi <- prices$RSI_30D
input$volume <- prices$PX_VOLUME

input$vol3d <- TTR::volatility(prices$PX_LAST, n=3, N=252) 
input$vol5d <- TTR::volatility(prices$PX_LAST, n=5, N=252) 
input$vol10d <- TTR::volatility(prices$PX_LAST, n=10, N=252) 
input$vol15d <- TTR::volatility(prices$PX_LAST, n=15, N=252) 
input$vol20d <- TTR::volatility(prices$PX_LAST, n=20, N=252) 

input$vnp1 <- lag(prices$PX_LAST,1) / input$vol20d
input$vnp2 <- lag(prices$PX_LAST,2) / input$vol20d
input$vnp3 <- lag(prices$PX_LAST,3) / input$vol20d
input$vnp4 <- lag(prices$PX_LAST,4) / input$vol20d
input$vnp5 <- lag(prices$PX_LAST,5) / input$vol20d

input$sma10 <- SMA(prices$PX_LAST, n=10)
input$sma50 <- SMA(prices$PX_LAST, n=50) 
input$sma100 <- SMA(prices$PX_LAST, n=100) 

# clean data and slice timeframe needed
input <- na.omit(input)
input <- input["2013-09-15/"] # select exactly fours years of data

input_original <- input
regime <- input$cur_ret

# select training and testing data
input$cur_ret <- ifelse(input$cur_ret>0,1,-1)
training <- input[1:as.integer(nrow(input)*0.75),]
testing <- input[(nrow(training)+1):nrow(input),]

training_df <- as.data.frame(training)
testing_df <- as.data.frame(testing)

training_predictor <- training_df[, -1]
testing_predictor <- testing_df[, -1]
training_target <- as.factor(training_df$cur_ret)
testing_target <- as.factor(testing_df$cur_ret)

# plot pca variance
pca <- prcomp(training_predictor, scale. = T)
prop_var <- data.frame(num=0:20,var=c(0,summary(pca)$importance[3,]))
p <- ggplot(data = prop_var) +
  geom_line(aes(x=num, y=var), color = "blue", size = 1) +
  geom_vline(xintercept = 6, linetype = "dashed", size = 1) +
  scale_x_continuous(breaks = seq(0,20,2), expand = c(0,0)) +
  scale_y_continuous(breaks = c(1:10)/10, limits = c(0,1), labels = scales::percent, expand = c(0,0)) +
  scale_color_manual(values = c("Change"="black")) +
  theme_tq() +
  labs(x="Number of PCA Axes", y = "Cumulative Variance (%)", title = "Proportion of Variance Explained By Varying Number of PCA Factors") +
  theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")

# insert footnote
grid.newpage()
footnote <- paste("Computations: Golden Compass Quant", sep = "")
g <- arrangeGrob(p, bottom = textGrob(footnote, x = 0, hjust = -0.1, vjust = 0.2, gp = gpar(fontface = "italic", fontsize = 9)))
grid.draw(g)
ggsave(paste("proportion of variance.png", sep = ""),plot = g, width = 8, height = 6, dpi = 300)

# fit SVM
preProc <- preProcess(training_predictor, method="pca", pcaComp = 6)
trainPC <- predict(preProc, training_predictor)
modelFit <- train(x=trainPC, y=training_target, method = "svmPoly")
testPC <- predict(preProc, testing_predictor)
prediction <- predict(modelFit, testPC)
confusionMatrix(testing_target, predict(modelFit, testPC))

# visualizing svm using the same method
test <- trainPC
test$ret <- training_target
m <- ksvm(ret~., data = test, kernel = "polydot", kpar = list(degree =  3, scale =  0.1, offset =  1 ))
plot(m, data=test, grid = 50, slice=c("PC3"=0,"PC4"=0,"PC5"=0,"PC6"=0)) # visualize PC1 and PC2

# backtest
bt <- regime[(nrow(training)+1):nrow(input),]
bt$pred <- as.numeric(levels(prediction))[prediction]
colnames(bt) <- c("actual", "prediction")
bt$cumret <- cumprod((bt$actual * bt$prediction) + 1)
bt$ret <- bt$actual * bt$prediction

# compute drawdowns
bt$dd <- cummax(bt$cumret)
bt$dd <- (bt$cumret - bt$dd) / bt$dd

# add price series
bt$price <- prices["2016-09-20/"]$PX_LAST

# add entry and exit markers
bt$delta <- diff(bt$prediction)
bt$delta[1] <- 0
bt$ent <- ifelse(bt$delta==2,bt$price,NA)
bt$ext <- ifelse(bt$delta==-2,bt$price,NA)

# plot equity curve and drawdowns
p1 <- ggplot(data = bt) +
  geom_line(aes(x= index(bt), y=price), color = "black", size = 1) +
  geom_point(aes(x= index(bt), y=ent), colour = "green", shape = 2) +
  geom_point(aes(x= index(bt), y=ext), colour = "red", shape = 6) +
  scale_x_datetime(expand = c(0,0)) +
  scale_y_continuous(limits = c(min(bt$price)*.9, max(bt$price)*1.1), expand = c(0,0)) +
  theme_tq() +
  labs(x="", y="", title = paste("Price Series and Entry/Exit Points", sep = "")) +
  theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "right")  

p2 <- ggplot(data = bt) +
  geom_line(aes(x= index(bt), y=prediction), color = "blue") +
  scale_x_datetime(expand = c(0,0)) +
  scale_y_continuous(limits = c(-1.2,1.2), breaks = c(-1,0,1), expand = c(0,0)) +
  theme_tq() +
  labs(x="", y="", title = paste("Positions", sep = "")) +
  theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "right")  

p3 <- ggplot(data = bt) +
  geom_line(aes(x= index(bt), y=cumret), color = "darkgreen", size = 1) +
  scale_x_datetime(expand = c(0,0)) +
  scale_y_continuous(limits = c(0.9,1.4), expand = c(0,0)) +
  geom_hline(yintercept = 1, linetype = "dashed", size=1) +
  theme_tq() +
  labs(x="", y="", title = paste("Cumulative Equity Holdings", sep = "")) +
  theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "right")  

p4 <- ggplot(data = bt) +
  geom_ribbon(aes(ymin=dd, ymax=0, x=index(bt)), fill="red", alpha = 0.6) +
  geom_line(aes(x=index(bt), y=dd), color = "red", size = 1) +
  scale_x_datetime(expand = c(0,0)) +
  scale_y_continuous(limits = c(-0.1,0), labels = scales::percent, expand = c(0,0)) +
  theme_tq() +
  labs(x="", y="", title = paste("Drawdowns (%)", sep = "")) +
  theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "right")

g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
g3 <- ggplotGrob(p3)
g4 <- ggplotGrob(p4)

grid.newpage()
footnote <- "Computations: Golden Compass Quant"
g <- grid.arrange(arrangeGrob(p1,p2,p3,p4, heights = c(2,1,2,1)), top = textGrob("SVM Strategy - JPX Nikkei 225 Mini Futures", gp = gpar(fontface = "bold", fontsize = 15)), 
                  bottom = textGrob(footnote, x = 0, hjust = -0.1, vjust = 0.2, gp = gpar(fontface = "italic", fontsize = 10)))
grid.draw(g)
ggsave(paste("bt_output_svm.png", sep = ""),plot = g, width = 8, height = 12, dpi = 300)

# one-year return
last(bt$cumret) - first(bt$cumret)

# total number of trades
nrow(bt$delta[bt$delta!=0])

# expectancy
mean(bt$ret)

# extreme returns
max(bt$ret)
min(bt$ret)

# average holding time
mean(diff(index(bt$delta[bt$delta!=0])))

# win rate
success <- bt[,c("prediction","delta","cumret")]
success <- success[success$delta!=0,]
success$trade <- lag(success$prediction)
success$ret <- ROC(success$cumret)
success <- na.omit(success)
success <- success[success$prediction!=0,]
success <- success[success$ret!=0,]
length(success$ret[success$ret>0])/nrow(success)

# expectancy
mean(success$ret)

table.Drawdowns(bt$ret)
# charts.PerformanceSummary(bt$ret)
SharpeRatio(bt$ret)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

#################################################

# Nikkei SVM Trading Strategy

#################################################

# clear environment

rm(list = ls())

# require libraries

library(xts)

library(ggplot2)

library(grid)

library(gridExtra)

library(dplyr)

library(lazyeval)

library(zoo)

library(tidyquant)

library(rdrop2)

library(httpuv)

library(caret)

library(kernlab)

# load data from Bloomberg consisting of daily prices, 30D RSI and Volume saved in RData file

load("daily_data.RData")

# feature engineering is conducted here

input <- xts()

input$cur_ret <- ROC(prices$PX_LAST)

input$ret1 <- lag(ROC(prices$PX_LAST),1)

input$ret2 <- lag(ROC(prices$PX_LAST),2)

input$ret3 <- lag(ROC(prices$PX_LAST),3)

input$ret4 <- lag(ROC(prices$PX_LAST),4)

input$ret5 <- lag(ROC(prices$PX_LAST),5)

input$rsi <- prices$RSI_30D

input$volume <- prices$PX_VOLUME

input$vol3d <- TTR::volatility(prices$PX_LAST, n=3, N=252)

input$vol5d <- TTR::volatility(prices$PX_LAST, n=5, N=252)

input$vol10d <- TTR::volatility(prices$PX_LAST, n=10, N=252)

input$vol15d <- TTR::volatility(prices$PX_LAST, n=15, N=252)

input$vol20d <- TTR::volatility(prices$PX_LAST, n=20, N=252)

input$vnp1 <- lag(prices$PX_LAST,1) / input$vol20d

input$vnp2 <- lag(prices$PX_LAST,2) / input$vol20d

input$vnp3 <- lag(prices$PX_LAST,3) / input$vol20d

input$vnp4 <- lag(prices$PX_LAST,4) / input$vol20d

input$vnp5 <- lag(prices$PX_LAST,5) / input$vol20d

input$sma10 <- SMA(prices$PX_LAST, n=10)

input$sma50 <- SMA(prices$PX_LAST, n=50)

input$sma100 <- SMA(prices$PX_LAST, n=100)

# clean data and slice timeframe needed

input <- na.omit(input)

input <- input["2013-09-15/"] # select exactly fours years of data

input_original <- input

regime <- input$cur_ret

# select training and testing data

input$cur_ret <- ifelse(input$cur_ret>0,1,-1)

training <- input[1:as.integer(nrow(input)*0.75),]

testing <- input[(nrow(training)+1):nrow(input),]

training_df <- as.data.frame(training)

testing_df <- as.data.frame(testing)

training_predictor <- training_df[, -1]

testing_predictor <- testing_df[, -1]

training_target <- as.factor(training_df$cur_ret)

testing_target <- as.factor(testing_df$cur_ret)

# plot pca variance

pca <- prcomp(training_predictor, scale. = T)

prop_var <- data.frame(num=0:20,var=c(0,summary(pca)$importance[3,]))

p <- ggplot(data = prop_var) +

geom_line(aes(x=num, y=var), color = "blue", size = 1) +

geom_vline(xintercept = 6, linetype = "dashed", size = 1) +

scale_x_continuous(breaks = seq(0,20,2), expand = c(0,0)) +

scale_y_continuous(breaks = c(1:10)/10, limits = c(0,1), labels = scales::percent, expand = c(0,0)) +

scale_color_manual(values = c("Change"="black")) +

theme_tq() +

labs(x="Number of PCA Axes", y = "Cumulative Variance (%)", title = "Proportion of Variance Explained By Varying Number of PCA Factors") +

theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")

# insert footnote

grid.newpage()

footnote <- paste("Computations: Golden Compass Quant", sep = "")

g <- arrangeGrob(p, bottom = textGrob(footnote, x = 0, hjust = -0.1, vjust = 0.2, gp = gpar(fontface = "italic", fontsize = 9)))

grid.draw(g)

ggsave(paste("proportion of variance.png", sep = ""),plot = g, width = 8, height = 6, dpi = 300)

# fit SVM

preProc <- preProcess(training_predictor, method="pca", pcaComp = 6)

trainPC <- predict(preProc, training_predictor)

modelFit <- train(x=trainPC, y=training_target, method = "svmPoly")

testPC <- predict(preProc, testing_predictor)

prediction <- predict(modelFit, testPC)

confusionMatrix(testing_target, predict(modelFit, testPC))

# visualizing svm using the same method

test <- trainPC

test$ret <- training_target

m <- ksvm(ret~., data = test, kernel = "polydot", kpar = list(degree = 3, scale = 0.1, offset = 1 ))

plot(m, data=test, grid = 50, slice=c("PC3"=0,"PC4"=0,"PC5"=0,"PC6"=0)) # visualize PC1 and PC2

# backtest

bt <- regime[(nrow(training)+1):nrow(input),]

bt$pred <- as.numeric(levels(prediction))[prediction]

colnames(bt) <- c("actual", "prediction")

bt$cumret <- cumprod((bt$actual * bt$prediction) + 1)

bt$ret <- bt$actual * bt$prediction

# compute drawdowns

bt$dd <- cummax(bt$cumret)

bt$dd <- (bt$cumret - bt$dd) / bt$dd

# add price series

bt$price <- prices["2016-09-20/"]$PX_LAST

# add entry and exit markers

bt$delta <- diff(bt$prediction)

bt$delta[1] <- 0

bt$ent <- ifelse(bt$delta==2,bt$price,NA)

bt$ext <- ifelse(bt$delta==-2,bt$price,NA)

# plot equity curve and drawdowns

p1 <- ggplot(data = bt) +

geom_line(aes(x= index(bt), y=price), color = "black", size = 1) +

geom_point(aes(x= index(bt), y=ent), colour = "green", shape = 2) +

geom_point(aes(x= index(bt), y=ext), colour = "red", shape = 6) +

scale_x_datetime(expand = c(0,0)) +

scale_y_continuous(limits = c(min(bt$price)*.9, max(bt$price)*1.1), expand = c(0,0)) +

theme_tq() +

labs(x="", y="", title = paste("Price Series and Entry/Exit Points", sep = "")) +

theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "right")

p2 <- ggplot(data = bt) +

geom_line(aes(x= index(bt), y=prediction), color = "blue") +

scale_x_datetime(expand = c(0,0)) +

scale_y_continuous(limits = c(-1.2,1.2), breaks = c(-1,0,1), expand = c(0,0)) +

theme_tq() +

labs(x="", y="", title = paste("Positions", sep = "")) +

theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "right")

p3 <- ggplot(data = bt) +

geom_line(aes(x= index(bt), y=cumret), color = "darkgreen", size = 1) +

scale_x_datetime(expand = c(0,0)) +

scale_y_continuous(limits = c(0.9,1.4), expand = c(0,0)) +

geom_hline(yintercept = 1, linetype = "dashed", size=1) +

theme_tq() +

labs(x="", y="", title = paste("Cumulative Equity Holdings", sep = "")) +

theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "right")

p4 <- ggplot(data = bt) +

geom_ribbon(aes(ymin=dd, ymax=0, x=index(bt)), fill="red", alpha = 0.6) +

geom_line(aes(x=index(bt), y=dd), color = "red", size = 1) +

scale_x_datetime(expand = c(0,0)) +

scale_y_continuous(limits = c(-0.1,0), labels = scales::percent, expand = c(0,0)) +

theme_tq() +

labs(x="", y="", title = paste("Drawdowns (%)", sep = "")) +

theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "right")

g1 <- ggplotGrob(p1)

g2 <- ggplotGrob(p2)

g3 <- ggplotGrob(p3)

g4 <- ggplotGrob(p4)

grid.newpage()

footnote <- "Computations: Golden Compass Quant"

g <- grid.arrange(arrangeGrob(p1,p2,p3,p4, heights = c(2,1,2,1)), top = textGrob("SVM Strategy - JPX Nikkei 225 Mini Futures", gp = gpar(fontface = "bold", fontsize = 15)),

bottom = textGrob(footnote, x = 0, hjust = -0.1, vjust = 0.2, gp = gpar(fontface = "italic", fontsize = 10)))

grid.draw(g)

ggsave(paste("bt_output_svm.png", sep = ""),plot = g, width = 8, height = 12, dpi = 300)

# one-year return

last(bt$cumret) - first(bt$cumret)

# total number of trades

nrow(bt$delta[bt$delta!=0])

# expectancy

mean(bt$ret)

# extreme returns

max(bt$ret)

min(bt$ret)

# average holding time

mean(diff(index(bt$delta[bt$delta!=0])))

# win rate

success <- bt[,c("prediction","delta","cumret")]

success <- success[success$delta!=0,]

success$trade <- lag(success$prediction)

success$ret <- ROC(success$cumret)

success <- na.omit(success)

success <- success[success$prediction!=0,]

success <- success[success$ret!=0,]

length(success$ret[success$ret>0])/nrow(success)

# expectancy

mean(success$ret)

table.Drawdowns(bt$ret)

# charts.PerformanceSummary(bt$ret)

SharpeRatio(bt$ret)

앞서 결과와 비교해보시길 바랍니다.

1 Comment

harry 12월 7, 2017 at 12:50 오전

좋은 글 감사드립니다.
혹시 첫번째 논문 “잘못된 라벨링” 내용중 삼중베리어로 라벨링하는 부분이 있는데 그림믈 봐도 이해가 잘 되지 않는데 설명을 부탁드려도 될는지요?상하 두영역은 알겠는데 수직영역은 어디를 의미하며,이미지에 샘플로 라벨링된 영역이 왜 1,1,1 인지 정말 궁금합니다.감사합니다.

Reply ↓

기계학습, 고빈도매매 및 기계학습펀드의 실패

이 글 공유하기:

1 Comment

Leave a Comment 응답 취소