Java로 HFT를 구현하면?

1.
자주 가는 블로그를 통해 멋진 프로젝트를 하나 발견하였습니다. OpenHFT로 Java를 기반으로 한 고빈도매매를 위한 프로젝트입니다.Low Latency를 위한 Java Project중 대표적인 것은 Disruptor입니다. 저도 여러번 소개한 프로젝트입니다.

Disruptor와 FEP

엄밀히 말하면 Disruptor는 HFT를 위한 것은 아닙니다. HFT를 위한 시스템을 구성할 때 고려할 수 있는 프레임워크나 디자인패턴입니다. 반면 오늘 소개하는 OpenHFT는 고빈도매매를 위한 프로젝트입니다. 아래와 같은 프로젝트들로 이루어져 있습니다.

Java-Chronicle – Persisted low latency messaging
Java-Thread-Affinity – OpenHFT Java Thread Affinity library
Java-Lang – Java Language support
Java-Runtime-Compiler – Java Runtime Compiler
TransFIX – Faster FIX engine.
HugeCollections – Huge Collections for Java using efficient, persisted off heap storage

프로젝트를 이끄는 Peter Lawrey는 고빈도매매의 요구에 부합하기 위하여 Native Java가 아니라 low level Java을 주장합니다.

Latency targets

You can consider 1 ms latency as not that fast these days unless you are operating at 99.9% of the time less than 1 ms. The fastest systems in Java are typically around 100 micro-seconds, even below 20 micro-seconds external to the box.

How to handle high throughput

For trading systems, I suggest you make the latency as low as possible and the throughput is often enough. For example, Chronicle can persist a full Opera feed at maximum burst as a sustained rate with one thread. This is possible because the latency of Chronicle is very low. A tiny 4 byte message has an average latency of 7 nano-seconds. This will be persisted even if the JVM crashes on the next line. At this point throughput is not such an issue.

For back testing, low latency is not enough. This is because what you want to be able to do is to replay months worth of data in a fraction of a second (ideally). In this case, you need high degrees of parallelism and the ability to replay lots of data in a short period of time. For this I have used memory mapped files which have pre canned queries of the data I need. If these files are in a binary format and fit in main memory, they can be accessed repeatedly across thread very fast.

Handling GCs.

Garbage collection pauses slow you down, so the fastest approach is to build an engine is to not use the breaks. Instead you give yourself a budget, the largest Eden size you can reasonably have. This might be 24 GB of Eden. If you take all day to fill it, you won’t collect, even minor collect in a day. Assuming you can stop for a few seconds over night or on the weekend, you Full GC at a predetermined time and GCs are no longer an issue.

Even then I suggest keeping garbage to a minimum and localise memory access patterns to maximise the L1 / L2 efficiency which can result in 2-5x performance improvement. The L3 cache is at least 10x slower than the L1 and if you are not filling your L1 with garbage (as many Java program literally do) your program runs much faster, more consistently.

Use shared memory for IPC

This has two advantages.

1) you can keep a record of every event, with timings for the whole day with trivial overhead. Because you can keep much more data, you can reproduce the exact state of the system externally, down stream or to reproduce a bug. This gives you a massive data driven test based and you can feed production data through your test system in real time to see how ti behaves before running live.

2) In fact it can be much faster than the alternatives. A tiny message can have a average latency of 7 nano-seconds to write and can be visible to another process in under 100 nano-seconds.

Pinning cores

Many people have had tried pinning cores and not found much difference. This can be due to a) the jitter of the application being so high, it hardly matters or b) then didn’t pin to an isolated core. I have found if you bind to an isolated core the worst case latencies drop from 2 ms to 10 micro-seconds, but pinning to a non-isolated core doesn’t appear to help at all.

Busy waiting

You don’t want your critical thread to context switch and one way to avoid this is to never give up the core or CPU. This can be done by busy waiting on a pinned and isolated core. One thing I have done is to use isolated cores to

a) disable hyper threading selectively for the most critical threads, but leave it on for the rest.

b) over clock isolated core more by not over-clocking the “junk” cores. This works by limiting heat produced by cores that don’t need maximum speed anyway. e.g. say you have a hex core, you leave cores 0,2,4 not over clocked, and over-clock core 1,3,5 without using HT another 10% more than you can the whole socket. I assume that the over-clocked cores are not next to each other on the die so they are less likely to over heat.

Larger address space collections

A key feature of Chronicle and OpenHFT’s Direct Store is that they allow access to 64-bit sizes. (Technically only 48-bit due to OS/CPU limitations) This means you can manage TB sized data sets as one collection with many billions of records/excerpts natively, in memory in Java. This avoids even a JNI barrier or a system call to slow down your application.

C++ like Java for low latency중에서

2.
Peter Lawrey는 JavaONE이나 JAX와 같은 행사에서 OpenHFT를 발표하고 있네요. 블로그도 그렇고 나름 인기있는 프로그래머입니다.

Writing and Testing High Frequency Trading Engines talk at JavaOne

Update on Writing and Monitoring HFT systems와 관련한 발표문입니다.

Peter Lawrey가 Low Level Java를 주장하는 이유는 아래입니다. “성능에 가장 많은 영향을 주는 부분은 낮은 수준의 언어로 개발하고 나머지는 생산성이 높은 언어로 개발하자”고 주장합니다.

If your application spend 90% of its time in 10% of its code, or perhaps 80/20, you can write 90% of the code in “natural” Java , or use third party libraries which do. However if you write 10% in low level Java you can there are significant performance gains to be made.

Latency targets

How to handle high throughput

Handling GCs.

Use shared memory for IPC

Pinning cores

Busy waiting

Larger address space collections

이 글 공유하기:

Leave a Comment 응답 취소