Exture+가 그리는 미래, 메시징(2)

1.
어제 메시징을 다루었습니다. 메시징은 Low Latency Framework를 구성할 때 주요한 부분입니다. 다만 메시징과 관련한 기술을 상품으로 구매할 경우 비용이 높습니다. 때문에 자체기술로 업무에 최적화한 메시징제품을 개발하는 것이 더 유리할 수 있습니다. 그래서 KRX도 ZeroMQ를 선택한 듯 하고 오픈소스중 많은 프로젝트들이 ZeroMQ를 채택하는 듯 합니다. 아래 자료도 KT가 클라우드서비스를 추진하면서 만든 자료인 듯 합니다.

Disruptor나 ZeroMQ나 Low Latency Framework를 위한 기본을 제공합니다. 그렇지만 업무환경에서 Low Latency만큼이나 이슈는 Concurrency입니다. Concurrency는 Producer-Consumer Queue의 문제이며 이는 아래 네가지 유형을 지원하여야 합니다.

Depending on allowed number of producer and consumer threads:
– Multi-producer/multi-consumer queues (MPMC)
– Single-producer/multi-consumer queues (SPMC)
– Multi-producer/single-consumer queues (MPSC)
– Single-producer/single-consumer queues (SPSC)

Disruptor는 기본적으로 SPMC를 위한 환경에 최적화하였습니다. 반면 ZeroMQ와 관련한 소스를 보시면 Disruptor와 비슷한 기능을 제공하는 부분이 있습니다. Ypipe와 Yqueue입니다. ZeroMQ는 이렇게 말합니다.

All the Disruptor tests are blocking threads, ØMQ Pipe uses non-blocking, ØMQ YPipe uses CAS, and ØMQ YQueue overlaps cache lines.

아래가 ZeroMQ소스중 yqueue.hpp 소스입니다. 위의 말을 확인해보시길 바랍니다. ZeroMQ는 Lock Free Algorithm으로 ‘Compare and Swap’을 사용하고 있습니다.

/*
    Copyright (c) 2009-2011 250bpm s.r.o.
    Copyright (c) 2007-2009 iMatix Corporation
    Copyright (c) 2007-2011 Other contributors as noted in the AUTHORS file

    This file is part of 0MQ.

    0MQ is free software; you can redistribute it and/or modify it under
    the terms of the GNU Lesser General Public License as published by
    the Free Software Foundation; either version 3 of the License, or
    (at your option) any later version.

    0MQ is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU Lesser General Public License for more details.

    You should have received a copy of the GNU Lesser General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.
*/

#ifndef __ZMQ_YQUEUE_HPP_INCLUDED__
#define __ZMQ_YQUEUE_HPP_INCLUDED__

#include <stdlib.h>
#include <stddef.h>

#include "err.hpp"
#include "atomic_ptr.hpp"

namespace zmq
{

    //  yqueue is an efficient queue implementation. The main goal is
    //  to minimise number of allocations/deallocations needed. Thus yqueue
    //  allocates/deallocates elements in batches of N.
    //
    //  yqueue allows one thread to use push/back function and another one 
    //  to use pop/front functions. However, user must ensure that there's no
    //  pop on the empty queue and that both threads don't access the same
    //  element in unsynchronised manner.
    //
    //  T is the type of the object in the queue.
    //  N is granularity of the queue (how many pushes have to be done till
    //  actual memory allocation is required).

    template <typename T, int N> class yqueue_t
    {
    public:

        //  Create the queue.
        inline yqueue_t ()
        {
             begin_chunk = (chunk_t*) malloc (sizeof (chunk_t));
             alloc_assert (begin_chunk);
             begin_pos = 0;
             back_chunk = NULL;
             back_pos = 0;
             end_chunk = begin_chunk;
             end_pos = 0;
        }

        //  Destroy the queue.
        inline ~yqueue_t ()
        {
            while (true) {
                if (begin_chunk == end_chunk) {
                    free (begin_chunk);
                    break;
                } 
                chunk_t *o = begin_chunk;
                begin_chunk = begin_chunk->next;
                free (o);
            }

            chunk_t *sc = spare_chunk.xchg (NULL);
            if (sc)
                free (sc);
        }

        //  Returns reference to the front element of the queue.
        //  If the queue is empty, behaviour is undefined.
        inline T &front ()
        {
             return begin_chunk->values [begin_pos];
        }

        //  Returns reference to the back element of the queue.
        //  If the queue is empty, behaviour is undefined.
        inline T &back ()
        {
            return back_chunk->values [back_pos];
        }

        //  Adds an element to the back end of the queue.
        inline void push ()
        {
            back_chunk = end_chunk;
            back_pos = end_pos;

            if (++end_pos != N)
                return;

            chunk_t *sc = spare_chunk.xchg (NULL);
            if (sc) {
                end_chunk->next = sc;
                sc->prev = end_chunk;
            } else {
                end_chunk->next = (chunk_t*) malloc (sizeof (chunk_t));
                alloc_assert (end_chunk->next);
                end_chunk->next->prev = end_chunk;
            }
            end_chunk = end_chunk->next;
            end_pos = 0;
        }

        //  Removes element from the back end of the queue. In other words
        //  it rollbacks last push to the queue. Take care: Caller is
        //  responsible for destroying the object being unpushed.
        //  The caller must also guarantee that the queue isn't empty when
        //  unpush is called. It cannot be done automatically as the read
        //  side of the queue can be managed by different, completely
        //  unsynchronised thread.
        inline void unpush ()
        {
            //  First, move 'back' one position backwards.
            if (back_pos)
                --back_pos;
            else {
                back_pos = N - 1;
                back_chunk = back_chunk->prev;
            }

            //  Now, move 'end' position backwards. Note that obsolete end chunk
            //  is not used as a spare chunk. The analysis shows that doing so
            //  would require free and atomic operation per chunk deallocated
            //  instead of a simple free.
            if (end_pos)
                --end_pos;
            else {
                end_pos = N - 1;
                end_chunk = end_chunk->prev;
                free (end_chunk->next);
                end_chunk->next = NULL;
            }
        }

        //  Removes an element from the front end of the queue.
        inline void pop ()
        {
            if (++ begin_pos == N) {
                chunk_t *o = begin_chunk;
                begin_chunk = begin_chunk->next;
                begin_chunk->prev = NULL;
                begin_pos = 0;

                //  'o' has been more recently used than spare_chunk,
                //  so for cache reasons we'll get rid of the spare and
                //  use 'o' as the spare.
                chunk_t *cs = spare_chunk.xchg (o);
                if (cs)
                    free (cs);
            }
        }

    private:

        //  Individual memory chunk to hold N elements.
        struct chunk_t
        {
             T values [N];
             chunk_t *prev;
             chunk_t *next;
        };

        //  Back position may point to invalid memory if the queue is empty,
        //  while begin & end positions are always valid. Begin position is
        //  accessed exclusively be queue reader (front/pop), while back and
        //  end positions are accessed exclusively by queue writer (back/push).
        chunk_t *begin_chunk;
        int begin_pos;
        chunk_t *back_chunk;
        int back_pos;
        chunk_t *end_chunk;
        int end_pos;

        //  People are likely to produce and consume at similar rates.  In
        //  this scenario holding onto the most recently freed chunk saves
        //  us from having to call malloc/free.
        atomic_ptr_t<chunk_t> spare_chunk;

        //  Disable copying of yqueue.
        yqueue_t (const yqueue_t&);
        const yqueue_t &operator = (const yqueue_t&);
    };

}

#endif

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

This file is part of 0MQ.

0MQ is free software; you can redistribute it and/or modify it under

the terms of the GNU Lesser General Public License as published by

the Free Software Foundation; either version 3 of the License, or

(at your option) any later version.

0MQ is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License

along with this program. If not, see <http://www.gnu.org/licenses/>.

#ifndef __ZMQ_YQUEUE_HPP_INCLUDED__

#define __ZMQ_YQUEUE_HPP_INCLUDED__

#include <stdlib.h>

#include <stddef.h>

#include "err.hpp"

#include "atomic_ptr.hpp"

namespace zmq

{

// yqueue is an efficient queue implementation. The main goal is

// to minimise number of allocations/deallocations needed. Thus yqueue

// allocates/deallocates elements in batches of N.

// yqueue allows one thread to use push/back function and another one

// to use pop/front functions. However, user must ensure that there's no

// pop on the empty queue and that both threads don't access the same

// element in unsynchronised manner.

// T is the type of the object in the queue.

// N is granularity of the queue (how many pushes have to be done till

// actual memory allocation is required).

template <typename T, int N> class yqueue_t

{

public:

// Create the queue.

inline yqueue_t ()

{

begin_chunk = (chunk_t*) malloc (sizeof (chunk_t));

alloc_assert (begin_chunk);

begin_pos = 0;

back_chunk = NULL;

back_pos = 0;

end_chunk = begin_chunk;

end_pos = 0;

}

// Destroy the queue.

inline ~yqueue_t ()

{

while (true) {

if (begin_chunk == end_chunk) {

free (begin_chunk);

break;

}

chunk_t *o = begin_chunk;

begin_chunk = begin_chunk->next;

free (o);

}

chunk_t *sc = spare_chunk.xchg (NULL);

if (sc)

free (sc);

}

// Returns reference to the front element of the queue.

// If the queue is empty, behaviour is undefined.

inline T &front ()

{

return begin_chunk->values [begin_pos];

}

// Returns reference to the back element of the queue.

// If the queue is empty, behaviour is undefined.

inline T &back ()

{

return back_chunk->values [back_pos];

}

// Adds an element to the back end of the queue.

inline void push ()

{

back_chunk = end_chunk;

back_pos = end_pos;

if (++end_pos != N)

return;

chunk_t *sc = spare_chunk.xchg (NULL);

if (sc) {

end_chunk->next = sc;

sc->prev = end_chunk;

} else {

end_chunk->next = (chunk_t*) malloc (sizeof (chunk_t));

alloc_assert (end_chunk->next);

end_chunk->next->prev = end_chunk;

}

end_chunk = end_chunk->next;

end_pos = 0;

}

// Removes element from the back end of the queue. In other words

// it rollbacks last push to the queue. Take care: Caller is

// responsible for destroying the object being unpushed.

// The caller must also guarantee that the queue isn't empty when

// unpush is called. It cannot be done automatically as the read

// side of the queue can be managed by different, completely

// unsynchronised thread.

inline void unpush ()

{

// First, move 'back' one position backwards.

if (back_pos)

--back_pos;

else {

back_pos = N - 1;

back_chunk = back_chunk->prev;

}

// Now, move 'end' position backwards. Note that obsolete end chunk

// is not used as a spare chunk. The analysis shows that doing so

// would require free and atomic operation per chunk deallocated

// instead of a simple free.

if (end_pos)

--end_pos;

else {

end_pos = N - 1;

end_chunk = end_chunk->prev;

free (end_chunk->next);

end_chunk->next = NULL;

}

// Removes an element from the front end of the queue.

inline void pop ()

{

if (++ begin_pos == N) {

chunk_t *o = begin_chunk;

begin_chunk = begin_chunk->next;

begin_chunk->prev = NULL;

begin_pos = 0;

// 'o' has been more recently used than spare_chunk,

// so for cache reasons we'll get rid of the spare and

// use 'o' as the spare.

chunk_t *cs = spare_chunk.xchg (o);

if (cs)

free (cs);

}

private:

// Individual memory chunk to hold N elements.

struct chunk_t

{

T values [N];

chunk_t *prev;

chunk_t *next;

};

// Back position may point to invalid memory if the queue is empty,

// while begin & end positions are always valid. Begin position is

// accessed exclusively be queue reader (front/pop), while back and

// end positions are accessed exclusively by queue writer (back/push).

chunk_t *begin_chunk;

int begin_pos;

chunk_t *back_chunk;

int back_pos;

chunk_t *end_chunk;

int end_pos;

// People are likely to produce and consume at similar rates. In

// this scenario holding onto the most recently freed chunk saves

// us from having to call malloc/free.

atomic_ptr_t<chunk_t> spare_chunk;

// Disable copying of yqueue.

yqueue_t (const yqueue_t&);

const yqueue_t &operator = (const yqueue_t&);

};

}

#endif

2.
주제가 메시징에서 Concurrency로 넘어왔네요. Low Latency와 High Throughput -Concurrency를 자본시장IT의 아주 중요한 화두입니다. 두가지를 개념을 담은 프레임워크를 어떻게 구성하냐가 트레이딩시스템의 성능을 좌우하기때문입니다. 이런 분위기를 반영한 기사가 있었습니다. 아주 유명한 자본시장IT회사인 Sungard의 Adav가 인터뷰한 내용입니다.

Q&A: SunGard’s Aditya Yadav on Concurrent Design, and How Stealers Can Help

Sungard의 Stealers가 Disruptor보다 4배 이상 빠르다고 합니다.

Stealers’ throughput surpasses that of Disruptor. Disruptor delivers a throughput of about 40 to 80 million events per second at latencies of 1200 nanoseconds, while Stealers does 2.7 billion events per second and latencies of 4200 nanoseconds. Disruptor uses mechanical sympathy, thread affinity while Stealers doesn’t use any hardware specific optimisations and is able to perform better solely based on a superior algorithm. Disruptor also uses a bounded cyclic buffer while Stealers uses unbounded system memory limited by the physical memory installed on the machine. Also, Stealers is a configuration-less system, so to build an application you just need to add the Stealers library and build around it without worrying about affecting anything else.

Stealers는 어떤 Framework일까요? 개발에 관계한 인도기술자들이 LinkedIn에 올린 짧은 소개입니다.

Stealers aims to be the fastest Trading Platform Core & Stream Processing Core in the world. And is a general purpose mechanism for solving concurrency problems. Stealers is not only Fast it is perhaps the Fastest Concurrency Framework around.

Design
Stealers doesn’t use Queues or even Fixed Size Ring Buffers. It was designed from ground up to utilize Unlimited Memory (Maximum Physical RAM), be resilient to Jitter & Garbage Collection & Overflows, and have an absolute 0 message loss at any load or throughput level. Stealers was designed according to Lean Manufacturing principles to maximise throughput and and not try to maximize thread utilization. The latter is what causes extreme contention.

Benchmark & Applications
It beats LMAX/Disruptor by 4x times (correction 6.5x times as of Mar’12) on a regular core i7 desktop where it does a sustained 79m ops/sec compared to about 25m ops/sec with disruptor. And when setup in a 100P-200C configuration delivers a Total System Throughput of 1,035 million ops/second with 40+ Cores, 128+ GB RAM and less than 1ms latency. Stealers is being used in Trading Systems and Internet Scale Stream Processing (Realtime MapReduce) and Distributed Platforms.

3.
앞서 Sunguard의 기술자가 인터뷰한 내용중 자본시장IT가 해결하여야할 과제를 언급한 부분이 있습니다.

Low Latency, Big Data, Analytics, Data Science, Statistics, Cloud Computing, Hardware Acceleration, Natural User Interfaces and Mobile

자본시장이 요구하는 기술영역들입니다. 이런 기술들의 공통영역은 무엇일까요? 이를 개념화하면 Low Latency And High Throughput And Concurrency라고 할 수 있습니다. 이를 컴포넌트화하면 프레임워크입니다. 현재 자본시장IT의 프레임워크는 TP-Monitor와 DBMS를 기반으로 합니다. 차세대를 수행한 기업들이 내세운 프레임워크가 이를 벗어나지 않는 것으로 압니다. 이런 기술적인 기반으로 앞서 나왔던 시장의 요구를 해결할 수 있을까요? ZeroMQ는 C++를 기반으로 하고 Disruptor와 Stealers는 Java를 기반으로 합니다. ZeroMQ, Disruptor, Stealers. 서로 지향하는 바가 조금씩 다릅니다. 그렇지만 Low Latency와 Concurrency를 위한 방향을 공유합니다.

Exture+가 등장한 이후 증권사의 전사적 프레임워크는 아마도 위와 같은 방향성을 갖는 쪽으로 진화하여야 하지않을까 합니다. 그것이 자본시장의 요구입니다.

(*)사족입니다. 앞서 ZeroMQ소스를 살펴보시면 Inline함수를 사용하고 있습니다. C/C++ 개발자이시면 Inline 함수가 가지는 장점을 잘 아시리라 생각합니다. 지연을 줄이기 위한 노력입니다. Exture+에 참여한 어떤 개발자나 DMA 트레이딩시스템을 하고 있는 어떤 개발자도 지연을 줄이기 위하여 Inline함수를 많이 사용하고 있습니다.

An Inline Function is As Fast As a Macro

Exture+가 그리는 미래, 메시징(2)

이 글 공유하기:

Leave a Comment 응답 취소