Mutex Vs Spinlock

1.
어플리케이션에서 레이턴시를 줄이는 일은 그리 간단하지 않습니다. 예를 들어 Multi-Threaded Trading Application을 구현할 때 Lock-free Algorithm을 사용하는 경우가 많다고 하였습니다.

전략의 복수이벤트처리

그렇지만 Lock-Free Algorithm이 전가의 보도(傳家寶刀)처럼 Latency를 확 줄여주지 않습니다. 반대로 많은 프로그램들은 쓰레드동기화를 위하여 Lock을 사용합니다. 대표적인 방법은 Mutex입니다. Mutex를 사용한다고 해서 무조건 느린 어플리케이션이라고 할 수도 없습니다. 목표로 하는 업무와 하드웨어환경에 따라 최적화한 방법을 꾸준히 찾아내는 것외엔 방법이 없습니다. 지루한 작업입니다.

앞서 Mutex를 이야기했지만 Mutex이외에도 여라가지 Lock이 있습니다. 그중 하나가 SpinLock입니다.

Multithread, multiprocessor 환경에서는 여러 thread에서 동시에 하나의 resource에 접근해서 이 resource의 값을 바꾸는 시나리오가 생길 수 있다. 예를 들자면 두 개의 thread가 전역(global) 변수의 값을 동시에 바꾸려고 할 경우. 이 때 값을 바꾸는 기계어코드(Assembly code)가 atomic instruction(atomic instruction:이 instruction이 실행되고 있을 때 context switching이 일어나지 않는 것을 보장한다.) 이 아니기 때문에 변수의 값이 망가질(corrupted) 수 있다.

이 때 한 번에 한 thread만이 변수(resource)에 쓸 수 있도록 보장하는 것이 spinlock과 mutex이다. 즉 변수에 접근하기 전에 spinlock이나 mutex를 얻고 변수의 값을 바꾸고 나서는 spinlock 또는 mutex를 반환하는 것이다. 한 thread (A)가 mutex를 얻고자 할 때 다른 thread (B)가 이미 그 mutex를 가지고 있다면 일반적으로 OS scheduler는 그 thread (A)를 suspend 시키고 mutex가 release되었을 때 thread A를 다시 scheduling한다. 같은 시나리오에서 spinlock을 얻고자 하는 thread는 spinlock이 릴리즈되었는지 체크하면서 suspend되지 않는다.

resource가 빨리 릴리즈되어 spinlock으로 낭비되는 CPU time이 context switching으로 인한 latency보다 작다면 spinlock의 사용을 생각해 볼 수 있다. Single Processor환경에서 spinlock의 사용은 의미가 없으며 deadlock을 일으킬 수도 있다. 한 thread에서 recursive하게 같은 spinlock을 얻고자 하는 경우에도 deadlock이 발생 할 수 있다. 반면에 Mutex는 recursive하게 얻을 수 있다.
Spinlock과 Mutex (1)중에서

그러면 Lock을 이용한 다양한 방법중 어떤 것이 가장 좋을까요? 이곳저곳에서 조사한 자료들입니다.
먼저?Ivan Novick(Engineer at Greenplum Divsion of EMC)가 공개한 multi-threaded concurrency syncronization performance testing을 위한 프로그램입니다.

// Copyright (c) <2009>, Ivan Novick
// All rights reserved (BSD License)
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are met:
//     * Redistributions of source code must retain the above copyright
//       notice, this list of conditions and the following disclaimer.
//     * Redistributions in binary form must reproduce the above copyright
//       notice, this list of conditions and the following disclaimer in the
//       documentation and/or other materials provided with the distribution.
//     * Neither the name of the Ivan Novick nor the
//       names of its contributors may be used to endorse or promote products
//       derived from this software without specific prior written permission.
// 
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
// ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
// WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
// DISCLAIMED. IN NO EVENT SHALL IVAN NOVICK BE LIABLE FOR ANY
// DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
// (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
// LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
// ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

// Compare the wall clock time of different concurrency mechanisms, under different levels of conention
// mutex, spinlock, reead-write locks, and atomic memory accesses are compared

#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <features.h>

// compile command line
// gcc pthread_concurrency.c -D_GNU_SOURCE  -o pthread_concurrency -lpthread -lrt

struct timespec diff_timespec(struct timespec start, struct timespec end);
long long nanosec_elapsed(struct timespec diff);

// change this to make the run time short or longer
#define ITERS_PER_TEST 500000

// global variable will be incremented from many threads
unsigned long counter = 0;

pthread_mutex_t mutex;
pthread_rwlock_t rwlock;
pthread_spinlock_t spinlock;
typedef void* thread_func_t(void*);

void* atomic(void* arg){

    int i;
    for (i = 0 ; i < ITERS_PER_TEST; ++i) {
        __sync_fetch_and_add( &counter, 1 );
    }
}

void* mutexfunc(void* arg){

    int i;
    for (i = 0 ; i < ITERS_PER_TEST; ++i) {
        pthread_mutex_lock(&mutex);
        counter++;
        pthread_mutex_unlock(&mutex);
    }
}

void* spin(void* arg){

    int i;
    for (i = 0 ; i < ITERS_PER_TEST; ++i) {
        pthread_spin_lock(&spinlock);
        counter++;
        pthread_spin_unlock(&spinlock);
    }
}

void* read_lock(void* arg){

    int i;
    for (i = 0 ; i < ITERS_PER_TEST; ++i) {
        pthread_rwlock_rdlock(&rwlock);
        counter++;
        pthread_rwlock_unlock(&rwlock);
    }
}

void* write_lock(void* arg){

    int i;
    for (i = 0 ; i < ITERS_PER_TEST; ++i) {
        pthread_rwlock_wrlock(&rwlock);
        counter++;
        pthread_rwlock_unlock(&rwlock);
    }
}

void do_test(thread_func_t func, unsigned short threads, const char* name){

    int i;
    struct timespec start;
    struct timespec end;
    struct timespec diff;

    pthread_t thread_array[threads];
    counter = 0;

    clock_gettime(CLOCK_MONOTONIC, &start);

    for (i = 0; i < threads; ++i){
        if (pthread_create( &thread_array[i], NULL, func, NULL)){
            printf("error creating threads, exiting");
            exit(1);
        }
    }

    for (i = 0; i < threads; i++)
        pthread_join( thread_array[i], NULL );

    clock_gettime(CLOCK_MONOTONIC, &end);
    diff = diff_timespec(start, end);
    printf("%14s %2d threads took %16lld nanoseconds, global counter = %lu\n", name, threads, nanosec_elapsed(diff), counter);
}

void do_all_tests(int threads){
    printf("*************************************\n");
    fflush(stdout);
    do_test(&atomic, threads, "atomic");
    do_test(&mutexfunc, threads, "mutex");
    do_test(&spin, threads, "spin");
    do_test(&read_lock, threads, "read_lock");
    do_test(&write_lock, threads, "write_lock");
    printf("*************************************\n");
    fflush(stdout);
}

int main(int argc, char** argv)
{
    pthread_mutex_init(&mutex, NULL);
    pthread_rwlock_init(&rwlock, NULL);
    pthread_spin_init(&spinlock, 0);

    do_all_tests(1);
    do_all_tests(8);
    do_all_tests(32); // spin lock will take forever if the number of threads is much higher

    pthread_mutex_destroy(&mutex);
    pthread_rwlock_destroy(&rwlock);
    pthread_spin_destroy(&spinlock);

    return 0;
}

struct timespec diff_timespec(struct timespec start, struct timespec end)
{
    struct timespec result;

    if (end.tv_nsec < start.tv_nsec){ // peform carry like in normal subtraction
        //                123456789
        result.tv_nsec = 1000000000 + end.tv_nsec - start.tv_nsec;        
        result.tv_sec = end.tv_sec - 1 - start.tv_sec;
    }
    else{
        result.tv_nsec = end.tv_nsec - start.tv_nsec;        
        result.tv_sec = end.tv_sec - start.tv_sec;
    }

    return result;
}

long long nanosec_elapsed(struct timespec diff){
    //                                123456789 
    return ((long long)diff.tv_sec * 1000000000) + diff.tv_nsec;
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

// Copyright (c) <2009>, Ivan Novick

// Redistribution and use in source and binary forms, with or without

// modification, are permitted provided that the following conditions are met:

// * Redistributions of source code must retain the above copyright

// notice, this list of conditions and the following disclaimer.

// * Redistributions in binary form must reproduce the above copyright

// notice, this list of conditions and the following disclaimer in the

// documentation and/or other materials provided with the distribution.

// * Neither the name of the Ivan Novick nor the

// names of its contributors may be used to endorse or promote products

// derived from this software without specific prior written permission.

// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND

// ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED

// WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

// DISCLAIMED. IN NO EVENT SHALL IVAN NOVICK BE LIABLE FOR ANY

// DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES

// (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;

// LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND

// ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT

// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS

// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

// Compare the wall clock time of different concurrency mechanisms, under different levels of conention

// mutex, spinlock, reead-write locks, and atomic memory accesses are compared

#include <time.h>

#include <stdio.h>

#include <stdlib.h>

#include <pthread.h>

#include <features.h>

// compile command line

// gcc pthread_concurrency.c -D_GNU_SOURCE -o pthread_concurrency -lpthread -lrt

struct timespec diff_timespec(struct timespec start, struct timespec end);

long long nanosec_elapsed(struct timespec diff);

// change this to make the run time short or longer

#define ITERS_PER_TEST 500000

// global variable will be incremented from many threads

unsigned long counter = 0;

pthread_mutex_t mutex;

pthread_rwlock_t rwlock;

pthread_spinlock_t spinlock;

typedef void* thread_func_t(void*);

void* atomic(void* arg){

int i;

for (i = 0 ; i < ITERS_PER_TEST; ++i) {

__sync_fetch_and_add( &counter, 1 );

}

void* mutexfunc(void* arg){

int i;

for (i = 0 ; i < ITERS_PER_TEST; ++i) {

pthread_mutex_lock(&mutex);

counter++;

pthread_mutex_unlock(&mutex);

}

void* spin(void* arg){

int i;

for (i = 0 ; i < ITERS_PER_TEST; ++i) {

pthread_spin_lock(&spinlock);

counter++;

pthread_spin_unlock(&spinlock);

}

void* read_lock(void* arg){

int i;

for (i = 0 ; i < ITERS_PER_TEST; ++i) {

pthread_rwlock_rdlock(&rwlock);

counter++;

pthread_rwlock_unlock(&rwlock);

}

void* write_lock(void* arg){

int i;

for (i = 0 ; i < ITERS_PER_TEST; ++i) {

pthread_rwlock_wrlock(&rwlock);

counter++;

pthread_rwlock_unlock(&rwlock);

}

void do_test(thread_func_t func, unsigned short threads, const char* name){

int i;

struct timespec start;

struct timespec end;

struct timespec diff;

pthread_t thread_array[threads];

counter = 0;

clock_gettime(CLOCK_MONOTONIC, &start);

for (i = 0; i < threads; ++i){

if (pthread_create( &thread_array[i], NULL, func, NULL)){

printf("error creating threads, exiting");

exit(1);

}

for (i = 0; i < threads; i++)

pthread_join( thread_array[i], NULL );

clock_gettime(CLOCK_MONOTONIC, &end);

diff = diff_timespec(start, end);

printf("%14s %2d threads took %16lld nanoseconds, global counter = %lu\n", name, threads, nanosec_elapsed(diff), counter);

}

void do_all_tests(int threads){

printf("*************************************\n");

fflush(stdout);

do_test(&atomic, threads, "atomic");

do_test(&mutexfunc, threads, "mutex");

do_test(&spin, threads, "spin");

do_test(&read_lock, threads, "read_lock");

do_test(&write_lock, threads, "write_lock");

printf("*************************************\n");

fflush(stdout);

}

int main(int argc, char** argv)

{

pthread_mutex_init(&mutex, NULL);

pthread_rwlock_init(&rwlock, NULL);

pthread_spin_init(&spinlock, 0);

do_all_tests(1);

do_all_tests(8);

do_all_tests(32); // spin lock will take forever if the number of threads is much higher

pthread_mutex_destroy(&mutex);

pthread_rwlock_destroy(&rwlock);

pthread_spin_destroy(&spinlock);

return 0;

}

struct timespec diff_timespec(struct timespec start, struct timespec end)

{

struct timespec result;

if (end.tv_nsec < start.tv_nsec){ // peform carry like in normal subtraction

// 123456789

result.tv_nsec = 1000000000 + end.tv_nsec - start.tv_nsec;

result.tv_sec = end.tv_sec - 1 - start.tv_sec;

}

else{

result.tv_nsec = end.tv_nsec - start.tv_nsec;

result.tv_sec = end.tv_sec - start.tv_sec;

}

return result;

}

long long nanosec_elapsed(struct timespec diff){

// 123456789

return ((long long)diff.tv_sec * 1000000000) + diff.tv_nsec;

}

개발서버에서 실행한 결과입니다. 참고로 Quadcore입니다.

[zeroaos@ium ~]$ ./pthread_concurrency
*************************************
        atomic  1 threads took          7381341 nanoseconds, global counter = 500000
         mutex  1 threads took         10051864 nanoseconds, global counter = 500000
          spin  1 threads took          3579301 nanoseconds, global counter = 500000
     read_lock  1 threads took         16083250 nanoseconds, global counter = 500000
    write_lock  1 threads took         15635445 nanoseconds, global counter = 500000
*************************************
*************************************
        atomic  8 threads took         72055688 nanoseconds, global counter = 4000000
         mutex  8 threads took        439657179 nanoseconds, global counter = 4000000
          spin  8 threads took        282621428 nanoseconds, global counter = 4000000
     read_lock  8 threads took        719492547 nanoseconds, global counter = 3904112
    write_lock  8 threads took       2945338706 nanoseconds, global counter = 4000000
*************************************
*************************************
        atomic 32 threads took        290695065 nanoseconds, global counter = 16000000
         mutex 32 threads took       1910090731 nanoseconds, global counter = 16000000
            spin 32 threads took       3380825043 nanoseconds, global counter = 16000000
     read_lock 32 threads took       3182536379 nanoseconds, global counter = 15609558
    write_lock 32 threads took       2300156255 nanoseconds, global counter = 16000000
*************************************

[zeroaos@ium ~]$ ./pthread_concurrency

*************************************

atomic 1 threads took 7381341 nanoseconds, global counter = 500000

mutex 1 threads took 10051864 nanoseconds, global counter = 500000

spin 1 threads took 3579301 nanoseconds, global counter = 500000

read_lock 1 threads took 16083250 nanoseconds, global counter = 500000

write_lock 1 threads took 15635445 nanoseconds, global counter = 500000

*************************************

atomic 8 threads took 72055688 nanoseconds, global counter = 4000000

mutex 8 threads took 439657179 nanoseconds, global counter = 4000000

spin 8 threads took 282621428 nanoseconds, global counter = 4000000

read_lock 8 threads took 719492547 nanoseconds, global counter = 3904112

write_lock 8 threads took 2945338706 nanoseconds, global counter = 4000000

*************************************

atomic 32 threads took 290695065 nanoseconds, global counter = 16000000

mutex 32 threads took 1910090731 nanoseconds, global counter = 16000000

spin 32 threads took 3380825043 nanoseconds, global counter = 16000000

read_lock 32 threads took 3182536379 nanoseconds, global counter = 15609558

write_lock 32 threads took 2300156255 nanoseconds, global counter = 16000000

*************************************

같은 소스를 최적화옵션을 주고 컴파일을 한 후 실행해보았습니다.

[zeroaos@ium ~]$ gcc -O2 pthread_concurrency.c -D_GNU_SOURCE  -o pthread_concurrency -lpthread -lrt
[zeroaos@ium ~]$ ./pthread_concurrency
*************************************
        atomic  1 threads took          4920287 nanoseconds, global counter = 500000
         mutex  1 threads took         10300062 nanoseconds, global counter = 500000
          spin  1 threads took          3624036 nanoseconds, global counter = 500000
     read_lock  1 threads took         16055892 nanoseconds, global counter = 500000
    write_lock  1 threads took         15807383 nanoseconds, global counter = 500000
*************************************
*************************************
        atomic  8 threads took         72196236 nanoseconds, global counter = 4000000
         mutex  8 threads took        478613081 nanoseconds, global counter = 4000000
          spin  8 threads took        368431482 nanoseconds, global counter = 4000000
     read_lock  8 threads took        663119813 nanoseconds, global counter = 3916820
    write_lock  8 threads took       3441026599 nanoseconds, global counter = 4000000
*************************************
*************************************
        atomic 32 threads took        290805906 nanoseconds, global counter = 16000000
         mutex 32 threads took       2083969638 nanoseconds, global counter = 16000000
          spin 32 threads took       5692459866 nanoseconds, global counter = 16000000
     read_lock 32 threads took       2989131816 nanoseconds, global counter = 15653452
    write_lock 32 threads took       5042990707 nanoseconds, global counter = 16000000
*************************************

[zeroaos@ium ~]$ gcc -O2 pthread_concurrency.c -D_GNU_SOURCE -o pthread_concurrency -lpthread -lrt

[zeroaos@ium ~]$ ./pthread_concurrency

*************************************

atomic 1 threads took 4920287 nanoseconds, global counter = 500000

mutex 1 threads took 10300062 nanoseconds, global counter = 500000

spin 1 threads took 3624036 nanoseconds, global counter = 500000

read_lock 1 threads took 16055892 nanoseconds, global counter = 500000

write_lock 1 threads took 15807383 nanoseconds, global counter = 500000

*************************************

atomic 8 threads took 72196236 nanoseconds, global counter = 4000000

mutex 8 threads took 478613081 nanoseconds, global counter = 4000000

spin 8 threads took 368431482 nanoseconds, global counter = 4000000

read_lock 8 threads took 663119813 nanoseconds, global counter = 3916820

write_lock 8 threads took 3441026599 nanoseconds, global counter = 4000000

*************************************

atomic 32 threads took 290805906 nanoseconds, global counter = 16000000

mutex 32 threads took 2083969638 nanoseconds, global counter = 16000000

spin 32 threads took 5692459866 nanoseconds, global counter = 16000000

read_lock 32 threads took 2989131816 nanoseconds, global counter = 15653452

write_lock 32 threads took 5042990707 nanoseconds, global counter = 16000000

*************************************

Spin Lock의 경우 CPU 코어에 비해 Thread가 너무 많을 경우 성능이 떨어지는 듯 합니다. 다음은?pthread mutex vs pthread spinlock에 올라온 프로그램입니다.

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <sys/time.h>

#include <list>

#define LOOPS 10000000

using namespace std;

list<int> the_list;

#ifdef USE_SPINLOCK
pthread_spinlock_t spinlock;
#else
pthread_mutex_t mutex;
#endif

pid_t gettid() { return syscall( __NR_gettid ); }

void *consumer(void *ptr)
{
    int i;

    printf("Consumer TID %lu\n", (unsigned long)gettid());

    while (1)
    {
#ifdef USE_SPINLOCK
        pthread_spin_lock(&spinlock);
#else
        pthread_mutex_lock(&mutex);
#endif

        if (the_list.empty())
        {
#ifdef USE_SPINLOCK
            pthread_spin_unlock(&spinlock);
#else
            pthread_mutex_unlock(&mutex);
#endif
            break;
        }

        i = the_list.front();
        the_list.pop_front();

#ifdef USE_SPINLOCK
        pthread_spin_unlock(&spinlock);
#else
        pthread_mutex_unlock(&mutex);
#endif
    }

    return NULL;
}

int main()
{
    int i;
    pthread_t thr1, thr2;
    struct timeval tv1, tv2;

#ifdef USE_SPINLOCK
    pthread_spin_init(&spinlock, 0);
#else
    pthread_mutex_init(&mutex, NULL);
#endif

    // Creating the list content...
    for (i = 0; i < LOOPS; i++)
        the_list.push_back(i);

    // Measuring time before starting the threads...
    gettimeofday(&tv1, NULL);

    pthread_create(&thr1, NULL, consumer, NULL);
    pthread_create(&thr2, NULL, consumer, NULL);

    pthread_join(thr1, NULL);
    pthread_join(thr2, NULL);

    // Measuring time after threads finished...
    gettimeofday(&tv2, NULL);

    if (tv1.tv_usec > tv2.tv_usec)
    {
        tv2.tv_sec--;
        tv2.tv_usec += 1000000;
    }

    printf("Result - %ld.%ld\n", tv2.tv_sec - tv1.tv_sec,
        tv2.tv_usec - tv1.tv_usec);

#ifdef USE_SPINLOCK
    pthread_spin_destroy(&spinlock);
#else
    pthread_mutex_destroy(&mutex);
#endif

    return 0;
}

100

101

102

103

104

105

#include <stdio.h>

#include <pthread.h>

#include <unistd.h>

#include <sys/syscall.h>

#include <errno.h>

#include <sys/time.h>

#include <list>

#define LOOPS 10000000

using namespace std;

list<int> the_list;

#ifdef USE_SPINLOCK

pthread_spinlock_t spinlock;

#else

pthread_mutex_t mutex;

#endif

pid_t gettid() { return syscall( __NR_gettid ); }

void *consumer(void *ptr)

{

int i;

printf("Consumer TID %lu\n", (unsigned long)gettid());

while (1)

{

#ifdef USE_SPINLOCK

pthread_spin_lock(&spinlock);

#else

pthread_mutex_lock(&mutex);

#endif

if (the_list.empty())

{

#ifdef USE_SPINLOCK

pthread_spin_unlock(&spinlock);

#else

pthread_mutex_unlock(&mutex);

#endif

break;

}

i = the_list.front();

the_list.pop_front();

#ifdef USE_SPINLOCK

pthread_spin_unlock(&spinlock);

#else

pthread_mutex_unlock(&mutex);

#endif

}

return NULL;

}

int main()

{

int i;

pthread_t thr1, thr2;

struct timeval tv1, tv2;

#ifdef USE_SPINLOCK

pthread_spin_init(&spinlock, 0);

#else

pthread_mutex_init(&mutex, NULL);

#endif

// Creating the list content...

for (i = 0; i < LOOPS; i++)

the_list.push_back(i);

// Measuring time before starting the threads...

gettimeofday(&tv1, NULL);

pthread_create(&thr1, NULL, consumer, NULL);

pthread_create(&thr2, NULL, consumer, NULL);

pthread_join(thr1, NULL);

pthread_join(thr2, NULL);

// Measuring time after threads finished...

gettimeofday(&tv2, NULL);

if (tv1.tv_usec > tv2.tv_usec)

{

tv2.tv_sec--;

tv2.tv_usec += 1000000;

}

printf("Result - %ld.%ld\n", tv2.tv_sec - tv1.tv_sec,

tv2.tv_usec - tv1.tv_usec);

#ifdef USE_SPINLOCK

pthread_spin_destroy(&spinlock);

#else

pthread_mutex_destroy(&mutex);

#endif

return 0;

}

역시 개발서버에서 실험한 결과입니다.

[zeroaos@ium ~]$ g++ -DUSE_SPINLOCK  -Wall -pthread mutex_vs_spinlock.cc
[zeroaos@ium ~]$ ./a.out
Consumer TID 15850
Consumer TID 15851
Result - 0.990084
[zeroaos@ium ~]$ g++ -Wall -pthread mutex_vs_spinlock.cc
[zeroaos@ium ~]$ ./a.out
Consumer TID 15870
Consumer TID 15871
Result - 2.143256
[zeroaos@ium ~]$

[zeroaos@ium ~]$ g++ -DUSE_SPINLOCK -Wall -pthread mutex_vs_spinlock.cc

[zeroaos@ium ~]$ ./a.out

Consumer TID 15850

Consumer TID 15851

Result - 0.990084

[zeroaos@ium ~]$ g++ -Wall -pthread mutex_vs_spinlock.cc

[zeroaos@ium ~]$ ./a.out

Consumer TID 15870

Consumer TID 15871

Result - 2.143256

[zeroaos@ium ~]$

Spinlock이 Mutex에 비해 3배정도 좋은 듯 합니다.

2.
또다른 자료입니다. 리눅스환경이 아닌 Sparc Solaris입니다. The cost of mutexes에 올라온 자료입니다.

#include <pthread.h>
#include <atomic.h>
#include "timing.h"

#define SIZE 10000000

pthread_mutex_t mutex;
pthread_t thread;
volatile unsigned int counter;

void atomic_add(volatile unsigned int \*,int);

void \* count(void\* value)
{
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    pthread_mutex_lock(&mutex);
    counter++;
    pthread_mutex_unlock(&mutex);
  }
  endtime(SIZE);
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    atomic_add_int(&counter,1);
  }
  endtime(SIZE);
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    atomic_add(&counter,1);
  }
  endtime(SIZE);
}

void main()
{
  pthread_mutex_init(&mutex,0);
  counter=0;
  pthread_create(&thread,0,count,0);

  pthread_join(thread,0);
  pthread_mutex_destroy(&mutex);
}

#include <pthread.h>

#include <atomic.h>

#include "timing.h"

#define SIZE 10000000

pthread_mutex_t mutex;

pthread_t thread;

volatile unsigned int counter;

void atomic_add(volatile unsigned int \*,int);

void \* count(void\* value)

{

counter=0;

starttime();

while (counter<SIZE)

{

pthread_mutex_lock(&mutex);

counter++;

pthread_mutex_unlock(&mutex);

}

endtime(SIZE);

counter=0;

starttime();

while (counter<SIZE)

{

atomic_add_int(&counter,1);

}

endtime(SIZE);

counter=0;

starttime();

while (counter<SIZE)

{

atomic_add(&counter,1);

}

endtime(SIZE);

}

void main()

{

pthread_mutex_init(&mutex,0);

counter=0;

pthread_create(&thread,0,count,0);

pthread_join(thread,0);

pthread_mutex_destroy(&mutex);

}

제가 Sparc CPU를 사용하는 서버가 없어서 시험을 하지 못했습니다. 저자가 올려놓은 결과는 이렇습니다.

% cc test.c add.il
% a.out
Time per iteration 250.61 ns
Time per iteration 75.85 ns
Time per iteration 65.21 ns

% cc test.c add.il

% a.out

Time per iteration 250.61 ns

Time per iteration 75.85 ns

Time per iteration 65.21 ns

이렇게 해석합니다.

So the mutex calls are about 3x slower than atomic operations. Calling libc is about 10ns slower than using an inline template (not a bad difference in return for not having to write the inline template code).

오라클의 수석엔지니어는 Inline을 이용한 방법이 최선이라고 주장합니다. 그래서 아래와 같은 글도 썼네요.

Using Inline Templates to Improve Application Performance

3.
다양한 방법에 대한 결과들입니다. 위의 소스를 이용하여 목적에 맞도록 수정하여 Performance를 점검해보시면 개발할 때 도움이 되리라 생각합니다. 그렇다고 ‘한방’을 기대하지 마시길 바랍니다.

오히려 끊임없이 진화하는 프로그램이도록 노력하는 것이 현명할 듯 합니다. 그래서 저의 목표는 지속적인 버전업(Version Up)입니다.

이 글 공유하기:

Leave a Comment 응답 취소