Home > General development > Multithreading part 2

Multithreading part 2

September 10th, 2009 Romain Leave a comment Go to comments

I’ve finally finished implementing multithreaded ray tracing in order to take full advantage of my iMac’s Core 2 Duo.

The implementation currently uses pthreads, which are simple enough to use and integrate. Of course, now with Snow Leopard a good move would be to go the Grand Central Dispatch route.

One issue of multithreaded ray tracing is synchronised access to the frame buffer in order to insure consistent reconstruction filtering.

A solution is to restrict access to the frame buffer to a single thread. This implies buffering the samples produced by the worker threads and have this dedicated thread splatter them on the frame buffer.

Another solution, which I have implemented, is to have the worker threads work on non-contiguous parts of the frame buffer at any given time: That way worker threads may have access to the frame buffer and no buffering is required.

I would be curious to have performance figures in order to compare these two approaches.

  1. Matt Pharr
    September 11th, 2009 at 02:58 | #1

    Grand Central dispatch is great; it’s amazing how much simpler it makes the code for stuff like this.

    I’ve found that having a single big mutex for the framebuffer actually isn’t too bad (for up to 4 cores at least). But this may also be a function of my ray tracer being relatively slow, which means that there’s less pressure on the framebuffer update code. If you can partition among threads, that’s great, but if you have pixel reconstruction filters with width >1 pixel, then that doesn’t work so well…

    Another option is basically a lockless approach to do atomic floating point adds, along the lines of:

    inline float AtomicAdd(volatile float *val, float delta) {
    union bits { float f; int32_t i; };
    bits oldVal, newVal;
    do {
    oldVal.f = *val;
    newVal.f = oldVal.f + delta;
    } while (AtomicCompareAndSwap(((AtomicInt32 *)val),
    newVal.i, oldVal.i) != oldVal.i);
    return newVal.f;
    }

    with

    inline int32_t AtomicCompareAndSwap(AtomicInt32 *v, int32_t newValue, int32_t oldValue) {
    #if defined(WIN32)
    return InterlockedCompareExchange(v, newValue, oldValue);
    #elif defined(__APPLE__) && !(defined(__i386__) || defined(__amd64__))
    return OSAtomicCompareAndSwap32Barrier(oldValue, newValue, v);
    #else
    int32_t result;
    __asm__ __volatile__(“lock\ncmpxchgl %2,%1″
    : “=a”(result), “=m”(*v)
    : “q”(newValue), “0″(oldValue)
    : “memory”);
    return result;
    #endif
    }

    This seems to perform decently as well. But again, my ray tracer is relatively slow, so it’s hard to draw definitive performance conclusions from it. But on the other hand, I haven’t run it on a modern (~Nehalem) CPU with fast in-cache atomics, which should also make that stuff faster…

    HTH/FWIW,
    -matt

  2. Romain
    September 11th, 2009 at 10:13 | #2

    @Matt Pharr

    Hi Matt,

    Thanks a lot for this comment. You’ve made my day! :-)

    Grand Central dispatch is great; it’s amazing how much simpler it makes the code for stuff like this.

    That was my impression reading at the material on the Apple Dev Center. I’ll probably have a try at it at some point; after all that’s pretty much the only OSX 10.6’s feature my Mac supports! ;-)

    If you can partition among threads, that’s great, but if you have pixel reconstruction filters with width >1 pixel, then that doesn’t work so well…

    Yeah, exactly. My sequencing scheme gets around the issue by insuring no two threads work on contiguous frame buffer’s tiles. This requires a nutex to synchronize access to the sequencer, but accesses are quick. No idea about the relative performance, though.

    I stalled on the lockless approach because I could not find out whether an atomic accumulation was possible. So this code snippet you’ve posted is a real gem. I’ll certainly try it out. Thanks! :-)

    PS: This little PBRT book of yours is an invaluable reference.

  1. No trackbacks yet.