Dekker’s Algorithm Does not Work, as Expected

Sometimes it is very reassuring that certain things do not work when tested in practice, especially when you have been telling people that for a long time. In my talks about Debugging Multicore Systems at the Embedded Systems Conference Silicon Valley in 2006 and 2007, I had a fairly long discussion about relaxed or weak memory consistency models and their effect on parallel software when run on a truly concurrent machine. I used Dekker’s Algorithm as an example of code that works just fine on a single-processor machine with a multitasking operating system, but that fails to work on a dual-processor machine. Over Christmas, I finally did a practical test of just how easy it was to make it fail in reality. Which turned out to showcase some interesting properties of various types and brands of hardware and software.

Now to the code.

The core part of Dekker’s Algorithm are two symmetrical pieces of code that access a set of shared variables in a way that makes it impossible for both codes to enter their critical section at the same time. As long as memory is sequentially consistent. Here is my implementation, warts and all:

static volatile int flag1 = 0;
static volatile int flag2 = 0;
static volatile int turn  = 1;
static volatile int gSharedCounter = 0;
void dekker1( ) {
        flag1 = 1;
        turn  = 2;
        while((flag2 ==  1) && (turn == 2)) ;
        // Critical section
        gSharedCounter++;
        // Let the other task run
        flag1 = 0;
}

void dekker2(void) {
        flag2 = 1;
        turn = 1;
        while((flag1 ==  1) && (turn == 1)) ;
        // critical section
        gSharedCounter++;
        // leave critical section
        flag2 = 0;
}

This code can fail on a machine with weak memory consistency since there is no constraint in most memory systems about the order in which the updates to “flag2”, “flag1”, and “turn” become visible to the other processor. In particular, there is no guarantee that the read from “flag2” in dekker1 will happen after the write to “flag1” and “turn” propagates to dekker2. Doing this argument symmetrically, you get something like the following sketch:

Dekker Bug

From this basic faulty code, I then use pthreads to create a parallel program. In the program, I loop many million of times in each thread trying to get into the critical section:

int gLoopCount;
void *task1(void *arg) {
        int i;
        printf("Starting task1n");
        for(i=gLoopCount;i>0;i--) {
                dekker1();
        }
}
void *task2(void *arg) {
        int i;
        printf("Starting task2n");
        for(i=gLoopCount;i>0;i--) {
                dekker2();
        }
}

If it happens that both enter the critical section at the same time, the construction of increasing a shared counter in dekker1 and dekker2 will likely result in a missed update to “gSharedCounter” as both threads read the same value, increment it, and then write the same value back (this is the kind of error mutual exclusion is supposed to protect against). Given this, by checking the value of gSharedCounter at the end of the program run, I can tell if any failures to lock happened. Note that this is likely conservative, since it is quite possible that one task manages to do the entire read-increment-write operation implicit in the ++ operation before the other task does. So the number of missed updates is actually a lower bound on the number of failed lockings.

So what happened when I tried this on real machines?

I must admit that I did not expect to see very many instances of errors, since I kind of assumed that modern hardware is so fast in communicating between cores that the window of opportunity to catch a bug would be pretty small. But it turned out to be quite frequent, and very variable across machines.

  • On my Core 2 Duo T7600 laptop, with Windows Vista, I got on average one error in every 1.5 million locking attempts.
  • On an older dual-processor Opteron 242 machine, I got on average one error every 15000 locking attempts. 1000 times more often than on the Core 2 Duo!
  • On a Freescale MPC8641D dual-core machine, I got on average one error every 2000 locking attempts.
  • On a range of single-core machines, not a single error was observed.

So the theory is validated. Always feels good to have proof in practice that I have been telling the truth at the ESC 🙂 What else can we tell from the numbers? I think this actually demonstrates a few other theories in practice:

  • Communication between cores on a single chip is much faster than between separate chips, and the longer latencies between chips makes Dekker more likely to fail. This is shown by the difference between the dual-processor and dual-core x86 systems.
  • The PowerPC has a weaker memory consistency model by design than x86 systems, so the greater occurrence of locking failures there is also consistent with expectations.

Now all I need to find are a few more machines to test on. It is always fun to do microbenchmarking of real machines, as evidenced in my RTAS 2003 paper on branch predictors and their effect on WCET predictability.

If you want to try this yourself, the source code is attached to this post: dekker.c . Just compile it with “gcc -O2 -o dekker dekker.c -lpthread”. It has been tested on Windows with Cygwin as well as various Linuxes. Run it with the argument 10000000 (10 million), which appears to give a good indication of the prevalence of errors without running for too long.

22 thoughts on “Dekker’s Algorithm Does not Work, as Expected”

  1. Update: I just tried this on a Freescale MPC8572E dual-core chip, and it has a far more aggressive memory system… I get locking failures every 5 to 20 attempts on that one. Very interesting and very unexpected.

  2. Good analysis.
    I’m assuming to fix the program would require some architecture specific inline-asm, to create a memory barrier.

  3. Your testcase is faulty. Volatile cannot proect against operations like “++”. It can only protect against load and store. Using of gSharedCounter++ as an indicator is misleading. Therefore, your result is wrong. Use a mutex(or some other mechanism) to protect against the gSharedCounter and try again.

  4. Johnny :Your testcase is faulty. Volatile cannot proect against operations like “++”. It can only protect against load and store. Using of gSharedCounter++ as an indicator is misleading. Therefore, your result is wrong. Use a mutex(or some other mechanism) to protect against the gSharedCounter and try again.

    No, that is not a correct statement. Volatile ties a variable to memory and forces it to be loaded and stored each time it is used, and never stored in a register. That I do a “++” operation on it is irrelevant, doing “x=x+1” would have done the same.

    This code is interesting since it works on a single-processor machine, but not on a multiprocessor.

    Using a mutex is silly, since Dekker’s is a way to try to implement a mutex using user-level operations. But certainly, that would make things safe.

  5. I wounder if using __sync_synchronize() would make this “correct”? Right before while().

    turn = 2;
    __sync_synchronize();
    while((flag2 == 1) && (turn == 2)) ;

    turn = 1;
    __sync_synchronize();
    while((flag1 == 1) && (turn == 1)) ;

    or is it just luck that it works for me on an Intel Core i7? Have run it a few times with ./dekker 1000000000

    /S

  6. I got deadlock on my machine, even through I’ve bind the two threads on one cpu core using sched_setaffinity routing.

    My machine:
    Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
    8 cores

  7. Getting a deadlock is kind of unexpected. Tying each thread to a core should make the error occur more often, as we are not at the mercy of OS scheduling then. But it is strange that it deadlocks, as nothing in this code should be able to spin forever. I would suggest attaching a gdb to the process and intercepting it to see what it is doing. Is it spinning in user land or is it off in the kernel somewhere?

  8. Hi,Jakob, I tried your sample code and i got some problem that the code runs very slow. I modified the code a little to output “gSharedCounter” every 5 seconds, and here is the result:
    ./dekker 10000000
    Starting task2
    Starting task1
    dekker 2: 8942980
    dekker 1: 8942981
    dekker 1: 8944243
    dekker 2: 8944244
    dekker 2: 8945532
    dekker 1: 8945533
    dekker 1: 8946799
    dekker 2: 8946800

    you can see the counter quickly gets to 8 million and then slow down…
    i am new to this and i am still trying to find out why.
    Due to some limitation, i use Ubuntu oneiric on virtualbox and set the processor number to 1. When i use 2 processors the code seems ok.
    My host OS is Windows 7 with Intel Core i5 2520.

  9. Hi, Jakob, i have another question: is there any tool/method to debug/observe such problems? i believe that if some program fails due to “there is no constraint in most memory systems about the order in which the updates to “flag2?, “flag1?, and “turn” become visible to the other processor.”, it is not easy to find the root cause.

  10. memon :
    Due to some limitation, i use Ubuntu oneiric on virtualbox and set the processor number to 1. When i use 2 processors the code seems ok.

    Two things. First, running in a VM could have interesting effects, depends on how the VM works. A typical x86 host VM probably does not affect the outcome from what I know of how they work.

    Second, if you use 1 virtual processor, the program will not show its effect. There is no way to get a missed lock on a single processor as this requires true concurrency — that is the whole point here. Dekker’s works nicely on an interleaved execution on single core, but fails when subject to true concurrent execution on a truly parallel machine.

  11. memon :Hi, Jakob, i have another question: is there any tool/method to debug/observe such problems? i believe that if some program fails due to “there is no constraint in most memory systems about the order in which the updates to “flag2?, “flag1?, and “turn” become visible to the other processor.”, it is not easy to find the root cause.

    Exactly 🙂

    Finding totally missing locks around shared data is detective work. There might be some static analysis tools that could warn that you are accessing data from multiple threads without using known lock operations.

  12. Jakob :

    memon :
    Due to some limitation, i use Ubuntu oneiric on virtualbox and set the processor number to 1. When i use 2 processors the code seems ok.

    Two things. First, running in a VM could have interesting effects, depends on how the VM works. A typical x86 host VM probably does not affect the outcome from what I know of how they work.
    Second, if you use 1 virtual processor, the program will not show its effect. There is no way to get a missed lock on a single processor as this requires true concurrency — that is the whole point here. Dekker’s works nicely on an interleaved execution on single core, but fails when subject to true concurrent execution on a truly parallel machine.

    Yes, for multiprocessor, i see the inconsistency there…
    ./dekker 10000000
    Output the cpu affinity
    cpu=0 is set
    cpu=1 is set
    cpu=2 is unset
    Starting task1
    Starting task2
    Both threads terminated
    [-] Dekker did not work, sum 19999997 rather than 20000000.
    3 missed updates due to memory consistency races.
    it is quite impressive to explain the ideas in your article.

    For the problem that running dekker.c with one processor, i tried to find a PC with single-core cpu other than using VM, but i failed… my oldest friend has a dual core cpu… Then i tried to run dekker.c using only one core by setting the cpu affinity. It seems i got the same result as in the VM:
    ./dekker 1000000
    Output cpu affinity
    cpu=0 is unset
    cpu=1 is set
    cpu=2 is unset
    Starting task2
    Starting task1
    dekker2 gSharedCounter:422491.
    dekker1 gSharedCounter:422492.
    dekker1 gSharedCounter:423240.
    dekker2 gSharedCounter:423241.

    For 5 seconds, gSharedCounter only increases from 422492 to 423240. I think there might be some problem, but i still could not figure out why…

    Now i run all the test on real machine with Pentium(R) Dual-Core CPU E5800 @ 3.20GHz and ubuntu 11.10.

  13. For 5 seconds, gSharedCounter only increases from 422492 to 423240. I think there might be some problem, but i still could not figure out why…

    That is very interesting. My guess is that the OS gets into some chaotic scheduler state where the thread that holds the “lock” gets interrupted just after it grabs it, and then the other thread sits in a spin loop for its entire time slice… basically, a “priority inversion” where the thread that should run to get progress is blocked, and the other thread just spins and spins and spins eating lots of CPU time to no avail.

    Actually, you would almost expect to get that effect a few times in any sufficiently long single-processor execution of this program.

  14. Jakob,
    It’s not that the algorithm doesn’t work, it’s that one of its constraint was not implemented in the code you compiled (the memory barriers)
    I guess strong memory models were implied at the time this algorithm was written and therefore the constraint was not made explicit!
    Great post anyway, good reminder!

  15. Yes, indeed Decker’s was built at the time where processors were single and memory thus implicitly consistent. That is the point of this post, that the algorithm is a nice example of something that works multithreaded but fails multicore.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.