The Cost of Volatile

Assessing the scalability of programs and algorithms on multicore is critical. There is an important literature on locks and locking schemes, but the exact cost of volatile is less clear.

For the software composition seminar, Stefan Nüsch and I decided to look at this topic. Essentially, we devised a benchmark where multiple threads would access objects within a pool. Each thread has a set of objects to work with. To generate contention, the sets could be fully disjointed, have partial overlap, of have a full overlap. The ratio of reads and writes per thread was also configurable.

On an AMD 64 Dual Core, the graph looks as follows:

bench_amd64On a i7 Quad Core, the graph looks as follows:

bench_i7We clearly see that different architectures have different performance profiles.

In future work, we could try to reproduce false sharing and assess the impact of other forms of data locality.

More details about the benchmark and methodology can be found in his presentation. The code in on github.

Here a some links about the semantics of volatile, and mememory management in general.

Understanding the Visibility of Side-Effects

This entry is all about the Java Memory Model and the happens-before partial order it formalizes.

The official specification of the JMM is authoritative but unreadable. Fortunately, there are useful resources around that aim at making it more accessible:

Happens-Before

The JMM defines a partial order called happens-before on all actions in the program. Actions are reads and writes to variables, locks and unlocks of monitors etc. The rules for happens-before are as follows:

  • Program order rule. Each action in a thread happens-before every action in that thread that comes later in the program order.
  • Monitor lock rule. An unlock on a monitor lock happens-before every subsequent lock on that same monitor lock.
  • Volatile variable rule. A write to a volatile field happens-before every subsequent read of that same field.
  • Thread start rule. A call to Thread.start on a thread happens-before any other thread detects that thread has terminated, either by successfully return from Thread.join or by Thread.isAlive returning false.

The happens-before partial order provides clear guidance about what values are allowed to be returned when the data are read. In other words: it formalizes the visibility of side-effect across threads.

A read is allowed to return the value of a write 1) if that write is the last write to that variable before the read along some path in the happens-before order, or 2) if the write is not ordered with respect to that read in the happens-before order.

Instructions can be reordered as long as the happens-before order is preserved. The code below

t1 = 1;
t2 = 2;
t3 = t1+t2;
t4 = 2*t1;

can for instance be be reordered as follows

t2 = 2;
t1 = 1;
t4 = 2*t1;
t3 = t1+t2;

If two threads shared data without using any synchronization, writes of a thread are considered unordered with respect to the other thread; according to condition 2), the execution is not deterministic and the writes of a thread might or might not be visible to the other thread.

Let us consider the two threads below, and one given execution:

  T1        T2

s = s+1 

          s = s+1

s = s+1 

The second increment might or might not see the previous increment in T1. Similarly, the thrid increment might or might not see the side-effect of the second increment.

In improperly synchronized programs, a read will return the value of one the possible matching writes, according to conditions 1) and 2). This corresponds to data races. In properly synchronized programs, a read has one unique ancestor write.

  T1        T2

lock m
  |
s = s+1 
  |
unlock m  
          \  
          lock m
             |
          s = s+1
             |
          unlock m
         /
lock m
  |
s = s+1 
  |
unlock m                  

Implementing Happens-Before

Conditions 1) and 2) specify that side-effects might or might not be visible consistently to theads in improperly synchronized programs. They enable the implementation to not read the shared memory all the time, either using CPU caches or compiler optimization. The specification does however never speak of specific implementation choice like caching and optimizations.

For instance, the code

t2 = s+1
t3 = s*2

could be rewritten as by a compiler optimization

t = s
t2 = t+1
t3 = t*2

, where t caches the value of the read to s.

Typically, the compiler will emit memory barriers to flush the CPU caches. The JSR-133 Cookbook provides guidance about implementation issues.

Data Races

Failure to properly synchronize accesses to shared data is called a data race. Inversely, a program is data race free if all accesses to shared state are synchronized. But what does data race freedom exactly mean?

Is the program data race free if,

  1. accesses to shared data happen within critical section?
  2. shared data is declared volatile?
  3. accesses to shared data are never concurrent?

Issues about the memory model are typically discussed using the double-checked locking pattern, and the “pausable” thread pattern. They provide only partial answer to these questions.

Point 1 is true. If the accesses to shared data are protected by the same cricital section, there is no possible data race.

Point 2 is true also. If you define variables as volatile, the side effect will be made visible to the other thread and there will be no data race. But remember: data race freedom does not mean that the behavior is correct. Consider the trival example below:

counter = counter + 1;

Making the counter volatile won’t suffice to ensure all increments are recorded.

Point 3 holds, but requires some clarification of the underlying assumptions (Answers to my stackoverflow question failed be clear cut and authoritative). Let us consider the situation when multiple threads manipulate an object X that is plain data, but alternate their temporal execution (so that X is never accessed concurrently) via another object Y that rely on concurrency control (wait, notify, synchronize). Should fields of object X be volatile or not?

If we assume that threads are synchronized using at least one the concurrency control primitive — an not just temporal alternance thanks, say, to a well-time sleep statements — this implies the alternate acquisition of at least a lock m, as for instance in:

while(true) {
    synchronized(m) { wait(); }
    counter = counter + 1;
    synchronized(m) { notify(); }
}

There is a happens-before dependency between the increment statement and the release of the wait lock. By construction, when the lock is released, exactly one thread will acquire it and wait again. So exactly one increment will happen-after the release of the wait lock. By consequence the increment is guaranteed to see only the value of the previous increment–no need of volatile.

More

Implementation of Semaphores

For the need of the experimental Pinocchio research VM, we needed to add support for threading and concurrency. We implemented green threads, a la Squeak and there is then no “real” multi-core concurrency going on. The VM relies on AST interpretation, instead of bytecode. With green threads, the interpretation of an AST node can always be considered atomic: no two AST node can be interpreted concurrently. This is unlike Java and its memory model, where individual bytecodes can be interpreted concurrently, possibly with nasty side-effects (e.g. manipulation of long is not atomic). Thread preemption can happen anytime between AST nodes evaluation.

How can we add support for semaphores?

The pharo design

The pharo design can be informally summarize like this: when a semaphore is instantiated, its counter is set to one. Whenever a block needs to be evaluated in an exclusive way, the counter is checked. If the counter > 0, it is decreased and the block is evaluated. If the counter = 0, the thread is suspended an added to the list of threads currently waiting on this semaphore. When the critical block has been evaluated, the list of suspended threads is checked. If there are not suspended threads, the counter is incremented to 1. Otherwise, one of the suspended thread is picked and resumed.

This implementation explains why Semaphore extends LinkedList. I’m not sure it’s the best design decision, because it’s not conceptually a list and the list protocol should not be exposed by a semaphore. It uses inheritance for implementation reuse, but composition would have been just fine here if the semaphore was internally holding a linked list (and maybe use a pluggable ownership type system to check that the list does not leak out of the semaphore…).

Also, semaphores must be created using the forMutualExclusion factory method. This method instantiates and initialize the semaphore to allow exactly one execution at a time (hence the term mutual exclusion), but nothing would prevent you from initializing the semaphore so that up to N blocks can be executed concurrently.

The respective code for wait and signal (which respectively decrement and increment the counter) are:

wait
 excessSignals>0
 ifTrue: [excessSignals := excessSignals-1]
 ifFalse: [self addLastLink: Processor activeProcess suspend]
signal
 self isEmpty
 ifTrue: [excessSignals := excessSignals+1]
 ifFalse: [Processor resume: self removeFirstLink]

They are however implemented as primitives. I suspect this is not for performance reason, but for the sake of concurrency correctness. These operation themselves need to be atomic. Implemented in Smalltalk, the thread could be preempted during one of these, breaking the semaphore’s design.

The test-and-set design

These two methods and the internal counter suggest that an implementation relying on more generic concurrency primitive is possible. Typical concurrency primitives for this are test-and-set or compare-and-swap.
 We’ve added a primitive testAndSet to Boolean, and implemented the Semphore with busy waiting (also sometimes called spin lock):
 
 critical: aBlock
 | v |
 "we spin on the lock until we can enter the semaphore"
 [ lock testAndSet ] whileTrue: [ PThread current yield ].
 "we evaluate the block and make sure we reset the flag when we leave it"
 [ v := aBlock value. ] ensure: [ lock value: false ].
 ^ v.

The design could be improved to no use busy waiting. Instead of yielding, the thread would be suspended and added to a list. In the ensure block,  the flag would be reset and one of the thread would be resumed. The resumed thread would however still need to testAndSet the lock to prevent that no other thread has entered the semaphore in the meantime, possibly delaying the thread forever. So if fairness is required, this algorithm is not optimal.

The bakery design

You can also implement critical section without the support of other concurrency primitives. The most famous one is probably Lamport’s bakery algorithm:

What is significant about the bakery algorithm is that it implements mutual exclusion without relying on any lower-level mutual exclusion.  Assuming that reads and writes of a memory location are atomic actions, as previous mutual exclusion algorithms had done, is tantamount to assuming mutually exclusive access to the location.  So a mutual exclusion algorithm that assumes atomic reads and writes is assuming lower-level mutual exclusion.  Such an algorithm cannot really be said to solve the mutual exclusion problem.  Before the bakery algorithm, people believed that the mutual exclusion problem was unsolvable–that you could implement mutual exclusion only by using lower-level mutual exclusion.

In our case with green threads, read and write are atomic because they are single AST nodes, but this isn’t necessary the case.

Here ends this little trip into basic concurrency. There is a rich litterature on the topic — which is truely fantastic — and we might explore and implement more sophisicated abstractions later on.

References

The Java Memory Model
Thread Synchronization and Critical Section Problem
A New Solution of Dijkstra’s Concurrent Programming Problem