Psychosomatic, Lobotomy, Saw: Degrees Of (Lock/Wait) Freedom

Tuesday 19 May 2015

Degrees Of (Lock/Wait) Freedom

Yo, Check the diagonal, three brothers gone...

I've been throwing around the terms lock-free and wait-free in the context of the queues I've been writing, perhaps too casually. The definition I was using was the one from D. Vyukov's website (direct quote below):

Wait-freedom: Wait-freedom means that each thread moves forward regardless of external factors like contention from other threads, other thread blocking. Each operations is executed in a bounded number of steps.
Lock-freedom: Lock-freedom means that a system as a whole moves forward regardless of anything. Forward progress for each individual thread is not guaranteed (that is, individual threads can starve). It's a weaker guarantee than wait-freedom. [...] A blocked/interrupted/terminated thread can not prevent forward progress of other threads. Consequently, the system as a whole undoubtedly makes forward progress.
Obstruction-freedom: Obstruction-freedom guarantee means that a thread makes forward progress only if it does not encounter contention from other threads. That is, two threads can prevent each other's progress and lead to a livelock.
Blocking Algorithms: It's the weakest guarantee - basically all bets are off, the system as a whole may not make any forward progress. A blocked/interrupted/terminated thread may prevent system-wide forward progress infinitely.

The above definitions refer to "forward progress/moving forward" in the context of "system" and "thread". In particular they translate "X-freedom" to "a guarantee that T/S makes forward progress".
Now "thread" is a well defined term, but what counts as forward progress of a given thread?
To my mind a thread which is free to return from a method is free to 'make progress'. For example: if the next element in a queue is not visible to a consumer thread I can return control to the consumer thread which is then free to make progress on anything it feels like and try getting an element out of the queue later. Similarly a producer thread which is unable to add elements to a queue is 'free' to 'make progress' even if the queue is full. In short, I interpreted 'freedom' as regaining control of execution.

Freedom? yeah, right...

My interpretation however is limited to the scope of the thread and assumes it has other things to do (so placing the thread within the context of a given 'system'). So the freedom to make progress is really a freedom to make progress on 'other things'. The definitions above when applied to a given data structure have no knowledge of the 'system' and it is therefore fair to assume nothing. And so if the 'system' is viewed as being concerned only with using the data structure, it seems my view of 'regained control freedom' is not inline with the 'progress making freedom'.
Let's consider for example the linked Multi-Producer-Single-Consumer queue discussed in the previous post (this is the not the j.u.Queue compliant version):

Now let us assume a producer has stalled in that unpleasant gap between setting the head and pointing prev to the new head(at line 34), this is what I've come to think of as a 'bubble' in the queue. The next producer will see the new head, but the consumer will not see it (or any nodes after it) until prev.next is set.
Others producers can 'make progress' in the sense that elements will get added to the queue. Those elements visibility to the consumer however is blocked by the 'bubble'. What is the correct degree of freedom for these producers? what about the consumer?
When describing the queue Vyukov makes the following statements:

Wait-free and fast producers. One XCHG is maximum what one can get with multi-producer non-distributed queue.
Push [offer in the Java version] function is blocking wrt consumer. I.e. if producer blocked in (* [line 34]), then consumer is blocked too. Fortunately 'window of inconsistency' is extremely small - producer must be blocked exactly in (* [line 34]).

So producers are wait-free, but consumer is blocking. The consumer is truly blocked by the 'bubble' and can make no progress in the terms of the data structure. Despite other producers adding nodes the consumer cannot see those nodes and is prevented from consuming the data. The fact that control is returned to the caller is considered irrelevant. And if we are being precise we should call this a blocking queue.

Life imitates Art

The "Art of Multiprocessor Programming" offers a different definition:

"A concurrent object implementation is wait free if each method call completes in a finite number of steps. A method is lock-free if it guarantees that infinitely often some method call finishes in a finite number of steps" - page 99

This sentiment is similarly echoed in "Wait-Free Queues With Multiple Enqueuers and Dequeuers" a paper on lock free queues:

[...] to ensure that a process (or a thread) completes its operations in a bounded number of steps, regardless of what other processes (or threads) are doing. This property is known in the literature as (bounded) wait-freedom.
lock-freedom ensures that among all processes accessing a queue, at least one will succeed to finish its operation.

Hmmm... no system, no progress. The boundaries are now an object and a method, which are well defined. I think this definition matches my understanding of 'regained control freedom' (open to feedback). If a method returns, progress or no progress, the condition is fulfilled. Under this definition the queue above is wait-free.
The wiki definition is again closer to the one outlined by Vyukov:

An algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread for some operations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress.
Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.
Lock-freedom allows individual threads to starve but guarantees system-wide throughput. An algorithm is lock-free if it satisfies that when the program threads are run sufficiently long at least one of the threads makes progress (for some sensible definition of progress). All wait-free algorithms are lock-free.

We are back to a fuzzy definition of system, progress, but the underlined sentence above is interesting in highlighting suspension. In particular I'm not sure how this is to be interpreted in the the context of the lock-free definition where suspension is not an issue if some threads can keep going.
Note that once we have fixed poll to behave in a j.u.Queue compliant way: The poll method is no longer wait-free by any definition.

What about the JCTools queues?

JCTools covers the full range of MPMC/MPSC/SPMC/SPSC range of queues, and aims for high performance rather than strict adherence to any of the definitions above. To be more precise in the definitions I would say:

SpscArrayQueue/SpscLinkedQueue are Wait Free (on both control and progress senses)
MpscArrayQueue is lock free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
MpscLinkedQueue is wait free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
SpmcArrayQueue is lock free on the consumer side and blocking on the producer side. I'm planning to add a weakOffer method which will be wait free in the control sense.
MpmcArrayQueue is blocking on the both producer and consumer side. I'm planning to add a weakOffer/Poll method which will be lock free in the control sense.

Does it matter that the queues do not meet the exact definitions set out below? As always it depends on your needs...

Non-Blocking Algorithms On The JVM?

{Full disclosure: please note when reading the next section that I work with Azul Systems on the Zing JVM, I'm not trying to sell anything, but I am naturally more familiar with it and consider it awesome :-)}

What happens when we include the JVM in our definition of a system? Can you build a non-blocking algorithm on the JVM? All the JVMs I know of are blocking in some particular cases, important to the discussion above are:

Allocation: The common case for allocation is fast and involves no locking, but once the young generation is exhausted a collection is required. Young generation collections are stop the world events for Oracle/OpenJDK. Zing has a concurrent young generation collector, but under extreme conditions or bad configuration it may degrade to blocking the allocator. The bottom line is that allocation on the JVM is blocking and that means you cannot consider an algorithm which allocates non-blocking. To be non-blocking a system would have to provably remove the risk of a blocking collection event on allocation.
Deoptimization: Imagine the JIT compiler in it's eager and aggressive compilation strategy has decided your offer method ends up getting compiled a particular way which ends up not working out (assumptions on inheritance, class loading, passed in values play a part). When the assumption breaks a deoptimization may take place, and that is a blocking event. It is hard to prove any piece of code is deoptimization risk free, and therefore it is hard to prove any Java code is non-blocking. Deoptimization is in many cases a warmup issue. Zing is actively battling this issue with ReadyNow which reloads previously recorded compilation profiles for the JVM, greatly reducing the risk of deoptimization. Oracle/OpenJDK users can do a warmup run of their application before actually using it to reduce the risk of deoptimization. My colleague Doug Hawkins gave a great talk on this topic which will give you far more detail then I want to go into here :-).
Safepoints: If your algorithm includes a safepoint poll it can be brought to a halt by the JVM. The JVM may bring all threads to a safepoint for any number of reasons. Your method, unless inlined, already has a safepoint on method exit(OpenJDK) or entry(Zing). Any non-counted loop will have a safepoint-poll, and this includes your typical CAS loop. Can you prevent safepoints from ever happening in your system? It might be possible for a JVM to provide users with compiler directives which will prevent safepoint polls from being inserted in certain code paths, but I don't know of any JVM offering this feature at the moment.

In short, if you are on the JVM you are probably already blocking. Any discussion of non-blocking algorithms on the JVM is probably ignoring the JVM as part of the 'system'. This needs to be considered if you are going to go all religious about wait-free vs. lock-free vs. blocking. If a system MUST be non-blocking, it should probably not be on the JVM.

Summary

Any term is only as good as our ability to communicate with others using it. In that sense, splitting hairs about semantics is not very useful in my opinion. It is important to realize people may mean any number of things when talking about lock-free/wait-free algorithms and it is probably a good idea to check if you are all on the same page.

I personally find the distinction between control/progress freedom useful in thinking about algorithms, and I find most people mean lock/wait-free excluding the effects of the JVM...

Thanks Martin & Doug for the review, feedback and discussion, any remaining errors are my own (but their fault ;-) )

9 comments:

Ruslan Cheremin20 May 2015, 11:30:00
The interesting thing about progress-freedom is that we could evaluate algorithm "degree of freedom" only given "degree of freedom" for it's building parts -- i.e. instructions. So, talking about "..this method is non-blocking because there is only one CAS" is true only while we're sure CAS itself is really wait-free. But is it? On any hardware JVM probably run on?

I.e. lets take getAndAdd() implemented as CAS-loop (as it was until 1.8) -- this is, theoretically, lock-free (if CAS is indeed lock free on this JVM impl on this hardware platform), but not wait-free, because, theoretically, loop could iterate forever, given enough threads to compete. But in 1.8 JVM (afaik) compiles getAndAdd to atomic lock xadd -- now it becomes wait-free. But how? Is it really any _untrivial_ difference between how lock xadd is implemented on the top of MESI+ and how (looped lock xchg) is implemented on the top of same MESI+? My guess (only guess) is: no, there is no non-trivial difference. CAS-loop is ~ equivalent to lock xadd, just with bigger window of contention. And if it is true, then CAS-loop should be also seen as wait-free -- because, given finite number of physical CPUs to compete, it will terminate in finite amount of iterations.

So, I mean, we should really keep in mind the separation between computer science and software architecture. Well designed blocking (from CS point of view) algorithm could easy outperform wait-free (from CS point of view) in most (if not all) practical applications -- and even more, blocking could be equivalent to wait-free algorithm if we'd go deep enough in reducing actual instructions executed down to roots

P.S. By some reason comment doesn't appear after send, so may be I've duplicate it few times, sorry :)
ReplyDelete
Replies
Ruslan Cheremin21 May 2015, 09:47:00
About lock-ed instructions: again, how would you implement them given MESI+ as you base cache-coherency protocol? Let core_1 process lock xadd: first core_1 needs to issue RFO for the memory location, then cache line acquired in E-state, then core_1 needs to read value into some internal register, add something, and write result back to cache line. But what if core_2 issues RFO for same cache line while RMW operation on core_1 is in progress? How will the request be arbitrated? Does core_1 intervene and prevent main memory to satisfy request (so, effectively, "locks" the cache line)? Or core_1 just looses it's ownership, and detects this on writing back updated value, and re-try (so, effectively, re-introducing CAS-loop on hardware level)? I'm not sure, but I'd bet on the first option ("locks" cache line for full diration of RMW). And if it is indeed true about current implementation of RMW in hardware, then is there any difference between such "locking" of cache line ownership, and fine-grained short code under synchronized()?

My understanding about this difference is following: it is all about (unlimited number of) software threads with complex scheduling policy (and don't forget about interrupts!). Given limited (and known in advance) number of hardware cores and fair-ness of "lock" you use on hardware level -- you are able to give upper bound of locked region duration. But having unknown number of software threads with different priorities and priority switching policy, and unknown number of interrupts, and without implementation details of locks you use -- there is no way to estimate such an upper bound. For me, as a physicist, it is like transition from newton-laws-guided motion of 1-2-3-.. particles in the box to molecular kinetics with number of particles rising towards 10^23: you could simulate exact dynamics of limited number of particles, but with number growing you're forced to switch to statistics-based approach at some point, even though individual particle motion still remains fully deterministic. "Blocking"/"non-blocking" dichotomy for me looks exactly the same transition.
ReplyDelete
Replies
maaartinus21 May 2015, 13:10:00
> Does core_1 intervene and prevent main memory to satisfy request (so, effectively, "locks" the cache line)?

Take it with a big shovel of salt: AFAIK no and yes. Instead of the main memory, a higher level cache is involved. It's inclusive, so it has the given cache line. Moreover it knows that's the core_1 who has the only true version.

> is there any difference between such "locking" of cache line ownership

Yes. It's locked, but the lock duration is limited by the single instruction. Unlike a synchronized block, it can't be prolonged by a page fault (as it'd abort the instruction) and it obviously can't be prolonged by a context switch.
ReplyDelete
Replies
Gil Tene22 May 2015, 15:47:00
Being nit-picky, lets dissect the three JVM items you identify as causing blocking: Safepoint, Depot, and Allocation. I don't think any of these are specific to JVMs. The same qualities are shared by pretty much all OSs and by Hardware.. The reality is that if If a system MUST be non-blocking, it should probably not be on the JVM, OS, Hypervisor, Hardware-that-uses-Power-management, Hardware-that-does-ECC, etc. Basically, non-blocking (systems) as opposed to algorithms or sets of threads only exit on paper, in the sense that if you have a system that MUST be non-blocking (as a system), you probably need to execute it on paper... ;-)

We probably agree that having the JVM suspend the execution of a thread temporarily for various reason does not not remove wait-freedom qualities from the thread or from the system of threads your algorithm (or whole program) represents. The JVM itself is not wait-free. But neither is any OS I know of. And neither is hardware. E.g. on any OS with interrupts enabled, any thread may experience a delay for some amount of time for interrupt and/or exception processing, but will still be considered wait free if other qualities hold.

The key point is that all three examples of JVM blocking stuff are bound in time, and do not create any new cross-thread-blocking potential within the affected threads (that run on the JVM):

Safepoint: A JVM may bring all threads to a global safepoint that can last an arbitrary (but bounded) amount of time. That makes for an ugly pause, but it does not make the threads running on it non-wait-free. Not any more than e.g. a page fault in an OS does to a C program, at least. The magnitude may be larger and uglier (on JVMs that make threads wait for large an ugly safepoints to complete), but the concept is the same.

Time-to-safepoint (TTSP): The real issue with safepoints is cross-thread dependence on execution of code, and the resulting time-to-safepoint that extends the perceived stall seen by all threads in a global safepoint. In the fairytale land of "all-safepoints-all-the-time", a safepoint can happens anywhere in code, without a neccesaity for instructions to be executed by any thread, removing the dependence between threads. However, in the practical world of optimized code, safepoints may occur more rarely in code, which means that threads need to "execute forward" to reach a safepoint, creating a cross-thread dependency. When safepoint opportunities are eliminated from long code paths (e.g. counted loops that count to a billion, or runtime things like array fills of 1GB), TTSP can get very long. When page faults may need to be satisfied for threads to make forward progress (e.g. when a thread is blocked on access to a mapped file), that gets ugly too. However, as ugly as TTSP is, it is still a bounded operation. It does not (in itself) make the threads affected by it non-wait-free.

Deoptimization: Deoptimization without safepoints is possible. But I'll grant that all current JVMs use global safepoints for at least some deoptimizations. And global safepoints are ugly. However, see safepoint discussion above for why they do not make affected threads non-wait-free any more than an OS fault handler that grabs a spin lock does.

Allocation: Allocation on the JVM is actually in better shape than allocation on anything else in this regard... The key is that it's not the JVM that stalls you when you allocate, it's the allocation itself that includes a potential semantic lock (just like a get on a semaphore does). An allocating C thread (using any variant of malloc() or new) is just as susceptible to long delayed operations. Whenever you allocate from something other than a completely-thread-local-self-managed-pool-of-memory thing, you've introduced blocking into your algorithm. JVMs (and other GC'ed runtimes) reduces the likelihood of allocation-related contention and blocking dramatically compared to other environments, but allocation is still a blocking operation, just like anywhere else.
ReplyDelete
Replies

Add comment

Note: only a member of this blog may post a comment.