Tuesday, 19 May 2015

Degrees Of (Lock/Wait) Freedom

Yo, Check the diagonal, three brothers gone...
I've been throwing around the terms lock-free and wait-free in the context of the queues I've been writing, perhaps too casually. The definition I was using was the one from D. Vyukov's website (direct quote below):
  • Wait-freedom:  Wait-freedom means that each thread moves forward regardless of external factors like contention from other threads, other thread blocking. Each operations is executed in a bounded number of steps. 
  • Lock-freedom: Lock-freedom means that a system as a whole moves forward regardless of anything. Forward progress for each individual thread is not guaranteed (that is, individual threads can starve). It's a weaker guarantee than wait-freedom. [...] A blocked/interrupted/terminated thread can not prevent forward progress of other threads. Consequently, the system as a whole undoubtedly makes forward progress.
  • Obstruction-freedom: Obstruction-freedom guarantee means that a thread makes forward progress only if it does not encounter contention from other threads. That is, two threads can prevent each other's progress and lead to a livelock. 
  • Blocking Algorithms: It's the weakest guarantee - basically all bets are off, the system as a whole may not make any forward progress. A blocked/interrupted/terminated thread may prevent system-wide forward progress infinitely. 
The above definitions refer to "forward progress/moving forward" in the context of "system" and "thread". In particular they translate "X-freedom" to "a guarantee that T/S makes forward progress".
Now "thread" is a well defined term, but what counts as forward progress of a given thread?
To my mind a thread which is free to return from a method is free to 'make progress'. For example: if the next element in a queue is not visible to a consumer thread I can return control to the consumer thread which is then free to make progress on anything it feels like and try getting an element out of the queue later. Similarly a producer thread which is unable to add elements to a queue is 'free' to 'make progress' even if the queue is full. In short, I interpreted 'freedom' as regaining control of execution.

Freedom? yeah, right...

My interpretation however is limited to the scope of the thread and assumes it has other things to do (so placing the thread within the context of a given 'system'). So the freedom to make progress is really a freedom to make progress on 'other things'. The definitions above when applied to a given data structure have no knowledge of the 'system' and it is therefore fair to assume nothing. And so if the 'system' is viewed as being concerned only with using the data structure, it seems my view of 'regained control freedom' is not inline with the 'progress making freedom'.
Let's consider for example the linked Multi-Producer-Single-Consumer queue discussed in the previous post (this is the not the j.u.Queue compliant version):

Now let us assume a producer has stalled in that unpleasant gap between setting the head and pointing prev to the new head(at line 34), this is what I've come to think of as a 'bubble' in the queue. The next producer will see the new head, but the consumer will not see it (or any nodes after it) until prev.next is set.
Others producers can 'make progress' in the sense that elements will get added to the queue. Those elements visibility to the consumer however is blocked by the 'bubble'. What is the correct degree of freedom for these producers? what about the consumer?
When describing the queue Vyukov makes the following statements:
  • Wait-free and fast producers. One XCHG is maximum what one can get with multi-producer non-distributed queue.
  • Push [offer in the Java version] function is blocking wrt consumer. I.e. if producer blocked in (* [line 34]), then consumer is blocked too. Fortunately 'window of inconsistency' is extremely small - producer must be blocked exactly in (* [line 34]).
So producers are wait-free, but consumer is blocking. The consumer is truly blocked by the 'bubble' and can make no progress in the terms of the data structure. Despite other producers adding nodes the consumer cannot see those nodes and is prevented from consuming the data. The fact that control is returned to the caller is considered irrelevant. And if we are being precise we should call this a blocking queue.

Life imitates Art

The "Art of Multiprocessor Programming" offers a different definition:
"A concurrent object implementation is wait free if each method call completes in a finite number of steps. A method is lock-free if it guarantees that infinitely often some method call finishes in a finite number of steps" - page 99
This sentiment is similarly echoed in "Wait-Free Queues With Multiple Enqueuers and Dequeuers" a paper on lock free queues:

  • [...] to ensure that a process (or a thread) completes its operations in a bounded number of steps, regardless of what other processes (or threads) are doing. This property is known in the literature as (bounded) wait-freedom.
  • lock-freedom ensures that among all processes accessing a queue, at least one will succeed to finish its operation.

Hmmm... no system, no progress. The boundaries are now an object and a method, which are well defined. I think this definition matches my understanding of 'regained control freedom' (open to feedback). If a method returns, progress or no progress, the condition is fulfilled. Under this definition the queue above is wait-free.
The wiki definition is again closer to the one outlined by Vyukov:
  • An algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread for some operations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress.
  • Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.
  • Lock-freedom allows individual threads to starve but guarantees system-wide throughput. An algorithm is lock-free if it satisfies that when the program threads are run sufficiently long at least one of the threads makes progress (for some sensible definition of progress). All wait-free algorithms are lock-free.
We are back to a fuzzy definition of system, progress, but the underlined sentence above is interesting in highlighting suspension. In particular I'm not sure how this is to be interpreted in the the context of the lock-free definition where suspension is not an issue if some threads can keep going.
Note that once we have fixed poll to behave in a j.u.Queue compliant way: The poll method is no longer wait-free by any definition.

What about the JCTools queues?

JCTools covers the full range of MPMC/MPSC/SPMC/SPSC range of queues, and aims for high performance rather than strict adherence to any of the definitions above. To be more precise in the definitions I would say:
  • SpscArrayQueue/SpscLinkedQueue are Wait Free (on both control and progress senses)
  • MpscArrayQueue is lock free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
  • MpscLinkedQueue is wait free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
  • SpmcArrayQueue is lock free on the consumer side and blocking on the producer side. I'm planning to add a weakOffer method which will be wait free in the control sense.
  • MpmcArrayQueue is blocking on the both producer and consumer side. I'm planning to add a weakOffer/Poll method which will be lock free in the control sense.
Does it matter that the queues do not meet the exact definitions set out below? As always it depends on your needs...

Non-Blocking Algorithms On The JVM?

{Full disclosure: please note when reading the next section that I work with Azul Systems on the Zing JVM, I'm not trying to sell anything, but I am naturally more familiar with it and consider it awesome :-)}
What happens when we include the JVM in our definition of a system? Can you build a non-blocking algorithm on the JVM? All the JVMs I know of are blocking in some particular cases, important to the discussion above are:
  1. Allocation: The common case for allocation is fast and involves no locking, but once the young generation is exhausted a collection is required. Young generation collections are stop the world events for Oracle/OpenJDK. Zing has a concurrent young generation collector, but under extreme conditions or bad configuration it may degrade to blocking the allocator. The bottom line is that allocation on the JVM is blocking and that means you cannot consider an algorithm which allocates non-blocking. To be non-blocking a system would have to provably remove the risk of a blocking collection event on allocation.
  2. Deoptimization: Imagine the JIT compiler in it's eager and aggressive compilation strategy has decided your offer method ends up getting compiled a particular way which ends up not working out (assumptions on inheritance, class loading, passed in values play a part). When the assumption breaks a deoptimization may take place, and that is a blocking event. It is hard to prove any piece of code is deoptimization risk free, and therefore it is hard to prove any Java code is non-blocking. Deoptimization is in many cases a warmup issue. Zing is actively battling this issue with ReadyNow which reloads previously recorded compilation profiles for the JVM, greatly reducing the risk of deoptimization. Oracle/OpenJDK users can do a warmup run of their application before actually using it to reduce the risk of deoptimization. My colleague Doug Hawkins gave a great talk on this topic which will give you far more detail then I want to go into here :-). 
  3. Safepoints: If your algorithm includes a safepoint poll it can be brought to a halt by the JVM. The JVM may bring all threads to a safepoint for any number of reasons. Your method, unless inlined, already has a safepoint on method exit(OpenJDK) or entry(Zing). Any non-counted loop will have a safepoint-poll, and this includes your typical CAS loop. Can you prevent safepoints from ever happening in your system? It might be possible for a JVM to provide users with compiler directives which will prevent safepoint polls from being inserted in certain code paths, but I don't know of any JVM offering this feature at the moment.
In short, if you are on the JVM you are probably already blocking. Any discussion of non-blocking algorithms on the JVM is probably ignoring the JVM as part of the 'system'. This needs to be considered if you are going to go all religious about wait-free vs. lock-free vs. blocking. If a system MUST be non-blocking, it should probably not be on the JVM.

Summary

Any term is only as good as our ability to communicate with others using it. In that sense, splitting hairs about semantics is not very useful in my opinion. It is important to realize people may mean any number of things when talking about lock-free/wait-free algorithms and it is probably a good idea to check if you are all on the same page.
I personally find the distinction between control/progress freedom useful in thinking about algorithms, and I find most people mean lock/wait-free excluding the effects of the JVM...
Thanks Martin & Doug for the review, feedback and discussion, any remaining errors are my own (but their fault ;-) )

9 comments:

  1. The interesting thing about progress-freedom is that we could evaluate algorithm "degree of freedom" only given "degree of freedom" for it's building parts -- i.e. instructions. So, talking about "..this method is non-blocking because there is only one CAS" is true only while we're sure CAS itself is really wait-free. But is it? On any hardware JVM probably run on?

    I.e. lets take getAndAdd() implemented as CAS-loop (as it was until 1.8) -- this is, theoretically, lock-free (if CAS is indeed lock free on this JVM impl on this hardware platform), but not wait-free, because, theoretically, loop could iterate forever, given enough threads to compete. But in 1.8 JVM (afaik) compiles getAndAdd to atomic lock xadd -- now it becomes wait-free. But how? Is it really any _untrivial_ difference between how lock xadd is implemented on the top of MESI+ and how (looped lock xchg) is implemented on the top of same MESI+? My guess (only guess) is: no, there is no non-trivial difference. CAS-loop is ~ equivalent to lock xadd, just with bigger window of contention. And if it is true, then CAS-loop should be also seen as wait-free -- because, given finite number of physical CPUs to compete, it will terminate in finite amount of iterations.

    So, I mean, we should really keep in mind the separation between computer science and software architecture. Well designed blocking (from CS point of view) algorithm could easy outperform wait-free (from CS point of view) in most (if not all) practical applications -- and even more, blocking could be equivalent to wait-free algorithm if we'd go deep enough in reducing actual instructions executed down to roots

    P.S. By some reason comment doesn't appear after send, so may be I've duplicate it few times, sorry :)

    ReplyDelete
    Replies
    1. It's a good point that the 'system' definition should include the hardware (and OS/JVM) if we care about the real world. In particular if we limit our discussion to Intel x86, are the LOCK instructions to be considered as blocking? can we use these instructions and claim lock-freedom?
      How would you translate "An algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread for some operations" to the instruction level? Can we just replace Algorithm with Instruction?
      The answer (as I understand it, on recent CPUs, based on the Intel manuals) is that LOCK instructions (XCHG/LOCK XADD/LOCK CMPXCHG etc) are non-blocking:
      - On old processors the memory bus is locked on a LOCK instruction, newer processors perform a location limited exclusive access as long as the value is in the cache. This has performance implications, but to my understanding does not make them blocking by the above definition.
      - The instructions are atomic and exclusive, that means threads block each other but the blocking duration cannot be stretched by "failure or suspension of any thread". Thread failure cannot induce deadlock.
      It is generally accepted in papers I read that the LOCK instructions are wait-free.
      As for the difference between LOCK XADD and a CAS loop, the big difference is in the semantics. A LOCK XADD operation cannot fail indefinitely, it will get ordered and executed and the result will be returned. There's no failure mode. A CAS loop has distinct read-modify-compareAndUpdate stages where the compareAndUpdate are atomic, the getAndSet/Add is single stage. This also means they can't do everything the CAS loop allows you to do (see the Mpsc/Spmc/Mpmc queues for examples where CAS cannot be replaced by LOCK XADD).

      Delete
  2. About lock-ed instructions: again, how would you implement them given MESI+ as you base cache-coherency protocol? Let core_1 process lock xadd: first core_1 needs to issue RFO for the memory location, then cache line acquired in E-state, then core_1 needs to read value into some internal register, add something, and write result back to cache line. But what if core_2 issues RFO for same cache line while RMW operation on core_1 is in progress? How will the request be arbitrated? Does core_1 intervene and prevent main memory to satisfy request (so, effectively, "locks" the cache line)? Or core_1 just looses it's ownership, and detects this on writing back updated value, and re-try (so, effectively, re-introducing CAS-loop on hardware level)? I'm not sure, but I'd bet on the first option ("locks" cache line for full diration of RMW). And if it is indeed true about current implementation of RMW in hardware, then is there any difference between such "locking" of cache line ownership, and fine-grained short code under synchronized()?

    My understanding about this difference is following: it is all about (unlimited number of) software threads with complex scheduling policy (and don't forget about interrupts!). Given limited (and known in advance) number of hardware cores and fair-ness of "lock" you use on hardware level -- you are able to give upper bound of locked region duration. But having unknown number of software threads with different priorities and priority switching policy, and unknown number of interrupts, and without implementation details of locks you use -- there is no way to estimate such an upper bound. For me, as a physicist, it is like transition from newton-laws-guided motion of 1-2-3-.. particles in the box to molecular kinetics with number of particles rising towards 10^23: you could simulate exact dynamics of limited number of particles, but with number growing you're forced to switch to statistics-based approach at some point, even though individual particle motion still remains fully deterministic. "Blocking"/"non-blocking" dichotomy for me looks exactly the same transition.

    ReplyDelete
    Replies
    1. I'll take the lawyer way out and hang on the definition of non-blocking as: "failure or suspension of any thread cannot cause failure or suspension of another thread". The HW implementation does not matter, and it's speed doesn't matter, as long as instructions are atomic from the executing thread point of view.
      Can conflicting LOCKed instructions be considered blocking? No, because a thread cannot be suspended or interrupted half way through them. You need a crack to put your wedge in, and an atomic instruction guarantees exactly that there is no crack.
      If we agree that the instructions are at least non-blocking we can argue that LOCKed instructions cannot be considered wait free if starvation is possible. But I'm not sure that starvation is possible, i.e. that there is no bounded return time for LOCKed instructions. I think LOCKed instructions maintain exclusive access to the cache line for the duration of their execution, so there's no gap in which other cores can 'steal' the line. That a core will get exclusive access in a bounded time is perhaps enforced by some fairness of the MESI protocol implementation.
      The last paragraph is entirely conjecture based on internet rumours... I can't find an all out, spelling out details kind of explanation of the hardware level implementation.

      Delete
    2. I totally agree with most of your reasoning. What I'm trying to point to is following: while in computer science there are crucial difference between blocking/lock-free/wait-free algorithms, in engineering area difference is not crucial (and sometimes even miniscule). E.g. there is no way to implement distributed shared memory with RW access without any locks (==exclusive owning of some granularity) involved. Such a locks could be designed to have predictable timing, but, again, such a timings predictable only because on hardware level you have knowledge about max mutators count (==physical cores). Imagine you have N physical cores trying to apply lock xadd on same memory location -- do you really expect upper bound on _any single lock xadd duration_ to be O(1)? I'd expect it to be _at least_ O(N) -- i.e. this number is fixed only as long as cores count is fixed. Are you working in Azul right now? -- it is interesting to know how long could take atomic_add on Azul's with ~1000 Vega cores...

      Another point: afaik there is no practical algorithms which are totally lock-free. All of "lock-free" implementations I'm aware of really lock-free only on fast-path -- they are all introduce some kind of back-off (sometimes even in form of blocking) on slow paths.This is, sure, because of starvation is still possible in lock-free world, and from my point of view starvation in lock-free world has nothing really different from starvation in blocking world -- i.e. there are specifics, but it's the same phenomenon in its root.

      Yet another point: all of the locks implementation I'm aware of are itself lock-free (sic!) on fast-path. Well, you know -- all this [CAS(ownerThreadId) -> adaptive spinning -> ok, ok, I back off to enqueue you to waiters queue and schedule you off the CPU] stuff. So if we're calling, say, Discruptor to be lock-free (on fast path) -- why we do not call code with ReentrantLock to be lock-free (on fast-path)?

      Delete
  3. > Does core_1 intervene and prevent main memory to satisfy request (so, effectively, "locks" the cache line)?

    Take it with a big shovel of salt: AFAIK no and yes. Instead of the main memory, a higher level cache is involved. It's inclusive, so it has the given cache line. Moreover it knows that's the core_1 who has the only true version.

    > is there any difference between such "locking" of cache line ownership

    Yes. It's locked, but the lock duration is limited by the single instruction. Unlike a synchronized block, it can't be prolonged by a page fault (as it'd abort the instruction) and it obviously can't be prolonged by a context switch.

    ReplyDelete
    Replies
    1. >Yes. It's locked, but the lock duration is limited by the single instruction. Unlike a synchronized block, it can't be prolonged by a page fault (as it'd abort the instruction) and it obviously can't be prolonged by a context switch.

      Yes, sure. But any thread's execution could be prolonged by page fault or context switch or some interrupt or... If we worry about thread_2 being waited on lock acuired by thread_1, which was interrupted inside lock-ed region because of page fault -- why don't we worry about thread_2 being waited on plain local memory load because of same page fault?

      Sure, there are some reasons to worry about duration of lock-ed region _more_ then any thread-local region duration -- e.g. because this way one page fault could slow down not 1, but 2,3... threads at once. But still I see no crucial difference from practial point of view.

      Delete
  4. Being nit-picky, lets dissect the three JVM items you identify as causing blocking: Safepoint, Depot, and Allocation. I don't think any of these are specific to JVMs. The same qualities are shared by pretty much all OSs and by Hardware.. The reality is that if If a system MUST be non-blocking, it should probably not be on the JVM, OS, Hypervisor, Hardware-that-uses-Power-management, Hardware-that-does-ECC, etc. Basically, non-blocking (systems) as opposed to algorithms or sets of threads only exit on paper, in the sense that if you have a system that MUST be non-blocking (as a system), you probably need to execute it on paper... ;-)

    We probably agree that having the JVM suspend the execution of a thread temporarily for various reason does not not remove wait-freedom qualities from the thread or from the system of threads your algorithm (or whole program) represents. The JVM itself is not wait-free. But neither is any OS I know of. And neither is hardware. E.g. on any OS with interrupts enabled, any thread may experience a delay for some amount of time for interrupt and/or exception processing, but will still be considered wait free if other qualities hold.

    The key point is that all three examples of JVM blocking stuff are bound in time, and do not create any new cross-thread-blocking potential within the affected threads (that run on the JVM):

    Safepoint: A JVM may bring all threads to a global safepoint that can last an arbitrary (but bounded) amount of time. That makes for an ugly pause, but it does not make the threads running on it non-wait-free. Not any more than e.g. a page fault in an OS does to a C program, at least. The magnitude may be larger and uglier (on JVMs that make threads wait for large an ugly safepoints to complete), but the concept is the same.

    Time-to-safepoint (TTSP): The real issue with safepoints is cross-thread dependence on execution of code, and the resulting time-to-safepoint that extends the perceived stall seen by all threads in a global safepoint. In the fairytale land of "all-safepoints-all-the-time", a safepoint can happens anywhere in code, without a neccesaity for instructions to be executed by any thread, removing the dependence between threads. However, in the practical world of optimized code, safepoints may occur more rarely in code, which means that threads need to "execute forward" to reach a safepoint, creating a cross-thread dependency. When safepoint opportunities are eliminated from long code paths (e.g. counted loops that count to a billion, or runtime things like array fills of 1GB), TTSP can get very long. When page faults may need to be satisfied for threads to make forward progress (e.g. when a thread is blocked on access to a mapped file), that gets ugly too. However, as ugly as TTSP is, it is still a bounded operation. It does not (in itself) make the threads affected by it non-wait-free.

    Deoptimization: Deoptimization without safepoints is possible. But I'll grant that all current JVMs use global safepoints for at least some deoptimizations. And global safepoints are ugly. However, see safepoint discussion above for why they do not make affected threads non-wait-free any more than an OS fault handler that grabs a spin lock does.

    Allocation: Allocation on the JVM is actually in better shape than allocation on anything else in this regard... The key is that it's not the JVM that stalls you when you allocate, it's the allocation itself that includes a potential semantic lock (just like a get on a semaphore does). An allocating C thread (using any variant of malloc() or new) is just as susceptible to long delayed operations. Whenever you allocate from something other than a completely-thread-local-self-managed-pool-of-memory thing, you've introduced blocking into your algorithm. JVMs (and other GC'ed runtimes) reduces the likelihood of allocation-related contention and blocking dramatically compared to other environments, but allocation is still a blocking operation, just like anywhere else.

    ReplyDelete
    Replies
    1. Going back to the qualification of non-blocking as "failure or suspension of any thread cannot cause failure or suspension of another thread", anything which triggers a global safepoint is blocking. The global safepoint is in effect a lock, and any thread waiting on it falls under "failure or suspension of other thread". If you suspend a Java thread on it's way to a safe-point, or the GC thread in a safepoint, everyone is stuck. The amount of time you plan to spend in a lock, even a short or bounded one, does not matter if suspending a thread inside it will have the effect of blocking some other thread until the thread inside the lock is resumed.
      The reason I separate allocation and deopt from safepoints is to differentiate cause and effect. A Java thread can cause a global safepoint through allocation and deoptimization (and a few other ways). A Java thread can be effected by other Java threads, or the JVM itself, when hitting a safepoint initiated by them.
      > "non-blocking (systems) as opposed to algorithms or sets of threads only exit on paper, in the sense that if you have a system that MUST be non-blocking (as a system), you probably need to execute it on paper"
      This is probably true, my point is that to communicate using the "lock free" term we need to recognise it means different things to different people in different context. It's fuzzy on the boundaries of "system" and "progress", and should perhaps not be waved around quite as enthusiastically as I have been waving it :-).

      Delete