Tuesday, 20 December 2016

What do Atomic*::lazySet/Atomic*FieldUpdater::lazySet/Unsafe::putOrdered* actually mean?

Paved with well intended definitions it is.
lazySet/putOrdered (or an ordered store) was added as a bit of a rushed/non-commital afterthought after the JMM was done, so it's description is subject to many debates on the mailing lists, stack overflow and watercoolers the world over. This post merely tries to provide a clear definition with references to relevant/reputable sources.

An ordered store is a weakened volatile store[2][5]. It prevents preceding stores and loads from being reordered with the store[1][3][6], but does not prevent subsequent stores and loads from being reordered with it[2][4].
If there was a JMM cookbook entry for ordered store defining it with barriers in mind it would seem that the consensus is that ordered stores are preceded by a StoreStore AND a LoadStore barrier[4][6].

Ordered store is practically the same as a C++ memory_release_store[5][7].

[1] Original bug:
"lazySet provides a preceding store-store barrier (which is either a no-op or very cheap on current platforms), but no store-load barrier"
See here: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6275329

[2] java.util.concurrent docs:
"lazySet has the memory effects of writing (assigning) a volatile variable except that it permits reorderings with subsequent (but not previous) memory actions that do not themselves impose reordering constraints with ordinary non-volatile writes."
See here: https://docs.oracle.com/javase/8/docs/api/?java/util/concurrent/package-summary.html

[3] JMM cookbook: Defining barriers meaning here: http://g.oswego.edu/dl/jmm/cookbook.html

[4] concurrency-interest Q&A with Doug Lea, October 2011:
"[Ruslan]:... If it happens (== we see spin-wait loop finished) -- does it mean,that all writes preceding lazySet are also done, committed, and visible to thread 2, which finished spin-wait loop?
[Doug]: Yes, although technically, you cannot show this by reference to the Synchronization Order in the current JLS.
lazySet basically has the properties of a TSO store"
See here: http://cs.oswego.edu/pipermail/concurrency-interest/2011-October/008296.html

The discussion is REALLY worth reading, involving Hans, Doug, Vitaly, Ruslan and other such respectable members of this excellent mailing list. Go on, I'll wait.

The discussion on that thread concludes the following is required:
LoadStore + StoreStore
st [Y],X // store X into memory address Y

Outcome: Stores before and after are now prevented from floating across the barrier. Loads before the barrier are also prevented from floating down. Later loads are free to float up. Note that st may in theory be delayed indefinitely, certainly other loads and stores are allowed to float up between it and the barrier.

[5] concurrency-interest Q&A with Aleksey Shipilev, May 2016:
"putOrdered is a release in disguise, most of the C++11 std::atomic(...,
mem_order_release) reasoning applies here."
"acquire/release are the relaxations from the usual volatile
rules -- while producing happens-before-s, they drop from total
synchronization order, thus breaking sequential consistency."

And adds some fine examples:
"Safe publication still works:

                       int x; volatile int y;
    put(x, 1);                   |  r1 = get{Acquire|Volatile}(y);
    put{Release|Volatile}(y, 2); |  r2 = get(x);

(r1, r2) = (2, 0) is forbidden.

But anything trickier that requires sequential consistency fails. IRIW
fails, because no consistent write order observed by all threads. Dekker
fails, because release stores followed by loads may or may not be
visible in program order:

                     volatile int x; volatile int y;
    putRelease(x, 1);            |    putRelease(y, 1);
    r1 = getAcquire(y);          |    r2 = getAcquire(x);

(r1, r2) = (0, 0) is allowed. Forbidden if all ops are volatile.

Safe construction still does not work (even for volatiles!):

                                A global;
    A a = <alloc>;                  |  A a = global;
    put{Release|Volatile}(a.x, 1);  |  r1 = get{Acquire|Volatile}(a.x);
    global = a;                     |

(r1) = (0) is allowed."
See here: http://cs.oswego.edu/pipermail/concurrency-interest/2016-May/015104.html

[6] concurrency-interest Q&A with Aleksey Shipilev, March 2016:
"> int a, b;
> boolean tryLock() {
>     UNSAFE.putOrdered(a, 1); // Declare intent.
>     // No StoreLoad here as store is not volatile.
>     if (UNSAFE.getVolatile(b) == 1)) {
>         // Reset intent and return false;
>     }
>     return true;
> }

Even in the naive barrier interpretation that usually gives stronger
answers, you have:

 a = 1;

 r1 = b;
See here: http://cs.oswego.edu/pipermail/concurrency-interest/2016-March/015037.html

[7] C++ memory_order_release definition:
"A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below)."
See here: http://en.cppreference.com/w/cpp/atomic/memory_order

Many thanks to A. Shipilev, M. Thompson, D. Lawrie and C. Ruslan  for reviewing, any remaining errors are their own and they shall be most severely reprimanded for them.


  1. >Note that st may in theory be delayed indefinitely, certainly other loads and stores are allowed to float up between it and the barrier.

    This also true for ordinary volatile stores: they can be delayed infinitely (in theory).

    I was always found confusing why lazySet javadocs state it is "eventually visible" store -- while it seems no more "eventually" than any other plain/volatile store.

    1. +1

      This part of the docs I find confuses many people.

    2. There is a difference still. On x86 the volatile write will issue an mfence, flushing the Load Buffer to L1 and thus (via the cache coherency protocol) ensure that every other core sees the stores that happened before the volatile store.

      lazySet on the other hand does not do that, all it does is ensure that there are no re-orderings. So there has to be something else that will do the flushing, but it's not the lazySet that triggers that.

      At least that is my understanding and things seen that JIT emits for this.

    3. @eugene I did not go into the produced assembly because it represents an implementation, not a guarantee or specification. It may even have bugs...

      @martin + @ruslan you are right, normal loads/stores are similarly free to "float up" past a volatile store so it is similarly open to the hypothetical infinite delay.

  2. Now all the comments on JCtools with something like this:

    soElement(buffer, offset, e); // StoreStore

    need to be updated to:

    soElement(buffer, offset, e); // LoadStore + StoreStore


    With no jokes, that could mean that when you have something like:


    it is valid and effectively prevents the mutuallyExclusiveOperation instructions to float outside the "guarded" block?

    1. :) those comments are somewhat redundant, more gardening required...
      Yes, I think the model you describe works for mutual exclusion since getAndAdd is full fence and the happens-before model is hanging on observations/actions on x. It's similar to the mutual exclusion on allocation model used by the 'MpscLinkedArrayQueue's.

  3. This has puzzled me for quite a while too, and while your article(s) is/are very good, I still have a few questions if I may.

    Strictly speaking about x86 : the usual volatile store will emit an mfence, that basically says "flush the Load Buffer to L1" before this barrier. About the re-orderings x86 is already a strict memory model, so when reading a volatile variable, there's no need for a lfence.

    lazySet on the other hand does not flush the Load Buffer, all it does is prevent re-orderings, as you call it a weaker volatile store (which I like a lot). So it does not trigger the flush, someone else has to, but it ensures the order of the writes (I assume this is what make it such a good tool in the SingleWriter principle).

    My question is actually simple : do i make any sense? :) thank you

    1. Is it just me finding this stuff so much easier to understand thinking in cpu instructions (logic) then in the JMM? I feel weird right now...