Wednesday, 27 August 2014

Disassembling a JMH Nano-Benchmark

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
I often feel it is nano-benchmarks that give microbenchmarks a bad name (that and the fact MBMs tend to sell crack and their young bodies). Putting to one side the latter issue for bleeding heart liberalists to solve, we are left with the former. In this post I'd like to help the budding nano-benchmark writer resolve and investigate the embarrassing dilemma of: "What just happened?"
"What just happened?" is a question you should almost always ask yourself when running a nano-benchmark. The chances of the compiler finding out your benchmark does nothing, or that significant part of your benchmark can be omitted, are surprisingly large. This is partly a case of extreme cleverness of compiler writers and partly the simplicity of the benchmark code potentially leaving the door open to optimisations perhaps not possible in the wild. The best way to answer the question is to have a look at the assembly end result of your benchmark code.
Hipster developer that I am, I use JMH to write microbenchmarks. Chances are you should too if you are writing nano/micro benchmarks as it goes a long way toward solving common issues. In the rest of this post we'll be looking at the assembly produced by JMH benchmarks and explaining away the framework so that you can more easily find your way in your own benchmark.

The NOOP benchmark

I started with the observation that nano-benchmarks sometime get optimized away, if they did they'd have the same end result as this benchmark:
Exciting stuff! So we measure nothing at all. How are we measuring this? JMH generates some code around a call to the above method that will do the measurement:
So we have a while loop, spinning on the isDone flag and counting how many times we can manage to execute it until someone tells us to stop (by setting the isDone flag to true). It follows therefore that the measurement overhead is:
  • Reading the volatile field isDone (L1 hitting read, predictable)
  • Incrementing a counter (on the stack)
But healthy skepticism is what this is all about, let's see what the generated assembly looks like! I'll be gentle, assembly is often hard on the eyes.

Getting The Assembly Output

To try this at home you'll need a drink, a JVM setup to print assembley and the sample code. Build the project with maven and you run the benchmark and generate the assembly using the following command:
$JAVA_HOME/bin/java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*.noop_avgt_jmhLoop -XX:PrintAssemblyOptions=intel -XX:-UseCompressedOops -jar target/microbenchmarks.jar -i 5 -wi 5 -f 0 ".*.noop" > noop.ass
I'm only printing the measurement method, using the Intel syntax instead of the default AT&T and disabling compressed oops to get simpler output for this particular exercise. The output will contain several versions of the compiled method, I will be discussing the final version which is the last in the output.
Now we got the assembly printed we can get familiar with the structure of the JMH measurement loop as it is translated into assembly:

This is just the preliminaries for the method, so not much to see except noting which reference is in which register to help interpret the rest of the code. The comments in the printout are generated by the JVM, my comments are prefixed with [NW].
Once all the pieces are in place we can move on to some actual work.

Measurement Loop: 2 Timestamps diverged in a yellow wood

Refresh you memory of what the java code above does and let's see if we can see it here:
Have a sip and scan slowly. Here's some nuggets to consider:

  • As expected the noop() method is not called and any mention of it is gone from the measurement loop.
  • The first iteration of the loop has been 'peeled', this is common practice.
  • Even though we never call noop(), we still have to do the null check for the benchmark reference.
  • The sharp of eye reader will have noticed the redundant realTime variable in the generated measurement loop, so has the JIT compiler and it has been replaced with setting the result.realTime field directly to 0.
  • RBP is an 8 byte register, EBP is the lower half of the same register. Setting EBP to 1 in the peeled first iteration is the same as setting RBP to 1.
  • The measurement loop includes a safepoint! put that down as further measurement overhead.
This is the simplest benchmark one can write with JMH. On my test machine (an Intel Xeon E5-2697 v2 @ 2.70GHz) doing nothing is quite fast at 0.288 ns/op.
As you may have expected, reading the generated assembly is not so pleasant, I find the generated comments are very helpful for orientation and the timestamp calls on either side of the measurement loop help in zooming in on the important bits.

A Nano-Benchmark: i++

Nothing says "nano-benchmark" like benchmarking a single operation. Let's have a go at it!
The generated loop is the same, but this time that crafty old JIT compiler cannot just do nothing with our code. We will finally learn the true cost of incrementing an integer! Given the overhead includes a long increment already I might even guess the cost at 0.25 ns/op, so maybe the result reported by JMH will be 0.5 ns/op? A warm fuzzy feeling of wisdom.
But when I run this benchmark on the same machine I learn to my dismay that incrementing an integer takes 1.794 ns/op according to my JMH benchmark. Damn integers! why does the JVM torture us so with slow integer increments?
This is a silly benchmark, and the result makes absolutely no sense as an estimate of the cost of the ++ operator on integers. So what does it mean? Could it be that the JIT compiler failed us? Lets have a look at the assembly:
So why is the reported cost so much higher than our expectation?

What just happened?

My increment method got translated perfectly into: "inc DWORD PTR [r8+0x10]". There is no compiler issue.  The comparison I made between incrementing the operations counter and incrementing the benchmark field is flawed/misguided/stupid/ignorant when taking into account the benchmark framework.
The context in which we increment operations is:
  • It's a long variable allocated on the stack
  • It's used in a very small methods where there is no register pressure
  • It follows that operations is always a register
  • ADD/INC on a register cost very little (it's the cheapest thing you can do usually)
The context in which we increment benchmark.i is:
  • It's a field on the benchmark object
  • It's subject to happens-before rules so cannot be hoisted into a register inside the measurement loop (because control.isDone is a volatile read, see this post for more detail)
  • It follows that benchmark.i is always a memory location
  • INC on a memory location is not so cheap (by nano benchmark standards)
Consulting with the most excellent Agner Fog instructions tables tells me that for Ivy Bridge the latency for INC on memory is 6 cycles, while the latency on ADD for a register is 1. This indeed agrees to some extent with the cost reported by JMH (assuming 0.288 was for one cycle, 0.288 * 6 = 1.728 which is pretty close to 1.794).  But that's bad analysis. The truth is that cost is not additive, particularly when nano-benchmarks are concerned. In this case the cost of the INC seems to swallow up the baseline cost we measured before. 
Is there something wrong with JMH? I don't think so. If we take the benchmark to be "an attempt at estimating the cost of calling a method which increments a field" then I would argue we got a valid answer. It's not the only answer however. Calling the same method in a context which allows further optimizations would yield a different answer.


Tuesday, 12 August 2014

The volatile read suprise

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
On occasion, and for perfectly good reasons, I find myself trying to answer such deep existential questions as this one. Which is faster:
As you can see from the sample I turn to JMH to help me resolve such questions. If you know not what JMH is you may enjoy reading previous posts on the subject (start with this one). In short it is a jolly awesome framework for benchmarking java:
  • @Benchmark annotated methods will get benchmarked
  • The framework will pass in a Blackhole object that will pretend to 'consume' the values you pass into it and thus prevent the JIT compiler from dead code eliminating the above loops to nothing.
Assuming we are all on the same page with this snippet above, let the game begin!

Yummy yummy sugar!

So I ran the above benchmarks on some heavy duty benchmarking machine and get the following results for different array sizes:
It sure looks like that syntactic sugar is much better! more than twice as fast! awesome?

Must give us pause

At this point we could either:
  1. Declare syntactic sugar the clear winner and never write the old style for loops ever again 'cause they be slow like everything old! we hates them old loops! hates them!
  2. Worry that we are being a bit stupid
I get very little sleep and I was never very bright, so I'll go for 2. 
This benchmark result seems off, it's not what we expect. It would make sense for the JVM to make both loops the same, and yet they seem to work out very differently. Why, god? whhhhhhhy?
The above benchmark is a tiny piece of code, and is a fine example of a nano-benchmark (to use the term coined by Shipilev for benchmarks of nano-second scale). These are pretty suspect benchmarks at the best of time so you want to be quite alert when trying to make sense of them. When stuff doesn't make sense it is best to see what the JIT compiler made of your code and hit the assembly! Printing the JIT generated assembly is a neat party trick (sure to win you new friends and free drinks) and results in loads of funky text getting thrown at you. I was going to do a whole walk through the assembly but I have promises to keep and miles to walk before I sleep (some other time, I promise). So lets just skip to the WTF moment.

Into the hole

The assembly code for the goodOldLoop is long and painful to read through, and that in itself is a clue. Once you work out the control flow you'll sit there scratching your head and wondering. The thing that stands out (when the assembly smoke clears) is that bunn is loaded on every iteration, bunn.length is loaded and an array boundary check happens. This is surely a terrible way to interpret a for loop... 
The culprit turns out to be a volatile read in Blackhole.consume:

The above method ensures that a consumed value will not be subject to DCE even if it is completely predictable. The values for b1, b2 being volatile cannot be assumed to stay the same and so require re-examination. The side effect is however that we now have a volatile load in the midst of our for loop. A volatile load of one value requires the JVM to load all subsequent loads from memory to force happens before relationships, in this case the field bunn is reloaded on every iteration of the loop. If bunn may have changed then it's length may have also changed... sadness follows. To test this theory we can make a third loop:
This performs much like the sweet syntactic sugar version:

Lessons learnt?

  • Nano benchmarks and their results are hard to interpret. When in doubt read the assembly, when not in doubt smack yourself to regain doubt and read the assembly. It's very easy for a phenomena you are not looking to benchmark to slip into the benchmark.
  • Sugar is not necessarily bad for you. In the above case the syntactic sugar interpretation by the JVM was a better match to our intuition than the explicit old school loop. By being explicit we inhibited optimisation, despite intending the same thing. The enhanced for loop, as the JLS calls it, is semantically different from the basic for loop in that it assumes some sort of snapshot iterator taken at the beginning of the loop and used throughout, which for primitive arrays means taking the form used in goodOldLoopReturns.
  • Blackhole.consume is also a memory barrier, and these come with some side effects you may not expect. In larger benchmarks these may be negligible but in nano benchmarks every little thing counts. This is a fine use case for a 'weak' volatile read, one which requires a memory read but no memory barrier(previous post on the compound meaning of the volatile access)

Friday, 8 August 2014

The many meanings of volatile read and write

Just a quick note on the topic as I find I keep having this conversation. Volatile fields in Java provide three distinct features:
  1. Atomicity: volatile long and double fields are guaranteed to be atomically written. This is not the case otherwise for long double. See JLS section 17.7 for more details. Also see this excellent argument made by Shipilev on why all fields could be made atomic with no significant downside.
  2. Store/Load to/from memory: a normal field load may get hoisted out of a loop and be done once, a volatile field is prevented from being optimized that way and will be loaded on each iteration. Similarly stores are to memory and will not be optimized.
  3. Global Ordering: A volatile write acts as a StoreLoad barrier thus preventing previous stores from being reordered with following loads. A volatile read acts as a LoadLoad barrier and prevents following loads from happening before it. This is opposed to the meaning of volatile in C/C++ where only other volatile loads/stores are prevented from reordering.
I would personally prefer to have these more refined tools at my disposal for when I need them, but volatile is a 3-in-1 sort of tool...

What about AtomicLog.lazySet?

For those of you wondering (as I did) weather or not AtomicLong.lazySet (A.K.A Unsafe.putOrderedLong) provides atomicity, it would seem the answer is yes. Digging through the JVM source code for the putOrderedLong intrinsic yields the following nugget:
Look at that perfectly pleasant C++ code! The store is indeed made atomic. We can further test this observation by looking at the generated assembly for a 32 vs 64 bit JVM:
There you go! Atomicity is perserved! Hoorah!

Monday, 28 July 2014

Poll me, maybe?

The java.util.Queue interface has the ever important poll/offer methods, documented thus:
This allows the caller to assume that if the poll method returns null the queue is empty, and in a single threaded world this is quite straight forward. A problem for multi producer lock free concurrent queues however is that while an element may not be visible the queue may not be empty. "WOT?" I hear you cry, lets look at an example:
The above snippet shows the offer/poll methods of an MPSC linked queue as originated by Mr. Vyukov(MPSC - Multi-Prodcuer, Single-Consumer) and ported into Java by your humble servant (though others have ported this algo before me: Akka/RxJava/Netty, and others...).
Where be dragons?

Multiple Producers Break The Chain

How the above algorithm works for multiple producers:

  • On line 17 we use the same mechanism offered by AtomicReference.getAndSet to atomically swap the producerNode with the new node.
  • This means no producerNode is ever returned more than once, preventing producer threads from overwriting the producerNode reference field and that node's reference to the next node.
  • We use the same mechanism offered by AtomicReference.lazySet to set this new node as the next node from the previous.
  • On the consumer thread side we process nodes by grabbing the next node and replacing the consumerNode with it after we pluck out it's value.

The problem is in the offer method lines 17-19 where we first xchgProducerNode and then set the new node (now the producerNode) to be the next node after the previous producerNode. For a short moment the old producer node has null as the next element, the chain is broken. This short moment is of undefined length, depending on scheduler pressure and the whimsical sense of humour of the gods the producer thread which got interrupted at line 18  (puff) may be held up for a while.
And yet, while this producer thread is sleeping (what dreams may come?), other producers can make progress. They may add any number of nodes to the queue, each linking the producerNode to the next. The producerNode can be at any 'distance' from that suspended node on that suspended thread waiting for it's next field to be patched through and may have any number of further broken links in the way.
Looking at the poll method in this light the problem becomes more obvious. If a node may have it's next set to null due to the timing described above, then poll may return null when the queue is in fact not empty.

Unbreaking The Chain

To fix this issue we could do the following:

And indeed this is how other implementors have chosen to tackle the issue (in spirit, if not in code).
In doing so we have given up quite allot however:
  1. poll() is no longer wait free. Which is a shame, I quite like wait freedom.
  2. The consumer thread is volunteered into spin waiting on the next field to become visible.
  3. In the event of hitting null we now read the producerNode. This introduces the potential for a cache miss. This is a big problem to my mind as this cache miss has an unknown and potentially very large cost.
  4. The producerNode read from the consumer thread will have a negative impact on the producer threads contending to write to it. This has been previously explored here. This will be particularly bad when the consumer is spinning on the poll() method while waiting for the next value.
Is it worth it? I'm not convinced... Given that a Queue already mandates an isEmpty method could we not settle for a relaxed definition of poll? given the above observation of the queue emptiness is also (by it's lock free and concurrent nature) imprecise should we really sacrifice performance/scalability for it?
At the moment I am leaning towards offering weaker guarantees for the lock-free queues offered as part of JCTools, but I'm hoping to get some feedback from prospective users on how important that part of the queue interface is to their purposes.

NOTE: The same issue exists for the offer method when we look at an SPMC queue, as discussed in this issue.

Tuesday, 8 July 2014

Concurrent Bugs: Size Matters

As part of my ongoing work on JCTools (no relation to that awfully popular Mexican dude) I've implemented SPSC/MPSC/MPSC/MPMC lock free queues, all of which conform to the java.util.Queue interface. The circular array/ring buffered variants all shared a similar data structure where a consumer thread (calling the poll() method) writes to the consumerIndex and a producer thread (calling the offer() method) writes to the producerIndex. They merrily chase each other around the array as discussed in this previous post.
Recently I've had the pleasure of having Martin Thompson pore over the code and contribute some bits and pieces. Righteous brother that Mr. T is, he caught this innocent looking nugget (we're playing spot the bug, can you see it?):
What could possibly go wrong?


Like bugs? you'll looove Concurrency!

The code above can return a negative number as the size. How can that be you wonder? Let us see the same method with some slow motion comments:

See it?
  • We need not get suspended for this to happen. The independent loads could have the second one hit a cache miss and get delayed, or maybe the consumer and the producer are just moving furiously fast. The important thing to realise is that the loads are independent and so while there may not be much time between the 2, there is the potential for some time by definition.
  • Because the 2 loads are volatile they cannot get re-ordered.
  • Doing it all on one line, or on 3 different lines (in the same order) makes no difference at all.

So how do we solve this?

Is it really solved?
  • This will at least return a value that is in the correct range.
  • This will NOT ALWAYS return the right or correct number. The only way to get the absolutely correct number would be to somehow read both indexes atomically.
  • Atomically reading 2 disjoint longs is not possible... You can (on recent x86) atomically read 2 adjacent longs, but that is:
    • Not possible in Java
    • A recipe for false sharing
  • We could block the producers/consumers from making progress while we read the 2 values, but that would kill lock freedom.
  • This method is lock free, not wait free. In theory the while loop can spin forever in the same way that a CAS loop can spin forever. That is very unlikely however.
This is as good as it gets baby.

Lesson?

We all know that a thread can get suspended and threads execution can interleave, but it is easy to forget/overlook that when looking at a line of code. This issue is in a way just as naive as incrementing a volatile field and expecting atomicity.
Sometimes there is no atomicity to be found and we just have to come to terms with the best approximation to a good answer we can get...
A further lesson here is that having your code reviewed by others is an amazingly effective way to find bugs, and learn :-)


Thursday, 26 June 2014

Notes on False Sharing

I've recently given a talk on queues and related optimisations, but due to the limitations of time, the Universe and Everything else had to cut short the deep dive into false sharing. Here however I'm not limited by such worldly concerns, so here is more detail for those who can stomach it.

What mean False Sharing?

Described before on this blog and in many other places (IntelusenixMr. T), but assuming you are not already familiar with the issue:
"false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that will never be altered by another party, but that data shares a cache block with data that is altered, the caching protocol may force the first participant to reload the whole unit despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource." - From Wikipedia
Makes sense, right?
I tried my hand previously at explaining the phenomena in terms of the underlying coherency protocol which boils down to the same issue. These explanations often leave people still confused and I spent some time trying to reduce the explanation so that the idea can get across in a presentation... I'm pretty sure it still left people confused.
Let's try again.
The simplest explanation I've come up with to date is:
  1. Memory is cached at a granularity of a cache line (assumed 64 bytes, which is typical, in the following examples but may be a smaller or larger power of 2, determined by the hardware you run on)
  2. Both reading and writing to a memory location from a particular CPU require the presence of a copy of the memory location in the reading/writing CPU cache. The required location is 'viewed' or cached at the granularity of a cache line.
  3. When a line is not in a threads cache we experience a cache miss as we go searching for a valid copy of it.
  4. Once a line is in our cache it may get evicted (when more memory is required) or invalidated (because a value in that line was changed by another CPU). Otherwise it just stays there.
  5. A write to a cache line invalidates ALL cached copies of that line for ALL CPUs except the writer.
I think there is a mistaken intuition most of us have when it comes to memory access, which is that we think of it as direct. In reality however you access memory through the CPU cache hierarchy and it is in the attempt to make this system coherent and performant that the hardware undoes our intuition. Rather than thinking of all CPUs accessing a global memory pool we should have a mental model that is closer to a distributed version control system as illustrated in this excellent post on memory barriers.
Given the above setting of the stage we can say that false sharing occurs when Thread 1 invalidates (i.e writes to) a cache line required by Thread2 (for reading/writing) even though Thread2 and Thread1 are not accessing the same location.
The cost of false sharing is therefore observable as cache misses, and the cause of false sharing is the write to the cache line (coupled with the granularity of cache coherency). Ergo, the symptom of false sharing can be observed by looking at hardware events for your process or thread:
  • If you are on linux you can use perf. (Checkout the awesome perf integration in JMH!)
  • On linux/windows/mac Intel machines there is PCM.
  • From Java you can use Overseer
  • ... look for hardware counters on google.
Now assuming you have established that the number of cache misses you experience is excessive, the challenge remains in finding the sources of these cache misses. Sadly for Java there is no free utility that I'm aware of which can easily point you to the source of your false sharing problem, C/C++ developers are spoilt for choice by comparison. Some usual suspects to consider however are any shared data constructs, for example queues.


Active False Sharing: Busy Counters

Lets start with a queue with 2 fields which share the same cache line where each field is being updated by a separate thread. In this example we have an SPSC queue with the following structure:
Ignoring the actual correct implementation and edge cases to make this a working queue (you can read more about how that part works here), we can see that offer will be called by one thread, and poll by another as this is an SPSC queue for delivering messages between 2 threads. Each thread will update a distinct field, but as the fields are all jammed tight together we know False Sharing is bound to follow.
The solution at the time of writing seems to be the introduction of padding by means of class inheritance which is covered in detail here. This will add up to an unattractive class as follows:
Yipppeee! the counters are no longer false sharing! That little Pad in the middle will make them feel all fresh and secure.



Passive False Sharing: Hot Neighbours

But the journey is far from over (stay strong), because in False Sharing you don't have to write to a shared cache line to suffer. In the above example both the producer and the consumer are reading the reference to buffer which is sharing a cache line with naughty Mr. consumerIndex. Every time we update consumerIndex an angel loses her wings! As if the wings business wasn't enough, the write also invalidates the cache line for the producer which is trying to read the buffer so that it can write to the damn thing. The solution?
MORE PADDING!!
Surely now everyone will just get along?

The Object Next Door

So now we have the consumer and the producer feverishly updating their respective indexes and we know the indexes are not interfering with each other or with buffer, but what about the objects allocated before/after our queue object? Imagine we allocate 2 instances of the queue above and they end up next to each other in memory, we'd be practically back to where we started with the consumerIndex on the tail end of one object false sharing with the producer index at the head end of the other. In fact, the conclusion we are approaching here is that rapidly mutated fields make very poor neighbours altogether, be it to other fields in the same class or to neighbouring classes. The solution?
EVEN MORE PADDING!!! 
Seems a tad extreme don't it? But hell, at least now we're safe, right?

Buffers are people too

So we have our queue padded like a new born on it's first trip out the house in winter, but what about Miss buffer? Arrays in Java are objects and as such the buffer field merely point to another object and is not inlined into the queue (as it can be in C/C++ for instance). This means the buffer is also subject to potentially causing or suffering from false sharing. This is particularly true for the length field on the array which will get invalidated by writes into the first few elements of the queue, but equally true for any object allocated before or after the array. For circular array queues we know the producer will go around writing to all the elements and come back to the beginning, so the middle elements will be naturally padded. If the array is small enough and the message passing rate is high enough this can have the same effect as any hot field. Alternatively we might experience an uneven behaviour for the queue as the elements around the edges of the array suffer false sharing while the middle ones don't.
Since we cannot extend the array object and pad it to our needs we can over-allocate the buffer and never use the initial/last slots:
We cannot prevent false sharing between neighbours and the array length, what we can do is hoist it into our own class field and use our copy instead of the buffer field. This can also result in better locality if the hoisted length value is hosted in a field alongside the array itself.
Note that this is costing us some computational overhead to compute the offset index instead of the natural one, but in practice we can implement queues such that the we achieve the same effect with no overhead at all.

Protection against the Elements

While on this topic (I do go on about it... hmmm) we can observe that the producer and the consumer threads can still experience false sharing on the cache line holding adjacent elements in the buffer array. This is something we can perhaps handle more effectively if we had arrays of structs in Java (see the value types JEP and structured arrays proposal), and if the queue was to be a queue of such structs. Reality being the terrible place that it is, this is not the case yet. If we could prove our application indeed suffered from this issue, could we solve it?
Yes, but at a questionable price...
If we allocate extra elements for each reference we mean to store in the queue we can use empty elements to pad the used elements. This will reduce the density of each cache line and as a result the probability of false sharing as less and less elements are in a cache line. This comes a high cost however as we multiply the size of the buffer to make room for the empty slots and actively sabotage the memory throughput as we have less data per read cache line. This is an optimization I am reluctant to straight out recommend as a result, but what the hell? sometimes it might helps.

Playing with Dirty Cards

This last one is a nugget and a half. Usually when people say: "It's a JVM/compiler/OS issue" it turns out that it's a code issue written by those same people. But sometimes it is indeed the machine.
In certain cases you might not be the cause of False Sharing. In some cases you might be experiencing False Sharing induced by card marking. I won't spoil it for you, follow the link.

Summary

So there you have it, 6 different cases of False Sharing encountered in the process of optimizing/proofing the piddly SPSC queue. Each one had an impact, some more easily measured than others. The take away here is not "Pad every object and field to death" as that will be detrimental in most cases, much like making every field volatile. But this is an issue worth keeping in mind when writing shared data structures, and in particular when considering highly contended ones.
We've discussed 2 types of False Sharing:

  • Active: where both threads update distinct locations on the same cache line. This will severely impact both threads as they both experience cache misses and cause repeated invalidation of the cache line.
  • Passive: where one thread writes to and another reads from distinct locations on the same cache line. This will have a major impact on the reading thread with many reads resulting in a cache miss. This will also have an impact on the writer as the cache line sharing overhead is experienced.
These were identified and mitigated in several forms:

  1. Neighbouring hot/hot fields (Active)
  2. Neighbouring hot/cold fields (Passive)
  3. Neighbouring objects (potential for both Active and Passive)
  4. Objects neighbouring an array and the array elements (same as 3, but worth highlighting as a subtlety that arrays are separate objects)
  5. Array elements and array length field (Passive as the length is never mutated)
  6. Distinct elements sharing the same cache line (Passive in above example, but normally Active as consumer nulls out the reference on poll)
  7. -XX:+UseCondCardMark - see Dave Dice article

What should you do?
  1. Consider fields concurrently accessed by different threads, in particular frequently written to fields and their neighbours.
  2. Consider Active AND Passive false sharing.
  3. Consider (and be considerate to) your neighbours.
  4. All padding comes at the cost of using up valuable cache space for isolation. If over used (like in the elements padding) performance can suffer.
And finally: love thy neighbour, just a little bit and with adequate protection and appropriate padding, but still love'em.
Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium!

Wednesday, 11 June 2014

Advice for the concurrently confused: AtomicLong JDK7/8 vs. LongAdder

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
Almost a year ago I posted some thoughts on scalable performance counters and compared AtomicLong, a LongAdder backport to JDK7, Cliff Clicks ConcurrentAtomicTable and my own humble ThreadLocalCounter. With JDK8 now released I thought I'd take an opportunity to refresh the numbers and have a look at the delivered LongAdder and the now changed AtomicLong. This is also an opportunity to refresh the JMH usage for this little exercise.

JMH Refresh

If you never heard of JMH here are a few links to catch up on:
In short, JMH is a micro-benchmarking harness written by the Oracle performance engineering team to help in constructing performance experiments while side stepping (or providing means to side step) the many pitfalls of Java related benchmarks.

Experiment Refresh

Here's a brief overview of the steps I took to update the original benchmark to the current:
  • At the time I wrote the original post JMH was pretty new but already had support for thread groups. Alas there was no way to tweak their size from the command line. Now there is so I could drop the boilerplate for different thread counts.
  • Hide counter types behind an interface/factory and use single benchmark (was possible before, just tidying up my own mess there).
  • Switched to using @Param to select the benchmarked counter type. With JMH 0.9 the framework will pick up an enum and run through it's values for me, yay!

The revised benchmark is pretty concise:
The JMH infrastructure takes care of launching requested numbers of incrementing/getting threads and sums up all the results neatly for us. The @Param defaults will run all variant if we don't pick a particular implementation from the command line. All together a more pleasant experience than the rough and tumble of version 0.1. Code repository is here.

AtomicLong: CAS vs LOCK XADD

With JDK8 a change has been made to the AtomicLong class to replace the CAS loop:
With a single intrinsic:
getAndAddLong() (which corresponds to fetch-and-add) translates into LOCK XADD on x86 CPUs which atomically returns the current value and increments it. This translates into better performance under contention as we leave the hardware to negotiate the atomic increment.

Incrementing only

Running the benchmark with incrementing threads only on a nice (but slightly old) dual socket Xeon X5650@2.67GHz (HyperThreading is off, 2 CPUs, 6 cores each, JDK7u51, JDK8u5) shows the improvement:

  • Left hand column is number of threads, all threads increment
  • All numbers are in nanoseconds measuring the average cost per op
  • Values are averaged across threads, so the cost per op is presented from a single threaded method call perspective.




JDK7-AL .inc (err) JDK8-AL .inc (err)
1 9.8 +/- 0.0 6.2 +/- 0.0
2 40.4 +/- 2.1 34.7 +/- 0.8
3 69.4 +/- 0.3 54.1 +/- 5.6
4 90.2 +/- 2.8 85.4 +/- 1.3
5 123.3 +/- 3.7 102.7 +/- 1.9
6 144.2 +/- 0.4 120.4 +/- 2.5
7 398.0 +/- 29.6 276.6 +/- 33.9
8 417.3 +/- 32.6 307.8 +/- 39.1
9 493.8 +/- 46.0 257.5 +/- 16.4
10 457.7 +/- 17.9 409.8 +/- 34.8
11 515.3 +/- 13.7 308.5 +/- 10.1
12 537.9 +/- 7.9 351.3 +/- 7.6

Notes on the how benchmarks were run:
  • I used taskset to pin the benchmark process to a set of cores
  • The number of cores allocated matches the number of threads required by the benchmark
  • Each socket has 6 cores and I taskset the cores from a single socket from 1 to 6, then spilt over to utilizing the other socket. In the particular layout of cores on this machine this turned out to translate into taskset -c 0-<number-of-threads - 1>
  • There are no get() threads in use, only inc() threads. This is controlled from the command line by setting the -tg option. E.g. -tg 0,6 will spin no get() threads and 6 inc() threads
 Observations:
  • JDK 8 AtomicLong is consistently faster as expected.
  • LOCK XADD does NOT cure the scalability issue in this case. This echoes the rule of thumb by D. Yukov here, which is that shared data writes are a scalability bottleneck (he is comparing private writes to a LOCK XADD on shared). The chart in his post demonstrates throughput rather than cost per op and the benchmark he uses is somewhat different. Importantly his chart demonstrates LOCK XADD hits a sustained throughput which remains fixed as threads are added. The large noise in the dual socket measurements renders the data less than conclusive in my measurements here but the throughput does converge. 
  • When we cross the socket boundary (>6 threads) the ratio between JDK7 and JDK8 increases.
  • Crossing the socket boundary also increases results variability (as expressed by the error). 
This benchmark demonstrates cost under contention. In a real application with many threads doing a variety of tasks you are unlikely to experience this kind of contention, but when you hit it you will pay.

LongAdder JDK7 backport vs. JDK8 LongAdder

Is a very boring comparison as they turn out to scale and cost pretty much the same (minor win for the native LongAdder implementation). It is perhaps comforting to anyone who needs to support a JDK7 client base that the backport should work fine on both and that no further work is required for now. Below are the results, LA7 is the LongAdder backport, LA8 is the JDK8 implementation:


JDK7-LA7 .inc (err) JDK8-LA7 .inc (err) JDK8-LA8 .inc (err)
1 9.8 +/- 0.0 10.0 +/- 0.8 9.8 +/- 0.0
2 12.1 +/- 0.6 10.8 +/- 0.1 10.0 +/- 0.1
3 11.5 +/- 0.4 11.7 +/- 0.3 10.3 +/- 0.0
4 12.4 +/- 0.6 11.1 +/- 0.1 10.3 +/- 0.0
5 12.4 +/- 0.8 11.5 +/- 0.6 10.3 +/- 0.0
6 11.8 +/- 0.3 11.1 +/- 0.3 10.3 +/- 0.0
7 11.6 +/- 0.4 12.9 +/- 1.3 10.6 +/- 0.4
8 11.8 +/- 0.6 11.8 +/- 0.7 10.8 +/- 0.5
9 12.9 +/- 0.9 12.0 +/- 0.7 10.5 +/- 0.3
10 12.6 +/- 0.4 12.1 +/- 0.6 11.0 +/- 0.5
11 11.5 +/- 0.2 12.3 +/- 0.6 10.7 +/- 0.3
12 11.7 +/- 0.4 11.5 +/- 0.3 10.4 +/- 0.1

JDK8: AtomicLong vs LongAdder

Similar results were demonstrated has been discussed in the prev. post, but here's the JDK8 versions side by side:



JDK8-AL .inc (err) JDK8-LA8 .inc (err)
1 6.2 +/- 0.0 9.8 +/- 0.0
2 34.7 +/- 0.8 10.0 +/- 0.1
3 54.1 +/- 5.6 10.3 +/- 0.0
4 85.4 +/- 1.3 10.3 +/- 0.0
5 102.7 +/- 1.9 10.3 +/- 0.0
6 120.4 +/- 2.5 10.3 +/- 0.0
7 276.6 +/- 33.9 10.6 +/- 0.4
8 307.8 +/- 39.1 10.8 +/- 0.5
9 257.5 +/- 16.4 10.5 +/- 0.3
10 409.8 +/- 34.8 11.0 +/- 0.5
11 308.5 +/- 10.1 10.7 +/- 0.3
12 351.3 +/- 7.6 10.4 +/- 0.1

Should I use AtomicLong or LongAdder?

Firstly this question is only relevant if you are not using AtomicLong as a unique sequence generator. LongAdder does not claim to, nor makes any attempt to give you that guarantee. So LongAdder is definitely NOT a drop in replacement for AtomicLong, they have very different semantics.
From the LongAdder JavaDoc:
This class is usually preferable to AtomicLong when multiple threads update a common sum that is used for purposes such as collecting statistics, not for fine-grained synchronization
control.  Under low update contention, the two classes have similar characteristics. But under high contention, expected throughput of this class is significantly higher, at the expense of higher space consumption.
Assuming you were using AtomicLong as a counter you will need to consider a few tradeoffs:
  • When NO contention is present, AtomicLong performs slightly better than LongAdder.
  • To avoid contention LongAdder will allocate Cells (see previous post for implementation discussion) each Cell will consume at least 256 bytes (current implementation of @Contended) and you may have as many Cells as CPUs. If you are on a tight memory budget and have allot of counters this is perhaps not the tool for the job.
  • If you prefer get() performance to inc() performance than you should definitely stick with AtomicLong.
  • When you prefer inc() performance and expect contention, and when you have some memory to spare then LongAdder is indeed a great choice.

Bonus material: How the Observer effects the Experiment

What is the impact of reading a value that is being rapidly mutated by another thread? On the observing thread side we expect to pay a read-miss, but as discussed prev. here there is a price to pay on the mutator side as well. I run the same benchmark with an equal number of inc()/get() threads. The process is pinned as before but as the roles are not uniform I have no control on which socket the readers/writers end up on, so we should expect more noise as we cross the socket line (LA - LongAdder, AL - AtomicLong, both on JDK8, same type ing/get are within the same run, left column is number of inc()/get() threads):


LA .get (err) LA .inc (err) AL .get (err) AL .inc (err)
1,1 4.7 +/- 0.2 68.6 +/- 5.0 10.2 +/- 1.0 39.1 +/- 7.6
2,2 24.9 +/- 2.7 69.7 +/- 4.0 41.5 +/- 26.4 87.6 +/- 11.0
3,3 139.9 +/- 24.3 69.0 +/- 8.2 55.1 +/- 13.3 157.4 +/- 28.2
4,4 332.9 +/- 10.3 80.4 +/- 5.1 56.8 +/- 7.6 198.7 +/- 21.1
5,5 479.3 +/- 13.2 84.1 +/- 4.4 71.9 +/- 7.3 233.1 +/- 20.1
6,6 600.6 +/- 11.2 89.8 +/- 4.6 152.1 +/- 41.7 343.2 +/- 41.0

Now that is a very different picture... 2 factors come into play:
  1. The reads disturb the writes by generating cache coherency noise as discussed here. A writer must have the cache line in an Exclusive/Mutated state, but the read will cause it to shift into Shared.
  2. The get() measurement does not differentiate between new values and old values.
This second point is important as we compare different means of mutating values being read. If the value being read is slowly mutating we will succeed in reading the same value many times before a change is visible. This will make our average operation time look great as most reads will be L1 hitting. If we have a fast incrementing implementation we will cause more cache misses for the reader making it look bad. On the other hand a slow reader will cause less of a disturbance to the incrementing threads as it produces less coherency noise, thus making less of a dent in the writer performance. Martin Thompson has previously hit on this issue from a different angle here (Note his posts discuss Nahelem and Sandybridge, I'm benchmarking on Westmere here).
In this light I'm not sure we can read much into the effect of readers in this particular use case. The 'model' represented by a hot reading thread does not sit well with the use case I have in mind for these counters which is normally as performance indicators to be sampled at some set frequency (once a second or millisecond). A different experiment is more appropriate, perhaps utilising Blackhole.consumeCPU to set a gap between get() (see a great insight into this method here).
With this explanation in mind we can see perhaps more sense in the following comparison between AtomicLong on JDK7 and JDK8:


AL7 .get (err) AL7 .inc (err) AL8 .get (err) AL8 .inc (err)
1,1 4.0 +/- 0.2 61.1 +/- 7.8 10.2 +/- 1.0 39.1 +/- 7.6
2,2 7.0 +/- 0.3 133.1 +/- 2.9 41.5 +/- 26.4 87.6 +/- 11.0
3,3 21.8 +/- 3.4 278.5 +/- 47.4 55.1 +/- 13.3 157.4 +/- 28.2
4,4 27.2 +/- 3.3 324.1 +/- 34.6 56.8 +/- 7.6 198.7 +/- 21.1
5,5 31.2 +/- 1.8 378.5 +/- 28.1 71.9 +/- 7.3 233.1 +/- 20.1
6,6 57.3 +/- 12.5 481.8 +/- 39.4 152.1 +/- 41.7 343.2 +/- 41.0

Now consider that the implementation of get() has not changed. The reason for the change in cost for get is down to the increase in visible values and thus cache misses caused by the change to the increment method.