Psychosomatic, Lobotomy, Saw

Wednesday 11 July 2018

How Inlined Code Makes For Confusing Profiles

Inlining is a powerful and common optimization technique used by compilers.
But inlining combines with other optimizations to transform your code to such an extent that other tooling becomes confused, namely profilers.

Mommy, What is Inlining?

Inlining is the mother of all optimizations (to quote C. Click), and mothers are awesome as we all know. Inlining is simply the practice of replacing a method call with the method body. This doesn't sound like much of an improvement, but it's potentially a massive win for a couple of reasons:

Expanding the scope of other optimizations! calling a method with a constant parameter may just end up getting folded into a constant for instance. Most compilers take a view that you either know everything about a method, because it's the method you are compiling ATM, or you know next to nothing about it. Calling another method may end up blocking, or having some unknown side effect, or bring the world to an end. Inlining expands the horizons of all available optimizations into the inlined code. These benefits can be quite dramatic, eliminating much of the inlined code (callee) on the one hand and assisting in supporting assumptions about the compilation unit at hand (caller).
Very small methods call overhead may actually exceed the method cost. Importantly, and counter intuitively perhaps, the cost of calling a method is not fixed. The cost will depend for example on calling convention (whose job is it to keep track of registers), pre/post conditions (e.g. ordering and safepoints).

But Inlining is not without it's potential issues:

"code explosion". If I have a method of size 128b and that method cannot be reduced in any form by the context of it's caller, every time I inline it I increase my code size by 128b (a bit less, as I can remove the calling related prologue/epilogue). This may lead to a very bloated application, which will put pressure on my instruction cache as well as the JVM code cache.
Increased CPU pressure from the compiler itself, being challenged with larger and larger compilation units will lead to higher compile times, increased CPU usage, delayed compilation etc.
Increasingly complex compilation units may result in less optimal code than simpler ones. This is a real world limitation, though I would think in theory the more context the compiler has to work with, and assuming an infinite amount of cpu/memory/monkeys to throw at the problem, the better overall result we should get. In practice I have seen several cases where limiting inlining can improve results.

Picking the right boundaries is challenging, but luckily we got them bright compiler engineers to worry about that.

How Much Inlining Can We Do?

The amount of inlining a compiler does will depend on many factors. I'm not going to get very far if I try and explain the heuristics involved in great detail but some basic dimensions to the problem are:

Who is the callee? In Java almost all method calls are 'virtual'. This means that at runtime any number of implementations could possibly exist for a particular call. This is traditionally where you'd give up on inlining. To make inlining happen anyway the compiler must convert a virtual call into a static call before it can be inlined. The same code in a different runtime may induce different compiler decisions on this. JIT compilation allows the compiler to tailor the decision in a way that AOT compilation cannot (in a dynamic environment).
How big is the callee? Most compilers put some upper limit on how big a caller method can grow as well as how big can a callee method be and still get inlined. A very large caller will be blocked from further inlining and even a medium sized callee might be blocked from being inlined. What is considered big will depend on the compiler, configuration and the profile. Limitations on compilation time for JIT compilers also drive this decision, at least partially.
How deep is the already inlined stack? How many veils can the compiler pierce? What about recursion?

So, that's the general shape of the problem: Virtual x Size x Depth. And if you were gonna try and write an inlining strategy for a compiler you'd need allot more details, but we can keep going with this for now.

How Much Inlining Does Java Actually Do?

And by Java I mean the Oracle JDK 8u161. These things change whimsically from run to run, release to release, JVM to JVM etc, so stick a big YMMV on this shit.
We can have the JVM helpfully share with us compilation details via a bunch of flags, but most importantly for this feature we will need: -XX:+PrintInlining -XX:+PrintCompilation (output is a bit hard to reason about because concurrent logging...).
For the sake of demonstration, here's the output of the offer method of SpscArrayQueue from a throughput benchmark I run regularly for JCTools:

Some observations:

Tiny methods and some intrinsics are inlined even by the early compilation.
The offerSlowPath method which is not inlined by the first compilation is inlined by the C2 compiler at a second pass.

This is not a particularly challenging piece of code, but I picked it because it is easy enough for a human to follow. Small methods get inlined, even if they are nested a bit in other tiny methods.
There's a performance related nugget to dig here:

I split the offerSlowPath method with the intent of not having it inlined (or at least leaving this decision to the compiler). The compiler decided it is better off inlined. Who's right? has the compiler defeated my puny human brain? am I as worthless as my mother in law suspects? these questions will be answered some other day.

Where Do Profilers Come In?

In traditional Java profilers no attempt is made at distinguishing between inlined frames (calls to methods which have been inlined) and 'real' frames (real method calls on the stack). Telling the 2 apart can be useful:

Inlining is often a good thing, so failure to inline a method can be worth looking into. It's also instructive to see just how much the compiler can inline, and how many layers of distraction it must cut through to get to where the actual code you want to run is. Excessive "Manager/Abstract/Adapter/Proxy/Wrapper/Impl/Dimple/Pimpl" layering however can still prevent inlining and may also lead to erectile dysfunction, early onset dementia and being knifed by your colleagues.
Top of stack hot methods which are not inlined present obvious curios. Perhaps callers are too large, or the method itself?
Multiple implementors of same methods which are not inlined indicate a megamorphic callsite. If one of these is significantly hotter than the rest you might be able to special case for it (peel off the callsite with an instanceof check, effectively rolling your own conditional inlining).

But missing out on some good opportunities is not the issue I'm concerned with in this post. The point I'd like to get across is that Inlining makes an already confusing reality much much more confusing.

A Confusing Reality

CPUs execute instructions, these instructions are not our code but the product of a series of compilation and optimization steps. It follows that if we hope to learn where the bottleneck in our code/application is we must be able to translate back from CPU instructions (the product of C1/C2/Other JIT compilers) to bytecode (the product of javac/JVM language compiler/bytecode generating agent/Other) to Code coordinates (file + line or class + method).
This is not a straight forward mapping because:

Some instructions do not map to any bytecode: this can happen when the JVM generates code for it's own accounting purposes.
Some instructions map to many bytecodes: the compiler can be quite clever in substituting many operations with one specialized instruction, Or when the result of one computation can be reused elsewhere.
Some bytecodes do not map to Code: this can happen due to bytecode generation.
Sometimes the compiler just fails in it's book keeping :-(.

A modern JIT compiler employs a significant number of optimizations to any given method, allowing it to deliver us the speed we love, but making reverse engineering the instructions into code a challenge. The mapping rules employed by profiling tools trying to bridge this gap, and by the compilers trying to give them a usable index boil down to:

We assume instruction X translates to at most 1 bytecode location.
When a mapping does not exist, find the nearest instruction Y which has a mapping and use that.

This is already confusing, as the imperfect mapping, coupled with the compiler's license to reorder code combine to make "nearest instruction" possibly be "not nearest line of code".
The problem is made worse by the inaccuracy of the profiler observations themselves. It is a known issue, largely unsolved, that the reported instruction from which the mapping starts is often a few (in theory many) instructions away from the intended sample location. This can have a knock on effect on the ultimately blamed/mapped code. This issue was touched on in a previous post on AGCT profilers.
When looking at a single method this is often a pesky, but ultimately managable problem. Sure, the profiler pointed to line A, but any fool can see the problem is line B... silly little hobbit.

MOAR, BIGGER, CONFUSION

So, given that:

The instruction reported is a few instructions off.
The instructions do not map perfectly to lines of code.
The order of instructions does not have to match the order of the code.

We can see how even a small inaccuracy can translate into a large distance in lines of code between the cost we are trying to profile and where we end up. This is made exponentially worse by inlining.

Where we had one method, we now have many methods, from many classes. As the code gets pulled in it is subjected to the same compiler "remixing", but at an expanded horizon. For instance, a computation repeated in the caller and callee can now be done once. Considering that inlining 4-6 layers of indirection is very common the "LOC distance" between "near instructions" can end up being very confusing.

I should really elaborate with some clear example, but I do not have a repeatable experiment to hand and already stalled with this article for a few months while distracted elsewhere. An interesting example of how this could work out is available here. In this issue we have the wrong instruction being blamed for the cost of a cache-miss in a loop. The cache miss originates in the bowels of an inlined LinkedHashMap::get. The nearest instruction with a mapping however is from Field::binaryValue(line 441) it's easy to see how this would end up a very misleading profile if considered from a code perspective. When observed at the assembly level the instruction slip is much easier (as much as reading assembly is easy) to see through, but profiling on the assembly level is something very few people do and very few tools support.

Summary

Inlining is good, and profiling is good. But profiling modern compiled inlined code presents us challenges which are not easily solvable by looking at the common output of most tools. The more sense the compiler makes of our code, it's context, it's execution tree, the less obvious the mapping between instructions and code.
So, what can we do? I think we should have our tools approach the problem in 2 ways:

Fuzzier profiles: What would happen if instead of assigning cost to a single instruction (and it's code coordinates) we distributed the cost on a range of instructions? We could have more or less sophisticated models, but even a naive bell curve around the sampled instruction would bring in more context to the profile, and may help. This is hypothetical, as I am not aware of any tools which attempt this ATM.
Differentiate between real/inlined frames: This is something the JVM is certainly capable of doing, but most APIs for profiling do not support. AFAIK only Oracle Studio and perf/bcc + perf-map-agent support views on real vs. inlined. I tend to use the latter more than the former. There's talk of adding support for this into async-profiler.

Many thanks to V. Sitnikov for reviewing :-).

This post is one of a few on the topic of profiling, check the other ones out:

Friday 5 January 2018

What a difference a JVM makes?

JDK 9 is out! But as a library writer, this means change, and change can go either way... Once we've satisfied that JCTools works with JDK9, what other observations can we make? Well, one of the main motivations for using JCTools is performance, and since the code has been predominantly tested and run with JDK8, is it even better with JDK9? is it worse?

A JVM gets into trouble

I started my comparison (and we will not cover anything else cause the fuckers broke early) with the simplest queue, the SpscArrayQueue:
$ taskset -c 4-7 java -jar jctools-benchmarks/target/microbenchmarks.jar throughput.QueueThroughputBackoffNone -p qType=SpscArrayQueue -p qCapacity=131072 -jvmArgs="-Xmx1g -Xms1g" -i 5 -wi 15 -r 5 -w 1 -f 3

Oracle8u144:
offersFailed | 0.020 ± 0.020 ops/us
pollsFailed | 0.093 ± 0.102 ops/us
pollsMade | 361.161 ± 4.126 ops/us

Oracle9.0.1:
offersFailed | 0.065 ± 0.269 ops/us
pollsFailed | 5.987 ± 2.788 ops/us
pollsMade | 26.182 ± 2.273 ops/us

Some explanations on method and results:

This is running on my beefy laptop, lots of memory to spare, Xeon(R) CPU E3-1505M v6 @ 3.00GHz. I set the gov'nor to "userspace" and frequency to 2.8GHz to avoid CPU frequency scaling and turbo boosting while benchmarking. I use taskset above to pin the JVM process 1 logical core on each physical core, so no 2 threads share a core. Easier than disabling HT, and sufficient for this exploration.
The QueueThroughputBackoffNone benchmark is an all out throughput benchmark for queues where producers and consumers spin to offer/poll (with j.u.Queue semantics). Failures are recorded and the throughput observed is the rate of successful polls per microsecond(so millions per second if you prefer that figure). The benchmark is run with a single producer and single consumer thread, and is known to be sensitive to producer/consumer speed balance as contending on empty/full queue can lead to degraded performance. See some discussion of the benchmark here.

Back to the results. They is not good :(
Why would this happen? HOW COULD THIS HAPPEN!!!

I profiled this miserable little bastard only to find that on the offer side of there's a previously unobserved bottleneck:

What's happening? The whole point of SPSC is that there's no need for a strong memory barrier, no need for lock add or CAS, just some careful ordering (using putOrdered/lazySet). But here we got this LOCK ADDL (line 34) instruction spoiling all the fun and eating all the cycles (but the next line is getting all the blame, typical).
Where did it come from? I didn't put it there. Note that this is add 0 to the base of the stack(-0x40), which is how a StoreLoad barrier is implemented (see interesting post on barrier implementation details). But there's no volatile store in sight.

A JVM gets out of trouble

TBH, even before bothering to look at the assembly I reached for a big "usual suspect" for performance differences with JDK9: G1GC is the new default GC!

I quickly verified this theory (that G1GC fucked this up for me) by re-running the same benchmark on JDK9 with the JDK8 default GC (-XX:+UseParallelGC). Got similar results to JDK8. Awesome-ish:
Oracle9.0.1 (-XX:+UseParallelGC)
offersFailed | 0.059 ± 0.133 ops/us
pollsFailed | 0.147 ± 0.251 ops/us
pollsMade | 356.100 ± 6.305 ops/us

The bottleneck I've hit? it's G1GC card marking! I even wrote a post about it. G1GC write barrier is quite different from CMS/Parallel. You'll notice that the barrier discussed there is a little different from the one we see here. Times, they are a changing...

So... the G1GC write barrier is ANGRY. Why so angry?

What this means is that the g1_write_barrier_post is pissed because:

buffer (the backing array for the queue) and element (being offered) are from different regions
element is not null
card (for the buffer) is not young

Confusingly, when playing around with this issue I moved from a small machine (8gb) to a bigger one (32gb) and when running with a larger heap size specified the issue became allot less pronounced. This is because, if we set the size of the heap to 1g we get 1mb regions. If we set the size of the heap to 8g (set both mx and ms) however we get 4mb regions. We can demonstrate this is the issue by running again with G1 and setting the region size:
Oracle9.0.1 (-XX:+UseG1GC -Xms1g -Xmx1g -XX:G1HeapRegionSize=4m) offersFailed | 0.009 ± 0.033 ops/us pollsFailed | 0.252 ± 0.257 ops/us pollsMade | 183.827 ± 16.650 ops/us

So, still not so brilliant, but much better. This however implies that the improvement is due to the buffer and the element being allocated from the same region. This is purely a product of the benchmark and the chosen queue size, and is not typical of normal applications. It follows therefore (I think) that the behaviour we see here is quite likely to manifest pathologically for writes into long standing data structures, such as caches and queues. And indeed if we increase the queue capacity the situation reverts back.
All this messing around with JDK9 vs 8, runtime and GC impact comparison got me thinking of a couple of other regional collectors which might exhibit interesting behaviours here. Namely, Zing C4 and Shenandoah.

Put some Zing in that thing

I no longer work at Azul, so I had to get an Open Source developer licence, which was a breeze. I had to get some help with getting it all to work on Ubuntu 17.10, but 16.04 works off the bat (just follow the instructions). Starting with same parameters I got:
Zing8u17.12:
offersFailed | 0.016 ± 0.010 ops/us
pollsFailed | 0.013 ± 0.022 ops/us
pollsMade | 302.288 ± 3.602 ops/us

So Zing is 20% slower than Oracle ParallelGC here, but significantly better than G1 (either case). The JMH perfasm profiler does not work with Zing, though Azul does have a matching tool if you ask their support. To look at the profile I can either use ZVision, or Oracle Studio. I went with the latter, just because.
The profile is hard to read, so I might go into it another time, but the question that seems obvious is: "Is Zing slower than Oracle+ParallelGC because of read/write barrier costs?"
Zing after all is not a GC add-on to OpenJDK, but a completely different runtime+GC+compiler. In particular, Zing has recently switched from their old C2 like compiler (they forked paths many moons ago, but share a parent in Cliff Click), to an LLVM compiler which is now the default called Falcon. Testing that quickly by forcing Zing to use the C2 compiler yields the following results:
Zing8u17.12 (-XX:+UseC2):
offersFailed | 0.034 ± 0.055 ops/us
pollsFailed | 0.010 ± 0.017 ops/us
pollsMade | 198.067 ± 31.983 ops/us

OK, so Falcon is a big win for Zing here, that's not the issue. Can we take the read and write barriers out of the picture?
Sure we can! Zing supports 2 exciting GCs, you may have heard of the C4 GC, but Zing also supports the NoGC (it's very Zen), which is exactly what it sounds like. Running with no GC however may remove some positive effects GC has (e.g. not crashing when you've allocated more than your heap size and never collected, but also compacted relocated data), so we need to run NoGC with barriers, and NoGC with no barriers:
Zing8u17.12 (-XX:+GPGCNoGC -XX:+UseSVBs -XX:+UseLVBs):
offersFailed | 0.035 ± 0.053 ops/us
pollsFailed | 0.022 ± 0.043 ops/us
pollsMade | 302.433 ± 2.675 ops/us

So, turning GC off makes no difference at all for this benchmark. That's great, as we can consider the removal of the barriers in isolation:

Zing8u17.12 (-XX:+GPGCNoGC -XX:-UseSVBs -XX:-UseLVBs):
offersFailed | 0.099 ± 0.070 ops/us
pollsFailed | 0.027 ± 0.048 ops/us
pollsMade | 314.498 ± 17.872 ops/us

Note that we see:

Some improvement when barriers are removed.
Increased variance in results. This was due to large run to run variance which requires further digging.

So, while we can certainly see a difference here which is due to read/write barriers, that is not the whole story(maybe another day). My gut feeling is that Falcon is over inlining in this instance.

Oh Shenandoah

For Shenandoah I grabbed one of the builds provided by the benevolent Shipilev here. The build I got is this one: build 1.8.0-internal-jenkins_2017_11_12_03_35-b00

Running with same heap size, but remembering that these builds don't default to Shenandoah:

Shenandoah (-XX:+UseShenandoahGC):

offersFailed | 0.031 ± 0.024 ops/us

pollsFailed | 0.009 ± 0.025 ops/us

pollsMade | 143.165 ± 3.172 ops/us

Note that for Shenandoah there's not a massive imbalance between the offer/poll side, which indicates the issue is not pathological to one method or the other. For G1GC the problem was very clearly on the offer side. I had a peek at the assembly, but it's going to take me some time to get to grips with what's going on as it's a completely new set of tricks and quirks to look at. To get a high level view though, it's interesting to compare the HW counters for ParallelGC/Zing/G1/Shenandoah:

| PGC | Zing | G1 | Shenandoah

pollsMade | 368.055 | 303.039 | 194.538 | 140.236

CPI | 0.244 | 0.222 | 0.352 | 0.355

cycles | 7.629 | 9.319 | 14.764 | 19.998

instructions | 31.218 | 41.972 | 42.002 | 56.294

branches | 6.052 | 11.672 | 7.735 | 9.017

L1-dcache-loads | 10.057 | 10.936 | 14.320 | 26.707

L1-dcache-load-misses | 0.067 | 0.149 | 0.162 | 0.100

L1-dcache-stores | 3.553 | 3.042 | 4.057 | 5.064

This is a tiny workload, with 30-50 instructions per operation. I want to make it clear that it is very easy to have large differences between JVMs in such specific scenarios. This workload is all about loading and storing references in/out of an array. The data is all L1 resident, this is NOT A REPRESENTATIVE COMPARISON OF PERFORMANCE. If you want to know how these JVMs/GCs can help your application, run a representative workload with your application.
Seriously, let's not start a "my JVM/GC is better than thou" war here, OK people?
For simplicity of comparison I've run PGC/G1/Shenandoah out of the Shenandoah build that includes all 3. This makes for a simpler comparison as they all share the same compiler, but is not comparing with the relevant Oracle build (though it should be pretty much the same).
Zing has better CPI than PGC, but 10 more instructions per operation. These include 1 extra load, and 6 more branches. There are no branch misses in this workload, so the branches are just extra pressure on the branch predictor and more instructions. The 10 instructions difference translates into a 1.6 cycle difference, this is indicative of the success of the branch predictor in reducing the impact of the branches. These extra branches are the Zing LVB or read barrier. Each reference load costs at extra branch. Zing is doing less stores here, this is due to it's defensive approach to card marking.
G1 is given here with the good case (same region), as we already covered the bad case. We see G1 increasing the loads by 4 but using less branches than Zing, only 2 extra branches. These are related to the write barrier. We also see an extra store.
Shenandoah is using 3 extra instructions, and 16 extra loads. This is more than I expected. Since Shenandoah is using a Brooks-Pointer you would expect an extra load for each reference load. If we estimate from the Zing branch increase that we have 6 reference loads, I'd expect to have 6 extra loads on the Shenandoah side. I assume the other loads are related to card marking but I will need to learn more about this collector to say anything. Shipilev has expanded on my rudimentary analysis here. His conclusion: "read and write barriers around Unsafe intrinsics are very active. C2 handling on Unsafe intrinsics uses CPUOrder membars a lot (http://hg.openjdk.java.net/shenandoah/jdk10/file/1819ee64325f/src/hotspot/share/opto/library_call.cpp#l2631),which may inhibit some barrier optimizations. The workload is also tied up in a very unlucky volatile-predicated loop that prevents barrier hoisting.
```
Pending codegeneration improvements alleviate barrier costs even when they are not optimized."
```

Since I fully expect all the crusaders to get hot and bothered about the above I thought I'd throw in...

A Holy Graal!

Graal is the next gen compiler for HotSpot. Coming out of Oracle Labs and already running in Twitter production, I thought we should add another compiler dimension to this mix. To run Graal you can use: "-XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler"

Comes included in the Java 9 package!!! So how does it do?

Oracle9.0.1 (-XX:+UseParallelGC -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler)

offersFailed | ≈ 0 ops/us

pollsFailed | 0.347 ± 0.441 ops/us

pollsMade | 51.657 ± 3.568 ops/us

WHHHHHHHHYYYYYYYYYYYY!!!!!!!!

Why can't we have nice things? Well... It turns out the good folks at Oracle Labs have not yet implemented putOrdered as nicely as C2 and Falcon has, and have thus competely buggered up my SPSC queue :(

Summary: There Are Many JVMs In My Father's House

Variety is the spice of life as they say. In the narrow narrow usecase presented above we saw different JVMs/GCs/JITs throwing up all different behaviours. Some good, some less so. For this workload there's very little happening, no GC, no exciting vectorization opportunities, it's very limited. But, being limited has the benefit of simplicity and the opportunity to contrast.
Also note that by profiling and refining this code on C2 I have perhaps overfitted it to one compiler at the expense of others, I certainly had more opportunity to eliminate any issues for the C2+ParallelGC scenario....

I hope you enjoyed to tour, I encourage you to take these new friends home and play with them ;-).

Tuesday 14 February 2017

Java Flame Graphs Introduction: Fire For Everyone!

FlameGraphs are superawesome. If you've never heard of FlameGraphs and want to dive straight in the deep end, you should run off and check out the many many good resources provided by Brendan Greg in his one stop shop page here. This post will give a quick intro and some samples to get you started with collecting profiles for all JVMs everywhere. I'm taking a slightly different tack then Brendan in presenting the topic, so if it turns out my explanations suck you should see if his make more sense.

What's the big deal?

If you've ever used a profiler to look at your code you will have seen 2 profile reports typically:

Flat profile: This is often presented as the "top X" methods/classes/packages where time (or samples, or ticks or whatever) is spent. This is useful as it immediately shows up common bottlenecks across your code, but these are shown out of context. Sometimes this is enough, but often in larger application profiles context is significant. This representation is very useful when a method with a high overall impact is called from many callsites, making each callsite cheap but the method itself significant.
Tree profile: This profile will present you with a call tree where each method is a node with a total and self time quantity. The self measure implies the amount of time spent in the method itself(the amout of samples in which the method is the leaf), and total is for the total number of samples in which it shows up (leaf and node).

The problem with the tree view is that it is very unpleasant to navigate. Click click click click and as the stack deepens it becomes harder to look at and ingest. Enter FlameGraphs.

FlameGraph represents a tree profile in a single interactive SVG where:

The x-axis shows the stack profile population, sorted alphabetically (it is not the passage of time), and the y-axis shows stack depth. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what is on-CPU, and beneath it is its ancestry.

Like most visualisations, it makes sense when you see it rather than explain it. Let start with data sets we can easily grasp and see what they look like.

Synthetic Samples For Starters

For instance, what does a single stack sample look like? The FlameGraphs SVG generating script takes as it's input a "collapsed stacks" file which has a dead simple format, frames separated by semi-colons followed by the number of times this stack was sampled. Here's a dummy handwritten example of a single sample file (call it sample.cstk):
main;0;1;2;3;4;5;6;7;8;9;10 1

We can feed this to the flames (now is a good time to clone this repo and try shit out):
flamegraph.pl single-stack.cstk > single-stack.svg

Here's what a single stack trace looks like:

But a single stack trace is just one data point, not a profile. What if we had 1M samples of this same stack?

Well.. it would look pretty much the same, but if you hover over it will tell you it got 1M samples. It looks the same because we still have 100% the same stack for the whole profile. It's the same profile.
"BUT!" I hear you say, "But, colours?". Yes the colors mean nothing at this point, but will become interesting later. The default colour palate is red and the choice of colors is random, hence the different colour selection changes from run to run. Just forget colors for a second, OK?
Right, next we want to look at a set of samples with a few more stacks:
main;0;1;2;3;4;5 1
main;0;1;2;3;4;5;6 2
main;0;1;2;3;4;5;6;7 3
main;0;1;2;3;4;5;6;7;8 4
main;0;1;2;3;4;5;6;7;8;9 5

Now you can also get a feel for what clicking around does and how you zoom in and out.
By now I hope you get the picture for how a bunch of stacks and frequencies look with a simple data sets. Last synthtic example to look at has several root frames and a little more varied stacks. Lets try this:
main;a;1;2;3;4;5 1
main;c;1;2;3;4;5;6;7 4
main;a;1;2;3;4;5;6 2
main;c;1;2;3;4;5;6 4
main;c;1;2;3;4;5;6;8 4
main;b;1;2;3;4;5;6;7 3
main;b;1;2;3;4;5;6;8 3
main;b;1;2;3;4;5;6;9 3
main;d;1;2;3;4;5;6;7;8;9 5

And we get this:

We see here that stacks are sorted alphabetically and ordered from left to right. The ordering has nothing to do with the order in the file. The collapsed stacks format is itself an aggregation with no view on timing. So the order from left to right is only about merging, not time or anything else. We can see that stacks which share a common parent naturally aggregate under that parent. The width of each frame is it's relative total-time share. It's self-time share is it's top exposure, or how much of it is not covered by it's callees, the frames on top of it.

Tree View vs Flames

Now that we got the hang of this flamy thing, lets take a look at the same profile using 2 presentations. The venerated tree-view and this new hipsterish whatever flame thing. The following is a profile collected using honest-profiler for a netty benchmark:

In typical workflow I step further and further into the hot stack, but this pushes out the big picture out of my view. I would now have to go back up and fold it to see what hides under other hot branches in the tree. It's a familiar and annoying experience if you've ever used a profiler with this kind of view. The problem is that Java class and method names are typically long, and stacks are quite deep. This is a simple application and I quickly run out of room.
Here's the FlameGraph for the same profile (I chose green, because later it makes sense):

NOTE: I made all the flame graphs in this post narrow so they fit the layout. They don't have to be this narrow. You can set the width to whatever you like, I used "--width=700" for the graphs in this post.
We can see the root frames quickly break out to main activities, with the deep netty stack now visible upfront. We can click and zoom easily. I also find the search ability which colors matching strings useful to highlight class/package participation in the profile. Prominent flat-tops indicate hot leaf methods we might want to look at.
It's perhaps a matter of taste, but I love it. I've been using flame graphs for a while and they are pure genius IMO. I find the image itself is intuitively more approachable, and with the ability to quickly zoom in/out and search I can usually quickly work through a profile without losing sight of the full picture.
So how do you get one?

It's Bob! yay?

Everybody Gets A FlameGraph!

Yes, even you poor suckers running JDK 1.3 on Windows XP! I don't recommend this method of profiling if you have a more recent JVM, or if your JVM supports AsyncGetCallTrace, but if your deployment is stuck in the past you can still be in on this. This is because ALL JVMs must support JVMTI and AFAIK allow you to hit them with jstack/JVisualVM/hprof. It's a terrible way to profile, there's allot of overhead, and usually you can find a better way, but this is universally available. Collecting a sample via jstack is (a terrible idea) quite easy. Just find the pid of the process you want to profile using jps and then do something like:
for i in {1..100}; do
jstack <pid> >> iloveoldshit.jstk;
sleep 0.1;
done

And Bob is your relative (which is a good thing apparently).

Once you've collected a large enough sample for your application you can go on and feed flame graphs:

cat iloveoldshit.jstk | ./stackcollapse-jstack.pl | ./flamegraph.pl --color=green > jstack-flames.svg

And you get:

This is the same benchmark from before, but different profile with the safepoint bias. You can compare the two by scrolling up and down. OR you can use FlameGraphs to diff the 2, in a moment.
FlameGraphs supports converting jstack output into collapsed stacks (as above). Efforts exist on GitHub to convert the hprof format which JVisualVM produces (as well as other profilers) into collapsed stack format.
So for all my poor readers, struggling under the oppression of JVMs for which better means to spill out stack samples do not exist, I got your backs, you too can be looking at flame graphs!
But seriously, just move on.

Level Up: AsyncGetCallTrace or JFR

Now, if you want a better profiler, which does not result in bringing your application to safepoint and pausing ALL your threads at each sample AND your are either running a 1.6 or higher JDK (OpenJDK/Oracle/recent Zing) on Linux/Mac/BSD you can use Honest-Profiler to collect your profile. If you got Oracle JDK 1.7u40 or higher you can use Java Flight Recorder (if you use it in production you need to pay for the licence). These profilers rely on AsyncGetCallTrace to record the Java stack safely on interrupt(from a signal handler, not at safepoint).
To collect with Honest-Profiler I start my JVM with the following parameter:
-agentpath:$HONEST_PROFILER_HOME/liblagent.so=host=localhost,port=4242,logPath=$PWD/netty.hpl
Then, when I feel the time is right, I can start and stop the profile collection
echo start | nc localhost 4242
echo stop | nc localhost 4242
To convert the binary format into collapsed stacks I need to use a helper class in the honest-profiler.jar:
java -cp $HONEST_PROFILER_HOME/honest-profiler.jar com.insightfullogic.honest_profiler.ports.console.FlameGraphDumperApplication netty.hpl netty.cstk

I can then feed the flamegraph script the collapsed stacks file and get the result which we've already seen.
To convert JFR recordings to flame graphs see this post. But remember children, you must pay Oracle if you use it in production, or Uncle Larry might come for you.

Bonus: Diff Profiles!

A nice benefit of having 2 different profilers produce (via some massaging) a unified format for flame graphs is that we can now diff profiles from 2 profilers. Not something that is generally useful, granted. But diffing profiles is an established capability in many profilers, and is usually used to do a before/after comparison. Flame Graphs support this via the same visualization. Once you have converted your profiles into the collapsed stacks format you can produce a diff file and graph it:

./difffolded.pl -n A.cstk B.cstk | ./flamegraph.pl > A-B-diff.svg

Diffing the Honest-Profiler and jstack profiles gives us the following:

The white squares are not interesting, the red/pink squares highlight the delta of self samples as a percentage of total samples (not so intuitive). I admit it may seem a tad confusing at first, but at least it draws your eyes to the right places. More on differential flame graphs here.
Note: to make this diff work I had to shave off the thread names from the jstack collected collapsed stacks file.

Further Bonus: Icicle Graph

Some times the bottleneck is not a particular call stack plateau, but rather a particular method being called from many call sites. This kind of bottleneck will not show well in a flame graph as the different stacks with similar tops will be split and may not stand out. This is really where a flat profile is great, but we can also flip the flame graph view to highlight the top method merging:

cat netty.cstk | ./flamegraph.pl --reverse --invert --color=green > netty-icicles.svg

I've filtered the stacks to show only a relevant portion:

This is not dissimilar to functionality offered by other profiler GUIs which allow the drill down direction to start from hot methods and into their callers in a tree view presentation.

Level UP++: Java Perf Flame Graphs FTW!

If you are so fortunate as to:

Be running OpenJDK/Oracle 1.8u60 or later(this functionality is coming to Zing in a near future release, fingers crossed)
Running on Linux
Got permissions to make this run

You can get amazing visibility into your system by using a combination of:

Java with: -XX:+PreserveFramePointer (also recommended -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints so unfolded frame are more accurate)
perf-map-agent: This attachable agent dumps a symbol mapping file for all runtime generated code in the JVM, enabling perf to correctly resolve addresses to methods. You'll need to clone and build.
perf: You'll need permissions and you'll need to install it. I assume you are root of you own machine for simplicity.

With the above working you can collect a perf profile of your Java process(by itself or as part of whole system). This results in a perf.data file and a perf-<pid>.map file in your /tmp folder. You can then proceed to generate a collapsed stack profile from that file, the simplest way to get this going is by using a script packed with perf-map-agent:

perf-java-flames <pid>

This will ask for password as it needs to sudo a few things. Be the sudo you want to see in the world. After a suspenseful wait of 15 seconds you'll get this:

Note: Netty deploys it's native lib into tmp, loads it and deletes it, which means perf gets lost looking for it. I deleted it from the benchmarks jar and loaded it directly using LD_LIBRARY_PATH to resole this visibility issue. It doesn't make a huge difference, but in case you try this out.

The green frames are Java, and the rest are all the magic which happens to help Java along. Here's what happened:

Red frames are C library or Kernel code. We can see that the socket writes in the original profile actually go into native land. We now have further visibility down the hole. Importantly this illustrates where hot Java methods are in fact not Java methods at all and so looking at their Java code for optimisation ops is futile.
Yellow frames are C++. We can see the call chain leading into the interpreter and ultimately into compiled Java code.
Green frames are Java. BUT if you compare the previously presented profiles will this one you will notice there are some intermediate frames missing here. This is because the frames in this profile are "real" frames, or rather they map to stack frame. Inlined methods in Java do not have their own stack frames, so we can't see them (for now, we'll sort this out in a second). Further more, the keen observer will notice the familiar "Thread.run" bottom of the stack is missing, replaced by the "interpreter". As is often the case, the run method did not get compiled in this benchmark so it is not a proper compiled method for which we have a mapping. Methods in the interpreter are opaque in this profile.
Some stacks are broken, which can be confusing. In the example above we can see the 13.8 unknown chunk which leads to some JVM activities, but also to some Java code. More on that later.

So, it would seem that we have gained something in visibility into the native/OS/JVM CPU utilization, but lost allot of information we had in the Java side. When is this still useful:

This profile is super useful if you are writing Netty and trying to workout which system calls you end up with from your JNI code, or where time is spent in that code (netty implements it's own native epoll selector, very cool). If you are writing an application which utilizes JNI libraries this profile will give you visibility across the divide. The alternative here would be to use 2 profilers and try and correlate them. Solaris Studio also offers some help here, I will one day write a post on Solaris Studio.
This in not a good example of a profile dominated by JVM threads, but in many profiles the GC activity will show up. This is very useful, as GC and compiler CPU utilization can get in the way of application threads using the available CPU. A Java only profiler leaves you to correlate GC/compilation logs and application profile to figure out who ate the pie. It's also an interesting view into which part of the GC is to blame.
Some JVM intrinsics are confusing to AsyncGetCallTrace, and invisible to safepoint profilers. The biggest culprit I see is array copy. Array copies will show up as failed samples on AGCT profilers (unless, like JMC they just fail to tell you about failed samples all together). They show up in this profile (search above for arraycopy), but only a tiny bit.
This profile can be collected system wide, allowing you to present a much wider picture and expose machine wide issues. This is important when you are looking at machine level analysis of your application to improve configuration/setup.
In depth view of OS calls can inform your configuration.

'look at me! look at me now!' said the cat.

'with a cup and a cake on the top of my hat!

I can hold up TWO books!

I can hold up the fish!

and a little toy ship!

and some milk on a dish!

and look!

I can hop up and down on the ball!

but that is not all!

oh, no. That is not all...

Bonus: Inlined Frames! Threads! COLOR!

We can win back the inlined frames information by asking perf-map-agent to create a more detailed map file with inlining data. This leads to larger map files, but should be worth it.
You can further tweak the command line to color kernel frames differently and control sample duration and frequency. And while we're a-tweakin' lets also have threads info.
Here's what you run:

PERF_COLLAPSE_OPTS="--kernel --tid" PERF_RECORD_FREQ=99 PERF_RECORD_SECONDS=10 PERF_MAP_OPTIONS=unfoldall perf-java-flames <pid>

And the graph now looks like this:

The Java frames are now green and aqua. Aqua frames are inlined frames and green are "real". This information is not presented at all by most profilers, and is pretty inaccessible in others. Here we can instantly see some interesting inlining challenges in the tall towers of same sized frames. The compiler inlines through many, but maybe eventually gives up, maybe there's something to be won by simplifying the abstraction here?
Thread id is added as a base frame. This is helpful in this particular example because there are only 2 interesting threads and I very much want to see this split. It also helps bring back some broken stacks into the fold. Now I can tell these frames belong to the Netty epoll thread. Yay.
Orange frames are kernel frames.
Having the thread id highlights that the none Java frames on the left are from a variety of threads. If we had more GC/Compiler work happening this may become interesting.
Profiling a large application with thread pools this separation by thread may not be what you want... but sometimes it is very helpful, like above. In this benchmark I have a thread generating load and a thread I want to profile, so telling them apart works great. At the moment there's no mapping of threads to thread names, but in future we may be able to easily group thread pools for more meaningful views.

Bonus: Hack It All Up

There's very little code in perf-map-agent, and the scripts would take you 5 minutes to read through. You don't have to use the scripts, you can write your own. You can add or enhance features, it's easy to participate or customize. Dig in, have fun :-)

The FlameGraph scripts are nice and tidy, and the pipeline separation of [profile -> collapsed stacks -> graph] means you can read through them and tweak as you like the bits you care about without caring too much about the rest. While working on this post I played with the diff presentation a bit. It was my first ever interaction with Perl, and I'm not so very bright, and I managed to get what I wanted. Surely someone as fine as yourself can do better.

If you look at Brenden's updates page you'll see many many people are jumping in and tweaking and sharing and making funky things. Go for it!

Summary And Credits

So you can have flame graphs, all of you. And you can feed these with inputs from several sources, each with their own set of pros and cons. It makes a great tool in your tool box and may give you that extra perspective you are missing in your profiling.

There's a few people who deserve mention in the context of the tools above, look them up, they all helped make it happen:

Brendan Gregg: FlameGraphs proud daddy. Brendan has written allot about FlameGraphs, work through his posts and you'll learn plenty.
Johannes Rudolph: Author and maintainer of perf-map-agent.
Jake Lucianni: Contributed flamegraphs support for inlined frames.
Richard Warburton: Author and maintainer of honest-profiler.

Thanks for reading :-) next up some JVM profile analysis

Tuesday 20 December 2016

What do Atomic::lazySet/AtomicFieldUpdater::lazySet/Unsafe::putOrdered* actually mean?

Paved with well intended definitions it is.

lazySet/putOrdered (or an ordered store) was added as a bit of a rushed/non-commital afterthought after the JMM was done, so it's description is subject to many debates on the mailing lists, stack overflow and watercoolers the world over. This post merely tries to provide a clear definition with references to relevant/reputable sources.

Definition:

An ordered store is a weakened volatile store[2][5]. It prevents preceding stores and loads from being reordered with the store[1][3][6], but does not prevent subsequent stores and loads from being reordered with it[2][4].

If there was a JMM cookbook entry for ordered store defining it with barriers in mind it would seem that the consensus is that ordered stores are preceded by a StoreStore AND a LoadStore barrier[4][6].

Ordered store is practically the same as a C++ memory_release_store[5][7].

References:
[1] Original bug:

"lazySet provides a preceding store-store barrier (which is either a no-op or very cheap on current platforms), but no store-load barrier"
See here: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6275329

[2] java.util.concurrent docs:

"lazySet has the memory effects of writing (assigning) a volatile variable except that it permits reorderings with subsequent (but not previous) memory actions that do not themselves impose reordering constraints with ordinary non-volatile writes."
See here: https://docs.oracle.com/javase/8/docs/api/?java/util/concurrent/package-summary.html

[3] JMM cookbook: Defining barriers meaning here: http://g.oswego.edu/dl/jmm/cookbook.html

[4] concurrency-interest Q&A with Doug Lea, October 2011:

"[Ruslan]:... If it happens (== we see spin-wait loop finished) -- does it mean,that all writes preceding lazySet are also done, committed, and visible to thread 2, which finished spin-wait loop?

[Doug]: Yes, although technically, you cannot show this by reference to the Synchronization Order in the current JLS.
...
lazySet basically has the properties of a TSO store"
See here: http://cs.oswego.edu/pipermail/concurrency-interest/2011-October/008296.html

The discussion is REALLY worth reading, involving Hans, Doug, Vitaly, Ruslan and other such respectable members of this excellent mailing list. Go on, I'll wait.

The discussion on that thread concludes the following is required:
...
LoadStore + StoreStore
st [Y],X // store X into memory address Y
...

Outcome: Stores before and after are now prevented from floating across the barrier. Loads before the barrier are also prevented from floating down. Later loads are free to float up. Note that st may in theory be delayed indefinitely, certainly other loads and stores are allowed to float up between it and the barrier.

[5] concurrency-interest Q&A with Aleksey Shipilev, May 2016:

"putOrdered is a release in disguise, most of the C++11 std::atomic(...,

mem_order_release) reasoning applies here."

"acquire/release are the relaxations from the usual volatile

rules -- while producing happens-before-s, they drop from total

synchronization order, thus breaking sequential consistency."

And adds some fine examples:

"Safe publication still works:


                       int x; volatile int y;
-----------------------------------------------------------------------
    put(x, 1);                   |  r1 = get{Acquire|Volatile}(y);
    put{Release|Volatile}(y, 2); |  r2 = get(x);

(r1, r2) = (2, 0) is forbidden.

But anything trickier that requires sequential consistency fails. IRIW
fails, because no consistent write order observed by all threads. Dekker
fails, because release stores followed by loads may or may not be
visible in program order:

                     volatile int x; volatile int y;
-----------------------------------------------------------------------
    putRelease(x, 1);            |    putRelease(y, 1);
    r1 = getAcquire(y);          |    r2 = getAcquire(x);

(r1, r2) = (0, 0) is allowed. Forbidden if all ops are volatile.


Safe construction still does not work (even for volatiles!):

                                A global;
-----------------------------------------------------------------------
    A a = <alloc>;                  |  A a = global;
    put{Release|Volatile}(a.x, 1);  |  r1 = get{Acquire|Volatile}(a.x);
    global = a;                     |

(r1) = (0) is allowed."

See here: http://cs.oswego.edu/pipermail/concurrency-interest/2016-May/015104.html

[6] concurrency-interest Q&A with Aleksey Shipilev, March 2016:

"> int a, b;

> 
> boolean tryLock() {
>     UNSAFE.putOrdered(a, 1); // Declare intent.
> 
>     // No StoreLoad here as store is not volatile.
> 
>     if (UNSAFE.getVolatile(b) == 1)) {
>         // Reset intent and return false;
>     }
> 
>     return true;
> }

Even in the naive barrier interpretation that usually gives stronger
answers, you have:

 [LoadStore|StoreStore]
 a = 1;

 r1 = b;

 [LoadLoad|LoadStore]"

See here: http://cs.oswego.edu/pipermail/concurrency-interest/2016-March/015037.html

[7] C++ memory_order_release definition:

"A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below)."

See here: http://en.cppreference.com/w/cpp/atomic/memory_order

Many thanks to A. Shipilev, M. Thompson, D. Lawrie and C. Ruslan  for reviewing, any remaining errors are their own and they shall be most severely reprimanded for them.

Tuesday 13 December 2016

Linked Array Queues, part 2: SPSC Benchmarks

JCTools has a bunch of benchmarks we use to stress test the queues and evaluate optimizations.

These are of course not 'real' workloads, but serve to highlight imperfections and opportunities. While it is true that an optimization might work in a benchmark but not in the real world, a benchmark can work as a demonstration that there are at least circumstances in which it does work. All measurement is imperfect, but not as imperfect as claims made with no fucking evidence whatsoever, so here goes.

How do these linked-array queues fare in the benchmarks? what can we learn here?

The linked array queues are a hybrid of the array and linked queues. So it seems reasonable that we should compare them to both SpscArrayQueue and SpscLinkedQueue. We should also consider how the queues differ and see if we can flush out the differences via the benchmarks.
If you crack under the pressure of boring details, skip to the summary, do not stop at interlude, do not collect a cool drink or get praise, just be on yer fuckin' merry way.

Setup:

Benchmarks are run on a quiet server class machine:

Xeon processor(Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz): 2 CPUs x 12 cores x 2 threads (HT)
CentOS
Oracle JDK8u101
All benchmarks are run taskset to cores on the same numa node, but such that threads cannot share the same physical core.
Turbo boost is off, the scaling governor is userspace and the frequency is fixed.
The code is on github

Throughput benchmark: background and method

A throughput benchmark for queues is a tricky fucker. In particular the results change meaning depending on the balance between consumer and producer:

If the consumer is faster than the producer we are measuring empty queue contention (producer/consumer hitting the same cache line for elements in the queue, perhaps sampling each other index). Empty queues are the expected state for responsive applications.
If the producer is faster than the consumer we are measuring full queue contention, which may have similar issues. For some queues which optimize for the healthy assumption that queues are mostly empty this may be a particularly bad place to be.
If the producer and consumer are well balanced we are testing a streaming use case which offers the most opportunities for progress for both consumer and producer. This should yield the best performance, but for most applications may not be a realistic scenario at all.

The JCTools throughput benchmark does not resolve these issues. It does however report results which give us an idea of poll/offer failure rates which are in turn indicative of which state we find ourselves in.
A further challenge in managed runtime environments, which is unrelated to queues, is that garbage generating benchmarks will have GC state accumulate across measurement iterations. The implication is that each iteration is measuring from a different starting state. Naturally occurring GCs will leave the heap in varying states depending on the point at which they hit. We can choose to either embrace the noise in the measurement as an averaging of the cost/overhead of garbage or allocate a large enough heap to accommodate a single iteration worth of allocation and force a full GC per iteration, thus resetting the state per iteration. The benchmarks below were run with 8g heap and a GC cycle between iterations.
The benchmark I run here is the no backoff version of the throughput benchmark where failure to offer/poll makes no attempt at waiting/yielding/tapping of foot and just tries again straight away. This serves to maximize contention and is not a recipe for happiness in real applications.
JMH parameters common to all runs below:

-gc true -> GC cycle between iterations
-jvmArgs="-Xmx8g -Xms8g" -> 8g heap
-i 10 -r 1 -> 10 measurement iterations, 1 second each
-wi 20 -w 1 -> 20 warmup iterations, 1 second each
-f 5 -> five forks each to expose run to run variance

Throughput benchmark: baseline(JMH params: -bm thrpt -tu us)

Here's some baseline results, note the unit is ops/us equal to millions of ops per second:
SpscArrayQueue (128k capacity)
offersFailed 0.005 ± 0.008 ops/us
offersMade 252.201 ± 1.649 ops/us
pollsFailed 0.009 ± 0.008 ops/us
pollsMade 252.129 ± 1.646 ops/us

So the SpscArrayQueue is offering great throughput, and seems pretty well balanced with failed offers/polls sort of cancelling out and low compared to the overall throughput.

SpscLinkedQueue

offersFailed ≈ 0 ops/us
offersMade 14.711 ± 5.897 ops/us
pollsFailed 12.624 ± 8.281 ops/us
pollsMade 14.710 ± 5.896 ops/us

For the SpscLinkedQueue we have no failed offers, since it's an unbounded queue. We do see a fair amount of failed polls. We expect the polls to be faster than the offers as offering pays for allocation of nodes on each element (24b overhead per element), while the poll simply leaves it to the GC to toss it all away.

With this baseline we would expect linked arrays queues performance to be somewhere between the 2 data points above. Unlikely to hit the highs of the preallocated array queue, but hopefully much better than a linked queue.

Throughput benchmark: growable

So assuming we let it grow to 128k, how does the SpscGrowableArrayQueue perform in this benchmark and how much does the initial size impact the performance? CNK here is the initial buffer size. The buffer will double in size when offer fills up a buffer until we hit the max size buffer.

CNK Score Error Units
16 offersFailed 0.006 ± 0.006 ops/us
16 offersMade 183.720 ± 0.450 ops/us
16 pollsFailed 0.003 ± 0.001 ops/us
16 pollsMade 183.592 ± 0.450 ops/us
128 offersFailed 0.003 ± 0.006 ops/us
128 offersMade 184.236 ± 0.336 ops/us
128 pollsFailed 0.003 ± 0.001 ops/us
128 pollsMade 184.107 ± 0.336 ops/us
1K offersFailed 0.001 ± 0.003 ops/us
1K offersMade 183.113 ± 1.385 ops/us
1K pollsFailed 0.003 ± 0.001 ops/us
1K pollsMade 182.985 ± 1.385 ops/us
16K offersFailed 0.007 ± 0.006 ops/us
16K offersMade 181.388 ± 5.380 ops/us
16K pollsFailed 0.004 ± 0.001 ops/us
16K pollsMade 181.259 ± 5.380 ops/us

Under constant streaming pressure the Growable queue will keep growing until either full sized buffer is allocated (very likely) or a smaller buffer in which the throughput is sustainable is found (unlikely for this benchmark as all it takes is a single spike). If that was the case we would have no failing offers. Either way we expect transition to the last buffer to be a short phase after which the algorithm is very similar to SpscArrayQueue and no further allocations happen. The number of resizing events is small, as the buffer doubles each time (so log2(capacity/initial size), e.g. for initial capacity 16k: 16k -> 32k -> 64k -> 128k).
You may consider the slow down from SpscArrayQueue large at roughly 25%, but I don't think it too bad considering that with the throughputs in question we are looking at costs in the single digit nanoseconds where every extra instruction is going to show up (back of envelope: 250 ops/us -> ~4ns per offer/poll vs 180 ops/us -> ~5ns. 1ns = ~3 cycle ~= 12 instructions or 1 L1 load).

Throughput benchmark: chunked

For Chunked we see the expected increase in throughput as we increase the chunk size (CNK is the fixed chunk size, the max size is 128K):

CNK Score Error Units
16 offersFailed ≈ 0 ops/us
16 offersMade 43.665 ± 0.892 ops/us
16 pollsFailed 9.160 ± 0.519 ops/us
16 pollsMade 43.665 ± 0.892 ops/us
128 offersFailed ≈ 10⁻⁴ ops/us
128 offersMade 151.473 ± 18.786 ops/us
128 pollsFailed 0.380 ± 0.331 ops/us
128 pollsMade 151.443 ± 18.778 ops/us
1K offersFailed 0.309 ± 0.375 ops/us
1K offersMade 149.351 ± 14.102 ops/us
1K pollsFailed 0.112 ± 0.125 ops/us
1K pollsMade 149.314 ± 14.120 ops/us
16K offersFailed ≈ 10⁻⁸ ops/us
16K offersMade 175.408 ± 1.563 ops/us
16K pollsFailed 0.038 ± 0.031 ops/us
16K pollsMade 175.394 ± 1.563 ops/us

Note the decline in throughput for smaller chunks is matched with an increase in poll failures indicating that the consumer is becoming faster than the producer as the chunk grows smaller requiring more frequent allocations by the produce.
Note also that even with 16 slot chunks this option is ~3 times faster than the linked alternative.
Under constant streaming pressure the Chunked queue will be pushed to it's maximum size, which means the producer will be constantly allocating buffers. The producer resize conditions are also slightly trickier and require sampling of the consumer index. The consumer will be slowed down by this sampling, and also slowed down by jumping to new buffers. This problem will be worse as more resizing happens, which is a factor of chunk size.
The benefit of larger chunks will cap out at some point, you could explore this parameter to find the optimum.
An exercise to readers: run the benchmark with the JMH GC profiler and compare the queues. Use it to verify the assumption that Growable produces a bounded amount of garbage, while Chunked continues to churn.
Max throughput is slightly behind Growable.

The main take aways for sizing here seem to me that tiny chunks are bad, but even with small/medium chunks you can have pretty decent throughput. The right size for your chunk should therefore depend on your expectations of average traffic on the one hand and desirable size when empty.

Throughput benchmark: unbounded

For unbounded we see the expected increase in throughput as we increase the chunk size (CNK is the chunk size, the max size is infinity and beyond):

CNK Score Error Units
16 offersFailed ≈ 0 ops/us
16 offersMade 56.315 ± 7.563 ops/us
16 pollsFailed 10.823 ± 1.611 ops/us
16 pollsMade 56.315 ± 7.563 ops/us
128 offersFailed ≈ 0 ops/us
128 offersMade 135.119 ± 23.306 ops/us
128 pollsFailed 1.236 ± 0.851 ops/us
128 pollsMade 131.770 ± 21.535 ops/us
1K offersFailed ≈ 0 ops/us
1K offersMade 182.922 ± 3.397 ops/us
1K pollsFailed 0.005 ± 0.003 ops/us
1K pollsMade 176.208 ± 3.221 ops/us
16K offersFailed ≈ 0 ops/us
16K offersMade 177.586 ± 2.929 ops/us
16K pollsFailed 0.031 ± 0.038 ops/us
16K pollsMade 176.884 ± 2.255 ops/us

The 16 chunk size is ~4 times faster than the linked list option, as chunk size increases it gets more efficient.
Max throughput is slightly behind growable.
Why is Chunked faster than Unbounded on 128 chunks, but slower on 1K? I've not looked into it, it's taken long enough to write this bloody post as it is. How about you check it out and let me know?

Throughput benchmark: summary

Growable queue performs well regardless of initial size for this case.
For chunked and unbounded the chunk size has definite implications on throughput. Having said that throughput is very good even for relatively small chunks.
Note that the results for the same benchmark without a GC cycle between iterations were very noisy. The above result intentionally removes the variance GC induces by forcing GC and allowing a large heap. The GC impact of linked array queues when churning will likely be in increasing old generation pressure as the overflow chunks are likely to have been promoted before they get collected. This is assuming a load where overflow is not that frequent and other allocation is present.

Interlude

Go ahead, grab a beer, or a coffee, a mojito perhaps(Norman/Viktor, go on), or maybe order a large Pan Galactic Gargle Blaster, you've earned it. I never thought you'd read this far, it's a tad dry innit? Well, it's not fun writing it either, but we're getting there, just need to look at one more benchmark...

Burst "cost"/latency benchmark: background and method

The burst cost benchmark is a more stable workload than the throughput one. The producer sends a burst of messages to a consumer. The consumer signals completion when the last message in the burst has arrived. The measurement is from first message sent and arrival of last message observed from the producer thread. It's a 'latency' benchmark, or rather an estimate of average communication cost via the particular thread. It's got bells on. It's a friend, and it's a companion, it's the only product you will ever need, follow these easy assembly instructions it never needs ironing.

This is, I think, a better evaluation of queue characteristics than the throughput benchmark for most applications. Queue starts empty, is hit with a burst of traffic and the burst is drained. The cost measured is inclusive of return signal latency, but as scenarios go this is not too far fetched. Calling this queue latency is a damn sight better than PRETENDING THE BLOODY INVERSE OF THROUGHPUT IS LATENCY. <deep breath>

Same machine and JMH parameters used as above. All the measurements below are average time per operation in nanoseconds. The benchmark code can be found here.

Burst Cost benchmark: baseline

Testing first with SpscArrayQueue and SpscLinkedQueue to establish the expected baseline behaviour, BRST is the size of the burst:

SpscArrayQueue (128k capacity)

BRST Score Error Units

1 284.709 ± 8.813 ns/op

10 368.028 ± 6.949 ns/op

100 914.150 ± 11.424 ns/op

Right, sending one message has the overhead of cache coherency making data visible to another core. Sending 10/100 messages we can see the benefits of the SpscArrayQueue in allowing consumer and producer to minimize cache coherency overhead per element. We see a satisfying drop in cost per element as the burst size grows (the per element cost is the cost of the burst divided by the number of elements sent, so we see here: 1 -> 284, 10 -> 36, 100 -> 9), but this DOES NOT MEAN THE FRIGGIN' LATENCY IS BLOOMIN' DOWN TO 9ns WHEN WE SEND 100 MESSAGES.

SpscLinkedQueue

BRST Score Error Units

1 378.043 ± 7.536 ns/op

10 1675.589 ± 44.496 ns/op

100 17036.528 ± 492.875 ns/op

For the linked queue the per element overheads are larger, as well as the cost of scanning through a linked list rather than an array as we poll data out. The gap between the it and SpscArrayQueue widens as the burst size grows. The linked queue fails to make the most of the batching opportunity offered by slack in the queue in other words.

Burst Cost benchmark: growable

We expect the growable queue to grow to accommodate the size of the burst. The eventual buffer size will be a tighter fit around the burst size, which in theory might be a benefit as the array is more likely to fit in cache. Let's spin the wheel (CNK is the initial chunk size, the max size is 128K):

BRST CNK Score Error Units

1 16 327.703 ± 11.485 ns/op

1 128 292.382 ± 9.807 ns/op

1 1K 275.573 ± 6.230 ns/op

1 16K 286.354 ± 6.980 ns/op

10 16 599.540 ± 73.376 ns/op

10 128 386.828 ± 10.016 ns/op

10 1K 376.295 ± 8.009 ns/op

10 16K 358.096 ± 6.107 ns/op

100 16 1173.644 ± 28.669 ns/op

100 128 1152.241 ± 40.067 ns/op

100 1K 966.612 ± 9.504 ns/op

100 16K 951.495 ± 12.425 ns/op

We have to understand the implementation to understand the results here, in particular:

The growable queue buffer will grow to accommodate the burst in a power of 2 sized array. This in particular means that when the burst size is 100 the buffer for the initially smaller 16 chunk queue is also 128. The delta between the 2 configurations becomes marginal once that happens as we see in the 100 burst which forces the initially size 16 element buffer to grow to 128.
The queue tries to probe ahead within a buffer to avoid reading on each element.The read ahead step is a 25% of the buffer size. The smaller the buffer the more often we need to probe ahead (e.g. for a 16 element buffer we do this every 4 elements). This overhead is visible in the smaller buffers.
A burst which manages to fill more than 75% will fail to read ahead with the long probe described above and fall back to reading a single element ahead. This implies that buffers that fit too snugly to the burst size will have worse performance.
When the buffers are sufficiently large the costs closely match the costs observed for the SpscArrayQueue. Yay!

Burst Cost benchmark: chunked

For Chunked we see a slight increase in base cost and a bummer when the burst size exceeds the chunk size (CNK is the chunk size, the max size is 128K):

BRST CNK Score Error Units

1 16 311.743 ± 11.613 ns/op

1 128 295.987 ± 5.468 ns/op

1 1K 281.308 ± 8.381 ns/op

1 16K 281.962 ± 7.376 ns/op

10 16 478.687 ± 52.547 ns/op

10 128 390.041 ± 16.029 ns/op

10 1K 371.067 ± 7.789 ns/op

10 16K 386.683 ± 5.276 ns/op

100 16 2513.226 ± 38.285 ns/op

100 128 1117.990 ± 14.252 ns/op

100 1K 969.435 ± 10.072 ns/op

100 16K 939.010 ± 8.173 ns/op

Results are overall similar to the growable, what stands out is:

If the chunk is too little to accommodate the burst we see a large increase to cost. Still, comparing this to the SpscLinkedQueue shows a significant benefit. Comparing to the growable version we see the sense in perhaps letting the queue grow to a better size as a response to bursts.
If the chunk is large enough to accommodate the burst behaviour closely matches SpscGrowableArrayQueue. Yay!

Burst Cost benchmark: unbounded

Final one, just hang in there.

BRST CNK Score Error Units

1 16 303.030 ± 11.812 ns/op

1 128 308.158 ± 11.064 ns/op

1 1K 286.379 ± 6.027 ns/op

1 16K 282.574 ± 10.886 ns/op

10 16 554.285 ± 54.468 ns/op

10 128 407.350 ± 11.227 ns/op

10 1K 379.716 ± 9.357 ns/op

10 16K 370.885 ± 12.068 ns/op

100 16 2748.900 ± 64.321 ns/op

100 128 1150.393 ± 26.355 ns/op

100 1K 1005.036 ± 14.491 ns/op

100 16K 979.372 ± 13.369 ns/op

What stands out is:

If the chunk is too little to accommodate the burst we see a large increase to cost. Still, comparing this to the SpscLinkedQueue shows a significant benefit.
If the chunk is large enough to accommodate the burst and make the most of probing ahead the costs closely resemble the SpscArrayQueue for larger bursts. Yay!

Burst Cost benchmark: summary

We see a pretty much expected result for these queues, which is to say that on the fast path they are the same and therefore if the fast path dominates they show the same costs as a plain SpscArrayQueue, which is good news. When chunks are too small and we have to allocate new chunks we start to see overheads.

A more subtle observation here is that smaller buffers have some drawbacks as the slow path of the producer code is more likely to be executed. This reflects correctly the empty queue assumption that the JCTools queues rely on, but broken assumptions are... well... broken, so the cost goes up.

A further consideration here for smaller buffer is the hot/cold structure of the code. It is intended that the producer code inlines the "offer" hot path, but as the cold path is rarely run it will fail to inline it. This is an intentional inlining fail. Inlining the cold path will make the "offer" larger and allot more complex, making the compilers job harder and may result in worse resulting code. When we run with burst/buffer sizes which systematically violate the hot/cold assumption we can trigger a bad inlining decision. This can be worked around by marking the cold methods as "dontinline" using the CompileCommand option or the compiler oracle file.

Mmmm... this is boring :(

Yes... Nothing too surprising happened here, I did not emerge from the lab with my coat on fire, these things happen. One anecdote worth sharing here is that I originally run the benchmarks with only 2 threads allocated to the JVM, this resulted in noisier measurement as I effectively under provisioned the JVM with CPUs for compilation/GC or any OS scheduling contention/interrupts. When running on a 2 core laptop this is a reasonable compromise to fix the cross core topology of the benchmark, but on a server class machine it is easy enough to provision the same topology with more CPUs.

Next part will feature the astounding extension of these queues to the MPSC domain and will be far more interesting! I promise.

Wednesday 11 July 2018

How Inlined Code Makes For Confusing Profiles

Mommy, What is Inlining?

How Much Inlining Can We Do?

How Much Inlining Does Java Actually Do?

Where Do Profilers Come In?

A Confusing Reality

MOAR, BIGGER, CONFUSION

Summary

Friday 5 January 2018

What a difference a JVM makes?

A JVM gets into trouble

A JVM gets out of trouble

Put some Zing in that thing

Oh Shenandoah

A Holy Graal!

Summary: There Are Many JVMs In My Father's House

Tuesday 14 February 2017

Java Flame Graphs Introduction: Fire For Everyone!

What's the big deal?

Synthetic Samples For Starters

Tree View vs Flames

Everybody Gets A FlameGraph!

Level Up: AsyncGetCallTrace or JFR

Bonus: Diff Profiles!

Further Bonus: Icicle Graph

Level UP++: Java Perf Flame Graphs FTW!

Bonus: Inlined Frames! Threads! COLOR!

Bonus: Hack It All Up

Summary And Credits

Tuesday 20 December 2016

What do Atomic*::lazySet/Atomic*FieldUpdater::lazySet/Unsafe::putOrdered* actually mean?

Tuesday 13 December 2016

Linked Array Queues, part 2: SPSC Benchmarks

Setup:

Throughput benchmark: background and method

Throughput benchmark: baseline(JMH params: -bm thrpt -tu us)

Throughput benchmark: growable

Throughput benchmark: chunked

Throughput benchmark: unbounded

Throughput benchmark: summary

Interlude

Burst "cost"/latency benchmark: background and method

Burst Cost benchmark: baseline

Burst Cost benchmark: growable

Burst Cost benchmark: chunked

Burst Cost benchmark: unbounded

Burst Cost benchmark: summary

Mmmm... this is boring :(

What do Atomic::lazySet/AtomicFieldUpdater::lazySet/Unsafe::putOrdered* actually mean?