Monday, 1 February 2016

Wait For It: Counted/Uncounted loops, Safepoints and OSR Compilation

In this previous post about Safepoints I claimed that this here piece of code:

Will get your JVM to hang and will not exit after 5 seconds, as one might expect from reading the code. The reason this happens, I claimed, is because the compiler considers for loops limited by an int to be counted, and will therefore not insert a safepoint poll into the generated assembly code. This was the theory, and I tested it, and it definitely hung on my machine ('It Works On My Machine'™).
To my great surprise, some people not only read the posts, they also run the code (it's like they don't trust me ;-) ). And 2 of those diligent readers got back to me in the comments saying that actually the JVM exists just fine. So I ran it again, and it hung, again. Then I ran it with different versions of the JVM, and it hung as I expected. Then I thought, maybe they are running the code I put on GitHub, so I tried that. That also hung when I ran it from Eclipse, but did not when I built it with Maven and ran it from the command line. EUREKA!
As it happens, Eclipse has it's own Java compiler (a fact well known to those who know it well, no doubt) and the bytecode it generated (the .class file) hung every time, but Maven uses javac and the .class file generated from that did not hang. When testing initially I was running my class inside Eclipse, or using the Eclipse .class file to run from the command line. Both .class files are valid, but different (same-same but different). We can use javap to print out class bytecode and compare ("javap -c -v -p <class file>"). For your pleasure you can find the original Java file, the 2 class files and their respective bytecode printouts on GitHub and come to your own conclusions before you continue reading.
So... is my understanding (and the Eclipse javac) somehow out of date? can this problem of telling counted loops from uncounted loops be solved by the bytecode compiler?

How is the bytecode different?

Well, mostly not that different:
  1. An easy to spot difference which we can tell is semantically the same is the bytecode result for "System.out.println("How Odd:" + l);" you'll notice that Eclipse is using the constructor "java/lang/StringBuilder."<init>":(Ljava/lang/String;)" while javac is opting for "java/lang/StringBuilder."<init>":()V" followed by an append. Semantically the 2 choices are the same. We create a StringBuilder and append the String "How odd:" and the variable l. The reality is that there are subtle differences, the Eclipse version will allocate a larger char[] for instance. It may also be that the Eclipse pattern will miss out on some String concatenation optimization because the OpenJDK implementers are likely to focus on patterns generated from the official javac.
  2. The loop interpretation is different. Eclipse is checking the loop conditions at the end of the loop whereas javac is building a loop with the checks upfront and the loop backedge at the end. That seems like the important difference.
Let's zoom in on the relevant bytecode for the loop:

If we were to reverse translate the loop bytecode to ugly Java (which supported goto) we'd get:

Note that there's no special Java bytecode for loop control flows. So why so different at runtime? If we dump out the assembly for the run method we will see that the Eclipse version is missing the safepoint in the loop (the method has a {poll_return} but there's no {poll}), and the javac version has that safepoint there. Is it the bytecode difference that makes this happen?

Variations on bytecode

I tinkered with the class a bit and refactored the loop out of the lambda. This results in similar bytecode for the loop and same behaviour. I then made the limit of the loop a parameter of the method, this time it behaved like the javac version, but had the same funky loop structure:

WTF? if this is a purely a bytecode subtlety then it's seems pretty weird. What sort of flaky compiler would swing like this?

The compiler did it?

But... which compiler? so we looked at the above bytecode compiler results, this is but the first of many compilers to process your code on it's way to being executed. Traditionally when discussing Java code compilation we look at the following stages:
  1. Javac: From code to bytecode
  2. Interpreter: Step through the bytecode
  3. C1/Client: Short compilation time, does not generate the best optimized code. Intended for short lived applications.
  4. C2/Server: Longer compilation time, better optimized code. Intended for long running applications.
And the compilation timeline would look something like this:
                                                   /->C1
*.java->javac->*.class->interpreter->class+profile<
                                                   \->C2
The situation with OpenJDK8 is more complex with Tiered Compilation allowing us a smooth transition from interpreter to C2 compilation via C1 compilations. 
Given the behaviour we are after was supposed to manifest after a C1/C2 compilation we can run the code above to find which compiler compiled the method in question by adding the flag: -XX:+PrintCompilation:
    159  201 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
    159  202       3       safepoint.hang.WhenWillItExitInt2::countOdds (44 bytes)
    159  203 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
    162  201 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
    142  200 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
   1896  202 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
   1896  203 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
   1896  204 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
   1899  203 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
The 3/4 printouts on the 4th column above indicate the method is compiled by C1 (profiled) and C2. But what is that % about? This is an indicator that the compilations are not 'normal' compilations but rather On Stack Replacement(OSR) compilations. OSR compilations are 'special', from the OpenJDK glossary:
on-stack replacement
Also known as 'OSR'. The process of converting an interpreted (or less optimized) stack frame into a compiled (or more optimized) stack frame. This happens when the interpreter discovers that a method is looping, requests the compiler to generate a special nmethod with an entry point somewhere in the loop (specifically, at a backward branch), and transfers control to that nmethod. A rough inverse to deoptimization.
I like the "rough inverse to deoptimization", it's a great mental qualification helper. OSR is the opposite of deoptimization because it is a mid-method replacement of code with compiled code, whereas a deoptimization is a mid-method replacement of code with interpreted code. Normal compilation unit replacements happen out of method, so while the method is off stack. OSR is specifically done to replace long running loops, like our countOdds method. As we can see from the printout above both C1 and C2 can produce OSR compilations.
It is impossible to disable OSR directly in a product build (there's a developer option -XX:-CICompileOSR), but we can force upfront compilation by using the following flags: "-XX:-TieredCompilation -server/client -Xcomp". Running with these arguments all 3 variations (original, refactored loop, refactored parametrized loop) now hang both for the Eclipse and the javac generated bytecode.
In both cases where the Eclipse versions hang we can see OSR compilations of the relevant code.
So it seems:
  • C1/C2 non-OSR compilations of the loop treat it as a counted loop, thus eliminating the safepoint poll and demonstrating the hung behaviour due to huge TTSP.
  • C1/C2 OSR compilations of the same code result in different results depending on the generated bytecode for the loop. I'm not going to dig further into the reasons, but I would welcome any feedback from OpenJDK compiler guys.
The OSR compilation decision to treat the loop as non-counted is quite reasonable. If we have a loop which remains on stack for very long periods we should perhaps consider it uncounted.

Killer Task

I started down this path looking for valid Java code which will result in a hung JVM due to huge TTSP. Since OSR got in the way, can we build such a task? Sure we can, we just need to warmup the loop until it is C2 compiled and then run it with a large limit, like so:
This fucker now hangs in both the Eclipse and javac versions. Joy!

Side note: OSR and JMH

I look at benchmarks people write quite a bit, and it is very common to see people rolling their own benchmark harnesses put their "code under measurement" in a counted loop, run it 100,000 times and derive their results. Often that means the code they are measuring is OSR compiled code, which is different (for better or worse, mostly worse) from the code they will be running normally. If you are one of these people, I beg you: cut that fucking shit out!
JUST USE JMH!
PLEASE?
To a large extent it will cover your use case. Where it doesn't cover your use case, post a question on jmh-dev mailing list, it might be that you've missed something. If it turns out JMH cannot cover your usecase, make sure you've read the JMH samples before rolling your own harness.
I'm not going to include a before and after example, because it just makes me sad. Look at SO for some fine examples.

Counted Loops?

I keep saying 'counted loops' like it means something, so here are some examples for you to play with and contemplate:

So, as you can see the definition ain't so clear and I would imagine some of the above behaviour will vary depending on JVM version/vendor/phase of the moon etc.

Summary

So, it seems my previous example was not as tested as it should have been. I apologize to anyone who has based life changing decisions on that sample. I hope this post helps in explaining away the subtleties which led to my downfall.
The above examples help demonstrate that the line between counted and uncounted loops is a tad confusing and the classification may depend on differences in bytecode, compiler mode, and so on.

Monday, 14 December 2015

Safepoints: Meaning, Side Effects and Overheads

I've been giving a couple of talks in the last year about profiling and about the JVM runtime/execution and in both I found myself coming up against the topic of Safepoints. Most people are blissfully ignorant of the existence of safepoints and of a room full of people I would typically find 1 or 2 developers who would have any familiarity with the term. This is not surprising, since safepoints are not a part of the Java language spec. But as safepoints are a part of every JVM implementation (that I know of) and play an important role, here's my feeble attempt at an overview.

What's a Safepoint?

I was recently asked: "Are safepoints like safe words?". The short answer is "not really", but since I like the metaphor I'll run with it:
Imagine if you will a JVM full of mutator threads, all busy, sweating, mutating the heap. Some of them have <gasp> shared mutable state. They're mutating each others state, concurrently, like animals. Some stand in corners mutating their own state (go blind they will). Suddenly a neon sign flashes the word PINEAPPLES. One by one the mutators stop their rampant heap romping and wait, sweat dripping. When the last mutator stops, a bunch of elves come in, empty the ashtrays, fill up all the drinks, mop up the puddles, and quickly as they can they vanish back to the north pole. The sign is turned off and the threads go back to it....
<inner-dialogue>Too much? It's like 35c in the shade here, so sweating is like... no? never mind </inner-dialogue>
There are many references to Safepoints to be found online and what follows here is my attempt at a more nuanced use/definition, no sweating mutators from this point onwards.
A safepoint is a range of execution where the state of the executing thread is well described. Mutator threads are threads which manipulate the JVM heap (all your Java Threads are mutators. Non-Java threads may also be regarded as mutators when they call into JVM APIs which interact with the heap).
At a safepoint the mutator thread is at a known and well defined point in it's interaction with the heap. This means that all the references on the stack are mapped (at known locations) and the JVM can account for all of them. As long as the thread remains at a safepoint we can safely manipulate the heap + stack such that the thread's view of the world remains consistent when it leaves the safepoint.
This is particularly useful when the JVM wishes to examine or change the heap, for a GC or any one of many other reasons. If references on the stack were unaccounted for and the JVM was to run a GC it may miss the fact that some objects are still alive (referenced from the stack) and collect them, or it may move some object without updating it's reference on the stack leading to memory corruption.
A JVM will therefore need means of bringing threads to safepoints (and keeping them there) so that all sorts of runtime magic can happen. Here's a partial list of activities which JVMs run only once all mutator threads are at a safepoint and cannot leave it until released (at a global safepoint), these are sometime called safepoint operations:
  • Some GC phases (the Stop The World kind)
  • JVMTI stack sampling methods (not always a global safepoint operation for Zing))
  • Class redefinition
  • Heap dumping
  • Monitor deflation (not a global safepoint operation for Zing)
  • Lock unbiasing
  • Method deoptimization (not always)
  • And many more!
A great talk by Azul's own John Cuthbertson from JavaOne 2014 goes into some background on Safepoints and allot of details about safepoint operations other than GC (we at Azul feel GC is a solved problem, so the talk goes into the remaining reasons to stop).
Note that the distinction between requesting a global safepoint and a thread safepoint exists only for some JVM implementations (e.g. Zing, the Azul Systems JVM. Reminder: I work for Azul). Importantly it does not exist on OpenJDK/Oracle JVMs. This means that Zing can bring a single thread to a safepoint.
To summarize:
  • Safepoints are a common JVM implementation detail
  • They are used to put mutator threads on hold while the JVM 'fixes stuff up'
  • On OpenJDK/Oracle every safepoint operation requires a global safepoint
  • All current JVMs have some requirement for global safepoints

When is my thread at a safepoint?

So having threads at a safepoint allows the JVM to get on with it's managed runtime magic show, great! When is this groovy state happening?
  • A Java thread is at a safepoint if it is blocked on a lock or synchronized block, waiting on a monitor, parked, or blocked on blocking IO. Essentially these all qualify as orderly de-scheduling events for the Java thread and as part of tidying up before put on hold the thread is brought to a safepoint.
  • A Java thread is at a safepoint while executing JNI code. Before crossing the native call boundary the stack is left in a consistent state before handing off to the native code. This means that the thread can still run while at a safepoint.
  • A Java thread which is executing bytecode is NOT at a safepoint (or at least the JVM cannot assume that it is at a safepoint).
  • A Java thread which is interrupted (by the OS) while not at a safepoint is not brought to a safepoint before being de-scheduled.
The JVM and your running Java threads have the following relationship around safepoints:
  • The JVM cannot force any thread into a safepoint state.
  • The JVM can stop threads from leaving a safepoint state.
So how can the JVM bring all threads into a safepoint state? The problem is suspending a thread at a known state, not just interrupting it. To achieve this goal JVMs have the Java threads suspend themselves at convenient spots if they observe a 'safepoint flag'.

Bringing a Java Thread to a Safepoint

Java threads poll a 'safepoint flag' (global or thread level) at 'reasonable' intervals and transition into a safepoint state (thread is blocked at a safepoint) when they observe a 'Go to safepoint' flag. This is simple enough, but since we don't want to spend all of our time checking if we need to stop the C1/C2 compilers (-client/-server JIT compilers) try and keep safepoint polls to a minimum. On top of the cost of the flag check itself, maintaining a 'known state' adds significant complexity to the implementation of certain optimizations and so keeping safepoints further apart opens up a wider scope for optimization. These considerations combined lead to the following locations for safepoint polls:
  • Between any 2 bytecodes while running in the interpreter (effectively)
  • On 'non-counted' loop back edge in C1/C2 compiled code
  • Method entry/exit (entry for Zing, exit for OpenJDK) in C1/C2 compiled code. Note that the compiler will remove these safepoint polls when methods are inlined.
If you are the sort of person who looks at assembly for fun (or profit, or both) you'll find safepoint polls in the -XX:+PrintAssembly output by looking for:
  • '{poll}' or '{poll return}' on OpenJDK, this will be in the instructions comments
  • 'tls.pls_self_suspend' on Zing, this will be the flag examined at the poll operation
This mechanism is implemented differently on different VMs (on x86 + Linux, I've not looked at other architectures):
  • On Oracle/OpenJDK a blind TEST of an address on a special memory page is issued. Blind because it has no branch following it so it acts as a very unobtrusive instruction (usually a TEST is immediately followed by a branch instruction). When the JVM wishes to bring threads to a safepoint it protects the page causing a SEGV which is caught and handled  appropriately by the JVM. There is one such special page per JVM and therefore to bring any thread to a safepoint we must bring all threads to a safepoint.
  • On Zing the safepoint flag is thread local (hence the tls prefix). Threads can be brought to a safepoint independently.
See related post: Dude, Where's My Safepoint? for more details on polls.
Once a thread detects the safepoint flag it will perform the safepoint action that the poll triggers. This normally means blocking on some sort of JVM level lock, to be released when the safepoint operation is over. Think of it as a locking mechanism where:

  • Threads can lock themselves out (e.g. by calling into JNI, or blocking at a safepoint).
  • Threads can try and re-enter (when returning from JNI), but if the lock is held by the JVM they will block.
  • The safepoint operation requests the lock and blocks until it own it (when all the mutators have locked themselves out)

All Together Now

Now we have all the moving parts of the safepoint process:
  • Safepoint triggers/operations: reasons to request, and stuff to do which requires one or all threads to be at a safepoint
  • Safepoint state/lock: into which Java threads can voluntarily check in, but can only leave when the JVM is done executing the safepoint operation
  • Safepoint poll: the voluntary act by a Java thread to go into a safepoint if one is needed
Let's see how it all hangs together, for example when a profiler is sampling the stack using JVMTI::GetStackTrace:

GREEN  :   Executing Java
YELLOW :   Descheduled Java
RED    :   Native
This diagram highlights several components of safepoint operation overheads:
  1. Time To Safe Point(TTSP): Each thread enters a safepoint when it hits a safepoint poll. But arriving at a safepoint poll requires executing an unknown number of instructions. We can see J1 hits a safepoint poll straight away and is suspended. J2 and J3 are contending on the availability of CPU time. J3 grabs some CPU time pushing J2 into the run queue, but J2  is not in a safepoint. J3 arrives at a safepoint and suspends, freeing up the core for J2 to make enough progress to get to a safepoint poll. J4 and J5 are already at a safepoint while executing JNI code, they are not affected. Note that J5 is trying to leave JNI halfway through the safepoint and is suspended before resuming Java code. Importantly we observe that the time to safepoint varies from thread to thread and some threads are paused for longer than others, Java threads which take a long time to get to a safepoint can delay many other threads.
  2. Cost of the safepoint operation: This will depend on the operation. For GetStackTrace the cost of the operation itself will depend on the depth of the stack for instance. If your profiler is sampling many threads (or all the threads) in one go (e.g. via JVMTI::GetAllStackTraces) the duration will grow as more threads are involved. A secondary effect may be that the JVM will take the opportunity to execute some other safepoint operations while the going is good. 
  3. Cost of resuming the threads.
Some take aways from this analysis:
  1. Long TTSP can dominate pause times: some causes for long TTSP include page faults, CPU over-subscription, long running counted loops.
  2. The cost of stopping all threads and starting all threads scales with the number of threads. The more threads the higher the cost. Assuming some non-zero cost to suspend/resume and TTSP, multiply by number of threads...
  3. Add -XX:+PrintGCApplicationStoppedTime to your list of favourite flags! It prints the pause time and the TTSP.
But why should you trust me? Lets break some shit!!! Please note in the following examples I will be switching between Oracle 8u60 and Zing (latest dev JDK8 build). I work for Azul Systems on the Zing JVM, so feel free to assume bias on my part. My main motivation in including both JVMs is so that I can demonstrate the effect of -XX:+KeepSafepointsInCountedLoops on Zing.
NOTE: The benchmarks are not intended to spark a flame war and though Zing happens to compile this particular piece of code 'better' I would not get over excited about it. What is exciting is the extra option Zing makes available here to control and minimize TTSP, and the additional perspective it offers on how a JVM can deal with these challenges.

Example 0: Long TTSP Hangs Application

The above code should exit in 5 seconds (or see it on GitHub). The silly computation running in the daemon thread should not stop it from exiting (the JVM should exit if all remaining running Java threads are daemon threads). But if we run this sample we will find that the JVM does not exit after 5 seconds. It doesn't exit at all[UPDATE: this example turned out to be a bit flaky as per discussion in comments and following blog post]. Not only does it not exit, it will also refuse incoming JMX connections, it won't respond the jmap/jstack. It won't naturally die, you'll have to 'kill -9' this fucker.
The reason it's so stuck is because the computation is running in a 'counted' loop, and counted loops are considered short by the JVM. Short enough to not have safepoints. This thread refusing to come to a safepoint will lead to all other threads suspending indefinitely at the next safepoint operation.
We can fix this (somewhat contrived) problem in several ways:
  1. Prevent C1/C2 compilation: Use -Xint or a compile command to prevent the method compilation. This will force safepoint polls everywhere and the thread will stop hurting the JVM.
  2. Replace the index type with long: this will make the JVM consider this loop as 'uncounted' and include a safepoint poll in each iteration.
  3. On Zing you can apply the -XX:+KeepSafepointsInCountedLoops option on either the JVM or method level (via a -XX:CompileCommand=option,... or via a compiler oracle file). This option is coming shortly to an OpenJDK near you. See also related bug where breaking the loop into an inner/outer loop with safepoints in the outer loop is discussed.
This example is handy when considering shared JVM solutions. This is perfectly 'safe' Java code that will bring most application containers to their knees.
Summary: Delaying a safepoint poll indefinitely can be harmful...

Example 1: More Running Threads -> Longer TTSP, Higher Pause Times

The benchmark has 1 thread triggering a safepoint operation by calling safepointMethod() (in this case the System.gc() will be called as a result) and many threads scanning a byte array. The byte array scan loop will not have a safepoint poll (it's a counted loop), so the size of the array will act as our intended safepoint poll interval. With this setup we can control the frequency of safepoint operations (with some overhead), avg. distance from safepoint poll and with JMH the number of threads. We can use -XX:+PrintGCApplicationStoppedTime to log the stopped time and TTSP. Armed with this experiment we can now find a large machine to play on and see what happens. I use the command line '-tg' parameter to control the number of threads executing the contains1() method.
So, here goes on a 48 core Haswell Xeon machine, Cent OS, I'll be using Oracle 8u60. The benchmark class I'm running is here.
Lets start with no contains1() threads and explain the method and data collection as we go along:
$ java -jar safepoints.jar -f 1 -i 10 -wi 10 -jvmArgs="-XX:+PrintGCApplicationStoppedTime -Xms1g -Xmx1g" -tg 0,0,0,1 SafepointUsingGc

There's 4 thread groups because I add another benchmark to run with this crowd in the next sample.
This spouts allot of PrintGCApplicationStoppedTime printouts similar to:
Total time for which application threads were stopped: 0.0180172 seconds, Stopping threads took: 0.0000513 seconds

I'm no Unix shell guru, but with some googling even I can AWK something to summarize this shit:
grep 'Total time' output.file | awk '{ stopped += $9;ttsp += $14; n++ } END { if (n > 0) print "stopped.avg=" stopped / n ", ttsp.avg=" ttsp / n; }'

It's summing up the warmup period measurements as well, but I'm willing to let it ride. So with that in mind, lets compare the benchmark output with the reported pause times:
SafepointUsingGc.run_together:safepoint           100   10   32768     true  avgt   10  118611444.689 ± 1288530.618  ns/op
stopped.avg=0.015759, ttsp.avg=6.98995e-05

Or formatted for humans:
park+gc(ms) |Stp.ms|TTSP.ms|
  119 ± 1.0 | 16.1 |  0.07 |


We went with the default intervalMs=100, so we'd expect the safepoint() cost to be around 100 + stopped.avg, it's actually 2ms more than that. This is why we have the park() benchmark to baseline against. Running again with the park group enabled ('-tg 0,0,1,1'):
park(ms)   | park+gc(ms) |Stp.ms|TTSP.ms|
 101 ± 0.8 | 118 ± 1.4   | 15.9 |  0.07 |

We can see that there's some variance to the GC call and some scheduling variance which can account for delta between our expected value and actual. Small potatoes, lets move on. Start with measuring the cost of the contains1() call when we're not slapping it with a GC every 100ms ('-tg 1,0,1,0' we're running the park baseline):
contains1(us) | park(ms)    |Stp.ms|TTSP.ms|
  32.2 ± 0.02 | 100.1 ± 0.0 |  0.0 |  0.0  |

Note the  contains1() method cost is in microseconds. Awesome, now lets ramp up the number of contains1() thread with the GC thread running every 100ms:
T   |contains1(us) | park(ms)    | park+gc(ms) |Stp.ms|TTSP.ms|
1   |  51.4 ±  3.1 | 100.7 ± 0.7 | 117.5 ± 0.7 | 15.1 |  0.1  |
2   |  49.5 ± 17.1 | 101.6 ± 1.9 | 116.9 ± 0.5 | 14.4 |  0.1  |
4   |  45.5 ±  2.6 | 102.8 ± 4.0 | 116.9 ± 0.7 | 14.4 |  0.1  |
8   |  45.6 ±  2.1 | 103.6 ± 5.6 | 114.0 ± 0.7 | 13.3 |  0.1  |
16  |  46.8 ±  1.3 | 103.4 ± 6.8 | 115.0 ± 1.5 | 12.8 |  0.2  |
32  |  65.7 ±  2.0 | 101.8 ± 0.9 | 113.0 ± 1.3 | 11.4 |  0.7  |
64  | 121.9 ±  1.9 | 106.0 ± 8.1 | 116.1 ± 1.4 | 14.5 |  5.4  |
128 | 251.9 ±  6.6 | 103.6 ± 5.3 | 119.6 ± 2.3 | 16.9 |  6.4  |
256 | 513.2 ± 13.4 | 109.2 ± 9.7 | 123.4 ± 1.7 | 20.7 |  8.0  |
512 |1111.3 ± 69.7 | 143.9 ±24.8 | 134.8 ±15.3 | 27.8 | 10.0  |


OK, that escalated not so quickly. Keep in mind we only actually have 48 cores (hyper-threaded), and so when we cross the line between 32 and 64 we are now looking at the effect of contention and scheduling on the TTSP. It is evident that if you have more CPU bound threads than CPUs performance will degrade, but that is not really what we're after.
At 1 thread we already see a big drop in performance for contains1(). This is somewhat understandable in the presence of 15ms stolen from the application every 100ms.  But where you'd expect a 15% difference we're actually looking at a far more significant drop in performance. Going from 32us to 51us kinda sucks.
We can see a steady growth in TTSP while within the core count, going from 0.1ms to 0.7ms, this is bad news to anyone looking to keep their latencies in the low microseconds count as touched upon in this post from the brilliant Mark Price. Mark highlights the fact that the JVM regularly calls for a global safepoint (“whether she needed it or not”), so you may find to your delight that these little hiccups come round even when you've been allocating nothing.
Once we have more threads than cores the TTSP cost goes into the milliseconds, but escalates rather slowly once there. Going from 5ms to 10ms while we go from 64 to 512 threads is better than I expected.
Summary: By increasing the thread count we increase the TTSP. CPU over-subscription will lead to a further increase in TTSP.  Monitor machines load avg to detect this kind of misconfiguration.

Example 2: Long TTSP has Unfair Impact

Building on previous benchmark, but now we have 2 methods: contains1() and contains1toK()contains1toK() is K times further away from a safepoint poll (outer loop is counted, the contains needle methods gets inlined removing the method exit safepoint and the loop inside it is counted). By increasing K we can see the impact of one thread on the other, in particular we expect the cost of contains1() to increase as the cost of contains1toK() increases, in the presence of safepoint operations. The unfairness here being that while contains1toK() is making progress it is making contains1(legen...) wait for it (...DARY!). Because using GC as the safepoint operation is rather heavy handed (even though it's a small heap and there's no garbage) we'll switch to using another safepoint operation for this case:  ManagementFactory.getThreadMXBean().findDeadlockedThreads(). This is allot more lightweight than GC but becomes more expensive as we pile on threads (not by that much, but still) so I did not use it for prior example (though come to think of it the number of threads should also impact GC length a bit, oh well). Anyway, variety is the spice of life. Here's the results with both methods and park(), but no safepoint operation (benchmark class is here):
K     |contain1(us)|   containsK(us)    | park(ms)    |Stp.ms|TTSP.ms|
1     | 32.3 ± 0.1 |     34.8 ±     1.7 | 100.0 ± 0.0 |  0.0 |  0.0  |
10    | 32.4 ± 0.5 |    349.7 ±    18.0 | 100.0 ± 0.0 |  0.0 |  0.0  |
100   | 38.4 ± 2.8 |   3452.7 ±   130.4 | 100.0 ± 0.0 |  0.2 |  0.1  |
1000  | 39.8 ± 1.1 |  33419.9 ±   421.6 | 100.0 ± 0.0 |  0.0 |  0.0  |
10000 | 41.7 ± 5.0 | 336112.5 ± 10306.5 | 100.5 ± 1.7 |  0.0 |  0.0  |

Indeed the cost of contains1toK() grows linearly with K and the 2 methods don't seem to impact each other that significantly.
Here's what happens as we throw in the safepoint ops:
K     |contains1(us)|   containsK(us)   | park(ms)     | park+FDT(ms) |Stp.ms |TTSP.ms|
1     |  39.7 ± 0.9 |     39.6 ±    1.4 | 100.0 ±  0.0 | 100.3 ±  0.0 |   0.1 |   0.1 |
10    |  40.0 ± 1.0 |    400.1 ±   14.4 | 100.0 ±  0.0 | 100.5 ±  0.0 |   0.2 |   0.2 |
100   |  41.8 ± 1.1 |   3835.4 ±  141.2 | 100.1 ±  0.0 | 102.7 ±  0.1 |   2.1 |   2.0 |
1000  |  67.4 ± 3.5 |  37535.7 ± 2237.2 | 102.5 ±  1.9 | 127.1 ±  1.1 |  20.7 |  20.7 |
10000 | 183.4 ± 8.5 | 346048.6 ± 9818.2 | 194.9 ± 11.5 | 348.4 ± 11.4 | 152.8 | 152.8 |

A terrible thing, right? The annoying/unfair thing is that being a shit has no effect on contains1toK() performance. It scales linearly and seems oblivious to the havoc it's wrecking for everyone else. Note that the problem we see here was present in the run where no safepoint operations were invoked, there were just no safepoint operations to bring it out. On a long enough timeline a safepoint operation will come along and around it we'll have seen the same issue.
How can we fix it, and at what cost? Lets explore some options.

Solution 1: I use Zing -> use -XX:+KeepSafepointsInCountedLoops

So Zing is super awesome, and has this option to keep safepoint polls in counted loops which can come in handy in cases where one particular method is killing you softly with TTSP. Here's the results we get with Zing when running the benchmark above (same jvmArgs, same machine etc):
K     |contains1(us) |   containsK(us)   | park(ms)     | park+FDT(ms) | Stp.ms|
1     |  31.9 ±  0.9 |    30.5 ±     1.0 | 100.0 ±  0.0 | 100.2 ±  0.0 |   0.0 |
10    |  32.9 ±  0.2 |   317.8 ±     8.8 | 100.0 ±  0.0 | 100.2 ±  0.0 |   0.0 |
100   |  34.1 ±  2.4 |  3337.8 ±   265.4 | 100.1 ±  0.0 | 100.8 ±  0.2 |   0.4 |
1000  |  44.4 ±  1.9 | 33991.6 ±  1199.0 | 100.9 ±  0.2 | 117.0 ±  0.6 |  10.2 |
10000 | 159.0 ± 12.1 |299839.3 ±  6297.2 | 156.3 ±  8.0 | 319.8 ± 30.5 | 113.4 |

Still bad, but Zing happens to squeeze better performance from the containsNeedle() method and while contains1() performance degrades it's opening a wide gap with observed behaviour above probably because less time is lost to TTSP (because contains1toK() finishes quicker). Also note that -XX:+PrintGCApplicationStoppedTime only prints out the stopped time for Zing. A separate command line argument is available which prints out safepoint statistics, but as this post grows longer I'm losing my will to rerun and parse stuff...
We add the JVM argument: "-XX:CompileCommand=option,*.contains1ToK,+KeepSafepointsInCountedLoops" and watch the world burn:
K     |contain1(us)|   containsK(us)   | park(ms)    | park+FDT(ms)| Stp.ms|
1     | 33.0 ± 0.6 |     40.1 ±    1.2 | 100.0 ± 0.0 | 100.2 ± 0.0 |   0.0 |
10    | 32.8 ± 0.7 |    405.1 ±   13.1 | 100.0 ± 0.0 | 100.2 ± 0.0 |   0.0 |
100   | 33.3 ± 0.7 |   4124.8 ±  104.4 | 100.0 ± 0.0 | 100.3 ± 0.0 |   0.1 |
1000  | 32.9 ± 0.2 |  40839.6 ±  854.0 | 100.0 ± 0.0 | 101.0 ± 0.3 |   0.6 |
10000 | 40.2 ± 2.9 | 374935.6 ± 7145.3 | 111.3 ± 3.7 | 100.2 ± 0.0 |   0.0 |

SHAZZZING!... This is pretty fucking awesome. The impact on contains1toK() is visible, but I didn't have to change the code and the TTSP issue is gone man, solid GONE. A look at the assembly verifies that the inner loop is still unrolled within the outer loop, with the addition of the requested safepoint polls. There are more safepoint polls than I'd like here, since the option is applied to both the outer method and the inlined method (to be improved upon sometime down the road...).

Solution 2: I use OpenJDK/Oracle, disable inner loop inlining

If the inner loop is in it's own method as our example, or if we can perhaps modify the code such that it is, we can ask the compiler to not inline that method. This should work out nicely here. To force the JVM to not inline the containsNeedle() method only in contains1toK() I made a copy of containsNeedle() called containsNeedle2()  and used this argument: "-XX:CompilerCommand=dontinline,*.containsNeedle2". The results:
K     |contain1(us)|   containsK(us)   | park(ms)    | park+FDT(ms)|Stp.ms|TTSP.ms|
1     | 39.9 ± 1.0 |     40.0 ±    2.6 | 100.0 ± 0.0 | 100.3 ± 0.0 |  0.1 | 0.0   |
10    | 39.9 ±  .7 |    395.6 ±   10.0 | 100.0 ± 0.0 | 100.3 ± 0.0 |  0.1 | 0.0   |
100   | 41.3 ± 3.9 |   4030.6 ±  405.0 | 100.0 ± 0.0 | 100.3 ± 0.0 |  0.1 | 0.0   |
1000  | 39.6 ± 2.1 |  34543.8 ± 1409.9 | 100.0 ± 0.0 | 100.3 ± 0.0 |  0.1 | 0.0   |
10000 | 39.1 ± 1.2 | 328380.6 ± 8478.1 | 100.0 ± 0.0 | 100.3 ± 0.0 |  0.1 | 0.0   |

That worked well! No harm to contains1(), no TTSP issue, and contains1toK() performance didn't suffer for it. Inlining is usually a key optimization you want to make sure happens, but in this case the right thing to do is stop it from happening. The work inside containsNeedle() is chunky enough to make this a good tradeoff. If the external loop was at a very high count, and the array size rather small this would have been a very different story.

Solution 3: Last resort. stop compiling the offending method

This is a bit terrible, but if things get bad AND you can't change the code, you can use a CompileCommand to exclude the method from compilation. The method will be force to stay in the interpreter, so will effectively have a safepoint poll between every 2 bytecodes. Performance will suck, but at least the application will be responsive. I am not gonna run this option, but if you want to add the following argument to the command line and watch it roll: "-XX:CompilerCommand=exclude,*.contains1toK"

Summary: Threads which suffer from large TTSP will cause other threads to wait for them in the presence of a global safepoint attempt.

Example 3: Safepoint Operation Cost Scale


For this example we replace the safepoint() method with gathering thread stacks using ManagementFactory.getThreadMXBean().dumpAllThreads(false, false). Gathering thread stacks also occurs at a safepoint and the safepoint operation cost will scale with the number of threads. We construct a number of idle threads in the setup method to control the safepoint operation cost. To make sampling the threads stack expensive we park the fuckers 256 frames deep.
Different safepoint operations scale with different aspects of your process. Certain GC algorithms stop the world phases will scale with heap size. Profiling by collecting thread dumps (as done by JVisualVM sampler for instance) will scale with the number of threads (idle or not) and the stack depth. Deflating monitors will scale with the number of monitors in your application etc.
Here's what happens as we pile on the idle threads (running only contains1()):
IT   |  contain1(us)|  park(ms)    |  park+DTS(ms) | Stp.ms|TTSP.ms|
0    |  38.0 ±  1.6 | 100.0 ±  0.0 |  100.5 ±  0.0 |   0.2 |   0.0 |
1    |  40.7 ±  1.4 | 100.1 ±  0.0 |  102.2 ±  0.0 |   0.6 |   0.0 |
10   |  44.7 ±  0.8 | 100.2 ±  0.1 |  117.2 ±  0.1 |   5.5 |   0.0 |
100  |  75.5 ±  0.8 | 122.6 ±  3.5 |  261.9 ±  2.3 |  61.3 |   0.1 |
1000 | 204.5 ± 38.5 | 257.5 ± 28.8 | 1414.2 ± 47.0 | 313.0 |   1.0 |

We get long stops and short TTSPs! hoorah! You can play with the number of threads and the depth of the stack to get the effect you want. It is interesting that most profilers using JVMTI::Get*StackTrace opt for sampling all threads. This is risky to my mind as we can see above the profiling overhead is open ended. It would seem reasonable to sample 1 to X threads (randomly picking the threads we sample) to keep the overhead in check on the one hand and pick a reasonable balance of TTSP vs operation cost on the other.
As mention previously, JVMTI::GetStackTrace does not cause a global safepoint on Zing. Sadly AFAIK no commercial profilers use this method, but if they did it would help. The JMX thread bean does expose a single thread profiling method, but this only delegates to the multi-thread stack trace method underneath, which does cause a global safepoint...

Summary: This final step adds another interesting control for us to play with completing our set:
  • Safepoint poll interval -> Will drive time to safepoint. CPU scarcity, swapping and page faults are other common reasons for large TTSP.
  • Safepoint operation interval -> Sometimes in your control (be careful how you profile/monitor your application), sometimes not.
  • Safepoint operation cost -> Will drive time in safepoint. Will depend on the operation and your application specifics. Study your GC logs.

Example 4: Adding Safepoint Polls Stops Optimization

This is a repeat of a benchmark I ran to demonstrate the side effect of safepoint polls in loops on loop optimizations in the context of changing one of the Netty APIs. This post is so long by now that if you had the strength to read this far you can probably click through to the ticket. I'll quote the basic results here for completeness.
The benchmark:
We are debating replacing the length and DataSet index types from int to longs. The impact on usage inside loops is substantial as demonstrated by the following results:
Benchmark   (size)  Mode  Cnt    Score     Error  Units
copyInt       1000  avgt    5   76.943 ±  14.256  ns/op
copyLong      1000  avgt    5  796.583 ±  55.583  ns/op
equalsInt     1000  avgt    5  367.192 ±  16.398  ns/op
equalsLong    1000  avgt    5  806.421 ± 196.106  ns/op
fillInt       1000  avgt    5   84.075 ±  19.033  ns/op
fillLong      1000  avgt    5  567.866 ±  10.154  ns/op
sumInt        1000  avgt    5  338.204 ±  44.529  ns/op
sumLong       1000  avgt    5  585.657 ± 105.808  ns/op
There are several optimizations negated by the transition from counted to un-counted loop (uncounted loops currently include loops limited by a long, and therefore include a safepoint):
  • Loop unrolling :  This is not really a safepoint issue, can be fixed as per the KeepSafepointsInCountedLoops option
  • Fill pattern replacement (-XX:+OptimizeFill): the current fill implementation has no safepoints and the loop pattern it recognizes does not allow it.
  • Superword optimizations (-XX:+UseSuperword): current Superword implementation doesn't allow for long loops. This will probably require the compiler to construct an outer loop with an inner superword loop 
The latter 2 have been discussed on this blog before in some detail.

Final Summary And Testament

If you've read this far, you deserve a summary, or a beer, or something. WELL DONE YOU!!! Here's what I consider the main takeaways from this discussion/topic:
  • Safepoint/Safepoint poll/Safepoint operation are 3 distinct terms, often mingled. Separating them allows us to consider their costs and impacts individually.
  • In most discussions the term Safepoint is used to describe a global safepoint. Some JVMs can control safepoint transition on the thread level. By considering the individual thread state we can more easily reason about the transition costs on a thread by thread level. 
  • The failure of any thread to reach a safepoint (by hitting a safepoint poll) can cripple the whole JVM. In this sense most any bit of Java code can be considered blocking (as in not lock free). Consider for example a CAS loop, it's not counted, it will have a safepoint poll in it, the thread executing can be blocked indefinitely if another Java thread is suspended while not at a safepoint.
  • JVMs have many safepoint operations. If you are worried about their impact you should enable -XX:+PrintGCApplicationStoppedTime. Further detail on what types of safepoint operations were involved can be made available using -XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=X. Try it out, it's educational.

References

  • From the OpenJDK Runtime Overview look for the "VM Operations and Safepoints" section. Note that the term safepoint is used here to mean a global safepoint.
  • J.Bempel's post covering many safepoint operations and observing the impact of safepoints.
  • Alexey Ragozin's post also covers similar definition to the Runtime Overview and explores the safepoint poll mechanism and logging options.
  • Gil Tene post on the Mechanical Sympathy mailing list which closely matches the "Thread State" approach above. Gil goes into the topic in the context of Unsafe operations and assumptions surrounding safepoints. 

Monday, 19 October 2015

Expanding The Queue interface: Relaxed Queue Access


Continuing from previous post on the expansion of the Queue interface to support new ways of interacting with queues I have gone ahead and implemented relaxedOffer/Poll/Peek for the JCTools queues. This was pretty easy as the original algorithms all required a bit of mangling to support the strong semantic and relaxing it made life easier. Implementing batch/continuous interactions was a bit more involved, but the results were interestingly rewarding. I will save discussing the drain/fill results for a follow up post.
The new interface is here, feedback is welcome.
TL;DR:

  • Throughput is improved in all cases (~5-15% increase)
  • Cost of failure (when offer/poll fail to return/add a value) is dramatically reduced
  • Latency for bursts is improved (as much as 40% reduction)

Relaxed Access API

This ended up a very small API change, most of the difficulty here being choice of name. I went with the 'relaxed' prefix as suggested by J.P.Bempel (as opposed to the 'weak' prefix considered earlier):

Relaxed vs. Non-Relaxed Example

The 'relaxation' of the queue full/empty reporting in these methods allows us to remove an extra load of the 'opposite' index and avoiding waiting around for delayed visibility. Here's how MPSCArrayQueue::poll was changed to become 'relaxed':

So now we can drop lines 11-15, it doesn't look like much. We wouldn't really expect this gap in offer (see offer for details) to happen very often, so the overhead can typically amount to an extra load of the producerIndex and a predictable branch.
Also, since poll on an MPSC is actually cheaper than the offer anyhow, aren't we optimizing a non-issue? Could we not be actually making throughput worse by making the consumer more aggressive?

Throughput Impact Of Relaxation

To test the impact of this change I constructed a JMH benchmark that measures throughput when both consumer and producer spin to offer/poll as fast as they can (see the code here). I ran the benchmark on a Haswell machine(i7-4770 CPU@3.40GHz/Ubuntu/OracleJVM8u60). The benchmarks were pinned to run cross core (no 2 threads share a physical core, i.e. no hyperthreading). Benchmarks were run with 5 forks, 10 warmup iterations (1 second), 10 measured iterations (1 second). The score reports operations per microsecond, or millions of operation per second if you prefer. The throughput is the number of pollsMade, but the other figures are interesting as well.
These results are with a single producer and consumer:
 Type |     1P1C     | Normal± Err  | Relax ± Err  |
 Spsc | offersFailed |   0.0 ±  0.1 |   0.1 ±  0.1 |
 Spsc | offersMade   | 439.1 ±  4.1 | 432.1 ± 10.0 |
 Spsc | pollsFailed  |   0.2 ±  0.5 |   0.1 ±  0.2 |
 Spsc | pollsMade    | 438.9 ±  4.1 | 431.9 ± 10.0 |
 Mpsc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.0 |
 Mpsc | offersMade   |  22.7 ±  0.6 |  31.9 ±  0.8 |
 Mpsc | pollsFailed  |   0.1 ±  0.1 |   6.6 ±  0.9 |
 Mpsc | pollsMade    |  22.8 ±  0.6 |  31.9 ±  0.8 |
 Spmc | offersFailed |   0.2 ±  0.2 |  81.0 ±  5.4 |
 Spmc | offersMade   |  23.5 ±  0.6 |  26.8 ±  0.6 |
 Spmc | pollsFailed  |   0.1 ±  0.1 |   0.2 ±  0.3 |
 Spmc | pollsMade    |  23.2 ±  0.6 |  26.6 ±  0.6 |
 Mpmc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.1 |
 Mpmc | offersMade   |  71.4 ±  5.1 |  71.7 ±  4.8 |
 Mpmc | pollsFailed  |   0.0 ±  0.1 |   0.3 ±  0.9 |
 Mpmc | pollsMade    |  71.4 ±  5.1 |  71.7 ±  4.8 |

  • The SPSC case serves as a baseline, and I expected results to be the same. There's a 2% difference which is close to the error reported. Maybe the inlining depth impacted the performance slightly, or perhaps the benchmark is not as stable as I'd like.
  • The MPSC case shows 50% improvement. This is pretty awesome. We also notice the consumer now fails more. This is indicative to the reduced cost of failure to find elements in the queue. The cache miss on the producerIndex was slowing down both the consumer and the producer. This phenomena has been observed in other post, but my best attempt at an explanation is here. The bottom line being that the mutator suffers from losing the exclusive state of the cache line in his own cache, while the reader gets hit with a read miss (going to LLC as the threads are on separate cores).
  • The SPMC case shows 15% improvement on throughput. The cost of failing to offer has gone down dramatically.
  • The MPMC case shows no change, but does show great throughput. This could be down to the very different algorithm in play or just because the MPMC queue is well balanced and thus behaves very well in a balanced use case as above. This may lead you to believe MPMC is a better choice than MPSC, but as the latency results and contended throughput results show this is an anomaly of the benchmark load. Because MPSC presents a much cheaper poll than offer the queue is always empty, making the contention on the queue buffer worse than it would be for MPMC where offer and poll are of similar costs.
Using the command line controls over thread groups sizes (-tg) I ran a few more cases. Did I mention I love JMH?
Here's MPMC and MPSC with 3 producers and 1 consumer:
 Type |     3P1C     | Normal± Err  | Relax ± Err  |
 Mpsc | offersFailed |   0.0 ±  0.0 |   0.1 ±  0.6 | 
 Mpsc | offersMade   |  12.7 ±  0.1 |  13.4 ±  0.1 | 
 Mpsc | pollsFailed  |   4.1 ±  0.5 | 115.3 ±  4.9 | 
 Mpsc | pollsMade    |  12.7 ±  0.1 |  13.4 ±  0.1 | 
 Mpmc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.0 | 
 Mpmc | offersMade   |  11.1 ±  0.0 |  11.7 ±  0.1 | 
 Mpmc | pollsFailed  |   0.3 ±  0.1 |  60.4 ± 10.0 | 
 Mpmc | pollsMade    |  11.1 ±  0.0 |  11.7 ±  0.1 |
  • MPSC case shows slight improvement to throughput of roughly 5%, but the improvement to cost of failure is very visible here.
  • MPMC case improves by 5% and the improved cost of failure is similarly visible.
Here's MPMC and SPMC with 1 producers and 3 consumer:
 Type |     1P3C     | Normal± Err  | Relax ± Err  |
 Spmc | offersFailed |   1.5 ±  0.2 | 125.9 ±  4.6 |
 Spmc | offersMade   |  13.4 ±  0.1 |  14.0 ±  0.0 |
 Spmc | pollsFailed  |   0.0 ±  0.0 |   0.0 ±  0.0 |
 Spmc | pollsMade    |  13.1 ±  0.1 |  13.7 ±  0.0 |
 Mpmc | offersFailed |  23.9 ±  0.8 | 124.1 ±  3.2 |
 Mpmc | offersMade   |  11.0 ±  0.0 |  12.1 ±  0.0 |
 Mpmc | pollsFailed  |   0.0 ±  0.2 |   1.0 ±  1.6 |
 Mpmc | pollsMade    |  10.7 ±  0.0 |  11.9 ±  0.0 |
  • SPMC case shows  5% improvement and great improvement to failure performance.
  • MPMC case improves by 11% and the improved cost of failure is similarly visible.
Here's MPMC only, with 2 producers and 2 consumers:
 Type |    2P2C      | Normal± Err  | Relax ± Err  |
 Mpmc | offersFailed |   1.4 ±  0.8 |   0.7 ±  0.6 |
 Mpmc | offersMade   |  14.8 ±  0.8 |  16.0 ±  0.6 |
 Mpmc | pollsFailed  |   0.0 ±  0.0 |   0.4 ±  0.5 |
 Mpmc | pollsMade    |  14.7 ±  0.8 |  15.9 ±  0.6 |
  • The improvement to throughput is a nice 8%, but the error is not that far from it so I'm not too confident without lots more runs. This is a similarly symmetric load to the 1P1C and the balance is showing through.
I could go on to higher core counts etc, but I'm actually feeling the win here is pretty conclusive:
  • Throughput is generally improved
  • Failure to make progress is significantly cheaper
Why is this failure mode important? This is particularly important when you have a consumer/producer which is able to make progress in the face of such a failure. A common example of this pattern is a Selector thread which also has an incoming queue of commands to execute, when there's nothing in the queue we would like to go back to selecting on the selector.


Latency impact of Relaxation

Is this going to help us in the case of burst handling latency? There's a benchmark for that too! Or rather there's an estimate of burst processing cost. Code is here. The general idea being to send a burst of messages from the producer, then wait for completion notice from the consumer.
Here goes:
 Size | Type | Normal±Err | Relax± Err |
    1 | Spsc |  126 ±   2 |  126 ±   1 |
    1 | Mpsc |  144 ±   2 |  140 ±   1 |
    1 | Spmc |  187 ±   2 |  189 ±   2 |
    1 | Mpmc |  190 ±   8 |  203 ±   2 |
   10 | Spsc |  193 ±   2 |  192 ±   3 |
   10 | Mpsc |  611 ±  22 |  429 ±   7 |
   10 | Spmc |  309 ±   3 |  307 ±   1 |
   10 | Mpmc |  807 ±  19 |  677 ±  15 |
  100 | Spsc |  530 ±   7 |  528 ±   4 |
  100 | Mpsc | 5243 ± 158 | 3023 ± 107 |
  100 | Spmc | 1240 ±   7 | 1250 ±   6 |
  100 | Mpmc | 4684 ± 382 | 3665 ± 111 |

  • Same as before, SPSC serves as a sanity check/baseline. No change is good.
  • SPMC/MPSC results are pretty similar when sending a single message. MPMC seems slightly regressed.
  • When sending 10/100 messages in a burst we see significant benefits to using the relaxed methods, in particular for the MPSC and MPMC cases.


Summary

Relaxing is good for you, deep down you always knew it. Turns out it ain't half bad for your queues either :-). The new API will be part of the next version of JCTools, coming out as soon as I can work through my Maven mojo phobia.

Thanks J.P. Bempel and Phil Aston for their kind review, remaining errors are a result of me refusing to listen to them persnickety bastards. 

Wednesday, 19 August 2015

An extended Queue interface

In my work on JCTools I have implemented a fair number of concurrent access queues. The Queue interface is part of the java.util package and offers a larger API surface area than I found core to concurrent message passing on the one hand, and still missing others. I'm hoping to solicit some discussion on some new methods, and see if I can be convinced to implement those I decided to avoid thus far.

Inter-thread Message Passing vs. Generic Collection

The queue data structure traditionally has an offer/push/enqueue method for writing into the queue and a poll/pop/dequeue method for removing elements. This should be enough for anyone, right?
Turns out some other useful methods come to mind when using a queue:
  • peek: looking, not touching
  • size: for monitoring and heuristics and so on
  • isEmpty: a potentially more efficient assertion than size == 0
But I would argue that implementing the full scope of the java.util.Collection methods is slightly redundant. Why is it good to have offer/poll also exposed as add/remove/element?
I have chosen to avoid implementing iterators for queues in JCTools, they seem to introduce a large complexity and I'm not convinced users out there need an iterator for their concurrent queue. The duplicate collection methods I get from extending AbstractQueue for free are alright though I don't see why anyone would actually use them.
So lots of methods I don't want, but also some methods missing, in particular for concurrent queue semantics. In particular I have come to the conclusion the following are both practical to implement and helpful to users:
  • weakOffer/Poll/Peek: the same as the 'strong' methods but without the requirement that allowing for spurious negative return value. This is particularly useful for queues where we are able to detect the inability to claim the next slot independently of the queue being full or empty. The 'weak' prefix is different from the Atomic* classes definition of 'weak' as it has no ordering implications, but I've been unable to come up with a better name(suggestions are most welcome).
  • drain(receiver)/fill(provider): this is attractive for cases where a receiver/provider function exists such that calls to it are guaranteed to succeed (e.g. receiver could just throw elements away, producer might return a constant). There are some small savings to be made here around the function calling loop, but for some these small savings might prove significant. A perpetual version of these methods could also take some sort of backoff strategy and spin when hitting an empty/full condition (similar to the Disruptor's BatchProcessor workflow).
  • batchPoll/Offer: this is attractive for processing known number of elements out of/into a queue. This is particularly profitable when multiple consumer/producers contend as the granularity of contention is reduced. It is also potentially damaging as batches can introduce large 'bubbles' into the queue and increase the unfairness. These methods could support a receiver/provider interface or use an array (enables batch array copies, guarantees quick release of a 'bubble').
  • offeredElementsCount/polledElementsCount: this is purely for reporting and would tend to be a long reflecting the producer/consumer counter. For linked queues this adds a new overhead, so while this is a clearly useful feature I'm reluctant to make it a compulsory part of the new interface.

I plan to implement the above for JCTools in the near future. What do you think? Please opine in the comments or via the projects suggestion box.
Thanks :)

Monday, 27 July 2015

JMH perfasm explained: Looking at False Sharing on Conditional Inlining

There is an edge that JMH (read the jmh resources page for other posts and related nuggets) has over other frameworks. That edge is so sharp you may well cut yourself using it, but given an infinite supply of bandages you should definitely use it :-) This edge is the ultimate profiler, the perfasm (pronounced PERF-AWESOME!, the exclamation mark is silent). I've been meaning to write about it for a while and as it just saved my ass recently...

SpscGrowableArrayQueue's False Sharing Issue

JCTools includes a specialized SPSC (single-producer/consumer) bounded queue aimed at actor systems, the SpscGrowableArrayQueue. This queue is quite neat because it is combining the compactness of a linked queue with the awesome throughput of an array backed queue. The idea is quite simple, and similar in spirit to an ArrayList:
  1. Start with a small buffer, default to 16 elements
  2. As long as the queue doesn't grow beyond that just stay small
  3. If on offer you find that the queue is full double the size of the underlying buffer
The mechanics of the queue resizing, handling intermediate queue states, and detecting a new buffer from the consumer are a bit involved and will perhaps be expanded on some other day.
Because the queue is geared at actor systems per queue footprint is important, so I reworked the memory layout of SpscArrayQueue and skimped on the padding where possible. If you have no idea what I mean by padding, and why it is done you can read the following posts:
  1. False sharing and the MESI protocol details related to cached counters: This explores the motivation of padding fields in the first place
  2. Discovering and controlling object layout(Using an early version of JOL): This show cases the JOL tool and how it should be used to discover object layout
  3. SPSC Revisited - part I: An empiricist tale: This post discusses the inlining of counters into a queue use inheritance to control field order and introduce padding. In this post I show how to pad the counter from each other and also how to pre/post pad the class and buffer.
  4. A more high level summary of false sharing is given in this post
With an overhead of roughly 128b per padding this was instrumental to reducing the high memory cost per queue. The hot counters were still padded from each other, but the class pre/post padding were removed as well as the padding of the elements array.
So, I gave up on some of the padding, but reasoned that in most cases this should not make a difference because the hottest fields were still padded from each other.

Why so slow?

Now, I expect a queue that is very similar to SpscArrayQueue, but adds some features to be slower. There's just very little you can do about this, doing more will cost you something. But given that resizing is a rather exceptional event for this queue I thought this will be a minor hit, maybe 10-20% reduction in performance (for a certain definition of performance). JCTools has some benchmarks included and I ran the QueueThroughputBackoffNone which will have a producer and consumer threads chewing on the queue as hard as they can.
Since the numbers are not very important here, I'll stick to running benchmarks on my laptop (an MBP 11,1/ Ubuntu 15.04/Zulu JDK8u45 - Zulu is a supported and tested OpenJDK build) in this post. Rest assured that I have confirmed on real hardware the same results. The results I quote below are the pollsMade figure which reflects the actual delivered throughput.
To my puzzlement I found:
SpscArrayQueue          361.223 ±  7.156  ops/us
SpscGrowableArrayQueue   64.277 ± 31.803  ops/us

Crap performance and terrible variance, bugger me sideways.
Looking at the code, I thought there must be something I was doing to upset the mighty JIT spirits. Maybe my methods were too big? My branches too unpredictable? My variable names too offensive? So I tweaked the code this way and that, looked at the inlining log (-XX:+PrintInlining) and the assembly (-XX:+PrintAssembly/-XX:CompileCommand=print,*) I got some minor improvements, but it mostly still sucked. What's wrong? A quick look at "-prof perfasm" and a fair amount of head scratching lead to the answer. The code (before and after) is on github, the main focus for this post is perfasm and its usage so I won't dive into it.

Before we start: How does perfasm work?

To use perfasm you'll need a Linux(or Windows) OS, running on real hardware, the relevant perf tool installed, and a JVM setup to print assembly.
Perf is (amongst other things) an instruction level profiler, of the kind that don't usually work with Java traditionally (though things are slowly changing). One of the features offered by perf is the "perf record/annotate" workflow. Perf interrupts your process repeatedly and records the current PC (program counter) at each sample. This sampling of the program counter is recorded over a period of time to be post processed by the annotate feature which correlates samples to methods, lines of code and assembly instructions. The challenge for perf when dealing with Java code is that the binary form of each method only exists for the duration of that process lifetime. This means the PC is mostly referring to methods that are nowhere to be found when the annotation stage comes along.
To summarize: perf record works, but perf annotate is broken for Java.
To make perfasm work JMH captures the JVM compiler outputs by enabling the following flags:
  -XX:+UnlockDiagnosticVMOptions
  -XX:+LogCompilation
  -XX:LogFile=...
  -XX:+PrintAssembly
  -XX:+PrintInterpreter
  -XX:+PrintNMethods
  -XX:+PrintNativeNMethods
  -XX:+PrintSignatureHandlers
  -XX:+PrintAdapterHandlers;
  -XX:+PrintStubCode
  -XX:+PrintCompilation
  -XX:+PrintInlining
  -XX:+TraceClassLoading
  -XX:PrintAssemblyOptions=syntax
The data collected here allows perfasm to do the annotation process by itself. I'll not bore you with the details of the output processing, but hat off to Mr. Shipilev who ploughed through the gory mess and made this work. The end result is a detailed output offering you the assembly of the hottest regions in your program, along with a list of the top hottest methods including both native and Java parts of the stack.
Because perf allows the recording of any number of events the definition of hottest depends on the events you choose to profile. The default events are cycles and instructions, but you can specify any number of events (E.g. -Djmh.perfasm.events=cycles,cache-misses). The first event specified will be the 'hotness' qualifier.

What do you get?

The perfasm output is split into 4 sections:
  1. Annotated assembly output for the top HOT regions(titled "Hottest code regions (>10.00% "cycles" events):")
  2. A list of the hot regions in your benchmark (titled "[Hottest Regions]"). This is the list of compiled methods inclusive of native methods, which is why we have the next section.
  3. A list of the hottest methods after inlining, so only java methods in this section (titled "[Hottest Methods (after inlining)]").
  4. A distribution of the cycles between types of regions (titled "[Distribution by Area]"). This will inform you of how the split goes between compiled code, kernel, JVM etc.
The most interesting of the 4 sections (and the only one I'm going to explain here) is the annotated hot regions section. I tend to edit the output of assembly spouting tools to make it more readable, but for the sake of explaining what's what here's the original output with footnotes:
To focus on the main features I trimmed out allot of the assembly code, which is where you see the "<MORE OF SAME/>" comments. Note the legend at the start of the output describing the 2 left most columns: cycles and instructions sample percentages.
Here's the footnotes explained:
  1. (Line 7) This column is the instruction address. This is important as this column is both how perfasm matches the perf record output with the PrintAssembly output and how poor sods like yours truely have to look for jump destinations in the code.
  2. (Line 7) This column has the actual instructions, in AT&T syntax (src -> dst).
  3. (Line 7) This column is the instruction annotation generated by PrintAssembly. It is not always correct, but it's helpful in connecting the dots. Multi-level inlining of methods can be seen in action everywhere.
  4. (Line 12) The condition/loop edge annotation is a wonderful recent addition to perfasm. It makes life marginally more bearable for the assembly consumer by connecting the jumps with their destinations. This is the start point annotation ""
  5. (Line 16) This is the end point annotation ""
When looking at this you will be searching for suspect hot spots in the code. Some instruction which is chewing away your cycles. If you're in luck you'll be able to tweak your code just right to side step it and then you are off and away :-)

Back on Track: Show me the cycles!

Right, so I point Captain Awesome at my benchmark and what do I see (trimmed addresses and edges as there's no important conditions here, also trimmed package names and moved code annotation left):
This instruction, eating 30% of my cycles (line 7, needs HEALING!), is part of a guard generated to ensure that inlining the poll method call is still a valid decision, that the instance observed at the callsite is of the class for which we inlined the poll method. This is not really a Java code line, this is the aftermath of inlining a virtual call (through an interface, see Aleksey's method dispatch inlining article to learn more on inlining).
To be fair, the above is kind of hard to understand. Why is a comparison between a constant and a register eating 30% of the cycles? according to every instruction 'cost' manual this should take 1 cycle (and the CPU can do 3-4 of them in parallel too), this is obviously not the instruction I'm looking for.
This is a phenomena known as 'skid' where the reported instruction at a given sample is inaccurate because modern CPUs are complicated. See the following dialogue on the linux-perf-user mailing list:
> I think Andi mentioned this to me last year -- that instruction profiling was no longer reliable. 
It never was. 
> Is this due to parallel and out-of-order execution? (ie, we're sampling the instruction pointer, but that's set to the resumption instruction, not the instructions being processed in the backend?). 
Most problems are due to 'skid': It takes some time to trigger the profiling interrupt after the event fired. [...] There are also other problems, for example an event may not be tied to an instruction. Some events have inherently large skid.
This sounds grim, but at the end of the day this is as accurate as profiling can get. It's not perfect, but it's still very valuable and you get used to it (there are ways to minimize skid discussed in the link, I've not tried those).
What you end up doing is looking for instructions just before the blamed instruction, or instructions on which the blamed instruction is dependant, which may be more reasonably blamed for the bottleneck. Given the CMP is not the problem, we must ask why would the CPU spend so much time at it? A CMP or a TEST will often get blamed for the price of the load into the registers they use, in this case the CMP is most probably being blamed for the load of the queue type from the object header one instruction back:
"0x8(%r12,%r11,8),%r8d  ; implicit exception: dispatches to 0x00007f83c93f929d"
It doesn't help that the comment talks about some implicit exception (this is an implicit null check), where it could say "getkid(q) + implicit_nullchk(q)" or something similar to indicate we are loading the kid (klass id) from the object header (see object header layout details here).
Now that I had pointed a finger at this operation, I was still quite confused. This was not an issue for any of the other queues, why would loading the kid be such a bottleneck in this case? Maybe I'm wrong about this (always a good assumption)? To prove this is the issue I created a duplicate of the original benchmark where I use the SpscGrowableArrayQueue directly instead of going via the interface, for comparison I also benchmarked the SpscArrayQueue in the same fashion:
SpscArrayQueue          369.544 ±  1.535  ops/us
SpscGrowableArrayQueue  272.021 ± 12.133  ops/us

Now that's more like it! the expected 20% difference is more like 30%, but this is much closer. This still begs the question, why is the type check for SpscGrowableArrayQueue so expensive? We can see that for SpscArrayQueue this makes very little difference, how is SpscGrowableArrayQueue different?

Messing with the KID

I had a long hard look at this issue, which didn't help, then slept on it, which did help, and realized the problem here is that the object header is false-sharing with the producer fields. When I trimmed down the padding on this class in an effort to minimize allocation, I removed the array post and pre-padding as well as the class pre/post padding and reasoned that for the most part I need not worry about the object's neighbours false sharing. What I failed to realize was that the consumer and producer threads might be frequently hitting the object header, in this benchmark on every call. Once I realized this was the issue I reinstated the pre-padding such that the producer index is far enough from the object header to stop interfering with it and the problem went away (see before and after, it's a messy diff as I fixed some other niggles while I was there, you can take the last version and play with adding/removing the padding and reordering the index and cold fields to verify my claims).
Here's the (slightly trimmed for readability) perfasm output for the same conditional inlining check in the padded version:

As an interesting side note, this issue was not visible in the hand rolled throughput benchmarks. This is because in those benchmarks the queue is hoisted into a variable before the consume/produce loop which means the conditional inlining check can be done out of loop as well. This is great for getting nice numbers but hides an issue which users are likely to hit. Credit goes to JMH for herding benchmarks down a path which forces these issues into the measured scope.

Summary

The main focus of this post is perfasm, I hope it helps get you started on using it. The broader context in which you would use this tool is a less explicit background. I use perfasm regularly, but I'm also reasonably happy to look at assembly and compiler annotation which I know most people are not. I find it to be invaluable in issues, like the one described in this post, where the cost is between the Java lines rather than the lines themselves. Any profiler can give you a broad feel of where performance bottlenecks are, and a non-safepoint-biased profiler can show you the hottest line of code. What a Java profiler will not tell you is about the generated JVM code between your lines. It will attribute the cost of those instructions to the nearest Java line, and that can become a very confusing chase.
It also worth pointing out that any nano-benchmark (measuring operations in the 0-500 nanoseconds range) practically requires you to look at the assembly for it's analysis. But 500ns can still be an aweful lot of code and an assembly level profiler is very handy. At this point it worth mentioning the Oracle Solaris Studio (first prize for least known profiler every year since release). It is a great assembly level profiler, and just generally a great profiler. If your measurement needs to take place outside the cosy comforts of a JMH benchmark I would recommend you give it a spin.
Finally, this investigation came in the context of a workflow that is followed in the development of JCTools. I would loosely describe it as follows:
  1. Implement new feature/data structure
  2. Reason about expected impact/performance as part of design and implementation
  3. Test expectations using the existing set of benchmarks
  4. Expectations are far off the mark (if not... not sure, will figure it out when it happens)
  5. Dig in until either expectations or code are cured.
This has a whiff of scientific exploration to it, but I assure you it is not done quite so seriously and I often fail to follow my own advice (or worse other people's advice). The habit of testing performance assumptions/expectation has offered me many an afternoon spent banging my head on a variety of surfaces. Perfasm has been instrumental in reducing the amount of head banging, but I fear nothing short of permanent brain damage will actually solve the problem.
This post has been kindly reviewed by Aleksey Shipilev, Darach Ennis and Richard Warburton. Any remaining errors are entirely their fault ;-)