Monday, 21 March 2016

GC 'Nepotism' And Linked Queues

I've just run into this issue this week, and it's very cute, so this here is a summary. Akka has their own MPSC linked queue implementation, and this week in was suggested they'd swap to using JCTools. The context in which the suggestion was made was a recently closed bug with the mystery title: AbstractNodeQueue suffers from nepotism
An Akka user claimed to not be suffering from the bug because they were using JCTools, but as it turned out the issue was indeed present so I had to fix it etc... But what the hell is GC nepotism? You won't find it in your GC textbooks, but it's a catchy term for a very real problem.

Your Dead Olds Relatives Can Promote You!

In the open JDK a generational GCs, there are distinct generations, Old and New. The 2 generations get collected separately, but are not independent from each other. In particular, references from NewGen to OldGen will keep OldGen references alive through an OldGen collection cycle, and similarly references from OldGen to NewGen will keep objects alive through a NewGen collection. This is all quite straight forward, but has a surprising implication:
Dead OldGen objects will keep dead NewGen objects they reference alive until collected by an OldGen.
The reverse is also true, but since the whole notion of generational GC is based on the assumption that new gen is collected far more often then the old gen the above use case is the real issue. How does this effect you? Is this even a real problem?
Turns out it's real enough to have convinced Doug Lea to fix a particular manifestation of it, but also subtle enough to have slipped past him (and me, and the Akka guys, and Martin Thompson, and others no doubt) in the first place.

Linked Queues: Pathological Nepotism

Consider the following LinkedQueue implementation:
The following invariants are maintained:
  • head == tail <=> queue is empty
  • head.next == null
  • tail.value == null
  • In summary: queue is empty <=> (head == tail && tail.next == null && tail.value == null)
The nodes are created and linked together on offer, and dereferenced as we poll. The poll method clears MUST clear the value because the queue retains the last node. If we were to let the last node keep the value we will have effectively leaked this value reference.
Ignoring the values, imagine the queue is part of the live set and we have enqueued a few elements, the arrow denotes what next references:
head -> N3 -> null
tail -> N0 -> N1 -> N2 -> N3

If we now dequeue the elements we'll get back to empty, brackets denote live referenced from dead:
head -> N3 -> null
tail -> N3 -> null
DEAD:  N0 -> N1 -> N2 (-> N3)

Dead objects are traditionally considered the GC problem, and cleaning up about to be collected objects is considered an anti-pattern. After all, what is the point of cleaning up garbage? you are just duplicating work. Indeed in a single generation collector this is a non-issue.
But consider for a moment the possibility that one of the above nodes was promoted into the OldGen. This is something which may not happen in short lived tests, or even 'lucky' long lived tests, but is obviously quite possible. We will have:
head -> N3 -> null
tail -> N3 -> null
Zombie NEW:  N1 -> N2 (-> N3)
DEAD OLD:  N0 (-> N1 -> N2 (-> N3))

The young are alive, as far as the old are concerned. The old are dead, but their promotion gives life to their 'children' until they get collected. To demonstrate this issue I've written the following program: 

If you look at the GC logs, or plug in Visual VM with the Visual GC plugin you will notice the young GC is entirely ineffective. Everything gets promoted to old, until an old GC clears the lot. After the old GC the problem disappears (in the test program, but is likely to repeat in real programs). This is 'pathological' because ANY promoted node will result in the promotion of ALL following nodes until a GC resolves the issue. 

At this point you might be shaking your head sadly and saying: "Ye thick old fart, surely this is a silly issue which NEVER happens in the real world". I've had my share of heads being shook at me and am not offended, but consider that this issue is real enough for other people:

An Exercise To the Reader


My plan was to walk through the way this issue has been fixed in JCTools vs Akka vs Agrona. Each solution reflecting different considerations, but sadly I'm quite short on time for a tidy summary and will give my observations which you can choose to read up on and argue with.
All three implementations follow the algorithm outlined by D. Vyukov. Note that Agrona is ignoring the Queue interface requirement for poll(return null iff queue is empty).

  • In Akka the next reference is nulled on poll. This solves the issue, but at the cost of further weakening the accuracy of the size() method.
  • In JCTools the next reference is set to point to the discarded node. This helps in telling the difference between a removed node and the head. This piece of information is used in the size method to skip gaps in the size counted chain when racing with the consumer thread. The result is a more accurate size estimate. The change does come at a cost to the size method however as we are no longer able to rely on the chain remaining whole throughout the size operation.
  • In Agrona, the next reference is nulled as in Akka. A further effort is made to reuse a special empty node to avoid repeating promotion of transient nodes in empty queues. The effort is arguably worth while on systems with large numbers of active but mostly empty queues.

Open questions/discussion:

  • How common is this issue for 'classic' data structures? it seems reasonable that a host of queues/stacks/tree data structures are similarly vulnerable. A scan of JDK sources suggests this is an known pitfall to them, but I would bet many third party frameworks writers didn't get the memo.
  • How common is this issue for idiomatic 'user' Java code? given the tendency towards composition rather than inheritance and the drive towards smaller objects, we may argue that user object graphs often approximate trees. If we have short-lived 'trees' a parent node can certainly drag a useless chunk of state to the old gen. If a reference bridges 2 trees we may have something approximating the pathological behaviour above. I'm not aware of any studies trying to determine the impact of nepotism on user code, part of the problem would be the definition of 'typical' user code.
  • Where does the term 'nepotism' come from? I made an attempt at tracing the origins, but got nowhere. The problem seems to be documented at least 8 years ago, but the term nepotism is not used by most writing. [UPDATE: it seems the term can be tracked down to a GC paper published in 1991:An Adaptive Tenuring Policy for Generation]

Wednesday, 24 February 2016

Why (Most) Sampling Java Profilers Are Fucking Terrible

This post builds on the basis of a previous post on safepoints. If you've not read it you might feel lost and confused. If you have read it, and still feel lost and confused, and you are certain this feeling is related to the matter at hand (as opposed to an existential crisis), please ask away.
So, now that we've established what safepoints are, and that:
  1. Safepoint polls are dispersed at fairly arbitrary points (depending on execution mode, mostly at uncounted loop back edge or method return/entry).
  2. Bringing the JVM to a global safepoint is high cost
We have all the information we need to conclude that profiling by sampling at a safepoint is perhaps a bit shite. This will not come as a surprise to some, but this issue is present in the most commonly used profilers. According to this survey by RebelLabs this is the breakdown:


VisualVM, NB Profiler(same thing), YourKit and JProfiler all provide a sampling CPU profiler which samples at a safepoint. Seeing how this is a rather common issue, lets dig into it.

How Sampling Execution Profilers Work (in theory)

Sampling profilers are supposed to approximate the 'time spent' distribution in our application by collecting samples of where our application is at different points in time. The data collected at each sample could be:
  • current instruction
  • current line of code
  • current method
  • current stack trace
The data can be collected for a single thread or all threads at each sample. What do we need to hold for sampling to work?
"However, for sampling to produce results that are comparable to a full (unsampled) profile, the following two conditions must hold. First, we must have a large number of samples to get statistically significant results. For example, if a profiler collects only a single sample in the entire program run, the profiler will assign 100% of the program execution time to the code in which it took its sample and 0% to everything else. [...]
Second, the profiler should sample all points in a program run with equal probability. If a profiler does not do so, it will end up with bias in its profile. For example, let’s suppose our profiler can only sample methods that contain calls. This profiler will attribute no execution time to methods that do not contain calls even though they may account for much of the program’s execution time." - from Evaluating the Accuracy of Java Profilers, we'll get back to this article in a bit
That sounds easy enough, right?
Once we have lots of samples we can construct a list of hot methods, or even code lines in those methods (if the samples report it), we can look at the samples distributed on the call tree (if call traces are collected) and have an awesome time!

How Do Generic Commercial Java Sampling Execution Profilers Work?

Well, I can sort of reverse engineer here from different solutions, or read through open source code bases, but instead I'll offer unsupported conjecture and you are free to call me out on it if you know better. Generic profilers rely on the JVMTI spec, which all JVMs must meet:
  • JVMTI offers only safepoint sampling stack trace collection options (GetStackTrace for the calling thread doesn't require a safepoint, but that is not very useful to a profiler. On Zing  GetStackTrace to another thread will bring only that thread to a safepoint.). It follows that vendors who want their tools to work on ALL JVMs are limited to safepoint sampling.
  • You hit a global safepoint whether you are sampling a single thread or all threads (at least on OpenJDK, Zing is slightly different but as a profiler vendor OpenJDK is your assumption.). All profilers I looked into go for sampling all threads. AFAIK they also do not limit the depth of the stack collected. This amounts to the following JVMTI call: JvmtiEnv::GetAllStackTraces(0, &stack_info, &thread_count)
  • So this adds up to: setup a timer thread which triggers at 'sampling_interval' and gathers all stack traces.
This is bad for several reasons, some of which are avoidable:
  1. Sampling profilers need samples, so it is common to set sampling frequency is quite high (usually 10 times a second, or every 100ms). It's instructive to set the -XX:+PrintGCApplicationStoppedTime and see what sort of pause time this introduces. It's not unusual to see a few milliseconds pause, but YMMV(depending on number of threads, stack depth, TTSP etc). A 5ms pause every 100ms will mean a 5% overhead (actual damage is likely worse than that) introduced by your profiler. You can usually control the damage here by setting the interval longer, but this also means you will need a longer profiling period to get a meaningful sample count.
  2. Gathering full stack traces from all the threads means your safepoint operation cost is open ended. The more threads your application has (think application server, SEDA architecture, lots of thread pools...), and the deeper your stack traces (think Spring and Co.) the longer your application will wait for a single thread to go round taking names and filling forms. This was clearly demonstrated in the previous post. AFAIK, current profilers do nothing to help you here. If you are building your own profiler it would seem sensible to set a limit on either quantities so that you can box your overheads. The JVMTI functionality allows you to query the list of current threads, you could sample all if there's less than a 100 and otherwise pick a random subset of 100 to sample. It would make sense to perhaps bias towards sampling threads that are actually doing something as opposed to threads which spend all their time blocked.
  3. As if all that was not bad enough, sampling at a safepoint is a tad meaningless.
Points 1 and 2 are about profiling overheads, which is basically about cost. In my previous post on safepoints I looked at these costs, so there's no point repeating the exercise.  Cost may be acceptable for good profiling information, but as we'll see the information is not that great.
Point 3 bears explaining, so off we go in the search for meaning.

Safepoint Sampling: Theory

So what does sampling at a safepoint mean? It means only the safepoint polls in the running code are visible. Given hot code is likely compiled by C1/C2 (client/server compilers) we have reduced our sampling opportunities to method exit and uncounted loop backedges. This leads to the phenomena called safepoint bias whereby the sampling profiler samples are biased towards the next available safepoint poll location (this fails the second criteria set out above "the profiler should sample all points in a program run with equal probability").
This may not sound so bad at first, so lets work through a simple example and see which line gets the blame.
NOTE: In all of the following examples I will be using JMH as the test harness and make use of the 'CompilerControl' annotation to prevent inlining. This will let me control the compilation unit limits, and may seem cruel and unusual, or at least unfair, of me. Inlining decisions in the 'wild' are governed by many factors and it is safe (in my opinion) to consider them arbitrary (in the hands of several compilers/JVM vendors/command line arguments etc.). Inlinining may well be the "mother of all optimizations", but it is a fickle and wily mother at that.
Let's look at something simple:

This is an easy example to think about. We can control the amount of work in the method by changing the size of the array. We know the counted loop has no safepoint poll in it (verified by looking at the assembly output), so in theory the methods above will have a safepoint at method exit. The thing is, if we let the method above get inlined the end of method safepoint poll will disappear, and the next poll will be in the measurement loop:

So, it would seem reasonable to expect the method to get blamed if it is not inlined, but if it does get inlined we can expect the measurement method to get blamed. Right? very reasonable, but a bit off.

Safepoint Sampling: Reality

Safepoint bias is discussed in this nice paper from 2010: Evaluating the Accuracy of Java Profilers, where the writers recognize that different Java profilers identify different hotspots in the same benchmarks and go digging for reasons. What they don't do is set up some benchmarks where the hotspot is known and use those to understand what it is that safepoint biased profilers see. They state:
"If we knew the “correct” profile for a program run, we could evaluate the profiler with respect to this correct profile. Unfortunately, there is no “correct” profile most of the time and thus we cannot definitively determine if a profiler is producing correct results."
So, if we construct a known workload... what do these profilers see?
We'll study that with the JMH safepoint biased profiler "-prof stack". It is much like JVisualVM in the profiles it presents for the same code, and a damn sight more convenient for this study.
NOTE: In the following sections I'm using the term sub-method to describe a method which is being called into from another method. E.g. if method A calls method B, B is a sub method of A. Maybe a better terminology exists, but that's what I mean here.
If we run the samples above we get 2 different hot lines of code (run with '-prof stack:detailLine=true'):
# Benchmark: safepoint.profiling.SafepointProfiling.meSoHotInline
....[Thread state: RUNNABLE]...
 99.6%  99.8% meSoHotInline_avgt_jmhStub:165

# Benchmark: safepoint.profiling.SafepointProfiling.meSoHotNoInline
....[Thread state: RUNNABLE]...
 99.4%  99.6% meSoHotNoInline_avgt_jmhStub:163

Neither one in the actually hot method. It seems that the method exit safepoint is not deemed indicative of it's own method, but rather of the line of code from which it was called. So forcing the method under measurement to not inline implicated the calling line of code in the measurement loop, while leaving it to inline meant the back edge of the loop got the blame. It also appears that an uncounted loop safepoint poll is deemed indicative of it's own method.
We may deduce(but we won't necessarily be correct) that when looking at this kind of profile without line of code data a hot method is indicative of:
  1. Some non-inlined sub method is hot*
  2. Some code (own method? inlined sub method? non-inlined sub method?) in an uncounted loop is hot*
Having line of code data can help disambiguate the above cases, but it's not really very useful as line of code data. A hot line of code would be indicative of:
  1. Line has a method call: A method called from this line is hot (or maybe it's inlined sub-methods are)*
  2. Line is a loop back edge: Some code (include inlined submethods) in this loop is hot*
  *Does that seem useful? Don't get your hopes up.
Because we usually have no idea which methods got inlined this can be a bit confusing (you can use -XX:+PrintInlining if you want to find out, but be aware that inlining decisions can change from run to run).

Mind The Gap

If the above rules held true you could use a safepoint biased profile by examining the code under the blamed node in the execution tree. In other words, it would mean a hot method indicates the hot code lies somewhere in that code or in the method it calls. This would be good to know, and the profile could serve as a starting point for some productive digging. But sadly, these rules do not always hold true. They are missing the fact that the hot code can be anywhere between the indicated safepoint poll and the previous one. Consider the following example:

Clearly, the time is spent in the loop before calling setResult, but the profile blames setResult. There's nothing wrong with setResult except that a method it calls into is not inlined, providing our profiler with a blame opportunity. This demonstrates the randomness with which the safepoint poll opportunities present themselves to  user code, and shows that the hot code may be anywhere between the current safepoint poll and the previous one. This means that a hot method/line-of-code in a safepoint biased profile are potentially misleading without some idea of where the previous safepoint poll is. Consider the following example:

The profiler implicates the caller to a cheap method 9 levels down the stack, but the real culprit is the loop at the topmost method. Note that inlining will prevent a method from showing, but not-inlined frames only break the gap between safepoints on return (on OpenJDK anyhow. This is vendor and whim dependent. e.g. Zing puts the method level safepoint on entry for instance and I'm not sure where J9 stands on this issue. This is not to say that one way is somehow better than the other, just that the location is arbitrary). This is why setResult6 which is not inlined and higher up the stack doesn't show up.

Summary: What Is It Good For?

As demonstrated above, a safepoint sampling profiler can have a wildly inaccurate idea of where the hot code in your application is. This renders derived observations on "Running" threads pretty suspect, but they are at least correct observations on which threads are running. This doesn't mean they are entirely useless and sometimes all we need is a hint in the right direction for some good analysis to happen, but there's a potential for huge waste of time here. While samples of code running in the interpreter do not suffer safepoint bias, this is not very useful as hot code is quickly compiled. If your hot code is running in the interpreter you have bigger fish to fry than safepoint bias...
The stack traces for blocked threads are accurate, and so the "Waiting" profile is useful to discover blocked code profile. If blocking methods are at the root of your performance issue than this will be a handy observation.
There are better options out there! I'll get into some of them in following posts:
- Java Mission Control
- Solaris Studio
- Honest-Profiler
- Perf + perf-map-agent (or perfasm if your workload is wrapped in a JMH benchmark)
No tool is perfect, but all of the above are a damn sight better at identifying where CPU time is being spent.
Big thanks to the many reviewers, if you find any issues please comment and I will punish them severely.

Monday, 1 February 2016

Wait For It: Counted/Uncounted loops, Safepoints and OSR Compilation

In this previous post about Safepoints I claimed that this here piece of code:

Will get your JVM to hang and will not exit after 5 seconds, as one might expect from reading the code. The reason this happens, I claimed, is because the compiler considers for loops limited by an int to be counted, and will therefore not insert a safepoint poll into the generated assembly code. This was the theory, and I tested it, and it definitely hung on my machine ('It Works On My Machine'™).
To my great surprise, some people not only read the posts, they also run the code (it's like they don't trust me ;-) ). And 2 of those diligent readers got back to me in the comments saying that actually the JVM exists just fine. So I ran it again, and it hung, again. Then I ran it with different versions of the JVM, and it hung as I expected. Then I thought, maybe they are running the code I put on GitHub, so I tried that. That also hung when I ran it from Eclipse, but did not when I built it with Maven and ran it from the command line. EUREKA!
As it happens, Eclipse has it's own Java compiler (a fact well known to those who know it well, no doubt) and the bytecode it generated (the .class file) hung every time, but Maven uses javac and the .class file generated from that did not hang. When testing initially I was running my class inside Eclipse, or using the Eclipse .class file to run from the command line. Both .class files are valid, but different (same-same but different). We can use javap to print out class bytecode and compare ("javap -c -v -p <class file>"). For your pleasure you can find the original Java file, the 2 class files and their respective bytecode printouts on GitHub and come to your own conclusions before you continue reading.
So... is my understanding (and the Eclipse javac) somehow out of date? can this problem of telling counted loops from uncounted loops be solved by the bytecode compiler?

How is the bytecode different?

Well, mostly not that different:
  1. An easy to spot difference which we can tell is semantically the same is the bytecode result for "System.out.println("How Odd:" + l);" you'll notice that Eclipse is using the constructor "java/lang/StringBuilder."<init>":(Ljava/lang/String;)" while javac is opting for "java/lang/StringBuilder."<init>":()V" followed by an append. Semantically the 2 choices are the same. We create a StringBuilder and append the String "How odd:" and the variable l. The reality is that there are subtle differences, the Eclipse version will allocate a larger char[] for instance. It may also be that the Eclipse pattern will miss out on some String concatenation optimization because the OpenJDK implementers are likely to focus on patterns generated from the official javac.
  2. The loop interpretation is different. Eclipse is checking the loop conditions at the end of the loop whereas javac is building a loop with the checks upfront and the loop backedge at the end. That seems like the important difference.
Let's zoom in on the relevant bytecode for the loop:

If we were to reverse translate the loop bytecode to ugly Java (which supported goto) we'd get:

Note that there's no special Java bytecode for loop control flows. So why so different at runtime? If we dump out the assembly for the run method we will see that the Eclipse version is missing the safepoint in the loop (the method has a {poll_return} but there's no {poll}), and the javac version has that safepoint there. Is it the bytecode difference that makes this happen?

Variations on bytecode

I tinkered with the class a bit and refactored the loop out of the lambda. This results in similar bytecode for the loop and same behaviour. I then made the limit of the loop a parameter of the method, this time it behaved like the javac version, but had the same funky loop structure:

WTF? if this is a purely a bytecode subtlety then it's seems pretty weird. What sort of flaky compiler would swing like this?

The compiler did it?

But... which compiler? so we looked at the above bytecode compiler results, this is but the first of many compilers to process your code on it's way to being executed. Traditionally when discussing Java code compilation we look at the following stages:
  1. Javac: From code to bytecode
  2. Interpreter: Step through the bytecode
  3. C1/Client: Short compilation time, does not generate the best optimized code. Intended for short lived applications.
  4. C2/Server: Longer compilation time, better optimized code. Intended for long running applications.
And the compilation timeline would look something like this:
                                                   /->C1
*.java->javac->*.class->interpreter->class+profile<
                                                   \->C2
The situation with OpenJDK8 is more complex with Tiered Compilation allowing us a smooth transition from interpreter to C2 compilation via C1 compilations. 
Given the behaviour we are after was supposed to manifest after a C1/C2 compilation we can run the code above to find which compiler compiled the method in question by adding the flag: -XX:+PrintCompilation:
    159  201 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
    159  202       3       safepoint.hang.WhenWillItExitInt2::countOdds (44 bytes)
    159  203 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
    162  201 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
    142  200 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
   1896  202 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
   1896  203 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
   1896  204 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
   1899  203 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
The 3/4 printouts on the 4th column above indicate the method is compiled by C1 (profiled) and C2. But what is that % about? This is an indicator that the compilations are not 'normal' compilations but rather On Stack Replacement(OSR) compilations. OSR compilations are 'special', from the OpenJDK glossary:
on-stack replacement
Also known as 'OSR'. The process of converting an interpreted (or less optimized) stack frame into a compiled (or more optimized) stack frame. This happens when the interpreter discovers that a method is looping, requests the compiler to generate a special nmethod with an entry point somewhere in the loop (specifically, at a backward branch), and transfers control to that nmethod. A rough inverse to deoptimization.
I like the "rough inverse to deoptimization", it's a great mental qualification helper. OSR is the opposite of deoptimization because it is a mid-method replacement of code with compiled code, whereas a deoptimization is a mid-method replacement of code with interpreted code. Normal compilation unit replacements happen out of method, so while the method is off stack. OSR is specifically done to replace long running loops, like our countOdds method. As we can see from the printout above both C1 and C2 can produce OSR compilations.
It is impossible to disable OSR directly in a product build (there's a developer option -XX:-CICompileOSR), but we can force upfront compilation by using the following flags: "-XX:-TieredCompilation -server/client -Xcomp". Running with these arguments all 3 variations (original, refactored loop, refactored parametrized loop) now hang both for the Eclipse and the javac generated bytecode.
In both cases where the Eclipse versions hang we can see OSR compilations of the relevant code.
So it seems:
  • C1/C2 non-OSR compilations of the loop treat it as a counted loop, thus eliminating the safepoint poll and demonstrating the hung behaviour due to huge TTSP.
  • C1/C2 OSR compilations of the same code result in different results depending on the generated bytecode for the loop. I'm not going to dig further into the reasons, but I would welcome any feedback from OpenJDK compiler guys.
The OSR compilation decision to treat the loop as non-counted is quite reasonable. If we have a loop which remains on stack for very long periods we should perhaps consider it uncounted.

Killer Task

I started down this path looking for valid Java code which will result in a hung JVM due to huge TTSP. Since OSR got in the way, can we build such a task? Sure we can, we just need to warmup the loop until it is C2 compiled and then run it with a large limit, like so:
This fucker now hangs in both the Eclipse and javac versions. Joy!

Side note: OSR and JMH

I look at benchmarks people write quite a bit, and it is very common to see people rolling their own benchmark harnesses put their "code under measurement" in a counted loop, run it 100,000 times and derive their results. Often that means the code they are measuring is OSR compiled code, which is different (for better or worse, mostly worse) from the code they will be running normally. If you are one of these people, I beg you: cut that fucking shit out!
JUST USE JMH!
PLEASE?
To a large extent it will cover your use case. Where it doesn't cover your use case, post a question on jmh-dev mailing list, it might be that you've missed something. If it turns out JMH cannot cover your usecase, make sure you've read the JMH samples before rolling your own harness.
I'm not going to include a before and after example, because it just makes me sad. Look at SO for some fine examples.

Counted Loops?

I keep saying 'counted loops' like it means something, so here are some examples for you to play with and contemplate:

So, as you can see the definition ain't so clear and I would imagine some of the above behaviour will vary depending on JVM version/vendor/phase of the moon etc.

Summary

So, it seems my previous example was not as tested as it should have been. I apologize to anyone who has based life changing decisions on that sample. I hope this post helps in explaining away the subtleties which led to my downfall.
The above examples help demonstrate that the line between counted and uncounted loops is a tad confusing and the classification may depend on differences in bytecode, compiler mode, and so on.

Monday, 14 December 2015

Safepoints: Meaning, Side Effects and Overheads

I've been giving a couple of talks in the last year about profiling and about the JVM runtime/execution and in both I found myself coming up against the topic of Safepoints. Most people are blissfully ignorant of the existence of safepoints and of a room full of people I would typically find 1 or 2 developers who would have any familiarity with the term. This is not surprising, since safepoints are not a part of the Java language spec. But as safepoints are a part of every JVM implementation (that I know of) and play an important role, here's my feeble attempt at an overview.

What's a Safepoint?

I was recently asked: "Are safepoints like safe words?". The short answer is "not really", but since I like the metaphor I'll run with it:
Imagine if you will a JVM full of mutator threads, all busy, sweating, mutating the heap. Some of them have <gasp> shared mutable state. They're mutating each others state, concurrently, like animals. Some stand in corners mutating their own state (go blind they will). Suddenly a neon sign flashes the word PINEAPPLES. One by one the mutators stop their rampant heap romping and wait, sweat dripping. When the last mutator stops, a bunch of elves come in, empty the ashtrays, fill up all the drinks, mop up the puddles, and quickly as they can they vanish back to the north pole. The sign is turned off and the threads go back to it....
<inner-dialogue>Too much? It's like 35c in the shade here, so sweating is like... no? never mind </inner-dialogue>
There are many references to Safepoints to be found online and what follows here is my attempt at a more nuanced use/definition, no sweating mutators from this point onwards.
A safepoint is a range of execution where the state of the executing thread is well described. Mutator threads are threads which manipulate the JVM heap (all your Java Threads are mutators. Non-Java threads may also be regarded as mutators when they call into JVM APIs which interact with the heap).
At a safepoint the mutator thread is at a known and well defined point in it's interaction with the heap. This means that all the references on the stack are mapped (at known locations) and the JVM can account for all of them. As long as the thread remains at a safepoint we can safely manipulate the heap + stack such that the thread's view of the world remains consistent when it leaves the safepoint.
This is particularly useful when the JVM wishes to examine or change the heap, for a GC or any one of many other reasons. If references on the stack were unaccounted for and the JVM was to run a GC it may miss the fact that some objects are still alive (referenced from the stack) and collect them, or it may move some object without updating it's reference on the stack leading to memory corruption.
A JVM will therefore need means of bringing threads to safepoints (and keeping them there) so that all sorts of runtime magic can happen. Here's a partial list of activities which JVMs run only once all mutator threads are at a safepoint and cannot leave it until released (at a global safepoint), these are sometime called safepoint operations:
  • Some GC phases (the Stop The World kind)
  • JVMTI stack sampling methods (not always a global safepoint operation for Zing))
  • Class redefinition
  • Heap dumping
  • Monitor deflation (not a global safepoint operation for Zing)
  • Lock unbiasing
  • Method deoptimization (not always)
  • And many more!
A great talk by Azul's own John Cuthbertson from JavaOne 2014 goes into some background on Safepoints and allot of details about safepoint operations other than GC (we at Azul feel GC is a solved problem, so the talk goes into the remaining reasons to stop).
Note that the distinction between requesting a global safepoint and a thread safepoint exists only for some JVM implementations (e.g. Zing, the Azul Systems JVM. Reminder: I work for Azul). Importantly it does not exist on OpenJDK/Oracle JVMs. This means that Zing can bring a single thread to a safepoint.
To summarize:
  • Safepoints are a common JVM implementation detail
  • They are used to put mutator threads on hold while the JVM 'fixes stuff up'
  • On OpenJDK/Oracle every safepoint operation requires a global safepoint
  • All current JVMs have some requirement for global safepoints

When is my thread at a safepoint?

So having threads at a safepoint allows the JVM to get on with it's managed runtime magic show, great! When is this groovy state happening?
  • A Java thread is at a safepoint if it is blocked on a lock or synchronized block, waiting on a monitor, parked, or blocked on blocking IO. Essentially these all qualify as orderly de-scheduling events for the Java thread and as part of tidying up before put on hold the thread is brought to a safepoint.
  • A Java thread is at a safepoint while executing JNI code. Before crossing the native call boundary the stack is left in a consistent state before handing off to the native code. This means that the thread can still run while at a safepoint.
  • A Java thread which is executing bytecode is NOT at a safepoint (or at least the JVM cannot assume that it is at a safepoint).
  • A Java thread which is interrupted (by the OS) while not at a safepoint is not brought to a safepoint before being de-scheduled.
The JVM and your running Java threads have the following relationship around safepoints:
  • The JVM cannot force any thread into a safepoint state.
  • The JVM can stop threads from leaving a safepoint state.
So how can the JVM bring all threads into a safepoint state? The problem is suspending a thread at a known state, not just interrupting it. To achieve this goal JVMs have the Java threads suspend themselves at convenient spots if they observe a 'safepoint flag'.

Bringing a Java Thread to a Safepoint

Java threads poll a 'safepoint flag' (global or thread level) at 'reasonable' intervals and transition into a safepoint state (thread is blocked at a safepoint) when they observe a 'Go to safepoint' flag. This is simple enough, but since we don't want to spend all of our time checking if we need to stop the C1/C2 compilers (-client/-server JIT compilers) try and keep safepoint polls to a minimum. On top of the cost of the flag check itself, maintaining a 'known state' adds significant complexity to the implementation of certain optimizations and so keeping safepoints further apart opens up a wider scope for optimization. These considerations combined lead to the following locations for safepoint polls:
  • Between any 2 bytecodes while running in the interpreter (effectively)
  • On 'non-counted' loop back edge in C1/C2 compiled code
  • Method entry/exit (entry for Zing, exit for OpenJDK) in C1/C2 compiled code. Note that the compiler will remove these safepoint polls when methods are inlined.
If you are the sort of person who looks at assembly for fun (or profit, or both) you'll find safepoint polls in the -XX:+PrintAssembly output by looking for:
  • '{poll}' or '{poll return}' on OpenJDK, this will be in the instructions comments
  • 'tls.pls_self_suspend' on Zing, this will be the flag examined at the poll operation
This mechanism is implemented differently on different VMs (on x86 + Linux, I've not looked at other architectures):
  • On Oracle/OpenJDK a blind TEST of an address on a special memory page is issued. Blind because it has no branch following it so it acts as a very unobtrusive instruction (usually a TEST is immediately followed by a branch instruction). When the JVM wishes to bring threads to a safepoint it protects the page causing a SEGV which is caught and handled  appropriately by the JVM. There is one such special page per JVM and therefore to bring any thread to a safepoint we must bring all threads to a safepoint.
  • On Zing the safepoint flag is thread local (hence the tls prefix). Threads can be brought to a safepoint independently.
See related post: Dude, Where's My Safepoint? for more details on polls.
Once a thread detects the safepoint flag it will perform the safepoint action that the poll triggers. This normally means blocking on some sort of JVM level lock, to be released when the safepoint operation is over. Think of it as a locking mechanism where:

  • Threads can lock themselves out (e.g. by calling into JNI, or blocking at a safepoint).
  • Threads can try and re-enter (when returning from JNI), but if the lock is held by the JVM they will block.
  • The safepoint operation requests the lock and blocks until it own it (when all the mutators have locked themselves out)

All Together Now

Now we have all the moving parts of the safepoint process:
  • Safepoint triggers/operations: reasons to request, and stuff to do which requires one or all threads to be at a safepoint
  • Safepoint state/lock: into which Java threads can voluntarily check in, but can only leave when the JVM is done executing the safepoint operation
  • Safepoint poll: the voluntary act by a Java thread to go into a safepoint if one is needed
Let's see how it all hangs together, for example when a profiler is sampling the stack using JVMTI::GetStackTrace:

GREEN  :   Executing Java
YELLOW :   Descheduled Java
RED    :   Native
This diagram highlights several components of safepoint operation overheads:
  1. Time To Safe Point(TTSP): Each thread enters a safepoint when it hits a safepoint poll. But arriving at a safepoint poll requires executing an unknown number of instructions. We can see J1 hits a safepoint poll straight away and is suspended. J2 and J3 are contending on the availability of CPU time. J3 grabs some CPU time pushing J2 into the run queue, but J2  is not in a safepoint. J3 arrives at a safepoint and suspends, freeing up the core for J2 to make enough progress to get to a safepoint poll. J4 and J5 are already at a safepoint while executing JNI code, they are not affected. Note that J5 is trying to leave JNI halfway through the safepoint and is suspended before resuming Java code. Importantly we observe that the time to safepoint varies from thread to thread and some threads are paused for longer than others, Java threads which take a long time to get to a safepoint can delay many other threads.
  2. Cost of the safepoint operation: This will depend on the operation. For GetStackTrace the cost of the operation itself will depend on the depth of the stack for instance. If your profiler is sampling many threads (or all the threads) in one go (e.g. via JVMTI::GetAllStackTraces) the duration will grow as more threads are involved. A secondary effect may be that the JVM will take the opportunity to execute some other safepoint operations while the going is good. 
  3. Cost of resuming the threads.
Some take aways from this analysis:
  1. Long TTSP can dominate pause times: some causes for long TTSP include page faults, CPU over-subscription, long running counted loops.
  2. The cost of stopping all threads and starting all threads scales with the number of threads. The more threads the higher the cost. Assuming some non-zero cost to suspend/resume and TTSP, multiply by number of threads...
  3. Add -XX:+PrintGCApplicationStoppedTime to your list of favourite flags! It prints the pause time and the TTSP.
But why should you trust me? Lets break some shit!!! Please note in the following examples I will be switching between Oracle 8u60 and Zing (latest dev JDK8 build). I work for Azul Systems on the Zing JVM, so feel free to assume bias on my part. My main motivation in including both JVMs is so that I can demonstrate the effect of -XX:+KeepSafepointsInCountedLoops on Zing.
NOTE: The benchmarks are not intended to spark a flame war and though Zing happens to compile this particular piece of code 'better' I would not get over excited about it. What is exciting is the extra option Zing makes available here to control and minimize TTSP, and the additional perspective it offers on how a JVM can deal with these challenges.

Example 0: Long TTSP Hangs Application

The above code should exit in 5 seconds (or see it on GitHub). The silly computation running in the daemon thread should not stop it from exiting (the JVM should exit if all remaining running Java threads are daemon threads). But if we run this sample we will find that the JVM does not exit after 5 seconds. It doesn't exit at all[UPDATE: this example turned out to be a bit flaky as per discussion in comments and following blog post]. Not only does it not exit, it will also refuse incoming JMX connections, it won't respond the jmap/jstack. It won't naturally die, you'll have to 'kill -9' this fucker.
The reason it's so stuck is because the computation is running in a 'counted' loop, and counted loops are considered short by the JVM. Short enough to not have safepoints. This thread refusing to come to a safepoint will lead to all other threads suspending indefinitely at the next safepoint operation.
We can fix this (somewhat contrived) problem in several ways:
  1. Prevent C1/C2 compilation: Use -Xint or a compile command to prevent the method compilation. This will force safepoint polls everywhere and the thread will stop hurting the JVM.
  2. Replace the index type with long: this will make the JVM consider this loop as 'uncounted' and include a safepoint poll in each iteration.
  3. On Zing you can apply the -XX:+KeepSafepointsInCountedLoops option on either the JVM or method level (via a -XX:CompileCommand=option,... or via a compiler oracle file). This option is coming shortly to an OpenJDK near you. See also related bug where breaking the loop into an inner/outer loop with safepoints in the outer loop is discussed.
This example is handy when considering shared JVM solutions. This is perfectly 'safe' Java code that will bring most application containers to their knees.
Summary: Delaying a safepoint poll indefinitely can be harmful...

Example 1: More Running Threads -> Longer TTSP, Higher Pause Times

The benchmark has 1 thread triggering a safepoint operation by calling safepointMethod() (in this case the System.gc() will be called as a result) and many threads scanning a byte array. The byte array scan loop will not have a safepoint poll (it's a counted loop), so the size of the array will act as our intended safepoint poll interval. With this setup we can control the frequency of safepoint operations (with some overhead), avg. distance from safepoint poll and with JMH the number of threads. We can use -XX:+PrintGCApplicationStoppedTime to log the stopped time and TTSP. Armed with this experiment we can now find a large machine to play on and see what happens. I use the command line '-tg' parameter to control the number of threads executing the contains1() method.
So, here goes on a 48 core Haswell Xeon machine, Cent OS, I'll be using Oracle 8u60. The benchmark class I'm running is here.
Lets start with no contains1() threads and explain the method and data collection as we go along:
$ java -jar safepoints.jar -f 1 -i 10 -wi 10 -jvmArgs="-XX:+PrintGCApplicationStoppedTime -Xms1g -Xmx1g" -tg 0,0,0,1 SafepointUsingGc

There's 4 thread groups because I add another benchmark to run with this crowd in the next sample.
This spouts allot of PrintGCApplicationStoppedTime printouts similar to:
Total time for which application threads were stopped: 0.0180172 seconds, Stopping threads took: 0.0000513 seconds

I'm no Unix shell guru, but with some googling even I can AWK something to summarize this shit:
grep 'Total time' output.file | awk '{ stopped += $9;ttsp += $14; n++ } END { if (n > 0) print "stopped.avg=" stopped / n ", ttsp.avg=" ttsp / n; }'

It's summing up the warmup period measurements as well, but I'm willing to let it ride. So with that in mind, lets compare the benchmark output with the reported pause times:
SafepointUsingGc.run_together:safepoint           100   10   32768     true  avgt   10  118611444.689 ± 1288530.618  ns/op
stopped.avg=0.015759, ttsp.avg=6.98995e-05

Or formatted for humans:
park+gc(ms) |Stp.ms|TTSP.ms|
  119 ± 1.0 | 16.1 |  0.07 |


We went with the default intervalMs=100, so we'd expect the safepoint() cost to be around 100 + stopped.avg, it's actually 2ms more than that. This is why we have the park() benchmark to baseline against. Running again with the park group enabled ('-tg 0,0,1,1'):
park(ms)   | park+gc(ms) |Stp.ms|TTSP.ms|
 101 ± 0.8 | 118 ± 1.4   | 15.9 |  0.07 |

We can see that there's some variance to the GC call and some scheduling variance which can account for delta between our expected value and actual. Small potatoes, lets move on. Start with measuring the cost of the contains1() call when we're not slapping it with a GC every 100ms ('-tg 1,0,1,0' we're running the park baseline):
contains1(us) | park(ms)    |Stp.ms|TTSP.ms|
  32.2 ± 0.02 | 100.1 ± 0.0 |  0.0 |  0.0  |

Note the  contains1() method cost is in microseconds. Awesome, now lets ramp up the number of contains1() thread with the GC thread running every 100ms:
T   |contains1(us) | park(ms)    | park+gc(ms) |Stp.ms|TTSP.ms|
1   |  51.4 ±  3.1 | 100.7 ± 0.7 | 117.5 ± 0.7 | 15.1 |  0.1  |
2   |  49.5 ± 17.1 | 101.6 ± 1.9 | 116.9 ± 0.5 | 14.4 |  0.1  |
4   |  45.5 ±  2.6 | 102.8 ± 4.0 | 116.9 ± 0.7 | 14.4 |  0.1  |
8   |  45.6 ±  2.1 | 103.6 ± 5.6 | 114.0 ± 0.7 | 13.3 |  0.1  |
16  |  46.8 ±  1.3 | 103.4 ± 6.8 | 115.0 ± 1.5 | 12.8 |  0.2  |
32  |  65.7 ±  2.0 | 101.8 ± 0.9 | 113.0 ± 1.3 | 11.4 |  0.7  |
64  | 121.9 ±  1.9 | 106.0 ± 8.1 | 116.1 ± 1.4 | 14.5 |  5.4  |
128 | 251.9 ±  6.6 | 103.6 ± 5.3 | 119.6 ± 2.3 | 16.9 |  6.4  |
256 | 513.2 ± 13.4 | 109.2 ± 9.7 | 123.4 ± 1.7 | 20.7 |  8.0  |
512 |1111.3 ± 69.7 | 143.9 ±24.8 | 134.8 ±15.3 | 27.8 | 10.0  |


OK, that escalated not so quickly. Keep in mind we only actually have 48 cores (hyper-threaded), and so when we cross the line between 32 and 64 we are now looking at the effect of contention and scheduling on the TTSP. It is evident that if you have more CPU bound threads than CPUs performance will degrade, but that is not really what we're after.
At 1 thread we already see a big drop in performance for contains1(). This is somewhat understandable in the presence of 15ms stolen from the application every 100ms.  But where you'd expect a 15% difference we're actually looking at a far more significant drop in performance. Going from 32us to 51us kinda sucks.
We can see a steady growth in TTSP while within the core count, going from 0.1ms to 0.7ms, this is bad news to anyone looking to keep their latencies in the low microseconds count as touched upon in this post from the brilliant Mark Price. Mark highlights the fact that the JVM regularly calls for a global safepoint (“whether she needed it or not”), so you may find to your delight that these little hiccups come round even when you've been allocating nothing.
Once we have more threads than cores the TTSP cost goes into the milliseconds, but escalates rather slowly once there. Going from 5ms to 10ms while we go from 64 to 512 threads is better than I expected.
Summary: By increasing the thread count we increase the TTSP. CPU over-subscription will lead to a further increase in TTSP.  Monitor machines load avg to detect this kind of misconfiguration.

Example 2: Long TTSP has Unfair Impact

Building on previous benchmark, but now we have 2 methods: contains1() and contains1toK()contains1toK() is K times further away from a safepoint poll (outer loop is counted, the contains needle methods gets inlined removing the method exit safepoint and the loop inside it is counted). By increasing K we can see the impact of one thread on the other, in particular we expect the cost of contains1() to increase as the cost of contains1toK() increases, in the presence of safepoint operations. The unfairness here being that while contains1toK() is making progress it is making contains1(legen...) wait for it (...DARY!). Because using GC as the safepoint operation is rather heavy handed (even though it's a small heap and there's no garbage) we'll switch to using another safepoint operation for this case:  ManagementFactory.getThreadMXBean().findDeadlockedThreads(). This is allot more lightweight than GC but becomes more expensive as we pile on threads (not by that much, but still) so I did not use it for prior example (though come to think of it the number of threads should also impact GC length a bit, oh well). Anyway, variety is the spice of life. Here's the results with both methods and park(), but no safepoint operation (benchmark class is here):
K     |contain1(us)|   containsK(us)    | park(ms)    |Stp.ms|TTSP.ms|
1     | 32.3 ± 0.1 |     34.8 ±     1.7 | 100.0 ± 0.0 |  0.0 |  0.0  |
10    | 32.4 ± 0.5 |    349.7 ±    18.0 | 100.0 ± 0.0 |  0.0 |  0.0  |
100   | 38.4 ± 2.8 |   3452.7 ±   130.4 | 100.0 ± 0.0 |  0.2 |  0.1  |
1000  | 39.8 ± 1.1 |  33419.9 ±   421.6 | 100.0 ± 0.0 |  0.0 |  0.0  |
10000 | 41.7 ± 5.0 | 336112.5 ± 10306.5 | 100.5 ± 1.7 |  0.0 |  0.0  |

Indeed the cost of contains1toK() grows linearly with K and the 2 methods don't seem to impact each other that significantly.
Here's what happens as we throw in the safepoint ops:
K     |contains1(us)|   containsK(us)   | park(ms)     | park+FDT(ms) |Stp.ms |TTSP.ms|
1     |  39.7 ± 0.9 |     39.6 ±    1.4 | 100.0 ±  0.0 | 100.3 ±  0.0 |   0.1 |   0.1 |
10    |  40.0 ± 1.0 |    400.1 ±   14.4 | 100.0 ±  0.0 | 100.5 ±  0.0 |   0.2 |   0.2 |
100   |  41.8 ± 1.1 |   3835.4 ±  141.2 | 100.1 ±  0.0 | 102.7 ±  0.1 |   2.1 |   2.0 |
1000  |  67.4 ± 3.5 |  37535.7 ± 2237.2 | 102.5 ±  1.9 | 127.1 ±  1.1 |  20.7 |  20.7 |
10000 | 183.4 ± 8.5 | 346048.6 ± 9818.2 | 194.9 ± 11.5 | 348.4 ± 11.4 | 152.8 | 152.8 |

A terrible thing, right? The annoying/unfair thing is that being a shit has no effect on contains1toK() performance. It scales linearly and seems oblivious to the havoc it's wrecking for everyone else. Note that the problem we see here was present in the run where no safepoint operations were invoked, there were just no safepoint operations to bring it out. On a long enough timeline a safepoint operation will come along and around it we'll have seen the same issue.
How can we fix it, and at what cost? Lets explore some options.

Solution 1: I use Zing -> use -XX:+KeepSafepointsInCountedLoops

So Zing is super awesome, and has this option to keep safepoint polls in counted loops which can come in handy in cases where one particular method is killing you softly with TTSP. Here's the results we get with Zing when running the benchmark above (same jvmArgs, same machine etc):
K     |contains1(us) |   containsK(us)   | park(ms)     | park+FDT(ms) | Stp.ms|
1     |  31.9 ±  0.9 |    30.5 ±     1.0 | 100.0 ±  0.0 | 100.2 ±  0.0 |   0.0 |
10    |  32.9 ±  0.2 |   317.8 ±     8.8 | 100.0 ±  0.0 | 100.2 ±  0.0 |   0.0 |
100   |  34.1 ±  2.4 |  3337.8 ±   265.4 | 100.1 ±  0.0 | 100.8 ±  0.2 |   0.4 |
1000  |  44.4 ±  1.9 | 33991.6 ±  1199.0 | 100.9 ±  0.2 | 117.0 ±  0.6 |  10.2 |
10000 | 159.0 ± 12.1 |299839.3 ±  6297.2 | 156.3 ±  8.0 | 319.8 ± 30.5 | 113.4 |

Still bad, but Zing happens to squeeze better performance from the containsNeedle() method and while contains1() performance degrades it's opening a wide gap with observed behaviour above probably because less time is lost to TTSP (because contains1toK() finishes quicker). Also note that -XX:+PrintGCApplicationStoppedTime only prints out the stopped time for Zing. A separate command line argument is available which prints out safepoint statistics, but as this post grows longer I'm losing my will to rerun and parse stuff...
We add the JVM argument: "-XX:CompileCommand=option,*.contains1ToK,+KeepSafepointsInCountedLoops" and watch the world burn:
K     |contain1(us)|   containsK(us)   | park(ms)    | park+FDT(ms)| Stp.ms|
1     | 33.0 ± 0.6 |     40.1 ±    1.2 | 100.0 ± 0.0 | 100.2 ± 0.0 |   0.0 |
10    | 32.8 ± 0.7 |    405.1 ±   13.1 | 100.0 ± 0.0 | 100.2 ± 0.0 |   0.0 |
100   | 33.3 ± 0.7 |   4124.8 ±  104.4 | 100.0 ± 0.0 | 100.3 ± 0.0 |   0.1 |
1000  | 32.9 ± 0.2 |  40839.6 ±  854.0 | 100.0 ± 0.0 | 101.0 ± 0.3 |   0.6 |
10000 | 40.2 ± 2.9 | 374935.6 ± 7145.3 | 111.3 ± 3.7 | 100.2 ± 0.0 |   0.0 |

SHAZZZING!... This is pretty fucking awesome. The impact on contains1toK() is visible, but I didn't have to change the code and the TTSP issue is gone man, solid GONE. A look at the assembly verifies that the inner loop is still unrolled within the outer loop, with the addition of the requested safepoint polls. There are more safepoint polls than I'd like here, since the option is applied to both the outer method and the inlined method (to be improved upon sometime down the road...).

Solution 2: I use OpenJDK/Oracle, disable inner loop inlining

If the inner loop is in it's own method as our example, or if we can perhaps modify the code such that it is, we can ask the compiler to not inline that method. This should work out nicely here. To force the JVM to not inline the containsNeedle() method only in contains1toK() I made a copy of containsNeedle() called containsNeedle2()  and used this argument: "-XX:CompilerCommand=dontinline,*.containsNeedle2". The results:
K     |contain1(us)|   containsK(us)   | park(ms)    | park+FDT(ms)|Stp.ms|TTSP.ms|
1     | 39.9 ± 1.0 |     40.0 ±    2.6 | 100.0 ± 0.0 | 100.3 ± 0.0 |  0.1 | 0.0   |
10    | 39.9 ±  .7 |    395.6 ±   10.0 | 100.0 ± 0.0 | 100.3 ± 0.0 |  0.1 | 0.0   |
100   | 41.3 ± 3.9 |   4030.6 ±  405.0 | 100.0 ± 0.0 | 100.3 ± 0.0 |  0.1 | 0.0   |
1000  | 39.6 ± 2.1 |  34543.8 ± 1409.9 | 100.0 ± 0.0 | 100.3 ± 0.0 |  0.1 | 0.0   |
10000 | 39.1 ± 1.2 | 328380.6 ± 8478.1 | 100.0 ± 0.0 | 100.3 ± 0.0 |  0.1 | 0.0   |

That worked well! No harm to contains1(), no TTSP issue, and contains1toK() performance didn't suffer for it. Inlining is usually a key optimization you want to make sure happens, but in this case the right thing to do is stop it from happening. The work inside containsNeedle() is chunky enough to make this a good tradeoff. If the external loop was at a very high count, and the array size rather small this would have been a very different story.

Solution 3: Last resort. stop compiling the offending method

This is a bit terrible, but if things get bad AND you can't change the code, you can use a CompileCommand to exclude the method from compilation. The method will be force to stay in the interpreter, so will effectively have a safepoint poll between every 2 bytecodes. Performance will suck, but at least the application will be responsive. I am not gonna run this option, but if you want to add the following argument to the command line and watch it roll: "-XX:CompilerCommand=exclude,*.contains1toK"

Summary: Threads which suffer from large TTSP will cause other threads to wait for them in the presence of a global safepoint attempt.

Example 3: Safepoint Operation Cost Scale


For this example we replace the safepoint() method with gathering thread stacks using ManagementFactory.getThreadMXBean().dumpAllThreads(false, false). Gathering thread stacks also occurs at a safepoint and the safepoint operation cost will scale with the number of threads. We construct a number of idle threads in the setup method to control the safepoint operation cost. To make sampling the threads stack expensive we park the fuckers 256 frames deep.
Different safepoint operations scale with different aspects of your process. Certain GC algorithms stop the world phases will scale with heap size. Profiling by collecting thread dumps (as done by JVisualVM sampler for instance) will scale with the number of threads (idle or not) and the stack depth. Deflating monitors will scale with the number of monitors in your application etc.
Here's what happens as we pile on the idle threads (running only contains1()):
IT   |  contain1(us)|  park(ms)    |  park+DTS(ms) | Stp.ms|TTSP.ms|
0    |  38.0 ±  1.6 | 100.0 ±  0.0 |  100.5 ±  0.0 |   0.2 |   0.0 |
1    |  40.7 ±  1.4 | 100.1 ±  0.0 |  102.2 ±  0.0 |   0.6 |   0.0 |
10   |  44.7 ±  0.8 | 100.2 ±  0.1 |  117.2 ±  0.1 |   5.5 |   0.0 |
100  |  75.5 ±  0.8 | 122.6 ±  3.5 |  261.9 ±  2.3 |  61.3 |   0.1 |
1000 | 204.5 ± 38.5 | 257.5 ± 28.8 | 1414.2 ± 47.0 | 313.0 |   1.0 |

We get long stops and short TTSPs! hoorah! You can play with the number of threads and the depth of the stack to get the effect you want. It is interesting that most profilers using JVMTI::Get*StackTrace opt for sampling all threads. This is risky to my mind as we can see above the profiling overhead is open ended. It would seem reasonable to sample 1 to X threads (randomly picking the threads we sample) to keep the overhead in check on the one hand and pick a reasonable balance of TTSP vs operation cost on the other.
As mention previously, JVMTI::GetStackTrace does not cause a global safepoint on Zing. Sadly AFAIK no commercial profilers use this method, but if they did it would help. The JMX thread bean does expose a single thread profiling method, but this only delegates to the multi-thread stack trace method underneath, which does cause a global safepoint...

Summary: This final step adds another interesting control for us to play with completing our set:
  • Safepoint poll interval -> Will drive time to safepoint. CPU scarcity, swapping and page faults are other common reasons for large TTSP.
  • Safepoint operation interval -> Sometimes in your control (be careful how you profile/monitor your application), sometimes not.
  • Safepoint operation cost -> Will drive time in safepoint. Will depend on the operation and your application specifics. Study your GC logs.

Example 4: Adding Safepoint Polls Stops Optimization

This is a repeat of a benchmark I ran to demonstrate the side effect of safepoint polls in loops on loop optimizations in the context of changing one of the Netty APIs. This post is so long by now that if you had the strength to read this far you can probably click through to the ticket. I'll quote the basic results here for completeness.
The benchmark:
We are debating replacing the length and DataSet index types from int to longs. The impact on usage inside loops is substantial as demonstrated by the following results:
Benchmark   (size)  Mode  Cnt    Score     Error  Units
copyInt       1000  avgt    5   76.943 ±  14.256  ns/op
copyLong      1000  avgt    5  796.583 ±  55.583  ns/op
equalsInt     1000  avgt    5  367.192 ±  16.398  ns/op
equalsLong    1000  avgt    5  806.421 ± 196.106  ns/op
fillInt       1000  avgt    5   84.075 ±  19.033  ns/op
fillLong      1000  avgt    5  567.866 ±  10.154  ns/op
sumInt        1000  avgt    5  338.204 ±  44.529  ns/op
sumLong       1000  avgt    5  585.657 ± 105.808  ns/op
There are several optimizations negated by the transition from counted to un-counted loop (uncounted loops currently include loops limited by a long, and therefore include a safepoint):
  • Loop unrolling :  This is not really a safepoint issue, can be fixed as per the KeepSafepointsInCountedLoops option
  • Fill pattern replacement (-XX:+OptimizeFill): the current fill implementation has no safepoints and the loop pattern it recognizes does not allow it.
  • Superword optimizations (-XX:+UseSuperword): current Superword implementation doesn't allow for long loops. This will probably require the compiler to construct an outer loop with an inner superword loop 
The latter 2 have been discussed on this blog before in some detail.

Final Summary And Testament

If you've read this far, you deserve a summary, or a beer, or something. WELL DONE YOU!!! Here's what I consider the main takeaways from this discussion/topic:
  • Safepoint/Safepoint poll/Safepoint operation are 3 distinct terms, often mingled. Separating them allows us to consider their costs and impacts individually.
  • In most discussions the term Safepoint is used to describe a global safepoint. Some JVMs can control safepoint transition on the thread level. By considering the individual thread state we can more easily reason about the transition costs on a thread by thread level. 
  • The failure of any thread to reach a safepoint (by hitting a safepoint poll) can cripple the whole JVM. In this sense most any bit of Java code can be considered blocking (as in not lock free). Consider for example a CAS loop, it's not counted, it will have a safepoint poll in it, the thread executing can be blocked indefinitely if another Java thread is suspended while not at a safepoint.
  • JVMs have many safepoint operations. If you are worried about their impact you should enable -XX:+PrintGCApplicationStoppedTime. Further detail on what types of safepoint operations were involved can be made available using -XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=X. Try it out, it's educational.

References

  • From the OpenJDK Runtime Overview look for the "VM Operations and Safepoints" section. Note that the term safepoint is used here to mean a global safepoint.
  • J.Bempel's post covering many safepoint operations and observing the impact of safepoints.
  • Alexey Ragozin's post also covers similar definition to the Runtime Overview and explores the safepoint poll mechanism and logging options.
  • Gil Tene post on the Mechanical Sympathy mailing list which closely matches the "Thread State" approach above. Gil goes into the topic in the context of Unsafe operations and assumptions surrounding safepoints. 

Monday, 19 October 2015

Expanding The Queue interface: Relaxed Queue Access


Continuing from previous post on the expansion of the Queue interface to support new ways of interacting with queues I have gone ahead and implemented relaxedOffer/Poll/Peek for the JCTools queues. This was pretty easy as the original algorithms all required a bit of mangling to support the strong semantic and relaxing it made life easier. Implementing batch/continuous interactions was a bit more involved, but the results were interestingly rewarding. I will save discussing the drain/fill results for a follow up post.
The new interface is here, feedback is welcome.
TL;DR:

  • Throughput is improved in all cases (~5-15% increase)
  • Cost of failure (when offer/poll fail to return/add a value) is dramatically reduced
  • Latency for bursts is improved (as much as 40% reduction)

Relaxed Access API

This ended up a very small API change, most of the difficulty here being choice of name. I went with the 'relaxed' prefix as suggested by J.P.Bempel (as opposed to the 'weak' prefix considered earlier):

Relaxed vs. Non-Relaxed Example

The 'relaxation' of the queue full/empty reporting in these methods allows us to remove an extra load of the 'opposite' index and avoiding waiting around for delayed visibility. Here's how MPSCArrayQueue::poll was changed to become 'relaxed':

So now we can drop lines 11-15, it doesn't look like much. We wouldn't really expect this gap in offer (see offer for details) to happen very often, so the overhead can typically amount to an extra load of the producerIndex and a predictable branch.
Also, since poll on an MPSC is actually cheaper than the offer anyhow, aren't we optimizing a non-issue? Could we not be actually making throughput worse by making the consumer more aggressive?

Throughput Impact Of Relaxation

To test the impact of this change I constructed a JMH benchmark that measures throughput when both consumer and producer spin to offer/poll as fast as they can (see the code here). I ran the benchmark on a Haswell machine(i7-4770 CPU@3.40GHz/Ubuntu/OracleJVM8u60). The benchmarks were pinned to run cross core (no 2 threads share a physical core, i.e. no hyperthreading). Benchmarks were run with 5 forks, 10 warmup iterations (1 second), 10 measured iterations (1 second). The score reports operations per microsecond, or millions of operation per second if you prefer. The throughput is the number of pollsMade, but the other figures are interesting as well.
These results are with a single producer and consumer:
 Type |     1P1C     | Normal± Err  | Relax ± Err  |
 Spsc | offersFailed |   0.0 ±  0.1 |   0.1 ±  0.1 |
 Spsc | offersMade   | 439.1 ±  4.1 | 432.1 ± 10.0 |
 Spsc | pollsFailed  |   0.2 ±  0.5 |   0.1 ±  0.2 |
 Spsc | pollsMade    | 438.9 ±  4.1 | 431.9 ± 10.0 |
 Mpsc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.0 |
 Mpsc | offersMade   |  22.7 ±  0.6 |  31.9 ±  0.8 |
 Mpsc | pollsFailed  |   0.1 ±  0.1 |   6.6 ±  0.9 |
 Mpsc | pollsMade    |  22.8 ±  0.6 |  31.9 ±  0.8 |
 Spmc | offersFailed |   0.2 ±  0.2 |  81.0 ±  5.4 |
 Spmc | offersMade   |  23.5 ±  0.6 |  26.8 ±  0.6 |
 Spmc | pollsFailed  |   0.1 ±  0.1 |   0.2 ±  0.3 |
 Spmc | pollsMade    |  23.2 ±  0.6 |  26.6 ±  0.6 |
 Mpmc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.1 |
 Mpmc | offersMade   |  71.4 ±  5.1 |  71.7 ±  4.8 |
 Mpmc | pollsFailed  |   0.0 ±  0.1 |   0.3 ±  0.9 |
 Mpmc | pollsMade    |  71.4 ±  5.1 |  71.7 ±  4.8 |

  • The SPSC case serves as a baseline, and I expected results to be the same. There's a 2% difference which is close to the error reported. Maybe the inlining depth impacted the performance slightly, or perhaps the benchmark is not as stable as I'd like.
  • The MPSC case shows 50% improvement. This is pretty awesome. We also notice the consumer now fails more. This is indicative to the reduced cost of failure to find elements in the queue. The cache miss on the producerIndex was slowing down both the consumer and the producer. This phenomena has been observed in other post, but my best attempt at an explanation is here. The bottom line being that the mutator suffers from losing the exclusive state of the cache line in his own cache, while the reader gets hit with a read miss (going to LLC as the threads are on separate cores).
  • The SPMC case shows 15% improvement on throughput. The cost of failing to offer has gone down dramatically.
  • The MPMC case shows no change, but does show great throughput. This could be down to the very different algorithm in play or just because the MPMC queue is well balanced and thus behaves very well in a balanced use case as above. This may lead you to believe MPMC is a better choice than MPSC, but as the latency results and contended throughput results show this is an anomaly of the benchmark load. Because MPSC presents a much cheaper poll than offer the queue is always empty, making the contention on the queue buffer worse than it would be for MPMC where offer and poll are of similar costs.
Using the command line controls over thread groups sizes (-tg) I ran a few more cases. Did I mention I love JMH?
Here's MPMC and MPSC with 3 producers and 1 consumer:
 Type |     3P1C     | Normal± Err  | Relax ± Err  |
 Mpsc | offersFailed |   0.0 ±  0.0 |   0.1 ±  0.6 | 
 Mpsc | offersMade   |  12.7 ±  0.1 |  13.4 ±  0.1 | 
 Mpsc | pollsFailed  |   4.1 ±  0.5 | 115.3 ±  4.9 | 
 Mpsc | pollsMade    |  12.7 ±  0.1 |  13.4 ±  0.1 | 
 Mpmc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.0 | 
 Mpmc | offersMade   |  11.1 ±  0.0 |  11.7 ±  0.1 | 
 Mpmc | pollsFailed  |   0.3 ±  0.1 |  60.4 ± 10.0 | 
 Mpmc | pollsMade    |  11.1 ±  0.0 |  11.7 ±  0.1 |
  • MPSC case shows slight improvement to throughput of roughly 5%, but the improvement to cost of failure is very visible here.
  • MPMC case improves by 5% and the improved cost of failure is similarly visible.
Here's MPMC and SPMC with 1 producers and 3 consumer:
 Type |     1P3C     | Normal± Err  | Relax ± Err  |
 Spmc | offersFailed |   1.5 ±  0.2 | 125.9 ±  4.6 |
 Spmc | offersMade   |  13.4 ±  0.1 |  14.0 ±  0.0 |
 Spmc | pollsFailed  |   0.0 ±  0.0 |   0.0 ±  0.0 |
 Spmc | pollsMade    |  13.1 ±  0.1 |  13.7 ±  0.0 |
 Mpmc | offersFailed |  23.9 ±  0.8 | 124.1 ±  3.2 |
 Mpmc | offersMade   |  11.0 ±  0.0 |  12.1 ±  0.0 |
 Mpmc | pollsFailed  |   0.0 ±  0.2 |   1.0 ±  1.6 |
 Mpmc | pollsMade    |  10.7 ±  0.0 |  11.9 ±  0.0 |
  • SPMC case shows  5% improvement and great improvement to failure performance.
  • MPMC case improves by 11% and the improved cost of failure is similarly visible.
Here's MPMC only, with 2 producers and 2 consumers:
 Type |    2P2C      | Normal± Err  | Relax ± Err  |
 Mpmc | offersFailed |   1.4 ±  0.8 |   0.7 ±  0.6 |
 Mpmc | offersMade   |  14.8 ±  0.8 |  16.0 ±  0.6 |
 Mpmc | pollsFailed  |   0.0 ±  0.0 |   0.4 ±  0.5 |
 Mpmc | pollsMade    |  14.7 ±  0.8 |  15.9 ±  0.6 |
  • The improvement to throughput is a nice 8%, but the error is not that far from it so I'm not too confident without lots more runs. This is a similarly symmetric load to the 1P1C and the balance is showing through.
I could go on to higher core counts etc, but I'm actually feeling the win here is pretty conclusive:
  • Throughput is generally improved
  • Failure to make progress is significantly cheaper
Why is this failure mode important? This is particularly important when you have a consumer/producer which is able to make progress in the face of such a failure. A common example of this pattern is a Selector thread which also has an incoming queue of commands to execute, when there's nothing in the queue we would like to go back to selecting on the selector.


Latency impact of Relaxation

Is this going to help us in the case of burst handling latency? There's a benchmark for that too! Or rather there's an estimate of burst processing cost. Code is here. The general idea being to send a burst of messages from the producer, then wait for completion notice from the consumer.
Here goes:
 Size | Type | Normal±Err | Relax± Err |
    1 | Spsc |  126 ±   2 |  126 ±   1 |
    1 | Mpsc |  144 ±   2 |  140 ±   1 |
    1 | Spmc |  187 ±   2 |  189 ±   2 |
    1 | Mpmc |  190 ±   8 |  203 ±   2 |
   10 | Spsc |  193 ±   2 |  192 ±   3 |
   10 | Mpsc |  611 ±  22 |  429 ±   7 |
   10 | Spmc |  309 ±   3 |  307 ±   1 |
   10 | Mpmc |  807 ±  19 |  677 ±  15 |
  100 | Spsc |  530 ±   7 |  528 ±   4 |
  100 | Mpsc | 5243 ± 158 | 3023 ± 107 |
  100 | Spmc | 1240 ±   7 | 1250 ±   6 |
  100 | Mpmc | 4684 ± 382 | 3665 ± 111 |

  • Same as before, SPSC serves as a sanity check/baseline. No change is good.
  • SPMC/MPSC results are pretty similar when sending a single message. MPMC seems slightly regressed.
  • When sending 10/100 messages in a burst we see significant benefits to using the relaxed methods, in particular for the MPSC and MPMC cases.


Summary

Relaxing is good for you, deep down you always knew it. Turns out it ain't half bad for your queues either :-). The new API will be part of the next version of JCTools, coming out as soon as I can work through my Maven mojo phobia.

Thanks J.P. Bempel and Phil Aston for their kind review, remaining errors are a result of me refusing to listen to them persnickety bastards.