Wednesday, 24 February 2016

Why (Most) Sampling Java Profilers Are Fucking Terrible

This post builds on the basis of a previous post on safepoints. If you've not read it you might feel lost and confused. If you have read it, and still feel lost and confused, and you are certain this feeling is related to the matter at hand (as opposed to an existential crisis), please ask away.
So, now that we've established what safepoints are, and that:
  1. Safepoint polls are dispersed at fairly arbitrary points (depending on execution mode, mostly at uncounted loop back edge or method return/entry).
  2. Bringing the JVM to a global safepoint is high cost
We have all the information we need to conclude that profiling by sampling at a safepoint is perhaps a bit shite. This will not come as a surprise to some, but this issue is present in the most commonly used profilers. According to this survey by RebelLabs this is the breakdown:

VisualVM, NB Profiler(same thing), YourKit and JProfiler all provide a sampling CPU profiler which samples at a safepoint. Seeing how this is a rather common issue, lets dig into it.

How Sampling Execution Profilers Work (in theory)

Sampling profilers are supposed to approximate the 'time spent' distribution in our application by collecting samples of where our application is at different points in time. The data collected at each sample could be:
  • current instruction
  • current line of code
  • current method
  • current stack trace
The data can be collected for a single thread or all threads at each sample. What do we need to hold for sampling to work?
"However, for sampling to produce results that are comparable to a full (unsampled) profile, the following two conditions must hold. First, we must have a large number of samples to get statistically significant results. For example, if a profiler collects only a single sample in the entire program run, the profiler will assign 100% of the program execution time to the code in which it took its sample and 0% to everything else. [...]
Second, the profiler should sample all points in a program run with equal probability. If a profiler does not do so, it will end up with bias in its profile. For example, let’s suppose our profiler can only sample methods that contain calls. This profiler will attribute no execution time to methods that do not contain calls even though they may account for much of the program’s execution time." - from Evaluating the Accuracy of Java Profilers, we'll get back to this article in a bit
That sounds easy enough, right?
Once we have lots of samples we can construct a list of hot methods, or even code lines in those methods (if the samples report it), we can look at the samples distributed on the call tree (if call traces are collected) and have an awesome time!

How Do Generic Commercial Java Sampling Execution Profilers Work?

Well, I can sort of reverse engineer here from different solutions, or read through open source code bases, but instead I'll offer unsupported conjecture and you are free to call me out on it if you know better. Generic profilers rely on the JVMTI spec, which all JVMs must meet:
  • JVMTI offers only safepoint sampling stack trace collection options (GetStackTrace for the calling thread doesn't require a safepoint, but that is not very useful to a profiler. On Zing  GetStackTrace to another thread will bring only that thread to a safepoint.). It follows that vendors who want their tools to work on ALL JVMs are limited to safepoint sampling.
  • You hit a global safepoint whether you are sampling a single thread or all threads (at least on OpenJDK, Zing is slightly different but as a profiler vendor OpenJDK is your assumption.). All profilers I looked into go for sampling all threads. AFAIK they also do not limit the depth of the stack collected. This amounts to the following JVMTI call: JvmtiEnv::GetAllStackTraces(0, &stack_info, &thread_count)
  • So this adds up to: setup a timer thread which triggers at 'sampling_interval' and gathers all stack traces.
This is bad for several reasons, some of which are avoidable:
  1. Sampling profilers need samples, so it is common to set sampling frequency is quite high (usually 10 times a second, or every 100ms). It's instructive to set the -XX:+PrintGCApplicationStoppedTime and see what sort of pause time this introduces. It's not unusual to see a few milliseconds pause, but YMMV(depending on number of threads, stack depth, TTSP etc). A 5ms pause every 100ms will mean a 5% overhead (actual damage is likely worse than that) introduced by your profiler. You can usually control the damage here by setting the interval longer, but this also means you will need a longer profiling period to get a meaningful sample count.
  2. Gathering full stack traces from all the threads means your safepoint operation cost is open ended. The more threads your application has (think application server, SEDA architecture, lots of thread pools...), and the deeper your stack traces (think Spring and Co.) the longer your application will wait for a single thread to go round taking names and filling forms. This was clearly demonstrated in the previous post. AFAIK, current profilers do nothing to help you here. If you are building your own profiler it would seem sensible to set a limit on either quantities so that you can box your overheads. The JVMTI functionality allows you to query the list of current threads, you could sample all if there's less than a 100 and otherwise pick a random subset of 100 to sample. It would make sense to perhaps bias towards sampling threads that are actually doing something as opposed to threads which spend all their time blocked.
  3. As if all that was not bad enough, sampling at a safepoint is a tad meaningless.
Points 1 and 2 are about profiling overheads, which is basically about cost. In my previous post on safepoints I looked at these costs, so there's no point repeating the exercise.  Cost may be acceptable for good profiling information, but as we'll see the information is not that great.
Point 3 bears explaining, so off we go in the search for meaning.

Safepoint Sampling: Theory

So what does sampling at a safepoint mean? It means only the safepoint polls in the running code are visible. Given hot code is likely compiled by C1/C2 (client/server compilers) we have reduced our sampling opportunities to method exit and uncounted loop backedges. This leads to the phenomena called safepoint bias whereby the sampling profiler samples are biased towards the next available safepoint poll location (this fails the second criteria set out above "the profiler should sample all points in a program run with equal probability").
This may not sound so bad at first, so lets work through a simple example and see which line gets the blame.
NOTE: In all of the following examples I will be using JMH as the test harness and make use of the 'CompilerControl' annotation to prevent inlining. This will let me control the compilation unit limits, and may seem cruel and unusual, or at least unfair, of me. Inlining decisions in the 'wild' are governed by many factors and it is safe (in my opinion) to consider them arbitrary (in the hands of several compilers/JVM vendors/command line arguments etc.). Inlinining may well be the "mother of all optimizations", but it is a fickle and wily mother at that.
Let's look at something simple:

This is an easy example to think about. We can control the amount of work in the method by changing the size of the array. We know the counted loop has no safepoint poll in it (verified by looking at the assembly output), so in theory the methods above will have a safepoint at method exit. The thing is, if we let the method above get inlined the end of method safepoint poll will disappear, and the next poll will be in the measurement loop:

So, it would seem reasonable to expect the method to get blamed if it is not inlined, but if it does get inlined we can expect the measurement method to get blamed. Right? very reasonable, but a bit off.

Safepoint Sampling: Reality

Safepoint bias is discussed in this nice paper from 2010: Evaluating the Accuracy of Java Profilers, where the writers recognize that different Java profilers identify different hotspots in the same benchmarks and go digging for reasons. What they don't do is set up some benchmarks where the hotspot is known and use those to understand what it is that safepoint biased profilers see. They state:
"If we knew the “correct” profile for a program run, we could evaluate the profiler with respect to this correct profile. Unfortunately, there is no “correct” profile most of the time and thus we cannot definitively determine if a profiler is producing correct results."
So, if we construct a known workload... what do these profilers see?
We'll study that with the JMH safepoint biased profiler "-prof stack". It is much like JVisualVM in the profiles it presents for the same code, and a damn sight more convenient for this study.
NOTE: In the following sections I'm using the term sub-method to describe a method which is being called into from another method. E.g. if method A calls method B, B is a sub method of A. Maybe a better terminology exists, but that's what I mean here.
If we run the samples above we get 2 different hot lines of code (run with '-prof stack:detailLine=true'):
# Benchmark: safepoint.profiling.SafepointProfiling.meSoHotInline
....[Thread state: RUNNABLE]...
 99.6%  99.8% meSoHotInline_avgt_jmhStub:165

# Benchmark: safepoint.profiling.SafepointProfiling.meSoHotNoInline
....[Thread state: RUNNABLE]...
 99.4%  99.6% meSoHotNoInline_avgt_jmhStub:163

Neither one in the actually hot method. It seems that the method exit safepoint is not deemed indicative of it's own method, but rather of the line of code from which it was called. So forcing the method under measurement to not inline implicated the calling line of code in the measurement loop, while leaving it to inline meant the back edge of the loop got the blame. It also appears that an uncounted loop safepoint poll is deemed indicative of it's own method.
We may deduce(but we won't necessarily be correct) that when looking at this kind of profile without line of code data a hot method is indicative of:
  1. Some non-inlined sub method is hot*
  2. Some code (own method? inlined sub method? non-inlined sub method?) in an uncounted loop is hot*
Having line of code data can help disambiguate the above cases, but it's not really very useful as line of code data. A hot line of code would be indicative of:
  1. Line has a method call: A method called from this line is hot (or maybe it's inlined sub-methods are)*
  2. Line is a loop back edge: Some code (include inlined submethods) in this loop is hot*
  *Does that seem useful? Don't get your hopes up.
Because we usually have no idea which methods got inlined this can be a bit confusing (you can use -XX:+PrintInlining if you want to find out, but be aware that inlining decisions can change from run to run).

Mind The Gap

If the above rules held true you could use a safepoint biased profile by examining the code under the blamed node in the execution tree. In other words, it would mean a hot method indicates the hot code lies somewhere in that code or in the method it calls. This would be good to know, and the profile could serve as a starting point for some productive digging. But sadly, these rules do not always hold true. They are missing the fact that the hot code can be anywhere between the indicated safepoint poll and the previous one. Consider the following example:

Clearly, the time is spent in the loop before calling setResult, but the profile blames setResult. There's nothing wrong with setResult except that a method it calls into is not inlined, providing our profiler with a blame opportunity. This demonstrates the randomness with which the safepoint poll opportunities present themselves to  user code, and shows that the hot code may be anywhere between the current safepoint poll and the previous one. This means that a hot method/line-of-code in a safepoint biased profile are potentially misleading without some idea of where the previous safepoint poll is. Consider the following example:

The profiler implicates the caller to a cheap method 9 levels down the stack, but the real culprit is the loop at the topmost method. Note that inlining will prevent a method from showing, but not-inlined frames only break the gap between safepoints on return (on OpenJDK anyhow. This is vendor and whim dependent. e.g. Zing puts the method level safepoint on entry for instance and I'm not sure where J9 stands on this issue. This is not to say that one way is somehow better than the other, just that the location is arbitrary). This is why setResult6 which is not inlined and higher up the stack doesn't show up.

Summary: What Is It Good For?

As demonstrated above, a safepoint sampling profiler can have a wildly inaccurate idea of where the hot code in your application is. This renders derived observations on "Running" threads pretty suspect, but they are at least correct observations on which threads are running. This doesn't mean they are entirely useless and sometimes all we need is a hint in the right direction for some good analysis to happen, but there's a potential for huge waste of time here. While samples of code running in the interpreter do not suffer safepoint bias, this is not very useful as hot code is quickly compiled. If your hot code is running in the interpreter you have bigger fish to fry than safepoint bias...
The stack traces for blocked threads are accurate, and so the "Waiting" profile is useful to discover blocked code profile. If blocking methods are at the root of your performance issue than this will be a handy observation.
There are better options out there! I'll get into some of them in following posts:
- Java Mission Control
- Solaris Studio
- Honest-Profiler
- Perf + perf-map-agent (or perfasm if your workload is wrapped in a JMH benchmark)
No tool is perfect, but all of the above are a damn sight better at identifying where CPU time is being spent.
Big thanks to the many reviewers, if you find any issues please comment and I will punish them severely.

UPDATE: Read follow up post of Honest-Profiler/Java Flight Recorder type profilers in this next post - The Pros And Cons of AsyncGetCallTrace Profilers


  1. Too much of "fuckings" in public posts

    1. A recent poll shows that most people enjoy cursing in blog posts:
      But still:
      "If we shadows have offended,
       Think but this, and all is mended—
       That you have but slumbered here
       While these visions did appear.
       And this weak and idle theme,
       No more yielding but a dream,
       Gentles, do not reprehend."

    2. Following in the footsteps of one of my favorite development blogposts of all time ..

      Peter Booth

  2. Nitsan, one small addition. You mentioned that those open source and commercial tools all ship sampling profilers. It should be noted that the also ship instrumenting profilers as alternative. While I think that this is the even worse alternative, I think it is worth mentioning.

    1. They do indeed, and those profilers are sort of off topic, so I left them out. They have their uses and limitations... there's only so much I can cover in a post...

  3. Thanks for the post. Hope you include JFR and -XX:+DebugNonSafepoints in the followup post you mention.

    1. Next post on this topic will look at Honest-Profiler and JFR. Both require -XX:+DebugNonSafepoints to get better profiles. I'm not planning to look at JMC beyond the code profiling aspect, so from this limited perspective HP and JFR give very similar data. The data is better than JVisualVM etc, but has its issues as well.

    2. JFR is lovely. Alas it doesn't work with Zing.

    3. JFR only works on Oracle JVM, and only from 7u60.
      It doesn't work for Zing, or OpenJDK, IceTea, J9 or anything else.
      It originates from the now abandoned JRockit JVM, where it also worked.
      Zing supports AsyncGetCallTrace, so honest-profiler works, and we're trying to make it Solaris Studio compatible. It also comes with it's own profiler ZVision which is non-safepoint sampling and instruction level profiling (but is rather ugly to look at... on the other hand, if you are looking at assembly, you must have built some tolerance by now).

  4. Brendan Gregg did some work on the OpenJDK JVM so you could use the Linux "perf" tool to get real stack traces out of the interpreter. Combined with the java-perf-map agent you can see some non-biased, sampled stack traces.

    1. This is definitely a good option, let me just expand further on it (I will write a post on at some point):

      Brendan contributed a POC patch which (after the OpenJDK guys worked it into full production code) adds up to what is now available as -XX:+PerserveFramePointer. This allows perf to walk the stack. Using that in conjunction with perf-map-agent allows us to get usable call traces out of Java (on Linux+JDK8u60 and up).

      Brendan was instrumental to the effort, but others deserve good chunks of the credit too (and Brendan is very good about pointing this out himself).

      Using perf+PMA+PFP gives you the unbiased stacks, but the data you get is limited in 2 respects:

      1. No interpreter frames
      2. No inlined frames

      You can 'unfold' inlined frames (use latest PMA code, there was a bug there) but AFAIK this needs some FlameGraph lovin' to display well and to get sensible data you should enable -XX:+DebugNonSafepoint. Once you unfold you have other issues to worry about such as skid etc. Nothing is perfect :-).