Wednesday, 11 March 2015

Correcting YCSB's Coordinated Omission problem

YCSB is the Yahoo Cloud Serving Benchmark(also on wiki): a generic set of benchmarks setting out
The Nimbus Cloud Serving Board
to compare different key-value store providers under a set of loads:
The goal of the Yahoo Cloud Serving Benchmark (YCSB) project is to develop a framework and common set of workloads for evaluating the performance of different "key-value" and "cloud" serving stores.
The code is open for extension and contribution and all that good stuff, you can get it here. And it has become tool for comparing vendors in the NoSQL space. The benchmarks set out to measure latency and throughput. The terms are not directly defined in the paper, but the following statement is made:
The Performance tier of the benchmark focuses on the latency of requests when the database is under load. Latency is very important in serving systems, since there is usually an impatient human waiting for a web page to load. [...] Typically application designers must decide on an acceptable latency, and provision enough servers to achieve the desired throughput while preserving acceptable latency. [...] The YCSB Client allows the user to define the offered throughput as a command line parameter, and reports the resulting latency, making it straightforward to produce latency versus throughput curves.
What could possibly go wrong?™
It can go this wrong for instance, order of magnitude difference in results for different percentiles, leading to some poor decision making on how much hardware you'll need, leading to getting fired from your job and growing old bitter and twisted mumbling to yourself as you get drunk on the street corner until you freeze to death on a winter night. So potentially this is a risk to your future well being, listen up!


It's broken? Coordinated WOT?

When you measure latency bad,
Mr. Tene is sad
 :(
My colleague at Azul, Gil Tene, the magnificent, glorious, multi-dimensional, coding CTO, officer and gentleman (that's my bonus sorted) has been doing a lot of latency related preaching and teaching in the last few years. He has given the following talks at any number of conferences, but if you happened to have missed them, watch them NOW:
In particular he has coined the term "Coordinated Omission" (see raging discussion on Mechanical Sympathy) to describe instances in which the measuring/monitoring system coordinates measurement with the system under test/measurement such that samples are biased. This issue manifests in many load generating frameworks where the call into the system under test is done synchronously and the measurement thread holds up the next sample while the call is ongoing. This enables the system under test to delay requests that would have been made during the synchronous call thus skewing the sample set. Consider for example a system where:
  • We set out to measure a request every 1ms (from a single thread, synchronously)
  • The first 500 calls come back in 100µs each (so call K starts at Kms and returns at Kms + 100µs )
  • Call 501 takes 500 milliseconds (starts at 500ms, returns at 1 second)
  • Call 502 takes 100µs
See the problem?
The problem is that call 502 did NOT happen at it's designated time, and saying it took 100µs fails to capture this. It failed the assumptions laid out in the first sentence because we were blocked for 500ms. If we were to stick to our original schedule we would be making calls 502 to 1000 in the time it took for call 501 to execute. How should we treat this departure from plan?
Is it safe to come out yet?
  1. Ignore it! it will go away by itself! - This is the coordinated omission way. We are now reporting numbers that are no longer according to the test plan, which means that our "latency versus throughput curves" are off the mark. This is a very common solution to this issue.
  2. Fail the test, we wanted a call every ms and we didn't get that - This is an honest hardline answer, but it potentially throws the baby with the bath water. I think that if you set out to schedule 1000 calls per second you might want to see how often this falls apart and how. But this answer is The Truth™, can you handle it? If one is to start from scratch and write their own load generator I propose a read of the Iago load test framework philosophy page: "Iago accurately replicates production traffic. It models open systems, systems which receive requests independent of their ability to service them. Typical load generators measure the time it takes for M threads to make N requests, waiting for a response to each request before sending the next; if your system slows down under load, these load testers thus mercifully slow down their pace to match. That's a fine thing to measure; many systems behave this way. But maybe your service isn't such a system; maybe it's exposed on the internet. Maybe you want to know how your system behaves when N requests per second come in with no "mercy" if it slows down.". This is a fine sentiment.
  3. Coordinated Omission correction: Adjust the results to reflect the expected call rate. This can be done in a straight forward manner if the 'missing' calls are added back with a latency which reflects the period for which they were delayed. This correction method is supported out of the box by HdrHistogram but the discussion regarding it's over or under estimation of the impact of the delay is outside the scope of this post.
  4. Coordinated Omission avoidance: Measure all calls according to original schedule. We are now saying: "If I can't make the call, the meter is still running!". This is particularly relevant for systems where you would typically be making the requests to the system under test from a thread pool. That thread pool would be there to help you support asynchronous interaction where the API failed in giving you that option. Like JDBC... Like many key-value pair provider APIs.
This last solution is the one we'll go for in this post, but I would urge you to consider the results critically. In particular if you are trying to simulate independent access to a web server (as opposed to a DB via a thread pool) then the adherence to schedule might be hugely optimistic of the results in your case. This is because failing to generate independent load may have all sorts of beneficial effects on the system under test.
For the YCSB benchmark I'm assuming the harness/load generator is simulating a web serving layer accessing the key-value store in an effort to serve an unbounded, uncoordinated user request load via a predefined thread pool. So it's door number 3 for me. The corrected load generator is here.


Step 0: Some preliminary work (not strictly required)

HdrHistogram, as approved by Freddie!
As described previously here, we should all just get on with capturing latency using HdrHistograms. So as a first step toward correcting YCSB I have gone in and added an HdrHistogram measurement container. This is pretty straight forward as all I needed to modify was the Measurements class to allow a new measurement type. While I was there I tweaked this and that and the following list of changes to that class emerged:
  1. Add new measurement type and corresponding command-line option("-p measurementtype=hdrhistogram")
  2. Add combined measurement option allowing old/new measurement side by side: "hdrhistogram+histogram"
  3. Add support for capturing both corrected and uncorrected measurements for the same run.
  4. Use CHM instead of synchronizing around a HashMap.
The new measurement type supports logging loss less HdrHistogram data to a file (controlled by the hdrhistogram.fileoutput=<true|false> option and the hdrhistogram.output.path=<path> option) as well as better precision percentile data and lock free logging of latencies. This is not very interesting work but if you are interested in the "How would I plug in my own data structure to capture latency into YCSB?" topic have fun. It was not necessary for correction but it was good to do so that better quality results can be observed. You're welcome.


Step 1: Demonstrate the issue

YCSB includes a very useful means of verifying the measurements in the form of a mock DB driver. This means we can test our assertions regarding coordinated omission without setting up a key value store of any kind. The mock DB is called BasicDB and is the default DB used. We can configure it to simulate a pause and see what happens (-p basicdb.verbose=false -p basicdb.simulatedelay=4  will make the mock DB stop logging every action and simulate a latency of 0-4ms for each action). I added a further option to the BasicDB which allows us to turn off the randomization of the delay (-p basicdb.randomizedelay=false).
Let's consider our expectations in the case where a DB simply cannot handle request quickly enough. We can setup an experiment with the following settings: -target 1000 -threads 1 -s -p status.interval=1 -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -p measurementtype=hdrhistogram -p maxexecutiontime=60
Here's what they all mean:
  • -target 1000 -> We aim to test 1000 requests per second
  • -threads 1 -> We have a single client thread
  • -s -p status.interval=1 -> We will be printing out status every second (I made the status interval configurable)
  • -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -> The DB will sleep 4ms on each request, so the maximum we can hope for is 250, no noisy printing per operation please
  • -p measurementtype=hdrhistogram -> Use HdrHistogram to capture the latencies
  • -p maxexecutiontime=60 -> Run for one minute, then exit and print summary
This DB is obviously failing, it can't keep up with the rate of incoming requests and according to our model they queue up. The time measured per call is reflected in the summary for the READ operations:
[READ], Operations, 12528.0
[READ], AverageLatency(us), 4477.102809706258
[READ], MinLatency(us), 4018.0
[READ], MaxLatency(us), 44703.0
[READ], 95thPercentileLatency(ms), 4.0
[READ], 99thPercentileLatency(ms), 4.0
[READ], Return=0, 12528
But this completely ignores the time spent on the queue. If we were measuring according to schedule we'd get the following set of latencies:
Latency[k] = 4 + 3*(k-1) ms
The max latency would be for the last request to get in. We ran for 60 seconds, at 250 requests/sec which means our last request was (k=15000) and had a latency of 45 seconds when measured from the time we intended to make it. This number reflects the system's failure to handle load far more correctly than the numbers quoted above.


Step 2: Working to Schedule

The YCSB load generator has a weak notion of schedule, in the sense that it opts for option number 1 above and will just execute the operations when it can. When faced with the task of correcting this kind of issue (in a pile of foreign code) we need to look for 2 things in the load generator:
  1. "Scheduling an action to run at time X" - This will involve some calls to one of the many scheduling facilities in the JDK:
    1. Thread.sleep is an old favourite, but TimeUnit also supports a sleep method. A search for sleep in the code base will cover both. This is what YCSB was using to schedule next event to fire.
    2. Code submitting tasks to java.util.Timer, or alternatively the ScheduledExecutorService
    3. Code using LockSupport.parkNanos
    4. Object.wait(...)
    5. others?
  2. "Measuring the length of an operation" - This will involve calls to System.nanoTime() or currentTimeMillis(). For YCSB this is found to happen for example here.

To correct this problem I had to introduce the concept of 'intended start time' to the operations measurement. Schedule for YCSB is specified by the -target command line option which sets the overall number of operations per second to be attempted by the load generator. This is optional, and the default is to go as fast as you can manage, i.e. with no schedule but the back pressure from the system under test to guide us. I'm not sure what a good assumed rate of requests is reasonable in this case, so I did not correct this case. NOTE: If you don't specify target no correction will take place.
The target parameter is translated to a per-thread operation rate (number of threads is set via the threads option, default is 1) so if we have 10 threads, and the target request rate is 1000 (-target 1000 -threads 10) we will have each thread hitting the store with 100 requests per second. The client threads randomize the first operation time to avoid all hitting the store on the same interval. I did some ground work here by setting the units across the board to nanoseconds and naming interval parameters appropriately, nothing too exciting.
The actual correction at it's core involves:
  1. Record the operation's intended start time
  2. Use the intended start time when computing latency
Sadly the way YCSB measures latency does not lend itself to a simple in place fix. The operations are scheduled by the ClientThread which calls into a workload, calling into a DB, which is actually the DBWrapper which measures the latency (for calling into an actual DB implementation) and reports it to the central Measurements singleton. This means that changing the Workload/DB API to include a startTime parameter to each call is quite a far reaching change which would require me to dig through all the DB drivers implementations and would result in a very unpleasant time for all.
I settled on using a thread local on the Measurements object to transfer the start time to the DBWrapper, it is not a nice way to do things (and I'm happy to hear better suggestions) but it does the job without modifying the API.
Once we have:
  1. ClientThread setting up the start time for the operation via Measurements
  2. DBWrapper using the start time from Measurements to compute the operation latency
That's pretty much it. For extra points I wanted to include some facilities to compare measurements before/after the change. These can be removed if we accept HdrHistogram as a replacement and if we accept we only want to measure the intended latency, which would result in a much smaller PR.


Step 3: is the issue solved?

Running the setup from step 1 such that it produces the intended latency as well as the original measurement side by side(-p measurement.interval=both) yields the following result for the READ operations:
[READ], Operations, 12414.0
[READ], AverageLatency(us), 4524.981069759949
[READ], MinLatency(us), 4018.0
[READ], MaxLatency(us), 24703.0
[READ], 95thPercentileLatency(ms), 4.0
[READ], 99thPercentileLatency(ms), 4.0
[READ], Return=0, 12414
[Intended-READ], Operations, 12414.0
[Intended-READ], AverageLatency(us), 2.359010991606251E7
[Intended-READ], MinLatency(us), 4256.0
[Intended-READ], MaxLatency(us), 4.6989311E7
[Intended-READ], 95thPercentileLatency(ms), 42369.0
[Intended-READ], 99thPercentileLatency(ms), 46530.0

This reflects the effect a backed up system would have on latency as we express in Step 1 above. It's actually a bit worse because the average cost of calling the mock DB with a sleep of 4ms is 4.5ms. As we can see the maximum latency is 46.9 seconds, reflecting the fact that the last read to execute was scheduled to hit the system 13.1 seconds into the run.


Step 4: The limitations of the harness

We can now also consider the perfect DB for the sake of observing the short comings of the test harness by setting the mock DB delay to 0(-p basicdb.simulatedelay=0):
[READ], Operations, 56935.0
[READ], AverageLatency(us), 0.01796785808377975
[READ], MinLatency(us), 0.0
[READ], MaxLatency(us), 49.0
[READ], 95thPercentileLatency(ms), 0.0
[READ], 99thPercentileLatency(ms), 0.0
[READ], Return=0, 56935
[Intended-READ], Operations, 56935.0
[Intended-READ], AverageLatency(us), 232.37026433652412
[Intended-READ], MinLatency(us), 0.0
[Intended-READ], MaxLatency(us), 39007.0
[Intended-READ], 95thPercentileLatency(ms), 0.0
[Intended-READ], 99thPercentileLatency(ms), 0.0

How come it take so long to measure a noop? why such large differences? Here's some generic theories and how they panned out:
  • The JVM running the load generator is running with suboptimal settings(-Xms64m -Xmx64m -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime, Oracle JDK8u31) on a busy Mac laptop running on battery
This is no way to benchmark anything, but the interesting thing is that if we have no schedule to stick to the test harness is willing to just ignore the issue. If we run on a decent machine (with a decent OS) we get nicer results. This is from a server class machine running CentOS6.3/OracleJDK8u25 with same settings:
[READ], Operations, 56930.0
[READ], AverageLatency(us), 0.44417705954681186
[READ], MinLatency(us), 0.0
[READ], MaxLatency(us), 20.0
[READ], 95thPercentileLatency(ms), 0.0
[READ], 99thPercentileLatency(ms), 0.0
[READ], Return=0, 56930
[Intended-READ], Operations, 56930.0
[Intended-READ], AverageLatency(us), 146.31262954505533
[Intended-READ], MinLatency(us), 15.0
[Intended-READ], MaxLatency(us), 14255.0
[Intended-READ], 95thPercentileLatency(ms), 0.0
[Intended-READ], 99thPercentileLatency(ms), 0.0
This is still significant.

  • The JVM suffers from warmup related artefacts
This certainly correlated to the max values I'm seeing here. When looking at the status line for the first second I see:
[READ: Count=22, Max=14, Min=0, Avg=0.64, 90=0, 99=14, 99.9=14, 99.99=14]
[Intended-READ: Count=23, Max=14255, Min=15, Avg=5563.39, 90=13287, 99=14255, 99.9=14255, 99.99=14255]
 But after a few seconds the process settles and we see much better results, this is typical:
[READ: Count=947, Max=14, Min=0, Avg=0.02, 90=0, 99=0, 99.9=2, 99.99=14]
[Intended-READ: Count=946, Max=194, Min=61, Avg=151.29, 90=165, 99=175, 99.9=186, 99.99=194]
A good way to handle this issue is by relying on the HdrHistogram output files to grab relevant time periods for analysis. With the original YCSB output we have the percentile summary data, but this is not something we can combine for analysis. With the loss-less interval histogram logs we can look at any sub-period(which is longer than one interval, but shorter than the whole run) and get accurate full range histogram data. A common practice is to discard warmup period results, I'm no a fan of throwing away data, but since this is the load generator warmup I'd think it's quite legitimate. It's perhaps an interesting feature to add to such a framework that the framework can be warmed up separately from the system to examine cold system behaviour.
  • Thread.sleep/LockSupport.parkNanos are not super accurate and may wakeup after the intended operation start time
I've added an option for spinning instead of sleeping (so burn a CPU). This has improved the average value dramatically from ~146µs to ~3.1µs. A typical status line now looks like:
[READ: Count=947, Max=13, Min=0, Avg=0.02, 90=0, 99=0, 99.9=3, 99.99=13]
[Intended-READ: Count=948, Max=57, Min=0, Avg=0.47, 90=1, 99=12, 99.9=26, 99.99=57]
It is obviously not desirable for the load generator to burn a CPU instead of using sleep, but it also introduces scheduling inaccuracies. This is an accuracy issue we didn't have to deal with when not measuring from a schedule. This didn't impact the measured outliers, but has dramatically reduced their number. The take away here is just that there are accuracy limitations to the load generators ability to stick to schedule.

  • GC pauses that are large enough to derail the schedule on the load generator side are now captured. Unless the GC pauses happen inside the measurement gap we will have no idea we have gone off schedule if we don't track the intended start time.
We should capture GC logs on load generator side and make sure we correlate the GC events with recorded latencies. Here's a GC pause being captured by the corrected measurement:
[READ: Count=952, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0] 
[Intended-READ: Count=952, Max=14, Min=0, Avg=0.03, 90=0, 99=0, 99.9=3, 99.99=14]
[GC (Allocation Failure) [PSYoungGen: 17895K->1824K(18944K)] 17903K->1840K(62976K), 0.0024340 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
Total time for which application threads were stopped: 0.0026392 seconds
[READ: Count=957, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0]
[Intended-READ: Count=957, Max=2719, Min=0, Avg=5.21, 90=0, 99=0, 99.9=2119, 99.99=2719]
This process is running with a 64M heap, you can expect longer pauses as the heap grows (in particular as the young generation grows).

  • The operation setup time is now being measured as well as the operation itself.
When running with the spin option we can see the average operation cost is ~3.1µs, this is all test harness overhead and is really quite negligible in the context of network hopping operations. In other words, nothing to worry about for this harness but could well prove an issue for others.


Step 5: The Good, The Bad And The STW pausing DB

Many software processes have a latency profile that is far from normally distributed. To see what YCSB makes of this kind of profile now that we can compare corrected vs. uncorrected measurement I have built a mock DB that has 4 mods of latency (p is uniform random number [0,1]):
  1. Awesome (p < 0.9): we return in 200µs-1ms
  2. Good (0.9 < p < 0.99): we return in 1-10ms
  3. Minor Hiccup( 0.99 < p < 0.9999): we hit a bump, but only one thread is affected 10-50ms
  4. Major Hiccup(0.9999 < p): we hit a STW pause(because GC/THP/LBJ/STD/others), all threads stop for 50-200ms
I implemented the above with a read write lock, where the STW pause grabs the write lock and all the others grab the read lock. This is far from a perfect representation of a system (everyone waits for STW as intended, but also STW waits for everyone to start), but it will do. If you feel strongly that a better simulation is in order, write one and let's discuss in the comments!
What sort of profile will we see? How far off course will our measurements be if we don't stick to schedule? Here's this setup run at a rate of 10,000 requests per second, with 25 threads (so each thread is trying for 250 reqs/sec or 1 request per 4ms):
[READ], Operations, 569516.0
[READ], AverageLatency(us), 1652.1852871561116
[READ], MinLatency(us), 210.0
[READ], MaxLatency(us), 142463.0
[READ], 95thPercentileLatency(ms), 1.0
[READ], 99thPercentileLatency(ms), 19.0
[READ], Return=0, 569516

According to these numbers, the max is quite high but the overall impact of hiccups is not too severe (all depends on your requirements of course). Even at this stage we can see that the effect of global pauses is skewing the other measurements (if you hit a short operation while a STW pause is in progress you still have to wait for the STW event to finish).
The corrected measurements tell a different story:
[Intended-READ], Operations, 569516.0
[Intended-READ], AverageLatency(us), 24571.6025835973
[Intended-READ], MinLatency(us), 268.0
[Intended-READ], MaxLatency(us), 459519.0
[Intended-READ], 95thPercentileLatency(ms), 83.0
[Intended-READ], 99thPercentileLatency(ms), 210.0


How can this be right? Can this be right?
  • At a rate of 10000 request per second, the unlikely Major Hiccup is likely to happen every second. Consider this next time someone tells you of a 99.99%ile behaviour. Given an event rate of 10K per second, 99.99% is suddenly not very rare. Consider that at this rate there's likely to be a few events that are worse.
  • The average major hiccup is 125ms long, in this time 125/4 events are delayed on all 25 threads -> 125 * 25 / 4 = 781 events are delayed from starting, they will further delay each other as they execute. In roughly 12 seconds we can see how it is quite probable that one of these events is another major hiccup. What with all the queuing up behind the first one etc, the pileup becomes quite reasonable.
  • The probability of a 'mode' is not the probability of the per event latency once STW and queuing effects are in play.
I've made the mock DB print out 'OUCH' every time we get slapped with a STW event. It turns out that we got very unlucky in this run and hit three of these in a row:
56 sec:
[READ: Count=9192, Max=83903, Min=238, Avg=1745.13, 90=1531, 99=26959, 99.9=79551, 99.99=83775]
[Intended-READ: Count=9208, Max=159999, Min=303, Avg=16496.92, 90=54271, 99=103807, 99.9=150527, 99.99=158335]
OUCH
OUCH
OUCH
57 sec: 
[READ: Count=9642, Max=129727, Min=247, Avg=2318, 90=1799, 99=40607, 99.9=125631, 99.99=127359]  
[Intended-READ: Count=9635, Max=459519, Min=320, Avg=102971.39, 90=200319, 99=374271, 99.9=442367, 99.99=457983]

This is quite telling.
The view on what's the worst second in this run is wildly different here. Because the uncorrected measurement takes each event as it comes it will take the view that 75 events were delayed by these hiccups, and none by more than 130ms. But from the corrected measurement point of view all the queued up measurements were effected and were further delayed by each other.
I've re-run, this time logging interval histograms in their compressed form for every second in the run. Logging a 60 seconds run with 1 second interval data cost me 200k (we can tweak the construction in OneMeasurementHdrHistogram to minimize the cost). I can take the compressed logs and use the HistogramLogProcessor script provided with HdrHistogram to process the logs (you need to copy the HdrHistogram.jar into the script folder first). Running:
./HistogramLogProcessor -i READ.hdr -o uncorrected -outputValueUnitRatio 1000
./HistogramLogProcessor -i Intended-READ.hdr -o corrected -outputValueUnitRatio 1000
Will produce *.hgrm files for both. I then use the plotFiles.html to generate the following comparison:


They tell very different stories don't they.
The red line will have you thinking your system copes gracefully up to the 99%ile slowly degrading to 20ms, when measuring correctly however the system is shown to degrade very quickly with the 20ms line crossed as early as the median, and the 99%ile being 10 times the original measurement. The difference is even more pronounced when we look at one of those terrible seconds where we had back to back STW hiccups. I can use the HistogramLogProcessor script to produce partial summary histograms for the 3 seconds around that spike:
./HistogramLogProcessor -i Intended-READ.hdr -o correctedOuch3 -outputValueUnitRatio 1000 -start 1425637666.488 -end 1425637668.492
Similarly we can compare a good second with no STW pauses:


Summary

Coordinated Omission is a common problem in load generators (and other latency reporters), we had a look at fixing YCSB, an industry standard load generator:
  • Replaced the data structure used to capture latency with HdrHistogram: that is just generally useful and gives us better data to work with when examining the corrected measurement
  • Found scheduling code and introduced notion of operation start time.
  • Found measuring code and captured both operation cost (uncorrected measurement) and scheduled time latency (corrected measurement).
  • Use a mock system under test to evaluate measurement of known scenario. This is a very handy thing to have and luckily YCSB had this facility in place. In other places you may have to implement this yourself but it's a valuable tool to have in order to better understand the measurement capabilities of your harness. This helped highlight the scale of scheduling inaccuracies and test harness overhead per operation, as well as the scale of test harness error during its own warmup period.
  • Use HdrHistogram facilities to visualise and analyse latency histogram data from the compressed histogram logs.
Thanks goes to this posts kind reviewers: Peter Huges, Darach, and Philip Aston


Monday, 16 February 2015

HdrHistogram: A better latency capture method

Some years back I was working on a latency sensitive application, and since latency was sensitive it was a requirement that we somehow capture latency both on a transaction/event level and in summary form. The event level latency was post processed from the audit logs we had to produce in any case, but the summary form was used for live system monitoring. We ended up with a solution I've seen since in many places (mild variations on a theme), which is what I've come to think of as the linear buckets histogram:

The above data structure was an easy solution to a problem we had little time to solve, but it left much to be desired.
These days the histogram problem is gloriously solved by the HdrHistogram (High Dynamic Range), and though it's been around for a couple of years now I still find olden hand rolled histograms in many a code base. Cut that shit out boys and girls! Let me show you a better way :-)

HdrHistogram highlights:
  • Mature and battle tested, this data structure has been in production for many companies for a while now. Kinks have been unkinked and knickers untwisted.
  • Multi-lingual support: current implementations available in Java, C, C#, Erlang, Go and more are on the way.
  • Auto-resizing histograms (if you exceed your initial miximum value estimate)
  • Compact memory footprint supporting high precision of values across a wide range.
  • Compressed lossless serialization/de-serialization
  • Plotting scripts for gnuplot, a webby charting tool and an excel sheet chart
  • Lock-free concurrency support for recording and logging from multiple threads
  • Zero allocation on recording path (unless resizing which is optional, and then only if value exceeds initially specified max recorded value)
  • Constant time measurement which is less than cost of calling System.nanoTime() (on the cost, scalability and trustworthiness of nanoTime read Shipilev's excellent report)
It is truly as if the bees-knees and the dogs-bollocks had a baby, is it not?

Mama, what's a histogram?

"Histogram of arrivals per minute" by DanielPenfield
Own work. Licensed under CC BY-SA 3.0 via Wikimedia Common
Well, you could look it up on wikipedia, but the short answer is that a histogram is a summary of data in terms of the frequency of ranges of values.  So given the following data set [1,2,3,...,100000] captured for a second I could summarize in several ways:
  • I could capture the whole range of values each value in a bucket of it's own, assigning each value a frequency of 1 per second.
  • I could have a single bucket for values between 0 and 100,000, this bucket will have the frequency of 100,000 times per second.
These are both a bit silly, the first is as bad as dealing with the full set of data, the second is telling us nothing about the way the 100,000 values break down within the range. Still these are the 2 extremes of histograms, the alternatives lie within that range in terms of the data they offer but there are many ways to skin a cat (apparently, :( poor cats):
  • Capture the values in 1000 buckets, each bucket representing a range of 100 values: [1..100][101..200]...[99,901..100,000] that will result in 1,000 buckets each with a frequency of 100. This is the sort of histogram described above where all buckets capture the same fixed range.
  • Capture the values in 17 buckets, each bucket K representing a range [2^K..(2^(K+1)-1)] e.g. [1..1][2..3][4..7]...[65,536..131,071]. This would be a good solution if we thought most values are likely to be small and so wanted higher precision on the lower range, with lower and lower precision for larger values. Note that we don't have to use 2 as the base for exponential histogram, other values work as well.
Both of the above solutions trade precision across a large range with storage space. In both solutions I am required to choose up-front the histogram precision and we expect to pay for a large range with either space or precision. Now that we realize what the variables are we can describe these solutions:
  • Linear buckets: For a range 0..R we will have to pay R/B space for buckets of range B. The higher R is the more space we require, we can compensate by picking a large B.
  • Exponential buckets: For a range 0..R we require space of log2 of R. The bucket size grows exponentially as we track higher values.
The problem we face with latency data points is that the range of values we want to capture is rather large. It is not unreasonable to expect the latency outliers to be several orders of magnitude larger than the typical observed measurement. For example, it may well be that we are timing a method whose cost is in the 100s of nanoseconds, or a high speed connection round trip on the order of 1-100µs but on occasion our latency is dominated by some stop the world GC pause, or network congestion, which is in the order of 10ms to a few seconds. How can we correctly size the range of our histogram? Given the possibility of multi-second GC delays we need to cover a typical range of 1000ns to 100,000,000,000ns. If we used a linear histogram with a 100µs bucket we'd need 1,000,000 buckets (assuming an int counter this will add up to a ~4MB data structure).
The HdrHistogram follows a different path to the above solutions and manages to accommodate a large range with a high precision across the range in a limited space.

How does it work?

Here's what the documentation has to say:
"Internally, data in HdrHistogram variants is maintained using a concept somewhat similar to that of floating point number representation: Using an exponent a (non-normalized) mantissa to support a wide dynamic range at a high but varying (by exponent value) resolution. AbstractHistogram uses exponentially increasing bucket value ranges (the parallel of the exponent portion of a floating point number) with each bucket containing a fixed number (per bucket) set of linear sub-buckets (the parallel of a non-normalized mantissa portion of a floating point number). Both dynamic range and resolution are configurable, with highestTrackableValue controlling dynamic range, and numberOfSignificantValueDigits controlling resolution."
Hmmm... I'll admit to having difficulty immediately understanding what's happening from the above text, precise though it may be. I had to step through the code to get my head around what works why, read the above again and let it simmer. I'm not going to dig into the implementation because, while interesting, it is not the point of this post. I leave it to the reader to pester Gil Tene(the author of HdrHistogram) with implementation related questions.
The principal idea is a mix of the exponential and linear histograms to support a dynamic range precision that is appropriate to the time unit scale. At the scale of seconds we have a precision of milliseconds, at the scale of milliseconds we have a precision of microseconds etc. This translates roughly into exponential scale buckets which have linear sub-buckets.

Example: From raw recording to histogram

I have posted a while back a Java ping utility which measures the round trip between minimal client and server processes. Each round trip was recorded into a large array and every set number of round trips the measurements were summarized in percentiles:
Recording raw data is the simplest way of capturing latency, but it comes at a price. The long[] used to capture the latencies is ~8MB in size, this is for a million samples and in a real application can grow without bounds until some cutoff point where we decide to summarize or discard the data. When we want to report percentiles we have to sort it and pick the relevant data points. This is not usually an acceptable solution (because of the memory footprint), but it offers absolute accuracy and is trivial to implement (until you have to consider serialization, concurrency and visualization, but otherwise trivial).
Replacing this measurement method with a histogram is straight forward:
This histogram is 31KB when using 2 decimal places precision which is good enough in most case (according to JOL, add it to your utility belt if it ain't there already. Increasing the precision to 3 increases the size to 220KB), which is a large improvement over 8MB. We could reduce the memory consumption further if we were willing to limit the maximum data point count per bucket and use an integer/short histogram(ints seem like a reasonable choice).
If we print both measurements methods for the same run we can see the difference between the raw data and the HDR representation which is naturally slightly less accurate (# lines are HdrHistogram,@ is for raw data, each line represents 1M data points):
We can see that reported percentiles are pretty close to the raw data:

  • Note that the nanoTime on Mac reports in µs granularity, which is why the real values(@ lines) all end with 3 zeros.
  • Note that the max/min reported are adjusted to the correct histogram resolution (not a big deal, but slightly surprising).

What would have happened with our hand rolled solution? To keep a range of 0 to 60,000,000,000ns in a linear histogram of the same memory footprint we would need to limit ourselves to roughly 220k/8=64k buckets. Each bucket would have a granularity of roughly 1ms which would have translated to a very limited visibility on the lower end of the spectrum as most data sets are actually all under 1ms. This would have also completely skewed our percentiles (i.e 99.99% results in 1ms bucket, no breakdown of behaviour in percentiles). We could try and tackle the issue by picking a lower range to cover (which if applied to HdrHistogram will minimize memory usage further) or by blowing the memory budget on finer grained buckets.

Example: Aeron samples, much percentiles! Such graphs! Wow!

Percentiles are commonly used to describe latency SLA of a given system and a typical application will report a range of percentiles to reflect the probability of a given response time. In this context we can say that a 99%ile latency of 1ms means that 99% of all requests were handled in =< 1ms.
Aeron is a low latency, reliable UDP messaging library. Since latency is an obvious concern, all the samples utilize the HdrHistogram lib to demonstrate measurement and report results, here are the relevant excerpts from the Ping sample:
This results in a pile of text getting poured into the console, not that friendly:
Fear not, HdrHistogram comes packed with a handy charty thingy! Here's what the above histogram looks like when plotted:

To get this graph:
  • Save output above to a text file
  • Open a browser, and go here (the same HTML is in the project here)
  • Choose your file, choose percentiles to report and unit to report in
  • Export the picture and stick it in your blog!
This histogram was provided to me by Martin Thompson (one of the Aeron developers) and is from a test run in a performance lab. We can see that Aeron is delivering a solid 7µs RTT up to the 90%ile where latency starts to gradually grow. In this particular data set the maximum observed latency was 38µs. This is a great latency profile. It is far more common for the max and 99.99%ile to be orders of magnitude more that the 90%ile.
I could similarly plot this histogram using a gnuplot script to be found here. The gnuplot script is very handy for scripted reporting on large runs. It also allows for plotting several histograms on the same graph to allow visual comparison between benchmark runs for instance.

Example: Compressed histogram logging

Because the SLA is often specified in percentiles, it is common for applications to log only percentiles and not histograms. This leads to a reporting problem as it turns out that percentiles output cannot be combined to produce meaningful average percentiles. The solution would be to log the full histogram data, but who want a log that grows by 31KB every 5 seconds just to capture one histogram? Worry not, HdrHistogram comes with a compressed logging format and log writer and all that good stuff:
How many bytes is that logged histogram costing us? The 31KB histogram compressed down to 1KB in my example, but the result of compression will depend on the histogram captured. It is fair to assume that histograms compress well as the array of buckets is full of zeros (on the byte level) as most buckets are empty or low count.
If 1KB sounds like allot consider that a days worth of 10s interval histograms will result in an 8MB log file, which seems pretty acceptable even if you have a hundred such files. The benefit is that you will now have high precision interval latency data that you can reliably use to create longer interval latency data. You can use the HistogramLogProcessor to produce a full or partial log summary histogram for plotting as above.
I believe there's some truly exciting data visualizations one could build on top of this data, but sadly thats not where my skills lie. If you got skills to show of in this area I'm sure HdrHistogram would value your contribution.

Example: jHiccup and concurrent logging

jHiccup is a pause measurement tool used to capture OS or JVM 'hiccups'. It deserves it's own post but I'll try and summarize it in a few points:
  • jHiccup runs a HiccupRecorder thread which sleeps for a period of time (configurable) and measures the delta between the wakeup time and actual time. The failure to be re-scheduled is taken as a potential OS/JVM hiccup. The size of the hiccup is recorded in a histogram.
  • jHiccup can be run as an agent in your own process, an external process, or both.
  • jHiccup has been ported to C as well.
  • People typically use jHiccup to help charecterize and diagnose disruptions to execution in their system. While not every hiccup is the result of a STW pause we can use the jHiccup agent evidence correlated with an external jHiccup process and the JVM gc logs to support root cause analysis. A significant hiccup is a serious sign of trouble meaning a thread was denied from scheduling for the length of the hiccup. We can safely assume in most cases that this is a sign that other threads were similarly disturbed.
Gil Tene originally wrote HdrHistogram as part of jHiccup, but as HdrHistogram turned out to be more generally useful the two were split. The reason I bring jHiccup up in this context is that it serves as a regularly maintained full blown real world example of using an HdrHistogram.
jHiccup has 2 interesting threads, with roles that parallel many real world applications out there:
  • The measuring thread(HiccupRecorder): This is the thread that sleeps and wakes up and so on. The rate at which it does that is potentially quite high and we don't want to skew to measurement by performing IO on this thread. Similarly many real world application will have critical threads where it is not desirable to introduce IO. Since this is the case actual persistence will be performed on another thread.
  • The monitoring/logging thread(HiccupMeter): This thread will wake up at regular intervals and write the last interval's histogram to the log file. But since it is reading a histogram while another thread is writing to the histogram we now need to manage concurrency.
HdrHistogram offers a synchronization facility to serve exactly this use case in the form of the Recorder:
  • Recording a value in the recorder is a wait free operation (on JDK8, can be lock free on older depending on the getAndAdd implementation for AtomicLongArray).
  • The Recorder also comes in a single-writer flavour, which minimizes the concurrency related overheads.

Under the covers the recorder is using an active histogram and an inactive one, swapped seamlessly when an interval histogram is requested. Using a Recorder looks much like using a normal histogram(full code here):
And that's concurrent logging sorted ;-).

Summary

With HdrHistogram now hitting version 2.1.4 and offering a wealth of tried and tested functionality along with cross platform implementations and a standardized compressed logging format it is definitely time you gave it a go! May your latencies always be low!
If you are looking for a pet project and have a gift for UI thingies a latency explorer for the histogram interval logs would be an awesome contribution!

Monday, 19 January 2015

MPMC: The Multi Multi Queue vs. CLQ

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
JCTools, which is my spandex side project for lock-free queues and other animals, contains a lovely gem of a queue called the MpmcArrayQueue. It is a port of an algorithm put forward by D. Vyukov (the lock free ninja) which I briefly discussed in a previous post on ring-buffer queues.
The implementation is covered to a large extent in that post, but no performance numbers are provided to evaluate this queue vs. the JDK's own MPMC queue the illustrious ConcurentLinkedQueue (CLQ for short). Let's fix that!

Welcome to the Machine

Before we start the party, we must establish the test bed for these experiments. The results will only be relevant for similar setups and will become less indicative the more we deviate from this environment. So here it is, the quiet performance testing machine (shitloads of power and nothing else running):
  • OS: CentOS 6.4
  • CPU: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz, dual socket, each CPU has 12 cores (24 logical core as HT is on). This is an Ivy Bridge, so decent hardware if not the fanciest/latest.
  • JVM: Oracle JDK 1.8u25
  • Load average (idle): 0.00, 0.02, 0.00 -> it's oh so quiet....
I'll be using taskset to pin threads to a given topology of producers/consumers which is important because:
  • Running 2 logical threads on same physical core will share queue in L1 cache.
  • Running on 2 physical cores will share queue in LLC.
  • Running on different socket (i.e 2 different CPUs) will force shared data via QPI.
  • Each topology has it's own profile, there's no point to averaging the lot together.
For the sake of these benchmarks I'll be pinning the producer/consumer threads to never share a physical core, but remain on the same CPU socket. In my particular hardware this means I'll use the 12 physical cores on one CPU using taskset -c 0-11.
For the sake of keeping things simple I ran the benchmarks several times and averaged the results, no fancy graphs (still recovering from the new year). Throughput results are quoted in ops per microsecond, if you are more comfortable with ops per second just multiply by 1 million. Latency results are in nanoseconds per operation.

Throughput Benchmarks: Hand rolled 1 to 1

Let's start with the simplest benchmark in JCTools. The QueuePerfTest is really a single producer/consumer test:
  • A producer thread is spun up which feeds into the queue as fast as it can.
  • The main thread plays the consumer thread and empties the queue as fast as it can. 
The original benchmark was a sample used by Martin Thompson in one of his talks and I tweaked it further. It's nice and simple and has the added benefit of verifying that all the elements arrived at their destination. It also opens the door to loop unrolling which you may or may not consider a reasonable use case. This benchmark calls Thread.yield() if an offer/poll call fails which represents a rather forgiving backoff policy, which is why we also have the BusyQueuePerfTest which busy spins on fail.
Here's what we get:
Queue  Benchmark            Throughput(ops/µs)
CLQ    QueuePerfTest        10.7
Mpmc   QueuePerfTest        65.1
CLQ    BusyQueuePerfTest    15.3
Mpmc   BusyQueuePerfTest    63.5
         
The observed throughput for CLQ varies quite a bit between iterations, but the full run average is not too unstable. The performance also depends on the size of the heap as CLQ allocates a node per item passed and the GC overhead can become a significant part of management. In my runs I set the heap size to 1GB and the benchmark also calls System.gc() between measurement iterations. In any case, off to a good start, the JCTools queue is looking well.

Throughput benchmarks: JMH Joy

I'm a big JMH fan (see the JMH reference page), and these benchmarks show some of the power JMH has. The benchmark code is very simple and it's clear what's being measured:
  • We have 2 methods, offer and poll
  • Thread groups hammer each method
  • We use @AuxCounters to report successful/failed interactions with the queue
  • The throughput is the number of pollsMade, i.e. the number of items delivered
The nice thing is that once I got this setup I can now play with the number of threads in each group with no further investment (see an introduction to multi-threaded benchmarks with JMH).  I can run several iterations/forks and go make some coffee while JMH crunches through the lot and gives me a nice report (JMH options used for these benchmarks: "-gc -f 5 -i 5 -wi 5 -p qType=MpmcArrayQueue,ConcurrentLinkedQueue -p qCapacity=1000000" the -gc option forces a GC cycle between iterations as before, I use the -tg option to control number of producer/consumer threads). Here's the results for QueueThroughputBackoffNone (again results are in ops/µs, so higher is better):
Queue  1P1C        2P2C      3P3C      4P4C      5P5C      6P6C
CLQ     7.9 ±1.7 , 4.7 ±0.2, 3.8 ±0.1, 3.7 ±0.4, 3.1 ±0.1, 2.9 ±0.1
Mpmc   68.4 ±11.1, 9.2 ±1.1, 7.3 ±0.5, 5.8 ±0.4, 5.2 ±0.4, 5.3 ±0.2

Note the columns stand for number of producers/consumers so 1P1C will be running with -tg 1,1 and will have one thread hitting the offer and one the poll. 2P2C will have 2 consumers and 2 producers etc.
So on the one hand, joy, MPMC still consistently ahead, on the other hand it is not some sort of magic cure for contention. If you have multiple threads hitting a shared resource you will suffer for it. The initial hit is the worst, but the following degraded performance curve isn't too bad and we seem to stay ahead.

Average Round Trip Time Benchmarks


This benchmark was set up to model a bursty load on queue implementations where we measure the time it takes for a burst of messages to travel from an originating thread, to a chain of threads inter-linked by queues, back to the same thread. This benchmark is focusing on near empty queues and the notification/wakeup time from when an item is placed in a queue until it becomes visible to another thread. It also highlights any batching optimisations impact as bursts grow larger.
The benchmark has been discussed in some detail in 2 previous posts(1, 2)
We have 2 axis we can explore here, burst size and ring size, I have not explore all the possibilities. The results seem quite straightforward and I'm short on time. I tested with burst size=1,10,100 in the default configuration, i.e. chain length 2 (so a ping-pong kind of setup), and just for the heck of it I ran the burst size=100 with a chain length of 8 (so 1->2->3...->7->8->1). There you go (results are now in nanoseconds per operation, lower is better):
Queue  b=1,c=2   b=10,c=2   b=100,c=2    b=100,c=8
CLQ    488 ±11,  2230 ±98,  15755 ±601,  24886 ±2565
Mpmc   422 ±7,   1718 ±51,   5144 ±287,   8714 ± 200

Note that the headers shorthand stands for burst size(b) and chain length(c) so b=1,c=2 which is the default config for the benchmark stands for burst size of 1 (so send 1 message and wait until 1 message is received back) and chain length of 2 (so 2 threads linked up in this ring: 1->2->1).
The difference on single element exchange is not that great, but as burst size grows the gap widens significantly. This is perhaps down to the fact MPMC enjoys a much denser delivery of data, minimising the cache misses experienced as part of the message exchange. Note that extending the length of the chain seems to add the same percentage of overhead for each implementation, resulting in the same ratio for chain length 2 as we did for chain length 8 (MPMC is 3 times 'cheaper' than CLQ)

Summary

This was all rather dry, but I hope it helps people place the MPMC alternative in context. I would suggest you consider using MpmcArrayQueue in your application as a replacement to CLQ if:
  • You need a bounded queue
  • You are concerned about latency
  • You don't need to use the full scope of methods offered by Queue and can make do with the limited set supported in JCTools queues

Monday, 1 December 2014

The Escape of ArrayList.iterator()

{This post assumes some familiarity with JMH. For more JMH related content start at the new and improved JMH Resources Page and branch out from there!}
Escape Analysis was a much celebrated optimisation added to the JVM in Java 6u23:
"Based on escape analysis, an object's escape state might be one of the following:
  • GlobalEscape – An object escapes the method and thread. For example, an object stored in a static field, or, stored in a field of an escaped object, or, returned as the result of the current method.
  • ArgEscape – An object passed as an argument or referenced by an argument but does not globally escape during a call. This state is determined by analyzing the bytecode of called method.
  • NoEscape – A scalar replaceable object, meaning its allocation could be removed from generated code.
After escape analysis, the server compiler eliminates scalar replaceable object allocations and associated locks from generated code. The server compiler also eliminates locks for all non-globally escaping objects. It does not replace a heap allocation with a stack allocation for non-globally escaping objects." - from Java 7 performance enhancements 
Alas, proving an object never escapes is a difficult problem, and many people feel they cannot rely on this optimisation to kick in and "do the right thing" for them. Part of the problem is that there is no easy way to discover if a particular allocation has been eliminated (on a debug OpenJDK build one can use -XX:+UnlockDiagnosticVMOptions -XX:+PrintEscapeAnalysis -XX:+PrintEliminateAllocations).
The EscapeAnalysis skepticism leads some people to go as far as claim that the JIT compiler fails to eliminate the iterator allocation of collections, which are everywhere:

Buy why skepticize what you can analyze?

Theory: even simple iterators do not benefit from Escape Analysis

So lets give this some thought. I generally think the best of the JVM developer bunch. Sure they may miss here and there, they're just human after all, but we are talking about a good strong bunch of engineers. I tend to assume that if there is a simple case for an optimization that is worth while than they have not failed to capitalize on it. I would therefore be quite surprised if indeed iterators do not benefit from escape analysis as they seem quite natural candidates, particularly in the syntactic sugar case, but even in the explicit case. Still, my trust in the JVM engineers is no evidence, how can I prove this works for a given setup?
  1. Use a debug build... I invite the readers to try this method out, didna do it.
  2. Use a memory profiler, like the one packed with VisualVM or YourKit
  3. Setup an experiment to observe before/after effect of desired optimization. Use -XX:-+DoEscapeAnalysis and examine gc logs to observe effect.
  4. Look at the assembly...

Experiment: Observe a loop over an array list

Reach for your tool belts and pick the swiss army knife that is JMH. Here's the benchmark I will use for this:
This is one of them times when I'm not particularly concerned with the performance of this bit of code as such, but JMH makes a good crucible for java code. The same effort that went into correct measurement enables the examination of code in an isolated manner. This makes JMH a good tool for testing the JIT compiler.

Measurement: Profile the experiment

I want to plug the experiment into a profiler, so I set the number of iterations to 1000 and get into it man, profiling, doing it you know, like a... like a sex machine, can I count it all? Here's YourKit reporting:

Ouch! that array list iterator is right there at the top! all that trust I put in the JVM developers? GONE! Let's get a second opinion from JVisualVM, just to be sure:

Oh me, oh my, this is bad...
Finally, with tears in our eyes, let us try see what Java Mission Control has to say:

How curious! the iterator allocation is gone! but JMC is outnumbered 2 to 1 here. Could it be that some profilers are pooping their pants?

Measurement: Measure with -XX:+/-DoEscapeAnalysis

Sometimes we get lucky and the optimization we want to examine comes with a handy flag to turn it on and off. We expect escape analysis to remove the iterator allocation and thus leave us with a benchmark which generates no garbage. We are also fortunate that JMH comes with the handy GC profiler which simply examines the GC JMX bean to inform us if any collections happened. Running the benchmark with a short list size and a small heap should trigger plenty of young generation GC cycle in each measurement iteration. Lo and behold, with escape analysis on:
$java -jar target/microbenchmarks.jar -f 1 -i 10 -wi 10 -p size=1000 \
 -jvmArgs="-Xms64m -Xmx64m -XX:+DoEscapeAnalysis" -prof gc ".*.sumIteratorOverList"
Benchmark                                           (size)    Score    Error   Units
sumIteratorOverList                                   1000  816.367 ± 17.768   ns/op
sumIteratorOverList:@gc.count.profiled                1000    0.000 ±    NaN  counts
sumIteratorOverList:@gc.count.total                   1000    0.000 ±    NaN  counts
sumIteratorOverList:@gc.time.profiled                 1000    0.000 ±    NaN      ms
sumIteratorOverList:@gc.time.total                    1000    0.000 ±    NaN      ms


$java -jar target/microbenchmarks.jar -f 1 -i 10 -wi 10 -p size=1000 \
 -jvmArgs="-Xms64m -Xmx64m -XX:-DoEscapeAnalysis" -prof gc ".*.sumIteratorOverList"
Benchmark                                           (size)    Score    Error   Units
sumIteratorOverList                                   1000  940.567 ± 94.156   ns/op
sumIteratorOverList:@gc.count.profiled                1000   19.000 ±    NaN  counts
sumIteratorOverList:@gc.count.total                   1000   42.000 ±    NaN  counts
sumIteratorOverList:@gc.time.profiled                 1000    9.000 ±    NaN      ms
sumIteratorOverList:@gc.time.total                    1000   27.000 ±    NaN      ms
Now that is what we hoped for, escape analysis saves the day! Please note above numbers are from running on my laptop using Oracle Java 8u20, I was not aiming for measurement accuracy, just wanted to verify the compilation/GC aspects. Laptops are good enough for that.

WTF Profilers? Why you lie?

There's a big difference in how JVisualVM/YourKit work and how JMC work, in particular:
  • JVisualVM/YourKit: treat the JVM as a black box and profile via the JVMTI interfaces and bytecode injection. Work on all JVMs.
  • Java Mission Control: use internal JVM counters/reporting APIs only available to the Oracle JVM (so can't profile OpenJDK/Zing/IBM-J9 etc)
So why should this get in the way of escape analysis? Searching through the OpenJDK source code we can spot the following:
void C2Compiler::compile_method(ciEnv* env, ciMethod* target, int entry_bci) {
  assert(is_initialized(), "Compiler thread must be initialized");

  bool subsume_loads = SubsumeLoads;
  bool do_escape_analysis = DoEscapeAnalysis &&!env->jvmti_can_access_local_variables();
This explains a similar issue, but not this one. The above code is relevant for debuggers and other tools relying on stepping through code and examining variable values.
This issue is the result of some instrumentation trapping the reference, at least for JVisualVM. We can look at the assembly to be certain, this is the iterator non-allocation before we connect JVisualVM to the process:
Note how the iterator is initialized with no allocation actually taking place (note also how line 4 annotation is a big fat lie, this is actually getting modCount from the list). Method is recompiled to same when we attach the agent. And here's what happens when we turn on memory profiling: The iterator is now allocated (see 'new' keyword at line 18 above).We can see that the instrumenting agent added a traceObjAlloc call into the iterator constructor (line 52 above). JVisualVM is open source, so we can dig into the code of the above trace method and the instrumentation code, I leave it as an exercise to the reader. If I was a better man I might have a go at fixing it, maybe another day.

Summary

  • EscapeAnalysis works, at least for some trivial cases. It is not as powerful as we'd like it, and code that is not hot enough will not enjoy it, but for hot code it will happen. I'd be happier if the flags for tracking when it happens were not debug only.
  • Profilers can kill EscapeAnalysis. In the example above we have very little code to look at, so it is easy to cross examine. In the real world you'd be profiling a much larger codebase, looking for allocation elimination opportunities, I suggest you have a healthy doubt in your profiler.
  • Java Mission Control is my current favourite profiler for the Oracle JVM. It's a shame the underlying APIs are not made public and standard so that tool makers can rely on them for all JVMs. Perhaps one day at least part of these APIs will be made public.
Thanks Peter Lawrey for reviewing this post!