Tuesday, 21 May 2013

Know Thy Java Object Memory Layout

Sometimes we want to know how an object is laid out in memory, here's 2 use cases and 2 solutions which remove the need to hypothesise.
A few months back I was satisfying my OCD by reading up on java object memory layout. Now Java, as we all know and love, is all about taking care of  such pesky details as memory layout for you. You just leave it to the JVM son, and don't lose sleep over it.
Sometimes though... sometimes you do care. And when you do, here's how to find out.

In theory, theory and practice are the same

Here's an excellent article from a few years back which tells you all about how Java should layout your object, to summarise:
  • Objects are 8 bytes aligned in memory (address A is K aligned if A % K == 0)
  • All fields are type aligned (long/double is 8 aligned, integer/float 4, short/char 2)
  • Fields are packed in the order of their size, except for references which are last
  • Classes fields are never mixed, so if B extends A, an object of class B will be laid out in memory with A's fields first, then B's
  • Sub class fields start at a 4 byte alignment
  • If the first field of a class is long/double and the class starting point (after header, or after super) is not 8 aligned then a smaller field may be swapped to fill in the 4 bytes gap.
The reasons why the JVM doesn't just plonk your fields one after the other in the order you tell it are also discussed in the article, to summarise:
  • Unaligned access is bad, so JVM saves you from bad layout (unaligned access to memory can cause all sorts of ill side effects, including crashing your process on some architectures)
  • Naive layout of your fields would be wasting memory, the JVM reorders fields to improve the overall size of your object
  • JVM implementation requires types to have consistent layout, thus requiring the sub class rules
So... nice clear rules, what could possibly go wrong?

False False Sharing Protection

For one thing, the rules are not part of the JLS, they are just implementation details. If you read Martin Thompson's article about false sharing you'll notice Mr. T had a solution to false sharing which worked on JDK 6, but no longer worked on JDK 7. Here are the 2 versions:

It turns out the JVM changed the way it orders the fields between 6 and 7, and that was enough to break the spell. In fairness there is no rule specified above which requires the fields order to correlate to the order in which they were defined, but ... it's allot to worry about and it can trip you up.
Just as above rules were still fresh in my mind, LMAX (who kindly open sourced the Disruptor) released the Coalescing Ring Buffer. I read through the code and came across the following:

I approached Nick Zeeb on the blog post which introduced the CoalescingRingBuffer and raised my concern that the fields accessed by the producer/consumer might be suffering from false sharing, Nick's reply:
I’ve tried to order the fields such that the risk of false-sharing is minimized. I am aware that Java 7 can re-order fields however. I’ve run the performance test using Martin Thompson’s PaddedAtomicLong instead but got no performance increase on Java 7. Perhaps I’ve missed something so feel free to try it yourself.
Now Nick is a savvy dude, and I'm not quoting him here to criticise him. I'm quoting him to show that this is confusing stuff (so in a way, I quote him to comfort myself in the company of others equally confused professionals). How can we know? here's one way I thought of after talking to Nick:

Using Unsafe I can get the field offset from the object reference, if 2 fields are less than a cache line apart they can suffer from false sharing (depending on the end location in memory). Sure, it's a bit of a hackish way to verify things, but it can become part of your build so in the case of version changes you on't get caught out.
Note that it's the runtime which will determine the layout, not the compiler, so if you build on version X and run on Y it won't help you much.
Enough of that false sharing thing... so negative... why would you care about memory layout apart from false sharing? Here's another example.

The Hot Bunch

Through the blessings of the gods, at about the same time LMAX released the CoalescingRingBuffer, Gil Tene (CTO of Azul) released HdrHistogram. Now Gil is seriously, awesomely, bright and knows more about JVMs than most mortals (here's his InfoQ talk, watch it)  so I was keen to look into his code. And what do you know, a bunch of hot fields:

What Gil is doing here is good stuff, he's trying to get relevant fields to huddle together in memory, which will improve the likelihood of them ending up on the same cache line, saving the CPU a potential cache miss. Sadly the JVM has other plans... 
So here is another tool to help make sense of your memory layout to add to your tool belt: Java Object Layout I bumped into it by accident, not while obsessing about memory layout at all. Here's the output for Histogram:

Note how histogramData jumps to the botton and subBucketMask is moved to the top, breaking up our hot bunch. The solution is ugly but effective, move all fields but the hot bunch to an otherwise pointless parent class:

And the new layout:

Joy! I shall be sending Mr. Tene a pull request shortly :-)

Wednesday, 15 May 2013

Using JMH to Benchmark Multi-Threaded Code

Exploring some of the multi-threading related functionality in JMH via a benchmark of false shared counters vs. uncontended counters.
As discussed in previous post the JMH framework introduced by the good folk of the OpenJDK offers a wealth of out of the box functionality to micro benchmark writers. So much in fact that a single post is not enough! In this post I'll explore some of the functionality on offer for people writing and measuring multi-threaded code.

Sharing != Caring?

The phenomena I'm putting under the microscope here is the infamous False Sharing. There are many explanations of what False Sharing is and why it's bad to be found elsewhere(Intelusenix, Mr. T) but to save you from having to chase links here's mine. Here goes:
In order to maintain memory view consistency across caches modern CPUs must observe which CPU is modifying which location is memory. 2 CPUs modifying the same location are actually sharing and must take turns in doing so leading to contention a much back and forth between the CPUs invalidating each others view on the memory in question. Sadly the protocol used by the CPUs is limited in granularity to cache lines. So when CPU 1 modifies a memory location, the whole cache line (some memory locations to the actual locations right and left) is considered as modified. False sharing occurs when CPU1 and CPU2 are modifying 2 distinct memory locations on the same cache line. They shouldn't interfere with each other, as there is no actual contention, but due to the way cache coherency works they end up uselessly chattering to each other. The image to the right is taken from the Intel link provided above.
I will be using JMH to demonstrate the impact of false sharing by allocating an array of counters and accessing a counter per thread. I am comparing the difference between using all the counters bunched together and spacing them out such that no false sharing happens. I've repeated the same experiment for  plain access, lazySet and volatile access to highlight that false sharing is not limited to memory explicitly involved in multi-threading building blocks.


Multi-Threaded experiments with JMH

JMH offers support for running multi-threaded benchmarks, which is a new feature for such frameworks (to the best of my knowledge). I'll cover some of the available annotations and command line options I used for this benchmark. First up, let's look at the code:

The lazySet and volatile access versions are very similar using an AtomicLongArray instead of the primitive array used above.
In the previous post I briefly described the State annotation and made the most basic use of it. Here's the full description from the javadoc (slightly edited to read as one paragraph here):
State: Marks the state object. State objects naturally encapsulate the state on which benchmark is working on. The Scope of state object defines to which extent it is shared among the worker threads. Three Scope types are supported: 
  • Benchmark: The objects of this scope are always shared between all the threads, and all the identifiers. Note: the state objects of different types are naturally distinct. Setup and TearDown methods on this state object would be performed by one of the worker threads. No other threads would ever touch the state object. 
  • Group: The objects of this scope are shared within the execution group, across all the identifiers of the same type. Setup and TearDown methods on this state object would be performed by one of the group threads. No other threads would ever touch the state object.
  • Thread: The objects of this scope are always unshared, even with multiple identifiers within the same worker thread. Setup and TearDown methods on this state object would be performed by single worker thread exclusively. No other threads would ever touch the state object.
In the above example I'm exploring the use of scoped state further by utilising both Benchmark scope state to store the shared array and a Thread scope state to hold the thread counter index for both the false shared and no sharing case.
The state classes I define are instantiated by JMH rather than directly by my benchmark class. The framework then passes them as parameters to the methods being benchmarked. The only difference between the methods being the use of a different index for accessing the array of counters. As I add more threads, more instances of the Thread state will materialize, but only a single Benchmark state object will be used throughout.


Great Expectations

Before we start looking at what JMH offers us let's consider our expectations of the behaviour under test:
  1. We expect shared and unshared cases to behave the same when there is only a single benchmarking thread.
  2. When there are more than 1 benchmarking threads we expect the unshared case to outperform the shared case.
  3. We expect the unshared case to scale linearly, where the shared case performance should scale badly.
We can put our expectations/theory to the test by running the benchmark and scaling the number of threads participating. JMH offers us several ways to control the number of threads allocated to a benchmark:
  • -t 4 - will launch the benchmark with 4 benchmark threads hitting the code.
  • -tc 1,2,4,8 - will replace both the count of iterations and the number of threads by the thread count pattern described. This example will result in the benchmark being run for 4 iterations, iteration 1 will have 1 benchmarking thread, the second will have 2, the third 4 and the last will have 8 threads.
  • -s - will scale the number of threads per iteration from 1 to the maximum set by -t
I ended up either setting the number of threads directly or using the 'pattern' option. Variety is the spice of life though :-) and its nice to have the choice.
False sharing as a phenomena is a bit random (in Java where memory alignment is hard to control) as it relies on the memory layout falling a particular way on top of the cache line. Consider 2 adjacent memory locations being written to by 2 different threads, should they be placed such that they are on different cache lines in a particular run then that run will not suffer false sharing. This means we'll want to get statistics from a fair amount of runs to make sure we hit the variations we expect on memory layout. JMH supports running multiple forks of your benchmark by using the -f option to determine the number of forks.

And the results are?

Left as an exercise to the reader :-). I implore you to go have a play, come on.
Here's a high level review on my experimentation process for this benchmark:

  • As sanity check, during development I ran the benchmark locally. My laptop is a dual core MBA and is really not a suitable environment for benchmarking, still the results matched the theory although with massive measurement error:
    • Even at this point it was clear the false sharing is a significant issue. Results are consistent, if unstable in their suggestion of how bad it is. 
    • Also as expected, volatile writes are slower than lazy/ordered writes which are slower than plain writes.
  • Once I was happy with the code I deployed it on a more suitable machine. A dual socket Intel Xeon CPU E5-2630 @ 2.30GHz, using OpenJDK 7u9 on CentOS 6.3(OS and JDK are 64bit). This machine has 24 logical cores and is tuned for benchmarking (what that means is a post onto itself). First up I thought I'd go wild and use all the cores, just for the hell of it. Results were interesting if somewhat flaky. 
    • Using all the available CPU power is not a good idea if you want stable results ;-)
  • I decided to pin the process to a single socket and use all it's cores, this leaves plenty of head room for the OS and give me data on scaling up to 12 threads. Nice. At this point I could have a meaningful run with 1,2,4,8,12 threads to see the way false sharing hurts scaling for the different write methods. At this point things started to get more interesting:
    • Volatile writes suffering from false sharing degrade in overall performance as more threads are added, 12 threads suffering from false sharing achieve less than 1 thread. 
    • Once false sharing is resolved volatile writes scale near linearly until 8 threads, the same is true for lazy and plain. 8 to 12 seem to not scale quite so cleanly.
    • Lazy/Plain writes do not suffer quite as much from false sharing but the more threads are at work the larger the difference (up to 2.5x) becomes between false sharing and unshared.
    Scaling from 1-12 threads
  • I focused on the 12 thread case, running longer/more iterations and observing the output, it soon became clear the results for the false shared access suffer from high run-to-run variance. The reason for that, I assumed, was the different memory layout from run to run resulting in different cache line populations. In the case of 12 threads we have 12 adjacent longs. Given the cache line is 8 longs wide they can be distributed over 2 or 3 lines in a number of ways (2-8-2,1-8-3,8-4,7-5,6-6). Going ever deeper into the rabbit hole I thought I could print out the thread specific stats (choose -otss, JMH delivers again :-) ). The thread stats did deliver a wealth of information, but at this point I hit slight disappointment... There was no (easy) way for me to correlate thread numbers from the run to thread numbers in the output, which would have made my life slightly easier in characterising the different distributions...
  • I had enough and played with the data until I was happy with the results. And I finally found a use for a stacked bar chart :-). Each thread score is a different color, the sum of all thread scores is the overall score.
Here's the results for plain/volatile shared/unshared running on 12 cores (20 runs, 5 iterations per run: -f 20 -i 5 -r 1):
Plain writes suffering from false sharing. Note the different threads score
differently and the overall result falls into several different 'categories'.
Plain writes with no false sharing. Threads score is fairly uniform,
and overall score is stable.
Volatile writes suffering from false sharing, note the score varies
by less than it did in the plain write case but similar categories emerge.
Volatile writes with no false sharing, similarly uniform behaviour
for all threads.
Hope you enjoyed the pictures :-)

Sunday, 28 April 2013

Writing Java Micro Benchmarks with JMH: Juicy

Demonstrating use of JMH and exploring how the framework can squeeze every last drop out of a simple benchmark.
Writing micro benchmarks for Java code has always been a rather tricky affair with many pitfalls to lookout for:
  • JIT:
    • Pre/Post compilation behaviour: After 10K(default, tuneable via -XX:CompileThreshold) invocations your code with morph into compiled assembly making it hopefully faster, and certainly different from it's interpreted version.
    • Specialisation: The JIT will optimise away code that does nothing (i.e. has no side effects), will optimise for single interface/class implementations. A smaller code base, like a micro-benchmark, is a prime candidate.
    • Loop unrolling and OSR can make benchmark code (typically a loop of calls to the profiled method) perform different to how it would in real life.
  • GC effects:
    • Escape analysis may succeed in a benchmark where it would fail in real code.
    • A buildup to a GC might be ignored in a run or a collection may be included.
  • Application/Threading warmup: during initialisation threading behaviour and resources allocation can lead to significantly different behaviour than steady state behaviour. 
  • Environmental variance:
    • Hardware: CPU/memory/NIC etc...
    • OS
    • JVM: which one? running with which flags?
    • Other applications sharing resources

Here's a bunch of articles on Java micro-benchmarks which discuss the issue further:
Some of these issues are hard to solve, but some are addressable via a framework and indeed many frameworks have been written to tackle the above.

Let's talk about JMH

JMH (Java Micro-benchmarks Harness or Juicy Munchy Hummus, hard to say as they don't tell you on the site) is the latest and as it comes out of the workshop of the very people who work hard to make the OpenJDK JVM fly it promises to deliver more accuracy and better tooling then most.
The source/project is here and you will currently need to build it locally to have it in your maven repository, as per the instructions. Once you've done that you are good to go, and can set yourself up with a maven dependency on it.
Here is the project I'll be using throughout this post, feel free to C&P to your hearts content. It's a copy of the JMH samples project with the JMH jar built and maven sorted and all that so you can just clone and run without setting up JMH locally. The original samples are pure gold in terms of highlighting the complexities of benchmarking, READ THEM! The command line output is detailed and informative, so have a look to see what hides in the tool box.
I added my sample on top of the original samples, it is basic (very very basic) in it's use of the framework but the intention here is to help you get started, not drown you in detail, and give you a feel of how much you can get out of it for very little effort. Here goes...

It's fun to have fun, but you've got to know how

For the sake of easy comparison and reference I'll use JMH to benchmark the same bit of code I benchmarked with a hand rolled framework here and later on with Caliper here. We're benchmarking my novel way of encoding UTF-8 strings into ByteBuffers vs String.getBytes() vs best practice recommendation of using a CharsetEncoder. The benchmark compares the 3 methods by encoding a test set of UTF-8 strings samples.
Here's what the benchmark looks like when using JMH:
We're using three JMH annotations here:
  1. State - This annotation tells JMH how to share benchmark object state. I'm using the Thread scope which means no sharing is desirable. There are 2 other scopes available Group (for sharing the state between a group of threads) and Benchmark (for sharing the state across all benchmark threads).
  2. Setup - Much like the JUnit counterpart the Setup annotation tells JMH this method needs to be called before it starts hammering my methods. Setup methods are executed appropriately for your chosen State scope.
  3. GenerateMicroBenchmark - Tells JMH to fry this method with onions.

A lot of good tricks, I will show them to you, your mother will not mind at all if I do

To get our benchmarks going we need to run the generated microbenchmarks.jar. This is what we get:

Nice innit?
Here's the extra knobs we get on our experiment for our effort:
  • I'm using some command line options to control the number of iterations/warmup iterations, here's the available knobs on that topic:
    • i - number of benchmarked iterations, use 10 or more to get a good idea
    • r - how long to run each benchmark iteration
    • wi - number of warmup iterations
    • w - how long to run each warmup iteration (give ample room for warmup, how much will depend on the code you try and measure, try and have it execute 100K times or so)
  • To choose which benchmarks we want to run we need to supply a regular expression to filter them or ".*" to run all of them. If you can't remember what you packed use:
    • v - verbose run will also print out the list of available benchmarks and which ones were selected by your expression
    • l - to list the available benchmarks
  • If you wish to isolate GC effects between iterations you can use the gc option, this is often desirable to help getting more uniform results.
  • Benchmarks are forked into separate VMs by default. If you wish to run them together add "-f false"
The output is given for every iteration, then a summary of the stats. As I'm running 3 iterations these are not very informative (this is not recommended practice and was done for the sake of getting sample outputs rather than accurate measurement, I recommend you run more than 10 iterations and compare several JMH runs for good measure) but if I was to run 50 iterations they'd give me more valuable data. We can choose from a variety of several output formats to generate graphs/reports later. To get CSV format output add "-of csv" to your command line, which leaves you to draw your own conclusions from the data (no summary stats here):
The above has your basic requirements from a benchmark framework covered:
  1. Make it easy to write benchmarks
  2. Integrate with my build tool
  3. Make it easy to run benchmarks (there's IDE integration on the cards to make it even easier)
  4. Give me output in a format I can work with
I'm particularly happy with the runnable jar as a means to packaging the benchmarks as I can now take the same jar and try it out on different environments which is important to my work process. My only grumble is the lack of support for parametrization which leads me to use a system property to switch between the direct and heap buffer output tests. I'm assured this is also in the cards.

I will show you another good game that I know

There's even more! Whenever I run any type of experiment the first question is how to explain the results and what differences one implementation has over the other. For small bits of code the answer will usually be 'read the code you lazy bugger' but when comparing 3rd party libraries or when putting large  compound bits of functionality to the test profiling is often the answer, which is why JMH comes with a set of profilers:
  1. gc: GC profiling via standard MBeans
  2. comp: JIT compiler profiling via standard MBeans
  3. cl: Classloader profiling via standard MBeans
  4. hs_rt: HotSpot (tm) runtime profiling via implementation-specific MBeans
  5. hs_cl: HotSpot (tm) classloader profiling via implementation-specific MBeans
  6. hs_comp: HotSpot (tm) JIT compiler profiling via implementation-specific MBeans
  7. hs_gc: HotSpot (tm) memory manager (GC) profiling via implementation-specific MBeans
  8. hs_thr: HotSpot (tm) threading subsystem via implementation-specific MBeans
  9. stack: Simple and naive Java stack profiler

Covering the lot exceeds the scope of this blog post, let's focus on obvious ones that might prove helpful for this experiment. Running with the gc and hs_gc profiler (note: this should be done with fixed heap for best result, just demonstrating output here) give this output:

The above supports the theory that getBytes() is slower because it generates more garbage than the alternatives, and highlights the low garbage impact of custom/charset encoder. Running with the stack and hs_rt profilers gives us the following output:
What I can read from it is that getBytes() spends less time in encoding then the other 2 due to the overheads involved in getting to the encoding phase. Custom encoder spends the most time on on encoding, but what is significant is that as it outperforms charset encoder, and the ratios are similar we can deduce that the encoding algorithm itself is faster.

But that is not all, oh no, that is not all

The free functionality does not stop here! To quote Tom Waits:
It gets rid of your gambling debts, it quits smoking
It's a friend, and it's a companion,
And it's the only product you will ever need
Follow these easy assembly instructions it never needs ironing
Well it takes weights off hips, bust, thighs, chin, midriff,
Gives you dandruff, and it finds you a job, it is a job
...
'Cause it's effective, it's defective, it creates household odors,
It disinfects, it sanitizes for your protection
It gives you an erection, it wins the election
Why put up with painful corns any longer?
It's a redeemable coupon, no obligation, no salesman will visit your home
'What more?' you ask, well... there's loads more functionality around multi threading I will not attempt to try in this post and several more annotations to play with. In a further post I'd like to go back and compare this awesome new tool with the previous 2 variations of this benchmark and see if, how and why results differ...
Many thanks to the great dudes who built this framework of whom I'm only familiar with Master Shipilev (who also took time to review, thanks again), they had me trial it a few months back and I've been struggling to shut up about it ever since :-)

Sunday, 7 April 2013

135 Million messages a second between processes in pure Java

Porting an existing single producer/single consumer concurrent queue into an IPC mechanism via memory mapped files and getting 135 million messages throughput in pure Java.
In my previous post I covered a single producer/consumer queue developed and shared by Martin Thompson capable of delivering an amazing 130M messages per second. The queue he delivered is a great tool for communicating between threads, but sometimes communicating between threads is not enough. Sometime you need to leave your JVM and go out of process. Inter Process Communications (IPC) is a different problem to inter thread communications, can it be cracked by the same approach?

IPC, what's the problem?

Inter Process Communication is an old problem and there are many ways to solve it (which I will not discuss here). There are several attractions to specialized IPC solutions for Java:
  • Faster than socket communication.
  • An out of process integration option with applications written in other languages.
  • A means of splitting large VMs to smaller ones improving performance by allowing GC and JIT specialization.
For IPC to be attractive it has to be fast, otherwise you may as well go for network based solutions which would extend beyond your local machine uniformly. I attended an Informatica conference a while back and got talking to Todd Montgomerey about the Disruptor and mechanical sympathy, he suggested that IPC should be able to perform as well as inter thread messaging. I found the idea interesting and originally meant to port the Disruptor, but Martin's queue is simpler (and quicker) so I went for that instead. Starting with a good algorithm/data structure is very good indeed, now I just needed to bridge the gap and see if I can maintain the benefits.

 

Off the heap we go!

To do IPC we must go off heap. This has several implications for the queue, most importantly references are not supported. Also note persistence to and from the queue is required, though one could extend my implementation to support a zero copy interaction where a struct is acquired, written and committed instead of the offer method, and similarly acquired, read and finally released instead of the poll method. I plan to make several flavours of this queue to test out these ideas in the near future.
My IPC queue uses a memory mapped file as a means of acquiring a chunk of shared memory, there is no intention to use the persisted values though further development in that direction may prove interesting to some. So now that I got me some shared memory, I had to put the queue in it.
I started by laying out the queue counters and cached counters. After realizing the counters need to be aligned to work properly I learnt how to align memory in Java. I went on to verify that aligned memory offers the guarantees required for concurrent access. Quick summary:
  • aligned access means writing data types to addresses which divide by their size.
  • unaligned access is not atomic, which is bad for concurrency :(
  • unaligned access is slow, which is bad for performance :(
  • unaligned access may not work, depending on OS and architecture. Not working is very bad :(
Sorting out alignment is not such a big deal once you know how it works. One of the nice things about going off-heap was that solving false sharing has become far more straightforward. Move your pointer and you're in the next cache line, job done. This left me rather frustrated me with the tricks required to control memory layout in Java. Going back to the original implementation you will notice the Padded classes who's role it is to offer false sharing protection. They are glorious hacks (with all due respect) made necessary by this lack of control. The @Contended annotation coming in JDK 8 will hopefully remove the need for this.
This is how the memory layout worked out:


To illustrate in glorious ASCII graphics (each - is a byte), this is what the memory layout looks like when broken into cache lines:
|--------|--------|--------|head....|--------|--------|--------|--------|
|--------|--------|--------|tailCach|--------|--------|--------|--------|
|--------|--------|--------|tail----|--------|--------|--------|--------|
|--------|--------|--------|headCach|--------|--------|--------|--------|
|int1int2|int3int4|int5int6|etcetcet|cetcetce|tcetcetc|etcetcet|cetcetce|
...



I played around with mixing off heap counters with on heap buffer but in the interest of brevity I'll summarize and say the JVM does not like that very much and the end result performance is not as good as all heap/off-heap solutions. The code is available with everything else.
Once alignment and memory layout were sorted I had to give up the flexibility of having reference pointers and settle for writing my data (an integer) directly into the memory. This leaves my queue very restrictive in it's current form. I intend to revisit it and see what I can do to offer a more extendable API on top of it.
Let me summarize the recipe at this point:
  • Create a memory mapped file large enough to hold:
    • 4 cache lines for counters/cached counters.
    • 4 bytes(per integer) * queue capacity (must be a power of 2).
    • 1 spare cache line to ensure you can align the above to the cache line.
  • Get a mapped byte buffer, which is a direct byte buffer on top of the mapped memory.
  • Steal the address and get the contained aligned byte buffer.
  • Setup pointers to the counters and the beginning of the buffer
  • Replace use of natural counters with off heap counters accessed via Unsafe using the pointers.
  • Replace use of array with use of offset pointers into buffer and Unsafe access.
  • Test and debug until you work out the kinks...
The above code should give you a fair idea how it works out and the rest is here. This queue can work in process and out of process as demonstrated in the tests included in the repository. Now that it works (for the limited use case, and with room for further improvement... but works), is it fast enough? not so fast? is it...<gasp> ... FASTER????!?!?!

Smithers, release the hounds

Here are the numbers for using the different implementations in process:

Implementation/AffinitySame coreCross coreCross socket
P1C1QueueOriginal3110M130M19M
P1C1OffHeapQueue130M220M200M
P1C1QueueOriginalPrimitive124M220M215M


Confused? Let me explain. First line is the measurements taken for the original queue. Similar to what was presented in prev. post, though I saw a slight improvement in the results with increasing the compile threshold to 100000.
The second line is my offheap implementation of same algorithm. It is significantly faster. This is not IPC yet, this is in process. The reason it is faster is because data is inlined in the queue, which means that by loading an entry in the queue we get the data as opposed to a reference to the data. Getting a reference is what you get when you have and Object[] array. The array holds the references and the data is elsewhere, this seems to make it more painful as we get further from the producer.
The last entry is a mutation of P1C1QueueOriginal3 into a primitive array backed queue to compare performance like for like. As you can see this displays very similar results to the off heap implementation supporting the theory that data in-lining is behind the observed performance boost.
The lesson here is an old one, namely that pointer chasing is expensive business further amplified by the distance between the producing CPU and consuming CPU.
The off-heap queue can offer an alternative to native code integration as the consuming thread may interact directly with the off-heap queue and write results back to a different off-heap queue.
Running a similar benchmark adapted to use a memory mapped file as the backing DirectByteBuffer for the off-heap queue we get:
    same core      - ops/sec=135M
    across cores   - ops/sec=98M
    across sockets - ops/sec=25M



JOY! a pure Java IPC that gives you 135M messages per second is more throughput then you'd get with most commercial products out there. This is still not as fast as the same queue in process and I admit I'm not sure what the source of the performance difference is. Still I am quite happy with it.
A few notes/observations from the experimentation process:
  1. I got a variety of results, stabilizing around different average throughputs. I chose the best for the above summary and plan to go into detail about the results in the near future.
  2. The JVM was launched with: -XX:+UseCondCardMark -XX:CompileThreshold=100000
  3. Removing the Thread.yield from the producer/consumer loops improved performance when running on the same core, but made it worse otherwise.
  4. Moving the queue allocation into the test loop changes the performance profile dramatically.
  5. I've not had time to fully explore the size of the queue as a variable in the experiment but the little I've done suggests it makes a difference, choose the right size for your application.
I realize this post is rather less accessible than the previous one, so if you have any questions please ask.

Sunday, 17 March 2013

Single Producer/Consumer lock free Queue step by step

Reading through a fine tuned lock free single producer/consumer queue. Working through the improvements made and their respective impact.
Back in November, Martin Thompson posted a very nice bit of code on GitHub to accompany his Lock Free presentation. It's a single producer/consumer queue that is very fast indeed.  His repository includes the gradual improvements made, which is unusual in software, and offers us some insights into the optimization process. In this post I break down the gradual steps even further, discuss the reasoning behind each step and compare the impact on performance. My fork of his code is to be found here and contains the material for this post and the up and coming post. For this discussion focus on the P1C1QueueOriginal classes.
The benchmarks for this post were run on a dual socket Intel(R) Xeon(R) CPU E5-2630 @ 2.30GHz machine running Linux and OpenJdk 1.7.09. A number of runs were made to unsure measurement stability and then a represantative result from those runs was taken. Affinity was set using taskset. Running on same core is pinned to 2 logical cores on the same physical core. Across cores means pinned to 2 different physical cores on the same socket. Across sockets means pinned to 2 different cores on different sockets.

Starting point: Lock free, single writer principle

The initial version of the queue P1C1QueueOriginal1, while very straight forward in it's implementation already offers us a significant performance improvement and demonstrates the important Single Writer Principle. It is worth while reading and comparing offer/poll with their counter parts in ArrayBlockingQueue.
Running the benchmark for ArrayBlockingQueue on same core/across cores/across sockets yields the following result (run the QueuePerfTest with parameter 0):
same core      - ops/sec=9,844,983
across cores   - ops/sec=5,946,312 [I got lots of variance on this on, took an average]
across sockets - ops/sec=2,031,953

We expect this degrading scale as we pay more and more for cache traffic. These result are our out of the box JDK baseline.
When we move on to our first cut of the P1C1 Queue we get the following result (run the QueuePerfTest with parameter 1):
same core      - ops/sec=24,180,830[2.5xABQ]
across cores   - ops/sec=10,572,447[~2xABQ]
across sockets - ops/sec=3,285,411[~1.5xABQ]

Jumping jelly fish! Off to a good start with large improvements on all fronts. At this point we have gained speed at the expense of limiting our scope from multi producer/consumer to single producer/consumer. To go further we will need to show greater love for the machine. Note that the P1C1QueueOriginal1 class is the same as Martin's OneToOneConcurrentArrayQueue.

Lazy loving: lazySet replaces volatile write

As discussed previously on this blog, single writers can get a performance boost by replacing volatile writes with lazySet. We replace the volatile long fields for head and tail with AtomicLong and use get for reads and lazySet for writes in P1C1QueueOriginal12. We now get the following result (run the QueuePerfTest with parameter 12):
same core      - ops/sec=48,879,956[2xP1C1.1]
across cores   - ops/sec=30,381,175[3xP1C1.1 large variance, average result]
across sockets - ops/sec=10,899,806[3xP1C1.1]

As you may or may not recall, lazySet is a cheap volatile write such that it provides happens-before guarantees for single writers without forcing a drain of the store buffer. This manifests in this case as lower overhead to the thread writing, as well as reduced cache coherency noise as writes are not forced through.

Mask of power: use '& (k pow 2) - 1 instead of %

The next improvement is replacing the modulo operation for wrapping the array index location with a bitwise mask. This 'trick' is also present in ring buffer implementations, Cliff Click CHM and ArrayDeque. Combined with the lazySet improvement this version is Martin's OneToOneConcurrentArrayQueue2 or P1C1QueueOriginal2 in my fork.
The result (run the QueuePerfTest with parameter 2):
same core      - ops/sec=86,409,484[1.9xP1C1.12]
across cores   - ops/sec=47,262,351[1.6xP1C1.12 large variance, average result]
across sockets - ops/sec=11,731,929[1.1xP1C1.12]

We made a minor trade off here, forcing the queue to have a size which is a power of 2 and we got some excellent mileage out of it. The modulo operator is quite expensive both in terms of cost and in terms of limiting instruction throughput and it is a trick worth employing when the opportunity rises.
So far our love for the underlying architecture is expressed by offering cheap alternatives for expensive instructions. The next step is sympathy for the cache line.

False sharing

False sharing is described elsewhere(here, and later here, and more recently here where the coming of a solution by @Contended annotation is discussed). To summarize, from the CPU cache coherency  system perspective if a thread writes to a cache line then it 'owns' it, if another thread then needs to write to the same line they need to exchange ownership. When this happens for writes into different locations the sharing is 'falsely' assumed by the CPU and time is wasted on the exchange. The next step of improvement is made by padding the head and tail fields such that they are not on the same cache line in P1C1QueueOriginal21.
The result (run the QueuePerfTest with parameter 21):
same core      - ops/sec=88,709,910[1.02xP1C1.2]
across cores   - ops/sec=52,425,315[1.1xP1C1.2]
across sockets - ops/sec=13,025,529[1.2xP1C1.2]

This made less of an impact then previous changes. False sharing is a less deterministic side effect and may manifest differently based on luck of the draw memory placement. We expect code which avoids false sharing to have less variance in performance. To see the variation run the experiment repeatedly, this will result in different memory layout of the allocated queue.

Reducing volatile reads

Common myth regarding volatile reads is that they are free, the next improvement step shows that to be false. In P1C1QueueOriginal22 I have reversed the padding of the AtomicLong (i.e head and tail are back to being plain AtomicLong) and added caching fields for the last read value of head and tail. As these values are only used from a single thread (tailCache is used by consumer, headCache used by producer) they need not be volatile. Their only use is to reduce volatile reads to a minimum. Normal reads, unlike volatile reads are open to greater optimization and may end up in a register, volatile reads are never from a register (i.e always from memory).
The result (run the QueuePerfTest with parameter 22):
same core      - ops/sec=99,181,930[1.13xP1C1.2]
across cores   - ops/sec=80,288,491[1.6xP1C1.2]
across sockets - ops/sec=17,113,789[1.5xP1C1.2]

By Toutatis!!! This one is a cracker of an improvement. Not having to load the head/tail value from memory as volatile reads makes a massive difference.
The last improvement made is adding the same false sharing guard we had for the head and tail fields around the cache fields. This is required as these are both written at some point and can still cause false sharing, something we all tend to forget can happen to normal fields/data even if it is nothing to do with concurrency. I've added a further implementation P1C1QueueOriginal23 where only the cache fields are padded and not the head and tail. It makes for a slight further improvement, but as the head and tail are still suffering from false sharing it is not a massive step forward.

All together now!

The final version P1C1QueueOriginal3 packs together all the improvements made before:
  • lock free, single writer principle observed. [Trade off: single producer/consumer]
  • Set capacity to power of 2, use mask instead of modulo. [Trade off: more space than intended]
  • Use lazySet instead of volatile write.
  • Minimize volatile reads by adding local cache fields. [Trade off: minor size increment]
  • Pad all the hot fields: head, tail, headCache,tailCache [Trade off: minor size increment]
The result (run the QueuePerfTest with parameter 3):
same core      - ops/sec=110,405,940[1.33xP1C1.2]
across cores   - ops/sec=130,982,020[2.6xP1C1.2]
across sockets - ops/sec=19,338,354[1.7xP1C1.2]

To put these results in the context of the ArrayBlocking queue:
same core      - ops/sec=110,405,940[11xABQ]
across cores   - ops/sec=130,982,020[26xABQ]
across sockets - ops/sec=19,338,354[9xABQ]

This is great improvement indeed, hat off to Mr. Thompson.

Summary, and the road ahead

My intent in this post was to give context and add meaning to the different performance optimizations used in the queue implementation. At that I am just elaborating Martin's work further. If you find the above interesting I recommend you attend one of his courses or talks (if you can find a place).
During the course of running these experiments I encountered great variance between runs, particularly in the case of running across cores. I chose not to explore that aspect in this post and picked representative measurements for the sake of demonstration. To put it another way: your mileage may vary.
Finally, my interest in the queue implementation was largely as a data structure I could port off heap to be used as an IPC messaging mechanism. I have done that and the result are in my fork of Martin's code here. This post evolved out of the introduction to the post about my IPC queue, it grew to the point where I decided they would be better apart, so here we are. The IPC queue is working and achieves similar throughput between processes as demonstrated above... coming soon.

Wednesday, 6 March 2013

Merging Queue, ArrayDeque, and Suzie Q

A merging queue is a map/queue data structure which is useful in handling data where only the latest update/value for a particular topic is required and you wish to process the data in FIFO manner. Implement 2 flavours, benchmark and discussion.
The merging queue is a useful construct for slow consumers. It allows a bounded queue to keep receiving updates with the requirement for space limited to the number of keys. It also allows the consumer to skip old data. This is particularly of interest for systems dealing with fast moving data where old data is not relevant and will just slow you down. I've seen this requirement in many pricing systems in the past few years, but there are other variations.
Here's the interface:

What about LinkedHashMap?

Now it is true that LinkedHashMap offers similar functionality and you could use it to implement a merging queue as follows:
This works, but the way we have to implement poll() is clumsy. What I mean by that is that it looks like we are asking for allot more than we want to work around some missing functionality. If you dig into the machinery behind the expression: "lastValMap.remove(lastValMap.keySet().iterator().next())" there's an awful lot of intermediate structures we need to jump through before we get where we are going. LinkedHashMap is simply not geared toward being a queue, we are abusing it to get what we want.

ArrayDeque to the rescue!

ArrayDeque is one of the unsung heroes of the java collections. If you ever need a non-concurrent queue or stack look no further than this class. In it's guts you'll find the familiar ring buffer. It doesn't allocate or copy anything when you pop elements out or put them in(unless you exceed the capacity). It's cache friendly(unlike a linked list). It's LOVELY!
Here's a merging queue implementation using a HashMap and ArrayDeque combined:
You can replace the HashMap with an open address implementation to get more cache friendly behaviour for key collisions if you like, but in the name of KISS we won't go down that particular rabbit hole. As the comment states, setting entries to null rather than removing them is an optimization with a trade off. If your key set is not of a finite manageable range then this is perhaps not the way to go. As it stands it saves you some GC overheads. This optimization is not really open to you with LinkedHashMap where the values and their order are managed as one.
ArrayDeque is a better performer than any other queue for the all the reasons discussed in this StackOverflow discussion, which boil down to:

  • backed by a ring buffer (yes, like the Disruptor! you clever monkeys) 
  • it uses a power of 2 sized backing array, which allows it to replace modulo(%) with a bit-wise and(&) which works because x % some-power-of-2 is the same as x & (some-power-of-2 - 1)
  • adding and removing elements is all about changing the head/tail counters, no copies, no garbage (until you hit capacity).
  • iterating through an array involves no pointer chasing, unlike linked list.


I like the way you walk, I like the way you talk, Susie Q!

I'm using a micro benchmarking framework which is both awesome and secret, so sadly the benchmark code is not entirely peer review-able. I will put the benchmark on GitHub when the framework makers will give me the go ahead which should be soon enough. Here are the benchmarks:

The results(average over multiple runs) are as follows:
Experiment                            Throughput           Cost
array.measureOffer,                   100881.077 ops/msec, 10ns
array.measureOffer1Poll1,              41679.299 ops/msec, 24ns
array.measureOffer2Poll1,              30217.424 ops/msec, 33ns
array.measureOffer2Poll2,              21365.283 ops/msec, 47ns
array.measureOffer1000PollUntilEmpty,    102.232 ops/msec, 9804ns
linked.measureOffer,                  103403.692 ops/msec, 10ns
linked.measureOffer1Poll1,             24970.200 ops/msec, 40ns
linked.measureOffer2Poll1,             16228.638 ops/msec, 62ns
linked.measureOffer2Poll2,             12874.235 ops/msec, 78ns
linked.measureOffer1000PollUntilEmpty,    92.328 ops/msec, 10830ns
--------

Interpretation:

  • Offer method cost for both implementations is quite similar at 10ns, with the linked implementation marginally faster perhaps.
  • Poll method cost is roughly 14ns for the array deque based implementation, and 30ns for the linked implementation. Further profiling has also shown that while the deq implementation generates no garbage the linked implementation has some garbage overhead.
  • For my idea of a real world load the array deq is 10% faster.
Depending on the ratio between offer and poll the above implementation can be quite attractive. Consider for instance that queues/buffer buildup tends to be either empty, or quite full when a burst of traffic comes in. When you are dealing with relatively little traffic the cost of polling is more significant, when you are merging a large buildup of updates into your queue the offer cost is more important. Luckily this is not a difficult choice as the array deque implementation is only marginally slower for offering and much faster for polling.
Finally, a small real world gem I hit while writing this blog. When benchmarking the 1k offer/queue drain case for the linked implementation I hit this JVM bug - "command line length affects performance". The way it manifested was bad performance (~50 ops/ms) when running with one set of parameters and much better performance when using some extra parameters to profile GC which I'd have expected to slow things down if anything. It had me banging my head against the wall for a bit, I wrote a second benchmark to validate what I considered the representative performance etc. Eventually I talked to Mr. Shipilev who pointed me at this ticket. I was not suffering the same issue with the other benchmarks, or the same benchmark for the other implementation which goes to show what a slippery sucker this is. The life lesson from this is to only trust what you measure. I can discard the benchmark result if I like, but if you change your command line arguments in a production environment and hit a kink like that you will have a real problem.
Many thanks to Doug Lawrie with whom I had a discussion about his implementation of a merging event queue (a merging queue stuck on the end of a Disruptor) which drove me to write this post.

Update 08/03/2013: Just realized I forgot to include a link to the code. Here it is.

Tuesday, 12 February 2013

Alignment, Concurrency and Torture (x86)

Summary: Exploring the effects of unaligned/aligned concurrent access in Java for off/on heap memory, how to test for correctness, torture and cushions. 

Concurrent access to memory is a pain at the best of times. The JMM offers Java programmers a way to reason about concurrency in a platform independent manner, and while not without fault it's certainly better then having no memory model at all. Other languages are catching up of course, but the focus of this article is not comparative concurrency, so look it up yourselves ;).
When we go off heap via Unsafe or Direct Byte Buffers, or when we utilize Unsafe to gain direct access to on heap memory we are on our own. Direct memory access strips away the JVM guarantees and leaves you exposed to the underlying OS/architecture memory access rules and the interpretation of the JVM runtime.
Should you be worried?
Silly question.
If you are not the worried type, follow this link with my blessings.
Worried? Here's 2 guarantees now gone to worry about:
  • Atomicity - memory access on the JVM is guaranteed to be atomic, even when the underlying system does not support it. No such similar guarantees are available for ByteBuffers. In fact ByteBuffers are explicitly not thread safe, so atomicity is not guaranteed at all. Unsafe has no guarantees for anything, so caution is advised.
  • Word tearing - this means concurrent updates to adjacent bytes in memory do not result in corruption of state. Some processors perform memory writes in units of a word, which can be 2 or 4 bytes. This can result in the unmolested parts of the word overwriting bytes written to by other threads. While the JLS is binding for 'plain' memory access (via assignment) the guarantee does not necessarily extend to off heap memory.
The problem with verifying these behaviours is that concurrency guarantees and issues are by definition timing dependent. This means that while most of the time there is no issue, that is not to say that throwing a few extra processors in, or changing memory layout, or maybe just waiting a bit longer will not trigger the issue...
So how do you prove your code is thread safe once you stepped off heap and into the wild? I recently answered a question on stack overflow regarding testing for correctness under multi-threaded access of a data structure. My answer was accepted, so obviously it's correct, I cut out the niceties to keep this short:
 1.  Formally prove the correctness of your program in the terms of the JMM. 

 2.  Construct test cases which demonstrate intended above behaviour by using count down latches or other means of suspending threads of execution at specific points in your program to force contention.

 3.  Statistically demonstrate correctness by exercising the code over a sufficient period of time from multiple threads.
This is all fine for when you write 'safe' Java, but option 1, and therefore 2 go out the window when you are straying off spec. This is an important observation to make: if the rules are not well defined then testing the edge cases for those rules is not that meaningful.
Which leaves us with number 3.

Science is torture

It is a fact well known to those who know it well that a scientific experiment can only prove your theory is either definitely wrong or so far right. This (limited) power of induction from past experiences is what empiricism is all about. To prove JVM implementations implement the JMM correctly Mr. Shipilev (Benchmarking Tzar) has put together the Java Concurrency Torture Suite into which he drags JVMs and tries to break them on any number of architectures (not to worry, he tries it on cute bunnies first). It's a great project for anyone trying to reason about the JMM as it gives you examples in code of edge cases and the corresponding range of expected behaviours on all architectures. It also means that if you got questionable behaviours you want to explore on a particular architecture Aleksey just saved you allot of work.
The JCTS project is a very professional piece of work, so I'll try not to mince words over what you can read in the readme. The general approach is that if you leave it (the concurrent behaviour) long enough the edge cases will come crawling out of the woodwork. Here's one bug found using it, the discussion in the ticket is quite interesting.
The tests run also report how many times each state was detected, which has the added benefit of educating us on the distribution of occurrences for the different cases. The next version may support ASCII bar charts too. Instead of yapping on about it, lets see how atomicity is tackled and all will become clear.

 

Biggles! Fetch...THE CUSHIONS!

The JCTS already has atomicity tests, I just had to my own to target unaligned access. I added 2 tests, one for unaligned access within the line and one for crossing the cache line, for brevity this is just the long test:
The method for allocating aligned direct memory byte buffers is described in this post. The cross cache line case is very similar with the position being set intentionally to cause cache straddled access.
I expected the straddled line access to be non-atomic and for once I was right(read on, it doesn't last)!!!
Note the fine printed output, a joy! The test run several times and we can see how in the majority of cases things are fine, the observed result is either 0 or -1. But there are other values which should not be there (when our mother is not), those observed values are the result of the lack of atomicity. Reading the value in some undefined trashed state gives us... mmm... trash. As I suspected! Patting myself on the shoulder I moved on to the unaligned non-straddling tests.
The results suggested that unaligned access inside the cache line is atomic on my machine. I admit it was not what I thought. This is a bit surprising. Good old Intel, they can really make this work!


Our chief weapon is surprise

Going through these experiments and results reminded me of a comment made by Gil Tene (Azul CTO) on Martin Thompson's blog:
Cache alignment is only a performance issue. Never a correctness issue. Cache lines do not add or remove atomicity. Specifically, non atomic updates within a single cache line can still be seen non-atomically by other threads.The Java spec only guarantees atomic treatment (as in the bits in the field are read and written all-or-nothing) for boolean, byte, char, short, int, and float field types. The larger types (long and double) *may* be read and written using two separate operations (although most JVMs will still perform these as atomic, all-or-nothing operations).
BTW, all x86 variants DO support both unaligned data types in memory, as well as LOCK operations on such types. This means that on an x86, a LOCK operations that spans boundary between two cache lines will still be atomic (this little bit of historical compatibility probably has modern x86 implementors cursing each time).
I read the above to suggest Gil was thinking atomicity is a given across the cache line, so I sent him an e-mail to discuss my findings. Now Gil has been under the JVM hood more than most people, and this turned out to be quite enlightening:
I'm not saying that cross cache line boundaries cannot induce non-atomicity to occur more often. I *am* saying that whenever you see non-atomicity on accesses that cross cache line boundaries, that same non-atomicity potential was there [for the same code] even within a single cache line, and is usually there regardless of whether or not the in-same-cache-line access is aligned to the type size boundary. 
He went on to point out that the whole experimentation as means to proving correctness is not possible (I told you it's well known) and that where the spec. doesn't promise you anything, expect nothing:
... there is no way to verify atomicity by experimentation. You can only disprove it, but cannot prove it.
The only way you can control atomicity on a known x86 arch. for off heap access would be to control ALL x86 instructions used to access your data, where knowing exactly which instructions are used and what the layout restrictions you impose would allow you to prove atomicity through known architectural qualities.

 

I didn't expect a kind of Spanish Inquisition

Mmmm... but how would I know? Given there are no guarantees on the one hand and the limitations of experimentation on the other, we are back to square one... And while musing on Gil's wise words I ran the unaligned access test again, and what do you know:

Bugger. Bugger. Bugger. It seems like a special value sneaks in every once in a while. This happens quite infrequently and as things worked out it failed to happen until my little exchange with Gil was over, leading me to believe Gil possesses that power of the experienced to have reality demonstrate for them on cue. On the other hand, now I know it doesn't work. Certainty at last.
What about aligned memory access? I gave it a good run and saw no ill effects, but given Gil's warnings this may not be enough. The JCTS already had a test for aligned access and there are no particular expectations there. This is important: it is OK for a JVM implementation to give you no atomicity on putLong/Int/... (via Unsafe or DirectByteBuffer). Aligned access on x86 is atomic, but as Gil points out we don't have full control on every level so that may not be a strong enough guarantee...

Conclusion? 

The take away from the above is that offheap storage should be treated as if atomicity is not guaranteed, especially if no attempt is made at proper type alignment. Is that the end of the world? certainly not, but it does mean precautions must be taken if data is to be written and read concurrently. Note that mutating shared data is always a risk, but you may go further without noticing issues with a POJO approach. Also note that some of the tools used for proper object publication are not available, in particular final fields.


Many thanks to Aleksey Shipilev and Gil Tene for their feedback and guidance along the way. I leave word tearing as an exercise to the interested reader.

Update 14/02/2013:
Martin Thompson got back to me asking if the same atomicity issue is still there for putOrdered*/put*Volatile when the cache line is crossed. I expected it would be (so did he), wrote some more tests and it turned out we were both correct. It makes no difference if you use put*/putOrdered*/put*Volatile/compareAndSwap*, if you cross the cache line there is no atomicity. Similarly I would expect unaligned access within the cache line to lack atomicity, but have not had time to add the tests. Code is on GitHub for your pleasure.
Thanks Martin.
Update 16/03/2013:
The GitHub links to the code are temporarily removed, please contact for details if required.