Monday, 19 October 2015

Expanding The Queue interface: Relaxed Queue Access

Continuing from previous post on the expansion of the Queue interface to support new ways of interacting with queues I have gone ahead and implemented relaxedOffer/Poll/Peek for the JCTools queues. This was pretty easy as the original algorithms all required a bit of mangling to support the strong semantic and relaxing it made life easier. Implementing batch/continuous interactions was a bit more involved, but the results were interestingly rewarding. I will save discussing the drain/fill results for a follow up post.
The new interface is here, feedback is welcome.

  • Throughput is improved in all cases (~5-15% increase)
  • Cost of failure (when offer/poll fail to return/add a value) is dramatically reduced
  • Latency for bursts is improved (as much as 40% reduction)

Relaxed Access API

This ended up a very small API change, most of the difficulty here being choice of name. I went with the 'relaxed' prefix as suggested by J.P.Bempel (as opposed to the 'weak' prefix considered earlier):

Relaxed vs. Non-Relaxed Example

The 'relaxation' of the queue full/empty reporting in these methods allows us to remove an extra load of the 'opposite' index and avoiding waiting around for delayed visibility. Here's how MPSCArrayQueue::poll was changed to become 'relaxed':

So now we can drop lines 11-15, it doesn't look like much. We wouldn't really expect this gap in offer (see offer for details) to happen very often, so the overhead can typically amount to an extra load of the producerIndex and a predictable branch.
Also, since poll on an MPSC is actually cheaper than the offer anyhow, aren't we optimizing a non-issue? Could we not be actually making throughput worse by making the consumer more aggressive?

Throughput Impact Of Relaxation

To test the impact of this change I constructed a JMH benchmark that measures throughput when both consumer and producer spin to offer/poll as fast as they can (see the code here). I ran the benchmark on a Haswell machine(i7-4770 CPU@3.40GHz/Ubuntu/OracleJVM8u60). The benchmarks were pinned to run cross core (no 2 threads share a physical core, i.e. no hyperthreading). Benchmarks were run with 5 forks, 10 warmup iterations (1 second), 10 measured iterations (1 second). The score reports operations per microsecond, or millions of operation per second if you prefer. The throughput is the number of pollsMade, but the other figures are interesting as well.
These results are with a single producer and consumer:
 Type |     1P1C     | Normal± Err  | Relax ± Err  |
 Spsc | offersFailed |   0.0 ±  0.1 |   0.1 ±  0.1 |
 Spsc | offersMade   | 439.1 ±  4.1 | 432.1 ± 10.0 |
 Spsc | pollsFailed  |   0.2 ±  0.5 |   0.1 ±  0.2 |
 Spsc | pollsMade    | 438.9 ±  4.1 | 431.9 ± 10.0 |
 Mpsc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.0 |
 Mpsc | offersMade   |  22.7 ±  0.6 |  31.9 ±  0.8 |
 Mpsc | pollsFailed  |   0.1 ±  0.1 |   6.6 ±  0.9 |
 Mpsc | pollsMade    |  22.8 ±  0.6 |  31.9 ±  0.8 |
 Spmc | offersFailed |   0.2 ±  0.2 |  81.0 ±  5.4 |
 Spmc | offersMade   |  23.5 ±  0.6 |  26.8 ±  0.6 |
 Spmc | pollsFailed  |   0.1 ±  0.1 |   0.2 ±  0.3 |
 Spmc | pollsMade    |  23.2 ±  0.6 |  26.6 ±  0.6 |
 Mpmc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.1 |
 Mpmc | offersMade   |  71.4 ±  5.1 |  71.7 ±  4.8 |
 Mpmc | pollsFailed  |   0.0 ±  0.1 |   0.3 ±  0.9 |
 Mpmc | pollsMade    |  71.4 ±  5.1 |  71.7 ±  4.8 |

  • The SPSC case serves as a baseline, and I expected results to be the same. There's a 2% difference which is close to the error reported. Maybe the inlining depth impacted the performance slightly, or perhaps the benchmark is not as stable as I'd like.
  • The MPSC case shows 50% improvement. This is pretty awesome. We also notice the consumer now fails more. This is indicative to the reduced cost of failure to find elements in the queue. The cache miss on the producerIndex was slowing down both the consumer and the producer. This phenomena has been observed in other post, but my best attempt at an explanation is here. The bottom line being that the mutator suffers from losing the exclusive state of the cache line in his own cache, while the reader gets hit with a read miss (going to LLC as the threads are on separate cores).
  • The SPMC case shows 15% improvement on throughput. The cost of failing to offer has gone down dramatically.
  • The MPMC case shows no change, but does show great throughput. This could be down to the very different algorithm in play or just because the MPMC queue is well balanced and thus behaves very well in a balanced use case as above. This may lead you to believe MPMC is a better choice than MPSC, but as the latency results and contended throughput results show this is an anomaly of the benchmark load. Because MPSC presents a much cheaper poll than offer the queue is always empty, making the contention on the queue buffer worse than it would be for MPMC where offer and poll are of similar costs.
Using the command line controls over thread groups sizes (-tg) I ran a few more cases. Did I mention I love JMH?
Here's MPMC and MPSC with 3 producers and 1 consumer:
 Type |     3P1C     | Normal± Err  | Relax ± Err  |
 Mpsc | offersFailed |   0.0 ±  0.0 |   0.1 ±  0.6 | 
 Mpsc | offersMade   |  12.7 ±  0.1 |  13.4 ±  0.1 | 
 Mpsc | pollsFailed  |   4.1 ±  0.5 | 115.3 ±  4.9 | 
 Mpsc | pollsMade    |  12.7 ±  0.1 |  13.4 ±  0.1 | 
 Mpmc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.0 | 
 Mpmc | offersMade   |  11.1 ±  0.0 |  11.7 ±  0.1 | 
 Mpmc | pollsFailed  |   0.3 ±  0.1 |  60.4 ± 10.0 | 
 Mpmc | pollsMade    |  11.1 ±  0.0 |  11.7 ±  0.1 |
  • MPSC case shows slight improvement to throughput of roughly 5%, but the improvement to cost of failure is very visible here.
  • MPMC case improves by 5% and the improved cost of failure is similarly visible.
Here's MPMC and SPMC with 1 producers and 3 consumer:
 Type |     1P3C     | Normal± Err  | Relax ± Err  |
 Spmc | offersFailed |   1.5 ±  0.2 | 125.9 ±  4.6 |
 Spmc | offersMade   |  13.4 ±  0.1 |  14.0 ±  0.0 |
 Spmc | pollsFailed  |   0.0 ±  0.0 |   0.0 ±  0.0 |
 Spmc | pollsMade    |  13.1 ±  0.1 |  13.7 ±  0.0 |
 Mpmc | offersFailed |  23.9 ±  0.8 | 124.1 ±  3.2 |
 Mpmc | offersMade   |  11.0 ±  0.0 |  12.1 ±  0.0 |
 Mpmc | pollsFailed  |   0.0 ±  0.2 |   1.0 ±  1.6 |
 Mpmc | pollsMade    |  10.7 ±  0.0 |  11.9 ±  0.0 |
  • SPMC case shows  5% improvement and great improvement to failure performance.
  • MPMC case improves by 11% and the improved cost of failure is similarly visible.
Here's MPMC only, with 2 producers and 2 consumers:
 Type |    2P2C      | Normal± Err  | Relax ± Err  |
 Mpmc | offersFailed |   1.4 ±  0.8 |   0.7 ±  0.6 |
 Mpmc | offersMade   |  14.8 ±  0.8 |  16.0 ±  0.6 |
 Mpmc | pollsFailed  |   0.0 ±  0.0 |   0.4 ±  0.5 |
 Mpmc | pollsMade    |  14.7 ±  0.8 |  15.9 ±  0.6 |
  • The improvement to throughput is a nice 8%, but the error is not that far from it so I'm not too confident without lots more runs. This is a similarly symmetric load to the 1P1C and the balance is showing through.
I could go on to higher core counts etc, but I'm actually feeling the win here is pretty conclusive:
  • Throughput is generally improved
  • Failure to make progress is significantly cheaper
Why is this failure mode important? This is particularly important when you have a consumer/producer which is able to make progress in the face of such a failure. A common example of this pattern is a Selector thread which also has an incoming queue of commands to execute, when there's nothing in the queue we would like to go back to selecting on the selector.

Latency impact of Relaxation

Is this going to help us in the case of burst handling latency? There's a benchmark for that too! Or rather there's an estimate of burst processing cost. Code is here. The general idea being to send a burst of messages from the producer, then wait for completion notice from the consumer.
Here goes:
 Size | Type | Normal±Err | Relax± Err |
    1 | Spsc |  126 ±   2 |  126 ±   1 |
    1 | Mpsc |  144 ±   2 |  140 ±   1 |
    1 | Spmc |  187 ±   2 |  189 ±   2 |
    1 | Mpmc |  190 ±   8 |  203 ±   2 |
   10 | Spsc |  193 ±   2 |  192 ±   3 |
   10 | Mpsc |  611 ±  22 |  429 ±   7 |
   10 | Spmc |  309 ±   3 |  307 ±   1 |
   10 | Mpmc |  807 ±  19 |  677 ±  15 |
  100 | Spsc |  530 ±   7 |  528 ±   4 |
  100 | Mpsc | 5243 ± 158 | 3023 ± 107 |
  100 | Spmc | 1240 ±   7 | 1250 ±   6 |
  100 | Mpmc | 4684 ± 382 | 3665 ± 111 |

  • Same as before, SPSC serves as a sanity check/baseline. No change is good.
  • SPMC/MPSC results are pretty similar when sending a single message. MPMC seems slightly regressed.
  • When sending 10/100 messages in a burst we see significant benefits to using the relaxed methods, in particular for the MPSC and MPMC cases.


Relaxing is good for you, deep down you always knew it. Turns out it ain't half bad for your queues either :-). The new API will be part of the next version of JCTools, coming out as soon as I can work through my Maven mojo phobia.

Thanks J.P. Bempel and Phil Aston for their kind review, remaining errors are a result of me refusing to listen to them persnickety bastards. 

Wednesday, 19 August 2015

An extended Queue interface

In my work on JCTools I have implemented a fair number of concurrent access queues. The Queue interface is part of the java.util package and offers a larger API surface area than I found core to concurrent message passing on the one hand, and still missing others. I'm hoping to solicit some discussion on some new methods, and see if I can be convinced to implement those I decided to avoid thus far.

Inter-thread Message Passing vs. Generic Collection

The queue data structure traditionally has an offer/push/enqueue method for writing into the queue and a poll/pop/dequeue method for removing elements. This should be enough for anyone, right?
Turns out some other useful methods come to mind when using a queue:
  • peek: looking, not touching
  • size: for monitoring and heuristics and so on
  • isEmpty: a potentially more efficient assertion than size == 0
But I would argue that implementing the full scope of the java.util.Collection methods is slightly redundant. Why is it good to have offer/poll also exposed as add/remove/element?
I have chosen to avoid implementing iterators for queues in JCTools, they seem to introduce a large complexity and I'm not convinced users out there need an iterator for their concurrent queue. The duplicate collection methods I get from extending AbstractQueue for free are alright though I don't see why anyone would actually use them.
So lots of methods I don't want, but also some methods missing, in particular for concurrent queue semantics. In particular I have come to the conclusion the following are both practical to implement and helpful to users:
  • weakOffer/Poll/Peek: the same as the 'strong' methods but without the requirement that allowing for spurious negative return value. This is particularly useful for queues where we are able to detect the inability to claim the next slot independently of the queue being full or empty. The 'weak' prefix is different from the Atomic* classes definition of 'weak' as it has no ordering implications, but I've been unable to come up with a better name(suggestions are most welcome).
  • drain(receiver)/fill(provider): this is attractive for cases where a receiver/provider function exists such that calls to it are guaranteed to succeed (e.g. receiver could just throw elements away, producer might return a constant). There are some small savings to be made here around the function calling loop, but for some these small savings might prove significant. A perpetual version of these methods could also take some sort of backoff strategy and spin when hitting an empty/full condition (similar to the Disruptor's BatchProcessor workflow).
  • batchPoll/Offer: this is attractive for processing known number of elements out of/into a queue. This is particularly profitable when multiple consumer/producers contend as the granularity of contention is reduced. It is also potentially damaging as batches can introduce large 'bubbles' into the queue and increase the unfairness. These methods could support a receiver/provider interface or use an array (enables batch array copies, guarantees quick release of a 'bubble').
  • offeredElementsCount/polledElementsCount: this is purely for reporting and would tend to be a long reflecting the producer/consumer counter. For linked queues this adds a new overhead, so while this is a clearly useful feature I'm reluctant to make it a compulsory part of the new interface.

I plan to implement the above for JCTools in the near future. What do you think? Please opine in the comments or via the projects suggestion box.
Thanks :)

Monday, 27 July 2015

JMH perfasm explained: Looking at False Sharing on Conditional Inlining

There is an edge that JMH (read the jmh resources page for other posts and related nuggets) has over other frameworks. That edge is so sharp you may well cut yourself using it, but given an infinite supply of bandages you should definitely use it :-) This edge is the ultimate profiler, the perfasm (pronounced PERF-AWESOME!, the exclamation mark is silent). I've been meaning to write about it for a while and as it just saved my ass recently...

SpscGrowableArrayQueue's False Sharing Issue

JCTools includes a specialized SPSC (single-producer/consumer) bounded queue aimed at actor systems, the SpscGrowableArrayQueue. This queue is quite neat because it is combining the compactness of a linked queue with the awesome throughput of an array backed queue. The idea is quite simple, and similar in spirit to an ArrayList:
  1. Start with a small buffer, default to 16 elements
  2. As long as the queue doesn't grow beyond that just stay small
  3. If on offer you find that the queue is full double the size of the underlying buffer
The mechanics of the queue resizing, handling intermediate queue states, and detecting a new buffer from the consumer are a bit involved and will perhaps be expanded on some other day.
Because the queue is geared at actor systems per queue footprint is important, so I reworked the memory layout of SpscArrayQueue and skimped on the padding where possible. If you have no idea what I mean by padding, and why it is done you can read the following posts:
  1. False sharing and the MESI protocol details related to cached counters: This explores the motivation of padding fields in the first place
  2. Discovering and controlling object layout(Using an early version of JOL): This show cases the JOL tool and how it should be used to discover object layout
  3. SPSC Revisited - part I: An empiricist tale: This post discusses the inlining of counters into a queue use inheritance to control field order and introduce padding. In this post I show how to pad the counter from each other and also how to pre/post pad the class and buffer.
  4. A more high level summary of false sharing is given in this post
With an overhead of roughly 128b per padding this was instrumental to reducing the high memory cost per queue. The hot counters were still padded from each other, but the class pre/post padding were removed as well as the padding of the elements array.
So, I gave up on some of the padding, but reasoned that in most cases this should not make a difference because the hottest fields were still padded from each other.

Why so slow?

Now, I expect a queue that is very similar to SpscArrayQueue, but adds some features to be slower. There's just very little you can do about this, doing more will cost you something. But given that resizing is a rather exceptional event for this queue I thought this will be a minor hit, maybe 10-20% reduction in performance (for a certain definition of performance). JCTools has some benchmarks included and I ran the QueueThroughputBackoffNone which will have a producer and consumer threads chewing on the queue as hard as they can.
Since the numbers are not very important here, I'll stick to running benchmarks on my laptop (an MBP 11,1/ Ubuntu 15.04/Zulu JDK8u45 - Zulu is a supported and tested OpenJDK build) in this post. Rest assured that I have confirmed on real hardware the same results. The results I quote below are the pollsMade figure which reflects the actual delivered throughput.
To my puzzlement I found:
SpscArrayQueue          361.223 ±  7.156  ops/us
SpscGrowableArrayQueue   64.277 ± 31.803  ops/us

Crap performance and terrible variance, bugger me sideways.
Looking at the code, I thought there must be something I was doing to upset the mighty JIT spirits. Maybe my methods were too big? My branches too unpredictable? My variable names too offensive? So I tweaked the code this way and that, looked at the inlining log (-XX:+PrintInlining) and the assembly (-XX:+PrintAssembly/-XX:CompileCommand=print,*) I got some minor improvements, but it mostly still sucked. What's wrong? A quick look at "-prof perfasm" and a fair amount of head scratching lead to the answer. The code (before and after) is on github, the main focus for this post is perfasm and its usage so I won't dive into it.

Before we start: How does perfasm work?

To use perfasm you'll need a Linux(or Windows) OS, running on real hardware, the relevant perf tool installed, and a JVM setup to print assembly.
Perf is (amongst other things) an instruction level profiler, of the kind that don't usually work with Java traditionally (though things are slowly changing). One of the features offered by perf is the "perf record/annotate" workflow. Perf interrupts your process repeatedly and records the current PC (program counter) at each sample. This sampling of the program counter is recorded over a period of time to be post processed by the annotate feature which correlates samples to methods, lines of code and assembly instructions. The challenge for perf when dealing with Java code is that the binary form of each method only exists for the duration of that process lifetime. This means the PC is mostly referring to methods that are nowhere to be found when the annotation stage comes along.
To summarize: perf record works, but perf annotate is broken for Java.
To make perfasm work JMH captures the JVM compiler outputs by enabling the following flags:
The data collected here allows perfasm to do the annotation process by itself. I'll not bore you with the details of the output processing, but hat off to Mr. Shipilev who ploughed through the gory mess and made this work. The end result is a detailed output offering you the assembly of the hottest regions in your program, along with a list of the top hottest methods including both native and Java parts of the stack.
Because perf allows the recording of any number of events the definition of hottest depends on the events you choose to profile. The default events are cycles and instructions, but you can specify any number of events (E.g.,cache-misses). The first event specified will be the 'hotness' qualifier.

What do you get?

The perfasm output is split into 4 sections:
  1. Annotated assembly output for the top HOT regions(titled "Hottest code regions (>10.00% "cycles" events):")
  2. A list of the hot regions in your benchmark (titled "[Hottest Regions]"). This is the list of compiled methods inclusive of native methods, which is why we have the next section.
  3. A list of the hottest methods after inlining, so only java methods in this section (titled "[Hottest Methods (after inlining)]").
  4. A distribution of the cycles between types of regions (titled "[Distribution by Area]"). This will inform you of how the split goes between compiled code, kernel, JVM etc.
The most interesting of the 4 sections (and the only one I'm going to explain here) is the annotated hot regions section. I tend to edit the output of assembly spouting tools to make it more readable, but for the sake of explaining what's what here's the original output with footnotes:
To focus on the main features I trimmed out allot of the assembly code, which is where you see the "<MORE OF SAME/>" comments. Note the legend at the start of the output describing the 2 left most columns: cycles and instructions sample percentages.
Here's the footnotes explained:
  1. (Line 7) This column is the instruction address. This is important as this column is both how perfasm matches the perf record output with the PrintAssembly output and how poor sods like yours truely have to look for jump destinations in the code.
  2. (Line 7) This column has the actual instructions, in AT&T syntax (src -> dst).
  3. (Line 7) This column is the instruction annotation generated by PrintAssembly. It is not always correct, but it's helpful in connecting the dots. Multi-level inlining of methods can be seen in action everywhere.
  4. (Line 12) The condition/loop edge annotation is a wonderful recent addition to perfasm. It makes life marginally more bearable for the assembly consumer by connecting the jumps with their destinations. This is the start point annotation ""
  5. (Line 16) This is the end point annotation ""
When looking at this you will be searching for suspect hot spots in the code. Some instruction which is chewing away your cycles. If you're in luck you'll be able to tweak your code just right to side step it and then you are off and away :-)

Back on Track: Show me the cycles!

Right, so I point Captain Awesome at my benchmark and what do I see (trimmed addresses and edges as there's no important conditions here, also trimmed package names and moved code annotation left):
This instruction, eating 30% of my cycles (line 7, needs HEALING!), is part of a guard generated to ensure that inlining the poll method call is still a valid decision, that the instance observed at the callsite is of the class for which we inlined the poll method. This is not really a Java code line, this is the aftermath of inlining a virtual call (through an interface, see Aleksey's method dispatch inlining article to learn more on inlining).
To be fair, the above is kind of hard to understand. Why is a comparison between a constant and a register eating 30% of the cycles? according to every instruction 'cost' manual this should take 1 cycle (and the CPU can do 3-4 of them in parallel too), this is obviously not the instruction I'm looking for.
This is a phenomena known as 'skid' where the reported instruction at a given sample is inaccurate because modern CPUs are complicated. See the following dialogue on the linux-perf-user mailing list:
> I think Andi mentioned this to me last year -- that instruction profiling was no longer reliable. 
It never was. 
> Is this due to parallel and out-of-order execution? (ie, we're sampling the instruction pointer, but that's set to the resumption instruction, not the instructions being processed in the backend?). 
Most problems are due to 'skid': It takes some time to trigger the profiling interrupt after the event fired. [...] There are also other problems, for example an event may not be tied to an instruction. Some events have inherently large skid.
This sounds grim, but at the end of the day this is as accurate as profiling can get. It's not perfect, but it's still very valuable and you get used to it (there are ways to minimize skid discussed in the link, I've not tried those).
What you end up doing is looking for instructions just before the blamed instruction, or instructions on which the blamed instruction is dependant, which may be more reasonably blamed for the bottleneck. Given the CMP is not the problem, we must ask why would the CPU spend so much time at it? A CMP or a TEST will often get blamed for the price of the load into the registers they use, in this case the CMP is most probably being blamed for the load of the queue type from the object header one instruction back:
"0x8(%r12,%r11,8),%r8d  ; implicit exception: dispatches to 0x00007f83c93f929d"
It doesn't help that the comment talks about some implicit exception (this is an implicit null check), where it could say "getkid(q) + implicit_nullchk(q)" or something similar to indicate we are loading the kid (klass id) from the object header (see object header layout details here).
Now that I had pointed a finger at this operation, I was still quite confused. This was not an issue for any of the other queues, why would loading the kid be such a bottleneck in this case? Maybe I'm wrong about this (always a good assumption)? To prove this is the issue I created a duplicate of the original benchmark where I use the SpscGrowableArrayQueue directly instead of going via the interface, for comparison I also benchmarked the SpscArrayQueue in the same fashion:
SpscArrayQueue          369.544 ±  1.535  ops/us
SpscGrowableArrayQueue  272.021 ± 12.133  ops/us

Now that's more like it! the expected 20% difference is more like 30%, but this is much closer. This still begs the question, why is the type check for SpscGrowableArrayQueue so expensive? We can see that for SpscArrayQueue this makes very little difference, how is SpscGrowableArrayQueue different?

Messing with the KID

I had a long hard look at this issue, which didn't help, then slept on it, which did help, and realized the problem here is that the object header is false-sharing with the producer fields. When I trimmed down the padding on this class in an effort to minimize allocation, I removed the array post and pre-padding as well as the class pre/post padding and reasoned that for the most part I need not worry about the object's neighbours false sharing. What I failed to realize was that the consumer and producer threads might be frequently hitting the object header, in this benchmark on every call. Once I realized this was the issue I reinstated the pre-padding such that the producer index is far enough from the object header to stop interfering with it and the problem went away (see before and after, it's a messy diff as I fixed some other niggles while I was there, you can take the last version and play with adding/removing the padding and reordering the index and cold fields to verify my claims).
Here's the (slightly trimmed for readability) perfasm output for the same conditional inlining check in the padded version:

As an interesting side note, this issue was not visible in the hand rolled throughput benchmarks. This is because in those benchmarks the queue is hoisted into a variable before the consume/produce loop which means the conditional inlining check can be done out of loop as well. This is great for getting nice numbers but hides an issue which users are likely to hit. Credit goes to JMH for herding benchmarks down a path which forces these issues into the measured scope.


The main focus of this post is perfasm, I hope it helps get you started on using it. The broader context in which you would use this tool is a less explicit background. I use perfasm regularly, but I'm also reasonably happy to look at assembly and compiler annotation which I know most people are not. I find it to be invaluable in issues, like the one described in this post, where the cost is between the Java lines rather than the lines themselves. Any profiler can give you a broad feel of where performance bottlenecks are, and a non-safepoint-biased profiler can show you the hottest line of code. What a Java profiler will not tell you is about the generated JVM code between your lines. It will attribute the cost of those instructions to the nearest Java line, and that can become a very confusing chase.
It also worth pointing out that any nano-benchmark (measuring operations in the 0-500 nanoseconds range) practically requires you to look at the assembly for it's analysis. But 500ns can still be an aweful lot of code and an assembly level profiler is very handy. At this point it worth mentioning the Oracle Solaris Studio (first prize for least known profiler every year since release). It is a great assembly level profiler, and just generally a great profiler. If your measurement needs to take place outside the cosy comforts of a JMH benchmark I would recommend you give it a spin.
Finally, this investigation came in the context of a workflow that is followed in the development of JCTools. I would loosely describe it as follows:
  1. Implement new feature/data structure
  2. Reason about expected impact/performance as part of design and implementation
  3. Test expectations using the existing set of benchmarks
  4. Expectations are far off the mark (if not... not sure, will figure it out when it happens)
  5. Dig in until either expectations or code are cured.
This has a whiff of scientific exploration to it, but I assure you it is not done quite so seriously and I often fail to follow my own advice (or worse other people's advice). The habit of testing performance assumptions/expectation has offered me many an afternoon spent banging my head on a variety of surfaces. Perfasm has been instrumental in reducing the amount of head banging, but I fear nothing short of permanent brain damage will actually solve the problem.
This post has been kindly reviewed by Aleksey Shipilev, Darach Ennis and Richard Warburton. Any remaining errors are entirely their fault ;-)

Friday, 22 May 2015

Object.equals, primitive '==', and Arrays.equals ain't equal

It is a fact well known to those who know it well that "==" != "equals()" the example usually going something like:
  String a = "Tom";
  String b = new String(a);
  -> a != b but a.equals(b)

It also seems reasonable therefore that:
  String[] arr1 = {a};
  String[] arr2 = {b};
  -> arr1 != arr2 but Arrays.equals(arr1, arr2)

So far, so happy... That's examples for you...
For primitives we don't have the equals method, but we can try boxing:
  int i = 0;
  int j = 0;
  -> i == j and also ((Integer)i).equals((Integer)j), and Arrays.equals({i}, {j})

Floats ruin everything

But... some primitives are more equal then others. Consider for instance the following example:
  float f1 = Float.Nan;
  float f2 = Float.Nan;
  -> f1 != f2, and also f1 != f1, Nan's are nasty

This is what floats are like children, they be treacherous little hobbitses and no mistake. The fun starts when you box them fuckers:
  -> ((Float)f1).equals((Float)f2), because Java likes to fix shit up
  -> Arrays.equals({((Float)f1)},{((Float)f2)}), hmm consistent
  -> but also: Arrays.equals({f1},{f2})...

This is counter to how one would normally think arrays are compared. You would think that for primitives (skipping arrays null and length checks):
  boolean feq = true;
  for(int i=0;i<farr1.length;i++){
    if(farr1[i] != farr2[i]) {
      feq = false; break;
Is the same as:
  boolean feq = Arrays.equals(farr1,farr2);
But for double[] and float[] the contract has been changed to accommodate Nan. This is perhaps a good decision on the JDK authors side, but it is somewhat surprising.
Conclusion: 2 arrays can be equal, even if some elements are not equal.

Objects are weird

Let's consider objects for a second. We started with a != b does not predicate that !a.equals(b), but what about a == b?
The Object.equals() javadoc offers the following wisdom:
The equals method implements an equivalence relation on non-null object references:

  • It is reflexive: for any non-null reference value x, x.equals(x) should return true.
  • It is symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
  • It is transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
  • It is consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
  • For any non-null reference value x, x.equals(null) should return false.
You know what the most important word in the above lines is? SHOULD
In the wonderful land of Should, all classes behave and if they bother implementing an equals method they follow the above rules (and also override the hashcode method while they are there).
But what happens when an object is implement perversely, like this:
  public class Null {
    public boolean equals(Object obj) {
        return obj == null;

This has some nice implications:
  Null n = new Null();
  -> n != null, but n.equals(null)
  Object[] a = {n};
  Object[] b = {null};
  Object[] c = {n};
  -> Arrays.equals(a, b) is true, because n.equals(null)
  -> Arrays.equals(b, a) is false, because null != n
  -> Arrays.equals(a, a) is true, because a == a
  -> Arrays.equals(a, c) is false, because !n.equals(n)

Or to quote The Dude: "Fuck it man, let's go bowling"

Tuesday, 19 May 2015

Degrees Of (Lock/Wait) Freedom

Yo, Check the diagonal, three brothers gone...
I've been throwing around the terms lock-free and wait-free in the context of the queues I've been writing, perhaps too casually. The definition I was using was the one from D. Vyukov's website (direct quote below):
  • Wait-freedom:  Wait-freedom means that each thread moves forward regardless of external factors like contention from other threads, other thread blocking. Each operations is executed in a bounded number of steps. 
  • Lock-freedom: Lock-freedom means that a system as a whole moves forward regardless of anything. Forward progress for each individual thread is not guaranteed (that is, individual threads can starve). It's a weaker guarantee than wait-freedom. [...] A blocked/interrupted/terminated thread can not prevent forward progress of other threads. Consequently, the system as a whole undoubtedly makes forward progress.
  • Obstruction-freedom: Obstruction-freedom guarantee means that a thread makes forward progress only if it does not encounter contention from other threads. That is, two threads can prevent each other's progress and lead to a livelock. 
  • Blocking Algorithms: It's the weakest guarantee - basically all bets are off, the system as a whole may not make any forward progress. A blocked/interrupted/terminated thread may prevent system-wide forward progress infinitely. 
The above definitions refer to "forward progress/moving forward" in the context of "system" and "thread". In particular they translate "X-freedom" to "a guarantee that T/S makes forward progress".
Now "thread" is a well defined term, but what counts as forward progress of a given thread?
To my mind a thread which is free to return from a method is free to 'make progress'. For example: if the next element in a queue is not visible to a consumer thread I can return control to the consumer thread which is then free to make progress on anything it feels like and try getting an element out of the queue later. Similarly a producer thread which is unable to add elements to a queue is 'free' to 'make progress' even if the queue is full. In short, I interpreted 'freedom' as regaining control of execution.

Freedom? yeah, right...

My interpretation however is limited to the scope of the thread and assumes it has other things to do (so placing the thread within the context of a given 'system'). So the freedom to make progress is really a freedom to make progress on 'other things'. The definitions above when applied to a given data structure have no knowledge of the 'system' and it is therefore fair to assume nothing. And so if the 'system' is viewed as being concerned only with using the data structure, it seems my view of 'regained control freedom' is not inline with the 'progress making freedom'.
Let's consider for example the linked Multi-Producer-Single-Consumer queue discussed in the previous post (this is the not the j.u.Queue compliant version):

Now let us assume a producer has stalled in that unpleasant gap between setting the head and pointing prev to the new head(at line 34), this is what I've come to think of as a 'bubble' in the queue. The next producer will see the new head, but the consumer will not see it (or any nodes after it) until is set.
Others producers can 'make progress' in the sense that elements will get added to the queue. Those elements visibility to the consumer however is blocked by the 'bubble'. What is the correct degree of freedom for these producers? what about the consumer?
When describing the queue Vyukov makes the following statements:
  • Wait-free and fast producers. One XCHG is maximum what one can get with multi-producer non-distributed queue.
  • Push [offer in the Java version] function is blocking wrt consumer. I.e. if producer blocked in (* [line 34]), then consumer is blocked too. Fortunately 'window of inconsistency' is extremely small - producer must be blocked exactly in (* [line 34]).
So producers are wait-free, but consumer is blocking. The consumer is truly blocked by the 'bubble' and can make no progress in the terms of the data structure. Despite other producers adding nodes the consumer cannot see those nodes and is prevented from consuming the data. The fact that control is returned to the caller is considered irrelevant. And if we are being precise we should call this a blocking queue.

Life imitates Art

The "Art of Multiprocessor Programming" offers a different definition:
"A concurrent object implementation is wait free if each method call completes in a finite number of steps. A method is lock-free if it guarantees that infinitely often some method call finishes in a finite number of steps" - page 99
This sentiment is similarly echoed in "Wait-Free Queues With Multiple Enqueuers and Dequeuers" a paper on lock free queues:

  • [...] to ensure that a process (or a thread) completes its operations in a bounded number of steps, regardless of what other processes (or threads) are doing. This property is known in the literature as (bounded) wait-freedom.
  • lock-freedom ensures that among all processes accessing a queue, at least one will succeed to finish its operation.

Hmmm... no system, no progress. The boundaries are now an object and a method, which are well defined. I think this definition matches my understanding of 'regained control freedom' (open to feedback). If a method returns, progress or no progress, the condition is fulfilled. Under this definition the queue above is wait-free.
The wiki definition is again closer to the one outlined by Vyukov:
  • An algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread for some operations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress.
  • Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.
  • Lock-freedom allows individual threads to starve but guarantees system-wide throughput. An algorithm is lock-free if it satisfies that when the program threads are run sufficiently long at least one of the threads makes progress (for some sensible definition of progress). All wait-free algorithms are lock-free.
We are back to a fuzzy definition of system, progress, but the underlined sentence above is interesting in highlighting suspension. In particular I'm not sure how this is to be interpreted in the the context of the lock-free definition where suspension is not an issue if some threads can keep going.
Note that once we have fixed poll to behave in a j.u.Queue compliant way: The poll method is no longer wait-free by any definition.

What about the JCTools queues?

JCTools covers the full range of MPMC/MPSC/SPMC/SPSC range of queues, and aims for high performance rather than strict adherence to any of the definitions above. To be more precise in the definitions I would say:
  • SpscArrayQueue/SpscLinkedQueue are Wait Free (on both control and progress senses)
  • MpscArrayQueue is lock free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
  • MpscLinkedQueue is wait free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
  • SpmcArrayQueue is lock free on the consumer side and blocking on the producer side. I'm planning to add a weakOffer method which will be wait free in the control sense.
  • MpmcArrayQueue is blocking on the both producer and consumer side. I'm planning to add a weakOffer/Poll method which will be lock free in the control sense.
Does it matter that the queues do not meet the exact definitions set out below? As always it depends on your needs...

Non-Blocking Algorithms On The JVM?

{Full disclosure: please note when reading the next section that I work with Azul Systems on the Zing JVM, I'm not trying to sell anything, but I am naturally more familiar with it and consider it awesome :-)}
What happens when we include the JVM in our definition of a system? Can you build a non-blocking algorithm on the JVM? All the JVMs I know of are blocking in some particular cases, important to the discussion above are:
  1. Allocation: The common case for allocation is fast and involves no locking, but once the young generation is exhausted a collection is required. Young generation collections are stop the world events for Oracle/OpenJDK. Zing has a concurrent young generation collector, but under extreme conditions or bad configuration it may degrade to blocking the allocator. The bottom line is that allocation on the JVM is blocking and that means you cannot consider an algorithm which allocates non-blocking. To be non-blocking a system would have to provably remove the risk of a blocking collection event on allocation.
  2. Deoptimization: Imagine the JIT compiler in it's eager and aggressive compilation strategy has decided your offer method ends up getting compiled a particular way which ends up not working out (assumptions on inheritance, class loading, passed in values play a part). When the assumption breaks a deoptimization may take place, and that is a blocking event. It is hard to prove any piece of code is deoptimization risk free, and therefore it is hard to prove any Java code is non-blocking. Deoptimization is in many cases a warmup issue. Zing is actively battling this issue with ReadyNow which reloads previously recorded compilation profiles for the JVM, greatly reducing the risk of deoptimization. Oracle/OpenJDK users can do a warmup run of their application before actually using it to reduce the risk of deoptimization. My colleague Doug Hawkins gave a great talk on this topic which will give you far more detail then I want to go into here :-). 
  3. Safepoints: If your algorithm includes a safepoint poll it can be brought to a halt by the JVM. The JVM may bring all threads to a safepoint for any number of reasons. Your method, unless inlined, already has a safepoint on method exit(OpenJDK) or entry(Zing). Any non-counted loop will have a safepoint-poll, and this includes your typical CAS loop. Can you prevent safepoints from ever happening in your system? It might be possible for a JVM to provide users with compiler directives which will prevent safepoint polls from being inserted in certain code paths, but I don't know of any JVM offering this feature at the moment.
In short, if you are on the JVM you are probably already blocking. Any discussion of non-blocking algorithms on the JVM is probably ignoring the JVM as part of the 'system'. This needs to be considered if you are going to go all religious about wait-free vs. lock-free vs. blocking. If a system MUST be non-blocking, it should probably not be on the JVM.


Any term is only as good as our ability to communicate with others using it. In that sense, splitting hairs about semantics is not very useful in my opinion. It is important to realize people may mean any number of things when talking about lock-free/wait-free algorithms and it is probably a good idea to check if you are all on the same page.
I personally find the distinction between control/progress freedom useful in thinking about algorithms, and I find most people mean lock/wait-free excluding the effects of the JVM...
Thanks Martin & Doug for the review, feedback and discussion, any remaining errors are my own (but their fault ;-) )

Wednesday, 22 April 2015

Porting Pitfalls: Turning D.Vyukov MPSC Wait-free queue into a j.u.Queue

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
D. Vyukov is an awesome lock-free dude, and I often refer to his instructive and invaluable website in my posts. On his site he covers lock free queue implementations and in particular a wait-free MPSC linked node queue. This is really rather special when you consider that normally MP would imply lock-free rather than wait-free guarantees. I've ported his algorithm to Java (and so have many others: Netty/Akka/RxJava etc.), and had to tweak it to match the Queue interface. In this post I'd like to explain the algorithm, it's translation to Java, and the implications of making it a j.u.Queue.

Lock free vs. Wait free

Let's review the definitions:
  • Wait-free: thread progress is guaranteed, all operations finish in a determined number of steps.
  • Lock-free: global progress is guaranteed, though a particular thread may be blocked.

An example of a transition from lock free to wait free is available with JDK8 changes to AtomicReference::getAndSet(). The change was made by utilizing the newly available Unsafe::getAndSetObject intrinsic which translates directly to XCHG (on x86). So where we used to have for AtomicReference:
T getAndSet(T newVal) {
  T currentValue;
  do {
  currentValue = val; // val is a volatile field
  } while (!Unsafe.compareAndSwapObject(this, VAL_FIELD_OFFSET, currentValue, newValue));
  return currentValue;
Now we have:
T getAndSet(T newVal) {
  return Unsafe.getAndSetObject(this, VAL_FIELD_OFFSET, newValue);
I discussed a similar change to AtomicLong.getAndAdd in a previous post, replacing the CAS loop with LOCK XADD.

The Vyukov Wait Free MPSC queue

This is a LinkedList type structure and the interesting methods are offer and poll, here's the original (I did some formatting):
Awesome in it's simplicity, deceptive in it's genius. Be aware that head/tail meaning is the other way around than what most people are used to. I personally go for producer/consumerNode (head = producer side, tail = consumer side in the above snippet) in my code, but for consistency I'll stick with Mr. Vs notation for the porting exercise.
But how do we manage the same memory barriers in Java? We can be nice about it and use the AtomicFieldUpdater or more brutal and use Unsafe. I find you get better performance with Unsafe, but you should consider how appropriate this is for you. In any case, here's what we end up with:
The code is pretty similar. Now if we wanted to complete the j.u.Queue we could extend AbstractQueue and implement only size()/peek() and we're done (I'm not going to bother with the iterator()):
There we go, seems reasonable don't it? off to the pub to celebrate some kick ass lock free programming!

Vyukov::poll() ain't Queue::poll!

I've discussed this particular annoyance in a previous post which has some overlap with this one. The names are the same, but as it turns out the guarantees are quite different. While Vyukov is doing a fine job implementing a queue, not any queue is a j.u.Queue. In particular for this case, the poll() method has a different contract:
  • j.u.Queue: Retrieves and removes the head of this queue, or returns null if this queue is empty.
  • Vyukov: Retrieves and removes the head of this queue, or returns null if next element is not available.
Why wouldn't the next element be available? Doesn't that mean the queue is empty? Sadly this ain't the case. Imagine for instance there are 2 producers and we run both threads step by step in a debugger. We break at line 33 (4th line of offer, the Java version):
  1. Producer 1: we step 1 line, we just replaced head with node n. We are suspended before executing line 35.
  2. Producer 2: we let the program continue. We've replaced head and linked it to the previous head.
What state are we in? Let's look at head/tail nodes and where they lead (I number the nodes in order of assignment to head):
  • head = Node[2], this is the node created by Producer 2. We also know that Node[1].next = Node[2], because we let offer run it's course on Producer 2.
  • tail = Node[0] the node we allocated in the constructor. This is the node head was before the first producer came along. This is what prev is equal to for Producer 1, but because we suspended that thread it never set it's next value. Node[0].next is still null!

If a consumer came along now they would get a null from poll(), indicating the queue is empty. But the queue is obviously not empty!
So it seems we cannot deduce the queue is empty from looking at here's 2 valid indicators that the queue is empty:
  • head == tail : this is the starting point set in the constructor and where the consumer ends up after consuming the last element
  • head.val == null : head can only have a value of null if it is tail
Here's a solution to a correct poll() and the knock on effect on peek():
This is a bit annoying because now we have a wait-free offer(), but poll() and peek() are lock-free(only block one thread, producers can make progress).
This pitfall is tricky enough that not only did I fall for it (on another queue algorithm, but same mistake), it also took by surprise Mr. Manes writing an interesting variant on this queue for his high performance cache implementation (I filed an issue which he promptly fixed), and struck the notorious Dave Dice when considering Vyukov's MPMC queue (see comments for discussion).
So is this it? Are we done?
Almost... size() is still broken.

A Good Size

It's not really surprising that size is broken given the terminating condition for the size loop was relying on next == null to terminate the count. Size was also broken in 2 other subtle ways:
  • The interface for size dictates that it returns a positive int. But given that the queue is unbounded it is possible (though very unlikely) for it to have more than 2^31 elements. This would require that the linked list consume over 64GB (16b + 8b + 8b=32b per element, refs are 8b since this requires more than 32GB heap so no compressed oops). Unlikely, but not impossible. This edge condition is handled for ConcurrentLinkedQueue (same as here), and for LinkedBlockingQueue (by bounding it's size to MAX_INT, so it's not really unbounded after all) but not for LinkedList (size is maintained in an int and is blindly incremented).
  • Size is chasing a moving target and as such can in theory never terminate as producers keep adding nodes. We should limit the scope of the measurement to the size of the queue at the time of the call. This is not done for CLQ and should perhaps be considered.
Here's the end result:
If you need a full SPSC/MPSC linked queue implementation they are available in JCTools for your pleasure, enjoy!

Special thanks to Doug and Ben for reviewing!

Monday, 13 April 2015

On Arrays.fill, Intrinsics, SuperWord and SIMD instructions

{This post turned rather long, if you get lazy feel free to skip to the summary}
Let's start at the very beginning, a very good place to start... My very first post on this blog was a short rant on intrinsics, and how they ain't what they seem. In that post I made the following statement:
"intrinsic functions show up as normal methods or native methods"
Which is correct. An intrinsic function is applied as a method substitution. A method call will appear in the code and the compiler will replace it's apparent source-level implementation with a pre-cooked implementation. In some cases intrinsics are sort of compilation cheats, the idea being that some bits of functionality are both very important (i.e. worth while optimizing) and can benefit from a hand crafted solution that will be better than what the compiler can achieve. The end result can be in one of a few flavours:
  1. Method call replaced with a call to a JVM runtime method: E.g. System.arrayCopy is replaced with a call to a method stub generated by the runtime for all array types. This method call is not a JNI call, but it is a static method call that is not inlined.
  2. Method call replaced with one or more instructions inlined: E.g. Unsafe.getByte/compareAndSet/Math.max
  3. Method call replaced with compiler IR implementation: E.g. java.lang.reflect.Array.getLength
  4. A mix of the above: E.g. String.equals is partially implemented in IR, but the array comparison is a call to a method stub.
The intrinsics are all set up in vmSymbols.hpp and if you look, you'll see Arrays.fill is NOT on the list. So why am I talking about Chewbacca? Because it is something like an intrinsic...

The Arrays.fill SIMD Opportunity

Arrays.fill is the Java memset (fills an array with a given value), and just like System.arrayCopy (memcpy in C lingo) is worth the effort to optimize and offers the same kind of opportunity. What opportunity might that be, you ask? the opportunity to use SIMD (Single Instruction Multiple Data) instructions when the underlying CPU offers them (I assume for the sake of discussion AVX enabled CPUs i.e. since Sandy Bridge, I find this listing of intel intrinsics useful to explain and sort through the available instructions). These instructions allow the CPU to operate on up to 256 bit (512 bit soon) chunks of data, thus transforming 32 byte sized MOV instructions into a single wide MOV instruction (E.g. the intel C instrinsic  _mm256_storeu_si256 or the corresponding instruction vmovdqu). SIMD instructions are good for all sorts of operations on vectors of data, or arrays, which is why the process of transforming element by element operations into SIMD instructions is also referred to as vectorization.
The actual assembly stub is generated dependent on CPU and available instruction set. For x86 the code is generated by the macroAssembler_x86.cpp, and the observant digger into the code will find it makes use of the widest memory instructions it can identify the processor is capable of. Wider is better baby! If you are not morbidly curious about what the implementation looks like, skip the next wall of assembly and you'll be back in Java land shortly.
Here's what the assembly boils down to when UseAVX>=2/UseSSE>=2/UseUnalignedLoadStores=true:

Roughly speaking the algorithm above is:
  1. Fill up an XMM register with the intended value
  2. Use the XMM register to write 64 byte chunks (2 vmovdqu) until no more are available
  3. Write leftover 32 byte chunk (skipped if no matching leftovers)
  4. Write leftover 8 byte chunks (skipped if no matching leftovers)
  5. Write leftover 4 bytes (skipped if no matching leftovers)
  6. Write leftover 2 bytes (skipped if no matching leftovers)
  7. Write leftover 1 bytes (skipped if no matching leftovers)
It ain't nice, but we do what we gotta for performance! There are variations of the above described across the internets as the done thing for a memset implementation, this might seem complex but is pretty standard... anyway, moving right along.

The Arrays.fill 'intrinsic'

Arrays.fill is different from System.arrayCopy because, as it's absence from vmSymbols suggests, it's not a method substitution kind of intrinsic (so technically not an intrinsic). What is it then? Arrays.fill is a code pattern substitution kind of compiler shortcut, basically looking for this kind of loop:
And replacing it with a call into the JVM memset implementation (I recently learnt the same thing is done by GCC as well, see code to assembly here). The pattern matching bit is done in loopTransform.cpp. This feels enough like an intrinsic grey area that the method doing the pattern match and replace is called intrinsify_fill.
Pattern matching makes this optimization potentially far more powerful than method substitution as the programmer doesn't have to use a special JDK method to convey meaning, they can just express their meaning in code and the compiler 'knows' that this simple loop means 'fill'. Compare that with System.arrayCopy where rolling your own leads to performance that is much worse than that offered by the intrinsic.
Let's prove me right (my favourite thing, beats kittens and all that crap), here's a JMH (see the JMH reference page for more JMH info/examples) benchmark comparing Arrays.fill to a hand rolled fill, and System.arrayCopy to handrolled array copy:
And the results are (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu, array is 32K in size)?
ArrayFill.fillBytes                561.540 ± 10.814 ns/op
ArrayFill.manualFillBytes          557.901 ± 5.255  ns/op
ArrayFill.manualReversedFillBytes 1017.856 ± 0.425  ns/op
ArrayFill.copyBytes               1300.313 ± 13.482 ns/op
ArrayFill.manualCopyBytes         1477.442 ± 13.030 ns/op

We can verify that the call out to the JVM fill method happens for fillBytes/manualFillBytes by printing out the assembly:

So what have we learnt so far:
  • Use System.arrayCopy, it is better than your handrolled loop. But surprisingly not hugely better, hmmm.
  • You don't have to use Arrays.fill, you can roll your own and it works the same. Notice the call out to the fill method. But...
  • Don't get too creative rolling your own. If you get too funky (like filling the array backwards) it'll fall apart and the 'intrinsic' won't happen. But do note that the reverse fill still has some of that good SIMD stuff going, we'll get to that in a sec.

Are The Other Types Filling The Love?

It all sounds great don't it? Let's see how this pans out for other types. We'll be filling an array of 32KB. To be uniform across data types that means a 16K chars/shorts array, an 8K ints/floats array and a 4K array of longs. I added an 8K array of objects, which is the same size for compressed oops on the Oracle JVM (reference size is 4 bytes, same as an int).
The JMH benchmark code is as you'd expect:
Here's some reasonable expectations:
  • If no optimizations are present, wider writes are more efficient. It follows that the longFill would be the fastest. But...
  • Given a clever compiler the fill loop is replaced with the widest writes possible, so there should be no significant difference. But the fill optimization does not cover double/long/object arrays, so we might expect longFill to be the worst performer.
  • An objects array is not that different from an int array, so performance should be similar. Sure there's a write barrier, but it need only be done once per card (not once for the whole array as I thought initially, god bless Shipilev and PrintAssembly), so that's an extra byte write per card of elements filled. A card is per 512 bytes, each element is 4 bytes, so that's one card per 128 elements. Given there is no fill method implemented for it we may expect it to be slightly worse than the longFill.
  • We should not rely on expectations, because performance is best measured.
As you'd expect the results are somewhat different than the expectations (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu):
ArrayFill.fillBytes     561.540 ± 10.814  ns/op
ArrayFill.fillChars     541.901 ±  4.833  ns/op
ArrayFill.fillShorts    532.936 ±  4.508  ns/op
ArrayFill.fillInts      543.165 ±  3.788  ns/op
ArrayFill.fillFloats    537.599 ±  2.323  ns/op
ArrayFill.fillLongs     326.770 ±  3.267  ns/op
ArrayFill.fillDoubles   346.840 ±  5.786  ns/op
ArrayFill.fillObjects  4388.242 ± 11.945  ns/op

Say WOT?
For bytes/chars/shorts/ints/floats Arrays.fill performs very similarly. This much is as expected from the second point above. But filling an array of longs/doubles is better than the others. The funny thing is, there's no fill function implemented for the long array, how come it is so darn quick? Also, why does the objects fill suck quite so badly when compared with the rest (I will not be addressing this last question! I refuse! this post is too fucking long as it is!)?
This is what happens when we turn off the OptimizeFill flag:
ArrayFill.fillBytes    1013.803 ± 0.227  ns/op
ArrayFill.fillChars     323.806 ± 3.879  ns/op
ArrayFill.fillShorts    323.689 ± 4.499  ns/op
ArrayFill.fillInts      326.336 ± 1.559  ns/op
ArrayFill.fillFloats    319.449 ± 2.048  ns/op
ArrayFill.fillLongs     328.692 ± 3.282  ns/op
ArrayFill.fillDoubles   345.035 ± 6.362  ns/op
ArrayFill.fillObjects  4397.130 ± 7.161  ns/op

Strange innit? now we got char/int/long arrays all performing similarly. In fact, with the exception of the byte array, everything is better than it was with the optimization.

Superword to the rescue! 

Turns out the JIT compiler is clued up on the topic of SIMD parallelisation by way of Superword Level Parallelism (see the original paper here):
In some respects, superword level parallelism is a restricted form of ILP (Instruction Level Parallelism). ILP techniques have been very successful in the general purpose computing arena, partly because of their ability to find parallelism within basic blocks. In the same way that loop unrolling translates loop level parallelism into ILP, vector parallelism can be transformed into SLP. This realization allows for the parallelization of vectorizable loops using the same basic block analysis. As a result, our algorithm does not require any of the complicated loop transformations typically associated with vectorization. In fact, vector parallelism alone can be uncovered using a simplified version of the SLP compiler algorithm.
Superword level parallelism is defined as short SIMD parallelism in which the source and result operands of a SIMD operation are packed in a storage location.
Vector parallelism is a subset of superword level parallelism.
The Hotspot compiler implements SLP optimizations in superword.cpp and you are invited to dive into the implementation if you like. I'm going to focus on it's impact here, and to do that I only need to know how to turn it on and off (core competency for any software person). It's on by default, so above results are what happens when it is on, here's what life looks like when it is off too (so -XX:-OptimizeFill -XX:-UseSuperWord):
ArrayFill.fillBytes   8501.270 ±  2.896  ns/op
ArrayFill.fillChars   4286.161 ±  4.935  ns/op
ArrayFill.fillShorts  4286.168 ±  3.146  ns/op
ArrayFill.fillInts    2152.489 ±  2.653  ns/op
ArrayFill.fillFloats  2140.649 ±  2.587  ns/op
ArrayFill.fillLongs   1105.926 ±  2.228  ns/op
ArrayFill.fillDoubles 1105.820 ±  2.393  ns/op
ArrayFill.fillObjects 4392.506 ± 11.678  ns/op

Life is revealed in all it's sucky splendour! This is what happens when the compiler shows you no love... did I say no love? hang on, things can get a bit worse.

Detour: Unsafe? We don't serve you kind here

To all the Unsafe fans, I got some sad news for y'all. Unsafe 'fill' loops are not well loved by the compiler. This is the price of stepping off the beaten path I guess. Consider the following benchmark:
The results are:
ArrayFill.unsafeFillOffheapBytes  9742.621 ±  2.270  ns/op
ArrayFill.unsafeFillOnHeapBytes  12640.019 ±  1.977  ns/op
ArrayFill.fillBytes(for reference) 561.540 ± 10.814 ns/op

The Unsafe variant do not enjoy the 'fill' pattern matching magic, nor do they get the SuperWord optimizations. What can you do? For this kind of thing you should use the Unsafe.setMemory method instead:
With the result:
ArrayFill.unsafeSetOffheapBytes   1259.281 ± 21.294  ns/op
ArrayFill.unsafeSetOnHeapBytes    1275.158 ± 27.950  ns/op
Not quite there, still ~2x worse (why? how come it doesn't just call the bytes fill method? a bit of digging shows it ends up calling the underlying platform's memset...) but beats being 20-25x worse like the handrolled method is.

Summary and Musings

It's the circle of life!
So what did we learn:
  • There's another kind of 'intrinsic' like optimization, which uses pattern matching to swap a block of code rather than a method. This is employed for memset like memory fill loops (in particular Arrays.fill) intrinsicfication. It's not an intrinsic technically, but you know what I fucking mean. 
  • System.arrayCopy/Arrays.fill implementations utilize SIMD instructions to improve their efficiency. These instructions are not available in plain Java, so some compiler intervention is required.
  • The JIT compiler is also able to use SuperWord Level Parallelism to derive SIMD code from 'normal' sequential code.
  • In the case of Arrays.fill, it looks like the SuperWord optimized code is faster than the fill specialized implementation for all types except bytes (on the system under test)
  • If you use Unsafe you will be excluded from these optimizations.
So I look at this process and I imagine history went something like this:
We want to use SIMD instructions, but the JIT compiler isn't really clever enough to generate them by itself. Memset implementations are rather specialized after all. Let's make life a bit easier for the compiler by creating an intrinsic. We'll even go the extra mile and make an effort to automatically identify opportunities to use this intrinsic, so now it's not really an intrinsic any more. The Arrays.fill optimization is available on Oracle JDK6u45 (the oldest I keep around, maybe it was there a while before that) and on that JVM it is twice as fast as the SLP generated code.
Over time, SLP gets better and eventually the compiler is now good enough to optimize the fill loop by itself and beat the specialized method. That is an awesome thing. We just need to remove the training wheels now.
And there's a final punch line to this story. Memset/Memcpy are such common and important opportunities for optimization, so Intel has decided to offer an assembly 'recipe' for them and save everyone the effort in writing them:
3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)
Beginning with processors based on Intel microarchitecture code name Ivy Bridge, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for soft- ware in common situations like memory copy and set operations. Processors that provide enhanced MOVSB/STOSB operations are enumerated by the CPUID feature flag: CPUID:(EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1. - [From the Intel Optimization Manual(September 2014)]
From the manual it seems that this method of implementing memcpy/memset can perform well, but like anything else, YMMV (the intel manual discussion of the performance differences is in itself interesting both on the results and the methodology level). One obvious advantage of this method is that it results in much much smaller code that should be trivial to inline into callers. This will however put the SuperWord method at a slight disadvantage, and the tide will change again.
[UPDATE 14/03/2015: It seems the good folks of Oracle have considered and rejected the use of REP MOVSB for array copy.]
Thanks go to the kind reviewers Peter 'Massive' Hughes, Darrach and the Shipster