Monday 3 November 2014

The Mythical Modulo Mask

It is an optimisation well known to those who know it well that % of power of 2 numbers can be replaced by a much cheaper AND operator to the same power of 2 - 1. E.g:
x % 8 == x & (8 - 1)
[4/11/2014 NOTE] This works because the binary representation for N which is a power of 2 will have a single bit set to 1 and (N-1) will have all the bits below that set to 1 (e.g 8 = 00001000, 8-1= 00000111). When we do x AND (N-1) only the remainder of x / N will match the N-1 mask.
[4/11/2014 NOTE + slight spoiler: this only works when x >= 0]
[15/02/2016 NOTE: It has been pointed out that % is not modulo, but remainder. Why everyone calls it modulo I'm not sure]
The reason the & is  so much cheaper is because while % is implemented using the DIV instruction, & is just AND and as it turns out DIV is expensive and AND is cheap on x86 CPUs (and other places too I think). The optimisation is used in the Disruptor as well as the JCTools circular array queues and in the ArrayDequeue and other JDK classes. Is it time to replace % with & everywhere in your code which has this opportunity?
[4/11/2014 NOTE]  Technical term for this sort of optimization is Strength Reduction

Starting Simple

Lets start with some basic benchmarks:
And the results (on JDK8u5/E5-2697 v2 @ 2.70GHz/-XX:-UseCompressedOops for consistency between assembly and results):
  Benchmark                   Score   error  Units
  moduloLengthNoMask          3.448 ± 0.007  ns/op
  moduloConstantLengthNoMask  1.150 ± 0.002  ns/op
  moduloLengthMask            1.030 ± 0.006  ns/op
  moduloMask                  0.862 ± 0.001  ns/op
  consume                     0.719 ± 0.001  ns/op
  noop                        0.287 ± 0.000  ns/op

So pretty much as per expectation the modulo operation is far more expensive than the mask:
  • The clever JIT is aware of the optimisation opportunity and will replace a constant % with the &. It is not a perfect replacement, but pretty close.
  • At this sort of low digit ns benchmark we can’t make a statement such as “modulo is 4 times more expensive” because the same machine produces a baseline of 0.287ns/op for the noop benchmark and 0.719ns/op for the consume benchmark. If we deduct the consume result from the other scores we see a 1 : 25 ratio between the costs. Is that a good way to model performance? not really either, performance is not additive so simply subtracting one cost from the other doesn't really work at this scale. The truth is somewhere fuzzy in between and if we really care we should look at the assembly.
  • It seems that using a pre-calculated mask field is more awesome than using the "array length - 1" as a mask. That is consistent with the expectation that the re-calculation of the mask on the fly, as well as loading the value to be used for that calculation, is more expensive than using the pre-calculated field.
I love it when a plan comes together...

Going Deeper

The reason we wanted the modulo in the first place was to read from the array, right? so let’s try that:
And the results:
  Benchmark                   Score   error  Units
  readByLengthNoMask          3.736 ± 0.005  ns/op
  readByConstantLengthNoMask  1.437 ± 0.001  ns/op
  readByMask                  1.347 ± 0.022  ns/op
  readByLengthMask            1.181 ± 0.049  ns/op
  readNoMask                  1.175 ± 0.004  ns/op
Well, what’s this I see? "length-1" mask is leading the chart! How’d that happen?
To quote from the famous “Jack and the FlumFlum Tree”:
“Don’t get your knickers in a twist!” said Jack,
“Let’s have a look in the patchwork sack.”
Lets start with the generated assembly for the constant modulo:
I didna see that one coming! the modulo on a constant is not your garden variety & mask affair since it turns out our original assertion about the mask/modulo equality is only true for positive numbers. The JIT in it’s wisdom is dealing with the negative case by doing (x = -x; x = x&15; x = -x;).
I think the above case could be made a tiny bit faster by switching the branch around (so jump for negative value). It’s easy however to see what happens if we simplify the constant version further by using a constant mask:
And results:
  Benchmark                   Score   error  Units
  moduloConstantLengthNoMask  1.150 ± 0.002  ns/op
  moduloConstantLengthMask    0.860 ± 0.001  ns/op
  readByConstantLengthNoMask  1.437 ± 0.001  ns/op
  readByConstantLengthMask    1.209 ± 0.017  ns/op
So minor joy on the modulo, and reading is better than plain mask, nearly as good as the "length-1" mask. Oh well, let's move on.
The big surprise was the mask calculated on the fly from the array length version. How can calculating the mask on the fly, which seemed to be slower, end up being faster when reading from the array? Who feels like more assembly?
I was hoping the JVM was clever enough to remove the array bound checks, but that didn’t happen. What’s happening here is that the length load serves the purpose of both creating the mask and checking the bounds. This is not the case for the mask version where we load the mask for the index calculation and the length for the bounds check, thus paying for 2 loads instead of one:
So removing the computation did not make a difference because the bound check requires the extra load of the length anyhow, can we make the bounds check go away? Of course we can, but it’s Unsafe!!! Let’s do it anyways!
The assembly:

Shazzam! no bounds check, but look at all the work that’s gone into the unsafe read of the array. It would have been so much better if the unsafe read enjoyed the same addressing mode as normal array reads like so “r8d,DWORD PTR [r9+r10*4+0x18]”, but it seems the JIT compiler is not recognising the opportunity here. What’s the performance like?
  Benchmark                   Score   error  Units
  readByMask                  1.347 ± 0.022  ns/op
  readByLengthMask            1.181 ± 0.049  ns/op
  readNoMask                  1.175 ± 0.004  ns/op
  unsafeReadByMask            1.152 ± 0.001  ns/op

This is even better than no mask at all. Yay?
Well… sort of. If you mean to have the fastest ‘get’ from the array that allows for an array size which is not an application constant, than this is a mini-win. In particular is saves you a load of the array length in this case and loads can cost anything really. In the case where index and mask are long we can get better code generated:
But performance is much the same for this case. Seems like there’s not much left to win in this case.
For completeness sake we can compare the no mask result with an Unsafe equivalent:
  Benchmark                   Score   error  Units
  unsafeReadByNoMask          1.038 ± 0.022  ns/op
  readNoMask                  1.175 ± 0.004  ns/op

So it seems slipping past the array boundary check is worth something, but is it generally worth it? what if we weren't dealing with just the one element?

Bound Check Elimination

Looking at the above optimisation we need to accept that it is probably only worth it if array bound checks happen on every access. If we now compare a sum over an array:

We get the following results (length=100):
  Benchmark                    Score    error  Units
  loopOverArrayStraight        26.855 ± 0.060  ns/op
  loopOverArrayUnsafeInt       41.413 ± 0.056  ns/op
  loopOverArrayUnsafeLong      76.257 ± 0.171  ns/op
Oh Unsafe, why you so sucky sucky? How come the unsafe versions suck so significantly? isn’t Unsafe the cure to all performance problems?
Once the bounds check is eliminated by the JIT we can see that for the UnsafeInt we have the same issue with addressing conversion, only now the cost is not compensated for by the bound check removal. The UnsafeLong version is even worse, how come?
The generated loop for the int case is long and boring because it’s unrolled, the long case is pretty small:
2 'bad' things just happened:
  1. Addressing didn’t workout the way we’d like. Instead of the desired “mov    r11d,DWORD PTR [r9+rdi*4+0x18]” we get a two stage setup where we do:”lea    r10,[r9+rdi*4]” and then “add    r11d,DWORD PTR [r10+0x18]”. Bummer.
  2. We got a safe point poll in the loop. This is happening because long indexed loops are considered potentially very long (as opposed to shorter int loops... heuristics for time to safe point) and so include a safe point poll.
So we want to fix the addressing mode and stick to having an int index. If we were to insist on using Unsafe (perhaps because we are trying to do this with off heap memory instead of an array) we’d have to do this:
[4/11/2014 NOTE]  Note that what we really want here is more than just getting rid of the multiplication/widening, we want the JIT to identify the expression calculated for offset as relative array access and pick the correct addressing mode for MOV to use. There are clever people out there trying to make sure this will work better in the future.
This removes the need for a safe point poll and simplifies addressing to the point where we nearly match the iteration over the array case (length=100):
  Benchmark                    Score    error  Units
  loopOverArrayStraight        26.855 ± 0.060  ns/op
  loopOverArrayUnsafePointer   27.377 ± 0.049  ns/op
We can explore the relationship between the implementations by testing for different array sizes:
            10   100     1000    10000
straight    4.3  26.8    289.2   2883.7
unsafeP     4.8  27.3    296.1   2886.4

So it seems that the smaller the array the more relative advantage the array iteration has when iterating in this fashion. This should not really be surprising, there's nothing here to confuse the JIT compiler and iterating over arrays is important enough to optimize. We have to work hard to get close to the JIT compiler when it does what it does best.


Summary

We had a simple optimisation in mind, replace a % with &:
  • Observed that for the case where constants are used the JIT is able to perform that optimisation for us almost as well as we’d do ourselves (we have no way of specifying positive only modulo, i.e uint).
  • We proved the viability of the optimisation in 2 variations, using a pre-calculated mask field and using (array.length - 1)
  • Using the optimisation in the context of a circular array read showed an interesting reversal in performance. We observed the cause of this reversal to be the array.length load for the purpose of bound checks reused for the calculated mask as opposed to the re-calculated.
  • Using Unsafe we managed to bypass the array bound check and get the best result using the mask for a single read. 
  • When we try the same method naively in a loop (over the whole array) array bound check is eliminated and plain old array access is the best performer.
  • To regain the performance for Unsafe access we have to tweak the code to avoid safe point polling as well as to get the addressing mode we want in the resulting assembly. Even then plain array access is better for smaller arrays.
Simple innit?
Some notes on methodology:
  • I ran the same experiments on different intel generations, you get different results but the assembly remains the same. E.g. on older CPUs the maximum instructions per cycle would be less than on the Ivy Bridge CPU I've used here, this will lead to instruction spilling over to the next cycle. The L1 latency could be higher leading to loads dominating the costs etc. This ends up giving a slightly different balance to compute vs. memory load. Overall analysis holds.
  • Using -XX:-UseCompressedOops was done for the sake of consistent assembly and results. Using compressed oops makes loads look a bit clumsier and I wanted to have less to explain. But running with the flag on (as it is by default) also effects results on this scale. In particular because the compressed oops require a shift to be used and shifters are a limited resource the CPU (1 on Westmere, 2 on Ivy Bridge) it can end up adding a cycle to the results.
  • Running these same experiments on a laptop was good for getting the assembly out and a vague sense of scale for results, but measurements had far greater error in that environment. Also note that laptops and desktops tend to be a generation ahead of servers where processors are concerned.
  • An interesting experiment would be to look at same experiment with the JMH perfasm profiler. I did that but could not figure out how to get Intel syntax out of it and so for consistency sake stuck with what I had. Left as an exercise to the reader :P
Many thanks to J.P. Bempel and Peter Hughes for reviewing, any issues remaining were added by me after they reviewed the post.