lørdag den 12. juli 2014

To vectorize or not to vectorize - that's the question

Some observations on SIMD performance on A7

I have recently been dabbling a bit with an old passion of mine: computing deep zooms of the mandelbrot fractal. The core of my (very proprietary) routine consists of this computation:

y=(a.*b >> n) + c

All terms are 2-element vectors but with variable precision - from 32 bit and to infinity in multiples of 32 bit (limited by machine memory of course). The multi precision multiplication, addition and shift calls for a lot of 32 bit numbers being multiplied to form 64 bit numbers, and the additions are done as 32+32=64 bit to ease carry computations and keep the code in pure C

Intuitively this should map well to the NEON (and new SIMD) instruction set on the A7. Therefore I took a stab at implementing the code using the ARM NEON intrinsics.

When Apple introduced the A7 processor, it meant that all pure assembly NEON code could no longer be used, because the NEON instructions no longer exists in ARM64 mode. Therefore Apple now recommends using intrinsics as the intrinsics found in arm_neon.h will work both on A7 and the previous 32 bit processors. Looking at the assembly the mapping from intrinsics to machine code it shows that it maps pretty much 1:1, removing previous complaints about intrinsics doing weird stuff to your code. This is a good thing, as it allows us to write the code using functions rather than assembly notation which is quite a bit easier. 

To test performance I have implemented a set of macros and inline functions that maps the operations I need (multiplication, addition, shift etc.) to either an intrinsic or a pure C function. As an example, the unsigned multiply of two 32 bit numbers into a 64 bit number looks like this for Neon

#define vmull(_x,_y) vmull_u32(_x,_y)

And like this in pure C:

static inline v64x2_t vmull(v32x2_t x, v32x2_t y){
    v64x2_t v;
    v.r = x.r * (uint64_t)y.r;
    v.i = x.i * (uint64_t)y.i;
    return v;
}

This allows me to compare the performance of the code I write both when executed on the SIMD units (of which there supposedly is 2 in the A7) and on the normal integer data paths (of which there supposedly is 4 in the A7). Theoretically there is the same compute bandwidth for 64 bit numbers in the two types of code but a different number of registers, and therefore it is interesting to see how the code compares. 

I have done fully unrolled versions of the routine for all bit sizes from 32 to 1024 (1 to 32 words), and I built the code with -O3 allowing the compiler for the standard C code to do all possible optimizations. The  code is very light on memory access, trying to keep all results in registers and only reading each operand once and writing the result once to memory. This is what I found:


Index 100 is the scalar code, meaning that the NEON code is (sometimes very much) slower up to a precision of 8 words. Looking at assembly this coincides with the point where the normal C code runs out of registers and stack spills starts to happen. At 32 words, the NEON code is 30% faster, but it is not winning by a lot. This to me shows the power of the new ARM64 instruction set. The core is simply able to get a lot of things done without the NEON extensions. It also shows that it is somewhat tricky to apply NEON optimizations for this kind of integer code - it depends very much on the workload whether it pays off. In my case I now only run the NEON code when I use the routines operating on more than 8 words, as it doesn't pay off before. 

The case for a desktop Ax processor

When Apple presented the A7 processor, they claimed this was a desktop class processor. And with what I saw for integer performance, I thought it would be fun to compare it to my Ivy Bridge i7 from late 2011 running 2.2 GHz. I thought it would be interesting to compare both scalar and SIMD (SSE4) performance, and since my own small vector math library maps nicely to SSE4 as well, It was only a matter of writing another set of macros for SSE. Running the same benchmark, this is what I got:

There are two interesting things here: SSE4 consistently improves the performance on the Core i7 for all workloads - sometimes a lot and sometimes only a little. But note the very heavy 1024 bit computation workload to the right: Here the scalar code is only 20% slower on the A7 - and here we are talking a passively cooled processor running 1.3 GHz vs a 2.2 GHz processor with all fans on (literally). This is indeed very impressive. But SSE really works as well - It seems that the i7 is doing a much better job at keeping the vector unit busy. Although the A7 is out of order, it seems that the instruction scheduling still has some way to go here. 

For fun I tried to account for the difference in clock speed (a very simplistic approach - I know) to see what an imaginary 2.2 GHz A7 would be able to do:

For all the workloads the A7 would perform on par with the i7 when we talk scalar code, and for the very heavy workload it will even perform on par when it comes to SIMD instructions, while the scalar code will run much faster. A 2.2 GHz A7 would probably not be passively cooled, but definitely not run as hot as the i7 either. I would be surprised if we don't see an ARM based macbook air at some point.