New region multiply technique using GFNI Affine instruction
I thought I'd post this as it may interest developers or anyone looking at this project.
Intel has announced several new SIMD extensions in their upcomming Icelake processors (expected 2019). Of particular interest, is the GFNI (Galois Field New Instructions) extension, which includes three new instructions:
GF2P8MULB. Details can be found in the following reference: https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
GF2P8MULB is a fairly self explanatory vectored modular 8-bit multiply, but using a fixed polynomial 0x11D. I can't quite think of a use for
GF2P8AFFINEINVQB as it seems too specific.
GF2P8AFFINEQB instruction is what I'm interested here, as it can be used to perform arbitrary 8-bit multiplies using any polynomial, and, using the split multiplication idea, can be applied to larger word sizes.
I have implemented this idea for
w=16 using ALTMAP memory arrangment here as an example of the idea. The core idea is somewhat similar to
SPLIT 8 8 but in SIMD form. The concept should work for other
w sizes supported by GF-Complete, as well as non-ALTMAP memory arrangements (just requires additional pack/unpack operations in the loop).
Performance is unknown until someone is able to test on an actual processor. If we compare the
SPLIT x 4 algorithms, which are currently the fastest in GF-Complete, this technique processes all bits in each vector per
GF2P8AFFINEQB instruction, whilst
SPLIT x 4 only processes half, per
PSHUFB invokation. If the
GF2P8AFFINEQB instruction has the same throughput as
PSHUFB (which doesn't seem unreasonable to me), this could end up being the fastest technique available, as it only needs to execute half the number of instructions.
Regardless, considering GF-Complete's stated interest in implementing every known technique for Galois Field multiplication, I thought it might be ideal to have this in GF-Complete.
If there's interest in such, I may be able to supply some implementations (likely 128-bit only unless GF-Complete is interested in supporting wider vectors).