### Issue #14

#### New region multiply technique using GFNI Affine instruction

I thought I'd post this as it may interest developers or anyone looking at this project.

Intel has announced several new SIMD extensions in their upcomming Icelake processors (expected 2019). Of particular interest, is the GFNI (Galois Field New Instructions) extension, which includes three new instructions: `GF2P8AFFINEINVQB`

, `GF2P8AFFINEQB`

, `GF2P8MULB`

. Details can be found in the following reference: https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

The `GF2P8MULB`

is a fairly self explanatory vectored modular 8-bit multiply, but using a fixed polynomial 0x11D. I can't quite think of a use for `GF2P8AFFINEINVQB`

as it seems too specific.

The `GF2P8AFFINEQB`

instruction is what I'm interested here, as it can be used to perform arbitrary 8-bit multiplies using any polynomial, and, using the split multiplication idea, can be applied to larger word sizes.

I have implemented this idea for `w`

=16 using ALTMAP memory arrangment here as an example of the idea. The core idea is somewhat similar to `SPLIT 8 8`

but in SIMD form. The concept should work for other `w`

sizes supported by GF-Complete, as well as non-ALTMAP memory arrangements (just requires additional pack/unpack operations in the loop).

Performance is unknown until someone is able to test on an actual processor. If we compare the `SPLIT x 4`

algorithms, which are currently the fastest in GF-Complete, this technique processes all bits in each vector per `GF2P8AFFINEQB`

instruction, whilst `SPLIT x 4`

only processes half, per `PSHUFB`

invokation. If the `GF2P8AFFINEQB`

instruction has the same throughput as `PSHUFB`

(which doesn't seem unreasonable to me), this could end up being the fastest technique available, as it only needs to execute half the number of instructions.

Regardless, considering GF-Complete's stated interest in implementing every known technique for Galois Field multiplication, I thought it might be ideal to have this in GF-Complete.

If there's interest in such, I may be able to supply some implementations (likely 128-bit only unless GF-Complete is interested in supporting wider vectors).