SSE2 based region multiply (fallback for when SSSE3 is unavailable)
This is perhaps more of a discussion rather than a request/report - just that I can't really find any other place where this is suitable (I hope this is okay with everyone!).
Thanks to GF-Complete, we have an ultra-fast SSSE3 shuffle-based method for GF region multiplication. Unfortunately, it does require a CPU with SSSE3 support. I've been trying to think up a compatible method for CPUs without byte shuffle support (but with SSE2 support, which is much more prevalent amongst CPUs), which is faster than the fallback to non-SIMD code.
I've come up with the following idea: https://github.com/animetosho/ParPar/blob/master/xor_depends/info.md
Whilst the idea isn't complex, an optimised implementation seems to be very complicated, and the method has a number of other warts.
The above page contains a link to a hacky patch for w=16, which you definitely don't want to merge (but might apply for testing purposes, if interested).
Now the thing is, I don't know much about the mathematics/science behind all of this at all, and in fact, have only started reading about erasure coding about two months ago. I don't know whether this approach has been tried before, and, as far as I can tell, does seem to be a viable method.
What I'm interested is: what is the viability of this, and what sort of improvements can be made? Any other feedback/suggestions/comments about such an SSE2 approach?