Use native word size instead of int64
A number of places use
uint64_t type variables when not strictly necessary. It works well if the native word size is 64-bits, but can really hurt performance when it isn't.
I suggest changing the type depending on the target architecture. Something like
uint_fastX_t might not be a bad place to start, although it's not guaranteed to be the most efficient, so might be better to define your own types.
I got a significant speed boost for w16 mSPLIT(16,8) on 32-bit by simply changing the
uint64_t variables to
uint_fast32_t (and the corresponding shift operations to be based on