13 Sep, 2016

2 commits

  • This commits adds support for runtime detection of SIMD instructions. The idea is that you would build once with all supported SIMD functions and the same binaries could run on different machines with varying support for SIMD. At runtime gf-complete will select the right functions based on the processor.
    
    gf_cpu.c has the logic to detect SIMD instructions. On Intel processors this is done through cpuid. For ARM on linux we use getauxv.
    
    The logic in gf_w*.c has been changed to check for runtime SIMD support and fallback to generic code.
    
    Also a new test has been added. It compares the functions selected by gf_init when we enable/disable SIMD support through build flags, with runtime enabling/disabling. The test checks if the results are identical.
    Bassam Tabbara
     
  • There is currently no way to figure out which functions were selected
    during gf_init and as a result of SIMD options. This is not even possible
    in gdb since most functions are static.
    
    This commit adds a new macro SET_FUNCTION that records the name of the
    function selected during init inside the gf_internal structure. This macro
    only works when DEBUG_FUNCTIONS is defined during compile. Otherwise the
    code works exactly as it did before this change.
    
    The names of selected functions will be used during testing of SIMD
    runtime detection.
    
    All calls such as:
    
    gf->multiply.w32 = gf_w16_shift_multiply;
    
    need to be replaced with the following:
    
    SET_FUNCTION(gf,multiply,w32,gf_w16_shift_multiply)
    
    Also added a new flag to tools/gf_methods that will print the names of
    functions selected during gf_init.
    Bassam Tabbara
     

24 Oct, 2014

5 commits

  • Optimisations for 4,64 split table region multiplications. Only used on
    ARMv8-A since it is not faster on ARMv7-A.
    Janne Grunau
     
  • Optimisations for 4,32 split table multiplications.
    
    Selected time_tool.sh results on a 1.7 GHz cortex-a9:
    Region Best (MB/s):   346.67   W-Method: 32 -m SPLIT 32 4 -r SIMD -
    Region Best (MB/s):    92.89   W-Method: 32 -m SPLIT 32 4 -r NOSIMD -
    Region Best (MB/s):   258.17   W-Method: 32 -m SPLIT 32 4 -r SIMD -r ALTMAP -
    Region Best (MB/s):   162.00   W-Method: 32 -m SPLIT 32 8 -
    Region Best (MB/s):   160.53   W-Method: 32 -m SPLIT 8 8 -
    Region Best (MB/s):    32.74   W-Method: 32 -m COMPOSITE 2 - -
    Region Best (MB/s):   199.79   W-Method: 32 -m COMPOSITE 2 - -r ALTMAP -
    Janne Grunau
     
  • Optimisations for the 4,16 split table region multiplications.
    
    Selected time_tool.sh 16 -A -B results for a 1.7 GHz cortex-a9:
    Region Best (MB/s):   532.14   W-Method: 16 -m SPLIT 16 4 -r SIMD -
    Region Best (MB/s):   212.34   W-Method: 16 -m SPLIT 16 4 -r NOSIMD -
    Region Best (MB/s):   801.36   W-Method: 16 -m SPLIT 16 4 -r SIMD -r ALTMAP -
    Region Best (MB/s):    93.20   W-Method: 16 -m SPLIT 16 4 -r NOSIMD -r ALTMAP -
    Region Best (MB/s):   273.99   W-Method: 16 -m SPLIT 16 8 -
    Region Best (MB/s):   270.81   W-Method: 16 -m SPLIT 8 8 -
    Region Best (MB/s):    70.42   W-Method: 16 -m COMPOSITE 2 - -
    Region Best (MB/s):   393.54   W-Method: 16 -m COMPOSITE 2 - -r ALTMAP -
    Janne Grunau
     
  • Optimisations for the 4,4 split table region multiplication and carry
    less multiplication using NEON's polynomial long multiplication.
    arm: w8: NEON carry less multiplication
    
    Selected time_tool.sh results for a 1.7GHz cortex-a9:
    Region Best (MB/s):   375.86   W-Method: 8 -m CARRY_FREE -
    Region Best (MB/s):   142.94   W-Method: 8 -m TABLE -
    Region Best (MB/s):   225.01   W-Method: 8 -m TABLE -r DOUBLE -
    Region Best (MB/s):   211.23   W-Method: 8 -m TABLE -r DOUBLE -r LAZY -
    Region Best (MB/s):   160.09   W-Method: 8 -m LOG -
    Region Best (MB/s):   123.61   W-Method: 8 -m LOG_ZERO -
    Region Best (MB/s):   123.85   W-Method: 8 -m LOG_ZERO_EXT -
    Region Best (MB/s):  1183.79   W-Method: 8 -m SPLIT 8 4 -r SIMD -
    Region Best (MB/s):   177.68   W-Method: 8 -m SPLIT 8 4 -r NOSIMD -
    Region Best (MB/s):    87.85   W-Method: 8 -m COMPOSITE 2 - -
    Region Best (MB/s):   428.59   W-Method: 8 -m COMPOSITE 2 - -r ALTMAP -
    Janne Grunau
     
  • Optimisations for the single table region multiplication and carry less
    multiplication using NEON's polynomial multiplication of 8-bit values.
    
    The single polynomial multiplication is not that useful but vector
    version is for region multiplication.
    
    Selected time_tool.sh results for a 1.7GHz cortex-a9:
    Region Best (MB/s):   672.72   W-Method: 4 -m CARRY_FREE -
    Region Best (MB/s):   265.84   W-Method: 4 -m BYTWO_p -
    Region Best (MB/s):   329.41   W-Method: 4 -m TABLE -r DOUBLE -
    Region Best (MB/s):   278.63   W-Method: 4 -m TABLE -r QUAD -
    Region Best (MB/s):   329.81   W-Method: 4 -m TABLE -r QUAD -r LAZY -
    Region Best (MB/s):  1318.03   W-Method: 4 -m TABLE -r SIMD -
    Region Best (MB/s):   165.15   W-Method: 4 -m TABLE -r NOSIMD -
    Region Best (MB/s):    99.73   W-Method: 4 -m LOG -
    Janne Grunau
     

09 Oct, 2014

2 commits


23 Aug, 2014

1 commit


16 Jun, 2014

2 commits


09 Jun, 2014

1 commit


06 Jun, 2014

1 commit


22 Apr, 2014

1 commit


02 Apr, 2014

1 commit


18 Mar, 2014

1 commit


02 Jan, 2014

1 commit


01 Jan, 2014

1 commit


31 Dec, 2013

1 commit


30 Dec, 2013

3 commits


04 Dec, 2013

2 commits