24 Oct, 2014

5 commits

  • Optimisations for 4,64 split table region multiplications. Only used on
    ARMv8-A since it is not faster on ARMv7-A.
    Janne Grunau
     
  • Optimisations for 4,32 split table multiplications.
    
    Selected time_tool.sh results on a 1.7 GHz cortex-a9:
    Region Best (MB/s):   346.67   W-Method: 32 -m SPLIT 32 4 -r SIMD -
    Region Best (MB/s):    92.89   W-Method: 32 -m SPLIT 32 4 -r NOSIMD -
    Region Best (MB/s):   258.17   W-Method: 32 -m SPLIT 32 4 -r SIMD -r ALTMAP -
    Region Best (MB/s):   162.00   W-Method: 32 -m SPLIT 32 8 -
    Region Best (MB/s):   160.53   W-Method: 32 -m SPLIT 8 8 -
    Region Best (MB/s):    32.74   W-Method: 32 -m COMPOSITE 2 - -
    Region Best (MB/s):   199.79   W-Method: 32 -m COMPOSITE 2 - -r ALTMAP -
    Janne Grunau
     
  • Optimisations for the 4,16 split table region multiplications.
    
    Selected time_tool.sh 16 -A -B results for a 1.7 GHz cortex-a9:
    Region Best (MB/s):   532.14   W-Method: 16 -m SPLIT 16 4 -r SIMD -
    Region Best (MB/s):   212.34   W-Method: 16 -m SPLIT 16 4 -r NOSIMD -
    Region Best (MB/s):   801.36   W-Method: 16 -m SPLIT 16 4 -r SIMD -r ALTMAP -
    Region Best (MB/s):    93.20   W-Method: 16 -m SPLIT 16 4 -r NOSIMD -r ALTMAP -
    Region Best (MB/s):   273.99   W-Method: 16 -m SPLIT 16 8 -
    Region Best (MB/s):   270.81   W-Method: 16 -m SPLIT 8 8 -
    Region Best (MB/s):    70.42   W-Method: 16 -m COMPOSITE 2 - -
    Region Best (MB/s):   393.54   W-Method: 16 -m COMPOSITE 2 - -r ALTMAP -
    Janne Grunau
     
  • Optimisations for the 4,4 split table region multiplication and carry
    less multiplication using NEON's polynomial long multiplication.
    arm: w8: NEON carry less multiplication
    
    Selected time_tool.sh results for a 1.7GHz cortex-a9:
    Region Best (MB/s):   375.86   W-Method: 8 -m CARRY_FREE -
    Region Best (MB/s):   142.94   W-Method: 8 -m TABLE -
    Region Best (MB/s):   225.01   W-Method: 8 -m TABLE -r DOUBLE -
    Region Best (MB/s):   211.23   W-Method: 8 -m TABLE -r DOUBLE -r LAZY -
    Region Best (MB/s):   160.09   W-Method: 8 -m LOG -
    Region Best (MB/s):   123.61   W-Method: 8 -m LOG_ZERO -
    Region Best (MB/s):   123.85   W-Method: 8 -m LOG_ZERO_EXT -
    Region Best (MB/s):  1183.79   W-Method: 8 -m SPLIT 8 4 -r SIMD -
    Region Best (MB/s):   177.68   W-Method: 8 -m SPLIT 8 4 -r NOSIMD -
    Region Best (MB/s):    87.85   W-Method: 8 -m COMPOSITE 2 - -
    Region Best (MB/s):   428.59   W-Method: 8 -m COMPOSITE 2 - -r ALTMAP -
    Janne Grunau
     
  • Optimisations for the single table region multiplication and carry less
    multiplication using NEON's polynomial multiplication of 8-bit values.
    
    The single polynomial multiplication is not that useful but vector
    version is for region multiplication.
    
    Selected time_tool.sh results for a 1.7GHz cortex-a9:
    Region Best (MB/s):   672.72   W-Method: 4 -m CARRY_FREE -
    Region Best (MB/s):   265.84   W-Method: 4 -m BYTWO_p -
    Region Best (MB/s):   329.41   W-Method: 4 -m TABLE -r DOUBLE -
    Region Best (MB/s):   278.63   W-Method: 4 -m TABLE -r QUAD -
    Region Best (MB/s):   329.81   W-Method: 4 -m TABLE -r QUAD -r LAZY -
    Region Best (MB/s):  1318.03   W-Method: 4 -m TABLE -r SIMD -
    Region Best (MB/s):   165.15   W-Method: 4 -m TABLE -r NOSIMD -
    Region Best (MB/s):    99.73   W-Method: 4 -m LOG -
    Janne Grunau
     

09 Oct, 2014

2 commits


23 Aug, 2014

1 commit


16 Jun, 2014

2 commits


09 Jun, 2014

1 commit


06 Jun, 2014

1 commit


22 Apr, 2014

1 commit


02 Apr, 2014

1 commit


18 Mar, 2014

1 commit


02 Jan, 2014

1 commit


01 Jan, 2014

1 commit


31 Dec, 2013

1 commit


30 Dec, 2013

3 commits


04 Dec, 2013

2 commits