Depending on instruction set, significant speedups are observed for the vectorized path:
log1p wall time is reduced 60-93% (2.5x - 15x speedup)
expm1 wall time is reduced 0-85% (1x - 7x speedup)
The scalar path is slower by 20-30% due to the extra branch needed to handle +infinity correctly.
Full benchmarks measured on Intel(R) Xeon(R) Gold 6154 here: https://bitbucket.org/snippets/rmlarsen/MXBkpM
1. Fix buggy pcmp_eq and unit test for half types.
2. Add unit test for pselect and add specializations for SSE 4.1, AVX512, and half types.
3. Get rid of FIXME: Implement faster pnegate for half by XOR'ing with a sign bit mask.
- no FMA: 1ULP up to 3pi, 2ULP up to sin(25966) and cos(18838), fallback to std::sin/cos for larger inputs
- FMA: 1ULP up to sin(117435.992) and cos(71476.0625), fallback to std::sin/cos for larger inputs
This provide several advantages:
- more flexibility in designing unit tests
- unit tests can be glued to speed up compilation
- unit tests are compiled with same predefined macros, which is a requirement for zapcc
Benchmark speed in Giga-sqrts/s
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
-----------------------------------------
SSE AVX
Fast=1 2.529G 4.380G
Fast=0 1.944G 1.898G
Fast=1 fixed 2.214G 3.739G
This table illustrates the worst case in terms speed impact: It was measured by repeatedly computing the sqrt of an n=4096 float vector that fits in L1 cache. For large vectors the operation becomes memory bound and the differences between the different versions almost negligible.
Created the TensorDimensionList class to encode the list of all the dimensions of a tensor of rank n. This could be done using TensorIndexList, however TensorIndexList require cxx11 which isn't yet supported as widely as we'd like.
Created a new EIGEN_ALIGN_BYTES define to encode how the data should be aligned
Fixed a few remaining alignment issues exposed when the Eigen code is compiled with avx enabled.
Created a new EIGEN_ALIGN_DEFAULT define, which is set to the minimum alignment value required for the chosen instruction set. Use this value instead of EIGEN_ALIGN32 to preserve the existing alignment on SSE/Altivec/Neon.
- remove most of the metaprogramming kung fu in MathFunctions.h (only keep functions that differs from the std)
- remove the overloads for array expression that were in the std namespace
Renamed meta_{true|false} to {true|false}_type, meta_if to conditional, is_same_type to is_same, un{ref|pointer|const} to remove_{reference|pointer|const} and makeconst to add_const.
Changed boolean type 'ret' member to 'value'.
Changed 'ret' members refering to types to 'type'.
Adapted all code occurences.
as gcc on ARM (both CodeSourcery 4.4.1 used and experimental 4.5) fail to
ensure proper alignment with __attribute__((aligned(16))). This has to be
fixed upstream to remove the workarounds.
* renaming, e.g. LU ---> FullPivLU
* split tests framework: more robust, e.g. dont generate empty tests if a number is skipped
* make all remaining tests use that splitting, as needed.
* Fix 4x4 inversion (see stable branch)
* Transform::inverse() and geo_transform test : adapt to new inverse() API, it was also trying to instantiate inverse() for 3x4 matrices.
* CMakeLists: more robust regexp to parse the version number
* misc fixes in unit tests
now we also align to 8byte boundary fixed-size objects that are multiple of 8 bytes.
That's only useful for now for double, not e.g. for Vector2f, but that didn't seem to hurt. Am I missing something? Do you prefer that we don't align Vector2f at all?
Also, improvements in test_unalignedassert.
Pommier. They are for float only, and they return exactly the same
result as the standard versions in about 90% of the cases. Otherwise the max error
is below 1e-7. However, for very large values (>1e3) the accuracy of sin and cos
slighlty decrease. They are about 3 or 4 times faster than 4 calls to their respective
standard versions. So, is it ok to enable them by default in their respective functors ?
* add Homogeneous expression for vector and set of vectors (aka matrix)
=> the next step will be to overload operator*
* add homogeneous normalization (again for vector and set of vectors)
* add a Replicate expression (with uni-directional replication
facilities)
=> for all of them I'll add examples once we agree on the API
* fix gcc-4.4 warnings
* rename reverse.cpp array_reverse.cpp
* bugfix in Dot unroller
* added special random generator for the unit tests and reduced the tolerance threshold by an order of magnitude
this fixes issues with sum.cpp but other tests still failed sometimes, this have to be carefully checked...