Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							6b9c92fe7e 
							
						 
					 
					
						
						
							
							Add Apache 2.0 license text in COPYING.APACHE.  
						
						
						
					 
					
						2020-06-18 12:45:27 -07:00 
						 
				 
			
				
					
						
							
							
								Nicolas Mellado 
							
						 
					 
					
						
						
						
						
							
						
						
							cf7adf3a5d 
							
						 
					 
					
						
						
							
							Update things you can do message using cmake commands  
						
						... 
						
						
						
						Print cmake commands instead of make commands, which should work for any generator. 
						
					 
					
						2020-06-16 21:04:33 +00:00 
						 
				 
			
				
					
						
							
							
								Ilya Tokar 
							
						 
					 
					
						
						
						
						
							
						
						
							231ce21535 
							
						 
					 
					
						
						
							
							Run two independent chains, when reducing tensors.  
						
						... 
						
						
						
						Running two chains exposes more instruction level parallelism,
by allowing to execute both chains at the same time.
Results are a bit noisy, but for medium length we almost hit
theoretical upper bound of 2x.
BM_fullReduction_16T/3        [using 16 threads]       17.3ns ±11%        17.4ns ± 9%        ~           (p=0.178 n=18+19)
BM_fullReduction_16T/4        [using 16 threads]       17.6ns ±17%        17.0ns ±18%        ~           (p=0.835 n=20+19)
BM_fullReduction_16T/7        [using 16 threads]       18.9ns ±12%        18.2ns ±10%        ~           (p=0.756 n=20+18)
BM_fullReduction_16T/8        [using 16 threads]       19.8ns ±13%        19.4ns ±21%        ~           (p=0.512 n=20+20)
BM_fullReduction_16T/10       [using 16 threads]       23.5ns ±15%        20.8ns ±24%     -11.37%        (p=0.000 n=20+19)
BM_fullReduction_16T/15       [using 16 threads]       35.8ns ±21%        26.9ns ±17%     -24.76%        (p=0.000 n=20+19)
BM_fullReduction_16T/16       [using 16 threads]       38.7ns ±22%        27.7ns ±18%     -28.40%        (p=0.000 n=20+19)
BM_fullReduction_16T/31       [using 16 threads]        146ns ±17%          74ns ±11%     -49.05%        (p=0.000 n=20+18)
BM_fullReduction_16T/32       [using 16 threads]        154ns ±19%          84ns ±30%     -45.79%        (p=0.000 n=20+19)
BM_fullReduction_16T/64       [using 16 threads]        603ns ± 8%         308ns ±12%     -48.94%        (p=0.000 n=17+17)
BM_fullReduction_16T/128      [using 16 threads]       2.44µs ±13%        1.22µs ± 1%     -50.29%        (p=0.000 n=17+17)
BM_fullReduction_16T/256      [using 16 threads]       9.84µs ±14%        5.13µs ±30%     -47.82%        (p=0.000 n=19+19)
BM_fullReduction_16T/512      [using 16 threads]       78.0µs ± 9%        56.1µs ±17%     -28.02%        (p=0.000 n=18+20)
BM_fullReduction_16T/1k       [using 16 threads]        325µs ± 5%         263µs ± 4%     -19.00%        (p=0.000 n=20+16)
BM_fullReduction_16T/2k       [using 16 threads]       1.09ms ± 3%        0.99ms ± 1%      -9.04%        (p=0.000 n=20+20)
BM_fullReduction_16T/4k       [using 16 threads]       7.66ms ± 3%        7.57ms ± 3%      -1.24%        (p=0.017 n=20+20)
BM_fullReduction_16T/10k      [using 16 threads]       65.3ms ± 4%        65.0ms ± 3%        ~           (p=0.718 n=20+20) 
						
					 
					
						2020-06-16 15:55:11 -04:00 
						 
				 
			
				
					
						
							
							
								Pedro Caldeira 
							
						 
					 
					
						
						
						
						
							
						
						
							a475bf14d4 
							
						 
					 
					
						
						
							
							Fix pscatter and pgather for Altivec Complex double  
						
						
						
					 
					
						2020-06-16 16:41:02 -03:00 
						 
				 
			
				
					
						
							
							
								David Tellenbach 
							
						 
					 
					
						
						
						
						
							
						
						
							c6c84ed961 
							
						 
					 
					
						
						
							
							Fix unused variable warning on Arm  
						
						
						
					 
					
						2020-06-15 00:14:58 +02:00 
						 
				 
			
				
					
						
							
							
								Sebastien Boisvert 
							
						 
					 
					
						
						
						
						
							
						
						
							6228f27234 
							
						 
					 
					
						
						
							
							Fix   #1818 : SparseLU: add methods nnzL() and nnzU()  
						
						... 
						
						
						
						Now this compiles without errors:
$ clang++ -I ../../ test_sparseLU.cpp -std=c++03 
						
					 
					
						2020-06-11 23:49:49 +00:00 
						 
				 
			
				
					
						
							
							
								Sebastien Boisvert 
							
						 
					 
					
						
						
						
						
							
						
						
							39cbd6578f 
							
						 
					 
					
						
						
							
							Fix   #1911 : add benchmark for move semantics with fixed-size matrix  
						
						... 
						
						
						
						$ clang++ -O3 bench/bench_move_semantics.cpp -I. -std=c++11 \
        -o bench_move_semantics
$ ./bench_move_semantics
float copy semantics: 1755.97 ms
float move semantics: 55.063 ms
double copy semantics: 2457.65 ms
double move semantics: 55.034 ms 
						
					 
					
						2020-06-11 23:43:25 +00:00 
						 
				 
			
				
					
						
							
							
								Antonio Sanchez 
							
						 
					 
					
						
						
						
						
							
						
						
							a7d2552af8 
							
						 
					 
					
						
						
							
							Remove HasCast and fix packetmath cast tests.  
						
						... 
						
						
						
						The use of the `packet_traits<>::HasCast` field is currently inconsistent with
`type_casting_traits<>`, and is unused apart from within
`test/packetmath.cpp`. In addition, those packetmath cast tests do not
currently reflect how casts are performed in practice: they ignore the
`SrcCoeffRatio` and `TgtCoeffRatio` fields, assuming a 1:1 ratio.
Here we remove the unsed `HasCast`, and modify the packet cast tests to
better reflect their usage. 
						
					 
					
						2020-06-11 17:26:56 +00:00 
						 
				 
			
				
					
						
							
							
								Sebastien Boisvert 
							
						 
					 
					
						
						
						
						
							
						
						
							463ec86648 
							
						 
					 
					
						
						
							
							Fix   #1757 : remove the word 'suicide'  
						
						
						
					 
					
						2020-06-11 00:56:54 +00:00 
						 
				 
			
				
					
						
							
							
								ShengYang1 
							
						 
					 
					
						
						
						
						
							
						
						
							b5d66b5e73 
							
						 
					 
					
						
						
							
							Implement scalar_cmp_with_cast_op  
						
						
						
					 
					
						2020-06-09 08:12:07 +08:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							c4059ffcb6 
							
						 
					 
					
						
						
							
							Fix static analyzer warning in SelfadjointProduct.h.  
						
						... 
						
						
						
						Fix compiler warnings in GeneralBlockPanelKernel.h. 
						
					 
					
						2020-06-08 11:48:44 -07:00 
						 
				 
			
				
					
						
							
							
								Thales Sabino 
							
						 
					 
					
						
						
						
						
							
						
						
							1fcaaf460f 
							
						 
					 
					
						
						
							
							Update FindComputeCpp.cmake to fix build problems on Windows  
						
						... 
						
						
						
						- Use standard types in SYCL/PacketMath.h to avoid compilation problems on Windows
- Add EIGEN_HAS_CONSTEXPR to cxx11_tensor_argmax_sycl.cpp to fix build problems on Windows 
						
					 
					
						2020-06-05 20:51:20 +00:00 
						 
				 
			
				
					
						
							
							
								David Tellenbach 
							
						 
					 
					
						
						
						
						
							
						
						
							3ce18d3c8f 
							
						 
					 
					
						
						
							
							Revert ".gitlab-ci.yml: initial commit"  
						
						... 
						
						
						
						This reverts commit 95177362ed 
						
					 
					
						2020-06-05 22:43:49 +02:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							c2ab36f47a 
							
						 
					 
					
						
						
							
							Fix broken packetmath test for logistic on Arm.  
						
						
						
					 
					
						2020-06-04 16:24:47 -07:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							537e2b322f 
							
						 
					 
					
						
						
							
							Fix typo in previous update to generic predux_any.  
						
						
						
					 
					
						2020-06-04 22:25:05 +00:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							fdc1cbdce3 
							
						 
					 
					
						
						
							
							Avoid implicit float equality comparison in generic predux_any, but use numext::not_equal_strict to avoid breaking builds that compile with -Werror=float-equal.  
						
						
						
					 
					
						2020-06-04 22:15:56 +00:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							daf9bbeca2 
							
						 
					 
					
						
						
							
							Fix compilation error in logistic packet op.  
						
						
						
					 
					
						2020-06-03 00:57:41 +00:00 
						 
				 
			
				
					
						
							
							
								n0mend 
							
						 
					 
					
						
						
						
						
							
						
						
							6d2a9a524b 
							
						 
					 
					
						
						
							
							Update run instructions for benchCholesky  
						
						
						
					 
					
						2020-06-01 18:31:46 +00:00 
						 
				 
			
				
					
						
							
							
								Gael Guennebaud 
							
						 
					 
					
						
						
						
						
							
						
						
							029a76e115 
							
						 
					 
					
						
						
							
							Bug  #1777 : make the scalar and packet path consistent for the logistic function + respective unit test  
						
						
						
					 
					
						2020-05-31 00:53:37 +02:00 
						 
				 
			
				
					
						
							
							
								Gael Guennebaud 
							
						 
					 
					
						
						
						
						
							
						
						
							99b7f7cb9c 
							
						 
					 
					
						
						
							
							Fix   #556 : warnings with mingw  
						
						
						
					 
					
						2020-05-31 00:39:44 +02:00 
						 
				 
			
				
					
						
							
							
								Gael Guennebaud 
							
						 
					 
					
						
						
						
						
							
						
						
							72782d13e0 
							
						 
					 
					
						
						
							
							Bug  #1767 : increase required cmake version to 3.5.0  
						
						
						
					 
					
						2020-05-31 00:31:09 +02:00 
						 
				 
			
				
					
						
							
							
								Gael Guennebaud 
							
						 
					 
					
						
						
						
						
							
						
						
							867a756509 
							
						 
					 
					
						
						
							
							Fix   #1833 : compilation issue of "array!=scalar" with c++20  
						
						
						
					 
					
						2020-05-30 23:53:58 +02:00 
						 
				 
			
				
					
						
							
							
								Gael Guennebaud 
							
						 
					 
					
						
						
						
						
							
						
						
							ab615e4114 
							
						 
					 
					
						
						
							
							Save one extra temporary when assigning a sparse product to a row-major sparse matrix  
						
						
						
					 
					
						2020-05-30 23:15:12 +02:00 
						 
				 
			
				
					
						
							
							
								Christoph Junghans 
							
						 
					 
					
						
						
						
						
							
						
						
							95177362ed 
							
						 
					 
					
						
						
							
							.gitlab-ci.yml: initial commit  
						
						
						
					 
					
						2020-05-29 09:23:25 -06:00 
						 
				 
			
				
					
						
							
							
								Kan Chen 
							
						 
					 
					
						
						
						
						
							
						
						
							8d1302f566 
							
						 
					 
					
						
						
							
							Add support for PacketBlock<Packet8s,4> and PacketBlock<Packet16uc,4> ptranspose on NEON  
						
						
						
					 
					
						2020-05-29 00:33:45 +00:00 
						 
				 
			
				
					
						
							
							
								Antonio Sánchez 
							
						 
					 
					
						
						
						
						
							
						
						
							8719b9c5bc 
							
						 
					 
					
						
						
							
							Disable test for 32-bit systems (e.g. ARM, i386)  
						
						... 
						
						
						
						Both i386 and 32-bit ARM do not define __uint128_t. On most systems, if
__uint128_t is defined, then so is the macro __SIZEOF_INT128__.
https://stackoverflow.com/questions/18531782/how-to-know-if-uint128-t-is-defined1  
						
					 
					
						2020-05-28 17:40:15 +00:00 
						 
				 
			
				
					
						
							
							
								Yong Tang 
							
						 
					 
					
						
						
						
						
							
						
						
							8e1df5b082 
							
						 
					 
					
						
						
							
							Fix incorrect usage of if defined(EIGEN_ARCH_PPC) => if EIGEN_ARCH_PPC  
						
						... 
						
						
						
						This PR tries to fix an incorrect usage of `if defined(EIGEN_ARCH_PPC)`
in `Eigen/Core` header.
In `Eigen/src/Core/util/Macros.h`, EIGEN_ARCH_PPC was explicitly defined
as either 0 or 1. As a result `if defined(EIGEN_ARCH_PPC)` will always be true.
This causes issues when building on non PPC platform and `MatrixProduct.h` is not
available.
This fix changes `if defined(EIGEN_ARCH_PPC)` => `if EIGEN_ARCH_PPC`.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com> 
						
					 
					
						2020-05-28 05:53:44 -07:00 
						 
				 
			
				
					
						
							
							
								Kan Chen 
							
						 
					 
					
						
						
						
						
							
						
						
							4e7046063b 
							
						 
					 
					
						
						
							
							Fix   #1874 : it works on both MSVC 2017 and other platforms.  
						
						
						
					 
					
						2020-05-21 18:42:56 +08:00 
						 
				 
			
				
					
						
							
							
								Pedro Caldeira 
							
						 
					 
					
						
						
						
						
							
						
						
							2d67af2d2b 
							
						 
					 
					
						
						
							
							Add pscatter for Packet16{u}c (int8)  
						
						
						
					 
					
						2020-05-20 17:29:34 -03:00 
						 
				 
			
				
					
						
							
							
								David Tellenbach 
							
						 
					 
					
						
						
						
						
							
						
						
							5328cd62b3 
							
						 
					 
					
						
						
							
							Guard usage of decltype since it's a C++11 feature  
						
						... 
						
						
						
						This fixes https://gitlab.com/libeigen/eigen/-/issues/1897  
						
					 
					
						2020-05-20 16:04:16 +02:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							cc86a31e20 
							
						 
					 
					
						
						
							
							Add guard around specialization for bool, which is only currently implemented for SSE.  
						
						
						
					 
					
						2020-05-19 16:21:56 -07:00 
						 
				 
			
				
					
						
							
							
								Everton Constantino 
							
						 
					 
					
						
						
						
						
							
						
						
							8a7f360ec3 
							
						 
					 
					
						
						
							
							- Vectorizing MMA packing.  
						
						... 
						
						
						
						- Optimizing MMA kernel.
- Adding PacketBlock store to blas_data_mapper. 
						
					 
					
						2020-05-19 19:24:11 +00:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							a145e4adf5 
							
						 
					 
					
						
						
							
							Add newline at the end of StlIterators.h.  
						
						
						
					 
					
						2020-05-15 20:36:00 +00:00 
						 
				 
			
				
					
						
							
							
								Gael Guennebaud 
							
						 
					 
					
						
						
						
						
							
						
						
							8ce9630ddb 
							
						 
					 
					
						
						
							
							Fix   #1874 : workaround MSVC 2017 compilation issue.  
						
						
						
					 
					
						2020-05-15 20:47:32 +02:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							9b411757ab 
							
						 
					 
					
						
						
							
							Add missing packet ops for bool, and make it pass the same packet op unit tests as other arithmetic types.  
						
						... 
						
						
						
						This change also contains a few minor cleanups:
  1. Remove packet op pnot, which is not needed for anything other than pcmp_le_or_nan,
     which can be done in other ways.
  2. Remove the "HasInsert" enum, which is no longer needed since we removed the
     corresponding packet ops.
  3. Add faster pselect op for Packet4i when SSE4.1 is supported.
Among other things, this makes the fast transposeInPlace() method available for Matrix<bool>.
Run on ************** (72 X 2994 MHz CPUs); 2020-05-09T10:51:02.372347913-07:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark                        Time(ns)        CPU(ns)     Iterations
-----------------------------------------------------------------------
BM_TransposeInPlace<float>/4            9.77           9.77    71670320
BM_TransposeInPlace<float>/8           21.9           21.9     31929525
BM_TransposeInPlace<float>/16          66.6           66.6     10000000
BM_TransposeInPlace<float>/32         243            243        2879561
BM_TransposeInPlace<float>/59         844            844         829767
BM_TransposeInPlace<float>/64         933            933         750567
BM_TransposeInPlace<float>/128       3944           3945         177405
BM_TransposeInPlace<float>/256      16853          16853          41457
BM_TransposeInPlace<float>/512     204952         204968           3448
BM_TransposeInPlace<float>/1k     1053889        1053861            664
BM_TransposeInPlace<bool>/4            14.4           14.4     48637301
BM_TransposeInPlace<bool>/8            36.0           36.0     19370222
BM_TransposeInPlace<bool>/16           31.5           31.5     22178902
BM_TransposeInPlace<bool>/32          111            111        6272048
BM_TransposeInPlace<bool>/59          626            626        1000000
BM_TransposeInPlace<bool>/64          428            428        1632689
BM_TransposeInPlace<bool>/128        1677           1677         417377
BM_TransposeInPlace<bool>/256        7126           7126          96264
BM_TransposeInPlace<bool>/512       29021          29024          24165
BM_TransposeInPlace<bool>/1k       116321         116330           6068 
						
					 
					
						2020-05-14 22:39:13 +00:00 
						 
				 
			
				
					
						
							
							
								Felipe Attanasio 
							
						 
					 
					
						
						
						
						
							
						
						
							d640276d31 
							
						 
					 
					
						
						
							
							Added support for reverse iterators for Vectorwise operations.  
						
						
						
					 
					
						2020-05-14 22:38:20 +00:00 
						 
				 
			
				
					
						
							
							
								Christopher Moore 
							
						 
					 
					
						
						
						
						
							
						
						
							fa8fd4b4d5 
							
						 
					 
					
						
						
							
							Indexed view should have RowMajorBit when there is staticly a single row  
						
						
						
					 
					
						2020-05-14 22:11:19 +00:00 
						 
				 
			
				
					
						
							
							
								Christopher Moore 
							
						 
					 
					
						
						
						
						
							
						
						
							a187ffea28 
							
						 
					 
					
						
						
							
							Resolve "IndexedView of a vector should allow linear access"  
						
						
						
					 
					
						2020-05-13 19:24:42 +00:00 
						 
				 
			
				
					
						
							
							
								Mark Eberlein 
							
						 
					 
					
						
						
						
						
							
						
						
							ba9d18b938 
							
						 
					 
					
						
						
							
							Add KLU support to spbenchsolver  
						
						
						
					 
					
						2020-05-11 21:50:27 +00:00 
						 
				 
			
				
					
						
							
							
								Pedro Caldeira 
							
						 
					 
					
						
						
						
						
							
						
						
							5fdc179241 
							
						 
					 
					
						
						
							
							Altivec template functions to better code reusability  
						
						
						
					 
					
						2020-05-11 21:04:51 +00:00 
						 
				 
			
				
					
						
							
							
								mehdi-goli 
							
						 
					 
					
						
						
						
						
							
						
						
							d3e81db6c5 
							
						 
					 
					
						
						
							
							Eigen moved the scanLauncehr function inside the internal namespace.  
						
						... 
						
						
						
						This commit applies the following changes:
    - Moving the `scamLauncher` specialization inside internal namespace to fix compiler crash on TensorScan for SYCL backend.
    - Replacing  `SYCL/sycl.hpp` to `CL/sycl.hpp` in order to follow SYCL 1.2.1 standard.
    - minor fixes: commenting out an unused variable to avoid compiler warnings. 
						
					 
					
						2020-05-11 16:10:33 +01:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							c1d944dd91 
							
						 
					 
					
						
						
							
							Remove packet ops pinsertfirst and pinsertlast that are only used in a single place, and can be replaced by other ops when constructing the first/final packet in linspaced_op_impl::packetOp.  
						
						... 
						
						
						
						I cannot measure any performance changes for SSE, AVX, or AVX512.
name                                 old time/op             new time/op             delta
BM_LinSpace<float>/1                 1.63ns ± 0%             1.63ns ± 0%   ~             (p=0.762 n=5+5)
BM_LinSpace<float>/8                 4.92ns ± 3%             4.89ns ± 3%   ~             (p=0.421 n=5+5)
BM_LinSpace<float>/64                34.6ns ± 0%             34.6ns ± 0%   ~             (p=0.841 n=5+5)
BM_LinSpace<float>/512                217ns ± 0%              217ns ± 0%   ~             (p=0.421 n=5+5)
BM_LinSpace<float>/4k                1.68µs ± 0%             1.68µs ± 0%   ~             (p=1.000 n=5+5)
BM_LinSpace<float>/32k               13.3µs ± 0%             13.3µs ± 0%   ~             (p=0.905 n=5+4)
BM_LinSpace<float>/256k               107µs ± 0%              107µs ± 0%   ~             (p=0.841 n=5+5)
BM_LinSpace<float>/1M                 427µs ± 0%              427µs ± 0%   ~             (p=0.690 n=5+5) 
						
					 
					
						2020-05-08 15:41:50 -07:00 
						 
				 
			
				
					
						
							
							
								David Tellenbach 
							
						 
					 
					
						
						
						
						
							
						
						
							5c4e19fbe7 
							
						 
					 
					
						
						
							
							Possibility to specify user-defined default cache sizes for GEBP kernel  
						
						... 
						
						
						
						Some architectures have no convinient way to determine cache sizes at
runtime. Eigen's GEBP kernel falls back to default cache values in this
case which might not be correct in all situations.
This patch introduces three preprocessor directives
  `EIGEN_DEFAULT_L1_CACHE_SIZE`
  `EIGEN_DEFAULT_L2_CACHE_SIZE`
  `EIGEN_DEFAULT_L3_CACHE_SIZE`
to give users the possibility to set these default values explicitly. 
						
					 
					
						2020-05-08 12:54:36 +02:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							225ab040e0 
							
						 
					 
					
						
						
							
							Remove unused packet op "palign".  
						
						... 
						
						
						
						Clean up a compiler warning in c++03 mode in AVX512/Complex.h. 
						
					 
					
						2020-05-07 17:14:26 -07:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							74ec8e6618 
							
						 
					 
					
						
						
							
							Make size odd for transposeInPlace test to make sure we hit the scalar path.  
						
						
						
					 
					
						2020-05-07 17:29:56 +00:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							49f1aeb60d 
							
						 
					 
					
						
						
							
							Remove traits declaring NEON vectorized casts that do not actually have packet op implementations.  
						
						
						
					 
					
						2020-05-07 09:49:22 -07:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							2fd8a5a08f 
							
						 
					 
					
						
						
							
							Add parallelization of TensorScanOp for types without packet ops.  
						
						... 
						
						
						
						Clean up the code a bit and do a few micro-optimizations to improve performance for small tensors.
Benchmark numbers for Tensor<uint32_t>:
name                                                       old time/op             new time/op             delta
BM_cumSumRowReduction_1T/8   [using 1 threads]             76.5ns ± 0%             61.3ns ± 4%    -19.80%          (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/64  [using 1 threads]             2.47µs ± 1%             2.40µs ± 1%     -2.77%          (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/256 [using 1 threads]             39.8µs ± 0%             39.6µs ± 0%     -0.60%          (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/4k  [using 1 threads]             13.9ms ± 0%             13.4ms ± 1%     -4.19%          (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/8   [using 2 threads]             76.8ns ± 0%             59.1ns ± 0%    -23.09%          (p=0.016 n=5+4)
BM_cumSumRowReduction_2T/64  [using 2 threads]             2.47µs ± 1%             2.41µs ± 1%     -2.53%          (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/256 [using 2 threads]             39.8µs ± 0%             34.7µs ± 6%    -12.74%          (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/4k  [using 2 threads]             13.8ms ± 1%              7.2ms ± 6%    -47.74%          (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/8   [using 8 threads]             76.4ns ± 0%             61.8ns ± 3%    -19.02%          (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/64  [using 8 threads]             2.47µs ± 1%             2.40µs ± 1%     -2.84%          (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/256 [using 8 threads]             39.8µs ± 0%             28.3µs ±11%    -28.75%          (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/4k  [using 8 threads]             13.8ms ± 0%              2.7ms ± 5%    -80.39%          (p=0.008 n=5+5)
BM_cumSumColReduction_1T/8   [using 1 threads]             59.1ns ± 0%             80.3ns ± 0%    +35.94%          (p=0.029 n=4+4)
BM_cumSumColReduction_1T/64  [using 1 threads]             3.06µs ± 0%             3.08µs ± 1%       ~             (p=0.114 n=4+4)
BM_cumSumColReduction_1T/256 [using 1 threads]              175µs ± 0%              176µs ± 0%       ~             (p=0.190 n=4+5)
BM_cumSumColReduction_1T/4k  [using 1 threads]              824ms ± 1%              844ms ± 1%     +2.37%          (p=0.008 n=5+5)
BM_cumSumColReduction_2T/8   [using 2 threads]             59.0ns ± 0%             90.7ns ± 0%    +53.74%          (p=0.029 n=4+4)
BM_cumSumColReduction_2T/64  [using 2 threads]             3.06µs ± 0%             3.10µs ± 0%     +1.08%          (p=0.016 n=4+5)
BM_cumSumColReduction_2T/256 [using 2 threads]              176µs ± 0%              189µs ±18%       ~             (p=0.151 n=5+5)
BM_cumSumColReduction_2T/4k  [using 2 threads]              836ms ± 2%              611ms ±14%    -26.92%          (p=0.008 n=5+5)
BM_cumSumColReduction_8T/8   [using 8 threads]             59.3ns ± 2%             90.6ns ± 0%    +52.79%          (p=0.008 n=5+5)
BM_cumSumColReduction_8T/64  [using 8 threads]             3.07µs ± 0%             3.10µs ± 0%     +0.99%          (p=0.016 n=5+4)
BM_cumSumColReduction_8T/256 [using 8 threads]              176µs ± 0%               80µs ±19%    -54.51%          (p=0.008 n=5+5)
BM_cumSumColReduction_8T/4k  [using 8 threads]              827ms ± 2%              180ms ±14%    -78.24%          (p=0.008 n=5+5) 
						
					 
					
						2020-05-06 14:48:37 -07:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							0e59f786e1 
							
						 
					 
					
						
						
							
							Fix accidental copy of loop variable.  
						
						
						
					 
					
						2020-05-05 21:35:38 +00:00 
						 
				 
			
				
					
						
							
							
								Rasmus Munk Larsen 
							
						 
					 
					
						
						
						
						
							
						
						
							7b76c85daf 
							
						 
					 
					
						
						
							
							Vectorize and parallelize TensorScanOp.  
						
						... 
						
						
						
						TensorScanOp is used in TensorFlow for a number of operations, such as cumulative logexp reduction and cumulative sum and product reductions.
The benchmarks numbers below are for cumulative row- and column reductions of NxN matrices.
name                                                         old time/op             new time/op     delta
BM_cumSumRowReduction_1T/4    [using 1 threads ]             25.1ns ± 1%             35.2ns ± 1%    +40.45%
BM_cumSumRowReduction_1T/8    [using 1 threads ]             73.4ns ± 0%             82.7ns ± 3%    +12.74%
BM_cumSumRowReduction_1T/32   [using 1 threads ]              988ns ± 0%              832ns ± 0%    -15.77%
BM_cumSumRowReduction_1T/64   [using 1 threads ]             4.07µs ± 2%             3.47µs ± 0%    -14.70%
BM_cumSumRowReduction_1T/128  [using 1 threads ]             18.0µs ± 0%             16.8µs ± 0%     -6.58%
BM_cumSumRowReduction_1T/512  [using 1 threads ]              287µs ± 0%              281µs ± 0%     -2.22%
BM_cumSumRowReduction_1T/2k   [using 1 threads ]             4.78ms ± 1%             4.78ms ± 2%       ~
BM_cumSumRowReduction_1T/10k  [using 1 threads ]              117ms ± 1%              117ms ± 1%       ~
BM_cumSumRowReduction_8T/4    [using 8 threads ]             25.0ns ± 0%             35.2ns ± 0%    +40.82%
BM_cumSumRowReduction_8T/8    [using 8 threads ]             77.2ns ±16%             81.3ns ± 0%       ~
BM_cumSumRowReduction_8T/32   [using 8 threads ]              988ns ± 0%              833ns ± 0%    -15.67%
BM_cumSumRowReduction_8T/64   [using 8 threads ]             4.08µs ± 2%             3.47µs ± 0%    -14.95%
BM_cumSumRowReduction_8T/128  [using 8 threads ]             18.0µs ± 0%             17.3µs ±10%       ~
BM_cumSumRowReduction_8T/512  [using 8 threads ]              287µs ± 0%               58µs ± 6%    -79.92%
BM_cumSumRowReduction_8T/2k   [using 8 threads ]             4.79ms ± 1%             0.64ms ± 1%    -86.58%
BM_cumSumRowReduction_8T/10k  [using 8 threads ]              117ms ± 1%               18ms ± 6%    -84.50%
BM_cumSumColReduction_1T/4    [using 1 threads ]             23.9ns ± 0%             33.4ns ± 1%    +39.68%
BM_cumSumColReduction_1T/8    [using 1 threads ]             71.6ns ± 1%             49.1ns ± 3%    -31.40%
BM_cumSumColReduction_1T/32   [using 1 threads ]              973ns ± 0%              165ns ± 2%    -83.10%
BM_cumSumColReduction_1T/64   [using 1 threads ]             4.06µs ± 1%             0.57µs ± 1%    -85.94%
BM_cumSumColReduction_1T/128  [using 1 threads ]             33.4µs ± 1%              4.1µs ± 1%    -87.67%
BM_cumSumColReduction_1T/512  [using 1 threads ]             1.72ms ± 4%             0.21ms ± 5%    -87.91%
BM_cumSumColReduction_1T/2k   [using 1 threads ]              119ms ±53%               11ms ±35%    -90.42%
BM_cumSumColReduction_1T/10k  [using 1 threads ]              1.59s ±67%              0.35s ±49%    -77.96%
BM_cumSumColReduction_8T/4    [using 8 threads ]             23.8ns ± 0%             33.3ns ± 0%    +40.06%
BM_cumSumColReduction_8T/8    [using 8 threads ]             71.6ns ± 1%             49.2ns ± 5%    -31.33%
BM_cumSumColReduction_8T/32   [using 8 threads ]             1.01µs ±12%             0.17µs ± 3%    -82.93%
BM_cumSumColReduction_8T/64   [using 8 threads ]             4.15µs ± 4%             0.58µs ± 1%    -86.09%
BM_cumSumColReduction_8T/128  [using 8 threads ]             33.5µs ± 0%              4.1µs ± 4%    -87.65%
BM_cumSumColReduction_8T/512  [using 8 threads ]             1.71ms ± 3%             0.06ms ±16%    -96.21%
BM_cumSumColReduction_8T/2k   [using 8 threads ]             97.1ms ±14%              3.0ms ±23%    -96.88%
BM_cumSumColReduction_8T/10k  [using 8 threads ]              1.97s ± 8%              0.06s ± 2%    -96.74% 
						
					 
					
						2020-05-05 00:19:43 +00:00 
						 
				 
			
				
					
						
							
							
								Xiaoxiang Cao 
							
						 
					 
					
						
						
						
						
							
						
						
							a74a278abd 
							
						 
					 
					
						
						
							
							Fix confusing template param name for Stride fwd decl.  
						
						
						
					 
					
						2020-04-30 01:43:05 +00:00