eigen

CFD/eigen

Author	SHA1	Message	Date
Antonio Sanchez	f6c8cc0e99	Fix TensorReduction warnings and error bound for sum accuracy test. The sum accuracy test currently uses the default test precision for the given scalar type. However, scalars are generated via a normal distribution, and given a large enough count and strong enough random generator, the expected sum is zero. This causes the test to periodically fail. Here we estimate an upper-bound for the error as `sqrt(N) * prec` for summing N values, with each having an approximate epsilon of `prec`. Also fixed a few warnings generated by MSVC when compiling the reduction test.	2021-10-30 14:59:00 -07:00
Rasmus Munk Larsen	b3bea43a2d	Don't use unrolled loops for stateful reducers. The problem is the combination step, e.g. reducer0.reducePacket(accum1, accum0); reducer0.reducePacket(accum2, accum0); reducer0.reducePacket(accum3, accum0); For the mean reducer this will increment the count as well as adding together the accumulators and result in the wrong count being divided into the sum at the end.	2021-10-28 23:52:54 +00:00
Antonio Sánchez	185ad0e610	Revert "Avoid integer overflow in EigenMetaKernel indexing" This reverts commit `100d7caf92`	2021-10-27 14:55:25 +00:00
Ben Barsdell	100d7caf92	Avoid integer overflow in EigenMetaKernel indexing - The current implementation computes `size + total_threads`, which can overflow and cause CUDA_ERROR_ILLEGAL_ADDRESS when size is close to the maximum representable value. - The num_blocks calculation can also overflow due to the implementation of divup(). - This patch prevents these overflows and allows the kernel to work correctly for the full representable range of tensor sizes. - Also adds relevant tests.	2021-10-26 00:04:28 +00:00
Antonio Sanchez	a500da1dc0	Fix broadcasting oob error. For vectorized 1-dimensional inputs that do not take the special blocking path (e.g. `std::complex<...>`), there was an index-out-of-bounds error causing the broadcast size to be computed incorrectly. Here we fix this, and make other minor cleanup changes. Fixes #2351.	2021-10-25 19:31:12 +00:00
Nico	b17bcddbca	Fix -Wbitwise-instead-of-logical clang warning & and \| short-circuit, && and \|\| don't. When both arguments to those are boolean, the short-circuiting version is usually the desired one, so clang warns on this. Here, it is inconsequential, so switch to && and \|\| to suppress the warning.	2021-10-21 23:32:45 -04:00
Antonio Sanchez	24ebb37f38	Disable Tree reduction for GPU. For moderately sized inputs, running the Tree reduction quickly fills/overflows the GPU thread stack space, leading to memory errors. This was happening in the `cxx11_tensor_complex_gpu` test, for example. Disabling tree reduction on GPU fixes this.	2021-10-20 20:42:37 +00:00
Rasmus Munk Larsen	360290fc42	Improve accuracy of full tensor reduction for half and bfloat16 by reducing leaf size in tree reduction. Add more unit tests for summation accuracy.	2021-10-20 19:54:06 +00:00
Antonio Sanchez	d0d34524a1	Move CUDA/Complex.h to GPU/Complex.h, remove TensorReductionCuda.h The `Complex.h` file applies equally to HIP/CUDA, so placing under the generic `GPU` folder. The `TensorReductionCuda.h` has already been deprecated, now removing for the next Eigen version.	2021-10-20 12:00:19 -07:00
Rasmus Munk Larsen	1d75fab368	Speed up tensor reduction	2021-10-02 14:58:23 +00:00
Kolja Brix	afa616bc9e	Fix some typos found	2021-09-23 15:22:00 +00:00
sciencewhiz	4b6036e276	fix various typos	2021-09-22 16:15:06 +00:00
Alexander Karatarakis	4d622be118	[AutodiffScalar] Remove const when returning by value clang-tidy: Return type 'const T' is 'const'-qualified at the top level, which may reduce code readability without improving const correctness The types are somewhat long, but the affected return types are of the form: ``` const T my_func() { // } ``` Change to: ``` T my_func() { // } ```	2021-09-18 21:23:32 +00:00
Rasmus Munk Larsen	6cadab6896	Clean up EIGEN_STATIC_ASSERT to only use standard c++11 static_assert.	2021-09-16 20:43:54 +00:00
Rasmus Munk Larsen	d7d0bf832d	Issue an error in case of direct inclusion of internal headers.	2021-09-10 19:12:26 +00:00
Antonio Sanchez	6c10495a78	Remove unnecessary std::tuple reference.	2021-09-09 15:49:44 +00:00
Antonio Sanchez	eea2a3385c	Remove more DynamicSparseMatrix references. Also fixed some typos in SparseExtra/MarketIO.h.	2021-09-02 15:36:47 -07:00
Jens Wehner	8286073c73	Matrixmarket extension	2021-09-02 17:23:33 +00:00
Antonio Sanchez	74da2e6821	Rename Tuple -> Pair. This is to make way for a new `Tuple` class that mimics `std::tuple`, but can be reliably used on device and with aligned Eigen types. The existing Tuple has very few references, and is actually an analogue of `std::pair`.	2021-09-02 02:20:54 +00:00
jenswehner	a443a2373f	updated documentation	2021-08-31 22:58:28 +00:00
Antonio Sanchez	cc3573ab44	Disable cuda Eigen::half vectorization on host. All cuda `__half` functions are device-only in CUDA 9, including conversions. Host-side conversions were added in CUDA 10. The existing code doesn't build prior to 10.0. All arithmetic functions are always device-only, so there's therefore no reason to use vectorization on the host at all. Modified the code to disable vectorization for `__half` on host, which required also updating the `TensorReductionGpu` implementation which previously made assumptions about available packets.	2021-08-31 19:13:12 +00:00
Turing Eret	3324389f6d	Add EIGEN_TENSOR_PLUGIN support per issue #2052 .	2021-08-30 19:36:55 +00:00
Jens Wehner	53ad9c75b4	included unordered_map header	2021-08-27 16:53:28 +00:00
jenswehner	9abf4d0bec	made RandomSetter C++11 compatible	2021-08-25 20:24:55 +00:00
jenswehner	90b3b6b572	added doxygen flowchart	2021-08-24 17:11:51 +00:00
jenswehner	d85de1ef56	removed sparse dynamic matrix	2021-08-24 10:33:00 +02:00
Alexander Karatarakis	4ba872bd75	Avoid leading underscore followed by cap in template identifiers	2021-08-04 22:41:52 +00:00
Alexander Karatarakis	f357283d31	_DerType -> DerivativeType as underscore-followed-by-caps is a reserved identifier	2021-07-29 18:02:04 +00:00
Antonio Sanchez	1fd5ce1002	For GpuDevice::fill, use a single memset if all bytes are equal. The original `fill` implementation introduced a 5x regression on my nvidia Quadro K1200. @rohitsan reported up to 100x regression for HIP. This restores performance.	2021-07-10 13:37:16 +00:00
Antonio Sanchez	9c22795d65	Put attach/detach buffer back in for TensorDeviceSycl. Also added a test to verify the original buffer is updated correctly.	2021-07-09 10:00:05 -07:00
Antonio Sanchez	1e6c6c1576	Replace memset with fill to work for non-trivial scalars. For custom scalars, zero is not necessarily represented by a zeroed-out memory block (e.g. gnu MPFR). We therefore cannot rely on `memset` if we want to fill a matrix or tensor with zeroes. Instead, we should rely on `fill`, which for trivial types does end up getting converted to a `memset` under-the-hood (at least with gcc/clang). Requires adding a `fill(begin, end, v)` to `TensorDevice`. Replaced all potentially bad instances of memset with fill. Fixes #2245.	2021-07-08 18:34:41 +00:00
Jonas Harsch	e9c9a3130b	Removed superfluous boolean `degenerate` in TensorMorphing.h.	2021-07-08 18:02:58 +00:00
Antonio Sanchez	f5a9873bbb	Fix Tensor documentation page. The extra [TOC] tag is generating a huge floating duplicated table-of-contents, which obscures the majority of the page (see bottom of https://eigen.tuxfamily.org/dox/unsupported/eigen_tensors.html). Remove it. Also, headers do not support markup (see [doxygen bug](https://github.com/doxygen/doxygen/issues/7467)), so backticks like ``` ``` end up generating titles that looks like ``` Constructor <tt>Tensor<double,2></tt> ``` Removing backticks for now. To generate proper formatted headers, we must directly use html instead of markdown, i.e. ``` <h2>Constructor <code>Tensor<double,2></code></h2> ``` which is ugly. Fixes #2254.	2021-07-03 04:39:22 +00:00
Jonas Harsch	aab747021b	Don't crash when attempting to shuffle an empty tensor.	2021-07-02 20:33:52 +00:00
Antonio Sanchez	6035da5283	Fix compile issues for gcc 4.8. - Move constructors can only be defaulted as NOEXCEPT if all members have NOEXCEPT move constructors. - gcc 4.8 has some funny parsing bug in `a < b->c`, thinking `b-` is a template parameter.	2021-07-01 22:58:14 +00:00
Antonio Sanchez	3a087ccb99	Modify tensor argmin/argmax to always return first occurence. As written, depending on multithreading/gpu, the returned index from `argmin`/`argmax` is not currently stable. Here we modify the functors to always keep the first occurence (i.e. if the value is equal to the current min/max, then keep the one with the smallest index). This is otherwise causing unpredictable results in some TF tests.	2021-06-29 10:36:20 -07:00
Antonio Sanchez	e9ab4278b7	Rewrite balancer to avoid overflows. The previous balancer overflowed for large row/column norms. Modified to prevent that. Fixes #2273.	2021-06-21 17:29:55 +00:00
jenswehner	175f0cc1e9	changed documentation to make example compile	2021-06-16 11:45:06 +02:00
Antonio Sanchez	954879183b	Fix placement of permanent GPU defines.	2021-06-15 12:17:09 -07:00
Rasmus Munk Larsen	13fb5ab92c	Fix more enum arithmetic.	2021-06-15 09:09:31 -07:00
Antonio Sanchez	514977f31b	Add ability to permanently enable HIP/CUDA gpu* defines. When using Eigen for gpu, these simplify portability. If `EIGEN_PERMANENTLY_ENABLE_GPU_HIP_CUDA_DEFINES` is set, then we do not undefine them.	2021-06-11 17:19:54 +00:00
Antonio Sanchez	6aec83263d	Allow custom TENSOR_CONTRACTION_DISPATCH macro. Currently TF lite needs to hack around with the Tensor headers in order to customize the contraction dispatch method. Here we add simple `#ifndef` guards to allow them to provide their own dispatch prior to inclusion.	2021-06-11 17:02:19 +00:00
Nathan Luehr	972cf0c28a	Fix calls to device functions from host code	2021-05-11 22:47:49 +00:00
Antonio Sanchez	0eba8a1fe3	Clean up gpu device properties. Made a class and singleton to encapsulate initialization and retrieval of device properties. Related to !481, which already changed the API to address a static linkage issue.	2021-05-07 17:51:29 +00:00
Antonio Sanchez	e3b7f59659	Simplify TensorRandom and remove time-dependence. Time-dependence prevents tests from being repeatable. This has long been an issue with debugging the tensor tests. Removing this will allow future tests to be repeatable in the usual way. Also, the recently added macros in !476 are causing headaches across different platforms. For example, checking `_XOPEN_SOURCE` is leading to multiple ambiguous macro errors across Google, and `_DEFAULT_SOURCE`/`_SVID_SOURCE`/`_BSD_SOURCE` are sometimes defined with values, sometimes defined as empty, and sometimes not defined at all when they probably should be. This is leading to multiple build breakages. The simplest approach is to generate a seed via `Eigen::internal::random<uint64_t>()` if on CPU. For GPU, we use a hash based on the current thread ID (since `rand()` isn't supported on GPU). Fixes #1602.	2021-05-04 13:34:49 -07:00
Turing Eret	3804ca0d90	Fix for issue with static global variables in TensorDeviceGpu.h m_deviceProperties and m_devicePropInitialized are defined as global statics which will define multiple copies which can cause issues if initializeDeviceProp() is called in one translation unit and then m_deviceProperties is used in a different translation unit. Added inline functions getDeviceProperties() and getDevicePropInitialized() which defines those variables as static locals. As per the C++ standard 7.1.2/4, a static local declared in an inline function always refers to the same object, so this should be safer. Credit to Sun Chenggen for this fix. This fixes issue #1475.	2021-04-23 07:43:35 -06:00
Antonio Sanchez	045c0609b5	Check existence of BSD random before use. `TensorRandom` currently relies on BSD `random()`, which is not always available. The [linux manpage](https://man7.org/linux/man-pages/man3/srandom.3.html) gives the glibc condition: ``` _XOPEN_SOURCE >= 500 \|\| /* Glibc since 2.19: / _DEFAULT_SOURCE \|\| / Glibc <= 2.19: */ _SVID_SOURCE \|\| _BSD_SOURCE ``` In particular, this was failing to compile for MinGW via msys2. If not available, we fall back to using `rand()`.	2021-04-22 20:42:12 +00:00
Antonio Sanchez	69adf26aa3	Modify googlehash use to account for namespace issues. The namespace declaration for googlehash is a configurable macro that can be disabled. In particular, it is disabled within google, causing compile errors since `dense_hash_map`/`sparse_hash_map` are then in the global namespace instead of in `::google`. Here we play a bit of gynastics to allow for both `google::_hash_map` and `_hash_map`, while limiting namespace polution. Symbols within the `::google` namespace are imported into `Eigen::google`. We also remove checks based on `_SPARSE_HASH_MAP_H_`, as this is fragile, and instead require `EIGEN_GOOGLEHASH_SUPPORT` to be defined.	2021-04-12 19:00:39 -07:00
Rasmus Munk Larsen	a2c0542010	Fix typo in TensorDimensions.h	2021-04-12 18:59:56 +00:00
Jens Wehner	f6fc66aa75	fixed doxygen for unsupported iterative solver module	2021-04-11 16:26:14 +00:00
Rohit Santhanam	2859db0220	This fixes an issue where the compiler was not choosing the GPU specific specialization of ScanLauncher. The issue was discovered when the GPU scan unit test was run and resulted in a segmentation fault. The segmantation fault occurred because the unit test allocated GPU memory and passed a pointer to that memory to the computation that it presumed would execute on the GPU. But because of the issue, the computation was scheduled to execute on the CPU so a situation was constructed where the CPU attempted to access a GPU memory location. The fix expands the GPU specific ScanLauncher specialization to handle cases where vectorization is enabled. Previously, the GPU specialization is chosen only if Vectorization is not used.	2021-04-08 15:14:48 +00:00
Steve Bronder	e7b8643d70	Revert "Revert "Adds EIGEN_CONSTEXPR and EIGEN_NOEXCEPT to rows(), cols(), innerStride(), outerStride(), and size()"" This reverts commit `5f0b4a4010`.	2021-03-24 18:14:56 +00:00
Jens Wehner	c0a889890f	Fixed output of complex matrices	2021-03-15 21:51:55 +00:00
Antonio Sanchez	543e34ab9d	Re-implement move assignments. The original swap approach leads to potential undefined behavior (reading uninitialized memory) and results in unnecessary copying of data for static storage. Here we pass down the move assignment to the underlying storage. Static storage does a one-way copy, dynamic storage does a swap. Modified the tests to no longer read from the moved-from matrix/tensor, since that can lead to UB. Added a test to ensure we do not access uninitialized memory in a move. Fixes: #2119	2021-03-10 16:55:20 +00:00
Antonio Sanchez	2468253c9a	Define EIGEN_CPLUSPLUS and replace most __cplusplus checks. The macro `__cplusplus` is not defined correctly in MSVC unless building with the the `/Zc:__cplusplus` flag. Instead, it defines `_MSVC_LANG` to the specified c++ standard version number. Here we introduce `EIGEN_CPLUSPLUS` which will contain the c++ version number both for MSVC and otherwise. This simplifies checks for supported features. Also replaced most instances of standard version checking via `__cplusplus` with the existing `EIGEN_COMP_CXXVER` macro for better clarity. Fixes: #2170	2021-03-05 18:33:18 +00:00
David Tellenbach	5f0b4a4010	Revert "Adds EIGEN_CONSTEXPR and EIGEN_NOEXCEPT to rows(), cols(), innerStride(), outerStride(), and size()" This reverts commit `6cbb3038ac` because it breaks clang-10 builds on x86 and aarch64 when C++11 is enabled.	2021-03-05 13:16:43 +01:00
Steve Bronder	6cbb3038ac	Adds EIGEN_CONSTEXPR and EIGEN_NOEXCEPT to rows(), cols(), innerStride(), outerStride(), and size()	2021-03-04 18:58:08 +00:00
Eugene Zhulenev	a6601070f2	Add log2 operation to TensorBase	2021-03-04 00:13:36 +00:00
Christoph Hertzberg	2660d01fa7	Inherit from `no_assignment_operator` to avoid implicit copy constructor warnings (cherry picked from commit 9bbb7ea4b54b1f307863be4ed8d105c38cdefe50)	2021-02-27 18:44:26 +01:00
Christoph Hertzberg	a3521d743c	Fix some enum-enum conversion warnings (cherry picked from commit 838f3d8ce22a5549ef10c7386fb03040721749a0)	2021-02-27 18:44:26 +01:00
Christoph Hertzberg	81b5fe2f0a	ReturnByValue is already non-copyable (cherry picked from commit abbf95045009619f37bd92b45433eedbfcbe41cf)	2021-02-27 18:44:26 +01:00
Christoph Hertzberg	4fb3459a23	Fix double-promotion warnings (cherry picked from commit c22c103e932e511e96645186831363585a44b7a3)	2021-02-27 18:44:26 +01:00
Jens Wehner	4bfcee47b9	Idrs iterative linear solver	2021-02-27 12:09:33 +00:00
Rasmus Munk Larsen	f284c8592b	Don't crash when attempting to slice an empty tensor.	2021-02-24 18:12:51 -08:00
Guoqiang QI	f44197fabd	Some improvements for kissfft from Martin Reinecke(pocketfft author): 1.Only computing about half of the factors and use complex conjugate symmetry for the rest instead of all to save time. 2.All twiddles are calculated in double because that gives the maximum achievable precision when doing float transforms. 3.Reducing all angles to the range 0<angle<pi/4 which gives even more precision.	2021-02-24 21:36:47 +00:00
Antonio Sanchez	5f9cfb2529	Add missing adolc isinf/isnan. Also modified cmake/FindAdolc.cmake to eliminate warnings, and added search paths to match install layout. Fixed: #2157	2021-02-19 22:26:56 +00:00
frgossen	33e0af0130	Return nan at poles of polygamma, digamma, and zeta if limit is not defined	2021-02-19 16:35:11 +00:00
David Tellenbach	36200b7855	Remove vim specific comments to recognoize correct file-type. As discussed in #2143 we remove editor specific comments.	2021-02-09 09:13:09 +01:00
Ralf Hannemann-Tamas	984d010b7b	add specialization of check_sparse_solving() for SuperLU solver, in order to test adjoint and transpose solves	2021-02-08 22:00:31 +00:00
Antonio Sanchez	3f4684f87d	Include `<cstdint>` in one place, remove custom typedefs Originating from [this SO issue](https://stackoverflow.com/questions/65901014/how-to-solve-this-all-error-2-in-this-case), some win32 compilers define `__int32` as a `long`, but MinGW defines `std::int32_t` as an `int`, leading to a type conflict. To avoid this, we remove the custom `typedef` definitions for win32. The Tensor module requires C++11 anyways, so we are guaranteed to have included `<cstdint>` already in `Eigen/Core`. Also re-arranged the headers to only include `<cstdint>` in one place to avoid this type of error again.	2021-01-26 14:23:05 -08:00
David Tellenbach	660c6b857c	Remove std::cerr in iterative solver since we don't have iostream. This fixes #2123	2021-01-21 11:40:05 +01:00
Maozhou, Ge	21a8a2487c	fix paddings of TensorVolumePatchOp	2021-01-15 11:51:49 +08:00
Antonio Sanchez	070d303d56	Add CUDA complex sqrt. This is to support scalar `sqrt` of complex numbers `std::complex<T>` on device, requested by Tensorflow folks. Technically `std::complex` is not supported by NVCC on device (though it is by clang), so the default `sqrt(std::complex<T>)` function only works on the host. Here we create an overload to add back the functionality. Also modified the CMake file to add `--relaxed-constexpr` (or equivalent) flag for NVCC to allow calling constexpr functions from device functions, and added support for specifying compute architecture for NVCC (was already available for clang).	2020-12-22 23:25:23 -08:00
Turing Eret	19e6496ce0	Replace call to FixedDimensions() with a singleton instance of FixedDimensions.	2020-12-16 07:34:44 -07:00
Turing Eret	bc7d1599fb	TensorStorage with FixedDimensions now has zero instance memory overhead. Removed m_dimension as instance member of TensorStorage with FixedDimensions and instead use the template parameter. This means that the sizeof a pure fixed-size storage is exactly equal to the data it is storing.	2020-12-14 07:19:34 -07:00
Antonio Sanchez	2dbac2f99f	Fix bad NEON fp16 check	2020-12-04 13:42:18 -08:00
Antonio Sanchez	e2f21465fe	Special function implementations for half/bfloat16 packets. Current implementations fail to consider half-float packets, only half-float scalars. Added specializations for packets on AVX, AVX512 and NEON. Added tests to `special_packetmath`. The current `special_functions` tests would fail for half and bfloat16 due to lack of precision. The NEON tests also fail with precision issues and due to different handling of `sqrt(inf)`, so special functions bessel, ndtri have been disabled. Tested with AVX, AVX512.	2020-12-04 10:16:29 -08:00
Rasmus Munk Larsen	71c85df4c1	Clean up the Tensor header and get rid of the EIGEN_SLEEP macro.	2020-12-02 11:04:04 -08:00
Antonio Sanchez	17268b155d	Add bit_cast for half/bfloat to/from uint16_t, fix TensorRandom The existing `TensorRandom.h` implementation makes the assumption that `half` (`bfloat16`) has a `uint16_t` member `x` (`value`), which is not always true. This currently fails on arm64, where `x` has type `__fp16`. Added `bit_cast` specializations to allow casting to/from `uint16_t` for both `half` and `bfloat16`. Also added tests in `half_float`, `bfloat16_float`, and `cxx11_tensor_random` to catch these errors in the future.	2020-11-18 20:32:35 +00:00
Antonio Sanchez	3669498f5a	Fix rule-of-3 for the Tensor module. Adds copy constructors to Tensor ops, inherits assignment operators from `TensorBase`. Addresses #1863	2020-11-18 18:14:53 +00:00
mehdi-goli	a725a3233c	[SYCL clean up the code] : removing exrta #pragma unroll in SYCL which was causing issues in embeded systems	2020-10-28 08:34:49 +00:00
Rasmus Munk Larsen	61fc78bbda	Get rid of nested template specialization in TensorReductionGpu.h, which was broken by `c6953f799b`.	2020-10-13 23:53:11 +00:00
Rasmus Munk Larsen	c6953f799b	Add packet generic ops `predux_fmin`, `predux_fmin_nan`, `predux_fmax`, and `predux_fmax_nan` that implement reductions with `PropagateNaN`, and `PropagateNumbers` semantics. Add (slow) generic implementations for most reductions.	2020-10-13 21:48:31 +00:00
David Tellenbach	8f8d77b516	Add EIGEN prefix for HAS_LGAMMA_R	2020-10-08 18:32:19 +02:00
Eugene Zhulenev	2279f2c62f	Use lgamma_r if it is available (update check for glibc 2.19+)	2020-10-08 00:26:45 +00:00
Rasmus Munk Larsen	b431024404	Don't make assumptions about NaN-propagation for pmin/pmax - it various across platforms. Change test to only test for NaN-propagation for pfmin/pfmax.	2020-10-07 19:05:18 +00:00
Zhuyie	e4b24e7fb2	Fix Eigen::ThreadPool::CurrentThreadId returning wrong thread id when EIGEN_AVOID_THREAD_LOCAL and NDEBUG are defined	2020-09-25 09:36:43 +00:00
Rasmus Munk Larsen	e55182ac09	Get rid of initialization logic for blueNorm by making the computed constants static const or constexpr. Move macro definition EIGEN_CONSTEXPR to Core and make all methods in NumTraits constexpr when EIGEN_HASH_CONSTEXPR is 1.	2020-09-18 17:38:58 +00:00
Deven Desai	603e213d13	Fixing a CUDA / P100 regression introduced by PR 181 PR 181 ( https://gitlab.com/libeigen/eigen/-/merge_requests/181 ) adds `__launch_bounds__(1024)` attribute to GPU kernels, that did not have that attribute explicitly specified. That PR seems to cause regressions on the CUDA platform. This PR/commit makes the changes in PR 181, to be applicable for HIP only	2020-08-20 00:29:57 +00:00
Deven Desai	46f8a18567	Adding an explicit launch_bounds(1024) attribute for GPU kernels. Starting with ROCm 3.5, the HIP compiler will change from HCC to hip-clang. This compiler change introduce a change in the default value of the `__launch_bounds__` attribute associated with a GPU kernel. (default value means the value assumed by the compiler as the `__launch_bounds attribute__` value, when it is not explicitly specified by the user) Currently (i.e. for HIP with ROCm 3.3 and older), the default value is 1024. That changes to 256 with ROCm 3.5 (i.e. hip-clang compiler). As a consequence of this change, if a GPU kernel with a `__luanch_bounds__` attribute of 256 is launched at runtime with a threads_per_block value > 256, it leads to a runtime error. This is leading to a couple of Eigen unit test failures with ROCm 3.5. This commit adds an explicit `__launch_bounds(1024)__` attribute to every GPU kernel that currently does not have it explicitly specified (and hence will end up getting the default value of 256 with the change to hip-clang)	2020-08-05 01:46:34 +00:00
Rasmus Munk Larsen	b92206676c	Inherit alignment trait from argument in TensorBroadcasting to avoid segfault when the argument is unaligned.	2020-07-28 19:19:37 +00:00
Antonio Sanchez	9cb8771e9c	Fix tensor casts for large packets and casts to/from std::complex The original tensor casts were only defined for `SrcCoeffRatio`:`TgtCoeffRatio` 1:1, 1:2, 2:1, 4:1. Here we add the missing 1:N and 8:1. We also add casting `Eigen::half` to/from `std::complex<T>`, which was missing to make it consistent with `Eigen:bfloat16`, and generalize the overload to work for any complex type. Tests were added to `basicstuff`, `packetmath`, and `cxx11_tensor_casts` to test all cast configurations.	2020-06-30 18:53:55 +00:00
Teng Lu	386d809bde	Support BFloat16 in Eigen	2020-06-20 19:16:24 +00:00
Ilya Tokar	231ce21535	Run two independent chains, when reducing tensors. Running two chains exposes more instruction level parallelism, by allowing to execute both chains at the same time. Results are a bit noisy, but for medium length we almost hit theoretical upper bound of 2x. BM_fullReduction_16T/3 [using 16 threads] 17.3ns ±11% 17.4ns ± 9% ~ (p=0.178 n=18+19) BM_fullReduction_16T/4 [using 16 threads] 17.6ns ±17% 17.0ns ±18% ~ (p=0.835 n=20+19) BM_fullReduction_16T/7 [using 16 threads] 18.9ns ±12% 18.2ns ±10% ~ (p=0.756 n=20+18) BM_fullReduction_16T/8 [using 16 threads] 19.8ns ±13% 19.4ns ±21% ~ (p=0.512 n=20+20) BM_fullReduction_16T/10 [using 16 threads] 23.5ns ±15% 20.8ns ±24% -11.37% (p=0.000 n=20+19) BM_fullReduction_16T/15 [using 16 threads] 35.8ns ±21% 26.9ns ±17% -24.76% (p=0.000 n=20+19) BM_fullReduction_16T/16 [using 16 threads] 38.7ns ±22% 27.7ns ±18% -28.40% (p=0.000 n=20+19) BM_fullReduction_16T/31 [using 16 threads] 146ns ±17% 74ns ±11% -49.05% (p=0.000 n=20+18) BM_fullReduction_16T/32 [using 16 threads] 154ns ±19% 84ns ±30% -45.79% (p=0.000 n=20+19) BM_fullReduction_16T/64 [using 16 threads] 603ns ± 8% 308ns ±12% -48.94% (p=0.000 n=17+17) BM_fullReduction_16T/128 [using 16 threads] 2.44µs ±13% 1.22µs ± 1% -50.29% (p=0.000 n=17+17) BM_fullReduction_16T/256 [using 16 threads] 9.84µs ±14% 5.13µs ±30% -47.82% (p=0.000 n=19+19) BM_fullReduction_16T/512 [using 16 threads] 78.0µs ± 9% 56.1µs ±17% -28.02% (p=0.000 n=18+20) BM_fullReduction_16T/1k [using 16 threads] 325µs ± 5% 263µs ± 4% -19.00% (p=0.000 n=20+16) BM_fullReduction_16T/2k [using 16 threads] 1.09ms ± 3% 0.99ms ± 1% -9.04% (p=0.000 n=20+20) BM_fullReduction_16T/4k [using 16 threads] 7.66ms ± 3% 7.57ms ± 3% -1.24% (p=0.017 n=20+20) BM_fullReduction_16T/10k [using 16 threads] 65.3ms ± 4% 65.0ms ± 3% ~ (p=0.718 n=20+20)	2020-06-16 15:55:11 -04:00
mehdi-goli	d3e81db6c5	Eigen moved the `scanLauncehr` function inside the internal namespace. This commit applies the following changes: - Moving the `scamLauncher` specialization inside internal namespace to fix compiler crash on TensorScan for SYCL backend. - Replacing `SYCL/sycl.hpp` to `CL/sycl.hpp` in order to follow SYCL 1.2.1 standard. - minor fixes: commenting out an unused variable to avoid compiler warnings.	2020-05-11 16:10:33 +01:00
Rasmus Munk Larsen	2fd8a5a08f	Add parallelization of TensorScanOp for types without packet ops. Clean up the code a bit and do a few micro-optimizations to improve performance for small tensors. Benchmark numbers for Tensor<uint32_t>: name old time/op new time/op delta BM_cumSumRowReduction_1T/8 [using 1 threads] 76.5ns ± 0% 61.3ns ± 4% -19.80% (p=0.008 n=5+5) BM_cumSumRowReduction_1T/64 [using 1 threads] 2.47µs ± 1% 2.40µs ± 1% -2.77% (p=0.008 n=5+5) BM_cumSumRowReduction_1T/256 [using 1 threads] 39.8µs ± 0% 39.6µs ± 0% -0.60% (p=0.008 n=5+5) BM_cumSumRowReduction_1T/4k [using 1 threads] 13.9ms ± 0% 13.4ms ± 1% -4.19% (p=0.008 n=5+5) BM_cumSumRowReduction_2T/8 [using 2 threads] 76.8ns ± 0% 59.1ns ± 0% -23.09% (p=0.016 n=5+4) BM_cumSumRowReduction_2T/64 [using 2 threads] 2.47µs ± 1% 2.41µs ± 1% -2.53% (p=0.008 n=5+5) BM_cumSumRowReduction_2T/256 [using 2 threads] 39.8µs ± 0% 34.7µs ± 6% -12.74% (p=0.008 n=5+5) BM_cumSumRowReduction_2T/4k [using 2 threads] 13.8ms ± 1% 7.2ms ± 6% -47.74% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/8 [using 8 threads] 76.4ns ± 0% 61.8ns ± 3% -19.02% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/64 [using 8 threads] 2.47µs ± 1% 2.40µs ± 1% -2.84% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/256 [using 8 threads] 39.8µs ± 0% 28.3µs ±11% -28.75% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/4k [using 8 threads] 13.8ms ± 0% 2.7ms ± 5% -80.39% (p=0.008 n=5+5) BM_cumSumColReduction_1T/8 [using 1 threads] 59.1ns ± 0% 80.3ns ± 0% +35.94% (p=0.029 n=4+4) BM_cumSumColReduction_1T/64 [using 1 threads] 3.06µs ± 0% 3.08µs ± 1% ~ (p=0.114 n=4+4) BM_cumSumColReduction_1T/256 [using 1 threads] 175µs ± 0% 176µs ± 0% ~ (p=0.190 n=4+5) BM_cumSumColReduction_1T/4k [using 1 threads] 824ms ± 1% 844ms ± 1% +2.37% (p=0.008 n=5+5) BM_cumSumColReduction_2T/8 [using 2 threads] 59.0ns ± 0% 90.7ns ± 0% +53.74% (p=0.029 n=4+4) BM_cumSumColReduction_2T/64 [using 2 threads] 3.06µs ± 0% 3.10µs ± 0% +1.08% (p=0.016 n=4+5) BM_cumSumColReduction_2T/256 [using 2 threads] 176µs ± 0% 189µs ±18% ~ (p=0.151 n=5+5) BM_cumSumColReduction_2T/4k [using 2 threads] 836ms ± 2% 611ms ±14% -26.92% (p=0.008 n=5+5) BM_cumSumColReduction_8T/8 [using 8 threads] 59.3ns ± 2% 90.6ns ± 0% +52.79% (p=0.008 n=5+5) BM_cumSumColReduction_8T/64 [using 8 threads] 3.07µs ± 0% 3.10µs ± 0% +0.99% (p=0.016 n=5+4) BM_cumSumColReduction_8T/256 [using 8 threads] 176µs ± 0% 80µs ±19% -54.51% (p=0.008 n=5+5) BM_cumSumColReduction_8T/4k [using 8 threads] 827ms ± 2% 180ms ±14% -78.24% (p=0.008 n=5+5)	2020-05-06 14:48:37 -07:00
Rasmus Munk Larsen	0e59f786e1	Fix accidental copy of loop variable.	2020-05-05 21:35:38 +00:00
Rasmus Munk Larsen	7b76c85daf	Vectorize and parallelize TensorScanOp. TensorScanOp is used in TensorFlow for a number of operations, such as cumulative logexp reduction and cumulative sum and product reductions. The benchmarks numbers below are for cumulative row- and column reductions of NxN matrices. name old time/op new time/op delta BM_cumSumRowReduction_1T/4 [using 1 threads ] 25.1ns ± 1% 35.2ns ± 1% +40.45% BM_cumSumRowReduction_1T/8 [using 1 threads ] 73.4ns ± 0% 82.7ns ± 3% +12.74% BM_cumSumRowReduction_1T/32 [using 1 threads ] 988ns ± 0% 832ns ± 0% -15.77% BM_cumSumRowReduction_1T/64 [using 1 threads ] 4.07µs ± 2% 3.47µs ± 0% -14.70% BM_cumSumRowReduction_1T/128 [using 1 threads ] 18.0µs ± 0% 16.8µs ± 0% -6.58% BM_cumSumRowReduction_1T/512 [using 1 threads ] 287µs ± 0% 281µs ± 0% -2.22% BM_cumSumRowReduction_1T/2k [using 1 threads ] 4.78ms ± 1% 4.78ms ± 2% ~ BM_cumSumRowReduction_1T/10k [using 1 threads ] 117ms ± 1% 117ms ± 1% ~ BM_cumSumRowReduction_8T/4 [using 8 threads ] 25.0ns ± 0% 35.2ns ± 0% +40.82% BM_cumSumRowReduction_8T/8 [using 8 threads ] 77.2ns ±16% 81.3ns ± 0% ~ BM_cumSumRowReduction_8T/32 [using 8 threads ] 988ns ± 0% 833ns ± 0% -15.67% BM_cumSumRowReduction_8T/64 [using 8 threads ] 4.08µs ± 2% 3.47µs ± 0% -14.95% BM_cumSumRowReduction_8T/128 [using 8 threads ] 18.0µs ± 0% 17.3µs ±10% ~ BM_cumSumRowReduction_8T/512 [using 8 threads ] 287µs ± 0% 58µs ± 6% -79.92% BM_cumSumRowReduction_8T/2k [using 8 threads ] 4.79ms ± 1% 0.64ms ± 1% -86.58% BM_cumSumRowReduction_8T/10k [using 8 threads ] 117ms ± 1% 18ms ± 6% -84.50% BM_cumSumColReduction_1T/4 [using 1 threads ] 23.9ns ± 0% 33.4ns ± 1% +39.68% BM_cumSumColReduction_1T/8 [using 1 threads ] 71.6ns ± 1% 49.1ns ± 3% -31.40% BM_cumSumColReduction_1T/32 [using 1 threads ] 973ns ± 0% 165ns ± 2% -83.10% BM_cumSumColReduction_1T/64 [using 1 threads ] 4.06µs ± 1% 0.57µs ± 1% -85.94% BM_cumSumColReduction_1T/128 [using 1 threads ] 33.4µs ± 1% 4.1µs ± 1% -87.67% BM_cumSumColReduction_1T/512 [using 1 threads ] 1.72ms ± 4% 0.21ms ± 5% -87.91% BM_cumSumColReduction_1T/2k [using 1 threads ] 119ms ±53% 11ms ±35% -90.42% BM_cumSumColReduction_1T/10k [using 1 threads ] 1.59s ±67% 0.35s ±49% -77.96% BM_cumSumColReduction_8T/4 [using 8 threads ] 23.8ns ± 0% 33.3ns ± 0% +40.06% BM_cumSumColReduction_8T/8 [using 8 threads ] 71.6ns ± 1% 49.2ns ± 5% -31.33% BM_cumSumColReduction_8T/32 [using 8 threads ] 1.01µs ±12% 0.17µs ± 3% -82.93% BM_cumSumColReduction_8T/64 [using 8 threads ] 4.15µs ± 4% 0.58µs ± 1% -86.09% BM_cumSumColReduction_8T/128 [using 8 threads ] 33.5µs ± 0% 4.1µs ± 4% -87.65% BM_cumSumColReduction_8T/512 [using 8 threads ] 1.71ms ± 3% 0.06ms ±16% -96.21% BM_cumSumColReduction_8T/2k [using 8 threads ] 97.1ms ±14% 3.0ms ±23% -96.88% BM_cumSumColReduction_8T/10k [using 8 threads ] 1.97s ± 8% 0.06s ± 2% -96.74%	2020-05-05 00:19:43 +00:00
Rasmus Munk Larsen	ab773c7e91	Extend support for Packet16b: * Add ptranspose<,4> to support matmul and add unit test for Matrix<bool> Matrix<bool> * work around a bug in slicing of Tensor<bool>. * Add tensor tests This speeds up matmul for boolean matrices by about 10x name old time/op new time/op delta BM_MatMul<bool>/8 267ns ± 0% 479ns ± 0% +79.25% (p=0.008 n=5+5) BM_MatMul<bool>/32 6.42µs ± 0% 0.87µs ± 0% -86.50% (p=0.008 n=5+5) BM_MatMul<bool>/64 43.3µs ± 0% 5.9µs ± 0% -86.42% (p=0.008 n=5+5) BM_MatMul<bool>/128 315µs ± 0% 44µs ± 0% -85.98% (p=0.008 n=5+5) BM_MatMul<bool>/256 2.41ms ± 0% 0.34ms ± 0% -85.68% (p=0.008 n=5+5) BM_MatMul<bool>/512 18.8ms ± 0% 2.7ms ± 0% -85.53% (p=0.008 n=5+5) BM_MatMul<bool>/1k 149ms ± 0% 22ms ± 0% -85.40% (p=0.008 n=5+5)	2020-04-28 16:12:47 +00:00
Eugene Zhulenev	3c02fefec5	Add async evaluation support to TensorSlicingOp. Device::memcpy is not async-safe and might lead to deadlocks. Always evaluate slice expression in async mode.	2020-04-22 19:55:01 +00:00
Changming Sun	b1aa07a8d3	Fix a bug in TensorIndexList.h	2020-04-13 18:22:03 +00:00
jangsoopark	39142904cc	Resolve C4346 when building eigen on windows	2020-04-08 14:55:39 +09:00
Deven Desai	7158ed4e0e	Fixing HIP breakage caused by the recent commit that introduces Packet4h2 as the Eigen::Half packet type	2020-03-12 01:06:24 +00:00
Sami Kama	b733b8b680	remove duplicate pset1 for half and add some comments about why we need expose pmul/add/div/min/max on host	2020-03-10 20:28:43 +00:00
Cédric Hubert	98bfc5aaa8	Update MarketIO.h	2020-02-28 12:41:51 +00:00
Ilya Tokar	eb6cc29583	Avoid a division in NonBlockingThreadPool::Steal. Looking at profiles we spend ~10-20% of Steal on simply computing random % size. We can reduce random 32-bit int into [0, size) range with a single multiplication and shift. This transformation is described in https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/	2020-02-14 16:02:57 -05:00
Eugene Zhulenev	f584bd9b30	Fail at compile time if default executor tries to use non-default device	2020-02-06 22:43:24 +00:00
Eugene Zhulenev	3fda850c46	Remove dead code from TensorReduction.h	2020-01-29 18:45:31 +00:00
Jeff Daily	b5df8cabd7	fix hip-clang compilation due to new HIP scalar accessor	2020-01-20 21:08:52 +00:00
Deven Desai	6d284bb1b7	Fix for HIP breakage - 200115. Adding a missing EIGEN_DEVICE_FUNC attr	2020-01-16 00:51:43 +00:00
Srinivas Vasudevan	f6c6de5d63	Ensure Igamma does not NaN or Inf for large values.	2020-01-14 21:32:48 +00:00
Eugene Zhulenev	b9362fb8f7	Convert StridedLinearBufferCopy::Kind to enum class	2020-01-13 11:43:24 -08:00
Matthew Powelson	2ea5a715cf	Properly initialize b vector in SplineFitting InterpolateWithDerivative does not initialize the be vector correctly. This issue is discussed In stackoverflow question 48382939.	2020-01-09 21:29:04 +00:00
Ilya Tokar	19876ced76	Bug #1785 : Introduce numext::rint. This provides a new op that matches std::rint and previous behavior of pround. Also adds corresponding unsupported/../Tensor op. Performance is the same as e. g. floor (tested SSE/AVX).	2020-01-07 21:22:44 +00:00
mehdi-goli	d0ae052da4	[SYCL Backend] * Adding Missing operations for vector comparison in SYCL. This caused compiler error for vector comparison when compiling SYCL * Fixing the compiler error for placement new in TensorForcedEval.h This caused compiler error when compiling SYCL backend * Reducing the SYCL warning by removing the abort function inside the kernel * Adding Strong inline to functions inside SYCL interop.	2020-01-07 15:13:37 +00:00
Deven Desai	636e2bb3fa	Fix for HIP breakage - 191220 The breakage was introduced by the following commit : `ae07801dd8` After the commit, HIPCC errors out on some tests with the following error ``` Building HIPCC object unsupported/test/CMakeFiles/cxx11_tensor_device_1.dir/cxx11_tensor_device_1_generated_cxx11_tensor_device.cu.o In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_device.cu:17: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor💯 /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:129:12: error: no matching constructor for initialization of 'Eigen::internal::TensorBlockResourceRequirements' return {merge(lhs.shape_type, rhs.shape_type), // shape_type ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided struct TensorBlockResourceRequirements { ^ /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit copy constructor) not viable: requires 5 arguments, but 3 were provided /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit default constructor) not viable: requires 0 arguments, but 3 were provided ... ... ``` The fix is to explicitly decalre the (implicitly called) constructor as a device func	2019-12-20 21:28:00 +00:00
Christoph Hertzberg	1e9664b147	Bug #1796 : Make matrix squareroot usable for Map and Ref types	2019-12-20 18:10:22 +01:00
Christoph Hertzberg	d86544d654	Reduce code duplication and avoid confusing Doxygen	2019-12-19 19:48:39 +01:00
Christoph Hertzberg	dde279f57d	Hide recursive meta templates from Doxygen	2019-12-19 19:47:23 +01:00
Christoph Hertzberg	a3273aeff8	Fix trivial shadow warning	2019-12-19 19:13:11 +01:00
Eugene Zhulenev	7a65219a2e	Fix TensorPadding bug in squeezed reads from inner dimension	2019-12-19 05:43:57 +00:00
Eugene Zhulenev	73e55525e5	Return const data pointer from TensorRef evaluator.data()	2019-12-18 23:19:36 +00:00
Eugene Zhulenev	ae07801dd8	Tensor block evaluation cost model	2019-12-18 20:07:00 +00:00
Jeff Daily	de07c4d1c2	fix compilation due to new HIP scalar accessor	2019-12-17 20:27:30 +00:00
Eugene Zhulenev	788bef6ab5	Reduce block evaluation overhead for small tensor expressions	2019-12-17 19:06:14 +00:00
Eugene Zhulenev	381f8f3139	Initialize non-trivially constructible types when allocating a temp buffer.	2019-12-12 01:31:30 +00:00
Eugene Zhulenev	64272c7f40	Squeeze reads from two inner dimensions in TensorPadding	2019-12-11 16:54:51 -08:00
Eugene Zhulenev	963ba1015b	Add back accidentally deleted default constructor to TensorExecutorTilingContext.	2019-12-11 18:47:55 +00:00
Eugene Zhulenev	c9220c035f	Remove block memory allocation required by removed block evaluation API	2019-12-10 17:15:55 -08:00
Eugene Zhulenev	1c879eb010	Remove V2 suffix from TensorBlock	2019-12-10 15:40:23 -08:00
Eugene Zhulenev	dbca11e880	Remove TensorBlock.h and old TensorBlock/BlockMapper	2019-12-10 14:31:44 -08:00
Deven Desai	c49f0d851a	Fix for HIP breakage detected on 191210 The following commit introduces compile errors when running eigen with hipcc `2918f85ba9` hipcc errors out because it requies the device attribute on the methods within the TensorBlockV2ResourceRequirements struct instroduced by the commit above. The fix is to add the device attribute to those methods	2019-12-10 22:14:05 +00:00
Eugene Zhulenev	2918f85ba9	Do not use std::vector in getResourceRequirements	2019-12-09 16:19:55 -08:00
Artem Belevich	8056a05b54	Undo the block size change. .z is used by the EigenContractionKernelInternal().	2019-12-09 11:10:29 -08:00
Eugene Zhulenev	dbb703d44e	Add async evaluation support to TensorSelectOp	2019-12-09 18:36:13 +00:00
Janek Kozicki	11d6465326	fix AlignedVector3 inconsisent interface with other Vector classes, default constructor and operator- were missing.	2019-12-06 21:07:39 +01:00
Eugene Zhulenev	bb7ccac3af	Add recursive work splitting to EvalShardedByInnerDimContext	2019-12-05 14:51:49 -08:00
Artem Belevich	25230d1862	Improve performance of contraction kernels * Force-inline implementations. They pass around pointers to shared memory blocks. Without inlining compiler must operate via generic pointers. Inlining allows compiler to detect that we're operating on shared memory which allows generation of substantially faster code. * Fixed a long-standing typo which resulted in launching 8x more kernels than we needed (.z dimension of the block is unused by the kernel).	2019-12-05 12:48:34 -08:00
Eugene Zhulenev	8f4536e852	Capture TensorMap by value inside tensor expression AST	2019-12-03 16:39:05 -08:00
Rasmus Munk Larsen	4e696901f8	Remove __host__ annotation for device-only function.	2019-12-03 14:33:19 -08:00
Rasmus Munk Larsen	ead81559c8	Use EIGEN_DEVICE_FUNC macro instead of __device__.	2019-12-03 12:08:22 -08:00
Mehdi Goli	00f32752f7	[SYCL] Rebasing the SYCL support branch on top of the Einge upstream master branch. * Unifying all loadLocalTile from lhs and rhs to an extract_block function. * Adding get_tensor operation which was missing in TensorContractionMapper. * Adding the -D method missing from cmake for Disable_Skinny Contraction operation. * Wrapping all the indices in TensorScanSycl into Scan parameter struct. * Fixing typo in Device SYCL * Unifying load to private register for tall/skinny no shared * Unifying load to vector tile for tensor-vector/vector-tensor operation * Removing all the LHS/RHS class for extracting data from global * Removing Outputfunction from TensorContractionSkinnyNoshared. * Combining the local memory version of tall/skinny and normal tensor contraction into one kernel. * Combining the no-local memory version of tall/skinny and normal tensor contraction into one kernel. * Combining General Tensor-Vector and VectorTensor contraction into one kernel. * Making double buffering optional for Tensor contraction when local memory is version is used. * Modifying benchmark to accept custom Reduction Sizes * Disabling AVX optimization for SYCL backend on the host to allow SSE optimization to the host * Adding Test for SYCL * Modifying SYCL CMake	2019-11-28 10:08:54 +00:00
Eugene Zhulenev	5496d0da0b	Add async evaluation support to TensorReverse	2019-11-26 15:02:24 -08:00
Eugene Zhulenev	bc66c88255	Add async evaluation support to TensorPadding/TensorImagePatch/TensorShuffling	2019-11-26 11:41:57 -08:00
Hans Johnson	8c8cab1afd	STYLE: Convert CMake-language commands to lower case Ancient CMake versions required upper-case commands. Later command names became case-insensitive. Now the preferred style is lower-case.	2019-10-31 11:36:37 -05:00
Gael Guennebaud	c3f6fcf2c0	bug #1747 : one more fix for MSVC regarding the Bessel implementation.	2019-11-15 11:12:35 +01:00
Gael Guennebaud	b9837ca9ae	bug #1281 : fix AutoDiffScalar's make_coherent for nested expression of constant ADs.	2019-11-14 14:58:08 +01:00
Eugene Zhulenev	13c3327f5c	Remove legacy block evaluation support	2019-11-12 10:12:28 -08:00
Rasmus Munk Larsen	0ed0338593	Fix a race in async tensor evaluation: Don't run on_done() until after device.deallocate() / evaluator.cleanup() complete, since the device might be destroyed after on_done() runs.	2019-11-11 12:26:41 -08:00
Eugene Zhulenev	c952b8dfda	Break loop dependence in TensorGenerator block access	2019-11-11 10:32:57 -08:00

1 2 3 4 5 ...

2485 Commits