FoldsCUDA.jl
FoldsCUDA.FoldsCUDA
— ModuleFoldsCUDA
FoldsCUDA.jl provides Transducers.jl-compatible fold (reduce) implemented using CUDA.jl. This brings the transducers and reducing function combinators implemented in Transducers.jl to GPU. Furthermore, using FLoops.jl, you can write parallel for
loops that run on GPU.
API
FoldsCUDA exports CUDAEx
, a parallel loop executor. It can be used with the parallel for
loop created with FLoops.@floop
, Base
-like high-level parallel API in Folds.jl, and extensible transducers provided by Transducers.jl.
Examples
findmax
using FLoops.jl
You can pass CUDA executor FoldsCUDA.CUDAEx()
to @floop
to run a parallel for
loop on GPU:
julia> using FoldsCUDA, CUDA, FLoops
julia> using GPUArrays: @allowscalar
julia> xs = CUDA.rand(10^8);
julia> @allowscalar xs[100] = 2;
julia> @allowscalar xs[200] = 2;
julia> @floop CUDAEx() for (x, i) in zip(xs, eachindex(xs))
@reduce() do (imax = -1; i), (xmax = -Inf32; x)
if xmax < x
xmax = x
imax = i
end
end
end
julia> xmax
2.0f0
julia> imax # the *first* position for the largest value
100
extrema
using Transducers.TeeRF
julia> using Transducers, Folds
julia> @allowscalar xs[300] = -0.5;
julia> Folds.reduce(TeeRF(min, max), xs, CUDAEx())
(-0.5f0, 2.0f0)
julia> Folds.reduce(TeeRF(min, max), (2x for x in xs), CUDAEx()) # iterator comprehension works
(-1.0f0, 4.0f0)
julia> Folds.reduce(TeeRF(min, max), Map(x -> 2x)(xs), CUDAEx()) # equivalent, using a transducer
(-1.0f0, 4.0f0)
More examples
For more examples, see the examples section in the documentation.
FoldsCUDA.CUDAEx
— TypeCUDAEx()
A fold executor implemented using CUDA.jl.
For more information about executor, see Transducers.jl's glossary section and FLoops.jl's API section.
Examples
julia> using FoldsCUDA, Folds
julia> Folds.sum(1:10, CUDAEx())
55
FoldsCUDA.CoalescedCUDAEx
— TypeCoalescedCUDAEx()
A fold executor implemented using CUDA.jl. It uses coalesced memory access while supporting non-commutative loops. It can be faster but more limtied than CUDAEx
.