FoldsCUDA.jl

FoldsCUDA.FoldsCUDA — Module

FoldsCUDA

FoldsCUDA.jl provides Transducers.jl-compatible fold (reduce) implemented using CUDA.jl. This brings the transducers and reducing function combinators implemented in Transducers.jl to GPU. Furthermore, using FLoops.jl, you can write parallel for loops that run on GPU.

API

FoldsCUDA exports CUDAEx, a parallel loop executor. It can be used with the parallel for loop created with FLoops.@floop, Base-like high-level parallel API in Folds.jl, and extensible transducers provided by Transducers.jl.

Examples

findmax using FLoops.jl

You can pass CUDA executor FoldsCUDA.CUDAEx() to @floop to run a parallel for loop on GPU:

julia> using FoldsCUDA, CUDA, FLoops

julia> using GPUArrays: @allowscalar

julia> xs = CUDA.rand(10^8);

julia> @allowscalar xs[100] = 2;

julia> @allowscalar xs[200] = 2;

julia> @floop CUDAEx() for (x, i) in zip(xs, eachindex(xs))
           @reduce() do (imax = -1; i), (xmax = -Inf32; x)
               if xmax < x
                   xmax = x
                   imax = i
               end
           end
       end

julia> xmax
2.0f0

julia> imax  # the *first* position for the largest value
100

extrema using Transducers.TeeRF

julia> using Transducers, Folds

julia> @allowscalar xs[300] = -0.5;

julia> Folds.reduce(TeeRF(min, max), xs, CUDAEx())
(-0.5f0, 2.0f0)

julia> Folds.reduce(TeeRF(min, max), (2x for x in xs), CUDAEx())  # iterator comprehension works
(-1.0f0, 4.0f0)

julia> Folds.reduce(TeeRF(min, max), Map(x -> 2x)(xs), CUDAEx())  # equivalent, using a transducer
(-1.0f0, 4.0f0)

More examples

For more examples, see the examples section in the documentation.

source

FoldsCUDA.CUDAEx — Type

CUDAEx()

A fold executor implemented using CUDA.jl.

For more information about executor, see Transducers.jl's glossary section and FLoops.jl's API section.

Examples

julia> using FoldsCUDA, Folds

julia> Folds.sum(1:10, CUDAEx())
55

source

FoldsCUDA.CoalescedCUDAEx — Type

CoalescedCUDAEx()

A fold executor implemented using CUDA.jl. It uses coalesced memory access while supporting non-commutative loops. It can be faster but more limtied than CUDAEx.

source