How to write X in parallel.

using FLoops

In-place mutation

Mutable containers can be allocated in the init expressions (zeros(3) in the example below):

@floop for x in 1:10
    xs = [x, 2x, 3x]
    @reduce() do (ys = zeros(3); xs)
        ys .+= xs
    end
end
ys
3-element Vector{Float64}:
  55.0
 110.0
 165.0

Mutating objects allocated in the init expressions is not data race because each basecase "owns" such mutable objects. However, it is incorrect to mutate objects created outside init expressions.

See also: What is the difference of @reduce and @init to the approach using state[threadid()]?

Note

Technically, it is correct to mutate objects in the loop body if the objects are protected by a lock. However, it means that the code block protected by the lock can only be executed by a single task. For efficient data parallel loops, it is highly recommended to use non-thread-safe data collection (i.e., no lock) and construct the @reduce block that efficiently merge two mutable objects.

INCORRECT EXAMPLE

This example has data race because the array ys0 is shared across all base cases and mutated in parallel.

ys0 = zeros(3)
@floop for x in 1:10
    xs = [x, 2x, 3x]
    @reduce() do (ys = ys0; xs)
        ys .+= xs
    end
end

Data race-free reuse of mutable objects using private variables

To avoid allocation for each iteration, it is useful to pre-allocate mutable objects and reuse them. We can use @init macro to do this in a data race-free ("thread-safe") manner:

@floop for x in 1:10
    @init xs = Vector{typeof(x)}(undef, 3)
    xs .= (x, 2x, 3x)
    @reduce() do (ys = zeros(3); xs)
        ys .+= xs
    end
end
ys
3-element Vector{Float64}:
  55.0
 110.0
 165.0

See also: What is the difference of @reduce and @init to the approach using state[threadid()]?

Efficient and reproducible usage patterns of random number generators

Julia's default random number generator (RNG) is data race-free for invoking from multiple threads; i.e., calls like randn() have well-defined behaviors. However, for the performance and reproducibility, it is useful to directly creating the RNGs. A convenient approach to this is to use a private variable:

using Random

MersenneTwister()  # the first invocation of `MersenneTwister` is not data race-free

@floop for _ in 1:10
    @init rng = MersenneTwister()
    @reduce(s += rand(rng))
end

The above approach may work well for exploratory purposes. However, it has a problem that the computation is not reproducible and each invocation of MersenneTwister requires an I/O (reading /dev/urandom). These problems can be solved by, for example, using randjump function. First, let us construct ntasks RNGs to be used.

using Future

ntasks = Threads.nthreads()  # the number of base cases
rngs = [MersenneTwister(123456789)]
let rng = rngs[end]
    for _ in 2:ntasks
        rng = Future.randjump(rng, big(10)^20)
        push!(rngs, rng)
    end
end

This list of RNGs can be used with some input array by manually partitioning the input into ntasks chunks:

xs = 1:10  # input
chunks = Iterators.partition(xs, cld(length(xs), length(rngs)))
@floop ThreadedEx(basesize = 1) for (rng, chnk) in zip(rngs, chunks)
    y = 0
    for _ in chnk
        y += rand(rng)
    end
    @reduce(s += y)
end

Note that the above pattern can also be used with @threads for loop.

Another approach is to use a counter-based RNG as illustrated in Monte-Carlo π · FoldsCUDA. This approach works both on CPU and GPU.


This page was generated using Literate.jl.