Moved to: How to avoid boxing?
It is important to understand that
state[threadid()] += f(x) may contain a concurrency bug. If
f can yield to the scheduler (e.g., containing an I/O such as
@debug), this code may not work as you expect, even in a single-threaded
julia instance and/or pre-1.3 Julia. This is because the above code is equivalent to
i = threadid() a = state[i] b = f(x) c = a + b state[i] = c
f can yield to the scheduler, and if there are other tasks with the same
threadid that can mutate
state, the value stored at
state[threadid()] may not be equal to
a by the time the last line is executed.
julia supports migration of
Task across OS threads at some future version, the above scenario can happen even if
f never yields to the scheduler. Therefore, reduction or private state handling using
threadid is very discouraged.
This caveat does not apply to
@init used in FLoops.jl and in general to the reduction mechanism used by JuliaFolds packages. Furthermore, since
@init do not depend on a particular execution mechanism (i.e., threading),
@floop can generate the code that can be efficiently executed in distributed and GPU executors.
The problem discussed above can also be worked around by, e.g., using
Threads.@threads for (since it spawns exactly
nthreads() tasks and ensures that each task is scheduled on each OS thread, as of Julia 1.6) and making sure that
state is not shared across multiple loops.
It depends on the exact executor used. For example, a parallel loop can be executed in a single thread by using
SequentialEx executor. (Thus, a "parallel loop" should really be called a parallelizable loop. But it is mouthful so we use the phrase "parallel loop".) Furthermore, the default executor is determined by the input collection types; e.g., if FoldsCUDA.jl is loaded, reductions on
CuArray are executed on GPU with
But, by default (i.e., if no special executor is registered for the input collection type), parallel loops are run with
ThreadedEx executor. How this executor works is an implementation detail. However, as of writing (Transducers.jl 0.4.60), this executor takes a divide-and-conquer approach. That is to say, it first recursively halves the input collection until the each part (base case) is smaller or equal to
basesize. Each base case is then executed in a single
Task. The results of base cases are then combined pair-wise in distinct
Tasks (re-using the ones created for reducing the base case). Compared to the sequential scheduling approach taken by
Threads.@threads for (as of Julia 1.6), this approach has an advantage that it exhibits a greater parallelism.
If the scheduling by
ThreadedEx does not yield a desired behavior, you can use FoldsThreads.jl for different executors with different performance characteristics.