## How to avoid boxing?

FLoops.jl may complain about HasBoxedVariableError. For a quick prototyping, calling FLoops.assistant(false) to disable the error/warning may be useful. However, it is also easy to avoid this problem if you understand a few patterns as explained below:

### "Leaked variables" ⇒ use local

HasBoxedVariableError can occur when "leaked" variables are causing data races. Consider the following example:

function leaked_variable_example(xs)
a = xs[begin]
b = xs[end]
c = (a + b) / 2
@floop for x in xs
a = max(x, c)
@reduce s += a
end
return s
end

Calling this function causes HasBoxedVariableError:

julia> FLoops.assistant(:error);

julia> leaked_variable_example(1:2)
ERROR: HasBoxedVariableError: Closure ... has 1 boxed variable: a

This occurs because the variable a is assigned before @floop by a = xs[begin] and inside @floop by a = max(x, c). However, since the assignment inside @floop can occur in parallel, this is a data race. We can avoid this issue easily by making a local inside the loop:

function leaked_variable_example_fix1(xs)
a = xs[begin]
b = xs[end]
c = (a + b) / 2
@floop for x in xs
local a = max(x, c)   # note local
@reduce s += a
end
return s
end

Alternatively, we can use a different name:

function leaked_variable_example_fix2(xs)
a = xs[begin]
b = xs[end]
c = (a + b) / 2
@floop for x in xs
d = max(x, c)   # not a
@reduce s += d
end
return s
end

or limit the scope of a used for calculating c:

function leaked_variable_example_fix3(xs)
c = let a = xs[begin], b = xs[end]  # limit the scope of a
(a + b) / 2
end
@floop for x in xs
a = max(x, c)
@reduce s += a
end
return s
end

### "Uncertain values" ⇒ use let

Note

This is a known limitation as of Julia 1.8-DEV. This documentation may not be accurate in the future version of Julia. For more information, see: performance of captured variables in closures · Issue #15276 · JuliaLang/julia

HasBoxedVariableError can also occur when Julia is uncertain about the value of some variables in the scope surrounding @floop (i.e., there are multiple binding locations). For example:

function uncertain_value_example(xs; flag = false)
if flag
a = 0
else
a = 1
end
@floop for x in xs
x += a
@reduce s += x
end
return s
end

This can be fixed by using let to ensure that the variable a does not change while executing @floop. (Julia can be sure that the variable is not updated when it is assigned only once.)

function uncertain_value_example_fix(xs; flag = false)
if flag
a = 0
else
a = 1
end
let a = a  # "quench" the value of a
@floop for x in xs
x += a
@reduce s += x
end
return s
end
end

### "Not really a data race" ⇒ use Ref

It is conceptually sound to assign to the variables in an outer scope of @floop if it is ensured that the race does not occur. For example, in the following example, found = i can be executed at most once since the elements of xs are all distinct:

function non_data_race_example(xs::UnitRange; ex = nothing)
local found = nothing
@floop ex for (i, x) in pairs(xs)
if x == 1
found = i
end
end
return found
end

However, FLoops.jl also complains about HasBoxedVariableError when executing this function. We can fix this by using Ref:

function non_data_race_example_fix(xs::UnitRange; ex = nothing)
found = Ref{Union{Int,Nothing}}(nothing)
@floop ex for (i, x) in pairs(xs)
if x == 1
found[] = i
end
end
return found[]
end

## What is the difference of @reduce and @init to the approach using state[threadid()]?

It is important to understand that state[threadid()] += f(x) may contain a concurrency bug. If f can yield to the scheduler (e.g., containing an I/O such as println and @debug), this code may not work as you expect, even in a single-threaded julia instance and/or pre-1.3 Julia. This is because the above code is equivalent to

i = threadid()
a = state[i]
b = f(x)
c = a + b
state[i] = c

If f can yield to the scheduler, and if there are other tasks with the same threadid that can mutate state, the value stored at state[threadid()] may not be equal to a by the time the last line is executed.

Furthermore, if julia supports migration of Task across OS threads at some future version, the above scenario can happen even if f never yields to the scheduler. Therefore, reduction or private state handling using threadid is very discouraged.

This caveat does not apply to @reduce and @init used in FLoops.jl and in general to the reduction mechanism used by JuliaFolds packages. Furthermore, since @reduce and @init do not depend on a particular execution mechanism (i.e., threading), @floop can generate the code that can be efficiently executed in distributed and GPU executors.

Note

The problem discussed above can also be worked around by, e.g., using Threads.@threads for (since it spawns exactly nthreads() tasks and ensures that each task is scheduled on each OS thread, as of Julia 1.6) and making sure that state is not shared across multiple loops.

## How is a parallel @floop executed? What is the scheduling strategy?

It depends on the exact executor used. For example, a parallel loop can be executed in a single thread by using SequentialEx executor. (Thus, a "parallel loop" should really be called a parallelizable loop. But it is mouthful so we use the phrase "parallel loop".) Furthermore, the default executor is determined by the input collection types; e.g., if FoldsCUDA.jl is loaded, reductions on CuArray are executed on GPU with CUDAEx executor.

But, by default (i.e., if no special executor is registered for the input collection type), parallel loops are run with ThreadedEx executor. How this executor works is an implementation detail. However, as of writing (Transducers.jl 0.4.60), this executor takes a divide-and-conquer approach. That is to say, it first recursively halves the input collection until the each part (base case) is smaller or equal to basesize. Each base case is then executed in a single Task. The results of base cases are then combined pair-wise in distinct Tasks (re-using the ones created for reducing the base case). Compared to the sequential scheduling approach taken by Threads.@threads for (as of Julia 1.6), this approach has an advantage that it exhibits a greater parallelism.

If the scheduling by ThreadedEx does not yield a desired behavior, you can use FoldsThreads.jl for different executors with different performance characteristics.