# FoldsThreads.jl

`FoldsThreads.FoldsThreads`

— Module**FoldsThreads: Extra threaded executors for JuliaFolds/*.jl**

FoldsThreads.jl provides extra thread-based executors usable with various JuliaFolds/*.jl packages such as Transducers.jl and FLoops.jl.

```
Executors
,----------------------.
Algorithms | FoldsThreads.jl | Data structures
,------------------. |-----------------------| ,-----------------.
| FLoops, | | ThreadedEx* | | Array, |
| Folds, | | WorkStealingEx, | | Tables, |
| Transducers, | --- | DepthFirstEx, | --- | FGenerators, |
| OnlineStats, | | TaskPoolEx, | | Dict, |
| DataTools, ... ' | NondeterministicEx, | | Set, ... |
`------------------' | ... | `-----------------'
`-----------------------'
```

(* `ThreadedEx`

is the default executor provided by Transducers.jl)

`WorkStealingEx`

implements work stealing (continuation stealing). Useful for load-balancing.`DepthFirstEx`

implements depth-first scheduling. Useful for`findfirst`

-type computations.`TaskPoolEx`

: Task pool executor. Useful for fine execution control (e.g., back pressure and "background" threads).`NondeterministicEx`

: An executor for parallelizing computations with non-parallelizable iterators.

`FoldsThreads.DepthFirstEx`

— Type`DepthFirstEx(; [simd,] [basesize])`

Depth-first scheduling for parallel execution. Useful for `findfirst`

-type of computation.

**Examples**

```
julia> using FoldsThreads
using Folds
julia> Folds.sum(i -> gcd(i, 42), 1:1000_000, DepthFirstEx())
4642844
```

**Extended help**

`DepthFirstEx`

schedules chunks of size roughly equal to `basesize`

in the order that each chunk appears in the input collection. However, the base case computation does not wait for all the tasks to be scheduled. This approach performs better than a more naive approach where the all tasks are scheduled at once before the reduction starts. `DepthFirstEx`

is useful for reductions that can terminate early (e.g., `findfirst`

, `@floop`

with `break`

).

**More examples**

```
julia> using FoldsThreads
using FLoops
julia> @floop DepthFirstEx() for x in 1:1000_000
y = gcd(x, 42)
@reduce(acc += y)
end
acc
4642844
```

**Keyword Arguments**

`basesize`

: The size of base case.`simd`

:`false`

,`true`

,`:ivdep`

, or`Val`

of one of them. If`true`

/`:ivdep`

, the inner-most loop of each base case is annotated by`@simd`

/`@simd ivdep`

. Use a plain loop if`false`

(default).

`FoldsThreads.NonThreadedEx`

— Type`NonThreadedEx(; [simd,] [basesize])`

Single-threaded executor with call tree similar to divide-and-conquer type of threaded executors. It is useful for debugging, testing, and benchmarking.

`FoldsThreads.NondeterministicEx`

— Type`NondeterministicEx(; [simd,] [basesize,] [ntasks,])`

Pipelined batched reduction for parallelizing computations with non-parallelizable collections (e.g., `Channel`

, `Iterators.Stateful`

).

This is a simple wrapper of `NondeterministicThreading`

transducer. Use `NondeterministicThreading`

directly for explicit control on what transducers are parallelized.

**Examples**

```
julia> using FoldsThreads
using Folds
julia> partially_parallelizable(seq) = (gcd(y, 42) for x in seq for y in 1:10000x);
julia> Folds.sum(partially_parallelizable(Iterators.Stateful(1:100)), NondeterministicEx())
234462500
```

**Extended help**

In the above example, we can run `gcd(y, 42)`

(mapping), `for y in 1:10000x`

(flattening), and `+`

for `sum`

(reduction) in parallel even though the iteration of `Iterators.Stateful(1:100)`

is not parallelizable. Note that, as indicated in the example, the computation per each iteration of the non-parallelizable collection should be very CPU-intensive in order for `NondeterministicEx`

to show any kind of performance benefits.

Same computation using FLoops.jl:

```
julia> using FoldsThreads
using FLoops
julia> @floop NondeterministicEx() for x in Iterators.Stateful(1:100)
for y in 1:10000x
z = gcd(y, 42)
@reduce(acc += z)
end
end
acc
234462500
```

**Notes**

"Nondeterministic" in the name indicates that the result of the reduction is not deterministic (i.e., schedule-dependent) *if* the reducing function is only approximately associative (e.g., `+`

on floats). For computations (e.g., `Folds.collect`

) that uses strictly associative operations (e.g., "`vcat`

"), the result does not depend on the particular scheduling decision of Julia runtime. To be more specific, this executor uses the scheduling that does not produce deterministic divide-and-conquer "task" graph. Instead, the shape of the graph is determined by the particular timing of each computation at run-time.

**Keyword Arguments**

`basesize`

: The size of base case.`ntasks`

: The number of tasks used by this executor.`simd`

:`false`

,`true`

,`:ivdep`

, or`Val`

of one of them. If`true`

/`:ivdep`

, the inner-most loop of each base case is annotated by`@simd`

/`@simd ivdep`

. Use a plain loop if`false`

(default).

`FoldsThreads.TaskPoolEx`

— Type`TaskPoolEx(; [simd,] [basesize,] [ntasks,] [background,])`

Executor using pooled tasks for reduction. Useful for reductions with I/O and managing back pressure. With `background = true`

, it can also be used to isolate throughput-oriented tasks and use the primary thread for latency-oriented tasks.

**Examples**

```
julia> using FoldsThreads
using Folds
julia> Folds.sum(i -> gcd(i, 42), 1:1000_000, TaskPoolEx())
4642844
```

**Extended help**

Worker tasks are pooled (for each executor) so that the number of Julia `Task`

s used for a reduction can be much smaller than `input_length ÷ basesize`

. This strategy is used mainly for limiting resource (e.g., memory) required by the reduction than for load-balancing. `WorkStealingEx`

performs better for load-balancing of compute-intensive reductions.

**NOTE:** This executor is inspired by ThreadPools.jl. The hack for assigning a task to a dedicated thread is stolen from ThreadPools.jl.

**It is highly discouraged to use this executor in Julia packages**; especially those that are used as libraries rather than end-user applications. This is because the whole purpose of this executor is to

*prevent*Julia runtime from doing the right thing for managing tasks. Ideally, the library user should be able to pass an executor as an argument so that your library function can be used with any executors including

`TaskPoolEx`

.**More examples**

```
julia> using FoldsThreads
using FLoops
julia> @floop TaskPoolEx() for x in 1:1000_000
y = gcd(x, 42)
@reduce(acc += y)
end
acc
4642844
```

**Keyword Arguments**

`background = false`

: If`background == true`

, do not run tasks on`threadid() == 1`

.`ntasks`

: The number of tasks to be used.`basesize`

: The size of base case.`simd`

:`false`

,`true`

,`:ivdep`

, or`Val`

of one of them. If`true`

/`:ivdep`

, the inner-most loop of each base case is annotated by`@simd`

/`@simd ivdep`

. Use a plain loop if`false`

(default).

`FoldsThreads.WorkStealingEx`

— Type`WorkStealingEx(; [simd,] [basesize])`

Work-stealing scheduling for parallel execution. Useful for load-balancing.

**Examples**

```
julia> using FoldsThreads
using Folds
julia> Folds.sum(i -> gcd(i, 42), 1:1000_000, WorkStealingEx())
4642844
```

**Extended help**

`WorkStealingEx`

implements work stealing scheduler (in particular, continuation stealing) for Transducers.jl and other JuliaFolds/*.jl packages. Worker tasks are cached and re-used so that the number of Julia `Task`

s used for a reduction can be much smaller than `input_length ÷ basesize`

. This has a positive impact on computations that require load-balancing since this does not incur the overhead of spawning tasks.

**NOTE:** `WorkStealingEx`

is more complex and experimental than the default multi-thread executor `ThreadedEx`

.

**More examples**

```
julia> using FoldsThreads
using FLoops
julia> @floop WorkStealingEx() for x in 1:1000_000
y = gcd(x, 42)
@reduce(acc += y)
end
acc
4642844
```

**Keyword Arguments**

`basesize`

: The size of base case.`simd`

:`false`

,`true`

,`:ivdep`

, or`Val`

of one of them. If`true`

/`:ivdep`

, the inner-most loop of each base case is annotated by`@simd`

/`@simd ivdep`

. Use a plain loop if`false`

(default).