Data-parallel Julia

How to find data-parallel algorithms

💡 Note
Work in progress

[Alternative intro (maybe this is better)]

When there is a library function that provides an algorithm for what you want to compute, it is usually a good idea to use it. Although this is true for parallel programming as well, it may not be obvious what functions are provided by the libraries as it requires the knowledge of what can be parallelized. This how-to guide tries to provide a simplified overview for finding data-parallel library functions.


Julia provides @spawn as a basic building block for parallelism and FLoops.jl provides generic parallel for loop syntax. However, they are often too "low-level" as an API for building parallel programs. To make writing parallel programs simpler, Folds.jl and ThreadsX.jl provide basic "high-level" algorithms similar to Julia's Base library as described below. High-level APIs can be used for expressing what you want to compute which aids not only understandability and maintainability of the program but also the efficiency of the program because the library implementers can use adequate strategy for how to compute the result.

If you are familiar with the functions listed in Folds.jl and ThreadsX.jl, no further explanations are required for start using them. However, since not all Base functions are parallelizable, it may be hard to remember which functions are supported by these libraries. This how-to guide tries to provide a simple "guide map" for navigating the functions provided by Folds.jl and ThreadsX.jl, as summarized in the following picture:

guide map

TODO: cumprod!(ys) -> cumprod!(ys, xs)

The above picture is only very superficially accurate and you may need to "unlearn" it to use these libraries in full extent. However, it is a good approximation for start using

  1. Independent computation (map-family)
  2. Simple reductions (reduce-family)
  3. Element-wise computation fused with simple reductions (mapreduce-family)
  4. Searching
  5. Scan (accumulate-family)
  6. Generating data structure (collect-family)
  7. Flexible preprocessing
  8. Folds.jl vs ThreadsX.jl: which one to use?

Independent computation (map-family)

Given an array xs with elements x1,x2,...,xnx_1, x_2, ..., x_n...

FunctionReturns
map(f, xs)[f(x1),f(x2),...,f(xn)][f(x_1), f(x_2), ..., f(x_n)]
map!(f, ys, xs)ditto, but stores the result in ys
foreach(f, xs)runs f(xi)f(x_i) in parallel for side-effect

TODO: explanations, examples, ...

Simple reductions (reduce-family)

FunctionReturns
reduce(⊗, xs)x1⊗x2⊗...⊗xnx_1 \otimes x_2 \otimes ... \otimes x_n
sum(xs)x1+x2+...+xnx_1 + x_2 + ... + x_n
prod(xs)x1∗x2∗...∗xnx_1 * x_2 * ... * x_n
all(xs)x1 & x2 & ... & xnx_1 \;\&\; x_2 \;\&\; ... \;\&\; x_n
any(xs)x1 ∣ x2 ∣ ... ∣ xnx_1 \;|\; x_2 \;|\; ... \;|\; x_n
count(xs)x1+x2+...+xnx_1 + x_2 + ... + x_n
maximum(xs)max(x1,x2,...,xn)\texttt{max}(x_1, x_2, ..., x_n)
minimum(xs)min(x1,x2,...,xn)\texttt{min}(x_1, x_2, ..., x_n)
extrema(xs)(minimum(xs), maximum(xs))
issorted(xs)(x1≤x2) & (x2≤x3) & ... & (xn−1≤xn)(x_1 \le x_2) \;\&\; (x_2 \le x_3) \;\&\; ... \;\&\; (x_{n-1} \le x_n)

Element-wise computation fused with simple reductions (mapreduce-family)

FunctionReturns
mapreduce(f, ⊗, xs)f(x1)⊗f(x2)⊗...⊗f(xn)f(x_1) \otimes f(x_2) \otimes ... \otimes f(x_n)
sum(f, xs)f(x1)+f(x2)+...+f(xn)f(x_1) + f(x_2) + ... + f(x_n)
prod(f, xs)f(x1)∗f(x2)∗...∗f(xn)f(x_1) * f(x_2) * ... * f(x_n)
all(f, xs)f(x1) & f(x2) & ... & f(xn)f(x_1) \;\&\; f(x_2) \;\&\; ... \;\&\; f(x_n)
any(f, xs)f(x1) ∣ f(x2) ∣ ... ∣ f(xn)f(x_1) \;|\; f(x_2) \;|\; ... \;|\; f(x_n)
count(f, xs)f(x1)+f(x2)+...+f(xn)f(x_1) + f(x_2) + ... + f(x_n)
maximum(f, xs)max(f(x1),f(x2),...,f(xn))\texttt{max}(f(x_1), f(x_2), ..., f(x_n))
minimum(f, xs)min(f(x1),f(x2),...,f(xn))\texttt{min}(f(x_1), f(x_2), ..., f(x_n))
extrema(f, xs)(minimum(xs), maximum(xs))
issorted(xs; by = f)(f(x1)≤f(x2)) & (f(x2)≤f(x3)) & ... & (f(xn−1)≤f(xn))(f(x_1) \le f(x_2)) \;\&\; (f(x_2) \le f(x_3)) \;\&\; ... \;\&\; (f(x_{n-1}) \le f(x_n))

Searching

FunctionReturns
findmax(f, xs)(x, index) s.t. xs[index] == x == maximum(f, xs)
findmin(f, xs)(x, index) s.t. xs[index] == x == minimum(f, xs)
argmax(f, xs)x s.t. f(x) == maximum(f, xs)
argmin(f, xs)x s.t. f(x) == minimum(f, xs)
findall(f, xs)indices such that f(xs[i]) holds iff i in indices
findfirst(f, xs)first i s.t. f(xs[i]) holds; nothing if not found
findlast(f, xs)last i s.t. f(xs[i]) holds; nothing if not found

Scan (accumulate-family)

FunctionReturns
accumulate(⊗, xs)[x1,x1⊗x2,x1⊗x2⊗x3,...][x_1, x_1 \otimes x_2, x_1 \otimes x_2 \otimes x_3, ...]
accumulate!(⊗, ys, xs)ditto, but stores the result in ys
scan!(⊗, xs)ditto, but stores the result in xs
cumsum(xs)[x1,x1+x2,x1+x2+x3,...][x_1, x_1 + x_2, x_1 + x_2 + x_3, ...]
cumsum!(ys, xs)ditto, but stores the result in ys
cumprod(xs)[x1,x1∗x2,x1∗x2∗x3,...][x_1, x_1 * x_2, x_1 * x_2 * x_3, ...]
cumprod!(ys, xs)ditto, but stores the result in ys

Generating data structure (collect-family)

In above sections, we pretended that xs is an array. However, there various collection of items in Julia. We can turn sufficiently well-behaving such collections into an array in parallel using Folds.collect. It is also possible to collect only unique elements using Folds.unique. The output data structure does not have to be an array. We can use Folds.set to create a set of elements. If there is no need to preserve the ordering of the input, it is more efficient than Folds.unique. Given a collection of Pairs, we can use Folds.dict to create a dictionary.

FunctionReturns
collect(xs)collect elements of xs into an array
unique(xs)collect unique elements of xs into an array, preserving the order
set(xs)collect elements of xs into a set
dict(xs)collect elements of xs into a dictionary

Flexible preprocessing

xs = (f(y) for x in collection if p(x) for y in g(x))
#     ----                     ------- -------------
#     mapping               filtering  flattening

See also: A quick introduction to data parallelism in Julia

Folds.jl vs ThreadsX.jl: which one to use?

ThreadsX.jl provides a similar interface like Folds.jl but it is specific to multi-core -based parallelism. Furthermore, it provides algorithms such as sort! that cannot be easily expressed as a simple invocation of reduce.

Folds.jl focuses on algorithms expressible in terms of a generalized reduce.

#show test results