Solutions to many parallel programming problems can be reduced to a welldesigned monoid^{[1]}. As such, interesting parallel solution sometimes require nontrivial monoids. To enrich the repertoire, let us discuss semidirect product; a particularity elegant construction of useful monoids.
[1]  Here, we mainly discuss monoid and monoid actions. However, since one can always turn a semigroup into a monoid by adjoining the identity, the following discussion also applies to semigroups which is more useful than monoid in dynamic languages like Julia. 
As a motivating example, let us compute the maximum depth of parentheses in a string. Each position of a string has an integer depth which is one more larger (smaller) than the depth of the previous position if the current character is the open parenthesis (
(close parenthesis )
). Otherwise the depth is equal to the depth of the previous position. The initial depth is 0. We are interested in the maximum depth of a given string:
string ( ( ( ) ) ( ) )
depth 1 2 3 2 1 2 1 0
β
maximum depth
The following code is a Julia program that calculates the maximum depth. It computes the depth of every position using the cumulative sum function cumsum
and then take the maximum
of it:
parentheses = collect("(((())((()))) ((())))")
increments = ifelse.(parentheses .== '(', +1, ifelse.(parentheses .== ')', 1, 0))
depths = cumsum(increments)
maximum(depths)
5
Both cumsum
(prefix sum) and maximum
(reduce) can be computed in parallel. However, it is possible to fuse these operations and compute the maximum depth in one sweep (which is essential for performance). To solve this problem with singlepath reduction, we need to find a small set of states "just enough" for combining subsolutions obtained by solving each chunk of the input. An important observation here is that the "humps" in the depths
plot can be easily shifted up and down when combining into the result from preceding (left) string of parentheses. This motivates us to track the maximum and the final (relative) depth of a chunk of the input string. Splitting the above solution into three chunks, we get the following three subsolutions:
Consider chunk 1 and 2. Combining the final depths (d1
and d2
) is as easy as summing these two depths. To combine the maximum in each chunk, we need to shift the right maximum m2
by the final depth of the left chunk d1
so that both maximum "candidates" m1
and m2
are compared with respect to the same reference point (i.e., the beginning of the left chunk). Thus, we get the following function for combining the solutions (d1, m1)
and (d2, m2)
from two consecutive chunks:
function β((d1, m1), (d2, m2))
d = d1 + d2 # total change in depth
m2β² = d1 + m2 # shifting the maximum of the right chunk before comparison
m = max(m1, m2β²) # find the maximum
return (d, m)
end
Given the subsolutions above
(d1, m1) = (3, 4)
(d2, m2) = (2, 2)
(d3, m3) = (1, 3)
we can compute the final result
(d1, m1) β (d2, m2) β (d3, m3)
(0, 5)
where the second element of the tuple is the maximum depth. As we will see in the next section, β
is associative. Thus, we can use it as the input to reduce
. Observing singleton chunk [x]
(where x
is 1, 0, or 1) has the trivial solution (x, x)
(i.e., the last element of [x]
is x
and the maximum element of [x]
is x
), we get the singlesweep reduction to calculate the maximum depth of the parentheses:
mapreduce(x > (x, x), β, increments)
(0, 5)
Recall that mapreduce
can also be computed using the leftfold mapfoldl
. By manually inlining the definitions of mapfoldl(x > (x, x), β, increments)
, we also obtain the following sequential algorithm:
function maxdepth_seq1(increments)
d1 = m1 = 0
for x in increments
d2 = m2 = x
# Inlining `β`:
d = d1 + d2
m2β² = d1 + m2
m = max(m1, m2β²)
d1 = d
m1 = m
end
return m1
end
Since d1 + d2
and d1 + m2
are equivalent when d2 = m2 = x
, we can do "manual Common Subexpression Elimination" to get:
function maxdepth_seq2(increments)
d1 = m1 = 0
for x in increments
d1 = d1 + x
m1 = max(m1, d1)
end
return m1
end
Note that this is the straightforward sequential solution to the original problem. Compilers such as Julia+LLVM may be able to do this "derivation" as a part of optimization in may programs. It is also possible to implement this with sufficiently rich "monoid combinator" frameworks such as Transducers.jl (using next
and combine
specializations).
The structure in β
of the previous section is actually very generic and applicable to many other reductions. If we replace +
with *β²
and max
with +β²
, we obtain the following higherorder function (combinator) sdpl
. Let us also define a similar "flipped" version sdpr
.
sdpl(*β², +β²) = ((aβ, bβ), (aβ, bβ)) > (aβ *β² aβ, bβ +β² (aβ *β² bβ))
sdpr(*β², +β²) = ((aβ, bβ), (aβ, bβ)) > (aβ *β² aβ, (bβ *β² aβ) +β² bβ)
A binary function (x, y) > sdpl(*β², +β²)(x, y)
is associative (i.e., semigroup) if *β²
and +β²
are both associative and *β²
is leftdistributive over +β²
; i.e.,
Similarly, sdpr(*β², +β²)
is associative if *β²
and +β²
are both associative and *β²
is rightdistributives over +β²
. See, e.g., Kogge and Stone (1973), Blelloch (1990), Gorlatch and Lengauer (1997), Chauhan et al. (2016), and Kmett (2018) for the discussion and applications on this algebraic fact (or its generalized form; see below). The monoid combinator sdpl
of this form in particular is described in Gorlatch and Lengauer (1997), including the proofs we skipped here. They described it as scanreduce composition and scanscan composition theorems. As we saw in the previous section, mapreduce(x > (x,
x), sdpl(+, max), increments)
computes maximum(cumsum(increments))
efficiently by composing (fusing) the scan (cumsum
) and reduce (maximum
).
Borrowing the terminology from group theory (see Semidirect product  Wikipedia), and also following Chauhan et al. (2016)'s nomenclature, let us call sdpl
and sdpr
the semidirect products although the definition above is not of its fully generalized form.
Other than the usual multiplicationaddition pair (*
, +
) and additionmax pair (+
, max
) as discussed in the previous section, there are various pairs of monoids satisfying left and/or rightdistributivity property (see, Distributive property  Wikipedia). For example, following functions can be used with sdpl
and sdpr
(on appropriate domain types):
*β²  +β²  Example applications 

*  +  linear recurrence (see below) 
+  max  maximum depth of parentheses (see above) 
+  min  
min  max  
max  min  
β©  βͺ  unique (see below); GEN/KILLsets 
βͺ  β© 
The unique
example (below) and GEN/KILLsets example in Chauhan et al. (2016) can be considered an instance of sdpl(β©, βͺ)
with the first monoid β©
virtually operating on the complement sets.
Semidirect product sdpr(*β², +β²)
is a generalized from of linear recurrence equation:
given sequences $(a_t)_{t=1}^{T}$ and $(b_t)_{t=1}^T$. Let us use the initial condition $x_0 = 0$ for simplicity (as specifying $x_0$ is equivalent to specifying $b_1$). Indeed, (1, 0)
is the identity element for the method sdpr(*, +)(_::Number, _::Number)
. To see sdpr(*, +)
computes $(x_t)_{t=1}^T$, let us manually expand foldl(sdpr(*, +), zip(as, bs); init =
(1, 0))
, as we did for maxdepth_seq1
(or, alternatively, see Blelloch (1990)):
function linear_recurrence_seq1(as, bs)
aβ = 1
bβ = 0
for (aβ, bβ) in zip(as, bs)
# Inlining `sdpr(*, +)`:
aβ = aβ * aβ
bβ = bβ * aβ + bβ
end
return (aβ, bβ)
end
By eliminating aβ
and renaming bβ
to x
, we have
function linear_recurrence_seq2(as, bs)
x = 0
for (a, b) in zip(as, bs)
x = x * a + b
end
return x
end
which computes the linear recurrence equation (linrec). (Note: the first component aβ * aβ
is still required for associativity of sdpr(*, +)
. It keeps track of the multiplier $\prod_{t=p}^q a_t$ for propagating the effect of $x_{p1}$ to $x_q$.)
The above code also work when the elements in as
and bs
are matrices. In particular, x
and b
can be "row vectors." Even though Julia uses 1Γn matrices for row vectors, this observation indicates that a
s and b
s can live in different spaces. Indeed, we will see that it is useful to "disassociate" the functions for aβ *β² aβ
and bβ *β² aβ
.
The combinators sdpl
and sdpr
can be generalized to the case where a
s and b
s are of different types. We can simply extend these combinators to take three functions^{[2]} :
sdpl(*β², β³, +β²) = ((aβ, bβ), (aβ, bβ)) > (aβ *β² aβ, bβ +β² (aβ β³ bβ))
sdpr(*β², β², +β²) = ((aβ, bβ), (aβ, bβ)) > (aβ *β² aβ, (bβ β² aβ) +β² bβ)
[2]  In Julia, we acn use multiple dispatch for plumbing the calls aβ *β² aβ and bβ *β² aβ to different implementations. However, it would require defining particular type for each pair of aβ *β² aβ and bβ *β² aβ . Thus, it is more convenient to define these ternary combinators.

(x, y) > sdpl(*β², β³, +β²)(x, y)
is a monoid if *β²
and +β²
are monoids, the function β³
is a left monoid action^{[3]}; i.e., it satisfies
$\tag{actl}
a_1 \rhd (a_2 \rhd b) = (a_1 *' a_2) \rhd b$
and it is leftdistributive over +'
; i.e.,
The first condition indicates we can either
apply actions sequentially (e.g., $A_1 (A_2 x)$ where $A_1$ and $A_2$ are matrices and $x$ is a vector) or
combine actions first and then apply the combined function (e.g., $(A_1 A_2) x$).
The second condition indicates that we can either
merge the "targets" $b_1$ and $b_2$ first (e.g., $A (x_1 + x_2)$ where $A$ is a matrix and $x_1$ and $x_2$ are vectors) or
apply the actions separately and then merge them (e.g., $A x_1 + A x_2$).
These extra freedom in how to compute the result is essential in the parallel computing and is captured by the property that (x, y) > sdpl(*β², β³, +β²)(x, y)
is associative.
If $*' = \rhd$, the condition for monoid action (actl) is simply the associativity condition and the condition for leftdistributivity of the monoid action (distlact) is equivalent to the leftdistribuity of the modnoid (distlop).
For sdpr(*β², β², +β²)
, the function β²
must be a right monoid action
that rightdistributes over +'
This construction of monoids sdpl(*β², β³, +β²)
and sdpr(*β², β², +β²)
are called semidirect product^{[4]}.
[3]  It may be easier to find the resources on semigroup action. A monoid action is simply a semigroup action that is also a monoid. 
[4]  Kogge and Stone (1973) and Blelloch (1990) use the term semiassociative for describing the property required for the function β² .

prodop(ββ, ββ) = ((xa, xb), (ya, yb)) > (xa ββ ya, xb ββ yb)
mapreduce(x > (x, x), prodop(+, *), 1:10; init = (0, 1)) # example
(55, 3628800)
is a special case of semidirect product with trivial "donothing" actions a β³ b
= b
or b β² a = b
.
unique
Julia's unique
function returns a vector of unique element in the input collection while preserving the order that the elements appear in the input. To parallelize this function, we track the unique elements in a set and also keep the elements in a vector to maintain the ordering. When combining two solutions, we need to filter out elements in the right chunk if they already appear in the left chunk. This can be implemented by using setdiff
for the left action on the vector.
function vector_unique(xs)
singletons = ((Set([x]), [x]) for x in xs)
monoid = sdpl(union, (a, b) > setdiff(b, a), vcat)
return last(reduce(monoid, singletons))
end
Example:
vector_unique([1, 2, 1, 3, 1, 2, 1])
3element Vector{Int64}:
1
2
3
Using the general form of semidirect product, we can compute the maximum depth of parentheses and the corresponding index ("key"). It can be done with a function maxpair
that keeps track of the maximum value and the index in the second monoid. The action shiftvalue
can be used to shift the value but not the index:
maxpair((i, x), (j, y)) = y > x ? (j, y) : (i, x)
shiftvalue(d, (i, x)) = (i, d + x)
function findmax_parentheses(increments)
singletons = ((x, (i, x)) for (i, x) in pairs(increments))
monoid = sdpl(+, shiftvalue, maxpair)
return reduce(monoid, singletons)[2]
end
findmax_parentheses(increments)
(9, 5)
(Compare this result with the first section.)
In the above example, we assumed that the key (position/index) is cheap to compute. However, we may need to use reduction to compute the key as well. For example, the input may be a UTF8 string and we need the number of characters (code points) as the key. Another situation that this is requried is when processsing newline delimited strings and we need the line column (and/or the line number) as the key. We can compute the index onthefly by applying sdpl
twice. The "inner" application is the monoid used in the findmax_parentheses
function above. The "outer" application of sdpl
is used to for 1οΈβ£ counting the processed number of elemetns and 2οΈβ£ shifting the index using the left action:
shiftindex(n, (d, (i, x))) = (d, (n + i, x))
# ~~~~~
# shift the index i by the number n of processed elements
function findmax_parentheses_relative(increments)
singletons = ((1, (x, (1, x))) for x in increments)
#  
#  the index of the maximum element in a singleton collection
# the number of elements in a singleton collection
monoid = sdpl(+, shiftindex, sdpl(+, shiftvalue, maxpair))
# ~ ~.~~~~~~~~ ~~~.~~~~~~~~~~~~~~~~~~~~~~~~
#   ` compute maximum of depths and its index (as above)
#  
#  ` 2οΈβ£ shift the index of right chunk by the size of processed left chunk
# 
# ` 1οΈβ£ compute the size of processed chunk
return reduce(monoid, singletons)[2][2]
end
findmax_parentheses_relative(increments)
(9, 5)
Since shiftindex
acts only on the index portion of the state of sdpl(+,
shiftvalue, maxpair)
and the index does not interact with the other components, it is easy to see that this satisfy the condition for semidirect product.
# ...which does not mean we shouldn't be unittesting it
using Test
@testset begin
β = sdpl(+, shiftindex, sdpl(+, shiftvalue, maxpair))
prod3(xs) = Iterators.product(xs, xs, xs)
nfailed = 0
for (n1, n2, n3) in prod3(1:3),
(d1, d2, d3) in prod3(1:3),
(i1, i2, i3) in prod3('a':'c'),
(m1, m2, m3) in prod3(1:3)
x1 = (n1, (d1, (i1, m1)))
x2 = (n2, (d2, (i2, m2)))
x3 = (n3, (d3, (i3, m3)))
nfailed += !(x1 β (x2 β x3) == (x1 β x2) β x3)
end
@test nfailed == 0
end
Test Summary:  Pass Total
test set  1 1
Semidirect products appear naturally in reductions where multiple states interact. It helps us easily derive efficient and intricate parallel programs. Furthermore, even if semidirect product is not required to derive the monoid, it is cumbersom to convince oneself that the given definition and implementation of the monoid indeed satisfy the algebraic requirements. Decomposing the monoid into simpler "subcomponent" monoids and monoid actions makes it easy to reason about such algebraic properties.
Kogge, Peter M., and Harold S. Stone. 1973. βA Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations.β IEEE Transactions on Computers Cβ22 (8): 786β93. https://doi.org/10.1109/TC.1973.5009159.
Blelloch, Guy E. βPrefix Sums and Their Applications,β 1990, 26. https://kilthub.cmu.edu/articles/journal_contribution/Prefix_sums_and_their_applications/6608579/1
Gorlatch, S., and C. Lengauer. β(De) Composition Rules for Parallel Scan and Reduction.β In Proceedings. Third Working Conference on Massively Parallel Programming Models (Cat. No.97TB100228), 23β32, 1997. https://doi.org/10.1109/MPPM.1997.715958.
Chauhan, Satvik, Piyush P. Kurur, and Brent A. Yorgey. 2016. βHow to Twist Pointers without Breaking Them.β In Proceedings of the 9th International Symposium on Haskell, 51β61. Haskell 2016. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2976002.2976004.
Kmett, Edward. 2018. βThere and Back Again.β Presented at the Lambda World, Seattle, USA. https://www.youtube.com/watch?v=HGi5AxmQUwU.