LazyGroupBy.jl

LazyGroupBy.LazyGroupByModule

LazyGroupBy: lazy, parallelizable and composable group-by operations

Dev GitHub Actions

LazyGroupBy.jl exports a single API grouped. It can be used to run group-by operation using the dot-call syntax:

reducer.(..., grouped(key, collection), ...)

where reducer runs on each group (thus, grouped(key, collection) can be considered a as a key-value pairs with Dictionaries.jl-like broadcasting rule). Roughly speaking, grouped(key, collection) is equivalent to Dict(k_1 => [v_11, v_12, ...], k_2 => [v_21, v_22, ...], ...) where k_i is an output of value of key(v_ij) for v_ij in collection and each call of reducer is evaluated with a group "vector" [v_i1, v_i2, ...].

For example:

julia> using LazyGroupBy

julia> collect.(grouped(isodd, 1:7))
Transducers.GroupByViewDict{Bool,Array{Int64,1},…} with 2 entries:
  false => [2, 4, 6]
  true  => [1, 3, 5, 7]

julia> length.(grouped(isodd, 1:7))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 3
  true  => 4

julia> keys.(grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Array{Int64,1},…} with 2 entries:
  false => [1, 7, 9]
  true  => [2, 3, 4, 5, 6, 8, 10]

julia> foldl.(tuple, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Any,…} with 2 entries:
  false => ((0, 4), 0)
  true  => ((((((7, 3), 1), 5), 9), 3), 5)

julia> foldl.(tuple, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]); init = -1)
Transducers.GroupByViewDict{Bool,Tuple{Any,Int64},…} with 2 entries:
  false => (((-1, 0), 4), 0)
  true  => (((((((-1, 7), 3), 1), 5), 9), 3), 5)

julia> extrema_rf((min1, max1), (min2, max2)) = (min(min1, min2), max(max1, max2));

julia> mapfoldl.(x -> (x, x), extrema_rf, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Tuple{Int64,Int64},…} with 2 entries:
  false => (0, 4)
  true  => (1, 9)

Following generic and standard reducers are supported:

  • collect.(op, grouped(...))DICT{Key,Vector{...}}
  • view.(grouped(_, array))DICT{Key,SubArray}
  • map.(f, grouped(...))
  • length.(op, grouped(...))DICT{Key,Int}
  • count.([f,] op, grouped(...))DICT{Key,Int}
  • sum.([f,] op, grouped(...))DICT{Key,Number}
  • prod.([f,] op, grouped(...))DICT{Key,Number}
  • any.(f, op, grouped(...))DICT{Key,Bool}
  • all.(f, op, grouped(...))DICT{Key,Bool}
  • minimum.([f,] op, grouped(...))
  • maximum.([f,] op, grouped(...))
  • extrema.([f,] op, grouped(...))
  • keys.(op, grouped(_, collection))DICT{Key,Vector{keytype(collection)}}
  • pairs.(op, grouped(_, collection))DICT{Key,DICT{keytype(collection),valtype(collection)}}
  • findfirst.(f, grouped(_, array))DICT{Key,keytype(collection)}
  • findlast.(f, grouped(_, array))DICT{Key,keytype(collection)}
  • findall.(f, grouped(_, array))DICT{Key,Vector{keytype(collection)}}
  • foldl.(op, grouped(...); [init])
  • mapfoldl.(f, op, grouped(...); [init])

where DICT{K,V} above is a short-hand for AbstractDict{<:K,<:V} and Key is the type of the values returned from key function passed to grouped.

For more complex tasks, Transducers.jl and OnlineStats.jl can also be used:

  • foldl.(op, xf, grouped(...); [init])
  • foldxl.(op, [xf,] grouped(...); [init])
  • foldxt.(op, [xf,] grouped(...); [init]) (multi-threaded)
  • foldxd.(op, [xf,] grouped(...); [init]) (distributed)
  • collect.(xf, grouped(...))
  • tcollect.(xf, grouped(...)) (multi-threaded version of collect)
  • dcollect.(xf, grouped(...)) (distributed version of collect)

where xf::Transducer is initiated for each group individually and op is either a two-argument function or an OnlineStat object (e.g., OnlineStats.Mean).

Caveats

The dot-call syntax is used for defining the "domain-specific language" (DSL) and it is different from the standard semantics of broadcasting on arrays. In particular, reducer.(..., grouped(key, collection), ...) may not actually call reducer. Rather, it is pattern-matched and dispatched to an alternative definition based on Transducers.jl.

Implementation

LazyGroupBy.jl is implemented as a direct transformation to foldl/foldxt/foldxd and GroupBy from Transducer.jl. Consider

foldl.(rf, xf, grouped(key, collection); init = init)

This is simply translated to

foldl(right, GroupBy(key, xf, rf, init), collection)

Other reducers like sum and collect are implemented in terms of above transformation.

source
Base.allFunction
all.(f, grouped(key, array))

Examples

julia> using LazyGroupBy

julia> xs = [0, 7, 3];

julia> gs = all.(<(1), grouped(isodd, xs))
Transducers.GroupByViewDict{Bool,Bool,…} with 2 entries:
  false => true
  true  => false
source
Base.anyFunction
any.(f, grouped(key, array))

Examples

julia> using LazyGroupBy

julia> xs = [0, 7, 3];

julia> gs = any.(>(5), grouped(isodd, xs))
Transducers.GroupByViewDict{Bool,Bool,…} with 2 entries:
  false => false
  true  => true
source
Base.collectFunction
collect.([xf,] grouped(key, collection))

Collect each group as a Vector.

The first optional argument xf is a transducer.

Example

julia> using LazyGroupBy

julia> collect.(grouped(isodd, [0, 7, 3]))
Transducers.GroupByViewDict{Bool,Array{Int64,1},…} with 2 entries:
  false => [0]
  true  => [7, 3]
source
Base.countFunction
count.([f,] grouped(key, collection))

Count number of items f is evaluated to true in each group.

Example

julia> using LazyGroupBy

julia> count.(<(5), grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 3
  true  => 3
source
Base.extremaFunction
extrema.([f,] grouped(key, collection); [init])

Examples

julia> using LazyGroupBy

julia> xs = [0, 7, 2, 3];

julia> extrema.(grouped(isodd, xs))
Transducers.GroupByViewDict{Bool,Tuple{Int64,Int64},…} with 2 entries:
  false => (0, 2)
  true  => (3, 7)
source
Base.findallFunction
findall.(f, grouped(key, array))

Examples

julia> using LazyGroupBy

julia> xs = [0, 7, 2, 3];

julia> gs = findall.(>(1), grouped(isodd, xs))
Transducers.GroupByViewDict{Bool,Array{Int64,1},…} with 2 entries:
  false => [3]
  true  => [2, 4]

julia> xs[gs[false]]
1-element Array{Int64,1}:
 2

julia> xs[gs[true]]
2-element Array{Int64,1}:
 7
 3
source
Base.findfirstFunction
findfirst.(f, grouped(key, array))

Examples

julia> using LazyGroupBy

julia> xs = [0, 7, 2, 3];

julia> gs = findfirst.(>(1), grouped(isodd, xs))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 3
  true  => 2

julia> xs[gs[false]]
2

julia> xs[gs[true]]
7
source
Base.findlastFunction
findlast.(f, grouped(key, array))

Examples

julia> using LazyGroupBy

julia> xs = [0, 7, 2, 3];

julia> gs = findlast.(<(5), grouped(isodd, xs))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 3
  true  => 4

julia> xs[gs[false]]
2

julia> xs[gs[true]]
3
source
Base.foldlFunction
foldl.(op, [xf,] grouped(key, collection); [init])
foldl.(os::OnlineStat, [xf,] grouped(key, collection); [init])

The first argument is either a reducing step function or an OnlineStat. The second optional argument xf is a transducer.

Examples

julia> using LazyGroupBy

julia> foldl.(tuple, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Any,…} with 2 entries:
  false => ((0, 4), 0)
  true  => ((((((7, 3), 1), 5), 9), 3), 5)

julia> using OnlineStats

julia> foldl.(Ref(Mean()), grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Mean{Float64,EqualWeight},…} with 2 entries:
  false => Mean: n=3 | value=1.33333
  true  => Mean: n=7 | value=4.71429
source
Base.keysFunction
keys.(grouped(key, indexable))

Return a dictionary whose value is a vector of keys to the indexable input collection.

Example

julia> using LazyGroupBy

julia> keys.(grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Array{Int64,1},…} with 2 entries:
  false => [1, 7, 9]
  true  => [2, 3, 4, 5, 6, 8, 10]

julia> keys.(grouped(isodd, Dict(zip('a':'e', 1:5))))
Transducers.GroupByViewDict{Bool,Array{Char,1},…} with 2 entries:
  false => ['d', 'b']
  true  => ['a', 'c', 'e']
source
Base.lengthFunction
length.(grouped(key, collection))

Count number of items in each group. This is defined as count.(_ -> true, grouped(key, collection)) rather than materializing each group vector.

Example

julia> using LazyGroupBy

julia> length.(grouped(isodd, 1:7))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 3
  true  => 4
source
Base.mapFunction
map.(f, grouped(key, collection))

Like collect.(grouped(key, collection)), but process each item with f.

Examples

julia> using LazyGroupBy

julia> map.(string, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Array{String,1},…} with 2 entries:
  false => ["0", "4", "0"]
  true  => ["7", "3", "1", "5", "9", "3", "5"]
source
Base.mapfoldlFunction
mapfoldl.(f, op, grouped(key, collection); [init])

Examples

julia> using LazyGroupBy

julia> extrema_rf((min1, max1), (min2, max2)) = (min(min1, min2), max(max1, max2));

julia> mapfoldl.(x -> (x, x), extrema_rf, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Tuple{Int64,Int64},…} with 2 entries:
  false => (0, 4)
  true  => (1, 9)
source
Base.maximumFunction
maximum.([f,] grouped(key, collection); [init])

Examples

julia> using LazyGroupBy

julia> maximum.(grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 4
  true  => 9
source
Base.minimumFunction
minimum.([f,] grouped(key, collection); [init])

Examples

julia> using LazyGroupBy

julia> minimum.(grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 0
  true  => 1
source
Base.pairsFunction
pairs.(grouped(key, indexable))

Return a dictionary whose value is a vector of keys to the indexable input collection.

Example

julia> using LazyGroupBy

julia> pairs.(grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Dict{Int64,Int64},…} with 2 entries:
  false => Dict(7=>4,9=>0,1=>0)
  true  => Dict(4=>1,10=>5,2=>7,3=>3,5=>5,8=>3,6=>9)

julia> pairs.(grouped(isodd, Dict(zip('a':'e', 1:5))))
Transducers.GroupByViewDict{Bool,Dict{Char,Int64},…} with 2 entries:
  false => Dict('d'=>4,'b'=>2)
  true  => Dict('a'=>1,'c'=>3,'e'=>5)
source
Base.prodFunction
prod.([f,] grouped(key, collection); [prod])

Examples

julia> using LazyGroupBy

julia> prod.(grouped(isodd, [7, 3, 1, 5, 9, 4, 3, 5]))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 4
  true  => 14175
source
Base.sumFunction
sum.([f,] grouped(key, collection); [init])

Examples

julia> using LazyGroupBy

julia> sum.(grouped(isodd, [7, 3, 1, 5, 9, 4, 3, 5]))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 4
  true  => 33
source
Base.viewFunction
view.(grouped(key, array))

Like collect.(grouped(key, array)), but return a mutable view to the input array.

Examples

julia> using LazyGroupBy

julia> xs = [0, 7, 3];

julia> gs = view.(grouped(isodd, xs))
Dict{Bool,SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},false}} with 2 entries:
  false => [0]
  true  => [7, 3]

julia> gs[false][end] = 111;

julia> xs
3-element Array{Int64,1}:
 111
   7
   3
source
LazyGroupBy.groupedMethod
grouped(key, collection)

Create a lazy associative (dict-like) object grouped by a function key. Actual per-group reduction can be initiated by the dot-call (broadcasting) of the "reducers" like foldl and reduce.

Examples

julia> using LazyGroupBy

julia> length.(grouped(isodd, 1:7))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 3
  true  => 4
source
Statistics.meanFunction
mean.([f,] grouped(key, collection))

Compute mean of each group.

Example

julia> using LazyGroupBy, Statistics

julia> mean.(grouped(isodd, 1:7))
Dict{Bool,Float64} with 2 entries:
  false => 4.0
  true  => 4.0
source
Statistics.stdFunction
std.([f,] grouped(key, collection))

Compute standard deviation of each group.

Example

julia> using LazyGroupBy, Statistics

julia> std.(grouped(isodd, 1:10))
Dict{Bool,Float64} with 2 entries:
  false => 3.16228
  true  => 3.16228
source
Statistics.varFunction
var.([f,] grouped(key, collection))

Compute variance of each group.

Example

julia> using LazyGroupBy, Statistics

julia> var.(grouped(isodd, 1:10))
Dict{Bool,Float64} with 2 entries:
  false => 10.0
  true  => 10.0
source
Transducers.dcollectFunction
dcollect.([xf,] grouped(key, collection))

Collect each group as a Vector using Distributed.jl.

The first optional argument xf is a transducer.

Example

julia> using LazyGroupBy
       using Transducers

julia> dcollect.(grouped(isodd, [0, 7, 3]))
Transducers.GroupByViewDict{Bool,Array{Int64,1},…} with 2 entries:
  false => [0]
  true  => [7, 3]
source
Transducers.foldxdFunction
foldxd.(op, [xf,] grouped(key, collection); [init])
foldxd.(os::OnlineStat, [xf,] grouped(key, collection); [init])

The first argument is either a reducing step function or an OnlineStat. The second optional argument xf is a transducer.

Examples

julia> using LazyGroupBy
       using Transducers

julia> foldxd.(+, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 4
  true  => 33
source
Transducers.foldxtFunction
foldxt.(op, [xf,] grouped(key, collection); [init])
foldxt.(os::OnlineStat, [xf,] grouped(key, collection); [init])

The first argument is either a reducing step function or an OnlineStat. The second optional argument xf is a transducer.

Examples

julia> using LazyGroupBy, Transducers

julia> foldxt.(max, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
  false => 4
  true  => 9

julia> using OnlineStats

julia> foldxt.(Ref(Mean()), grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Mean{Float64,EqualWeight},…} with 2 entries:
  false => Mean: n=3 | value=1.33333
  true  => Mean: n=7 | value=4.71429

An example for calculating the minimum, maximum, and number of each group in one go:

julia> table = ((k = gcd(v, 42), v = v) for v in 1:100);

julia> collect(Iterators.take(table, 5))  # preview
5-element Array{NamedTuple{(:k, :v),Tuple{Int64,Int64}},1}:
 (k = 1, v = 1)
 (k = 2, v = 2)
 (k = 3, v = 3)
 (k = 2, v = 4)
 (k = 1, v = 5)

julia> counter = reducingfunction(Map(_ -> 1), +);

julia> foldxt.(TeeRF(min, max, counter), Map(x -> x.v), grouped(x -> x.k, table))
Transducers.GroupByViewDict{Int64,Tuple{Int64,Int64,Int64},…} with 8 entries:
  7  => (7, 91, 5)
  14 => (14, 98, 5)
  42 => (42, 84, 2)
  2  => (2, 100, 29)
  3  => (3, 99, 15)
  21 => (21, 63, 2)
  6  => (6, 96, 14)
  1  => (1, 97, 28)
source
Transducers.tcollectFunction
tcollect.([xf,] grouped(key, collection))

Collect each group as a Vector using multiple threads. See also collect.(grouped(key, collection)).

The first optional argument xf is a transducer.

Example

julia> using LazyGroupBy
       using Transducers

julia> tcollect.(grouped(isodd, [0, 7, 3]))
Transducers.GroupByViewDict{Bool,Array{Int64,1},…} with 2 entries:
  false => [0]
  true  => [7, 3]
source