[BREAKING] Use DataAPI.refpool for optimized grouping #2442

nalimilan · 2020-09-19T17:13:37Z

Generalize existing optimized row_group_slots method for CategoricalArray and PooledArray so that it can be used for other array types for which DataAPI.refpool returns an AbstractVector. This allows dropping the dependency on CategoricalArrays in this part of the code.

Also refactor the method to be faster when not sorting. In that case, we do not need to build a map between reference codes and groups (indexing into it is slow when the number of groups is very large). CategoricalArray is no longer special cased: when sort=false, levels are still sorted, but missing appears first.

Add more tests to cover weird combinations.

Some benchmarks:

df = DataFrame(x=PooledArray(rand(1:ngroups, N)), y=rand(N));

## 10M rows and 1k groups
# Current
julia> @btime groupby(df, :x);
  51.310 ms (35 allocations: 76.31 MiB)

julia> @btime groupby(df, :x, sort=true);
  62.972 ms (71 allocations: 76.34 MiB)

julia> @btime combine(groupby(df, :x), :y => sum);
  67.693 ms (216 allocations: 76.39 MiB)

julia> @btime combine(groupby(df, :x, sort=true), :y => sum);
  78.816 ms (252 allocations: 76.42 MiB)

# This PR
julia> @btime groupby(df, :x);
  43.987 ms (50 allocations: 76.35 MiB)

julia> @btime groupby(df, :x, sort=true);
  48.425 ms (55 allocations: 76.38 MiB)

julia> @btime combine(groupby(df, :x), :y => sum);
  59.312 ms (231 allocations: 76.43 MiB)

julia> @btime combine(groupby(df, :x, sort=true), :y => sum);
  63.025 ms (236 allocations: 76.46 MiB)


## 1G rows and 10M groups
# Current
julia> @time groupby(df, :x);
129.936916 seconds (7.60 M allocations: 7.907 GiB, 0.54% gc time)

julia> @time groupby(df, :x, sort=true);
163.077585 seconds (1.76 M allocations: 8.575 GiB)

julia> @time combine(groupby(df, :x), :y => sum);
165.301710 seconds (8.78 M allocations: 8.456 GiB, 0.39% gc time)

julia> @time combine(groupby(df, :x, sort=true), :y => sum);
193.600132 seconds (1.03 M allocations: 9.014 GiB)


# This PR
julia> @time groupby(df, :x);
 39.474096 seconds (96 allocations: 7.740 GiB)

julia> @time groupby(df, :x, sort=true);
134.739279 seconds (76.63 k allocations: 8.042 GiB, 0.51% gc time)

julia> @time combine(groupby(df, :x), :y => sum);
 72.631447 seconds (5.17 M allocations: 8.475 GiB)

julia> @time combine(groupby(df, :x, sort=true), :y => sum);
167.185490 seconds (293 allocations: 8.511 GiB, 0.97% gc time)

Generalize existing optimized `row_group_slots` method for `CategoricalArray` and `PooledArray` so that it can be used for other array types for which `DataAPI.refpool` returns an `AbstractVector`. This allows dropping the dependency on CategoricalArrays in this part of the code. Also refactor the method to be faster when not sorting. In that case, we do not need to build a map between reference codes and groups (indexing into it is slow when the number of groups is very large). `CategoricalArray` is no longer special cased: when `sort=false`, levels are still sorted, but `missing` appears first. Add more tests to cover weird combinations.

nalimilan · 2020-09-19T17:19:17Z

src/dataframerow/utils.jl

    @assert groups !== nothing && all(col -> length(col) == length(groups), cols)

+    refpools = map(DataAPI.refpool, cols)
+    foreach(refpool -> @assert(allunique(refpool)), refpools)


This should probably be turned into a check, with a fallback to the standard method if it fails, but I wanted to discuss that are the expectations for a refpool. Should we require it to contain unique values or not? @quinnj @piever

EDIT: in terms of performance, this is relatively negligible (1-2% of the total time for grouping).

Hmmmm....in the arrow format, duplicate values are specifically allowed in the "dict encoding" pool, though it's recommended to "normalize" them to avoid duplicates, so I'm not sure; probably a fine requirement in practice, but could potentially cause an issue for weird arrow files.

OK, so I'll turn this into a simple check with a fallback to the generic method.

Actually, in some extreme cases it can take a lot of time. For example, the following takes 65s currently, but only 27s if I get rid of the check. So we probably should add an API to check whether the pool may contain duplicates. For PooledArray and CategoricalArray it would always be false, for others it could either always be true or vary at runtime.

using DataFrames, PooledArrays N = 1_000_000_000; k = N ÷ 5 df = DataFrame(x=PooledArray(rand(1:k, N)), y=rand(N)); @time gd = groupby(df, :x);

Arrow.DictEncoded allows duplicates and would need a runtime check

Yes that's why I mentioned that possibility.

I wanted to use refpool for a different project, but the fact that refpool could have non-unique elements made the code too complicated to be worth it. I agree it would be nice to have an API that explicitly says elements are unique.

See also #2558.

src/dataframerow/utils.jl

quinnj

Yay! This is awesome; love the speedups and making the code more generic with DataAPI calls.

bkamins · 2020-09-20T09:10:33Z

Adding [BREAKING] as it changes return value order when sort=false (minor but still...).

src/dataframerow/utils.jl

bkamins · 2020-09-20T13:40:37Z

Looks good thank you - the open comments are minor. I am OK to merge the PR when you resolve them.

…hecking sortedness

nalimilan · 2020-09-21T11:44:24Z

I've added a commit to use the generic fallback when the pools are not unique, plus a few small things.

src/dataframerow/utils.jl

bkamins · 2020-09-21T15:25:46Z

src/dataframerow/utils.jl

    # so it makes sense to allocate more memory for better performance,
    # but it needs to remain reasonable compared with the size of the data frame.
-    if prod(Int128.(ngroupstup)) > typemax(Int) || ngroups > 2 * length(groups)
+    anydups = !all(allunique, refpools)


how expensive is this line?

This was discussed in another thread. Luckily that's only 1-2%. EDIT: it can be much more than that, see #2442 (comment).

src/dataframerow/utils.jl

bkamins · 2020-09-21T15:57:39Z

Looks good. I left some small remarks. Additionally - a general question - I hope that the tests cover all code paths (in particular the path when we have too many groups so we should invoke a slow path).

nalimilan · 2020-09-21T19:51:34Z

Additionally - a general question - I hope that the tests cover all code paths (in particular the path when we have too many groups so we should invoke a slow path).

Yes, that's covered by the new tests (there was already one but quite limited). We don't have an example of an array for which refpool contains duplicates, but since the path is the same it should be OK.

bkamins · 2020-10-11T10:58:15Z

So - I understand we merge it once CI passes. Right?

bkamins · 2020-10-11T16:47:24Z

Thank you!

nalimilan commented Sep 19, 2020

View reviewed changes

src/dataframerow/utils.jl Outdated Show resolved Hide resolved

quinnj approved these changes Sep 20, 2020

View reviewed changes

bkamins changed the title ~~Use DataAPI.refpool for optimized grouping~~ [BREAKING] Use DataAPI.refpool for optimized grouping Sep 20, 2020

bkamins added breaking The proposed change is breaking. ecosystem Issues in DataFrames.jl ecosystem performance labels Sep 20, 2020

bkamins added this to the 1.0 milestone Sep 20, 2020

bkamins reviewed Sep 20, 2020

View reviewed changes

src/dataframerow/utils.jl Show resolved Hide resolved

bkamins reviewed Sep 20, 2020

View reviewed changes

src/dataframerow/utils.jl Show resolved Hide resolved

bkamins approved these changes Sep 20, 2020

View reviewed changes

Allow duplicates, hash refs, and take into account skipmissing when c…

b46ed00

…hecking sortedness

bkamins reviewed Sep 21, 2020

View reviewed changes

src/dataframerow/utils.jl Outdated Show resolved Hide resolved

bkamins reviewed Sep 21, 2020

View reviewed changes

src/dataframerow/utils.jl Outdated Show resolved Hide resolved

bkamins reviewed Sep 21, 2020

View reviewed changes

src/dataframerow/utils.jl Show resolved Hide resolved

Apply suggestions from code review

9d05965

Merge branch 'master' into nl/refgrouping

f3ce3ed

nalimilan merged commit 9a292ff into master Oct 11, 2020

nalimilan deleted the nl/refgrouping branch October 11, 2020 16:34

nalimilan mentioned this pull request Nov 22, 2020

Introduce new dropunusedlevels! function and allow using from map JuliaData/PooledArrays.jl#43

Open

bkamins mentioned this pull request Nov 22, 2020

Track duplicates in DataAPI.refpool issue #2558

Open

Uh oh!

[BREAKING] Use DataAPI.refpool for optimized grouping #2442

[BREAKING] Use DataAPI.refpool for optimized grouping #2442

Uh oh!

Conversation

nalimilan commented Sep 19, 2020

Uh oh!

nalimilan Sep 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quinnj Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

nalimilan Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

nalimilan Nov 22, 2020

Choose a reason for hiding this comment

Uh oh!

quinnj Nov 22, 2020

Choose a reason for hiding this comment

Uh oh!

nalimilan Nov 22, 2020

Choose a reason for hiding this comment

Uh oh!

matthieugomez Nov 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nalimilan Nov 29, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quinnj left a comment

Choose a reason for hiding this comment

Uh oh!

bkamins commented Sep 20, 2020

Uh oh!

Uh oh!

Uh oh!

bkamins commented Sep 20, 2020

Uh oh!

nalimilan commented Sep 21, 2020

Uh oh!

Uh oh!

Uh oh!

bkamins Sep 21, 2020

Choose a reason for hiding this comment

Uh oh!

nalimilan Sep 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkamins commented Sep 21, 2020

Uh oh!

nalimilan commented Sep 21, 2020

Uh oh!

bkamins commented Oct 11, 2020

Uh oh!

bkamins commented Oct 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nalimilan Sep 19, 2020 •

edited

Loading

matthieugomez Nov 27, 2020 •

edited

Loading

nalimilan Sep 21, 2020 •

edited

Loading