Skip to content

Conversation

@nalimilan
Copy link
Member

Generalize existing optimized row_group_slots method for CategoricalArray and PooledArray so that it can be used for other array types for which DataAPI.refpool returns an AbstractVector. This allows dropping the dependency on CategoricalArrays in this part of the code.

Also refactor the method to be faster when not sorting. In that case, we do not need to build a map between reference codes and groups (indexing into it is slow when the number of groups is very large). CategoricalArray is no longer special cased: when sort=false, levels are still sorted, but missing appears first.

Add more tests to cover weird combinations.

Some benchmarks:

df = DataFrame(x=PooledArray(rand(1:ngroups, N)), y=rand(N));

## 10M rows and 1k groups
# Current
julia> @btime groupby(df, :x);
  51.310 ms (35 allocations: 76.31 MiB)

julia> @btime groupby(df, :x, sort=true);
  62.972 ms (71 allocations: 76.34 MiB)

julia> @btime combine(groupby(df, :x), :y => sum);
  67.693 ms (216 allocations: 76.39 MiB)

julia> @btime combine(groupby(df, :x, sort=true), :y => sum);
  78.816 ms (252 allocations: 76.42 MiB)

# This PR
julia> @btime groupby(df, :x);
  43.987 ms (50 allocations: 76.35 MiB)

julia> @btime groupby(df, :x, sort=true);
  48.425 ms (55 allocations: 76.38 MiB)

julia> @btime combine(groupby(df, :x), :y => sum);
  59.312 ms (231 allocations: 76.43 MiB)

julia> @btime combine(groupby(df, :x, sort=true), :y => sum);
  63.025 ms (236 allocations: 76.46 MiB)


## 1G rows and 10M groups
# Current
julia> @time groupby(df, :x);
129.936916 seconds (7.60 M allocations: 7.907 GiB, 0.54% gc time)

julia> @time groupby(df, :x, sort=true);
163.077585 seconds (1.76 M allocations: 8.575 GiB)

julia> @time combine(groupby(df, :x), :y => sum);
165.301710 seconds (8.78 M allocations: 8.456 GiB, 0.39% gc time)

julia> @time combine(groupby(df, :x, sort=true), :y => sum);
193.600132 seconds (1.03 M allocations: 9.014 GiB)


# This PR
julia> @time groupby(df, :x);
 39.474096 seconds (96 allocations: 7.740 GiB)

julia> @time groupby(df, :x, sort=true);
134.739279 seconds (76.63 k allocations: 8.042 GiB, 0.51% gc time)

julia> @time combine(groupby(df, :x), :y => sum);
 72.631447 seconds (5.17 M allocations: 8.475 GiB)

julia> @time combine(groupby(df, :x, sort=true), :y => sum);
167.185490 seconds (293 allocations: 8.511 GiB, 0.97% gc time)

Generalize existing optimized `row_group_slots` method for `CategoricalArray`
and `PooledArray` so that it can be used for other array types
for which `DataAPI.refpool` returns an `AbstractVector`. This allows dropping
the dependency on CategoricalArrays in this part of the code.

Also refactor the method to be faster when not sorting. In that case, we do
not need to build a map between reference codes and groups (indexing into it
is slow when the number of groups is very large). `CategoricalArray` is no longer
special cased: when `sort=false`, levels are still sorted, but `missing` appears first.

Add more tests to cover weird combinations.
@assert groups !== nothing && all(col -> length(col) == length(groups), cols)

refpools = map(DataAPI.refpool, cols)
foreach(refpool -> @assert(allunique(refpool)), refpools)
Copy link
Member Author

@nalimilan nalimilan Sep 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be turned into a check, with a fallback to the standard method if it fails, but I wanted to discuss that are the expectations for a refpool. Should we require it to contain unique values or not? @quinnj @piever

EDIT: in terms of performance, this is relatively negligible (1-2% of the total time for grouping).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm....in the arrow format, duplicate values are specifically allowed in the "dict encoding" pool, though it's recommended to "normalize" them to avoid duplicates, so I'm not sure; probably a fine requirement in practice, but could potentially cause an issue for weird arrow files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I'll turn this into a simple check with a fallback to the generic method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in some extreme cases it can take a lot of time. For example, the following takes 65s currently, but only 27s if I get rid of the check. So we probably should add an API to check whether the pool may contain duplicates. For PooledArray and CategoricalArray it would always be false, for others it could either always be true or vary at runtime.

using DataFrames, PooledArrays
N = 1_000_000_000;
k = N ÷ 5
df = DataFrame(x=PooledArray(rand(1:k, N)), y=rand(N));
@time gd = groupby(df, :x);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow.DictEncoded allows duplicates and would need a runtime check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's why I mentioned that possibility.

Copy link
Contributor

@matthieugomez matthieugomez Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use refpool for a different project, but the fact that refpool could have non-unique elements made the code too complicated to be worth it. I agree it would be nice to have an API that explicitly says elements are unique.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also #2558.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! This is awesome; love the speedups and making the code more generic with DataAPI calls.

@bkamins bkamins changed the title Use DataAPI.refpool for optimized grouping [BREAKING] Use DataAPI.refpool for optimized grouping Sep 20, 2020
@bkamins bkamins added breaking The proposed change is breaking. ecosystem Issues in DataFrames.jl ecosystem performance labels Sep 20, 2020
@bkamins bkamins added this to the 1.0 milestone Sep 20, 2020
@bkamins
Copy link
Member

bkamins commented Sep 20, 2020

Adding [BREAKING] as it changes return value order when sort=false (minor but still...).

@bkamins
Copy link
Member

bkamins commented Sep 20, 2020

Looks good thank you - the open comments are minor. I am OK to merge the PR when you resolve them.

@nalimilan
Copy link
Member Author

I've added a commit to use the generic fallback when the pools are not unique, plus a few small things.

# so it makes sense to allocate more memory for better performance,
# but it needs to remain reasonable compared with the size of the data frame.
if prod(Int128.(ngroupstup)) > typemax(Int) || ngroups > 2 * length(groups)
anydups = !all(allunique, refpools)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how expensive is this line?

Copy link
Member Author

@nalimilan nalimilan Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed in another thread. Luckily that's only 1-2%. EDIT: it can be much more than that, see #2442 (comment).

@bkamins
Copy link
Member

bkamins commented Sep 21, 2020

Looks good. I left some small remarks. Additionally - a general question - I hope that the tests cover all code paths (in particular the path when we have too many groups so we should invoke a slow path).

@nalimilan
Copy link
Member Author

Additionally - a general question - I hope that the tests cover all code paths (in particular the path when we have too many groups so we should invoke a slow path).

Yes, that's covered by the new tests (there was already one but quite limited). We don't have an example of an array for which refpool contains duplicates, but since the path is the same it should be OK.

@bkamins
Copy link
Member

bkamins commented Oct 11, 2020

So - I understand we merge it once CI passes. Right?

@nalimilan nalimilan merged commit 9a292ff into master Oct 11, 2020
@nalimilan nalimilan deleted the nl/refgrouping branch October 11, 2020 16:34
@bkamins
Copy link
Member

bkamins commented Oct 11, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking The proposed change is breaking. ecosystem Issues in DataFrames.jl ecosystem performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants