-
Couldn't load subscription status.
- Fork 374
[BREAKING] Use DataAPI.refpool for optimized grouping #2442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Generalize existing optimized `row_group_slots` method for `CategoricalArray` and `PooledArray` so that it can be used for other array types for which `DataAPI.refpool` returns an `AbstractVector`. This allows dropping the dependency on CategoricalArrays in this part of the code. Also refactor the method to be faster when not sorting. In that case, we do not need to build a map between reference codes and groups (indexing into it is slow when the number of groups is very large). `CategoricalArray` is no longer special cased: when `sort=false`, levels are still sorted, but `missing` appears first. Add more tests to cover weird combinations.
src/dataframerow/utils.jl
Outdated
| @assert groups !== nothing && all(col -> length(col) == length(groups), cols) | ||
|
|
||
| refpools = map(DataAPI.refpool, cols) | ||
| foreach(refpool -> @assert(allunique(refpool)), refpools) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be turned into a check, with a fallback to the standard method if it fails, but I wanted to discuss that are the expectations for a refpool. Should we require it to contain unique values or not? @quinnj @piever
EDIT: in terms of performance, this is relatively negligible (1-2% of the total time for grouping).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm....in the arrow format, duplicate values are specifically allowed in the "dict encoding" pool, though it's recommended to "normalize" them to avoid duplicates, so I'm not sure; probably a fine requirement in practice, but could potentially cause an issue for weird arrow files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I'll turn this into a simple check with a fallback to the generic method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in some extreme cases it can take a lot of time. For example, the following takes 65s currently, but only 27s if I get rid of the check. So we probably should add an API to check whether the pool may contain duplicates. For PooledArray and CategoricalArray it would always be false, for others it could either always be true or vary at runtime.
using DataFrames, PooledArrays
N = 1_000_000_000;
k = N ÷ 5
df = DataFrame(x=PooledArray(rand(1:k, N)), y=rand(N));
@time gd = groupby(df, :x);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow.DictEncoded allows duplicates and would need a runtime check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's why I mentioned that possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use refpool for a different project, but the fact that refpool could have non-unique elements made the code too complicated to be worth it. I agree it would be nice to have an API that explicitly says elements are unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #2558.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! This is awesome; love the speedups and making the code more generic with DataAPI calls.
|
Adding [BREAKING] as it changes return value order when |
|
Looks good thank you - the open comments are minor. I am OK to merge the PR when you resolve them. |
…hecking sortedness
|
I've added a commit to use the generic fallback when the pools are not unique, plus a few small things. |
| # so it makes sense to allocate more memory for better performance, | ||
| # but it needs to remain reasonable compared with the size of the data frame. | ||
| if prod(Int128.(ngroupstup)) > typemax(Int) || ngroups > 2 * length(groups) | ||
| anydups = !all(allunique, refpools) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how expensive is this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed in another thread. Luckily that's only 1-2%. EDIT: it can be much more than that, see #2442 (comment).
|
Looks good. I left some small remarks. Additionally - a general question - I hope that the tests cover all code paths (in particular the path when we have too many groups so we should invoke a slow path). |
Yes, that's covered by the new tests (there was already one but quite limited). We don't have an example of an array for which |
|
So - I understand we merge it once CI passes. Right? |
|
Thank you! |
Generalize existing optimized
row_group_slotsmethod forCategoricalArrayandPooledArrayso that it can be used for other array types for whichDataAPI.refpoolreturns anAbstractVector. This allows dropping the dependency on CategoricalArrays in this part of the code.Also refactor the method to be faster when not sorting. In that case, we do not need to build a map between reference codes and groups (indexing into it is slow when the number of groups is very large).
CategoricalArrayis no longer special cased: whensort=false, levels are still sorted, butmissingappears first.Add more tests to cover weird combinations.
Some benchmarks: