Skip to content

REF: split out sorted_rank algo #41910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 10, 2021

Conversation

mzeitlin11
Copy link
Member

No behavior change here, but two main advantages can follow this refactor:

  • Right now rank_2d does the ranking portion of the algo with different logic. rank_2d can instead call this function, which will allow removing lots of rank_2d code and make keeping behaviors in sync easier. rank_2d also does not use nogil, which this would fix.
  • nancorr_spearman can use this for a simplified (and hopefully faster) reranking handling when nulls are present.

The diff looks more complicated than it is because the structure was changed from

if object:
   ...
else:
   with nogil:
        ....

to

if object:
   with gil:
        ...
else:
  ....

Benchmarks look unaffected:

       before           after         ratio
     [8b9b1a1d]       [c7a91ac3]
     <master>         <ref/rank_1d_sorted>
         264±10μs         297±20μs    ~1.13  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'direct')
         259±10μs          293±8μs    ~1.13  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'transformation')
         416±20μs         399±10μs     0.96  groupby.GroupByMethods.time_dtype_as_field('float', 'rank', 'direct')
          414±4μs         405±40μs     0.98  groupby.GroupByMethods.time_dtype_as_field('float', 'rank', 'transformation')
         406±30μs          374±8μs     0.92  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'direct')
        509±200μs         413±40μs    ~0.81  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'transformation')
      1.31±0.03ms      1.42±0.03ms     1.09  groupby.GroupByMethods.time_dtype_as_field('object', 'rank', 'direct')
      1.37±0.06ms      1.41±0.09ms     1.03  groupby.GroupByMethods.time_dtype_as_field('object', 'rank', 'transformation')
         354±40μs          406±9μs    ~1.15  groupby.GroupByMethods.time_dtype_as_field('uint', 'rank', 'direct')
         387±10μs          368±7μs     0.95  groupby.GroupByMethods.time_dtype_as_field('uint', 'rank', 'transformation')
          350±9μs         378±20μs     1.08  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'direct')
         366±20μs         390±10μs     1.06  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'transformation')
         339±10μs         368±10μs     1.09  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'direct')
         350±20μs         375±20μs     1.07  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'transformation')
         347±30μs         370±10μs     1.07  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'direct')
         348±30μs         380±20μs     1.09  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'transformation')
         270±20μs         277±10μs     1.03  groupby.GroupByMethods.time_dtype_as_group('object', 'rank', 'direct')
         259±10μs         272±20μs     1.05  groupby.GroupByMethods.time_dtype_as_group('object', 'rank', 'transformation')
         340±20μs         377±20μs    ~1.11  groupby.GroupByMethods.time_dtype_as_group('uint', 'rank', 'direct')
          364±7μs         387±20μs     1.06  groupby.GroupByMethods.time_dtype_as_group('uint', 'rank', 'transformation')
      1.05±0.07ms      1.02±0.04ms     0.98  groupby.RankWithTies.time_rank_ties('datetime64', 'average')
      1.04±0.05ms      1.05±0.04ms     1.00  groupby.RankWithTies.time_rank_ties('datetime64', 'dense')
       1.12±0.1ms      1.10±0.04ms     0.98  groupby.RankWithTies.time_rank_ties('datetime64', 'first')
       1.21±0.2ms      1.01±0.02ms    ~0.84  groupby.RankWithTies.time_rank_ties('datetime64', 'max')
      1.01±0.03ms         996±30μs     0.99  groupby.RankWithTies.time_rank_ties('datetime64', 'min')
      1.03±0.03ms         997±20μs     0.97  groupby.RankWithTies.time_rank_ties('float32', 'average')
      1.06±0.05ms      1.03±0.02ms     0.97  groupby.RankWithTies.time_rank_ties('float32', 'dense')
      1.07±0.03ms       1.22±0.2ms    ~1.14  groupby.RankWithTies.time_rank_ties('float32', 'first')
      1.02±0.04ms      1.02±0.06ms     1.01  groupby.RankWithTies.time_rank_ties('float32', 'max')
      1.06±0.03ms       1.08±0.1ms     1.02  groupby.RankWithTies.time_rank_ties('float32', 'min')
      1.08±0.03ms      1.03±0.04ms     0.96  groupby.RankWithTies.time_rank_ties('float64', 'average')
      1.03±0.04ms      1.01±0.06ms     0.98  groupby.RankWithTies.time_rank_ties('float64', 'dense')
      1.04±0.06ms         995±30μs     0.95  groupby.RankWithTies.time_rank_ties('float64', 'first')
      1.05±0.04ms      1.03±0.06ms     0.98  groupby.RankWithTies.time_rank_ties('float64', 'max')
      1.11±0.05ms      1.01±0.04ms    ~0.91  groupby.RankWithTies.time_rank_ties('float64', 'min')
      1.09±0.09ms      1.15±0.07ms     1.06  groupby.RankWithTies.time_rank_ties('int64', 'average')
       1.12±0.1ms      1.04±0.07ms     0.93  groupby.RankWithTies.time_rank_ties('int64', 'dense')
      1.05±0.09ms      1.02±0.03ms     0.97  groupby.RankWithTies.time_rank_ties('int64', 'first')
       1.15±0.1ms      1.07±0.07ms     0.93  groupby.RankWithTies.time_rank_ties('int64', 'max')
       1.02±0.2ms         999±30μs     0.98  groupby.RankWithTies.time_rank_ties('int64', 'min')
       10.1±0.8ms       10.0±0.9ms     0.99  series_methods.Rank.time_rank('float')
       7.71±0.4ms       7.99±0.9ms     1.04  series_methods.Rank.time_rank('int')
       48.4±0.7ms         57.1±2ms    ~1.18  series_methods.Rank.time_rank('object')
       8.09±0.7ms       7.07±0.3ms    ~0.87  series_methods.Rank.time_rank('uint')

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Refactor Internal refactoring of code labels Jun 9, 2021
@jreback jreback added this to the 1.3 milestone Jun 10, 2021
@jreback
Copy link
Contributor

jreback commented Jun 10, 2021

great. assume perf is the same.

@mzeitlin11
Copy link
Member Author

great. assume perf is the same.

Yeah perf is posted above in a details block

@jreback jreback merged commit 499ef8c into pandas-dev:master Jun 10, 2021
@jreback
Copy link
Contributor

jreback commented Jun 10, 2021

thanks @mzeitlin11

@mzeitlin11 mzeitlin11 deleted the ref/rank_1d_sorted branch June 10, 2021 00:33
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants