Skip to content

Conversation

mzeitlin11
Copy link
Member

No behavior change here, but two main advantages can follow this refactor:

  • Right now rank_2d does the ranking portion of the algo with different logic. rank_2d can instead call this function, which will allow removing lots of rank_2d code and make keeping behaviors in sync easier. rank_2d also does not use nogil, which this would fix.
  • nancorr_spearman can use this for a simplified (and hopefully faster) reranking handling when nulls are present.

The diff looks more complicated than it is because the structure was changed from

if object:
   ...
else:
   with nogil:
        ....

to

if object:
   with gil:
        ...
else:
  ....

Benchmarks look unaffected:

       before           after         ratio
     [8b9b1a1d]       [c7a91ac3]
     <master>         <ref/rank_1d_sorted>
         264±10μs         297±20μs    ~1.13  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'direct')
         259±10μs          293±8μs    ~1.13  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'transformation')
         416±20μs         399±10μs     0.96  groupby.GroupByMethods.time_dtype_as_field('float', 'rank', 'direct')
          414±4μs         405±40μs     0.98  groupby.GroupByMethods.time_dtype_as_field('float', 'rank', 'transformation')
         406±30μs          374±8μs     0.92  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'direct')
        509±200μs         413±40μs    ~0.81  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'transformation')
      1.31±0.03ms      1.42±0.03ms     1.09  groupby.GroupByMethods.time_dtype_as_field('object', 'rank', 'direct')
      1.37±0.06ms      1.41±0.09ms     1.03  groupby.GroupByMethods.time_dtype_as_field('object', 'rank', 'transformation')
         354±40μs          406±9μs    ~1.15  groupby.GroupByMethods.time_dtype_as_field('uint', 'rank', 'direct')
         387±10μs          368±7μs     0.95  groupby.GroupByMethods.time_dtype_as_field('uint', 'rank', 'transformation')
          350±9μs         378±20μs     1.08  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'direct')
         366±20μs         390±10μs     1.06  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'transformation')
         339±10μs         368±10μs     1.09  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'direct')
         350±20μs         375±20μs     1.07  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'transformation')
         347±30μs         370±10μs     1.07  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'direct')
         348±30μs         380±20μs     1.09  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'transformation')
         270±20μs         277±10μs     1.03  groupby.GroupByMethods.time_dtype_as_group('object', 'rank', 'direct')
         259±10μs         272±20μs     1.05  groupby.GroupByMethods.time_dtype_as_group('object', 'rank', 'transformation')
         340±20μs         377±20μs    ~1.11  groupby.GroupByMethods.time_dtype_as_group('uint', 'rank', 'direct')
          364±7μs         387±20μs     1.06  groupby.GroupByMethods.time_dtype_as_group('uint', 'rank', 'transformation')
      1.05±0.07ms      1.02±0.04ms     0.98  groupby.RankWithTies.time_rank_ties('datetime64', 'average')
      1.04±0.05ms      1.05±0.04ms     1.00  groupby.RankWithTies.time_rank_ties('datetime64', 'dense')
       1.12±0.1ms      1.10±0.04ms     0.98  groupby.RankWithTies.time_rank_ties('datetime64', 'first')
       1.21±0.2ms      1.01±0.02ms    ~0.84  groupby.RankWithTies.time_rank_ties('datetime64', 'max')
      1.01±0.03ms         996±30μs     0.99  groupby.RankWithTies.time_rank_ties('datetime64', 'min')
      1.03±0.03ms         997±20μs     0.97  groupby.RankWithTies.time_rank_ties('float32', 'average')
      1.06±0.05ms      1.03±0.02ms     0.97  groupby.RankWithTies.time_rank_ties('float32', 'dense')
      1.07±0.03ms       1.22±0.2ms    ~1.14  groupby.RankWithTies.time_rank_ties('float32', 'first')
      1.02±0.04ms      1.02±0.06ms     1.01  groupby.RankWithTies.time_rank_ties('float32', 'max')
      1.06±0.03ms       1.08±0.1ms     1.02  groupby.RankWithTies.time_rank_ties('float32', 'min')
      1.08±0.03ms      1.03±0.04ms     0.96  groupby.RankWithTies.time_rank_ties('float64', 'average')
      1.03±0.04ms      1.01±0.06ms     0.98  groupby.RankWithTies.time_rank_ties('float64', 'dense')
      1.04±0.06ms         995±30μs     0.95  groupby.RankWithTies.time_rank_ties('float64', 'first')
      1.05±0.04ms      1.03±0.06ms     0.98  groupby.RankWithTies.time_rank_ties('float64', 'max')
      1.11±0.05ms      1.01±0.04ms    ~0.91  groupby.RankWithTies.time_rank_ties('float64', 'min')
      1.09±0.09ms      1.15±0.07ms     1.06  groupby.RankWithTies.time_rank_ties('int64', 'average')
       1.12±0.1ms      1.04±0.07ms     0.93  groupby.RankWithTies.time_rank_ties('int64', 'dense')
      1.05±0.09ms      1.02±0.03ms     0.97  groupby.RankWithTies.time_rank_ties('int64', 'first')
       1.15±0.1ms      1.07±0.07ms     0.93  groupby.RankWithTies.time_rank_ties('int64', 'max')
       1.02±0.2ms         999±30μs     0.98  groupby.RankWithTies.time_rank_ties('int64', 'min')
       10.1±0.8ms       10.0±0.9ms     0.99  series_methods.Rank.time_rank('float')
       7.71±0.4ms       7.99±0.9ms     1.04  series_methods.Rank.time_rank('int')
       48.4±0.7ms         57.1±2ms    ~1.18  series_methods.Rank.time_rank('object')
       8.09±0.7ms       7.07±0.3ms    ~0.87  series_methods.Rank.time_rank('uint')

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Refactor Internal refactoring of code labels Jun 9, 2021
@jreback jreback added this to the 1.3 milestone Jun 10, 2021
@jreback
Copy link
Contributor

jreback commented Jun 10, 2021

great. assume perf is the same.

@mzeitlin11
Copy link
Member Author

great. assume perf is the same.

Yeah perf is posted above in a details block

@jreback jreback merged commit 499ef8c into pandas-dev:master Jun 10, 2021
@jreback
Copy link
Contributor

jreback commented Jun 10, 2021

thanks @mzeitlin11

@mzeitlin11 mzeitlin11 deleted the ref/rank_1d_sorted branch June 10, 2021 00:33
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants