Skip to content

Conversation

realead
Copy link
Contributor

@realead realead commented Jan 23, 2021

This is follow up of #39009 but for mode: now keys returned by mode are in the original order. This is also improvement of the code quality as mode now uses the same code path as value_counts.

Fixes a performance regression of #39009.

@jreback jreback added the Bug label Jan 24, 2021
@jreback jreback added this to the 1.3 milestone Jan 24, 2021
@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jan 24, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm small coment

# thus we have 2 nans and 1 True
modes = ht.mode_object(values, False)
assert modes.size == 1
assert checknull(modes[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use is_na instead of the internal checknull

@jreback
Copy link
Contributor

jreback commented Jan 25, 2021

can you run the mode asvs (doubt we have too many), value_counts & some indexing ones just to make sure nothing is changing here

@realead
Copy link
Contributor Author

realead commented Jan 25, 2021

I have added some asv-benchmarks, because otherwise mode seems to be not covered in asv_bench.

For running asv continuous -f 1.10 upstream/master HEAD -b series_methods.Mode -b ^indexing I got:

       before           after         ratio
     [309cf3a6]       [e91e27a2]
+        18.6±1ms         51.3±4ms     2.76  series_methods.Mode.time_mode(100000, 'int')
+      18.9±0.5ms         38.4±5ms     2.03  series_methods.Mode.time_mode(100000, 'uint')
+      43.8±0.4ms         88.9±7ms     2.03  series_methods.Mode.time_mode(100000, 'float')
+     1.19±0.03ms      1.55±0.02ms     1.30  series_methods.Mode.time_mode(10000, 'int')
+     1.20±0.01ms      1.52±0.03ms     1.26  series_methods.Mode.time_mode(10000, 'uint')
+     1.89±0.08ms      2.39±0.04ms     1.26  series_methods.Mode.time_mode(10000, 'float')
+         134±2ms          153±2ms     1.15  series_methods.Mode.time_mode(100000, 'object')

There is a slow down, as soon as the table with keys no longer fits the cache (10^3 is OK, 10^5 isn't). More cache-misses should be expected, due to an additional indirection in value_counts_xxx, which is needed for keeping the original order.

Looking at the results it looks as if the array-sizes in asv-tests for value_counts are too small, otherwise this regression should be visible in asv-run in #39009 as well.

@realead
Copy link
Contributor Author

realead commented Jan 25, 2021

When I change series_methods.ValueCounts similar to ecf9ff6, i.e. adding different array-sizes, and compare with asv PR #39009 I get the following:

       before           after         ratio
     [2f883215]       [ecf9ff6e]
+      25.6±0.8ms         74.5±1ms     2.91  series_methods.ValueCounts.time_value_counts(100000, 'int')
+        26.4±1ms         73.1±5ms     2.76  series_methods.ValueCounts.time_value_counts(100000, 'uint')
+      51.2±0.7ms         123±20ms     2.40  series_methods.ValueCounts.time_value_counts(100000, 'float')
-         212±9ms          176±3ms     0.83  series_methods.ValueCounts.time_value_counts(100000, 'object')
-     1.50±0.02ms      1.05±0.02ms     0.70  series_methods.ValueCounts.time_value_counts(1000, 'object')
-      12.5±0.1ms       8.57±0.1ms     0.69  series_methods.ValueCounts.time_value_counts(10000, 'object')

With smaller sizes (as it is currently the case), the issues with cache misses aren't visible and one sees only speed-up in case of object (because some pre-/post-processing is no longer needed):

       before           after         ratio
     [2f883215]       [ecf9ff6e]
-     1.50±0.02ms      1.05±0.02ms     0.70  series_methods.ValueCounts.time_value_counts(1000, 'object')

@jreback
Copy link
Contributor

jreback commented Jan 26, 2021

is the build_count_table just a special case version of value_counts that doesn't keep order (and that is why its faster)?

@realead
Copy link
Contributor Author

realead commented Jan 26, 2021

@jreback

is the build_count_table just a special case version of value_counts that doesn't keep order (and that is why its faster)?

that is correct. However, my implementation is a little bit naive: I have missed, that there are more cache misses now (misses for look-up in table + miss for lookup in the line result_counts.data.data[unique_key_index] += 1 vs misses only for look-up table in the old version) and it really counts for bigger arrays.

I think there is a way to improve the performance (which would also improve value_counts as well) making it almost as fast as the old version. I will try it.

@realead
Copy link
Contributor Author

realead commented Jan 26, 2021

Comparison to the state before #39009 was merged (-b series_methods.ValueCounts -b series_methods.Mode):

       before           after         ratio
     [2f883215]       [31512ea5]
-         251±5ms          197±4ms     0.79  series_methods.ValueCounts.time_value_counts(100000, 'object')
-      13.3±0.8ms       9.36±0.8ms     0.70  series_methods.ValueCounts.time_value_counts(10000, 'object')
-      1.73±0.1ms      1.18±0.08ms     0.68  series_methods.ValueCounts.time_value_counts(1000, 'object')

no additional cache misses and better times for object, because post processing was removed.

Compared to the current master:

       before           after         ratio
     [309cf3a6]       [31512ea5]
-         165±4ms         62.7±2ms     0.38  series_methods.ValueCounts.time_value_counts(100000, 'float')
-         103±2ms         31.9±2ms     0.31  series_methods.ValueCounts.time_value_counts(100000, 'int')
-         106±2ms         33.0±2ms     0.31  series_methods.ValueCounts.time_value_counts(100000, 'uint')

Performance regression of #39009 has been reverted.

@jreback jreback merged commit 984def2 into pandas-dev:master Jan 27, 2021
@jreback
Copy link
Contributor

jreback commented Jan 27, 2021

very nice thanks @realead keep em coming!

@realead realead deleted the consistent_mode branch January 27, 2021 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent results of value_counts and mode for different nan-values
2 participants