Speed up LeaveOneOutEncoder with vectorization. #146

jkleint · 2018-10-23T04:55:03Z

Over 400X faster while using less memory.

Store category mappings as DataFrames, then do vectorized apply
with .map(). fit() just computes the sum and count of y for each
level of each column of X. These mappings are stored as a dict
mapping the column name to a DataFrame with sum and count columns.
transform() then applies the map to each column of X, plus a little
vectorized math.

There is a speed/space tradeoff, whether to store the mean of y
for each level as well as the sum and count. This was resolved
in favor of space, recomputing the mean in transform(), so that
e.g. pickled Transformers will take less space on disk.

rows, cats = 1000000, 1000
X = pd.DataFrame({'x': np.random.randint(0, cats, rows).astype(str)})
y = pd.Series(np.random.rand(rows))

# old
%time LeaveOneOutEncoder().fit_transform(X, y)
CPU times: user 2min 47s, sys: 241 ms, total: 2min 48s
Wall time: 2min 48s

# new
CPU times: user 390 ms, sys: 17.3 ms, total: 408 ms
Wall time: 407 ms

jkleint · 2018-10-23T04:58:00Z

Minor API suggestion for LeaveOneOutEncoder: the randomized and sigma parameters are redundant. You could get rid of randomized and just have sigma default to None. If you want randomization, just set sigma > 0.

janmotl · 2018-10-23T08:51:48Z

Thank you. It always makes me happy when the size of the code decreases.

Note however, that the behaviour changed when we encounter a unique value during the training time:

    def test_leave_one_out_unique(self):
        X = pd.DataFrame(data=['1', '2', '2', '2', '3'], columns=['col'])
        y = [1, 0, 1, 0, 1]

        encoder = encoders.LeaveOneOutEncoder(impute_missing=False)
        result = encoder.fit(X, y).transform(X, y)

        self.assertFalse(any(result.isnull()), 'There should not be any missing value')

I am not sure we should treat a unique value as missing.

The API change makes sense. Go ahead.

jkleint · 2018-10-23T19:10:08Z

You're welcome! Glad to help.

Yeah, I didn't fully understand the algorithm when I was translating it and I think I have a mistake. I believe the intent is that if there's only one sample for a level, we should use the overall target mean, yes? And that should happen both when y is provided and when it's not? That's a little cleaner. I can do that.

janmotl · 2018-10-23T19:43:54Z

That is correct.

Over 400X faster while using less memory. Store category mappings as DataFrames, then do vectorized apply with .map(). fit() just computes the sum and count of y for each level of each column of X. These mappings are stored as a dict mapping the column name to a DataFrame with sum and count columns. transform() then applies the map to each column of X, plus a little vectorized math. There is a speed/space tradeoff, whether to store the mean of y for each level as well as the sum and count. This was resolved in favor of space, recomputing the mean in transform(), so that e.g. pickled Transformers will take less space on disk.

jkleint · 2018-10-23T22:38:13Z

Fixed the issue with missing values, and added some documentation and comments.

sigma defaults to None; set to a value > 0 for randomization.

jkleint · 2018-10-23T23:15:43Z

Also added patch to remove randomize arg in favor of sigma.

janmotl · 2018-10-24T07:37:56Z

What if there is a missing value in the input:

    def test_leave_one_out_missing(self):
        X = pd.DataFrame(data=['1', '2', '2', '2', '3', None], columns=['col'])
        y = [1, 0, 1, 0, 1, 0]

        encoder = encoders.LeaveOneOutEncoder(impute_missing=False)
        result_fit = encoder.fit(X, y).transform(X, y)

        self.assertTrue(pd.isna(result_fit['col'][5]), 'we expect NaN or None because impute_missing=False')

? Shouldn't we preserve the missing value?

janmotl · 2018-10-24T14:33:02Z

Note: The suggested change depends on the outcome of #92.

jkleint force-pushed the vectorize-loo branch from 23f342b to 429ddb8 Compare October 23, 2018 22:34

Remove redudant randomize arg from LeaveOneOut in favor of sigma.

e5e8c5a

sigma defaults to None; set to a value > 0 for randomization.

janmotl merged commit 2759d4e into scikit-learn-contrib:master Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up LeaveOneOutEncoder with vectorization. #146

Speed up LeaveOneOutEncoder with vectorization. #146

Uh oh!

jkleint commented Oct 23, 2018 •

edited

Loading

Uh oh!

jkleint commented Oct 23, 2018

Uh oh!

janmotl commented Oct 23, 2018

Uh oh!

jkleint commented Oct 23, 2018

Uh oh!

janmotl commented Oct 23, 2018

Uh oh!

jkleint commented Oct 23, 2018

Uh oh!

jkleint commented Oct 23, 2018

Uh oh!

janmotl commented Oct 24, 2018

Uh oh!

janmotl commented Oct 24, 2018

Uh oh!

Uh oh!

Speed up LeaveOneOutEncoder with vectorization. #146

Speed up LeaveOneOutEncoder with vectorization. #146

Uh oh!

Conversation

jkleint commented Oct 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkleint commented Oct 23, 2018

Uh oh!

janmotl commented Oct 23, 2018

Uh oh!

jkleint commented Oct 23, 2018

Uh oh!

janmotl commented Oct 23, 2018

Uh oh!

jkleint commented Oct 23, 2018

Uh oh!

jkleint commented Oct 23, 2018

Uh oh!

janmotl commented Oct 24, 2018

Uh oh!

janmotl commented Oct 24, 2018

Uh oh!

Uh oh!

jkleint commented Oct 23, 2018 •

edited

Loading