Skip to content

Speed up LeaveOneOutEncoder with vectorization. #146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 26, 2018

Conversation

jkleint
Copy link

@jkleint jkleint commented Oct 23, 2018

Over 400X faster while using less memory.

Store category mappings as DataFrames, then do vectorized apply
with .map(). fit() just computes the sum and count of y for each
level of each column of X. These mappings are stored as a dict
mapping the column name to a DataFrame with sum and count columns.
transform() then applies the map to each column of X, plus a little
vectorized math.

There is a speed/space tradeoff, whether to store the mean of y
for each level as well as the sum and count. This was resolved
in favor of space, recomputing the mean in transform(), so that
e.g. pickled Transformers will take less space on disk.

rows, cats = 1000000, 1000
X = pd.DataFrame({'x': np.random.randint(0, cats, rows).astype(str)})
y = pd.Series(np.random.rand(rows))

# old
%time LeaveOneOutEncoder().fit_transform(X, y)
CPU times: user 2min 47s, sys: 241 ms, total: 2min 48s
Wall time: 2min 48s

# new
CPU times: user 390 ms, sys: 17.3 ms, total: 408 ms
Wall time: 407 ms

@jkleint
Copy link
Author

jkleint commented Oct 23, 2018

Minor API suggestion for LeaveOneOutEncoder: the randomized and sigma parameters are redundant. You could get rid of randomized and just have sigma default to None. If you want randomization, just set sigma > 0.

@janmotl
Copy link
Collaborator

janmotl commented Oct 23, 2018

Thank you. It always makes me happy when the size of the code decreases.

Note however, that the behaviour changed when we encounter a unique value during the training time:

    def test_leave_one_out_unique(self):
        X = pd.DataFrame(data=['1', '2', '2', '2', '3'], columns=['col'])
        y = [1, 0, 1, 0, 1]

        encoder = encoders.LeaveOneOutEncoder(impute_missing=False)
        result = encoder.fit(X, y).transform(X, y)

        self.assertFalse(any(result.isnull()), 'There should not be any missing value')

I am not sure we should treat a unique value as missing.

The API change makes sense. Go ahead.

@jkleint
Copy link
Author

jkleint commented Oct 23, 2018

You're welcome! Glad to help.

Yeah, I didn't fully understand the algorithm when I was translating it and I think I have a mistake. I believe the intent is that if there's only one sample for a level, we should use the overall target mean, yes? And that should happen both when y is provided and when it's not? That's a little cleaner. I can do that.

@janmotl
Copy link
Collaborator

janmotl commented Oct 23, 2018

That is correct.

Over 400X faster while using less memory.

Store category mappings as DataFrames, then do vectorized apply
with .map().  fit() just computes the sum and count of y for each
level of each column of X.  These mappings are stored as a dict
mapping the column name to a DataFrame with sum and count columns.
transform() then applies the map to each column of X, plus a little
vectorized math.

There is a speed/space tradeoff, whether to store the mean of y
for each level as well as the sum and count.  This was resolved
in favor of space, recomputing the mean in transform(), so that
e.g. pickled Transformers will take less space on disk.
@jkleint
Copy link
Author

jkleint commented Oct 23, 2018

Fixed the issue with missing values, and added some documentation and comments.

sigma defaults to None; set to a value > 0 for randomization.
@jkleint
Copy link
Author

jkleint commented Oct 23, 2018

Also added patch to remove randomize arg in favor of sigma.

@janmotl
Copy link
Collaborator

janmotl commented Oct 24, 2018

What if there is a missing value in the input:

    def test_leave_one_out_missing(self):
        X = pd.DataFrame(data=['1', '2', '2', '2', '3', None], columns=['col'])
        y = [1, 0, 1, 0, 1, 0]

        encoder = encoders.LeaveOneOutEncoder(impute_missing=False)
        result_fit = encoder.fit(X, y).transform(X, y)

        self.assertTrue(pd.isna(result_fit['col'][5]), 'we expect NaN or None because impute_missing=False')

? Shouldn't we preserve the missing value?

@janmotl
Copy link
Collaborator

janmotl commented Oct 24, 2018

Note: The suggested change depends on the outcome of #92.

@janmotl janmotl merged commit 2759d4e into scikit-learn-contrib:master Oct 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants