-
Notifications
You must be signed in to change notification settings - Fork 400
Speed up LeaveOneOutEncoder with vectorization. #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Minor API suggestion for |
Thank you. It always makes me happy when the size of the code decreases. Note however, that the behaviour changed when we encounter a unique value during the training time: def test_leave_one_out_unique(self):
X = pd.DataFrame(data=['1', '2', '2', '2', '3'], columns=['col'])
y = [1, 0, 1, 0, 1]
encoder = encoders.LeaveOneOutEncoder(impute_missing=False)
result = encoder.fit(X, y).transform(X, y)
self.assertFalse(any(result.isnull()), 'There should not be any missing value') I am not sure we should treat a unique value as missing. The API change makes sense. Go ahead. |
You're welcome! Glad to help. Yeah, I didn't fully understand the algorithm when I was translating it and I think I have a mistake. I believe the intent is that if there's only one sample for a level, we should use the overall target mean, yes? And that should happen both when y is provided and when it's not? That's a little cleaner. I can do that. |
That is correct. |
Over 400X faster while using less memory. Store category mappings as DataFrames, then do vectorized apply with .map(). fit() just computes the sum and count of y for each level of each column of X. These mappings are stored as a dict mapping the column name to a DataFrame with sum and count columns. transform() then applies the map to each column of X, plus a little vectorized math. There is a speed/space tradeoff, whether to store the mean of y for each level as well as the sum and count. This was resolved in favor of space, recomputing the mean in transform(), so that e.g. pickled Transformers will take less space on disk.
23f342b
to
429ddb8
Compare
Fixed the issue with missing values, and added some documentation and comments. |
sigma defaults to None; set to a value > 0 for randomization.
Also added patch to remove |
What if there is a missing value in the input: def test_leave_one_out_missing(self):
X = pd.DataFrame(data=['1', '2', '2', '2', '3', None], columns=['col'])
y = [1, 0, 1, 0, 1, 0]
encoder = encoders.LeaveOneOutEncoder(impute_missing=False)
result_fit = encoder.fit(X, y).transform(X, y)
self.assertTrue(pd.isna(result_fit['col'][5]), 'we expect NaN or None because impute_missing=False') ? Shouldn't we preserve the missing value? |
Note: The suggested change depends on the outcome of #92. |
Over 400X faster while using less memory.
Store category mappings as DataFrames, then do vectorized apply
with
.map()
.fit()
just computes the sum and count ofy
for eachlevel of each column of
X
. These mappings are stored as a dictmapping the column name to a DataFrame with
sum
andcount
columns.transform()
then applies the map to each column ofX
, plus a littlevectorized math.
There is a speed/space tradeoff, whether to store the mean of
y
for each level as well as the sum and count. This was resolved
in favor of space, recomputing the mean in
transform()
, so thate.g. pickled Transformers will take less space on disk.