Skip to content

Mismatched indexes in X,y esp. with sklearn pipelines #280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bmreiniger opened this issue Oct 30, 2020 · 5 comments · Fixed by #320
Closed

Mismatched indexes in X,y esp. with sklearn pipelines #280

bmreiniger opened this issue Oct 30, 2020 · 5 comments · Fixed by #320

Comments

@bmreiniger
Copy link
Contributor

bmreiniger commented Oct 30, 2020

When X is a numpy array but y is a pandas Series (which is the case e.g. when X was converted by sklearn), the convert_input... functions called e.g. in
https://github.com/bmreiniger/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/target_encoder.py#L118
don't give the resulting pandas objects the same index. This causes TargetEncoder, WOEEncoder, LeaveOneOutEncoder, CatBoostEncoder, and JamesSteinEncoder (any others?) to miscalculate the encodings, e.g. at
https://github.com/bmreiniger/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/target_encoder.py#L172
(the groupby matches up by index).

This is the cause (or at least one of the causes) of #272.

Actual Behavior

import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import pandas as pd

df = pd.DataFrame({
    'x': ['a', 'a', 'b', 'c'],
    'y': [1, 1, 1, 0],
})
df.index = [101, 105, 42, 76]

te = ce.TargetEncoder()
si = SimpleImputer(strategy='constant', fill_value='a')

pipe = Pipeline(steps=[
    ('impute', si),
    ('encode', te),
])
pipe.fit_transform(df[['x']], df['y'])

outputs

	0
0	NaN
1	NaN
2	NaN
3	NaN

More nefarious problems occur when the indexes partially match up so that the returned values aren't NaN but are incorrect.

Specifications

  • Version: 2.2.2
  • Platform: Windows 10
  • Subsystem: Python 3.8.5
@bmreiniger
Copy link
Contributor Author

I'd be up for making a PR, but am new to this project. I think it might be nicest to add a function that converts/checks both X, y:

  • if both pandas, check that their indexes are the same, error if not
  • if both arrays, cast to pandas with default indexes
  • if one of each, use the pandas index for the other

Thoughts?

@tsinggggg
Copy link

exactly, this is especially dangerous in a cross validate setting

@salmanea
Copy link

I realized that resetting index can solve the problem.

@PaulWestenthanner
Copy link
Collaborator

Hi @bmreiniger

thanks for pointing this issue out. If you still want to make a PR your help is much appreciated.
I think you're way to go is correct. We should just use a single convert_input function that converts both X and y together, so the indices match. Your suggested behaviour seems like the way to go to me. Pretty much all encoders use the convert_input function, also those that do not have a target. Please keep that in mind

@bmreiniger
Copy link
Contributor Author

@PaulWestenthanner I'll give it a shot, sure. And thanks for the heads up about X-only convert_input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants