Mismatched indexes in X,y esp. with sklearn pipelines #280

bmreiniger · 2020-10-30T14:40:18Z

When X is a numpy array but y is a pandas Series (which is the case e.g. when X was converted by sklearn), the convert_input... functions called e.g. in
https://github.com/bmreiniger/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/target_encoder.py#L118
don't give the resulting pandas objects the same index. This causes TargetEncoder, WOEEncoder, LeaveOneOutEncoder, CatBoostEncoder, and JamesSteinEncoder (any others?) to miscalculate the encodings, e.g. at
https://github.com/bmreiniger/category_encoders/blob/a810a4b7abfce9fc4eb7fc401e3d37f2c1c6e402/category_encoders/target_encoder.py#L172
(the groupby matches up by index).

This is the cause (or at least one of the causes) of #272.

Actual Behavior

import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import pandas as pd

df = pd.DataFrame({
    'x': ['a', 'a', 'b', 'c'],
    'y': [1, 1, 1, 0],
})
df.index = [101, 105, 42, 76]

te = ce.TargetEncoder()
si = SimpleImputer(strategy='constant', fill_value='a')

pipe = Pipeline(steps=[
    ('impute', si),
    ('encode', te),
])
pipe.fit_transform(df[['x']], df['y'])

outputs

	0
0	NaN
1	NaN
2	NaN
3	NaN

More nefarious problems occur when the indexes partially match up so that the returned values aren't NaN but are incorrect.

Specifications

Version: 2.2.2
Platform: Windows 10
Subsystem: Python 3.8.5

The text was updated successfully, but these errors were encountered:

bmreiniger · 2020-10-30T14:51:45Z

I'd be up for making a PR, but am new to this project. I think it might be nicest to add a function that converts/checks both X, y:

if both pandas, check that their indexes are the same, error if not
if both arrays, cast to pandas with default indexes
if one of each, use the pandas index for the other

Thoughts?

tsinggggg · 2020-12-10T23:32:21Z

exactly, this is especially dangerous in a cross validate setting

salmanea · 2021-08-19T09:24:57Z

I realized that resetting index can solve the problem.

PaulWestenthanner · 2021-10-20T20:13:58Z

Hi @bmreiniger

thanks for pointing this issue out. If you still want to make a PR your help is much appreciated.
I think you're way to go is correct. We should just use a single convert_input function that converts both X and y together, so the indices match. Your suggested behaviour seems like the way to go to me. Pretty much all encoders use the convert_input function, also those that do not have a target. Please keep that in mind

bmreiniger · 2021-10-21T02:51:40Z

@PaulWestenthanner I'll give it a shot, sure. And thanks for the heads up about X-only convert_input.

bmreiniger mentioned this issue Oct 18, 2021

Fix bad WOE scores. #304

Closed

bmreiniger mentioned this issue Oct 21, 2021

[BUG] Some encoders return NaN values #290

Closed

bmreiniger mentioned this issue Oct 24, 2021

Check array index fix #320

Merged

PaulWestenthanner closed this as completed in #320 Oct 29, 2021

bmreiniger mentioned this issue Feb 2, 2022

SklearnTransformerWrapper: cross-validation error when wrapping OneHotEncoder feature-engine/feature_engine#368

Closed

noahjgreen295 mentioned this issue Feb 19, 2022

Encoders that are f(X, y) can produce nan results when y has non-standard index and X becomes an np.ndarray feature-engine/feature_engine#376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mismatched indexes in X,y esp. with sklearn pipelines #280

Mismatched indexes in X,y esp. with sklearn pipelines #280

bmreiniger commented Oct 30, 2020 •

edited

Loading

bmreiniger commented Oct 30, 2020

Uh oh!

tsinggggg commented Dec 10, 2020

Uh oh!

salmanea commented Aug 19, 2021

Uh oh!

PaulWestenthanner commented Oct 20, 2021

Uh oh!

bmreiniger commented Oct 21, 2021

Uh oh!

Mismatched indexes in X,y esp. with sklearn pipelines #280

Mismatched indexes in X,y esp. with sklearn pipelines #280

Comments

bmreiniger commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Actual Behavior

Specifications

bmreiniger commented Oct 30, 2020

Uh oh!

tsinggggg commented Dec 10, 2020

Uh oh!

salmanea commented Aug 19, 2021

Uh oh!

PaulWestenthanner commented Oct 20, 2021

Uh oh!

bmreiniger commented Oct 21, 2021

Uh oh!

bmreiniger commented Oct 30, 2020 •

edited

Loading