Skip to content

API: column ordering on get_dummies #12010 #17612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Giftlin opened this issue Sep 21, 2017 · 9 comments
Closed

API: column ordering on get_dummies #12010 #17612

Giftlin opened this issue Sep 21, 2017 · 9 comments

Comments

@Giftlin
Copy link
Contributor

Giftlin commented Sep 21, 2017

Using get_dummies is moving the columns to the end. What @jreback and @TomAugspurger have commented is right. But there are situations in which we require to preserve the order. For example, in scipy model creation algorithms, the functions give preference for the columns based on the order of the columns. We are having to reorder the columns explicitly. I get it that it is not pandas' concern. But it will be better if we have an option to preserve.

So, I guess we must atleast have an option to preserve the order. Let the default be as the new columns to be at the end.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 21, 2017 via email

@Giftlin
Copy link
Contributor Author

Giftlin commented Sep 21, 2017

import pandas as pd

df = pd.DataFrame({'A': ["apple", "apple", "orange", "orange", "lemon", "lemon"],
'B': [98, 87, 45, 25, 12, 5]})
df = pd.get_dummies(df)
print(df)

Output:
image

Output (preserving order of columns):
image

Giftlin added a commit to Giftlin/pandas that referenced this issue Sep 21, 2017
Giftlin added a commit to Giftlin/pandas that referenced this issue Sep 21, 2017
@jreback
Copy link
Contributor

jreback commented Sep 21, 2017

categoricals already allow you to provide an order (note that these are not stricly ordered=True), rather you are providing the order, as opposed to a lex-sort order.

And this works as expected.

In [10]: df['A_ordered'] = df['A'].astype('category', categories=['apple', 'orange', 'lemon'])

In [11]: df
Out[11]: 
        A   B A_ordered
0   apple  98     apple
1   apple  87     apple
2  orange  45    orange
3  orange  25    orange
4   lemon  12     lemon
5   lemon   5     lemon

In [12]: df.A_ordered
Out[12]: 
0     apple
1     apple
2    orange
3    orange
4     lemon
5     lemon
Name: A_ordered, dtype: category
Categories (3, object): [apple, orange, lemon]

In [13]: pd.get_dummies(df.A)
Out[13]: 
   apple  lemon  orange
0      1      0       0
1      1      0       0
2      0      0       1
3      0      0       1
4      0      1       0
5      0      1       0

In [14]: pd.get_dummies(df.A_ordered)
Out[14]: 
   apple  orange  lemon
0      1       0      0
1      1       0      0
2      0       1      0
3      0       1      0
4      0       0      1
5      0       0      1

@jreback jreback closed this as completed Sep 21, 2017
@jreback jreback added Categorical Categorical Data Type Usage Question labels Sep 21, 2017
@jreback jreback added this to the No action milestone Sep 21, 2017
@Giftlin
Copy link
Contributor Author

Giftlin commented Sep 21, 2017

@jreback no, I'm not talking about column values here. It is the column I am talking about.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2017

In [6]: pd.concat([df.drop(['A'], axis=1), pd.get_dummies(df.A)], axis=1)
Out[6]: 
    B  apple  lemon  orange
0  98      1      0       0
1  87      1      0       0
2  45      0      0       1
3  25      0      0       1
4  12      0      1       0
5   5      0      1       0

@TomAugspurger
Copy link
Contributor

I think that example only works here since A was first.

My initial reaction was to just have the user to pd.get_dummies(df)[desired_order]. Getting that order is somewhat difficult...

def get_order(df):
    order = []
    for col in df.columns:
        if pd.api.types.is_categorical(df[col]):
            order.extend(['{}_{}'.format(col, val)
                          for val in df[col].cat.categories])
        else:
            order.append(col)
    return order

That doesn't handle object types, but it wouldn't be hard to fix it to do that.

We could start with a cookbook recipe? And expand to a keyword argument if others want it? This is on the borderline of whether or not it's worth a keyword to me.

@jorisvandenbossche
Copy link
Member

You can also do something like this to preserve the exact order:

In [178]: pd.concat([pd.get_dummies(df[col], prefix=col) if df[col].dtype == object else df[col] for col in df], axis=1)
Out[178]: 
   A_apple  A_lemon  A_orange   B
0        1        0         0  98
1        1        0         0  87
2        0        0         1  45
3        0        0         1  25
4        0        1         0  12
5        0        1         0   5

Only the if df[col].dtype == object should be optimized to do exactly what you want (eg probably also categorical).

@kimsin98
Copy link

kimsin98 commented Aug 23, 2021

This needs to be reopened. @jreback's response is irrelevant. What @Giftlin proposed is inserting new dummy columns where categorical column used to be, rather than appending them to the end. It has nothing to do with order of items within categoricals.

I would even go as far as saying users probably expect column orders to be preserved when using get_dummies, so this should be the default.

@greenare
Copy link

@jreback you totally misunderstood what is going on. Please reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants