-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Open
Labels
CategoricalCategorical Data TypeCategorical Data TypeEnhancementReshapingConcat, Merge/Join, Stack/Unstack, ExplodeConcat, Merge/Join, Stack/Unstack, ExplodeStringsString extension data type and string dataString extension data type and string data
Description
Hello the Pandas team and thanks for making this package greater day after day.
I was using the str.get_dummies
method on a dataframe and I realized that by default the dummies are coded as int64
.
This looks to me very inefficient because I ran into a memory error when trying to get dummies for a dataframe with several millions of rows (and about 5k dummies). I had to create the dummies by chunk, and use to_numeric()
to coerce to int8
.
Would it be possible to natively have the dummies in int8
format so that they take very little space? In that case NaN
would be coerced to 0 but that should be fine.
What do you think?
Thanks!
Metadata
Metadata
Assignees
Labels
CategoricalCategorical Data TypeCategorical Data TypeEnhancementReshapingConcat, Merge/Join, Stack/Unstack, ExplodeConcat, Merge/Join, Stack/Unstack, ExplodeStringsString extension data type and string dataString extension data type and string data