-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
API DesignCategoricalCategorical Data TypeCategorical Data TypeInternalsRelated to non-user accessible pandas implementationRelated to non-user accessible pandas implementationPerformanceMemory or execution speed performanceMemory or execution speed performance
Milestone
Description
tl;dr - add true support for Categoricals in NDFrame.
There was an issue on the mailing list about using cut and sorting the results that brought this to mind. The issue is both that (I believe) a categorical loses its representation when you put it in a DataFrame and so the output of cut has to just be strings. I propose the following:
- Add a
CategoricalBlock
(orFactorBlock
) internally that can handle categoricals like those produced from cut that could share most of MI's internals, as a 2D int ndarray with an associated list of indexes for each column (again, nearly the same as MI except most ops would be working on just one 'level' and underlying could/would be 2D rather than list of Int64Index). Probably also would mean abstracting common operations to a separate mixin class. - Change
Categorical
to be a Series subclass with a SingleBlockManager that's a CategoricalBlock. This would not change its API, but it would gain Series methods. - Add a
to_categorical
method to Series (bonus points if we change convert_objects to detect if there are < Some_Max number of labels and convert object dtypes to categoricals). - Add a registration method to make_block so it iterates over a set of functions that either return a klass or None before falling back to ObjectBlock (so abstract existing else clause into a function and make the list of functions semi-public).
I'm going to work on this and I don't think it will be that difficult to implement, but it would make pandas more useful for representing level sets and other normalized data.
Metadata
Metadata
Assignees
Labels
API DesignCategoricalCategorical Data TypeCategorical Data TypeInternalsRelated to non-user accessible pandas implementationRelated to non-user accessible pandas implementationPerformanceMemory or execution speed performanceMemory or execution speed performance