@@ -150,10 +150,10 @@ constructor to save the factorize step during normal constructor mode:
150150 splitter = np.random.choice([0 ,1 ], 5 , p = [0.5 ,0.5 ])
151151 s = pd.Series(pd.Categorical.from_codes(splitter, categories = [" train" , " test" ]))
152152
153- .. _categorical.objectcreation.frame :
153+ .. _categorical.objectcreation.existingframe :
154154
155- Creating categories from a ``DataFrame ``
156- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
155+ Creating categories from an existing ``DataFrame ``
156+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
157157
158158.. versionadded :: 0.22.0
159159
@@ -169,15 +169,6 @@ if a column does not contain all labels:
169169 df[' A' ].dtype
170170 df[' B' ].dtype
171171
172- Note that this behavior is different than instantiating a ``DataFrame `` with categorical dtype, which will only assign
173- categories to each column based on the labels present in each column:
174-
175- .. ipython :: python
176-
177- df = pd.DataFrame({' A' : [' a' , ' b' , ' c' ], ' B' : [' c' , ' d' , ' e' ]}, dtype = ' category' )
178- df[' A' ].dtype
179- df[' B' ].dtype
180-
181172 When using ``astype ``, you can control the categories that will be present in each column by passing
182173a ``CategoricalDtype ``:
183174
@@ -199,6 +190,72 @@ discussed hold with subselection.
199190 df[[' A' , ' B' ]] = df[[' A' , ' B' ]].astype(' category' )
200191 df.dtypes
201192
193+ Note that you can use ``apply `` to set categories on a per-column basis:
194+
195+ .. ipython :: python
196+
197+ df = pd.DataFrame({' A' : [' a' , ' b' , ' c' ], ' B' : [' c' , ' d' , ' e' ]})
198+ df = df.apply(lambda x : x.astype(' category' ))
199+ df[' A' ].dtype
200+ df[' B' ].dtype
201+
202+
203+ .. _categorical.objectcreation.frameconstructor :
204+
205+ Creating categories from the ``DataFrame `` constructor
206+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
207+
208+ .. versionchanged :: 0.22.0
209+
210+ .. warning ::
211+
212+ Prior to version 0.22.0, the default behavior of the ``DataFrame `` constructor when a categorical dtype was
213+ passed was to operate on a per-column basis, meaning that only labels present in a given column would be categories
214+ for that column.
215+
216+ To promote consistency of behavior, from version 0.22.0 onwards instantiating a ``DataFrame `` with categorical
217+ dtype will by default use all labels in present all columns when setting categories, even if a column does not
218+ contain all labels. This is consistent with the new ``astype `` behavior described above.
219+
220+ Behavior prior to version 0.22.0:
221+
222+ .. code-block :: ipython
223+
224+ In [2]: df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']}, dtype='category')
225+
226+ In [3]: df
227+ Out[3]:
228+ A B
229+ 0 a c
230+ 1 b d
231+ 2 c e
232+
233+ In [4]: df['A'].dtype
234+ Out[4]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
235+
236+ In [5]: df['B'].dtype
237+ Out[5]: CategoricalDtype(categories=['c', 'd', 'e'], ordered=False)
238+
239+ Behavior from version 0.22.0 onwards:
240+
241+ .. ipython :: python
242+
243+ df = pd.DataFrame({' A' : [' a' , ' b' , ' c' ], ' B' : [' c' , ' d' , ' e' ]}, dtype = ' category' )
244+ df
245+ df[' A' ].dtype
246+ df[' B' ].dtype
247+
248+ Like with ``astype ``, you can control the categories that will be present in each column by passing
249+ a ``CategoricalDtype ``:
250+
251+ .. ipython :: python
252+
253+ dtype = CategoricalDtype(categories = list (' abdef' ), ordered = True )
254+ df = pd.DataFrame({' A' : [' a' , ' b' , ' c' ], ' B' : [' c' , ' d' , ' e' ]}, dtype = dtype)
255+ df
256+ df[' A' ].dtype
257+ df[' B' ].dtype
258+
202259 .. _categorical.categoricaldtype :
203260
204261CategoricalDtype
0 commit comments