diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst index 8ca96ba0daa5e..5443f24161f67 100644 --- a/doc/source/user_guide/categorical.rst +++ b/doc/source/user_guide/categorical.rst @@ -797,37 +797,52 @@ Assigning a ``Categorical`` to parts of a column of other types will use the val df.dtypes .. _categorical.merge: +.. _categorical.concat: -Merging -~~~~~~~ +Merging / Concatenation +~~~~~~~~~~~~~~~~~~~~~~~ -You can concat two ``DataFrames`` containing categorical data together, -but the categories of these categoricals need to be the same: +By default, combining ``Series`` or ``DataFrames`` which contain the same +categories results in ``category`` dtype, otherwise results will depend on the +dtype of the underlying categories. Merges that result in non-categorical +dtypes will likely have higher memory usage. Use ``.astype`` or +``union_categoricals`` to ensure ``category`` results. .. ipython:: python - cat = pd.Series(["a", "b"], dtype="category") - vals = [1, 2] - df = pd.DataFrame({"cats": cat, "vals": vals}) - res = pd.concat([df, df]) - res - res.dtypes + from pandas.api.types import union_categoricals -In this case the categories are not the same, and therefore an error is raised: + # same categories + s1 = pd.Series(['a', 'b'], dtype='category') + s2 = pd.Series(['a', 'b', 'a'], dtype='category') + pd.concat([s1, s2]) -.. ipython:: python + # different categories + s3 = pd.Series(['b', 'c'], dtype='category') + pd.concat([s1, s3]) - df_different = df.copy() - df_different["cats"].cat.categories = ["c", "d"] - try: - pd.concat([df, df_different]) - except ValueError as e: - print("ValueError:", str(e)) + # Output dtype is inferred based on categories values + int_cats = pd.Series([1, 2], dtype="category") + float_cats = pd.Series([3.0, 4.0], dtype="category") + pd.concat([int_cats, float_cats]) + + pd.concat([s1, s3]).astype('category') + union_categoricals([s1.array, s3.array]) -The same applies to ``df.append(df_different)``. +The following table summarizes the results of merging ``Categoricals``: -See also the section on :ref:`merge dtypes` for notes about preserving merge dtypes and performance. ++-------------------+------------------------+----------------------+-----------------------------+ +| arg1 | arg2 | identical | result | ++===================+========================+======================+=============================+ +| category | category | True | category | ++-------------------+------------------------+----------------------+-----------------------------+ +| category (object) | category (object) | False | object (dtype is inferred) | ++-------------------+------------------------+----------------------+-----------------------------+ +| category (int) | category (float) | False | float (dtype is inferred) | ++-------------------+------------------------+----------------------+-----------------------------+ +See also the section on :ref:`merge dtypes` for notes about +preserving merge dtypes and performance. .. _categorical.union: @@ -920,46 +935,6 @@ the resulting array will always be a plain ``Categorical``: # "b" is coded to 0 throughout, same as c1, different from c2 c.codes -.. _categorical.concat: - -Concatenation -~~~~~~~~~~~~~ - -This section describes concatenations specific to ``category`` dtype. See :ref:`Concatenating objects` for general description. - -By default, ``Series`` or ``DataFrame`` concatenation which contains the same categories -results in ``category`` dtype, otherwise results in ``object`` dtype. -Use ``.astype`` or ``union_categoricals`` to get ``category`` result. - -.. ipython:: python - - # same categories - s1 = pd.Series(['a', 'b'], dtype='category') - s2 = pd.Series(['a', 'b', 'a'], dtype='category') - pd.concat([s1, s2]) - - # different categories - s3 = pd.Series(['b', 'c'], dtype='category') - pd.concat([s1, s3]) - - pd.concat([s1, s3]).astype('category') - union_categoricals([s1.array, s3.array]) - - -Following table summarizes the results of ``Categoricals`` related concatenations. - -+----------+--------------------------------------------------------+----------------------------+ -| arg1 | arg2 | result | -+==========+========================================================+============================+ -| category | category (identical categories) | category | -+----------+--------------------------------------------------------+----------------------------+ -| category | category (different categories, both not ordered) | object (dtype is inferred) | -+----------+--------------------------------------------------------+----------------------------+ -| category | category (different categories, either one is ordered) | object (dtype is inferred) | -+----------+--------------------------------------------------------+----------------------------+ -| category | not category | object (dtype is inferred) | -+----------+--------------------------------------------------------+----------------------------+ - Getting data in/out ------------------- diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst index 4c0d3b75a4f79..dca744827477f 100644 --- a/doc/source/user_guide/merging.rst +++ b/doc/source/user_guide/merging.rst @@ -883,7 +883,7 @@ The merged result: .. note:: The category dtypes must be *exactly* the same, meaning the same categories and the ordered attribute. - Otherwise the result will coerce to ``object`` dtype. + Otherwise the result will coerce to the categories' dtype. .. note::