|
| 1 | +.. _pymongo-arrow-data-types: |
| 2 | + |
| 3 | +========== |
| 4 | +Data Types |
| 5 | +========== |
| 6 | + |
| 7 | +.. contents:: On this page |
| 8 | + :local: |
| 9 | + :backlinks: none |
| 10 | + :depth: 1 |
| 11 | + :class: singlecol |
| 12 | + |
| 13 | +.. facet:: |
| 14 | + :name: genre |
| 15 | + :values: reference |
| 16 | + |
| 17 | +.. meta:: |
| 18 | + :keywords: support, conversions |
| 19 | + |
| 20 | +{+driver-short+} supports a majority of the BSON types. |
| 21 | +Because Arrow and Polars provide first-class support for Lists and Structs, |
| 22 | +this includes embedded arrays and documents. |
| 23 | + |
| 24 | +Support for additional types will be added in subsequent releases. |
| 25 | + |
| 26 | +.. tip:: |
| 27 | + |
| 28 | + For more information about BSON types, see the |
| 29 | + `BSON specification <http://bsonspec.org/spec.html>`__. |
| 30 | + |
| 31 | +.. list-table:: |
| 32 | + :widths: 20 60 |
| 33 | + :header-rows: 1 |
| 34 | + |
| 35 | + * - BSON Type |
| 36 | + - Type Identifiers |
| 37 | + * - String |
| 38 | + - ``py.str``, an instance of ``pyarrow.string`` |
| 39 | + * - Embedded document |
| 40 | + - ``py.dict``, and instance of ``pyarrow.struct`` |
| 41 | + * - Embedded array |
| 42 | + - An instance of ``pyarrow.list_`` |
| 43 | + * - ObjectId |
| 44 | + - ``py.bytes``, ``bson.ObjectId``, an instance of ``pymongoarrow.types.ObjectIdType``, an instance of ``pymongoarrow.pandas_types.PandasObjectId`` |
| 45 | + * - Decimal128 |
| 46 | + - ``bson.Decimal128``, an instance of ``pymongoarrow.types.Decimal128Type``, an instance of ``pymongoarrow.pandas_types.PandasDecimal128`` |
| 47 | + * - Boolean |
| 48 | + - An instance of ``~pyarrow.bool_``, ``~py.bool`` |
| 49 | + * - 64-bit binary floating point |
| 50 | + - ``py.float``, an instance of ``pyarrow.float64`` |
| 51 | + * - 32-bit integer |
| 52 | + - An instance of ``pyarrow.int32`` |
| 53 | + * - 64-bit integer |
| 54 | + - ``~py.int``, ``bson.int64.Int64``, an instance of ``pyarrow.int64`` |
| 55 | + * - UTC datetime |
| 56 | + - An instance of ``~pyarrow.timestamp`` with ``ms`` resolution, ``py.datetime.datetime`` |
| 57 | + * - Binary data |
| 58 | + - ``bson.Binary``, an instance of ``pymongoarrow.types.BinaryType``, an instance of ``pymongoarrow.pandas_types.PandasBinary``. |
| 59 | + * - JavaScript code |
| 60 | + - ``bson.Code``, an instance of ``pymongoarrow.types.CodeType``, an instance of ``pymongoarrow.pandas_types.PandasCode`` |
| 61 | + |
| 62 | +.. note:: |
| 63 | + |
| 64 | + {+driver-short+} supports ``Decimal128`` on only little-endian systems. On |
| 65 | + big-endian systems, it uses ``null`` instead. |
| 66 | + |
| 67 | +Use type identifiers to specify that a field is of a certain type |
| 68 | +during ``pymongoarrow.api.Schema`` declaration. For example, if your data |
| 69 | +has fields ``f1`` and ``f2`` bearing types 32-bit integer and UTC datetime, and |
| 70 | +an ``_id`` that is an ``ObjectId``, you can define your schema as follows: |
| 71 | + |
| 72 | +.. code-block:: python |
| 73 | + |
| 74 | + schema = Schema({ |
| 75 | + '_id': ObjectId, |
| 76 | + 'f1': pyarrow.int32(), |
| 77 | + 'f2': pyarrow.timestamp('ms') |
| 78 | + }) |
| 79 | + |
| 80 | +Unsupported data types in a schema cause a ``ValueError`` identifying the |
| 81 | +field and its data type. |
| 82 | + |
| 83 | +Embedded Array Considerations |
| 84 | +----------------------------- |
| 85 | + |
| 86 | +The schema used for an embedded array must use the ``pyarrow.list_()`` type, to specify |
| 87 | +the type of the array elements. For example, |
| 88 | + |
| 89 | +.. code-block:: python |
| 90 | + |
| 91 | + from pyarrow import list_, float64 |
| 92 | + schema = Schema({'_id': ObjectId, |
| 93 | + 'location': {'coordinates': list_(float64())} |
| 94 | + }) |
| 95 | + |
| 96 | +Extension Types |
| 97 | +--------------- |
| 98 | + |
| 99 | +{+driver-short+} implements the ``ObjectId``, ``Decimal128``, ``Binary data``, |
| 100 | +and ``JavaScript code`` types as extension types for PyArrow and Pandas. |
| 101 | +For arrow tables, values of these types have the appropriate |
| 102 | +``pymongoarrow`` extension type, such as ``pymongoarrow.types.ObjectIdType``. |
| 103 | +You can obtain the appropriate ``bson`` Python object by using the ``.as_py()`` |
| 104 | +method, or by calling ``.to_pylist()`` on the table. |
| 105 | + |
| 106 | +.. code-block:: python |
| 107 | + |
| 108 | + >>> from pymongo import MongoClient |
| 109 | + >>> from bson import ObjectId |
| 110 | + >>> from pymongoarrow.api import find_arrow_all |
| 111 | + >>> client = MongoClient() |
| 112 | + >>> coll = client.test.test |
| 113 | + >>> coll.insert_many([{"_id": ObjectId(), "foo": 100}, {"_id": ObjectId(), "foo": 200}]) |
| 114 | + <pymongo.results.InsertManyResult at 0x1080a72b0> |
| 115 | + >>> table = find_arrow_all(coll, {}) |
| 116 | + >>> table |
| 117 | + pyarrow.Table |
| 118 | + _id: extension<arrow.py_extension_type<ObjectIdType>> |
| 119 | + foo: int32 |
| 120 | + ---- |
| 121 | + _id: [[64408B0D5AC9E208AF220142,64408B0D5AC9E208AF220143]] |
| 122 | + foo: [[100,200]] |
| 123 | + >>> table["_id"][0] |
| 124 | + <pyarrow.ObjectIdScalar: ObjectId('64408b0d5ac9e208af220142')> |
| 125 | + >>> table["_id"][0].as_py() |
| 126 | + ObjectId('64408b0d5ac9e208af220142') |
| 127 | + >>> table.to_pylist() |
| 128 | + [{'_id': ObjectId('64408b0d5ac9e208af220142'), 'foo': 100}, |
| 129 | + {'_id': ObjectId('64408b0d5ac9e208af220143'), 'foo': 200}] |
| 130 | + |
| 131 | +When converting to pandas, the extension type columns have an appropriate |
| 132 | +``pymongoarrow`` extension type, such as |
| 133 | +``pymongoarrow.pandas_types.PandasDecimal128``. The value of the element in the |
| 134 | +dataframe is the appropriate ``bson`` type. |
| 135 | + |
| 136 | +.. code-block:: python |
| 137 | + |
| 138 | + >>> from pymongo import MongoClient |
| 139 | + >>> from bson import Decimal128 |
| 140 | + >>> from pymongoarrow.api import find_pandas_all |
| 141 | + >>> client = MongoClient() |
| 142 | + >>> coll = client.test.test |
| 143 | + >>> coll.insert_many([{"foo": Decimal128("0.1")}, {"foo": Decimal128("0.1")}]) |
| 144 | + <pymongo.results.InsertManyResult at 0x1080a72b0> |
| 145 | + >>> df = find_pandas_all(coll, {}) |
| 146 | + >>> df |
| 147 | + _id foo |
| 148 | + 0 64408bf65ac9e208af220144 0.1 |
| 149 | + 1 64408bf65ac9e208af220145 0.1 |
| 150 | + >>> df["foo"].dtype |
| 151 | + <pymongoarrow.pandas_types.PandasDecimal128 at 0x11fe0ae90> |
| 152 | + >>> df["foo"][0] |
| 153 | + Decimal128('0.1') |
| 154 | + >>> df["_id"][0] |
| 155 | + ObjectId('64408bf65ac9e208af220144') |
| 156 | + |
| 157 | +Polars does not support Extension Types. |
| 158 | + |
| 159 | +Null Values and Conversion to Pandas DataFrames |
| 160 | +----------------------------------------------- |
| 161 | + |
| 162 | +In Arrow and Polars, all Arrays are nullable. |
| 163 | +Pandas has experimental nullable data types, such as ``Int64``. |
| 164 | +You can instruct Arrow to create a pandas DataFrame using nullable dtypes |
| 165 | +with the following `Apache documentation code <https://arrow.apache.org/docs/python/pandas.html>`__. |
| 166 | + |
| 167 | +.. code-block:: pycon |
| 168 | + |
| 169 | + >>> dtype_mapping = { |
| 170 | + ... pa.int8(): pd.Int8Dtype(), |
| 171 | + ... pa.int16(): pd.Int16Dtype(), |
| 172 | + ... pa.int32(): pd.Int32Dtype(), |
| 173 | + ... pa.int64(): pd.Int64Dtype(), |
| 174 | + ... pa.uint8(): pd.UInt8Dtype(), |
| 175 | + ... pa.uint16(): pd.UInt16Dtype(), |
| 176 | + ... pa.uint32(): pd.UInt32Dtype(), |
| 177 | + ... pa.uint64(): pd.UInt64Dtype(), |
| 178 | + ... pa.bool_(): pd.BooleanDtype(), |
| 179 | + ... pa.float32(): pd.Float32Dtype(), |
| 180 | + ... pa.float64(): pd.Float64Dtype(), |
| 181 | + ... pa.string(): pd.StringDtype(), |
| 182 | + ... } |
| 183 | + ... df = arrow_table.to_pandas( |
| 184 | + ... types_mapper=dtype_mapping.get, split_blocks=True, self_destruct=True |
| 185 | + ... ) |
| 186 | + ... del arrow_table |
| 187 | + |
| 188 | +Defining a conversion for ``pa.string()`` also converts Arrow strings to NumPy strings, and not objects. |
| 189 | + |
| 190 | +Nested Extension Types |
| 191 | +---------------------- |
| 192 | + |
| 193 | +Pending `ARROW-179 <https://jira.mongodb.org/browse/ARROW-179>`__, extension |
| 194 | +types, such as ``ObjectId``, that appear in nested documents are not |
| 195 | +converted to the corresponding {+driver-short+} extension type, but |
| 196 | +instead have the raw Arrow type, ``FixedSizeBinaryType(fixed_size_binary[12])``. |
| 197 | + |
| 198 | +These values can be consumed as-is, or converted individually to the |
| 199 | +desired extension type, such as ``_id = out['nested'][0]['_id'].cast(ObjectIdType())``. |
0 commit comments