Skip to content

Commit ca9179e

Browse files
cleanup data-types page (#2)
1 parent 0cbc1b8 commit ca9179e

File tree

3 files changed

+200
-185
lines changed

3 files changed

+200
-185
lines changed

source/data-types.txt

Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
.. _pymongo-arrow-data-types:
2+
3+
==========
4+
Data Types
5+
==========
6+
7+
.. contents:: On this page
8+
:local:
9+
:backlinks: none
10+
:depth: 1
11+
:class: singlecol
12+
13+
.. facet::
14+
:name: genre
15+
:values: reference
16+
17+
.. meta::
18+
:keywords: support, conversions
19+
20+
{+driver-short+} supports a majority of the BSON types.
21+
Because Arrow and Polars provide first-class support for Lists and Structs,
22+
this includes embedded arrays and documents.
23+
24+
Support for additional types will be added in subsequent releases.
25+
26+
.. tip::
27+
28+
For more information about BSON types, see the
29+
`BSON specification <http://bsonspec.org/spec.html>`__.
30+
31+
.. list-table::
32+
:widths: 20 60
33+
:header-rows: 1
34+
35+
* - BSON Type
36+
- Type Identifiers
37+
* - String
38+
- ``py.str``, an instance of ``pyarrow.string``
39+
* - Embedded document
40+
- ``py.dict``, and instance of ``pyarrow.struct``
41+
* - Embedded array
42+
- An instance of ``pyarrow.list_``
43+
* - ObjectId
44+
- ``py.bytes``, ``bson.ObjectId``, an instance of ``pymongoarrow.types.ObjectIdType``, an instance of ``pymongoarrow.pandas_types.PandasObjectId``
45+
* - Decimal128
46+
- ``bson.Decimal128``, an instance of ``pymongoarrow.types.Decimal128Type``, an instance of ``pymongoarrow.pandas_types.PandasDecimal128``
47+
* - Boolean
48+
- An instance of ``~pyarrow.bool_``, ``~py.bool``
49+
* - 64-bit binary floating point
50+
- ``py.float``, an instance of ``pyarrow.float64``
51+
* - 32-bit integer
52+
- An instance of ``pyarrow.int32``
53+
* - 64-bit integer
54+
- ``~py.int``, ``bson.int64.Int64``, an instance of ``pyarrow.int64``
55+
* - UTC datetime
56+
- An instance of ``~pyarrow.timestamp`` with ``ms`` resolution, ``py.datetime.datetime``
57+
* - Binary data
58+
- ``bson.Binary``, an instance of ``pymongoarrow.types.BinaryType``, an instance of ``pymongoarrow.pandas_types.PandasBinary``.
59+
* - JavaScript code
60+
- ``bson.Code``, an instance of ``pymongoarrow.types.CodeType``, an instance of ``pymongoarrow.pandas_types.PandasCode``
61+
62+
.. note::
63+
64+
{+driver-short+} supports ``Decimal128`` on only little-endian systems. On
65+
big-endian systems, it uses ``null`` instead.
66+
67+
Use type identifiers to specify that a field is of a certain type
68+
during ``pymongoarrow.api.Schema`` declaration. For example, if your data
69+
has fields ``f1`` and ``f2`` bearing types 32-bit integer and UTC datetime, and
70+
an ``_id`` that is an ``ObjectId``, you can define your schema as follows:
71+
72+
.. code-block:: python
73+
74+
schema = Schema({
75+
'_id': ObjectId,
76+
'f1': pyarrow.int32(),
77+
'f2': pyarrow.timestamp('ms')
78+
})
79+
80+
Unsupported data types in a schema cause a ``ValueError`` identifying the
81+
field and its data type.
82+
83+
Embedded Array Considerations
84+
-----------------------------
85+
86+
The schema used for an embedded array must use the ``pyarrow.list_()`` type, to specify
87+
the type of the array elements. For example,
88+
89+
.. code-block:: python
90+
91+
from pyarrow import list_, float64
92+
schema = Schema({'_id': ObjectId,
93+
'location': {'coordinates': list_(float64())}
94+
})
95+
96+
Extension Types
97+
---------------
98+
99+
{+driver-short+} implements the ``ObjectId``, ``Decimal128``, ``Binary data``,
100+
and ``JavaScript code`` types as extension types for PyArrow and Pandas.
101+
For arrow tables, values of these types have the appropriate
102+
``pymongoarrow`` extension type, such as ``pymongoarrow.types.ObjectIdType``.
103+
You can obtain the appropriate ``bson`` Python object by using the ``.as_py()``
104+
method, or by calling ``.to_pylist()`` on the table.
105+
106+
.. code-block:: python
107+
108+
>>> from pymongo import MongoClient
109+
>>> from bson import ObjectId
110+
>>> from pymongoarrow.api import find_arrow_all
111+
>>> client = MongoClient()
112+
>>> coll = client.test.test
113+
>>> coll.insert_many([{"_id": ObjectId(), "foo": 100}, {"_id": ObjectId(), "foo": 200}])
114+
<pymongo.results.InsertManyResult at 0x1080a72b0>
115+
>>> table = find_arrow_all(coll, {})
116+
>>> table
117+
pyarrow.Table
118+
_id: extension<arrow.py_extension_type<ObjectIdType>>
119+
foo: int32
120+
----
121+
_id: [[64408B0D5AC9E208AF220142,64408B0D5AC9E208AF220143]]
122+
foo: [[100,200]]
123+
>>> table["_id"][0]
124+
<pyarrow.ObjectIdScalar: ObjectId('64408b0d5ac9e208af220142')>
125+
>>> table["_id"][0].as_py()
126+
ObjectId('64408b0d5ac9e208af220142')
127+
>>> table.to_pylist()
128+
[{'_id': ObjectId('64408b0d5ac9e208af220142'), 'foo': 100},
129+
{'_id': ObjectId('64408b0d5ac9e208af220143'), 'foo': 200}]
130+
131+
When converting to pandas, the extension type columns have an appropriate
132+
``pymongoarrow`` extension type, such as
133+
``pymongoarrow.pandas_types.PandasDecimal128``. The value of the element in the
134+
dataframe is the appropriate ``bson`` type.
135+
136+
.. code-block:: python
137+
138+
>>> from pymongo import MongoClient
139+
>>> from bson import Decimal128
140+
>>> from pymongoarrow.api import find_pandas_all
141+
>>> client = MongoClient()
142+
>>> coll = client.test.test
143+
>>> coll.insert_many([{"foo": Decimal128("0.1")}, {"foo": Decimal128("0.1")}])
144+
<pymongo.results.InsertManyResult at 0x1080a72b0>
145+
>>> df = find_pandas_all(coll, {})
146+
>>> df
147+
_id foo
148+
0 64408bf65ac9e208af220144 0.1
149+
1 64408bf65ac9e208af220145 0.1
150+
>>> df["foo"].dtype
151+
<pymongoarrow.pandas_types.PandasDecimal128 at 0x11fe0ae90>
152+
>>> df["foo"][0]
153+
Decimal128('0.1')
154+
>>> df["_id"][0]
155+
ObjectId('64408bf65ac9e208af220144')
156+
157+
Polars does not support Extension Types.
158+
159+
Null Values and Conversion to Pandas DataFrames
160+
-----------------------------------------------
161+
162+
In Arrow and Polars, all Arrays are nullable.
163+
Pandas has experimental nullable data types, such as ``Int64``.
164+
You can instruct Arrow to create a pandas DataFrame using nullable dtypes
165+
with the following `Apache documentation code <https://arrow.apache.org/docs/python/pandas.html>`__.
166+
167+
.. code-block:: pycon
168+
169+
>>> dtype_mapping = {
170+
... pa.int8(): pd.Int8Dtype(),
171+
... pa.int16(): pd.Int16Dtype(),
172+
... pa.int32(): pd.Int32Dtype(),
173+
... pa.int64(): pd.Int64Dtype(),
174+
... pa.uint8(): pd.UInt8Dtype(),
175+
... pa.uint16(): pd.UInt16Dtype(),
176+
... pa.uint32(): pd.UInt32Dtype(),
177+
... pa.uint64(): pd.UInt64Dtype(),
178+
... pa.bool_(): pd.BooleanDtype(),
179+
... pa.float32(): pd.Float32Dtype(),
180+
... pa.float64(): pd.Float64Dtype(),
181+
... pa.string(): pd.StringDtype(),
182+
... }
183+
... df = arrow_table.to_pandas(
184+
... types_mapper=dtype_mapping.get, split_blocks=True, self_destruct=True
185+
... )
186+
... del arrow_table
187+
188+
Defining a conversion for ``pa.string()`` also converts Arrow strings to NumPy strings, and not objects.
189+
190+
Nested Extension Types
191+
----------------------
192+
193+
Pending `ARROW-179 <https://jira.mongodb.org/browse/ARROW-179>`__, extension
194+
types, such as ``ObjectId``, that appear in nested documents are not
195+
converted to the corresponding {+driver-short+} extension type, but
196+
instead have the raw Arrow type, ``FixedSizeBinaryType(fixed_size_binary[12])``.
197+
198+
These values can be consumed as-is, or converted individually to the
199+
desired extension type, such as ``_id = out['nested'][0]['_id'].cast(ObjectIdType())``.

source/data_types.txt

Lines changed: 0 additions & 184 deletions
This file was deleted.

source/index.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
/quick-start
1818
/whats-new
1919
/comparison
20-
/data_types
20+
/data-types
2121
/schemas
2222
API Documentation <{+api-root+}>
2323
/faq

0 commit comments

Comments
 (0)