@@ -68,34 +68,43 @@ message).
6868See the :ref: `format_metadata_extension_types ` section of the metadata
6969specification for more details.
7070
71- Pyarrow allows you to define such extension types from Python.
72-
73- There are currently two ways:
74-
75- * Subclassing :class: `PyExtensionType `: the (de)serialization is based on pickle.
76- This is a good option for an extension type that is only used from Python.
77- * Subclassing :class: `ExtensionType `: this allows to give a custom
78- Python-independent name and serialized metadata, that can potentially be
79- recognized by other (non-Python) Arrow implementations such as PySpark.
71+ Pyarrow allows you to define such extension types from Python by subclassing
72+ :class: `ExtensionType ` and giving the derived class its own extension name
73+ and serialization mechanism. The extension name and serialized metadata
74+ can potentially be recognized by other (non-Python) Arrow implementations
75+ such as PySpark.
8076
8177For example, we could define a custom UUID type for 128-bit numbers which can
82- be represented as ``FixedSizeBinary `` type with 16 bytes.
83- Using the first approach, we create a ``UuidType `` subclass, and implement the
84- ``__reduce__ `` method to ensure the class can be properly pickled::
78+ be represented as ``FixedSizeBinary `` type with 16 bytes::
8579
86- class UuidType(pa.PyExtensionType ):
80+ class UuidType(pa.ExtensionType ):
8781
8882 def __init__(self):
89- pa.PyExtensionType.__init__(self, pa.binary(16))
83+ super().__init__(pa.binary(16), "my_package.uuid")
84+
85+ def __arrow_ext_serialize__(self):
86+ # Since we don't have a parameterized type, we don't need extra
87+ # metadata to be deserialized
88+ return b''
9089
91- def __reduce__(self):
92- return UuidType, ()
90+ @classmethod
91+ def __arrow_ext_deserialize__(cls, storage_type, serialized):
92+ # Sanity checks, not required but illustrate the method signature.
93+ assert storage_type == pa.binary(16)
94+ assert serialized == b''
95+ # Return an instance of this subclass given the serialized
96+ # metadata.
97+ return UuidType()
98+
99+ The special methods ``__arrow_ext_serialize__ `` and ``__arrow_ext_deserialize__ ``
100+ define the serialization of an extension type instance. For non-parametric
101+ types such as the above, the serialization payload can be left empty.
93102
94103This can now be used to create arrays and tables holding the extension type::
95104
96105 >>> uuid_type = UuidType()
97106 >>> uuid_type.extension_name
98- 'arrow.py_extension_type '
107+ 'my_package.uuid '
99108 >>> uuid_type.storage_type
100109 FixedSizeBinaryType(fixed_size_binary[16])
101110
@@ -112,8 +121,11 @@ This can now be used to create arrays and tables holding the extension type::
112121 ]
113122
114123This array can be included in RecordBatches, sent over IPC and received in
115- another Python process. The custom UUID type will be preserved there, as long
116- as the definition of the class is available (the type can be unpickled).
124+ another Python process. The receiving process must explicitly register the
125+ extension type for deserialization, otherwise it will fall back to the
126+ storage type::
127+
128+ >>> pa.register_extension_type(UuidType())
117129
118130For example, creating a RecordBatch and writing it to a stream using the
119131IPC protocol::
@@ -129,43 +141,12 @@ and then reading it back yields the proper type::
129141 >>> with pa.ipc.open_stream(buf) as reader:
130142 ... result = reader.read_all()
131143 >>> result.column('ext').type
132- UuidType(extension<arrow.py_extension_type>)
133-
134- We can define the same type using the other option::
135-
136- class UuidType(pa.ExtensionType):
137-
138- def __init__(self):
139- pa.ExtensionType.__init__(self, pa.binary(16), "my_package.uuid")
140-
141- def __arrow_ext_serialize__(self):
142- # since we don't have a parameterized type, we don't need extra
143- # metadata to be deserialized
144- return b''
145-
146- @classmethod
147- def __arrow_ext_deserialize__(self, storage_type, serialized):
148- # return an instance of this subclass given the serialized
149- # metadata.
150- return UuidType()
151-
152- This is a slightly longer implementation (you need to implement the special
153- methods ``__arrow_ext_serialize__ `` and ``__arrow_ext_deserialize__ ``), and the
154- extension type needs to be registered to be received through IPC (using
155- :func: `register_extension_type `), but it has
156- now a unique name::
157-
158- >>> uuid_type = UuidType()
159- >>> uuid_type.extension_name
160- 'my_package.uuid'
161-
162- >>> pa.register_extension_type(uuid_type)
144+ UuidType(FixedSizeBinaryType(fixed_size_binary[16]))
163145
164146The receiving application doesn't need to be Python but can still recognize
165- the extension type as a "uuid" type, if it has implemented its own extension
166- type to receive it.
167- If the type is not registered in the receiving application, it will fall back
168- to the storage type.
147+ the extension type as a "my_package.uuid" type, if it has implemented its own
148+ extension type to receive it. If the type is not registered in the receiving
149+ application, it will fall back to the storage type.
169150
170151Parameterized extension type
171152~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -187,7 +168,7 @@ of the given frequency since 1970.
187168 # attributes need to be set first before calling
188169 # super init (as that calls serialize)
189170 self._freq = freq
190- pa.ExtensionType. __init__(self, pa.int64(), 'my_package.period')
171+ super(). __init__(pa.int64(), 'my_package.period')
191172
192173 @property
193174 def freq(self):
@@ -198,7 +179,7 @@ of the given frequency since 1970.
198179
199180 @classmethod
200181 def __arrow_ext_deserialize__(cls, storage_type, serialized):
201- # return an instance of this subclass given the serialized
182+ # Return an instance of this subclass given the serialized
202183 # metadata.
203184 serialized = serialized.decode()
204185 assert serialized.startswith("freq=")
@@ -209,31 +190,10 @@ Here, we ensure to store all information in the serialized metadata that is
209190needed to reconstruct the instance (in the ``__arrow_ext_deserialize__ `` class
210191method), in this case the frequency string.
211192
212- Note that, once created, the data type instance is considered immutable. If,
213- in the example above, the ``freq `` parameter would change after instantiation,
214- the reconstruction of the type instance after IPC will be incorrect.
193+ Note that, once created, the data type instance is considered immutable.
215194In the example above, the ``freq `` parameter is therefore stored in a private
216195attribute with a public read-only property to access it.
217196
218- Parameterized extension types are also possible using the pickle-based type
219- subclassing :class: `PyExtensionType `. The equivalent example for the period
220- data type from above would look like::
221-
222- class PeriodType(pa.PyExtensionType):
223-
224- def __init__(self, freq):
225- self._freq = freq
226- pa.PyExtensionType.__init__(self, pa.int64())
227-
228- @property
229- def freq(self):
230- return self._freq
231-
232- def __reduce__(self):
233- return PeriodType, (self.freq,)
234-
235- Also the storage type does not need to be fixed but can be parameterized.
236-
237197Custom extension array class
238198~~~~~~~~~~~~~~~~~~~~~~~~~~~~
239199
@@ -252,12 +212,16 @@ the data as a 2-D Numpy array ``(N, 3)`` without any copy::
252212 return self.storage.flatten().to_numpy().reshape((-1, 3))
253213
254214
255- class Point3DType(pa.PyExtensionType ):
215+ class Point3DType(pa.ExtensionType ):
256216 def __init__(self):
257- pa.PyExtensionType. __init__(self, pa.list_(pa.float32(), 3))
217+ super(). __init__(pa.list_(pa.float32(), 3), "my_package.Point3DType" )
258218
259- def __reduce__(self):
260- return Point3DType, ()
219+ def __arrow_ext_serialize__(self):
220+ return b''
221+
222+ @classmethod
223+ def __arrow_ext_deserialize__(cls, storage_type, serialized):
224+ return Point3DType()
261225
262226 def __arrow_ext_class__(self):
263227 return Point3DArray
@@ -289,11 +253,8 @@ The additional methods in the extension class are then available to the user::
289253
290254
291255This array can be sent over IPC, received in another Python process, and the custom
292- extension array class will be preserved (as long as the definitions of the classes above
293- are available).
294-
295- The same ``__arrow_ext_class__ `` specialization can be used with custom types defined
296- by subclassing :class: `ExtensionType `.
256+ extension array class will be preserved (as long as the receiving process registers
257+ the extension type using :func: `register_extension_type ` before reading the IPC data).
297258
298259Custom scalar conversion
299260~~~~~~~~~~~~~~~~~~~~~~~~
@@ -304,18 +265,24 @@ If you want scalars of your custom extension type to convert to a custom type wh
304265For example, if we wanted the above example 3D point type to return a custom
3052663D point class instead of a list, we would implement::
306267
268+ from collections import namedtuple
269+
307270 Point3D = namedtuple("Point3D", ["x", "y", "z"])
308271
309272 class Point3DScalar(pa.ExtensionScalar):
310273 def as_py(self) -> Point3D:
311274 return Point3D(*self.value.as_py())
312275
313- class Point3DType(pa.PyExtensionType ):
276+ class Point3DType(pa.ExtensionType ):
314277 def __init__(self):
315- pa.PyExtensionType. __init__(self, pa.list_(pa.float32(), 3))
278+ super(). __init__(pa.list_(pa.float32(), 3), "my_package.Point3DType" )
316279
317- def __reduce__(self):
318- return Point3DType, ()
280+ def __arrow_ext_serialize__(self):
281+ return b''
282+
283+ @classmethod
284+ def __arrow_ext_deserialize__(cls, storage_type, serialized):
285+ return Point3DType()
319286
320287 def __arrow_ext_scalar_class__(self):
321288 return Point3DScalar
0 commit comments