-
-
Notifications
You must be signed in to change notification settings - Fork 19k
DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709
Changes from 2 commits
766aa50
2c8431c
931ca2c
d3cd904
8f6d6a7
10d1e86
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,10 +37,17 @@ So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a | |
|
||
.. code-block:: text | ||
|
||
{'index_columns': ['__index_level_0__', '__index_level_1__', ...], | ||
{'index_columns': [<descr0>, <descr1>, ...], | ||
'column_indexes': [<ci0>, <ci1>, ..., <ciN>], | ||
'columns': [<c0>, <c1>, ...], | ||
'pandas_version': $VERSION} | ||
'pandas_version': $VERSION, | ||
'creator': { | ||
'library': $LIBRARY, | ||
'version': $LIBRARY_VERSION | ||
}} | ||
|
||
The "descriptor" values ``<descr0>`` in the ``'index_columns'`` field are | ||
dictionaries with values as described below. | ||
|
||
Here, ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata | ||
for each column, *including the index columns*. This has JSON form: | ||
|
@@ -53,26 +60,42 @@ for each column, *including the index columns*. This has JSON form: | |
'numpy_type': numpy_type, | ||
'metadata': metadata} | ||
|
||
.. note:: | ||
See below for the detailed specification for these | ||
|
||
Index Metadata Descriptors | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
``RangeIndex`` can be stored as metadata only, not requiring serialization. The | ||
descriptor format for these as is follows: | ||
|
||
.. code-block:: python | ||
|
||
{'kind': 'range', | ||
'name': index.name, | ||
'start': index._start, | ||
|
||
'stop': index._stop, | ||
'step': index._step} | ||
|
||
Every index column is stored with a name matching the pattern | ||
``__index_level_\d+__`` and its corresponding column information is can be | ||
found with the following code snippet. | ||
Other index types must be serialized as data columns along with the other | ||
DataFrame columns. The metadata for these is a dict with ``kind`` field | ||
``'serialized'`` and ``'field_name'`` field indicating which data column | ||
contains the index data. For example, | ||
|
||
Following this naming convention isn't strictly necessary, but strongly | ||
suggested for compatibility with Arrow. | ||
.. code-block:: python | ||
|
||
Here's an example of how the index metadata is structured in pyarrow: | ||
{'kind': 'serialized', | ||
'field_name': '__index_level_0__'} | ||
|
||
.. code-block:: python | ||
Every index column is stored with a name matching the pattern | ||
``__index_level_\d+__``. Following this naming convention isn't strictly | ||
necessary, but strongly suggested for compatibility with Arrow and | ||
disambiguation. The ``'field_name'`` is the actual name of the column in the | ||
serialized Parquet table. If the ``Index`` has a non-None ``name`` attribute, | ||
then it can be found in the ``name`` field of the metadata for that serialized | ||
data column as described below. | ||
|
||
|
||
# assuming there's at least 3 levels in the index | ||
index_columns = metadata['index_columns'] # noqa: F821 | ||
columns = metadata['columns'] # noqa: F821 | ||
ith_index = 2 | ||
assert index_columns[ith_index] == '__index_level_2__' | ||
ith_index_info = columns[-len(index_columns):][ith_index] | ||
ith_index_level_name = ith_index_info['name'] | ||
Column Metadata | ||
~~~~~~~~~~~~~~~ | ||
|
||
``pandas_type`` is the logical type of the column, and is one of: | ||
|
||
|
@@ -121,7 +144,8 @@ As an example of fully-formed metadata: | |
|
||
.. code-block:: text | ||
|
||
{'index_columns': ['__index_level_0__'], | ||
{'index_columns': [{'kind': 'serialized', | ||
'field_name': '__index_level_0__'}], | ||
'column_indexes': [ | ||
{'name': None, | ||
'field_name': 'None', | ||
|
@@ -161,4 +185,8 @@ As an example of fully-formed metadata: | |
'numpy_type': 'int64', | ||
'metadata': None} | ||
], | ||
'pandas_version': '0.20.0'} | ||
'pandas_version': '0.20.0', | ||
'creator': { | ||
'library': 'pyarrow', | ||
'version': '0.13.0' | ||
}} |
Uh oh!
There was an error while loading. Please reload this page.