Skip to content

Commit 7f4ef32

Browse files
committed
ENH: add to/from_parquet with pyarrow & fastparquet
1 parent 7930202 commit 7f4ef32

19 files changed

+629
-6
lines changed

ci/install_travis.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,7 @@ fi
156156
echo
157157
echo "[removing installed pandas]"
158158
conda remove pandas -y --force
159+
pip uninstall -y pandas
159160

160161
if [ "$BUILD_TEST" ]; then
161162

ci/requirements-2.7.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 27"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
7+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet

ci/requirements-3.5.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ source activate pandas
44

55
echo "install 35"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
8-
97
# pip install python-dateutil to get latest
108
conda remove -n pandas python-dateutil --force
119
pip install python-dateutil
10+
11+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1

ci/requirements-3.5_OSX.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 35_OSX"
66

7-
conda install -n pandas -c conda-forge feather-format==0.3.1
7+
conda install -n pandas -c conda-forge feather-format==0.3.1 fastparquet

ci/requirements-3.6.pip

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
brotlipy

ci/requirements-3.6.run

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ sqlalchemy
1616
pymysql
1717
feather-format
1818
pyarrow=0.4.1
19+
python-snappy
20+
fastparquet
1921
# psycopg2 (not avail on defaults ATM)
2022
beautifulsoup4
2123
s3fs

ci/requirements-3.6_DOC.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ echo "[install DOC_BUILD deps]"
66

77
pip install pandas-gbq
88

9-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 nbsphinx pandoc
9+
conda install -n pandas -c conda-forge nbsphinx pandoc
10+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet
1011

1112
conda install -n pandas -c r r rpy2 --yes

ci/requirements-3.6_WIN.run

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,5 @@ numexpr
1313
pytables
1414
matplotlib
1515
blosc
16+
fastparquet
17+
pyarrow

doc/source/install.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,7 @@ Optional Dependencies
236236
* `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
237237
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
238238
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
239+
* ``Apache Parquet Format``, either `pyarrow <http://arrow.apache.org/docs/python/>`__ (>= 0.4.1) or `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`__ (>= 0.0.6) for parquet-based storage. The `snappy <https://pypi.python.org/pypi/python-snappy>`__ and `brotli <https://pypi.python.org/pypi/brotlipy>`__ are available for compression support.
239240
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:
240241

241242
* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL

doc/source/io.rst

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ object. The corresponding ``writer`` functions are object methods that are acces
4343
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
4444
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
4545
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
46+
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
4647
binary;`Msgpack <http://msgpack.org/index.html>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
4748
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
4849
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
@@ -4550,6 +4551,76 @@ Read from a feather file.
45504551
import os
45514552
os.remove('example.feather')
45524553
4554+
4555+
.. _io.parquet:
4556+
4557+
Parquet
4558+
-------
4559+
4560+
.. versionadded:: 0.21.0
4561+
4562+
Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data
4563+
frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a
4564+
variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance.
4565+
4566+
Parquet is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas
4567+
dtypes, including extension dtypes such as categorical and datetime with tz.
4568+
4569+
Several caveats.
4570+
4571+
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
4572+
error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index.
4573+
- Duplicate column names and non-string columns names are not supported
4574+
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
4575+
on an attempt at serialization.
4576+
4577+
See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__
4578+
4579+
.. note::
4580+
4581+
These engines are very similar and should read/write nearly identical parquet format files.
4582+
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
4583+
TODO: differing options to write non-standard columns & null treatment
4584+
4585+
.. ipython:: python
4586+
4587+
df = pd.DataFrame({'a': list('abc'),
4588+
'b': list(range(1, 4)),
4589+
'c': np.arange(3, 6).astype('u1'),
4590+
'd': np.arange(4.0, 7.0, dtype='float64'),
4591+
'e': [True, False, True],
4592+
'f': pd.Categorical(list('abc')),
4593+
'g': pd.date_range('20130101', periods=3),
4594+
'h': pd.date_range('20130101', periods=3, tz='US/Eastern'),
4595+
'i': pd.date_range('20130101', periods=3, freq='ns')})
4596+
4597+
df
4598+
df.dtypes
4599+
4600+
Write to a parquet file.
4601+
4602+
.. ipython:: python
4603+
4604+
df.to_parquet('example_pa.parquet', engine='pyarrow')
4605+
df.to_parquet('example_fp.parquet', engine='fastparquet')
4606+
4607+
Read from a parquet file.
4608+
4609+
.. ipython:: python
4610+
4611+
result = pd.read_parquet('example_pa.parquet')
4612+
result = pd.read_parquet('example_fp.parquet')
4613+
4614+
# we preserve dtypes
4615+
result.dtypes
4616+
4617+
.. ipython:: python
4618+
:suppress:
4619+
4620+
import os
4621+
os.remove('example_pa.parquet')
4622+
os.remove('example_fp.parquet')
4623+
45534624
.. _io.sql:
45544625

45554626
SQL Queries

0 commit comments

Comments
 (0)