Skip to content

Add reader for SPSS (.sav) files #26537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 53 commits into from
Jun 16, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
e2492bf
Initial version of SPSS reader
cbrnr May 27, 2019
b3581b2
Rename file
cbrnr May 27, 2019
6db0941
Add usecols and categorical optional parameters
cbrnr May 27, 2019
27a2768
Fix typo
cbrnr May 27, 2019
554fd3f
Add tests
cbrnr May 28, 2019
d8b2cb8
Skip tests if pyreadstat is not available
cbrnr May 28, 2019
7640448
Add pyreadstat to Travis (just 37 for now to see if it works)
cbrnr May 28, 2019
8fc9ee5
Ignore flake8 F401
cbrnr May 28, 2019
01fd5ec
Update whatsnew
cbrnr May 28, 2019
57bc84c
Change versionadded to 0.25.0
cbrnr May 28, 2019
b515ecc
Specify reason for skipif
cbrnr May 28, 2019
ef9f7d0
Fix API
cbrnr May 28, 2019
951a0c2
Fix path to test files
cbrnr May 28, 2019
977fff0
Sort imports
cbrnr May 28, 2019
a69c2bc
Use datapath fixture
cbrnr May 29, 2019
40c9875
Acknowledge Haven project
cbrnr May 29, 2019
c3a4291
Fix imports order
cbrnr May 29, 2019
c59f1e8
Use importorskip
cbrnr May 29, 2019
a3a95bf
Add pyreadstat dependency to macOS and Windows CI
cbrnr May 29, 2019
1983464
Add typing
cbrnr May 29, 2019
5829c95
Add missing whitespace
cbrnr May 29, 2019
17e8786
Add Haven license files
cbrnr May 29, 2019
fe2e2fc
Remove trailing whitespace
cbrnr May 30, 2019
1510b88
Fix import format and add pathlib.Path
cbrnr May 30, 2019
a6e5ad6
Use Optional to properly type an optional argument with default value…
cbrnr May 30, 2019
7817136
Use conda-forge instead of PyPI
cbrnr May 31, 2019
f070282
Better ImportError message
cbrnr May 31, 2019
8a52e41
Use pd.read_spss
cbrnr May 31, 2019
aa85f94
Add conda-forge on macOS task
cbrnr May 31, 2019
0707fbf
Use pyreadstat from pip for Python 3.5 (not available on conda-forge)
cbrnr Jun 3, 2019
cf6403f
usecols only accepts list-like or None
cbrnr Jun 6, 2019
f6a2747
Remove condition (str is not allowed anymore)
cbrnr Jun 6, 2019
af9dda9
Rename to convert_categoricals
cbrnr Jun 8, 2019
d6d408e
Explicitly convert to list
cbrnr Jun 8, 2019
1497a03
Add pyreadstat to environment.yml
cbrnr Jun 8, 2019
2d7a256
Use is_list_like
cbrnr Jun 8, 2019
bd56eee
Fix tests
cbrnr Jun 10, 2019
f68b516
Fix df is assigned but never used
cbrnr Jun 10, 2019
ee14f29
Sort imports
cbrnr Jun 10, 2019
15d7c71
Update requirements-dev.txt
cbrnr Jun 10, 2019
a05de6c
Remove isort
cbrnr Jun 11, 2019
748fe61
Improve docstring
cbrnr Jun 11, 2019
ced4866
Revert indent
cbrnr Jun 11, 2019
a18e0f5
Indent should be 2 spaces
cbrnr Jun 11, 2019
913989d
Add minimum version for pyreadstat
cbrnr Jun 13, 2019
0abcde8
Add minimum version
cbrnr Jun 13, 2019
040af2b
Use import_optional_dependency
cbrnr Jun 13, 2019
53f5692
Remove minimum version for now
cbrnr Jun 13, 2019
ceef885
Correct import order
cbrnr Jun 13, 2019
b232b61
Remove duplicate
cbrnr Jun 14, 2019
b8b7fff
Don't need conda-forge here
cbrnr Jun 14, 2019
90702f3
Remove blank line
cbrnr Jun 14, 2019
e55b8c4
Fix order
cbrnr Jun 15, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions LICENSES/HAVEN_LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
YEAR: 2013-2016
COPYRIGHT HOLDER: Hadley Wickham; RStudio; and Evan Miller
32 changes: 32 additions & 0 deletions LICENSES/HAVEN_MIT
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Based on http://opensource.org/licenses/MIT

This is a template. Complete and ship as file LICENSE the following 2
lines (only)

YEAR:
COPYRIGHT HOLDER:

and specify as

License: MIT + file LICENSE

Copyright (c) <YEAR>, <COPYRIGHT HOLDER>

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
1 change: 1 addition & 0 deletions ci/deps/azure-macos-35.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ dependencies:
- xlsxwriter
- xlwt
- pip:
- pyreadstat
# universal
- pytest==4.5.0
- pytest-xdist
Expand Down
1 change: 1 addition & 0 deletions ci/deps/azure-windows-37.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,4 @@ dependencies:
- pytest-mock
- moto
- hypothesis>=3.58.0
- pyreadstat
1 change: 1 addition & 0 deletions ci/deps/travis-37.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,6 @@ dependencies:
- hypothesis>=3.58.0
- s3fs
- pip
- pyreadstat
- pip:
- moto
1 change: 1 addition & 0 deletions doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,7 @@ pandas-gbq 0.8.0 Google Big Query access
psycopg2 PostgreSQL engine for sqlalchemy
pyarrow 0.9.0 Parquet and feather reading / writing
pymysql MySQL engine for sqlalchemy
pyreadstat SPSS files (.sav) reading
qtpy Clipboard I/O
s3fs 0.0.8 Amazon S3 access
xarray 0.8.2 pandas-like API for N-dimensional data
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ Other Enhancements
- Error message for missing required imports now includes the original import error's text (:issue:`23868`)
- :class:`DatetimeIndex` and :class:`TimedeltaIndex` now have a ``mean`` method (:issue:`24757`)
- :meth:`DataFrame.describe` now formats integer percentiles without decimal point (:issue:`26660`)
- Added support for reading SPSS .sav files using :func:`read_spss` (:issue:`26537`)

.. _whatsnew_0250.api_breaking:

Expand Down
2 changes: 2 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,5 @@ dependencies:
- xlrd # pandas.read_excel, DataFrame.to_excel, pandas.ExcelWriter, pandas.ExcelFile
- xlsxwriter # pandas.read_excel, DataFrame.to_excel, pandas.ExcelWriter, pandas.ExcelFile
- xlwt # pandas.read_excel, DataFrame.to_excel, pandas.ExcelWriter, pandas.ExcelFile
- pip:
- pyreadstat # pandas.read_spss
2 changes: 1 addition & 1 deletion pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@

# misc
read_clipboard, read_parquet, read_feather, read_gbq,
read_html, read_json, read_stata, read_sas)
read_html, read_json, read_stata, read_sas, read_spss)

from pandas.util._tester import test
import pandas.testing
Expand Down
1 change: 1 addition & 0 deletions pandas/io/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,6 @@
from pandas.io.pickle import read_pickle, to_pickle
from pandas.io.pytables import HDFStore, read_hdf
from pandas.io.sas import read_sas
from pandas.io.spss import read_spss
from pandas.io.sql import read_sql, read_sql_query, read_sql_table
from pandas.io.stata import read_stata
41 changes: 41 additions & 0 deletions pandas/io/spss.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from pathlib import Path
from typing import Optional, Sequence, Union

from pandas.compat._optional import import_optional_dependency

from pandas.api.types import is_list_like
from pandas.core.api import DataFrame


def read_spss(path: Union[str, Path],
usecols: Optional[Sequence[str]] = None,
convert_categoricals: bool = True) -> DataFrame:
"""
Load an SPSS file from the file path, returning a DataFrame.

.. versionadded 0.25.0

Parameters
----------
path : string or Path
File path
usecols : list-like, optional
Return a subset of the columns. If None, return all columns.
convert_categoricals : bool, default is True
Convert categorical columns into pd.Categorical.

Returns
-------
DataFrame
"""
pyreadstat = import_optional_dependency("pyreadstat")

if usecols is not None:
if not is_list_like(usecols):
raise TypeError("usecols must be list-like.")
else:
usecols = list(usecols) # pyreadstat requires a list

df, _ = pyreadstat.read_sav(path, usecols=usecols,
apply_value_formats=convert_categoricals)
return df
2 changes: 1 addition & 1 deletion pandas/tests/api/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ class TestPDApi(Base):
'read_gbq', 'read_hdf', 'read_html', 'read_json',
'read_msgpack', 'read_pickle', 'read_sas', 'read_sql',
'read_sql_query', 'read_sql_table', 'read_stata',
'read_table', 'read_feather', 'read_parquet']
'read_table', 'read_feather', 'read_parquet', 'read_spss']

# top-level to_* funcs
funcs_to = ['to_datetime', 'to_msgpack',
Expand Down
Binary file added pandas/tests/io/data/labelled-num-na.sav
Binary file not shown.
Binary file added pandas/tests/io/data/labelled-num.sav
Binary file not shown.
Binary file added pandas/tests/io/data/labelled-str.sav
Binary file not shown.
Binary file added pandas/tests/io/data/umlauts.sav
Binary file not shown.
74 changes: 74 additions & 0 deletions pandas/tests/io/test_spss.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import numpy as np
import pytest

import pandas as pd
from pandas.util import testing as tm

pyreadstat = pytest.importorskip("pyreadstat")


def test_spss_labelled_num(datapath):
# test file from the Haven project (https://haven.tidyverse.org/)
fname = datapath("io", "data", "labelled-num.sav")

df = pd.read_spss(fname, convert_categoricals=True)
expected = pd.DataFrame({"VAR00002": "This is one"}, index=[0])
expected["VAR00002"] = pd.Categorical(expected["VAR00002"])
tm.assert_frame_equal(df, expected)

df = pd.read_spss(fname, convert_categoricals=False)
expected = pd.DataFrame({"VAR00002": 1.0}, index=[0])
tm.assert_frame_equal(df, expected)


def test_spss_labelled_num_na(datapath):
# test file from the Haven project (https://haven.tidyverse.org/)
fname = datapath("io", "data", "labelled-num-na.sav")

df = pd.read_spss(fname, convert_categoricals=True)
expected = pd.DataFrame({"VAR00002": ["This is one", None]})
expected["VAR00002"] = pd.Categorical(expected["VAR00002"])
tm.assert_frame_equal(df, expected)

df = pd.read_spss(fname, convert_categoricals=False)
expected = pd.DataFrame({"VAR00002": [1.0, np.nan]})
tm.assert_frame_equal(df, expected)


def test_spss_labelled_str(datapath):
# test file from the Haven project (https://haven.tidyverse.org/)
fname = datapath("io", "data", "labelled-str.sav")

df = pd.read_spss(fname, convert_categoricals=True)
expected = pd.DataFrame({"gender": ["Male", "Female"]})
expected["gender"] = pd.Categorical(expected["gender"])
tm.assert_frame_equal(df, expected)

df = pd.read_spss(fname, convert_categoricals=False)
expected = pd.DataFrame({"gender": ["M", "F"]})
tm.assert_frame_equal(df, expected)


def test_spss_umlauts(datapath):
# test file from the Haven project (https://haven.tidyverse.org/)
fname = datapath("io", "data", "umlauts.sav")

df = pd.read_spss(fname, convert_categoricals=True)
expected = pd.DataFrame({"var1": ["the ä umlaut",
"the ü umlaut",
"the ä umlaut",
"the ö umlaut"]})
expected["var1"] = pd.Categorical(expected["var1"])
tm.assert_frame_equal(df, expected)

df = pd.read_spss(fname, convert_categoricals=False)
expected = pd.DataFrame({"var1": [1.0, 2.0, 1.0, 3.0]})
tm.assert_frame_equal(df, expected)


def test_spss_usecols(datapath):
# usecols must be list-like
fname = datapath("io", "data", "labelled-num.sav")

with pytest.raises(TypeError, match="usecols must be list-like."):
pd.read_spss(fname, usecols="VAR00002")
3 changes: 2 additions & 1 deletion requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,5 @@ sqlalchemy
xarray
xlrd
xlsxwriter
xlwt
xlwt
pyreadstat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put in the correct alphabetical place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file wasn't sorted at all. I've sorted all lines now.