-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Add reader for SPSS (.sav) files #26537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
53 commits
Select commit
Hold shift + click to select a range
e2492bf
Initial version of SPSS reader
cbrnr b3581b2
Rename file
cbrnr 6db0941
Add usecols and categorical optional parameters
cbrnr 27a2768
Fix typo
cbrnr 554fd3f
Add tests
cbrnr d8b2cb8
Skip tests if pyreadstat is not available
cbrnr 7640448
Add pyreadstat to Travis (just 37 for now to see if it works)
cbrnr 8fc9ee5
Ignore flake8 F401
cbrnr 01fd5ec
Update whatsnew
cbrnr 57bc84c
Change versionadded to 0.25.0
cbrnr b515ecc
Specify reason for skipif
cbrnr ef9f7d0
Fix API
cbrnr 951a0c2
Fix path to test files
cbrnr 977fff0
Sort imports
cbrnr a69c2bc
Use datapath fixture
cbrnr 40c9875
Acknowledge Haven project
cbrnr c3a4291
Fix imports order
cbrnr c59f1e8
Use importorskip
cbrnr a3a95bf
Add pyreadstat dependency to macOS and Windows CI
cbrnr 1983464
Add typing
cbrnr 5829c95
Add missing whitespace
cbrnr 17e8786
Add Haven license files
cbrnr fe2e2fc
Remove trailing whitespace
cbrnr 1510b88
Fix import format and add pathlib.Path
cbrnr a6e5ad6
Use Optional to properly type an optional argument with default value…
cbrnr 7817136
Use conda-forge instead of PyPI
cbrnr f070282
Better ImportError message
cbrnr 8a52e41
Use pd.read_spss
cbrnr aa85f94
Add conda-forge on macOS task
cbrnr 0707fbf
Use pyreadstat from pip for Python 3.5 (not available on conda-forge)
cbrnr cf6403f
usecols only accepts list-like or None
cbrnr f6a2747
Remove condition (str is not allowed anymore)
cbrnr af9dda9
Rename to convert_categoricals
cbrnr d6d408e
Explicitly convert to list
cbrnr 1497a03
Add pyreadstat to environment.yml
cbrnr 2d7a256
Use is_list_like
cbrnr bd56eee
Fix tests
cbrnr f68b516
Fix df is assigned but never used
cbrnr ee14f29
Sort imports
cbrnr 15d7c71
Update requirements-dev.txt
cbrnr a05de6c
Remove isort
cbrnr 748fe61
Improve docstring
cbrnr ced4866
Revert indent
cbrnr a18e0f5
Indent should be 2 spaces
cbrnr 913989d
Add minimum version for pyreadstat
cbrnr 0abcde8
Add minimum version
cbrnr 040af2b
Use import_optional_dependency
cbrnr 53f5692
Remove minimum version for now
cbrnr ceef885
Correct import order
cbrnr b232b61
Remove duplicate
cbrnr b8b7fff
Don't need conda-forge here
cbrnr 90702f3
Remove blank line
cbrnr e55b8c4
Fix order
cbrnr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
YEAR: 2013-2016 | ||
COPYRIGHT HOLDER: Hadley Wickham; RStudio; and Evan Miller |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
Based on http://opensource.org/licenses/MIT | ||
|
||
This is a template. Complete and ship as file LICENSE the following 2 | ||
lines (only) | ||
|
||
YEAR: | ||
COPYRIGHT HOLDER: | ||
|
||
and specify as | ||
|
||
License: MIT + file LICENSE | ||
|
||
Copyright (c) <YEAR>, <COPYRIGHT HOLDER> | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining | ||
a copy of this software and associated documentation files (the | ||
"Software"), to deal in the Software without restriction, including | ||
without limitation the rights to use, copy, modify, merge, publish, | ||
distribute, sublicense, and/or sell copies of the Software, and to | ||
permit persons to whom the Software is furnished to do so, subject to | ||
the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be | ||
included in all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, | ||
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF | ||
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | ||
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE | ||
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION | ||
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION | ||
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,6 +23,7 @@ dependencies: | |
- xlsxwriter | ||
- xlwt | ||
- pip: | ||
- pyreadstat | ||
# universal | ||
- pytest==4.5.0 | ||
- pytest-xdist | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,3 +30,4 @@ dependencies: | |
- pytest-mock | ||
- moto | ||
- hypothesis>=3.58.0 | ||
- pyreadstat |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,5 +19,6 @@ dependencies: | |
- hypothesis>=3.58.0 | ||
- s3fs | ||
- pip | ||
- pyreadstat | ||
- pip: | ||
- moto |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
from pathlib import Path | ||
from typing import Optional, Sequence, Union | ||
|
||
from pandas.compat._optional import import_optional_dependency | ||
|
||
from pandas.api.types import is_list_like | ||
from pandas.core.api import DataFrame | ||
|
||
|
||
def read_spss(path: Union[str, Path], | ||
usecols: Optional[Sequence[str]] = None, | ||
convert_categoricals: bool = True) -> DataFrame: | ||
""" | ||
Load an SPSS file from the file path, returning a DataFrame. | ||
|
||
.. versionadded 0.25.0 | ||
|
||
Parameters | ||
---------- | ||
path : string or Path | ||
File path | ||
usecols : list-like, optional | ||
Return a subset of the columns. If None, return all columns. | ||
convert_categoricals : bool, default is True | ||
Convert categorical columns into pd.Categorical. | ||
|
||
Returns | ||
------- | ||
DataFrame | ||
""" | ||
pyreadstat = import_optional_dependency("pyreadstat") | ||
|
||
if usecols is not None: | ||
if not is_list_like(usecols): | ||
raise TypeError("usecols must be list-like.") | ||
else: | ||
usecols = list(usecols) # pyreadstat requires a list | ||
|
||
df, _ = pyreadstat.read_sav(path, usecols=usecols, | ||
apply_value_formats=convert_categoricals) | ||
return df |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
import numpy as np | ||
import pytest | ||
|
||
import pandas as pd | ||
from pandas.util import testing as tm | ||
|
||
pyreadstat = pytest.importorskip("pyreadstat") | ||
|
||
|
||
def test_spss_labelled_num(datapath): | ||
# test file from the Haven project (https://haven.tidyverse.org/) | ||
fname = datapath("io", "data", "labelled-num.sav") | ||
|
||
df = pd.read_spss(fname, convert_categoricals=True) | ||
expected = pd.DataFrame({"VAR00002": "This is one"}, index=[0]) | ||
expected["VAR00002"] = pd.Categorical(expected["VAR00002"]) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
df = pd.read_spss(fname, convert_categoricals=False) | ||
expected = pd.DataFrame({"VAR00002": 1.0}, index=[0]) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
|
||
def test_spss_labelled_num_na(datapath): | ||
# test file from the Haven project (https://haven.tidyverse.org/) | ||
fname = datapath("io", "data", "labelled-num-na.sav") | ||
|
||
df = pd.read_spss(fname, convert_categoricals=True) | ||
expected = pd.DataFrame({"VAR00002": ["This is one", None]}) | ||
expected["VAR00002"] = pd.Categorical(expected["VAR00002"]) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
df = pd.read_spss(fname, convert_categoricals=False) | ||
expected = pd.DataFrame({"VAR00002": [1.0, np.nan]}) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
|
||
def test_spss_labelled_str(datapath): | ||
# test file from the Haven project (https://haven.tidyverse.org/) | ||
fname = datapath("io", "data", "labelled-str.sav") | ||
|
||
df = pd.read_spss(fname, convert_categoricals=True) | ||
expected = pd.DataFrame({"gender": ["Male", "Female"]}) | ||
expected["gender"] = pd.Categorical(expected["gender"]) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
df = pd.read_spss(fname, convert_categoricals=False) | ||
expected = pd.DataFrame({"gender": ["M", "F"]}) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
|
||
def test_spss_umlauts(datapath): | ||
# test file from the Haven project (https://haven.tidyverse.org/) | ||
fname = datapath("io", "data", "umlauts.sav") | ||
|
||
df = pd.read_spss(fname, convert_categoricals=True) | ||
expected = pd.DataFrame({"var1": ["the ä umlaut", | ||
"the ü umlaut", | ||
"the ä umlaut", | ||
"the ö umlaut"]}) | ||
expected["var1"] = pd.Categorical(expected["var1"]) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
df = pd.read_spss(fname, convert_categoricals=False) | ||
expected = pd.DataFrame({"var1": [1.0, 2.0, 1.0, 3.0]}) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
|
||
def test_spss_usecols(datapath): | ||
# usecols must be list-like | ||
fname = datapath("io", "data", "labelled-num.sav") | ||
|
||
with pytest.raises(TypeError, match="usecols must be list-like."): | ||
pd.read_spss(fname, usecols="VAR00002") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,4 +52,5 @@ sqlalchemy | |
xarray | ||
xlrd | ||
xlsxwriter | ||
xlwt | ||
xlwt | ||
pyreadstat | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you put in the correct alphabetical place There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The file wasn't sorted at all. I've sorted all lines now. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.