Skip to content

StataReader: Support sorting categoricals #8816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.15.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ API changes
Enhancements
~~~~~~~~~~~~

- StataReader: Properly support sorting categorical variables read from stata files.

.. _whatsnew_0152.performance:

Expand Down
17 changes: 11 additions & 6 deletions pandas/io/stata.py
Original file line number Diff line number Diff line change
Expand Up @@ -1139,12 +1139,17 @@ def data(self, convert_dates=True, convert_categoricals=True, index=None,
)[0]
for i in cols:
col = data.columns[i]
labeled_data = np.copy(data[col])
labeled_data = labeled_data.astype(object)
for k, v in compat.iteritems(
self.value_label_dict[self.lbllist[i]]):
labeled_data[(data[col] == k).values] = v
data[col] = Categorical.from_array(labeled_data)
codes = np.nan_to_num(data[col])
codes = codes.astype(int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the Stata data is float and is partically labeled?

For example

one [1.0]
1.5
two [2.0]
2.5

where the [#] indicates the underlying data? I suspect the produces the incorrect result in this case, and would look like

one
one
two
two

in pandas.

codes = codes-1
categories = []
labeldict = self.value_label_dict[self.lbllist[i]]
for j in range(max(labeldict.keys())):
try:
categories.append(labeldict[j+1])
except:
categories.append(j+1)
data[col] = Categorical.from_codes(codes, categories, ordered=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you know (from stata) that they are ordered? (is their some kind of flag)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are iterating over the columns. Going to be really slow. Need a vectorized soln for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, ordered=True does not sort the values, just defines the order in which they can be sorted. Otherwise I get "TypeError: Categorical not ordered" when trying to sort the data. Is there a technical reason to not enable this? dta files seem to not define if a variable can be sorted or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vectorized the loop in PKEuS@c410441 (I will squash the commits later)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the point of the ordered flag is to define whether the category has an order or not. Its an inherent property when creating the Categorical. I am not sure of the stata semantics w.r.t. . R supports both notions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be invoked throught flag for StataReader, something order_categoricals=False, and should be False by default. This degrades the fidelity of a write-read cycle when the original categories are not ordered/unorderable (e.g. male-female).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to disagree on the default behaviour -- I find losing information is worse than losing speed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand which way loses information?

Using ordered=True is adding information that the Stata data file cannot know, and so it is an end user adding non-data-file-based information to the imported data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you loose the underlying numeric codes from Stata, which is what you end up using all the time when coding in Stata. In most cases, I guess that the codes carry order. That's what you potentially loose.

I actually stumbled across this with the test dataset, where self-reported health came out as an alphabetically-ordered variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point about the loss.

I would think that there should be a monotonic increasing bijection between the underlying Stata data and the cat.codes, which would mean that cat.codes would always preserve the the same information that is in the Stata data. This could be done independently of whether ordered=True is used (so adding it should hopefully be non-controversial)


if not preserve_dtypes:
retyped_data = []
Expand Down
Binary file added pandas/io/tests/data/stata10_117.dta
Binary file not shown.
9 changes: 9 additions & 0 deletions pandas/io/tests/test_stata.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

import pandas as pd
from pandas.compat import iterkeys
from pandas.core.categorical import Categorical
from pandas.core.frame import DataFrame, Series
from pandas.io.parsers import read_csv
from pandas.io.stata import (read_stata, StataReader, InvalidColumnName,
Expand Down Expand Up @@ -81,6 +82,8 @@ def setUp(self):
self.dta18_115 = os.path.join(self.dirpath, 'stata9_115.dta')
self.dta18_117 = os.path.join(self.dirpath, 'stata9_117.dta')

self.dta19_117 = os.path.join(self.dirpath, 'stata10_117.dta')


def read_dta(self, file):
# Legacy default reader configuration
Expand Down Expand Up @@ -744,6 +747,12 @@ def test_drop_column(self):
columns = ['byte_', 'int_', 'long_', 'not_found']
read_stata(self.dta15_117, convert_dates=True, columns=columns)

def test_categorical_sorting(self):
dataset = read_stata(self.dta19_117)
dataset = dataset.sort("srh")
expected = Categorical.from_codes(codes=[-1, -1, 0, 1, 1, 1, 2, 2, 3, 4], categories=["Poor", "Fair", "Good", "Very good", "Excellent"])
tm.assert_equal(True, (np.asarray(expected)==np.asarray(dataset["srh"])).all())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably test whether the DataFrames are equal


if __name__ == '__main__':
nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'],
exit=False)
Expand Down