-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
StataReader: Support sorting categoricals #8816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1139,12 +1139,17 @@ def data(self, convert_dates=True, convert_categoricals=True, index=None, | |
)[0] | ||
for i in cols: | ||
col = data.columns[i] | ||
labeled_data = np.copy(data[col]) | ||
labeled_data = labeled_data.astype(object) | ||
for k, v in compat.iteritems( | ||
self.value_label_dict[self.lbllist[i]]): | ||
labeled_data[(data[col] == k).values] = v | ||
data[col] = Categorical.from_array(labeled_data) | ||
codes = np.nan_to_num(data[col]) | ||
codes = codes.astype(int) | ||
codes = codes-1 | ||
categories = [] | ||
labeldict = self.value_label_dict[self.lbllist[i]] | ||
for j in range(max(labeldict.keys())): | ||
try: | ||
categories.append(labeldict[j+1]) | ||
except: | ||
categories.append(j+1) | ||
data[col] = Categorical.from_codes(codes, categories, ordered=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how do you know (from stata) that they are ordered? (is their some kind of flag)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you are iterating over the columns. Going to be really slow. Need a vectorized soln for this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I understand it, ordered=True does not sort the values, just defines the order in which they can be sorted. Otherwise I get "TypeError: Categorical not ordered" when trying to sort the data. Is there a technical reason to not enable this? dta files seem to not define if a variable can be sorted or not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I vectorized the loop in PKEuS@c410441 (I will squash the commits later) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well, the point of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should probably be invoked throught flag for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am inclined to disagree on the default behaviour -- I find losing information is worse than losing speed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I understand which way loses information? Using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, you loose the underlying numeric codes from Stata, which is what you end up using all the time when coding in Stata. In most cases, I guess that the codes carry order. That's what you potentially loose. I actually stumbled across this with the test dataset, where self-reported health came out as an alphabetically-ordered variable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see your point about the loss. I would think that there should be a monotonic increasing bijection between the underlying Stata data and the |
||
|
||
if not preserve_dtypes: | ||
retyped_data = [] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,7 @@ | |
|
||
import pandas as pd | ||
from pandas.compat import iterkeys | ||
from pandas.core.categorical import Categorical | ||
from pandas.core.frame import DataFrame, Series | ||
from pandas.io.parsers import read_csv | ||
from pandas.io.stata import (read_stata, StataReader, InvalidColumnName, | ||
|
@@ -81,6 +82,8 @@ def setUp(self): | |
self.dta18_115 = os.path.join(self.dirpath, 'stata9_115.dta') | ||
self.dta18_117 = os.path.join(self.dirpath, 'stata9_117.dta') | ||
|
||
self.dta19_117 = os.path.join(self.dirpath, 'stata10_117.dta') | ||
|
||
|
||
def read_dta(self, file): | ||
# Legacy default reader configuration | ||
|
@@ -744,6 +747,12 @@ def test_drop_column(self): | |
columns = ['byte_', 'int_', 'long_', 'not_found'] | ||
read_stata(self.dta15_117, convert_dates=True, columns=columns) | ||
|
||
def test_categorical_sorting(self): | ||
dataset = read_stata(self.dta19_117) | ||
dataset = dataset.sort("srh") | ||
expected = Categorical.from_codes(codes=[-1, -1, 0, 1, 1, 1, 2, 2, 3, 4], categories=["Poor", "Fair", "Good", "Very good", "Excellent"]) | ||
tm.assert_equal(True, (np.asarray(expected)==np.asarray(dataset["srh"])).all()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should probably test whether the DataFrames are equal |
||
|
||
if __name__ == '__main__': | ||
nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'], | ||
exit=False) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the Stata data is float and is partically labeled?
For example
where the [#] indicates the underlying data? I suspect the produces the incorrect result in this case, and would look like
in pandas.