-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
StataReader: Support sorting categoricals #8816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
categories.append(labeldict[j+1]) | ||
except: | ||
categories.append(j+1) | ||
data[col] = Categorical.from_codes(codes, categories, ordered=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you know (from stata) that they are ordered? (is their some kind of flag)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are iterating over the columns. Going to be really slow. Need a vectorized soln for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, ordered=True does not sort the values, just defines the order in which they can be sorted. Otherwise I get "TypeError: Categorical not ordered" when trying to sort the data. Is there a technical reason to not enable this? dta files seem to not define if a variable can be sorted or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vectorized the loop in PKEuS@c410441 (I will squash the commits later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, the point of the ordered
flag is to define whether the category has an order or not. Its an inherent property when creating the Categorical. I am not sure of the stata semantics w.r.t. . R supports both notions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be invoked throught flag for StataReader
, something order_categoricals=False
, and should be False
by default. This degrades the fidelity of a write-read cycle when the original categories are not ordered/unorderable (e.g. male-female).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined to disagree on the default behaviour -- I find losing information is worse than losing speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand which way loses information?
Using ordered=True
is adding information that the Stata data file cannot know, and so it is an end user adding non-data-file-based information to the imported data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you loose the underlying numeric codes from Stata, which is what you end up using all the time when coding in Stata. In most cases, I guess that the codes carry order. That's what you potentially loose.
I actually stumbled across this with the test dataset, where self-reported health came out as an alphabetically-ordered variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point about the loss.
I would think that there should be a monotonic increasing bijection between the underlying Stata data and the cat.codes
, which would mean that cat.codes
would always preserve the the same information that is in the Stata data. This could be done independently of whether ordered=True
is used (so adding it should hopefully be non-controversial)
cc @bashtage |
There is no notion of this in Stata. You simply have numeric codes that Stata works with, if there are labels attached to them, read_stata assumes they are Categoricals. So we just have to pick a sensible default. The current implementation of read_stata leads to an ordered categorical where the order is based on the labels (=alphabetical). This does not make any sense. Two reasons why I think that picking and ordered categorical as the default makes sense.
|
dataset = read_stata(self.dta19_117) | ||
dataset = dataset.sort("srh") | ||
expected = Categorical.from_codes(codes=[-1, -1, 0, 1, 1, 1, 2, 2, 3, 4], categories=["Poor", "Fair", "Good", "Very good", "Excellent"]) | ||
tm.assert_equal(True, (np.asarray(expected)==np.asarray(dataset["srh"])).all()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably test whether the DataFrames are equal
Good point. You wouldn't think of labelled floats, but I guess if you call "compress" in Stata it can easily happen. Add a check whether labels are exhaustive (excluding missing data), if not, just use the underlying values and emit a warning? |
The current version does the correct thing IMO - return a partially labeled array, with labels |
Agree that alphabetical is certainly a bad choice for a "guessed" order. I would think the best "guess" would be to use the underlying values in the Stata dta, so that a label with value 1 would be < a label with a value 2, and so on. |
I don't think people would want to work with a partially labelled variable. Can you think of a use case? In Stata, you use the numeric codes anyhow and the labels are purely for display, so that would be the natural thing IMO. |
ok, so the current impl does not order Categorical, I assume let's preserve that. as its easy enough to have a user transform to an ordered Categorical if needed.
|
This has produced a lot of errors across a wide userbase and IMO as much as possible the code should follow the pretty well documented dta format, including strange but still (unambiguously) technically correct values. |
@bashtage unless you want to offer this an option when reading? |
This is how I would imagine it to be implemented pd.read_stata('my_ordered_data.dta', convert_categoricals=True, order_categoricals=True) This said, I think @hmgaudecker raised a valid point and that the information encoded in the rank of the Stata data should always be preserved in the rank of the Preserving this ordinal information would also allow for edge case matching where the Stata file has labeled floats so that all information -- including both the underlying float values and the value labels -- could be imported with two reads: df_labeled = pd.read_stata('my_labeled_floats.dta', convert_categoricals=True)
df_values = pd.read_stata('my_labeled_floats.dta', convert_categoricals=False) I would describe the current implementation as incorrect (buggy) since it loses this ordinal information. |
how about |
@bashtage on the partially labelled variables: I just don't think that this is a useful thing to work with in Pandas then. In Stata, you only work with numeric codes, you won't feel that they are only partially labelled except for the output. Rather then ending up with a mix of both as the default, I would leave it to the user to construct it by hand from StataReader. @jreback As long as the bijection is there, I am happy. Order is often implicit in Stata datasets. |
I think the reader has to produce something from a call to
A correct implementation of this method should be simple by augmenting the value label dictionary
and then the same code can be used for fully labeled or partially labeled. |
I agree. I think this should get a bug label too, since I discarding ordinal information is lossy. |
@bashtage @hmgaudecker are you saying that effectively stata has categories like |
You can have a Stata data file that looks like
which is applying three labels , |
Stata has numeric values, which are the unit of operation for any code. Then it has labels that it uses purely for displaying output. On Fri, Nov 14, 2014 at 9:20 AM, jreback [email protected] wrote:
|
that seems insane. Is that actually useful? partially labels? how then do you know '1' is not a label? |
If is is a string then it is a label, it is a number it is not. I am not claiming that these are useful - I am only claiming that they are supported and documented in the dta specification. As a result, they should be handled on a best-effort basis. The current test suite explicitly tests this case. |
seems "intuitive" to me. ok np. you can handle a partially labelled categorical however seems clear then. |
Well, as I said - labels are purely used for producing views, they do not On Fri, Nov 14, 2014 at 9:23 AM, jreback [email protected] wrote:
|
I think from this TL;DR thread the conclusion was to add:
as addtl arguments (debate over whether the ordering should be true or false by default though). whomever wants to do this: @PKEuS , @bashtage , @hmgaudecker go ahead |
I took a stab but am not totally sure of how a Category works. Does a category assign if convert_categoricals and self.value_label_dict:
value_labels = list(compat.iterkeys(self.value_label_dict))
cat_converted_data = []
for col, label in zip(data, self.lbllist):
if label in value_labels:
cat_data = data[col].copy().astype('category')
value_label_dict = self.value_label_dict[label]
categories = []
for category in cat_data.cat.categories:
if category in value_label_dict:
categories.append(value_label_dict[category])
else:
categories.append(category) # Partially labeled
cat_data.cat.categories = categories
cat_converted_data.append((col, cat_data))
else:
cat_converted_data.append((col, data[col]))
data = DataFrame.from_items(cat_converted_data) |
they are both 'assigned' using
The sort has to do with the codes and not the lexical order of the categories. |
closing in favor or #8836 |
No description provided.