Skip to content

StataReader: Support sorting categoricals #8816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

PKEuS
Copy link
Contributor

@PKEuS PKEuS commented Nov 14, 2014

No description provided.

categories.append(labeldict[j+1])
except:
categories.append(j+1)
data[col] = Categorical.from_codes(codes, categories, ordered=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you know (from stata) that they are ordered? (is their some kind of flag)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are iterating over the columns. Going to be really slow. Need a vectorized soln for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, ordered=True does not sort the values, just defines the order in which they can be sorted. Otherwise I get "TypeError: Categorical not ordered" when trying to sort the data. Is there a technical reason to not enable this? dta files seem to not define if a variable can be sorted or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vectorized the loop in PKEuS@c410441 (I will squash the commits later)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the point of the ordered flag is to define whether the category has an order or not. Its an inherent property when creating the Categorical. I am not sure of the stata semantics w.r.t. . R supports both notions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be invoked throught flag for StataReader, something order_categoricals=False, and should be False by default. This degrades the fidelity of a write-read cycle when the original categories are not ordered/unorderable (e.g. male-female).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to disagree on the default behaviour -- I find losing information is worse than losing speed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand which way loses information?

Using ordered=True is adding information that the Stata data file cannot know, and so it is an end user adding non-data-file-based information to the imported data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you loose the underlying numeric codes from Stata, which is what you end up using all the time when coding in Stata. In most cases, I guess that the codes carry order. That's what you potentially loose.

I actually stumbled across this with the test dataset, where self-reported health came out as an alphabetically-ordered variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point about the loss.

I would think that there should be a monotonic increasing bijection between the underlying Stata data and the cat.codes, which would mean that cat.codes would always preserve the the same information that is in the Stata data. This could be done independently of whether ordered=True is used (so adding it should hopefully be non-controversial)

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

cc @bashtage

@jreback jreback added Categorical Categorical Data Type IO Stata read_stata, to_stata labels Nov 14, 2014
@hmgaudecker
Copy link

There is no notion of this in Stata. You simply have numeric codes that Stata works with, if there are labels attached to them, read_stata assumes they are Categoricals. So we just have to pick a sensible default.

The current implementation of read_stata leads to an ordered categorical where the order is based on the labels (=alphabetical). This does not make any sense.

Two reasons why I think that picking and ordered categorical as the default makes sense.

  1. In my experience, the vast majority of variables in the fields Stata is most used in will be ordered.
  2. More compelling, if assuming unordered the underlying numeric codes will be lost, it is pretty clumsy to go back and find out what they are in the Stata dataset. But it should be easy to turn an ordered variable into an unordered one. [Related to this, it should also be easy to reverse the order, are there methods for these two cases or would it be easy to add them? This should pop up all the time when reading in data automatically rather than constructing it from scratch]

dataset = read_stata(self.dta19_117)
dataset = dataset.sort("srh")
expected = Categorical.from_codes(codes=[-1, -1, 0, 1, 1, 1, 2, 2, 3, 4], categories=["Poor", "Fair", "Good", "Very good", "Excellent"])
tm.assert_equal(True, (np.asarray(expected)==np.asarray(dataset["srh"])).all())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably test whether the DataFrames are equal

@hmgaudecker
Copy link

Good point. You wouldn't think of labelled floats, but I guess if you call "compress" in Stata it can easily happen.

Add a check whether labels are exhaustive (excluding missing data), if not, just use the underlying values and emit a warning?

@bashtage
Copy link
Contributor

The current version does the correct thing IMO - return a partially labeled array, with labels "one", 1.5, "two", and 2.5 (where the #s are floats). Of course, the codes are always int, but this is probably the best approximation to the underlying data.

@bashtage
Copy link
Contributor

The current implementation of read_stata leads to an ordered categorical where the order is based on the labels (=alphabetical). This does not make any sense.

Agree that alphabetical is certainly a bad choice for a "guessed" order.

I would think the best "guess" would be to use the underlying values in the Stata dta, so that a label with value 1 would be < a label with a value 2, and so on.

@hmgaudecker
Copy link

I don't think people would want to work with a partially labelled variable. Can you think of a use case? In Stata, you use the numeric codes anyhow and the labels are purely for display, so that would be the natural thing IMO.

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

ok, so the current impl does not order Categorical, I assume let's preserve that. as its easy enough to have a user transform to an ordered Categorical if needed.

In [6]: s = pd.Categorical(list('aabbcdedfab'),ordered=False)     

In [7]: s
Out[7]: 
[a, a, b, b, c, ..., e, d, f, a, b]
Length: 11
Categories (6, object): [a, b, c, d, e, f]

In [8]: s.ordered
Out[8]: False

In [9]: s = pd.Categorical(s,ordered=True)

In [10]: s
Out[10]: 
[a, a, b, b, c, ..., e, d, f, a, b]
Length: 11
Categories (6, object): [a < b < c < d < e < f]

In [11]: s.ordered
Out[11]: True

@bashtage
Copy link
Contributor

I don't think people would want to work with a partially labelled variable. Can you think of a use case? In Stata, you use the numeric codes anyhow and the labels are purely for display, so that would be the natural thing IMO.

StataReader and StataWriter have historically avoided uncommon edge cases.

This has produced a lot of errors across a wide userbase and IMO as much as possible the code should follow the pretty well documented dta format, including strange but still (unambiguously) technically correct values.

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

@bashtage unless you want to offer this an option when reading?

@bashtage
Copy link
Contributor

unless you want to offer this an option when reading?

This is how I would imagine it to be implemented

pd.read_stata('my_ordered_data.dta', convert_categoricals=True, order_categoricals=True)

This said, I think @hmgaudecker raised a valid point and that the information encoded in the rank of the Stata data should always be preserved in the rank of the cat.codes irrespective of whether the categoricals are returned with ordered or not.

Preserving this ordinal information would also allow for edge case matching where the Stata file has labeled floats so that all information -- including both the underlying float values and the value labels -- could be imported with two reads:

df_labeled = pd.read_stata('my_labeled_floats.dta', convert_categoricals=True)
df_values = pd.read_stata('my_labeled_floats.dta', convert_categoricals=False)

I would describe the current implementation as incorrect (buggy) since it loses this ordinal information.

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

how about order_categoricals=False as the default. As what I gain from the above conversation is the order is 'undefined'?

@hmgaudecker
Copy link

@bashtage on the partially labelled variables: I just don't think that this is a useful thing to work with in Pandas then. In Stata, you only work with numeric codes, you won't feel that they are only partially labelled except for the output. Rather then ending up with a mix of both as the default, I would leave it to the user to construct it by hand from StataReader.

@jreback As long as the bijection is there, I am happy. Order is often implicit in Stata datasets.

@bashtage
Copy link
Contributor

Rather then ending up with a mix of both as the default, I would leave it to the user to construct it by hand from StataReader.

I think the reader has to produce something from a call to read_stata for this case - I honestly believe the highest fidelity default is to use the label for the category where there is a label, and use the Stata value when it is unlabeled. What options are there?

  • raiseing, which is not a good choice since this is a valid dta file
  • refusing to convert mixed data, which is OK
  • Return a categorical with labels where available, and values where not

A correct implementation of this method should be simple by augmenting the value label dictionary

for value in np.setdiff1d(np.unique(stata_values), list(label_dict.keys()):
    label_dict[value]=value

and then the same code can be used for fully labeled or partially labeled.

@bashtage
Copy link
Contributor

@jreback As long as the bijection is there, I am happy. Order is often implicit in Stata datasets.

I agree.

I think this should get a bug label too, since I discarding ordinal information is lossy.

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

@bashtage @hmgaudecker are you saying that effectively stata has categories like [1,'foo'] are possible? e.g. mixed which actually don't mean anything?

@bashtage
Copy link
Contributor

are you saying that effectively stata has categories like [1,'foo'] are possible? e.g. mixed which actually don't mean anything?

You can have a Stata data file that looks like

1
2
three
four
5
six

which is applying three labels , 6->six, 3->three and 4->four. A "partially" labeled series.

@hmgaudecker
Copy link

Stata has numeric values, which are the unit of operation for any code.

Then it has labels that it uses purely for displaying output.

On Fri, Nov 14, 2014 at 9:20 AM, jreback [email protected] wrote:

@bashtage https://github.com/bashtage @hmgaudecker
https://github.com/hmgaudecker are you saying that effectively stata
has categories like [1,'foo'] are possible? e.g. mixed which actually
don't mean anything?


Reply to this email directly or view it on GitHub
#8816 (comment).

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

that seems insane. Is that actually useful? partially labels? how then do you know '1' is not a label?

@bashtage
Copy link
Contributor

that seems insane. Is that actually useful? partially labels? how then do you know '1' is not a label?

If is is a string then it is a label, it is a number it is not.

I am not claiming that these are useful - I am only claiming that they are supported and documented in the dta specification. As a result, they should be handled on a best-effort basis.

The current test suite explicitly tests this case.

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

seems "intuitive" to me. ok np. you can handle a partially labelled categorical however seems clear then.

@hmgaudecker
Copy link

Well, as I said - labels are purely used for producing views, they do not
carry any meaning for Stata itself. Very different model, but that's how
things worked in the 80's, I guess.

On Fri, Nov 14, 2014 at 9:23 AM, jreback [email protected] wrote:

that seems insane. Is that actually useful? partially labels? how then do
you know '1' is not a label?


Reply to this email directly or view it on GitHub
#8816 (comment).

@jreback
Copy link
Contributor

jreback commented Nov 16, 2014

I think from this TL;DR thread the conclusion was to add:

pd.read_stata('my_ordered_data.dta', convert_categoricals=True, order_categoricals=True)

as addtl arguments (debate over whether the ordering should be true or false by default though).

whomever wants to do this: @PKEuS , @bashtage , @hmgaudecker go ahead

@bashtage
Copy link
Contributor

I took a stab but am not totally sure of how a Category works. Does a category assign codes to a numeric array in ascending order (except np.nan which is -1)? Or is ordered required for this to happen?

        if convert_categoricals and self.value_label_dict:
            value_labels = list(compat.iterkeys(self.value_label_dict))
            cat_converted_data = []
            for col, label in zip(data, self.lbllist):
                if label in value_labels:
                    cat_data = data[col].copy().astype('category')
                    value_label_dict = self.value_label_dict[label]
                    categories = []
                    for category in cat_data.cat.categories:
                        if category in value_label_dict:
                            categories.append(value_label_dict[category])
                        else:
                            categories.append(category)  # Partially labeled
                    cat_data.cat.categories = categories
                    cat_converted_data.append((col, cat_data))
                else:
                    cat_converted_data.append((col, data[col]))
            data = DataFrame.from_items(cat_converted_data)

@jreback
Copy link
Contributor

jreback commented Nov 17, 2014

they are both 'assigned' using pd.factorize (whether ordered or not)
This all assumes that the user does not pass in the codes.

In [2]: pd.factorize(list('ddbcab'),sort=True)
Out[2]: (array([3, 3, 1, 2, 0, 1]), array(['a', 'b', 'c', 'd'], dtype=object))

In [3]: pd.factorize(list('ddbcab'),sort=False)
Out[3]: (array([0, 0, 1, 2, 3, 1]), array(['d', 'b', 'c', 'a'], dtype=object))

The sort has to do with the codes and not the lexical order of the categories.
notice that they are assign in the passed order.

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

closing in favor or #8836

@jreback jreback closed this Nov 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants