StataReader: Support sorting categoricals #8816

PKEuS · 2014-11-14T13:07:03Z

No description provided.

jreback · 2014-11-14T13:13:01Z

pandas/io/stata.py

+                        categories.append(labeldict[j+1])
+                    except:
+                        categories.append(j+1)
+                data[col] = Categorical.from_codes(codes, categories, ordered=True)


how do you know (from stata) that they are ordered? (is their some kind of flag)?

you are iterating over the columns. Going to be really slow. Need a vectorized soln for this.

As I understand it, ordered=True does not sort the values, just defines the order in which they can be sorted. Otherwise I get "TypeError: Categorical not ordered" when trying to sort the data. Is there a technical reason to not enable this? dta files seem to not define if a variable can be sorted or not.

I vectorized the loop in PKEuS@c410441 (I will squash the commits later)

well, the point of the ordered flag is to define whether the category has an order or not. Its an inherent property when creating the Categorical. I am not sure of the stata semantics w.r.t. . R supports both notions.

This should probably be invoked throught flag for StataReader, something order_categoricals=False, and should be False by default. This degrades the fidelity of a write-read cycle when the original categories are not ordered/unorderable (e.g. male-female).

I am inclined to disagree on the default behaviour -- I find losing information is worse than losing speed.

I'm not sure I understand which way loses information?

Using ordered=True is adding information that the Stata data file cannot know, and so it is an end user adding non-data-file-based information to the imported data.

Well, you loose the underlying numeric codes from Stata, which is what you end up using all the time when coding in Stata. In most cases, I guess that the codes carry order. That's what you potentially loose.

I actually stumbled across this with the test dataset, where self-reported health came out as an alphabetically-ordered variable.

I see your point about the loss.

I would think that there should be a monotonic increasing bijection between the underlying Stata data and the cat.codes, which would mean that cat.codes would always preserve the the same information that is in the Stata data. This could be done independently of whether ordered=True is used (so adding it should hopefully be non-controversial)

jreback · 2014-11-14T13:13:10Z

cc @bashtage

hmgaudecker · 2014-11-14T14:06:02Z

There is no notion of this in Stata. You simply have numeric codes that Stata works with, if there are labels attached to them, read_stata assumes they are Categoricals. So we just have to pick a sensible default.

The current implementation of read_stata leads to an ordered categorical where the order is based on the labels (=alphabetical). This does not make any sense.

Two reasons why I think that picking and ordered categorical as the default makes sense.

In my experience, the vast majority of variables in the fields Stata is most used in will be ordered.
More compelling, if assuming unordered the underlying numeric codes will be lost, it is pretty clumsy to go back and find out what they are in the Stata dataset. But it should be easy to turn an ordered variable into an unordered one. [Related to this, it should also be easy to reverse the order, are there methods for these two cases or would it be easy to add them? This should pop up all the time when reading in data automatically rather than constructing it from scratch]

bashtage · 2014-11-14T14:15:51Z

pandas/io/tests/test_stata.py

+        dataset = read_stata(self.dta19_117)
+        dataset = dataset.sort("srh")
+        expected = Categorical.from_codes(codes=[-1, -1, 0, 1, 1, 1, 2, 2, 3, 4], categories=["Poor", "Fair", "Good", "Very good", "Excellent"])
+        tm.assert_equal(True, (np.asarray(expected)==np.asarray(dataset["srh"])).all())


Should probably test whether the DataFrames are equal

hmgaudecker · 2014-11-14T14:28:05Z

Good point. You wouldn't think of labelled floats, but I guess if you call "compress" in Stata it can easily happen.

Add a check whether labels are exhaustive (excluding missing data), if not, just use the underlying values and emit a warning?

bashtage · 2014-11-14T14:31:39Z

The current version does the correct thing IMO - return a partially labeled array, with labels "one", 1.5, "two", and 2.5 (where the #s are floats). Of course, the codes are always int, but this is probably the best approximation to the underlying data.

bashtage · 2014-11-14T14:35:09Z

The current implementation of read_stata leads to an ordered categorical where the order is based on the labels (=alphabetical). This does not make any sense.

Agree that alphabetical is certainly a bad choice for a "guessed" order.

I would think the best "guess" would be to use the underlying values in the Stata dta, so that a label with value 1 would be < a label with a value 2, and so on.

hmgaudecker · 2014-11-14T14:38:05Z

I don't think people would want to work with a partially labelled variable. Can you think of a use case? In Stata, you use the numeric codes anyhow and the labels are purely for display, so that would be the natural thing IMO.

jreback · 2014-11-14T14:43:58Z

ok, so the current impl does not order Categorical, I assume let's preserve that. as its easy enough to have a user transform to an ordered Categorical if needed.

In [6]: s = pd.Categorical(list('aabbcdedfab'),ordered=False)     

In [7]: s
Out[7]: 
[a, a, b, b, c, ..., e, d, f, a, b]
Length: 11
Categories (6, object): [a, b, c, d, e, f]

In [8]: s.ordered
Out[8]: False

In [9]: s = pd.Categorical(s,ordered=True)

In [10]: s
Out[10]: 
[a, a, b, b, c, ..., e, d, f, a, b]
Length: 11
Categories (6, object): [a < b < c < d < e < f]

In [11]: s.ordered
Out[11]: True

bashtage · 2014-11-14T14:44:19Z

I don't think people would want to work with a partially labelled variable. Can you think of a use case? In Stata, you use the numeric codes anyhow and the labels are purely for display, so that would be the natural thing IMO.

StataReader and StataWriter have historically avoided uncommon edge cases.

This has produced a lot of errors across a wide userbase and IMO as much as possible the code should follow the pretty well documented dta format, including strange but still (unambiguously) technically correct values.

jreback · 2014-11-14T14:44:56Z

@bashtage unless you want to offer this an option when reading?

bashtage · 2014-11-14T14:47:23Z

unless you want to offer this an option when reading?

This is how I would imagine it to be implemented

pd.read_stata('my_ordered_data.dta', convert_categoricals=True, order_categoricals=True)

This said, I think @hmgaudecker raised a valid point and that the information encoded in the rank of the Stata data should always be preserved in the rank of the cat.codes irrespective of whether the categoricals are returned with ordered or not.

Preserving this ordinal information would also allow for edge case matching where the Stata file has labeled floats so that all information -- including both the underlying float values and the value labels -- could be imported with two reads:

df_labeled = pd.read_stata('my_labeled_floats.dta', convert_categoricals=True)
df_values = pd.read_stata('my_labeled_floats.dta', convert_categoricals=False)

I would describe the current implementation as incorrect (buggy) since it loses this ordinal information.

jreback · 2014-11-14T14:56:09Z

how about order_categoricals=False as the default. As what I gain from the above conversation is the order is 'undefined'?

hmgaudecker · 2014-11-14T14:59:42Z

@bashtage on the partially labelled variables: I just don't think that this is a useful thing to work with in Pandas then. In Stata, you only work with numeric codes, you won't feel that they are only partially labelled except for the output. Rather then ending up with a mix of both as the default, I would leave it to the user to construct it by hand from StataReader.

@jreback As long as the bijection is there, I am happy. Order is often implicit in Stata datasets.

bashtage · 2014-11-14T15:14:02Z

Rather then ending up with a mix of both as the default, I would leave it to the user to construct it by hand from StataReader.

I think the reader has to produce something from a call to read_stata for this case - I honestly believe the highest fidelity default is to use the label for the category where there is a label, and use the Stata value when it is unlabeled. What options are there?

raiseing, which is not a good choice since this is a valid dta file
refusing to convert mixed data, which is OK
Return a categorical with labels where available, and values where not

A correct implementation of this method should be simple by augmenting the value label dictionary

for value in np.setdiff1d(np.unique(stata_values), list(label_dict.keys()):
    label_dict[value]=value

and then the same code can be used for fully labeled or partially labeled.

bashtage · 2014-11-14T15:14:31Z

@jreback As long as the bijection is there, I am happy. Order is often implicit in Stata datasets.

I agree.

I think this should get a bug label too, since I discarding ordinal information is lossy.

jreback · 2014-11-14T15:20:12Z

@bashtage @hmgaudecker are you saying that effectively stata has categories like [1,'foo'] are possible? e.g. mixed which actually don't mean anything?

bashtage · 2014-11-14T15:22:10Z

are you saying that effectively stata has categories like [1,'foo'] are possible? e.g. mixed which actually don't mean anything?

You can have a Stata data file that looks like

1
2
three
four
5
six

which is applying three labels , 6->six, 3->three and 4->four. A "partially" labeled series.

hmgaudecker · 2014-11-14T15:22:37Z

Stata has numeric values, which are the unit of operation for any code.

Then it has labels that it uses purely for displaying output.

On Fri, Nov 14, 2014 at 9:20 AM, jreback [email protected] wrote:

@bashtage https://github.com/bashtage @hmgaudecker
https://github.com/hmgaudecker are you saying that effectively stata
has categories like [1,'foo'] are possible? e.g. mixed which actually
don't mean anything?

—
Reply to this email directly or view it on GitHub
#8816 (comment).

jreback · 2014-11-14T15:23:48Z

that seems insane. Is that actually useful? partially labels? how then do you know '1' is not a label?

bashtage · 2014-11-14T15:25:33Z

that seems insane. Is that actually useful? partially labels? how then do you know '1' is not a label?

If is is a string then it is a label, it is a number it is not.

I am not claiming that these are useful - I am only claiming that they are supported and documented in the dta specification. As a result, they should be handled on a best-effort basis.

The current test suite explicitly tests this case.

jreback · 2014-11-14T15:26:43Z

seems "intuitive" to me. ok np. you can handle a partially labelled categorical however seems clear then.

hmgaudecker · 2014-11-14T15:27:32Z

Well, as I said - labels are purely used for producing views, they do not
carry any meaning for Stata itself. Very different model, but that's how
things worked in the 80's, I guess.

On Fri, Nov 14, 2014 at 9:23 AM, jreback [email protected] wrote:

that seems insane. Is that actually useful? partially labels? how then do
you know '1' is not a label?

—
Reply to this email directly or view it on GitHub
#8816 (comment).

jreback · 2014-11-16T19:29:39Z

I think from this TL;DR thread the conclusion was to add:

pd.read_stata('my_ordered_data.dta', convert_categoricals=True, order_categoricals=True)

as addtl arguments (debate over whether the ordering should be true or false by default though).

whomever wants to do this: @PKEuS , @bashtage , @hmgaudecker go ahead

bashtage · 2014-11-16T23:01:54Z

I took a stab but am not totally sure of how a Category works. Does a category assign codes to a numeric array in ascending order (except np.nan which is -1)? Or is ordered required for this to happen?

        if convert_categoricals and self.value_label_dict:
            value_labels = list(compat.iterkeys(self.value_label_dict))
            cat_converted_data = []
            for col, label in zip(data, self.lbllist):
                if label in value_labels:
                    cat_data = data[col].copy().astype('category')
                    value_label_dict = self.value_label_dict[label]
                    categories = []
                    for category in cat_data.cat.categories:
                        if category in value_label_dict:
                            categories.append(value_label_dict[category])
                        else:
                            categories.append(category)  # Partially labeled
                    cat_data.cat.categories = categories
                    cat_converted_data.append((col, cat_data))
                else:
                    cat_converted_data.append((col, data[col]))
            data = DataFrame.from_items(cat_converted_data)

jreback · 2014-11-17T01:32:12Z

they are both 'assigned' using pd.factorize (whether ordered or not)
This all assumes that the user does not pass in the codes.

In [2]: pd.factorize(list('ddbcab'),sort=True)
Out[2]: (array([3, 3, 1, 2, 0, 1]), array(['a', 'b', 'c', 'd'], dtype=object))

In [3]: pd.factorize(list('ddbcab'),sort=False)
Out[3]: (array([0, 0, 1, 2, 3, 1]), array(['d', 'b', 'c', 'a'], dtype=object))

The sort has to do with the codes and not the lexical order of the categories.
notice that they are assign in the passed order.

jreback · 2014-11-18T13:23:22Z

closing in favor or #8836

StataReader: Support sorting categoricals

99b85c8

jreback reviewed Nov 14, 2014
View reviewed changes

jreback added Categorical Categorical Data Type IO Stata read_stata, to_stata labels Nov 14, 2014

Vectorized loop

c410441

bashtage reviewed Nov 14, 2014
View reviewed changes

bashtage mentioned this pull request Nov 17, 2014

FIX: Correct Categorical behavior in StataReader #8836

Merged

jreback closed this Nov 18, 2014

Uh oh!

StataReader: Support sorting categoricals #8816

StataReader: Support sorting categoricals #8816

Uh oh!

Conversation

PKEuS commented Nov 14, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Nov 14, 2014

Uh oh!

hmgaudecker commented Nov 14, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmgaudecker commented Nov 14, 2014

Uh oh!

bashtage commented Nov 14, 2014

Uh oh!

bashtage commented Nov 14, 2014

Uh oh!

hmgaudecker commented Nov 14, 2014

Uh oh!

jreback commented Nov 14, 2014

Uh oh!

bashtage commented Nov 14, 2014

Uh oh!

jreback commented Nov 14, 2014

Uh oh!

bashtage commented Nov 14, 2014

Uh oh!

jreback commented Nov 14, 2014

Uh oh!

hmgaudecker commented Nov 14, 2014

Uh oh!

bashtage commented Nov 14, 2014

Uh oh!

bashtage commented Nov 14, 2014

Uh oh!

jreback commented Nov 14, 2014

Uh oh!

bashtage commented Nov 14, 2014

Uh oh!

hmgaudecker commented Nov 14, 2014

Uh oh!

jreback commented Nov 14, 2014

Uh oh!

bashtage commented Nov 14, 2014

Uh oh!

jreback commented Nov 14, 2014

Uh oh!

hmgaudecker commented Nov 14, 2014

Uh oh!

jreback commented Nov 16, 2014

Uh oh!

bashtage commented Nov 16, 2014

Uh oh!

jreback commented Nov 17, 2014

Uh oh!

jreback commented Nov 18, 2014

Uh oh!

Uh oh!