Skip to content

Decide whether (new) ExtensionArrays and Dtypes are public #22860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Sep 27, 2018 · 23 comments
Closed

Decide whether (new) ExtensionArrays and Dtypes are public #22860

TomAugspurger opened this issue Sep 27, 2018 · 23 comments
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 27, 2018

This mostly affects

  1. IntervalArray
  2. PeriodArray
  3. DatetimeArray (and maybe TimedeltaArray if we do that)
  4. SparseArray
  5. IntegerArray

Categorical is already public, so let's leave that out.


A few questions

  1. do we allow users to construct these directly (via a set of to_*_array methods, or a top-ish-level pd.array([...], dtype) method)?
  2. Do users see these when they do .values or any operation returning an array (.unique, probably others)?
@TomAugspurger TomAugspurger added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Sep 27, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Sep 27, 2018
@TomAugspurger TomAugspurger changed the title Decide whether (new) ExtensionArrays are public Decide whether (new) ExtensionArrays and Dtypes are public Oct 8, 2018
@TomAugspurger
Copy link
Contributor Author

@pandas-dev/pandas-core here's a proposal: Everything goes at the top level.

  1. That's one extra *_array per array.
  2. One extra Dtype per array (more for IntegerArray)
  3. (Maybe) add the actual Array itself to the top-level.

This is consistent with Categorical and SparseArray, meaning we wouldn't have to deprecate those if we want all the arrays in one place.

I'm concerned about nesting them in pandas.api.types.extensions.* In the past, users have had reservations about reaching that far into pandas' namespace. I think that's fine for library developers subclassing ExtensionArray, but not for interactive use.

So we add

  1. categorical_array
  2. CategoricalDtype
  3. period_array
  4. PeriodDtype
  5. datetime_array
  6. DatetimeDtype
  7. sparse_array
  8. interval_array
  9. IntervalDtype
  10. integer_array
  11. IntegerDtype(...) (several of theses)

I don't think we need a timedelta_array / TimedeltaDtype in the public API. These shouldn't be user-facing at all.

@jbrockmendel
Copy link
Member

Do users see these when they do .values or any operation returning an array

I would be ecstatic if we got to a place where a) .values always points to an EA, or at least a lossless non-costly array and b) ._values were rendered unnecessary. Even better if .get_values() also become a synonym.

or a top-ish-level pd.array([...], dtype) method)?

A lot of the pandas-internal EA code could be simplified if there were something analogous to Index.__new__ that returned the appropriately-typed array

@jreback
Copy link
Contributor

jreback commented Oct 14, 2018

I think we should leave the dtypes themselves in pandas.api.types

no objection to the *_array in the pandas namespace

@jschendel
Copy link
Member

+1 on making these public in general.

  1. interval_array

This does not currently exist and I'm a little confused as to what it would look like if implemented. I understand how this makes sense for the non-interval EA's, as they are usually constructed in a fairly direct manner. With intervals, my impression is that users have generally been using the IntervalIndex.from_* methods as opposed to the constructor itself, and I imagine the same will hold for IntervalArray.

Seems like trying to support all these construction methods in a single interval_array function would be a bit convoluted. Or am I missing something here? If not, my preference would be to expose the actual IntervalArray class itself.

I think we should leave the dtypes themselves in pandas.api.types

I agree, these don't seem to provide much additional utility in general. We could maybe make an exception for CategoricalDtype though, as it's useful for creating non-default categoricals via astype; this is what the docs recommend as well. The other dtypes don't have similar non-default behavior iirc, so should be fine just directing users to the string equivalent when specifying dtypes.

@TomAugspurger
Copy link
Contributor Author

I agree with @jschendel that CategoricalDtype is especially useful on its own and should be in the top-level namespace.

@TomAugspurger
Copy link
Contributor Author

Also agreed that IntervalArray's alternative constructors are especially useful, and conflating them in a single interval_array method isn't a great idea.

@jorisvandenbossche
Copy link
Member

That's one extra *_array per array.

I have the feeling that adding a function for each array type is some API bloat. I think it would be nice to restrict ourselves to a general array(..) function, if that would be sufficient. It is eg also what pyarrow has to convert python array-likes (pure python/numpy/pandas containers) to an arrow array.

I suppose the main drawback of having a single function vs specialized functions for each array is the ability of having additional kwargs? (eg freq for period_array, or tz for datetime_array)

Further, I think we also need to discuss and somewhat define the "scope" of those .._array() functions. Although it might differ for each data type, I think we should have some general idea on this.
For example, is this function expected to do parsing of strings? (period_array does a limited form (only default format), integer_array does not; of course for periods this is a more typical use case) Is this function expected to accept "low-level values" (eg ordinals for period_array, ints for datetime_array) ...

I think for me the core use case that should certainly be covered is the round trip to/from "python objects" (so something like pd.array(np.asarray(EA, dtype=object), dtype=EA.dtype) should always work). Kind of the public version of _from_sequence.
But strictly this use case is of course more limited than eg what period_array already can do.

@jorisvandenbossche
Copy link
Member

Another remark about where to put them in our API: I don't think those belong in pandas.api.extension.
Yes, we call them ExtensionArrays, and they are extending the numpy arrays, but they are a fundamental part of pandas, not a kind of secondary extension of pandas. For me, pandas.api.extension exposes the API for external people to extend pandas.

From the existing submodules, I would say that pandas.api.types is more fitting, but we might also consider adding something like pandas.api.arrays or even pandas.arrays (although that might feel strange to put the dtype objects)

@jbrockmendel
Copy link
Member

Yes, we call them ExtensionArrays, and they are extending the numpy arrays, but they are a fundamental part of pandas, not a kind of secondary extension of pandas.

+1 on being clear about this.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 7, 2018

On the idea of a single pd.array() function, that's fine, but would require adding all of the types to the top-level so that users can easily do

arr = pd.array([1, 2, 3], type=pd.Int64())
  • Categorical: fine. Requires a bit more typing to get CategoricalDtype(..., ordered=...)
  • Integer: fine, unless we want to accept a mask for the missing / valid values
  • period: fine
  • Sparse: not great, since the main SparseArray constructor accepts sparse_index and kind (and index which was want to remove).
  • Interval: not being considered? Since it requires the alternative constructors?
  • Datetime: Probably not sufficient, if we want to handle errors, dayfirst, year first, utc, etc.

I'd really like to avoid pd.array(data, type, **kwargs) if we can avoid it...

I think that a simple array(data, type) would suffice for categorical, period, integer (unless we want to support a mask= argument like pyarrow), datetime / timedelta, sparse.

@TomAugspurger
Copy link
Contributor Author

I hate .api every time I type it. Python has a way of saying somethings in the API: it doesn't start with an underscore :)

So what do people think? pandas.array(...) or pandas[.api].arrays.period_array, pandas[.api].arrays.integer_array, ...?

@jbrockmendel
Copy link
Member

what do I get if I write arr = pd.array([2.0, 3.5], dtype='f8')? i.e. do we raise or just return a numpy array?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 7, 2018 via email

@jorisvandenbossche
Copy link
Member

On the idea of a single pd.array() function, that's fine, but would require adding all of the types to the top-level

Somehow this is maybe fine, as we then not have a bunch of .._array functions top-level and could have the dtypes top-level instead.

Integer: fine, unless we want to accept a mask for the missing / valid values

Unless we are fine with a mask argument for all? (like pyarrow.array(..) has). It's probably mainly useful for integer though ..

Datetime: Probably not sufficient, if we want to handle errors, dayfirst, year first, utc, etc.

For that we already have to_datetime?

Related to that, want to repeat my question of above (#22860 (comment)) on the "scope" of those functions. Should a datetime_array or array function be able to handle errors, parse custom formats, etc like to_datetime and DatetimeIndex now do? Or maybe we can simply keep it on actual datetime-like scalar objects (numpy scalars, pd.Timestamp, datetime.datetime), since we already have to_datetime to do string parsing.

@jorisvandenbossche
Copy link
Member

Requires a bit more typing to get CategoricalDtype(..., ordered=...)

Would something like pd.array(..., dtype=pd.types.categorical(ordered=True)) be less typing / easier to read? (note the factory function instead of class constructor, but put it in a types submodule as top-level this might be confusing)

@TomAugspurger
Copy link
Contributor Author

Somehow this is maybe fine, as we then not have a bunch of .._array functions top-level and could have the dtypes top-level instead.

Yeah, I think it would be one or the other (except for IntervalArray, which will be special).

Unless we are fine with a mask argument for all? (like pyarrow.array(..) has). It's probably mainly useful for integer though ..

When you say "mainly useful for integer", you mean mainly useful

[...] scope [...]

I'm not sure what's best here. I think list / object array of scalars, clearly. For convenience, I think unboxing Series / Index, and idempotenecy is nice. So all of

  • pd.array(Series[period])
  • pd.array(Index[period])
  • pd.array(a_period_array)

would work.

pd.array(..., dtype=pd.types.categorical(ordered=True))

To be clear, you can do dtype=CategoricalDtype(ordered=True), and the categories will be inferred from the data. I would favor that over a new .types.categorical(...) method.

@jorisvandenbossche
Copy link
Member

To be clear, you can do dtype=CategoricalDtype(ordered=True), and the categories will be inferred from the data

Yep, I know, I was only looking at readability / writability of dtype=pd.api.types.CategoricalDtype(ordered=True) vs dtype=pd.types.categorical(ordered=True)

@jorisvandenbossche
Copy link
Member

For convenience, I think unboxing Series / Index, and idempotenecy is nice. So all of

Yes, for sure that as well (next to the list of scalars).
But we could opt to keep it at those two cases: list of scalars (and of course some "coercion" can be done like int to float or round float to int) and unboxing containers that already have the dtype (without the more advanced parsing).

@TomAugspurger TomAugspurger mentioned this issue Nov 8, 2018
2 tasks
@jorisvandenbossche
Copy link
Member

One additional thing that we didn't really discuss yet: I think the idea for now is that this will also return numpy arrays.

There are clear advantages of having that behaviour (knowing that you can give it any data and will give you a usable array-like regardless of the data type and this being what would otherwise also be stored in a Series if you passed it there).
But it can also give dubious cases that we need to discuss:

  • What should be the return value of pd.array([1, 2, 3]) ? numpy array or IntegerArray ?

@TomAugspurger
Copy link
Contributor Author

We mentioned it briefly on the call, and Jeff was for it returning ndarrays.

pd.array([1, 2, 3]) ? numpy array or IntegerArray ?

In https://github.com/pandas-dev/pandas/pull/23581/files#diff-69ac57923b848af43df327c311b79db4R18 we have a nice, succinct description of what dtype does. If you don't specify it, the dtype is inferred from np.array(). I'd like to be out of the inference business as much as possible, and determining that [1, 2, 3] should be an IntegerArray would require looking into the data.

@jorisvandenbossche
Copy link
Member

I'd like to be out of the inference business as much as possible,

Also not for built-in EAs?
For example, passing a list/array of Period objects could easily be detected and converted into a PeriodArray.

@jorisvandenbossche
Copy link
Member

would require looking into the data.

Yes, but only if there is no dtype provided (and anyway, then it's numpy that looks into the data). We have a infer_dtype that works nicely for this.

@TomAugspurger
Copy link
Contributor Author

Hmm yea, you're right. It would be strange for Series[List[Period]] to be a Period dtype, but pd.array(List[Period]])` to not.... So we'll need to do some inference...

OK, two problems then: that'll put pandas EAs on a different level from 3rd party EAs (unless we let infer_dtype be extensible, but I don't think that's easily possible).

And then the problem of array([1, 2, 3]). Let's just follow series here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

5 participants