Skip to content

ENH: Specify how pandas infers dtype on objects #41848

@mocquin

Description

@mocquin

Hello there

Is your feature request related to a problem?

[this should provide a description of what the problem is, e.g. "I wish I could use pandas to do [...]"]
Context : I am creating a package to handle physical units (yes, another one), and I started working on the pandas interface implementation. I looked into pandas extension page, as well as what pint did with pint-pandas. I am pretty satisfied with the result, except for one thing : When creating pandas objects (Series of DataFrame), I have to explicitly specify what dtype (using my DtypeExtension for my "Quantity" class) pandas should use to cast my Quantity object to the correspond QuantityArrayExtension. Categorical objects kinda exhibit the same problem :

# create indeed a Categorical dtype
s = pd.Series(["a", "b", "c", "a"], dtype="category")
# use "object" as dtype
pd.Series(["a", "b", "c", "a"])

from physipy import m # import the "meter" object
from physipy import QuantityDtype # import the DtypeExtension for Quantity object
# create indeed a QuantityDtype serie
s = pd.Series([1, 2, 3]*m, dtype=DtypeExtension)

# casts into integers, dropping the "unit" (because bypasses my object by accessing its "array" value directly
pd.Series([1, 2, 3]*m)

Now, I understand that for the Categorical example, it is not obvious what kind of dtype pandas should use, but for my custom class, I would like to be able to tell pandas how to behave.

Describe the solution you'd like

I would expect some interface like this :

import pandas as pd
from physipy import Quantity, QuantityDtype

# tell pandas to use QuantityDtype when a Quantity object is passed
pd.dtype_lut[Quantity] = QuantityDtype

# then a series can be created directly 
my_quantity_object = [1, 2, 3]*m # this is a Quantity object
s = pd.Series(my_quantity_object)) # note the absence of dtype specification

Here, pandas admits it doesn't know the passed object's type, and so check in its dtype_lut if a corresponding dtype is set.

Another interface would be to add a method, pandas-specifically named, to Quantity that does this look-up table :

# into my Quantity object
class Quantity:
    ....

    def pd_dtype(self):
        return QuantityDtype

so that when pandas encounters an unknown object type, it first tries to get its Dtype using "obj.pd_type()"

Cheers

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions