ENH: from_records returns dtypes respecting input numpy dtypes #55081

Ruibin-Liu · 2023-09-09T23:11:11Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

When creating a DataFrame using from_records with a Numpy structured array, the current implementation doesn't respect the latter's dtypes.

>>>x = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
             dtype=[('name', 'U10'), ('age', 'i2'), ('weight', 'f4')])

>>>df = pd.DataFrame.from_records(x)
>>>df.dtypes
name       object
age         int16
weight    float32
dtype: object

It seems to me that the integer and float types are respected from some quick tests, but the str types are not.

Feature Description

def from_records(data, *args, **kwargs):
    ...
    df = ...
    if isinstance(data, np.ndarray):
        array_dtypes = data.dtype
        array_dtypes = dict(zip(array_dtypes.names, [array_dtypes[i] for i in range(len(array_dtypes))])) # there must be a better way to get the dtypes dict.
        df = df.astype(array_dtypes)    
    ...

Alternative Solutions

The end users can just do a post-processing similar to above.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

hedeershowk · 2023-09-13T02:51:56Z

In pandas the str type is represented as object by default. There is a StringDType and you can astype to a string if you'd like:

In [9]: df['name'] = df['name'].astype("string")

In [10]: df.dtypes
Out[10]: 
name      string[python]
age                int16
weight           float32
dtype: object

See https://pandas.pydata.org/docs/user_guide/text.html#text-types for more information about how text types are handled.

Ruibin-Liu · 2023-09-14T20:36:59Z

@hedeershowk Thanks for replying. I know we can set the dtypes in the post processing step, but I am wondering whether the DataFrame constructor can directly use the dtypes information from Numpy structured array so that it can be faster because no python object is needed to be constructed and destructed.

Edit:
My feature description part didn't do a good job because it's still post processing and that might be the reason for your comment. I hope this reply clears the request a little bit because I obviously didn't know how to implement the feature as the way I replied in this comment or I would have done it by myself.

hedeershowk · 2023-09-14T21:16:05Z

I am wondering whether the DataFrame constructor can directly use the dtypes information from Numpy structured array so that it can be faster because no python object is needed to be constructed and destructed.

so you're proposing that pandas will convert a numpy U10 type into an str? I think the only two options on the pandas side are object (which is a catch all type) or StringDType. It might make sense to have pandas use the latter in this case. As far as I understand, those are your two options (StringDType or object).

Ruibin-Liu · 2023-09-15T02:15:24Z

@hedeershowk It seems Pandas doesn't support all numpy structured arrays but it does support some like S10. Currently we can do this:

>>> df['name'] = df['name'].astype("S10")
>>> df.dtypes
name         |S10
age         int16
weight    float32
dtype: object

What I proposed is that Pandas should try to use the numpy dtypes for string like S10 in numpy.chararray if the given numpy array has the type information during constructing.

It would be better if Pandas support all numpy structured array types so that the U10 type can be used in Pandas as well.

hedeershowk · 2023-09-15T14:01:29Z

It would be better if Pandas support all numpy structured array types so that the U10 type can be used in Pandas as well.

👍 Okay makes sense. I think that's definitely a contributor question and not for me. Seems that current dtypes are pretty firmly part of the structure and the decision to capture all strings with StringDType or object was made a long time ago.

lithomas1 · 2023-09-15T21:49:42Z

This is intentional, I think.

I think prior discussion can be found here #10351.

Ruibin-Liu · 2023-09-16T02:45:40Z

This is intentional, I think.

I think prior discussion can be found here #10351.

I read the discussion, and it seems the main argument was this comment: #10351 (comment), which raised a question that if a column has a dtype like S1, what if we change one entry in the column to something like 'Some other string that is large' which is longer than 1 character? Well, we can just try that in the current version:

>>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')],dtype=[('col_1', 'i4'), ('col_2', 'U1')])
>>> df = pd.DataFrame.from_records(data)
>>> df.dtypes
col_1     int32
col_2    object
dtype: object
>>> df.col_2 = df.col_2.astype('S1')
>>> df.dtypes
col_1    int32
col_2      |S1
dtype: object
>>> df.iloc[0,1] = 'Some other string that is large'
>>> df.dtypes
col_1     int32
col_2    object
dtype: object
>>> type(df.iloc[1,1])
bytes

As you can see, the current pandas version already converts the column dtype from |S1 to object while the types for the rows that still satisfy the length constraint are sill bytes (byte array?).

I am very unfamiliar with the pandas internals but it seems pandas has implemented a very flexible model in the past several years, so that it is not a problem any more to have 'mixed' types in one column and the column type seems to be assigned with the most compatible one.

With that information, I don't think there is actually any reason that pandas cannot use the provided numpy fixed-length string dtypes as the column types.

At least for the from_records classmethod.

phofl · 2023-12-07T23:58:28Z

This is supposed to be object, never fixed width strings from NumPy, they are not supported in pandas and we are moving towards arrow strings anyway, so this won't get support in pandas itself

closing

Ruibin-Liu added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2023

lithomas1 added Dtype Conversions Unexpected or buggy dtype conversions Closing Candidate May be closeable, needs more eyeballs Strings String extension data type and string data and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2023

Ruibin-Liu changed the title ~~ENH: from_record returns dtypes respecting input numpy dtypes~~ ENH: from_records returns dtypes respecting input numpy dtypes Sep 16, 2023

phofl closed this as completed Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: from_records returns dtypes respecting input numpy dtypes #55081

ENH: from_records returns dtypes respecting input numpy dtypes #55081

Ruibin-Liu commented Sep 9, 2023 •

edited

Loading

hedeershowk commented Sep 13, 2023

Uh oh!

Ruibin-Liu commented Sep 14, 2023 •

edited

Loading

Uh oh!

hedeershowk commented Sep 14, 2023

Uh oh!

Ruibin-Liu commented Sep 15, 2023

Uh oh!

hedeershowk commented Sep 15, 2023

Uh oh!

lithomas1 commented Sep 15, 2023

Uh oh!

Ruibin-Liu commented Sep 16, 2023 •

edited

Loading

Uh oh!

phofl commented Dec 7, 2023

Uh oh!

Uh oh!

ENH: from_records returns dtypes respecting input numpy dtypes #55081

ENH: from_records returns dtypes respecting input numpy dtypes #55081

Comments

Ruibin-Liu commented Sep 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

hedeershowk commented Sep 13, 2023

Uh oh!

Ruibin-Liu commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hedeershowk commented Sep 14, 2023

Uh oh!

Ruibin-Liu commented Sep 15, 2023

Uh oh!

hedeershowk commented Sep 15, 2023

Uh oh!

lithomas1 commented Sep 15, 2023

Uh oh!

Ruibin-Liu commented Sep 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phofl commented Dec 7, 2023

Uh oh!

Ruibin-Liu commented Sep 9, 2023 •

edited

Loading

Ruibin-Liu commented Sep 14, 2023 •

edited

Loading

Ruibin-Liu commented Sep 16, 2023 •

edited

Loading