Skip to content

ENH: from_records returns dtypes respecting input numpy dtypes #55081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
Ruibin-Liu opened this issue Sep 9, 2023 · 8 comments
Closed
1 of 3 tasks

ENH: from_records returns dtypes respecting input numpy dtypes #55081

Ruibin-Liu opened this issue Sep 9, 2023 · 8 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data

Comments

@Ruibin-Liu
Copy link

Ruibin-Liu commented Sep 9, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

When creating a DataFrame using from_records with a Numpy structured array, the current implementation doesn't respect the latter's dtypes.

>>>x = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
             dtype=[('name', 'U10'), ('age', 'i2'), ('weight', 'f4')])

>>>df = pd.DataFrame.from_records(x)
>>>df.dtypes
name       object
age         int16
weight    float32
dtype: object

It seems to me that the integer and float types are respected from some quick tests, but the str types are not.

Feature Description

def from_records(data, *args, **kwargs):
    ...
    df = ...
    if isinstance(data, np.ndarray):
        array_dtypes = data.dtype
        array_dtypes = dict(zip(array_dtypes.names, [array_dtypes[i] for i in range(len(array_dtypes))])) # there must be a better way to get the dtypes dict.
        df = df.astype(array_dtypes)    
    ...

Alternative Solutions

The end users can just do a post-processing similar to above.

Additional Context

No response

@Ruibin-Liu Ruibin-Liu added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2023
@hedeershowk
Copy link
Contributor

In pandas the str type is represented as object by default. There is a StringDType and you can astype to a string if you'd like:

In [9]: df['name'] = df['name'].astype("string")

In [10]: df.dtypes
Out[10]: 
name      string[python]
age                int16
weight           float32
dtype: object

See https://pandas.pydata.org/docs/user_guide/text.html#text-types for more information about how text types are handled.

@Ruibin-Liu
Copy link
Author

Ruibin-Liu commented Sep 14, 2023

@hedeershowk Thanks for replying. I know we can set the dtypes in the post processing step, but I am wondering whether the DataFrame constructor can directly use the dtypes information from Numpy structured array so that it can be faster because no python object is needed to be constructed and destructed.

Edit:
My feature description part didn't do a good job because it's still post processing and that might be the reason for your comment. I hope this reply clears the request a little bit because I obviously didn't know how to implement the feature as the way I replied in this comment or I would have done it by myself.

@hedeershowk
Copy link
Contributor

I am wondering whether the DataFrame constructor can directly use the dtypes information from Numpy structured array so that it can be faster because no python object is needed to be constructed and destructed.

so you're proposing that pandas will convert a numpy U10 type into an str? I think the only two options on the pandas side are object (which is a catch all type) or StringDType. It might make sense to have pandas use the latter in this case. As far as I understand, those are your two options (StringDType or object).

@Ruibin-Liu
Copy link
Author

@hedeershowk It seems Pandas doesn't support all numpy structured arrays but it does support some like S10. Currently we can do this:

>>> df['name'] = df['name'].astype("S10")
>>> df.dtypes
name         |S10
age         int16
weight    float32
dtype: object

What I proposed is that Pandas should try to use the numpy dtypes for string like S10 in numpy.chararray if the given numpy array has the type information during constructing.

It would be better if Pandas support all numpy structured array types so that the U10 type can be used in Pandas as well.

@hedeershowk
Copy link
Contributor

It would be better if Pandas support all numpy structured array types so that the U10 type can be used in Pandas as well.

👍 Okay makes sense. I think that's definitely a contributor question and not for me. Seems that current dtypes are pretty firmly part of the structure and the decision to capture all strings with StringDType or object was made a long time ago.

@lithomas1
Copy link
Member

This is intentional, I think.

I think prior discussion can be found here #10351.

@lithomas1 lithomas1 added Dtype Conversions Unexpected or buggy dtype conversions Closing Candidate May be closeable, needs more eyeballs Strings String extension data type and string data and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2023
@Ruibin-Liu
Copy link
Author

Ruibin-Liu commented Sep 16, 2023

This is intentional, I think.

I think prior discussion can be found here #10351.

I read the discussion, and it seems the main argument was this comment: #10351 (comment), which raised a question that if a column has a dtype like S1, what if we change one entry in the column to something like 'Some other string that is large' which is longer than 1 character? Well, we can just try that in the current version:

>>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')],dtype=[('col_1', 'i4'), ('col_2', 'U1')])
>>> df = pd.DataFrame.from_records(data)
>>> df.dtypes
col_1     int32
col_2    object
dtype: object
>>> df.col_2 = df.col_2.astype('S1')
>>> df.dtypes
col_1    int32
col_2      |S1
dtype: object
>>> df.iloc[0,1] = 'Some other string that is large'
>>> df.dtypes
col_1     int32
col_2    object
dtype: object
>>> type(df.iloc[1,1])
bytes

As you can see, the current pandas version already converts the column dtype from |S1 to object while the types for the rows that still satisfy the length constraint are sill bytes (byte array?).

I am very unfamiliar with the pandas internals but it seems pandas has implemented a very flexible model in the past several years, so that it is not a problem any more to have 'mixed' types in one column and the column type seems to be assigned with the most compatible one.

With that information, I don't think there is actually any reason that pandas cannot use the provided numpy fixed-length string dtypes as the column types.

At least for the from_records classmethod.

@Ruibin-Liu Ruibin-Liu changed the title ENH: from_record returns dtypes respecting input numpy dtypes ENH: from_records returns dtypes respecting input numpy dtypes Sep 16, 2023
@phofl
Copy link
Member

phofl commented Dec 7, 2023

This is supposed to be object, never fixed width strings from NumPy, they are not supported in pandas and we are moving towards arrow strings anyway, so this won't get support in pandas itself

closing

@phofl phofl closed this as completed Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants