Skip to content

Conversation

seisman
Copy link
Member

@seisman seisman commented Nov 10, 2024

Description of proposed changes

This PR adds tests for PyArrow string types: string/utf8/large_string/large_utf8/string_view (xref: https://arrow.apache.org/docs/python/api/datatypes.html).

None of them can be converted to np.str_ directly, so we need to mapping them explicitly.

In [1]: import pyarrow as pa

In [2]: x = pa.array(["abc", "defg", "12345"], type=pa.string())

In [3]: x.type
Out[3]: DataType(string)

In [4]: str(x.type)
Out[4]: 'string'

In [6]: import numpy as np

In [7]: np.ascontiguousarray(x)
Out[7]: array(['abc', 'defg', '12345'], dtype=object)

@seisman seisman added maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. labels Nov 11, 2024
@seisman seisman added this to the 0.14.0 milestone Nov 11, 2024
@seisman seisman marked this pull request as ready for review November 11, 2024 09:15
@seisman seisman force-pushed the to_numpy/pyarrow_string branch from 8bfac8e to f10175f Compare November 11, 2024 09:48
@michaelgrund michaelgrund added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Nov 12, 2024
@seisman
Copy link
Member Author

seisman commented Nov 13, 2024

I plan to cherry-pick f10175f into a separate PR to have an entry in the "Enhancement" category.

@seisman seisman requested a review from weiji14 November 14, 2024 04:37
array = np.ascontiguousarray(data.astype(float))
else:
vec_dtype = str(getattr(data, "dtype", ""))
vec_dtype = str(getattr(data, "dtype", getattr(data, "type", "")))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow array only has .type, not .dtype.

@seisman seisman added skip-changelog Skip adding Pull Request to changelog and removed final review call This PR requires final review and approval from a second reviewer labels Nov 15, 2024
@seisman seisman merged commit c07f1b6 into main Nov 15, 2024
18 of 20 checks passed
@seisman seisman deleted the to_numpy/pyarrow_string branch November 15, 2024 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs skip-changelog Skip adding Pull Request to changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants