-
-
Notifications
You must be signed in to change notification settings - Fork 19k
BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
9876c64
0b936b3
00ad9c7
d54f950
adeceb9
3557b4a
975c87c
0ef179a
df996ac
0bea19b
d04ac92
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -266,21 +266,29 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]: | |||||
|
||||||
assert buffers["offsets"], "String buffers must contain offsets" | ||||||
# Retrieve the data buffer containing the UTF-8 code units | ||||||
data_buff, protocol_data_dtype = buffers["data"] | ||||||
# We're going to reinterpret the buffer as uint8, so make sure we can do it safely | ||||||
assert protocol_data_dtype[1] == 8 | ||||||
assert protocol_data_dtype[2] in ( | ||||||
ArrowCTypes.STRING, | ||||||
ArrowCTypes.LARGE_STRING, | ||||||
) # format_str == utf-8 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This assertion is valid, but it should be on |
||||||
# Convert the buffers to NumPy arrays. In order to go from STRING to | ||||||
# an equivalent ndarray, we claim that the buffer is uint8 (i.e., a byte array) | ||||||
data_dtype = ( | ||||||
DtypeKind.UINT, | ||||||
8, | ||||||
ArrowCTypes.UINT8, | ||||||
Endianness.NATIVE, | ||||||
) | ||||||
data_buff, data_dtype = buffers["data"] | ||||||
|
data_buff, data_dtype = buffers["data"] | |
data_buff, _ = buffers["data"] |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity what is the 8
here supposed to represent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey - from https://data-apis.org/dataframe-protocol/latest/API.html, it's number of bits:
@property @abstractmethod def dtype(self) -> Dtype: """ Dtype description as a tuple ``(kind, bit-width, format string, endianness)``. Bit-width : the number of bits as an integer Format string : data type description format string in Apache Arrow C Data Interface format. Endianness : current only native endianness (``=``) is supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are string dtypes supposed to have an 8 bit association? That is kind of confusing for variable length types, granted I know very little of how this interchange is supposed to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that strings are meant to be utf-8, and so each character can be represented with 8bits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm interesting. Well keep in mind that UTF-8 doesn't mean a character is 8 bits; it is still 1-4 bytes
In arrow-adbc I've seen this assigned the value of 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UFT-8 array, consisting of the actual strings and offsets array, so here the buffer representing all string data (so which is typically much longer as the logical length of the array) can be seen as simply a buffer of bytes ("bytearray"), so so in numpy / buffer interface terms you can view such an array as one bitwidth 8
In arrow-adbc I've seen this assigned the value of 0
That's something postgres specific, I think
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not need to be in an if
-block. We can simply disregard the data buffer dtype - for string columns, this will ALWAYS be as listed here. This was already the case in the previous code! Only the assertions were wrong.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be an INT? Timestamps are backed by 64 bit signed integers in arrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, AFAIK that should be signed INT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion can indeed be deleted, as we can assume bit width 8 if the column dtype is STRING or LARGE_STRING.