Skip to content

Fixes for use with Pandas 1.2.1 #171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 20 additions & 5 deletions text_extensions_for_pandas/array/span.py
Original file line number Diff line number Diff line change
Expand Up @@ -466,6 +466,17 @@ def __hash__(self):
self._ends.tobytes()))
return self._hash

def __contains__(self, item) -> bool:
"""
Return true if scalar item exists in this SpanArray.
:param item: scalar Span value.
:return: true if item exists in this SpanArray.
"""
if isinstance(item, Span) and \
item.begin == Span.NULL_OFFSET_VALUE:
return Span.NULL_OFFSET_VALUE in self._begins
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because otherwise the default is to check all values for equality, and that will fail if there is a NULL value with Span.target_text=None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how Pandas defines the result of NA == NA for other nullable types? Should that expression evaluate True, False, or NA? It could go either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the current Pandas policy is "¯_(ツ)_/¯"; see the warning on https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good question. The test failure related to this was checking assert na_value in data_missing which expects that na_value == na_value to be True. Seems like it would be different for np.nan though..

return super().__contains__(item)

def equals(self, other: "SpanArray"):
"""
:param other: A second `SpanArray`
Expand Down Expand Up @@ -541,12 +552,16 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
i = 0
for s in scalars:
if not isinstance(s, Span):
raise ValueError(f"Can only convert a sequence of Span "
f"objects to a SpanArray. Found an "
f"object of type {type(s)}")
if text is None:
# TODO: Temporary fix for np.nan values, pandas-dev GH#38980
if np.isnan(s):
s = Span(text, Span.NULL_OFFSET_VALUE, Span.NULL_OFFSET_VALUE)
else:
raise ValueError(f"Can only convert a sequence of Span "
f"objects to a SpanArray. Found an "
f"object of type {type(s)}")
if text is None and s.begin != Span.NULL_OFFSET_VALUE:
text = s.target_text
if s.target_text != text:
if s.target_text != text and s.begin != Span.NULL_OFFSET_VALUE:
raise ValueError(
f"Mixing different target texts is not currently "
f"supported. Received two different strings:\n"
Expand Down
2 changes: 1 addition & 1 deletion text_extensions_for_pandas/array/tensor.py
Original file line number Diff line number Diff line change
Expand Up @@ -433,7 +433,7 @@ def __setitem__(self, key: Union[int, np.ndarray], value: Any) -> None:
def __contains__(self, item) -> bool:
if isinstance(item, TensorElement):
npitem = np.asarray(item)
if npitem.size == 1 and np.isnan(npitem).all():
if npitem.size == 1 and np.isnan(npitem).all():
return self.isna().any()
return super().__contains__(item)

Expand Down
31 changes: 23 additions & 8 deletions text_extensions_for_pandas/array/token_span.py
Original file line number Diff line number Diff line change
Expand Up @@ -433,6 +433,17 @@ def __hash__(self):
self._hash = SpanArray.__hash__(self)
return self._hash

def __contains__(self, item) -> bool:
"""
Return true if scalar item exists in this TokenSpanArray.
:param item: scalar TokenSpan value.
:return: true if item exists in this TokenSpanArray.
"""
if isinstance(item, TokenSpan) and \
item.begin == TokenSpan.NULL_OFFSET_VALUE:
return TokenSpan.NULL_OFFSET_VALUE in self._begin_tokens
return super().__contains__(item)

@classmethod
def _concat_same_type(
cls, to_concat: Sequence[pd.api.extensions.ExtensionArray]
Expand Down Expand Up @@ -474,7 +485,7 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
for information about this method.
"""
tokens = None
if isinstance(scalars, Span):
if isinstance(scalars, TokenSpan):
scalars = [scalars]
if isinstance(scalars, TokenSpanArray):
tokens = scalars.tokens
Expand All @@ -483,14 +494,18 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
i = 0
for s in scalars:
if not isinstance(s, TokenSpan):
raise ValueError(
f"Can only convert a sequence of TokenSpan "
f"objects to a TokenSpanArray. Found an "
f"object of type {type(s)}"
)
if tokens is None:
# TODO: Temporary fix for np.nan values, pandas-dev GH#38980
if np.isnan(s):
s = TokenSpanDtype().na_value
else:
raise ValueError(
f"Can only convert a sequence of TokenSpan "
f"objects to a TokenSpanArray. Found an "
f"object of type {type(s)}"
)
if tokens is None and not (s.begin == TokenSpan.NULL_OFFSET_VALUE and s.tokens.target_text is None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets a little awkward dealing with with a NULL value with target_text=None, but should be ok for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case will work better once TokenSpanArray allows multiple target texts/tokenization pairs. Still some design work to be done on that front, methinks.

tokens = s.tokens
if not s.tokens.equals(tokens):
if not (s.begin == TokenSpan.NULL_OFFSET_VALUE and s.tokens.target_text is None) and not s.tokens.equals(tokens):
raise ValueError(
f"Mixing different token sets is not currently "
f"supported. Received two token sets:\n"
Expand Down