Skip to content

refactor: New read node that defers ibis table instantiation #709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 30, 2024

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@TrevorBergeron TrevorBergeron requested review from a team as code owners May 20, 2024 19:55
@TrevorBergeron TrevorBergeron requested a review from tswast May 20, 2024 19:55
@product-auto-label product-auto-label bot added the size: l Pull request size is large. label May 20, 2024
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label May 20, 2024
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowhere near finished reviewing, but sending some early comments so it doesn't get stuck for too long.

session: Session,
*,
predicate: Optional[str] = None,
snapshot_time: Optional[datetime.datetime] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure we reconcile this with the changes from #712

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 117 to 118
# These parameters should not be used
index_cols=(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because we're in ArrayValue, which doesn't have a concept of "index"? Let's clarify in the comment.

Alternatively, any chance you could make these parameters optional in to_query() and omit them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, index is really only a concept at higher layers, pullling out managing it to the caller

ibis_table = ibis.table(physical_schema, full_table_name)

if ordered:
if node.primary_key:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit odd to me to put ordering generation here, but I guess this is just for total ordering, right? We still generate separate order by when we add the index, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, read table nodes should be able to establish their own total ordering either with provided uniqueness metadata (primary_key field) or by generating a hash-based key. Just like before, we do a .sort_index() on top of the read operation if the user provided index columns

ordering_value_columns = tuple(
bf_ordering.ascending_over(col) for col in node.primary_key
)
if node.primary_key_sequential:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we have primary keys that we know are sequential integers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching, which doesn't use this new node yet. Also, uploading local data could provide this

columns: schemata.ArraySchema = field()

table_session: bigframes.session.Session = field()
# Should this even be stored here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "native ordering column" or something be more appropriate? Such a name might allow us to use a row ID pseudocolumn as a fallback if one becomes available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to total_order_cols

primary_key: Tuple[str, ...] = field() # subset of schema
# indicates a primary key that is exactly offsets 0, 1, 2, ..., N-2, N-1
primary_key_sequential: bool = False
snapshot_time: typing.Optional[datetime.datetime] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically "time travel" which is different from snapshot in BQ. https://cloud.google.com/bigquery/docs/access-historical-data

Although looking at that, even our the backend messages conflate the two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed symbols to not say "snapshot"

Comment on lines +370 to +371
# Added for backwards compatibility, not validated
sql_predicate: typing.Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating. This implies some level of SQL compilation outside of this node. Should this be a structured "filters" object, instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original filters type is a bit too flexible, allowing potentially non-hashable tuples. I could convert the whole thing to tuples I guess. Would there be a benefit to that approach?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Forcing compilation to a string doesn't seem like the right choice to me. Some namedtuple or frozen dataclass would make the most sense to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eq and frozen are both true, by default @DataClass will generate a hash() method for you.

https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass

def __post_init__(self):
# enforce invariants
physical_names = set(map(lambda i: i.name, self.physical_schema))
assert len(self.columns.names) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this assertion? It is possible to create a completely empty table in BQ. Why one would want to do so, I'm not certain, but it is possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess we should allow empty tables, removed this constraint

# enforce invariants
physical_names = set(map(lambda i: i.name, self.physical_schema))
assert len(self.columns.names) > 0
assert set(self.primary_key).issubset(physical_names)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion might be false in future if "primary key" contains psuedo columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this constraint, though we might need a bit more work to support pseudo columns anyways.

physical_names = set(map(lambda i: i.name, self.physical_schema))
assert len(self.columns.names) > 0
assert set(self.primary_key).issubset(physical_names)
assert set(self.columns.names).issubset(physical_names)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ever reach this line of code, it would likely be an indication we should have a ValueError further up the call stack. Would at least be helpful to have a custom error message here as in other assertions so the user know to file a bug that we missed a validation check somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some error messages

@@ -280,6 +281,10 @@ def ibis_dtype_to_bigframes_dtype(
if isinstance(ibis_dtype, ibis_dtypes.Integer):
return pd.Int64Dtype()

# Temporary: Will eventually support an explicit json type instead of casting to string.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably raise a warning (PreviewWarning?) in this case to make sure folks know that depending on any JSON functionality may break in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a preview warning

@@ -210,6 +210,7 @@ def start_query_with_client(
)

try:
print(sql)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this print.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)
bf_read_gbq_table.validate_sql_through_ibis(sql, self.ibis_client)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need Henry's logic to dryrun with and without the time_travel_timestamp so we can continue to support tables that don't support time travel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge in his logic

@TrevorBergeron TrevorBergeron requested a review from tswast May 28, 2024 17:59
@TrevorBergeron TrevorBergeron merged commit 9f0406e into main May 30, 2024
20 of 21 checks passed
@TrevorBergeron TrevorBergeron deleted the read_table_node branch May 30, 2024 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants