-
Notifications
You must be signed in to change notification settings - Fork 48
refactor: New read node that defers ibis table instantiation #709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowhere near finished reviewing, but sending some early comments so it doesn't get stuck for too long.
bigframes/core/__init__.py
Outdated
session: Session, | ||
*, | ||
predicate: Optional[str] = None, | ||
snapshot_time: Optional[datetime.datetime] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure we reconcile this with the changes from #712
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
bigframes/core/compile/compiler.py
Outdated
# These parameters should not be used | ||
index_cols=(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because we're in ArrayValue
, which doesn't have a concept of "index"? Let's clarify in the comment.
Alternatively, any chance you could make these parameters optional in to_query()
and omit them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, index is really only a concept at higher layers, pullling out managing it to the caller
bigframes/core/compile/compiler.py
Outdated
ibis_table = ibis.table(physical_schema, full_table_name) | ||
|
||
if ordered: | ||
if node.primary_key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit odd to me to put ordering generation here, but I guess this is just for total ordering, right? We still generate separate order by when we add the index, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, read table nodes should be able to establish their own total ordering either with provided uniqueness metadata (primary_key field) or by generating a hash-based key. Just like before, we do a .sort_index()
on top of the read operation if the user provided index columns
bigframes/core/compile/compiler.py
Outdated
ordering_value_columns = tuple( | ||
bf_ordering.ascending_over(col) for col in node.primary_key | ||
) | ||
if node.primary_key_sequential: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we have primary keys that we know are sequential integers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching, which doesn't use this new node yet. Also, uploading local data could provide this
bigframes/core/nodes.py
Outdated
columns: schemata.ArraySchema = field() | ||
|
||
table_session: bigframes.session.Session = field() | ||
# Should this even be stored here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would "native ordering column" or something be more appropriate? Such a name might allow us to use a row ID pseudocolumn as a fallback if one becomes available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to total_order_cols
bigframes/core/nodes.py
Outdated
primary_key: Tuple[str, ...] = field() # subset of schema | ||
# indicates a primary key that is exactly offsets 0, 1, 2, ..., N-2, N-1 | ||
primary_key_sequential: bool = False | ||
snapshot_time: typing.Optional[datetime.datetime] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically "time travel" which is different from snapshot in BQ. https://cloud.google.com/bigquery/docs/access-historical-data
Although looking at that, even our the backend messages conflate the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed symbols to not say "snapshot"
# Added for backwards compatibility, not validated | ||
sql_predicate: typing.Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fascinating. This implies some level of SQL compilation outside of this node. Should this be a structured "filters" object, instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original filters type is a bit too flexible, allowing potentially non-hashable tuples. I could convert the whole thing to tuples I guess. Would there be a benefit to that approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Forcing compilation to a string doesn't seem like the right choice to me. Some namedtuple or frozen dataclass would make the most sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If eq and frozen are both true, by default @DataClass will generate a hash() method for you.
https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass
bigframes/core/nodes.py
Outdated
def __post_init__(self): | ||
# enforce invariants | ||
physical_names = set(map(lambda i: i.name, self.physical_schema)) | ||
assert len(self.columns.names) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this assertion? It is possible to create a completely empty table in BQ. Why one would want to do so, I'm not certain, but it is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess we should allow empty tables, removed this constraint
bigframes/core/nodes.py
Outdated
# enforce invariants | ||
physical_names = set(map(lambda i: i.name, self.physical_schema)) | ||
assert len(self.columns.names) > 0 | ||
assert set(self.primary_key).issubset(physical_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion might be false in future if "primary key" contains psuedo columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this constraint, though we might need a bit more work to support pseudo columns anyways.
bigframes/core/nodes.py
Outdated
physical_names = set(map(lambda i: i.name, self.physical_schema)) | ||
assert len(self.columns.names) > 0 | ||
assert set(self.primary_key).issubset(physical_names) | ||
assert set(self.columns.names).issubset(physical_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we ever reach this line of code, it would likely be an indication we should have a ValueError further up the call stack. Would at least be helpful to have a custom error message here as in other assertions so the user know to file a bug that we missed a validation check somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some error messages
@@ -280,6 +281,10 @@ def ibis_dtype_to_bigframes_dtype( | |||
if isinstance(ibis_dtype, ibis_dtypes.Integer): | |||
return pd.Int64Dtype() | |||
|
|||
# Temporary: Will eventually support an explicit json type instead of casting to string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably raise a warning (PreviewWarning?) in this case to make sure folks know that depending on any JSON functionality may break in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a preview warning
@@ -210,6 +210,7 @@ def start_query_with_client( | |||
) | |||
|
|||
try: | |||
print(sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this print
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
bigframes/session/__init__.py
Outdated
) | ||
bf_read_gbq_table.validate_sql_through_ibis(sql, self.ibis_client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need Henry's logic to dryrun with and without the time_travel_timestamp
so we can continue to support tables that don't support time travel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge in his logic
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕