Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
b303999
docs: document index as a best practice
tswast Apr 19, 2024
0ddd86b
docs: set `index_cols` in `read_gbq` as a best practice
tswast Apr 19, 2024
994a8f1
feat: support primary key(s) in `read_gbq` by using as the `index_col…
tswast Apr 19, 2024
5fcc5a0
revert WIP commit
tswast Apr 19, 2024
6b6a5ab
Merge branch 'main' into b335727141-primary_key
tswast Apr 22, 2024
8c4e31c
address type error in tests
tswast Apr 22, 2024
dd940bd
Merge branch 'b335727141-primary_key' into b335727141-clustered-or-pa…
tswast Apr 22, 2024
b96cba3
document behaviors
tswast Apr 22, 2024
fb3b508
Merge branch 'b335727141-docs' into b335727141-clustered-or-partition…
tswast Apr 22, 2024
477a516
update docs to reflect new default index behavior
tswast Apr 23, 2024
2c5a0dd
add DefaultIndexKind to allowed `index_col` values
tswast Apr 24, 2024
d485be6
Merge remote-tracking branch 'origin/main' into b335727141-clustered-…
tswast Apr 24, 2024
d816db3
refactor: cache table metadata alongside snapshot time
tswast Apr 24, 2024
d3f0891
Merge branch 'b335727141-snapshot-save-metadata' into b335727141-clus…
tswast Apr 24, 2024
241dc60
add unit tests
tswast Apr 25, 2024
613e660
parametrize tables with clustered and partitioned
tswast Apr 25, 2024
2c782ca
Merge remote-tracking branch 'origin/main' into b335727141-clustered-…
tswast Apr 26, 2024
f437dcf
refactor: split `read_gbq_table` implementation into functions and mo…
tswast Apr 26, 2024
0090dc0
refactor progress
tswast Apr 29, 2024
850db7a
add index_cols function
tswast Apr 29, 2024
ab98d4a
maybe ready for review
tswast Apr 29, 2024
5b665dd
Merge remote-tracking branch 'origin/main' into b335727141-refactor-r…
tswast Apr 29, 2024
0577131
Update bigframes/session/__init__.py
tswast Apr 30, 2024
f3f6982
Merge branch 'main' into b335727141-refactor-read_gbq_table
tswast Apr 30, 2024
453eece
Merge branch 'b335727141-refactor-read_gbq_table' into b335727141-clu…
tswast Apr 30, 2024
175a23c
Merge remote-tracking branch 'origin/main' into b335727141-clustered-…
tswast Apr 30, 2024
204b2db
remove some todos
tswast Apr 30, 2024
adaf664
add error raising plus todos
tswast Apr 30, 2024
e8bdded
Merge remote-tracking branch 'origin/main' into b335727141-clustered-…
tswast May 1, 2024
d028bc5
add TODO for ROW_NUMBER() in the query we generate
tswast May 1, 2024
658f61d
remove filters unit test for now
tswast May 1, 2024
f1b3f88
docstring fixes
tswast May 1, 2024
6b0e63c
Merge branch 'main' into b335727141-clustered-or-partitioned-default-…
tswast May 1, 2024
40fab82
Merge remote-tracking branch 'origin/main' into b335727141-clustered-…
tswast May 2, 2024
9f3e149
feat: support `index_col=False` in `read_csv` and `engine="bigquery"`
tswast May 2, 2024
722abbb
Merge remote-tracking branch 'origin/b335727141-clustered-or-partitio…
tswast May 2, 2024
e7c4d93
revert typo
tswast May 2, 2024
d136bc0
attempt 2
tswast May 2, 2024
586cca2
Merge remote-tracking branch 'origin/main' into b335727141-clustered-…
tswast May 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions bigframes/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
from bigframes._config import option_context, options
from bigframes._config.bigquery_options import BigQueryOptions
from bigframes.core.global_session import close_session, get_global_session
import bigframes.enums as enums
import bigframes.exceptions as exceptions
from bigframes.session import connect, Session
from bigframes.version import __version__

Expand All @@ -25,6 +27,8 @@
"BigQueryOptions",
"get_global_session",
"close_session",
"enums",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is enums an intuitive module, or would a domain-related term be better, eg indexing.IndexType or directly putting the enum in the main module, bigframes.pandas.IndexType?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to find some guidance on this, but Python community doesn't seem particularly prescriptive about module names.

PEP-8 has this to say:

Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability.

https://peps.python.org/pep-0008/#package-and-module-names

Google Python style guide has a bit more to say:

Place related classes and top-level functions together in a module. Unlike Java, there is no need to limit yourself to one class per module.

Use CapWords for class names, but lower_with_under.py for module names.

https://google.github.io/styleguide/pyguide.html#3162-naming-conventions

I tried a few of these options out locally (bigframes.indexes.DefaultIndexKind and bigframes.pandas.DefaultIndexKind), but it feels strange to have something not really mimicking pandas in the pandas sub-package and bigframes.indexes.DefaultIndexKind would imply that we should move the Index and MultiIndex classes there, which is kinda the opposite of what we want to do.

The other option we could try is bigframes.pandas.core.indexes, but in pandas "core" is how they signify that an API is private an not to be relied on.

IMO, determining if classes are "related" by type for the basic types (e.g. exceptions, enums, ...) will be less effort for us long-term than having to figure out which public package to place these things if it doesn't fit in an existing API.

"exceptions",
"connect",
"Session",
"__version__",
Expand Down
10 changes: 10 additions & 0 deletions bigframes/core/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,10 +116,20 @@ def __init__(
raise ValueError(
f"'index_columns' (size {len(index_columns)}) and 'index_labels' (size {len(index_labels)}) must have equal length"
)

# If no index columns are set, create one.
#
# Note: get_index_cols_and_uniqueness in
# bigframes/session/_io/bigquery/read_gbq_table.py depends on this
# being as sequential integer index column. If this default behavior
# ever changes, please also update get_index_cols_and_uniqueness so
# that users who explicitly request a sequential integer index can
# still get one.
if len(index_columns) == 0:
new_index_col_id = guid.generate_guid()
expr = expr.promote_offsets(new_index_col_id)
index_columns = [new_index_col_id]

self._index_columns = tuple(index_columns)
# Index labels don't need complicated hierarchical access so can store as tuple
self._index_labels = (
Expand Down
29 changes: 29 additions & 0 deletions bigframes/enums.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Public enums used across BigQuery DataFrames."""

# NOTE: This module should not depend on any others in the package.


import enum


class DefaultIndexKind(enum.Enum):
"""Sentinel values used to override default indexing behavior."""

#: Use consecutive integers as the index. This is ``0``, ``1``, ``2``, ...,
#: ``n - 3``, ``n - 2``, ``n - 1``, where ``n`` is the number of items in
#: the index.
SEQUENTIAL_INT64 = enum.auto()
8 changes: 8 additions & 0 deletions bigframes/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@
# See the License for the specific language governing permissions and
# limitations under the License.

"""Public exceptions and warnings used across BigQuery DataFrames."""

# NOTE: This module should not depend on any others in the package.


class UnknownLocationWarning(Warning):
"""The location is set to an unknown value."""


class NoDefaultIndexError(ValueError):
"""Unable to create a default index."""
15 changes: 11 additions & 4 deletions bigframes/pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@
import bigframes.core.reshape
import bigframes.core.tools
import bigframes.dataframe
import bigframes.enums
import bigframes.operations as ops
import bigframes.series
import bigframes.session
Expand Down Expand Up @@ -423,7 +424,13 @@ def read_csv(
Union[MutableSequence[Any], numpy.ndarray[Any, Any], Tuple[Any, ...], range]
] = None,
index_col: Optional[
Union[int, str, Sequence[Union[str, int]], Literal[False]]
Union[
int,
str,
Sequence[Union[str, int]],
bigframes.enums.DefaultIndexKind,
Literal[False],
]
] = None,
usecols: Optional[
Union[
Expand Down Expand Up @@ -491,7 +498,7 @@ def read_json(
def read_gbq(
query_or_table: str,
*,
index_col: Iterable[str] | str = (),
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
configuration: Optional[Dict] = None,
max_results: Optional[int] = None,
Expand Down Expand Up @@ -529,7 +536,7 @@ def read_gbq_model(model_name: str):
def read_gbq_query(
query: str,
*,
index_col: Iterable[str] | str = (),
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
configuration: Optional[Dict] = None,
max_results: Optional[int] = None,
Expand All @@ -555,7 +562,7 @@ def read_gbq_query(
def read_gbq_table(
query: str,
*,
index_col: Iterable[str] | str = (),
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
max_results: Optional[int] = None,
filters: vendored_pandas_gbq.FiltersType = (),
Expand Down
74 changes: 55 additions & 19 deletions bigframes/session/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,7 @@ def read_gbq(
self,
query_or_table: str,
*,
index_col: Iterable[str] | str = (),
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
configuration: Optional[Dict] = None,
max_results: Optional[int] = None,
Expand All @@ -313,6 +313,9 @@ def read_gbq(

filters = list(filters)
if len(filters) != 0 or _is_table_with_wildcard_suffix(query_or_table):
# TODO(b/338111344): This appears to be missing index_cols, which
# are necessary to be selected.
# TODO(b/338039517): Also, need to account for primary keys.
query_or_table = self._to_query(query_or_table, columns, filters)

if _is_query(query_or_table):
Expand All @@ -326,9 +329,6 @@ def read_gbq(
use_cache=use_cache,
)
else:
# TODO(swast): Query the snapshot table but mark it as a
# deterministic query so we can avoid serializing if we have a
# unique index.
if configuration is not None:
raise ValueError(
"The 'configuration' argument is not allowed when "
Expand Down Expand Up @@ -359,6 +359,8 @@ def _to_query(
else f"`{query_or_table}`"
)

# TODO(b/338111344): Generate an index based on DefaultIndexKind if we
# don't have index columns specified.
select_clause = "SELECT " + (
", ".join(f"`{column}`" for column in columns) if columns else "*"
)
Expand Down Expand Up @@ -488,7 +490,7 @@ def read_gbq_query(
self,
query: str,
*,
index_col: Iterable[str] | str = (),
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
configuration: Optional[Dict] = None,
max_results: Optional[int] = None,
Expand Down Expand Up @@ -566,7 +568,7 @@ def _read_gbq_query(
self,
query: str,
*,
index_col: Iterable[str] | str = (),
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
configuration: Optional[Dict] = None,
max_results: Optional[int] = None,
Expand Down Expand Up @@ -598,7 +600,9 @@ def _read_gbq_query(
True if use_cache is None else use_cache
)

if isinstance(index_col, str):
if isinstance(index_col, bigframes.enums.DefaultIndexKind):
index_cols = []
elif isinstance(index_col, str):
index_cols = [index_col]
else:
index_cols = list(index_col)
Expand Down Expand Up @@ -628,7 +632,7 @@ def _read_gbq_query(

return self.read_gbq_table(
f"{destination.project}.{destination.dataset_id}.{destination.table_id}",
index_col=index_cols,
index_col=index_col,
columns=columns,
max_results=max_results,
use_cache=configuration["query"]["useQueryCache"],
Expand All @@ -638,7 +642,7 @@ def read_gbq_table(
self,
query: str,
*,
index_col: Iterable[str] | str = (),
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
max_results: Optional[int] = None,
filters: third_party_pandas_gbq.FiltersType = (),
Expand Down Expand Up @@ -693,7 +697,7 @@ def _read_gbq_table(
self,
query: str,
*,
index_col: Iterable[str] | str = (),
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
max_results: Optional[int] = None,
api_name: str,
Expand Down Expand Up @@ -821,10 +825,12 @@ def _read_bigquery_load_job(
table: Union[bigquery.Table, bigquery.TableReference],
*,
job_config: bigquery.LoadJobConfig,
index_col: Iterable[str] | str = (),
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
) -> dataframe.DataFrame:
if isinstance(index_col, str):
if isinstance(index_col, bigframes.enums.DefaultIndexKind):
index_cols = []
elif isinstance(index_col, str):
index_cols = [index_col]
else:
index_cols = list(index_col)
Expand Down Expand Up @@ -1113,7 +1119,13 @@ def read_csv(
Union[MutableSequence[Any], np.ndarray[Any, Any], Tuple[Any, ...], range]
] = None,
index_col: Optional[
Union[int, str, Sequence[Union[str, int]], Literal[False]]
Union[
int,
str,
Sequence[Union[str, int]],
bigframes.enums.DefaultIndexKind,
Literal[False],
]
] = None,
usecols: Optional[
Union[
Expand Down Expand Up @@ -1143,18 +1155,37 @@ def read_csv(
f"{constants.FEEDBACK_LINK}"
)

if index_col is not None and (
not index_col or not isinstance(index_col, str)
# TODO(b/338089659): Looks like we can relax this 1 column
# restriction if we check the contents of an iterable are strings
# not integers.
if (
# Empty tuples, None, and False are allowed and falsey.
index_col
and not isinstance(index_col, bigframes.enums.DefaultIndexKind)
and not isinstance(index_col, str)
):
raise NotImplementedError(
"BigQuery engine only supports a single column name for `index_col`. "
f"{constants.FEEDBACK_LINK}"
"BigQuery engine only supports a single column name for `index_col`, "
f"got: {repr(index_col)}. {constants.FEEDBACK_LINK}"
)

# None value for index_col cannot be passed to read_gbq
if index_col is None:
# None and False cannot be passed to read_gbq.
# TODO(b/338400133): When index_col is None, we should be using the
# first column of the CSV as the index to be compatible with the
# pandas engine. According to the pandas docs, only "False"
# indicates a default sequential index.
if not index_col:
index_col = ()

index_col = typing.cast(
Union[
Sequence[str], # Falsey values
bigframes.enums.DefaultIndexKind,
str,
],
index_col,
)

# usecols should only be an iterable of strings (column names) for use as columns in read_gbq.
columns: Tuple[Any, ...] = tuple()
if usecols is not None:
Expand Down Expand Up @@ -1199,6 +1230,11 @@ def read_csv(
columns=columns,
)
else:
if isinstance(index_col, bigframes.enums.DefaultIndexKind):
raise NotImplementedError(
f"With index_col={repr(index_col)}, only engine='bigquery' is supported. "
f"{constants.FEEDBACK_LINK}"
)
if any(arg in kwargs for arg in ("chunksize", "iterator")):
raise NotImplementedError(
"'chunksize' and 'iterator' arguments are not supported. "
Expand Down
Loading