From 3e3b1850572bc9383a8f68bd4fa1f88f9590f616 Mon Sep 17 00:00:00 2001 From: Chelsea Lin Date: Mon, 5 Aug 2024 23:25:32 +0000 Subject: [PATCH 1/3] docs: create sample notebook to manipulate struct and array data --- .../dataframes/struct_and_array_dtypes.ipynb | 658 ++++++++++++++++++ 1 file changed, 658 insertions(+) create mode 100644 notebooks/dataframes/struct_and_array_dtypes.ipynb diff --git a/notebooks/dataframes/struct_and_array_dtypes.ipynb b/notebooks/dataframes/struct_and_array_dtypes.ipynb new file mode 100644 index 0000000000..b056e78bd3 --- /dev/null +++ b/notebooks/dataframes/struct_and_array_dtypes.ipynb @@ -0,0 +1,658 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Copyright 2023 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A Guide to Array and Struct Data Types in BigQuery DataFrames" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Set up your environment\n", + "\n", + "Please refer to the notebooks in the `getting_started` folder for instructions on setting up your environment. Once your environment is ready, run the following code to import the necessary packages for working with BigFrames arrays:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "import bigframes.pandas as bpd\n", + "import bigframes.bigquery as bbq\n", + "import pyarrow as pa" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "REGION = \"US\" # @param {type: \"string\"}\n", + "bpd.options.display.progress_bar = None\n", + "bpd.options.bigquery.location = REGION\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Array Data Types\n", + "\n", + "In BigQuery, an [array](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type), also referred to as a `repeated` column, is an ordered list of zero or more non-array elements. These elements must be of the same data type, and arrays cannot contain other arrays. Furthermore, query results cannot include arrays with `NULL` elements.\n", + "\n", + "BigFrames DataFrames, inheriting these properties, map BigQuery array types to `pandas.ArrowDtype(pa.list_())`. This section provides code examples demonstrating how to effectively work with array columns within BigFrames DataFrames." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create DataFrames with array columns \n", + "\n", + "Let's create a sample BigFrames DataFrame where the `Scores` column holds array data of type `list[pyarrow]`:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameScores
0Alice[95 88 92]
1Bob[78 81]
2Charlie[ 82 89 94 100]
\n", + "

3 rows × 2 columns

\n", + "
[3 rows x 2 columns in total]" + ], + "text/plain": [ + " Name Scores\n", + "0 Alice [95 88 92]\n", + "1 Bob [78 81]\n", + "2 Charlie [ 82 89 94 100]\n", + "\n", + "[3 rows x 2 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = bpd.DataFrame({\n", + " 'Name': ['Alice', 'Bob', 'Charlie'],\n", + " 'Scores': [[95, 88, 92], [78, 81], [82, 89, 94, 100]],\n", + "})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Name string[pyarrow]\n", + "Scores list[pyarrow]\n", + "dtype: object" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## CRUD operations for array data\n", + "\n", + "While Pandas offers vectorized operations and lambda expressions to manipulate array data, BigFrames leverages BigQuery's computational power. BigFrames introduces the [`bigframes.bigquery`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery) package to provide access to a variety of native BigQuery array operations, such as [array_agg](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg), [array_length](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length), and others. This module allows you to seamlessly perform create, read, update, and delete (CRUD) operations on array data within your BigFrames DataFrames.\n", + "\n", + "Let's delve into how you can utilize these functions to effectively manipulate array data in BigFrames." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 3\n", + "1 2\n", + "2 4\n", + "Name: Scores, dtype: Int64" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Find the length in each array\n", + "bbq.array_length(df['Scores'])" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 95\n", + "0 88\n", + "0 92\n", + "1 78\n", + "1 81\n", + "2 82\n", + "2 89\n", + "2 94\n", + "2 100\n", + "Name: Scores, dtype: Int64" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Explode array elements into rows\n", + "scores = df['Scores'].explode()\n", + "scores" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 95.238095\n", + "0 88.571429\n", + "0 92.380952\n", + "1 79.047619\n", + "1 81.904762\n", + "2 82.857143\n", + "2 89.52381\n", + "2 94.285714\n", + "2 100.0\n", + "Name: Scores, dtype: Float64" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Adjuste the scores\n", + "adj_scores = (scores + 5) / 105.0 * 100.0\n", + "adj_scores" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 [95.23809524 88.57142857 92.38095238]\n", + "1 [79.04761905 81.9047619 ]\n", + "2 [ 82.85714286 89.52380952 94.28571429 100. ...\n", + "Name: Scores, dtype: list[pyarrow]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Aggregate adjusted scores back into arrays\n", + "adj_scores_arr = bbq.array_agg(adj_scores.groupby(level=0))\n", + "adj_scores_arr" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameScoresNewScores
0Alice[95 88 92][95.23809524 88.57142857 92.38095238]
1Bob[78 81][79.04761905 81.9047619 ]
2Charlie[ 82 89 94 100][ 82.85714286 89.52380952 94.28571429 100. ...
\n", + "

3 rows × 3 columns

\n", + "
[3 rows x 3 columns in total]" + ], + "text/plain": [ + " Name Scores \\\n", + "0 Alice [95 88 92] \n", + "1 Bob [78 81] \n", + "2 Charlie [ 82 89 94 100] \n", + "\n", + " NewScores \n", + "0 [95.23809524 88.57142857 92.38095238] \n", + "1 [79.04761905 81.9047619 ] \n", + "2 [ 82.85714286 89.52380952 94.28571429 100. ... \n", + "\n", + "[3 rows x 3 columns]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Incorporate adjusted scores into the DataFrame\n", + "df['NewScores'] = adj_scores_arr\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Struct Data Types\n", + "\n", + "In BigQuery, an [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigFrames maps BigQuery struct types to the Pandas equivalent, `pandas.ArrowDtype(pa.struct())`. In this section, we'll explore practical code examples illustrating how to work with struct columns within your BigFrames DataFrames." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create DataFrames with struct columns \n", + "\n", + "Let's create a sample BigFrames DataFrame where the `Address` column holds struct data of type `struct[pyarrow]`:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/google/home/chelsealin/src/bigframes2/venv/lib/python3.12/site-packages/google/cloud/bigquery/_pandas_helpers.py:537: UserWarning: Pyarrow could not determine the type of columns: bigframes_unnamed_index.\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameAddress
0Alice{'City': 'New York', 'State': 'NY'}
1Bob{'City': 'San Francisco', 'State': 'CA'}
2Charlie{'City': 'Seattle', 'State': 'WA'}
\n", + "

3 rows × 2 columns

\n", + "
[3 rows x 2 columns in total]" + ], + "text/plain": [ + " Name Address\n", + "0 Alice {'City': 'New York', 'State': 'NY'}\n", + "1 Bob {'City': 'San Francisco', 'State': 'CA'}\n", + "2 Charlie {'City': 'Seattle', 'State': 'WA'}\n", + "\n", + "[3 rows x 2 columns]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "names = bpd.Series(['Alice', 'Bob', 'Charlie'])\n", + "address = bpd.Series(\n", + " [\n", + " {'City': 'New York', 'State': 'NY'},\n", + " {'City': 'San Francisco', 'State': 'CA'},\n", + " {'City': 'Seattle', 'State': 'WA'}\n", + " ],\n", + " dtype=bpd.ArrowDtype(pa.struct(\n", + " [('City', pa.string()), ('State', pa.string())]\n", + " )))\n", + "\n", + "df = bpd.DataFrame({'Name': names, 'Address': address})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Name string[pyarrow]\n", + "Address struct[pyarrow]\n", + "dtype: object" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## CRUD operations for struct data\n", + "\n", + "Similar to Pandas, BigFrames provides a [`StructAccessor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor) to streamline the manipulation of struct data. Let's explore how you can utilize this feature for efficient CRUD operations on your nested struct columns." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "City string[pyarrow]\n", + "State string[pyarrow]\n", + "dtype: object" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Return the dtype object of each child field of the struct.\n", + "df['Address'].struct.dtypes()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 New York\n", + "1 San Francisco\n", + "2 Seattle\n", + "Name: City, dtype: string" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Extract a child field as a Series\n", + "city = df['Address'].struct.field(\"City\")\n", + "city" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CityState
0New YorkNY
1San FranciscoCA
2SeattleWA
\n", + "

3 rows × 2 columns

\n", + "
[3 rows x 2 columns in total]" + ], + "text/plain": [ + " City State\n", + "0 New York NY\n", + "1 San Francisco CA\n", + "2 Seattle WA\n", + "\n", + "[3 rows x 2 columns]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Extract all child fields of a struct as a DataFrame.\n", + "address_df = df['Address'].struct.explode()\n", + "address_df" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From f3ce5310a47a235deff8016aeb75d1c190a1202d Mon Sep 17 00:00:00 2001 From: Chelsea Lin Date: Wed, 7 Aug 2024 20:27:38 +0000 Subject: [PATCH 2/3] typo --- notebooks/dataframes/struct_and_array_dtypes.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/notebooks/dataframes/struct_and_array_dtypes.ipynb b/notebooks/dataframes/struct_and_array_dtypes.ipynb index b056e78bd3..3ba07d1b88 100644 --- a/notebooks/dataframes/struct_and_array_dtypes.ipynb +++ b/notebooks/dataframes/struct_and_array_dtypes.ipynb @@ -266,7 +266,7 @@ } ], "source": [ - "# Adjuste the scores\n", + "# Adjust the scores\n", "adj_scores = (scores + 5) / 105.0 * 100.0\n", "adj_scores" ] @@ -382,7 +382,7 @@ "source": [ "# Struct Data Types\n", "\n", - "In BigQuery, an [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigFrames maps BigQuery struct types to the Pandas equivalent, `pandas.ArrowDtype(pa.struct())`. In this section, we'll explore practical code examples illustrating how to work with struct columns within your BigFrames DataFrames." + "In BigQuery, a [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigFrames maps BigQuery struct types to the Pandas equivalent, `pandas.ArrowDtype(pa.struct())`. In this section, we'll explore practical code examples illustrating how to work with struct columns within your BigFrames DataFrames." ] }, { From e172cf8cb0a8d236da5dee90c805548760a74ec0 Mon Sep 17 00:00:00 2001 From: Chelsea Lin Date: Tue, 13 Aug 2024 17:55:01 +0000 Subject: [PATCH 3/3] address comments --- .../dataframes/struct_and_array_dtypes.ipynb | 92 +++++++++---------- 1 file changed, 45 insertions(+), 47 deletions(-) diff --git a/notebooks/dataframes/struct_and_array_dtypes.ipynb b/notebooks/dataframes/struct_and_array_dtypes.ipynb index 3ba07d1b88..3bcdaf40f7 100644 --- a/notebooks/dataframes/struct_and_array_dtypes.ipynb +++ b/notebooks/dataframes/struct_and_array_dtypes.ipynb @@ -34,12 +34,12 @@ "source": [ "# Set up your environment\n", "\n", - "Please refer to the notebooks in the `getting_started` folder for instructions on setting up your environment. Once your environment is ready, run the following code to import the necessary packages for working with BigFrames arrays:" + "To get started, follow the instructions in the notebooks within the `getting_started` folder to set up your environment. Once your environment is ready, you can import the necessary packages by running the following code:" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -50,13 +50,14 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "REGION = \"US\" # @param {type: \"string\"}\n", + "\n", "bpd.options.display.progress_bar = None\n", - "bpd.options.bigquery.location = REGION\n" + "bpd.options.bigquery.location = REGION" ] }, { @@ -65,18 +66,18 @@ "source": [ "# Array Data Types\n", "\n", - "In BigQuery, an [array](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type), also referred to as a `repeated` column, is an ordered list of zero or more non-array elements. These elements must be of the same data type, and arrays cannot contain other arrays. Furthermore, query results cannot include arrays with `NULL` elements.\n", + "In BigQuery, an [array](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type) (also called a repeated column) is an ordered list of zero or more elements of the same data type. Arrays cannot contain other arrays or `NULL` elements.\n", "\n", - "BigFrames DataFrames, inheriting these properties, map BigQuery array types to `pandas.ArrowDtype(pa.list_())`. This section provides code examples demonstrating how to effectively work with array columns within BigFrames DataFrames." + "BigQuery DataFrames map BigQuery array types to `pandas.ArrowDtype(pa.list_())`. The following code examples illustrate how to work with array columns in BigQuery DataFrames." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Create DataFrames with array columns \n", + "## Create DataFrames with array columns\n", "\n", - "Let's create a sample BigFrames DataFrame where the `Scores` column holds array data of type `list[pyarrow]`:" + "Create a DataFrame in BigQuery DataFrames from local sample data. Use a list of lists to create a column with the `list[pyarrow]` dtype, which corresponds to the `ARRAY` type in BigQuery." ] }, { @@ -178,11 +179,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## CRUD operations for array data\n", - "\n", - "While Pandas offers vectorized operations and lambda expressions to manipulate array data, BigFrames leverages BigQuery's computational power. BigFrames introduces the [`bigframes.bigquery`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery) package to provide access to a variety of native BigQuery array operations, such as [array_agg](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg), [array_length](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length), and others. This module allows you to seamlessly perform create, read, update, and delete (CRUD) operations on array data within your BigFrames DataFrames.\n", + "## Operate on array data\n", "\n", - "Let's delve into how you can utilize these functions to effectively manipulate array data in BigFrames." + "While pandas offers vectorized operations and lambda expressions for array manipulation, BigQuery DataFrames leverages the computational power of BigQuery itself. You can access a variety of native BigQuery array operations, such as [`array_agg`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg) and [`array_length`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length), through the [`bigframes.bigquery`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery) package (abbreviated as `bbq` in the following code samples)." ] }, { @@ -205,7 +204,7 @@ } ], "source": [ - "# Find the length in each array\n", + "# Find the length in each array.\n", "bbq.array_length(df['Scores'])" ] }, @@ -235,7 +234,9 @@ } ], "source": [ - "# Explode array elements into rows\n", + "# Transforms array elements into individual rows, preserving original order when in ordering\n", + "# mode. If an array has multiple elements, exploded rows are ordered by the element's index\n", + "# within its original array.\n", "scores = df['Scores'].explode()\n", "scores" ] @@ -248,15 +249,15 @@ { "data": { "text/plain": [ - "0 95.238095\n", - "0 88.571429\n", - "0 92.380952\n", - "1 79.047619\n", - "1 81.904762\n", - "2 82.857143\n", - "2 89.52381\n", - "2 94.285714\n", - "2 100.0\n", + "0 100.0\n", + "0 93.0\n", + "0 97.0\n", + "1 83.0\n", + "1 86.0\n", + "2 87.0\n", + "2 94.0\n", + "2 99.0\n", + "2 105.0\n", "Name: Scores, dtype: Float64" ] }, @@ -266,8 +267,8 @@ } ], "source": [ - "# Adjust the scores\n", - "adj_scores = (scores + 5) / 105.0 * 100.0\n", + "# Adjust the scores.\n", + "adj_scores = scores + 5.0\n", "adj_scores" ] }, @@ -279,9 +280,9 @@ { "data": { "text/plain": [ - "0 [95.23809524 88.57142857 92.38095238]\n", - "1 [79.04761905 81.9047619 ]\n", - "2 [ 82.85714286 89.52380952 94.28571429 100. ...\n", + "0 [100. 93. 97.]\n", + "1 [83. 86.]\n", + "2 [ 87. 94. 99. 105.]\n", "Name: Scores, dtype: list[pyarrow]" ] }, @@ -291,7 +292,7 @@ } ], "source": [ - "# Aggregate adjusted scores back into arrays\n", + "# Aggregate adjusted scores back into arrays.\n", "adj_scores_arr = bbq.array_agg(adj_scores.groupby(level=0))\n", "adj_scores_arr" ] @@ -332,19 +333,19 @@ " 0\n", " Alice\n", " [95 88 92]\n", - " [95.23809524 88.57142857 92.38095238]\n", + " [100. 93. 97.]\n", " \n", " \n", " 1\n", " Bob\n", " [78 81]\n", - " [79.04761905 81.9047619 ]\n", + " [83. 86.]\n", " \n", " \n", " 2\n", " Charlie\n", " [ 82 89 94 100]\n", - " [ 82.85714286 89.52380952 94.28571429 100. ...\n", + " [ 87. 94. 99. 105.]\n", " \n", " \n", "\n", @@ -352,15 +353,10 @@ "[3 rows x 3 columns in total]" ], "text/plain": [ - " Name Scores \\\n", - "0 Alice [95 88 92] \n", - "1 Bob [78 81] \n", - "2 Charlie [ 82 89 94 100] \n", - "\n", - " NewScores \n", - "0 [95.23809524 88.57142857 92.38095238] \n", - "1 [79.04761905 81.9047619 ] \n", - "2 [ 82.85714286 89.52380952 94.28571429 100. ... \n", + " Name Scores NewScores\n", + "0 Alice [95 88 92] [100. 93. 97.]\n", + "1 Bob [78 81] [83. 86.]\n", + "2 Charlie [ 82 89 94 100] [ 87. 94. 99. 105.]\n", "\n", "[3 rows x 3 columns]" ] @@ -371,7 +367,9 @@ } ], "source": [ - "# Incorporate adjusted scores into the DataFrame\n", + "# Add adjusted scores into the DataFrame. This operation requires an implicit join \n", + "# between the two tables, necessitating a unique index in the DataFrame (guaranteed \n", + "# in the default ordering and index mode).\n", "df['NewScores'] = adj_scores_arr\n", "df" ] @@ -382,7 +380,7 @@ "source": [ "# Struct Data Types\n", "\n", - "In BigQuery, a [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigFrames maps BigQuery struct types to the Pandas equivalent, `pandas.ArrowDtype(pa.struct())`. In this section, we'll explore practical code examples illustrating how to work with struct columns within your BigFrames DataFrames." + "In BigQuery, a [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigQuery DataFrames maps BigQuery struct types to the pandas equivalent, `pandas.ArrowDtype(pa.struct())`. This section provides practical code examples illustrating how to use struct columns with BigQuery DataFrames." ] }, { @@ -391,7 +389,7 @@ "source": [ "## Create DataFrames with struct columns \n", "\n", - "Let's create a sample BigFrames DataFrame where the `Address` column holds struct data of type `struct[pyarrow]`:" + "Create a DataFrame with an `Address` struct column by using dictionaries for the data and setting the dtype to `struct[pyarrow]`." ] }, { @@ -403,7 +401,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/usr/local/google/home/chelsealin/src/bigframes2/venv/lib/python3.12/site-packages/google/cloud/bigquery/_pandas_helpers.py:537: UserWarning: Pyarrow could not determine the type of columns: bigframes_unnamed_index.\n", + "/usr/local/google/home/chelsealin/src/bigframes/venv/lib/python3.12/site-packages/google/cloud/bigquery/_pandas_helpers.py:570: UserWarning: Pyarrow could not determine the type of columns: bigframes_unnamed_index.\n", " warnings.warn(\n" ] }, @@ -509,9 +507,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## CRUD operations for struct data\n", + "## Operate on struct data\n", "\n", - "Similar to Pandas, BigFrames provides a [`StructAccessor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor) to streamline the manipulation of struct data. Let's explore how you can utilize this feature for efficient CRUD operations on your nested struct columns." + "Similar to pandas, BigQuery DataFrames provides a [`StructAccessor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor). Use the methods provided in this accessor to manipulate struct data." ] }, {