diff --git a/README.rst b/README.rst index 73709641de..26bbbffa88 100644 --- a/README.rst +++ b/README.rst @@ -10,395 +10,19 @@ powered by the BigQuery engine. BigQuery DataFrames is an open-source package. You can run ``pip install --upgrade bigframes`` to install the latest version. + Documentation ------------- * `BigQuery DataFrames source code (GitHub) `_ * `BigQuery DataFrames sample notebooks `_ * `BigQuery DataFrames API reference `_ -* `BigQuery documentation `_ - - -Quickstart ----------- - -Prerequisites -^^^^^^^^^^^^^ - -* Install the ``bigframes`` package. -* Create a Google Cloud project and billing account. -* In an interactive environment (like Notebook, Python REPL or command line), - ``bigframes`` will do the authentication on-the-fly if needed. Otherwise, see - `how to set up application default credentials `_ - for various environments. For example, to pre-authenticate on your laptop you can - `install and initialize the gcloud CLI `_, - and then generate the application default credentials by doing - `gcloud auth application-default login `_. -* The user must have - `BigQuery Job User `_ and - `BigQuery Read Session User `_ - roles for the minimum usage. Additional IAM requirements apply for using remote - functions and ML. - -Code sample -^^^^^^^^^^^ - -Import ``bigframes.pandas`` for a pandas-like interface. The ``read_gbq`` -method accepts either a fully-qualified table ID or a SQL query. - -.. code-block:: python - - import bigframes.pandas as bpd - - bpd.options.bigquery.project = your_gcp_project_id - df1 = bpd.read_gbq("project.dataset.table") - df2 = bpd.read_gbq("SELECT a, b, c, FROM `project.dataset.table`") - -* `More code samples `_ - - -Locations ---------- -BigQuery DataFrames uses a -`BigQuery session `_ -internally to manage metadata on the service side. This session is tied to a -`location `_ . -BigQuery DataFrames uses the US multi-region as the default location, but you -can use ``session_options.location`` to set a different location. Every query -in a session is executed in the location where the session was created. -BigQuery DataFrames -auto-populates ``bf.options.bigquery.location`` if the user starts with -``read_gbq/read_gbq_table/read_gbq_query()`` and specifies a table, either -directly or in a SQL statement. - -If you want to reset the location of the created DataFrame or Series objects, -you can close the session by executing ``bigframes.pandas.close_session()``. -After that, you can reuse ``bigframes.pandas.options.bigquery.location`` to -specify another location. - - -``read_gbq()`` requires you to specify a location if the dataset you are -querying is not in the US multi-region. If you try to read a table from another -location, you get a NotFound exception. - -Project -------- -If ``bf.options.bigquery.project`` is not set, the ``$GOOGLE_CLOUD_PROJECT`` -environment variable is used, which is set in the notebook runtime serving the -BigQuery Studio/Vertex Notebooks. - -ML Capabilities ---------------- - -The ML capabilities in BigQuery DataFrames let you preprocess data, and -then train models on that data. You can also chain these actions together to -create data pipelines. - -Preprocess data -^^^^^^^^^^^^^^^^^^^^^^^^ - -Create transformers to prepare data for use in estimators (models) by -using the -`bigframes.ml.preprocessing module `_ -and the `bigframes.ml.compose module `_. -BigQuery DataFrames offers the following transformations: - -* Use the `KBinsDiscretizer class `_ - in the ``bigframes.ml.preprocessing`` module to bin continuous data into intervals. -* Use the `LabelEncoder class `_ - in the ``bigframes.ml.preprocessing`` module to normalize the target labels as integer values. -* Use the `MaxAbsScaler class `_ - in the ``bigframes.ml.preprocessing`` module to scale each feature to the range ``[-1, 1]`` by its maximum absolute value. -* Use the `MinMaxScaler class `_ - in the ``bigframes.ml.preprocessing`` module to standardize features by scaling each feature to the range ``[0, 1]``. -* Use the `StandardScaler class `_ - in the ``bigframes.ml.preprocessing`` module to standardize features by removing the mean and scaling to unit variance. -* Use the `OneHotEncoder class `_ - in the ``bigframes.ml.preprocessing`` module to transform categorical values into numeric format. -* Use the `ColumnTransformer class `_ - in the ``bigframes.ml.compose`` module to apply transformers to DataFrames columns. - - -Train models -^^^^^^^^^^^^ - -Create estimators to train models in BigQuery DataFrames. - -**Clustering models** - -Create estimators for clustering models by using the -`bigframes.ml.cluster module `_. - -* Use the `KMeans class `_ - to create K-means clustering models. Use these models for - data segmentation. For example, identifying customer segments. K-means is an - unsupervised learning technique, so model training doesn't require labels or split - data for training or evaluation. - -**Decomposition models** - -Create estimators for decomposition models by using the `bigframes.ml.decomposition module `_. - -* Use the `PCA class `_ - to create principal component analysis (PCA) models. Use these - models for computing principal components and using them to perform a change of - basis on the data. This provides dimensionality reduction by projecting each data - point onto only the first few principal components to obtain lower-dimensional - data while preserving as much of the data's variation as possible. - - -**Ensemble models** - -Create estimators for ensemble models by using the `bigframes.ml.ensemble module `_. - -* Use the `RandomForestClassifier class `_ - to create random forest classifier models. Use these models for constructing multiple - learning method decision trees for classification. -* Use the `RandomForestRegressor class `_ - to create random forest regression models. Use - these models for constructing multiple learning method decision trees for regression. -* Use the `XGBClassifier class `_ - to create gradient boosted tree classifier models. Use these models for additively - constructing multiple learning method decision trees for classification. -* Use the `XGBRegressor class `_ - to create gradient boosted tree regression models. Use these models for additively - constructing multiple learning method decision trees for regression. - - -**Forecasting models** - -Create estimators for forecasting models by using the `bigframes.ml.forecasting module `_. - -* Use the `ARIMAPlus class `_ - to create time series forecasting models. - -**Imported models** - -Create estimators for imported models by using the `bigframes.ml.imported module `_. - -* Use the `ONNXModel class `_ - to import Open Neural Network Exchange (ONNX) models. -* Use the `TensorFlowModel class `_ - to import TensorFlow models. -* Use the `XGBoostModel class `_ - to import XGBoostModel models. - -**Linear models** - -Create estimators for linear models by using the `bigframes.ml.linear_model module `_. - -* Use the `LinearRegression class `_ - to create linear regression models. Use these models for forecasting. For example, - forecasting the sales of an item on a given day. -* Use the `LogisticRegression class `_ - to create logistic regression models. Use these models for the classification of two - or more possible values such as whether an input is ``low-value``, ``medium-value``, - or ``high-value``. - -**Large language models** - -Create estimators for LLMs by using the `bigframes.ml.llm module `_. - -* Use the `GeminiTextGenerator class `_ to create Gemini text generator models. Use these models - for text generation tasks. -* Use the `PaLM2TextGenerator class `_ to create PaLM2 text generator models. Use these models - for text generation tasks. -* Use the `PaLM2TextEmbeddingGenerator class `_ to create PaLM2 text embedding generator models. - Use these models for text embedding generation tasks. - - -Create pipelines -^^^^^^^^^^^^^^^^ - -Create ML pipelines by using -`bigframes.ml.pipeline module `_. -Pipelines let you assemble several ML steps to be cross-validated together while setting -different parameters. This simplifies your code, and allows you to deploy data preprocessing -steps and an estimator together. - -* Use the `Pipeline class `_ - to create a pipeline of transforms with a final estimator. - - -ML remote models ----------------- - -**Requirements** - -To use BigQuery DataFrames ML remote models (`bigframes.ml.remote` or `bigframes.ml.llm`), -you must enable the following APIs: - -* The BigQuery API (bigquery.googleapis.com) -* The BigQuery Connection API (bigqueryconnection.googleapis.com) -* The Vertex AI API (aiplatform.googleapis.com) - -and you must be granted the following IAM roles in the project: - -* BigQuery Data Editor (roles/bigquery.dataEditor) -* BigQuery Connection Admin (roles/bigquery.connectionAdmin) -* Service Account User (roles/iam.serviceAccountUser) -* Vertex AI User (roles/aiplatform.user) -* Project IAM Admin (roles/resourcemanager.projectIamAdmin) if using default - BigQuery connection, or Browser (roles/browser) if using a pre-configured connection. - This requirement can be avoided by setting - ``bigframes.pandas.options.bigquery.skip_bq_connection_check`` option to ``True``, - in which case the connection (default or pre-configured) would be - used as-is without any existence or permission check. - - -ML locations ------------- - -``bigframes.ml`` supports the same locations as BigQuery ML. BigQuery ML model -prediction and other ML functions are supported in all BigQuery regions. Support -for model training varies by region. For more information, see -`BigQuery ML locations `_. - - -Data types ----------- - -BigQuery DataFrames supports the following numpy and pandas dtypes: - -* ``numpy.dtype("O")`` -* ``pandas.BooleanDtype()`` -* ``pandas.Float64Dtype()`` -* ``pandas.Int64Dtype()`` -* ``pandas.StringDtype(storage="pyarrow")`` -* ``pandas.ArrowDtype(pa.date32())`` -* ``pandas.ArrowDtype(pa.time64("us"))`` -* ``pandas.ArrowDtype(pa.timestamp("us"))`` -* ``pandas.ArrowDtype(pa.timestamp("us", tz="UTC"))`` - -BigQuery DataFrames doesn’t support the following BigQuery data types: - -* ``ARRAY`` -* ``NUMERIC`` -* ``BIGNUMERIC`` -* ``INTERVAL`` -* ``STRUCT`` -* ``JSON`` - -All other BigQuery data types display as the object type. - - -Remote functions ----------------- - -BigQuery DataFrames gives you the ability to turn your custom scalar functions -into `BigQuery remote functions -`_ . Creating a remote -function in BigQuery DataFrames (See `code samples -`_) -creates: - -1. A `Cloud Functions (2nd gen) function `_. -2. A `BigQuery connection `_. - If the BigQuery connection is created, the BigQuery service will - create a - `Google Cloud-managed IAM service account `_ - and attach it to the connection. You can use a pre-configured BigQuery - connection if you prefer, in which case the connection creation is skipped. -3. A BigQuery remote function that talks to the cloud function (1) using the BigQuery - connection (2). - -BigQuery connections are created in the same location as the BigQuery -DataFrames session, using the name you provide in the custom function -definition. To view and manage connections, do the following: - -1. Go to `BigQuery in the Google Cloud Console `__. -2. Select the project in which you created the remote function. -3. In the Explorer pane, expand that project and then expand External connections. - -BigQuery remote functions are created in the dataset you specify, or -in a special type of `hidden dataset `__ -referred to as an anonymous dataset. To view and manage remote functions created -in a user provided dataset, do the following: - -1. Go to `BigQuery in the Google Cloud Console `__. -2. Select the project in which you created the remote function. -3. In the Explorer pane, expand that project, expand the dataset in which you - created the remote function, and then expand Routines. - -To view and manage Cloud Functions functions, use the -`Functions `_ -page and use the project picker to select the project in which you -created the function. For easy identification, the names of the functions -created by BigQuery DataFrames are prefixed by ``bigframes``. - -**Requirements** - -To use BigQuery DataFrames remote functions, you must enable the following APIs: - -* The BigQuery API (bigquery.googleapis.com) -* The BigQuery Connection API (bigqueryconnection.googleapis.com) -* The Cloud Functions API (cloudfunctions.googleapis.com) -* The Cloud Run API (run.googleapis.com) -* The Artifact Registry API (artifactregistry.googleapis.com) -* The Cloud Build API (cloudbuild.googleapis.com ) -* The Cloud Resource Manager API (cloudresourcemanager.googleapis.com) - -To use BigQuery DataFrames remote functions, you must be granted the -following IAM roles in the project: - -* BigQuery Data Editor (roles/bigquery.dataEditor) -* BigQuery Connection Admin (roles/bigquery.connectionAdmin) -* Cloud Functions Developer (roles/cloudfunctions.developer) -* Service Account User (roles/iam.serviceAccountUser) -* Storage Object Viewer (roles/storage.objectViewer) -* Project IAM Admin (roles/resourcemanager.projectIamAdmin) if using default - BigQuery connection, or Browser (roles/browser) if using a pre-configured connection. - This requirement can be avoided by setting - ``bigframes.pandas.options.bigquery.skip_bq_connection_check`` option to ``True``, - in which case the connection (default or pre-configured) would be - used as-is without any existence or permission check. - -**Limitations** - -* Remote functions take about 90 seconds to become available when you first create them. -* Trivial changes in the notebook, such as inserting a new cell or renaming a variable, - might cause the remote function to be re-created, even if these changes are unrelated - to the remote function code. -* BigQuery DataFrames does not differentiate any personal data you include in the remote - function code. The remote function code is serialized as an opaque box to deploy it as a - Cloud Functions function. -* The Cloud Functions (2nd gen) functions, BigQuery connections, and BigQuery remote - functions created by BigQuery DataFrames persist in Google Cloud. If you don’t want to - keep these resources, you must delete them separately using an appropriate Cloud Functions - or BigQuery interface. -* A project can have up to 1000 Cloud Functions (2nd gen) functions at a time. See Cloud - Functions quotas for all the limits. - - -Quotas and limits ------------------- - -`BigQuery quotas `_ -including hardware, software, and network components. - - -Session termination -------------------- - -Each BigQuery DataFrames DataFrame or Series object is tied to a BigQuery -DataFrames session, which is in turn based on a BigQuery session. BigQuery -sessions -`auto-terminate `_ -; when this happens, you can’t use previously -created DataFrame or Series objects and must re-create them using a new -BigQuery DataFrames session. You can do this by running -``bigframes.pandas.close_session()`` and then re-running the BigQuery -DataFrames expressions. - -Data processing location ------------------------- -BigQuery DataFrames is designed for scale, which it achieves by keeping data -and processing on the BigQuery service. However, you can bring data into the -memory of your client machine by calling ``.to_pandas()`` on a DataFrame or Series -object. If you choose to do this, the memory limitation of your client machine -applies. +Getting started with BigQuery DataFrames +---------------------------------------- +Try the `BigQuery DataFrames quickstart `_ +to get up and running in just a few minutes. License