diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb b/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb index f869cdc330b..3d695ece947 100644 --- a/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb +++ b/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb @@ -1,998 +1,1183 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "name": "retail_product_stockout_prediction.ipynb", - "version": "0.3.2", - "provenance": [], - "collapsed_sections": [] - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "cells": [ - { - "metadata": { - "id": "9V5sA5glWemD", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "Copyright 2018 Google LLC \n", - "\n", - "Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "you may not use this file except in compliance with the License.\n", - "You may obtain a copy of the License at\n", - "\n", - "http://www.apache.org/licenses/LICENSE-2.0\n", - "\n", - "Unless required by applicable law or agreed to in writing, software\n", - "distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "See the License for the specific language governing permissions and limitations under the License." - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "m26YhtBMvVWA" - }, - "cell_type": "markdown", - "source": [ - "# Retail Product Stockouts Prediction using AutoML Tables\n", - "\n", - "AutoML Tables enables you to build machine learning models based on tables of your own data and host them on Google Cloud for scalability. This solution demonstrates how you can use AutoML Tables to solve a product stockouts problem in the retail industry. This problem is solved using a binary classification approach, which predicts whether a particular product at a certain store will be out-of-stock or not in the next four weeks. Once the solution is built, you can plug this in with your production system and proactively predict stock-outs for your business. \n", - "\n", - "To use this Colab notebook, copy it to your own Google Drive and open it with [Colaboratory](https://colab.research.google.com/) (or Colab). To run a cell hold the Shift key and press the Enter key (or Return key). Colab automatically displays the return value of the last line in each cell. Refer to [this page](https://colab.research.google.com/notebooks/welcome.ipynb) for more information on Colab.\n", - "\n", - "You can run a Colab notebook on a hosted runtime in the Cloud. The hosted VM times out after 90 minutes of inactivity and you will lose all the data stored in the memory including your authentication data. If your session gets disconnected (for example, because you closed your laptop) for less than the 90 minute inactivity timeout limit, press 'RECONNECT' on the top right corner of your notebook and resume the session. After Colab timeout, you'll need to\n", - "\n", - "1. Re-run the initialization and authentication.\n", - "2. Continue from where you left off. You may need to copy-paste the value of some variables such as the `dataset_name` from the printed output of the previous cells.\n", - "\n", - "Alternatively you can connect your Colab notebook to a [local runtime](https://research.google.com/colaboratory/local-runtimes.html)." - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "b--5FDDwCG9C" - }, - "cell_type": "markdown", - "source": [ - "## 1. Project set up\n", - "\n", - "\n", - "\n" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "AZs0ICgy4jkQ" - }, - "cell_type": "markdown", - "source": [ - "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to\n", - "* Create a Google Cloud Platform (GCP) project.\n", - "* Enable billing.\n", - "* Apply to whitelist your project.\n", - "* Enable AutoML API.\n", - "* Enable AutoML Talbes API.\n", - "* Create a service account, grant required permissions, and download the service account private key.\n", - "\n", - "You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source\n", - "* Create a GCS bucket.\n", - "* Upload the training and batch prediction files.\n", - "\n", - "\n", - "**Warning:** Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console." - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "xZECt1oL429r" - }, - "cell_type": "markdown", - "source": [ - "\n", - "\n", - "---\n", - "\n" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "rstRPH9SyZj_" - }, - "cell_type": "markdown", - "source": [ - "## 2. Initialize and authenticate\n", - "This section runs intialization and authentication. It creates an authenticated session which is required for running any of the following sections." - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "BR0POq2UzE7e" - }, - "cell_type": "markdown", - "source": [ - "### Install the client library in Colab\n", - "Run the following cell to install the client libary using `pip`.\n", - "\n", - "See [documentations ](https://cloud.google.com/automl-tables/docs/client-libraries) of Google Cloud AutoML Client Library for Python. \n" - ] - }, - { - "metadata": { - "id": "43aXKjDRt_qZ", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Install AutoML Tables client library { vertical-output: true }\n", - "\n", - "!pip install google-cloud-automl" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "eVFsPPEociwF" - }, - "cell_type": "markdown", - "source": [ - "### Authenticate using service account key\n", - "Run the following cell. Click on the __Choose Files__ button and select the service account private key file. If your Service Account Key file or folder is hidden, you can reveal it in a Mac by pressing the __Command + Shift + .__ combo.\n", - "\n" - ] - }, - { - "metadata": { - "id": "u-kCqysAuaJk", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Authenticate using service account key and create a client. { vertical-output: true }\n", - "\n", - "from google.cloud import automl_v1beta1\n", - "from google.colab import files\n", - "\n", - "# Upload service account key\n", - "keyfile_upload = files.upload()\n", - "keyfile_name = list(keyfile_upload.keys())[0]\n", - "# Authenticate and create an AutoML client.\n", - "client = automl_v1beta1.AutoMlClient.from_service_account_file(keyfile_name)\n", - "# Authenticate and create a prediction service client.\n", - "prediction_client = automl_v1beta1.PredictionServiceClient.from_service_account_file(keyfile_name)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "s3F2xbEJdDvN" - }, - "cell_type": "markdown", - "source": [ - "### Test" - ] - }, - { - "metadata": { - "id": "0uX4aJYUiXh5", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "Enter your GCP project ID." - ] - }, - { - "metadata": { - "colab_type": "code", - "id": "6R4h5HF1Dtds", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title GCP project ID and location\n", - "\n", - "project_id = '' #@param {type:'string'}\n", - "location = 'us-central1'\n", - "location_path = client.location_path(project_id, location)\n", - "location_path" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "rUlBcZ3OfWcJ" - }, - "cell_type": "markdown", - "source": [ - "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.\n", - "\n", - "If no dataset has previously imported into AutoML Tables, you shall expect an empty return." - ] - }, - { - "metadata": { - "cellView": "both", - "colab_type": "code", - "id": "sf32nKXIqYje", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title List datasets. { vertical-output: true }\n", - "\n", - "list_datasets_response = client.list_datasets(location_path)\n", - "datasets = {dataset.display_name: dataset.name for dataset in list_datasets_response}\n", - "datasets" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "t9uE8MvMkOPd" - }, - "cell_type": "markdown", - "source": [ - "You can also print the list of your models by running the following cell.\n", - "\n", - "If no model has previously trained using AutoML Tables, you shall expect an empty return." - ] - }, - { - "metadata": { - "cellView": "both", - "colab_type": "code", - "id": "j4-bYRSWj7xk", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title List models. { vertical-output: true }\n", - "\n", - "list_models_response = client.list_models(location_path)\n", - "models = {model.display_name: model.name for model in list_models_response}\n", - "models" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "qozQWMnOu48y" - }, - "cell_type": "markdown", - "source": [ - "\n", - "\n", - "---\n", - "\n" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "ODt86YuVDZzm" - }, - "cell_type": "markdown", - "source": [ - "## 3. Import training data" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "XwjZc9Q62Fm5" - }, - "cell_type": "markdown", - "source": [ - "### Create dataset" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "_JfZFGSceyE_" - }, - "cell_type": "markdown", - "source": [ - "Select a dataset display name and pass your table source information to create a new dataset." - ] - }, - { - "metadata": { - "id": "Z_JErW3cw-0J", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Create dataset { vertical-output: true, output-height: 200 }\n", - "\n", - "dataset_display_name = 'stockout_data' #@param {type: 'string'}\n", - "\n", - "dataset_dict = {\n", - " 'display_name': dataset_display_name, \n", - " 'tables_dataset_metadata': {}\n", - "}\n", - "\n", - "create_dataset_response = client.create_dataset(\n", - " location_path,\n", - " dataset_dict\n", - ")\n", - "create_dataset_response" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "RLRgvqzUdxfL", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - " #@title Get dataset name { vertical-output: true }\n", - "\n", - "dataset_name = create_dataset_response.name\n", - "dataset_name" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "35YZ9dy34VqJ" - }, - "cell_type": "markdown", - "source": [ - "### Import data" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "3c0o15gVREAw" - }, - "cell_type": "markdown", - "source": [ - "You can import your data to AutoML Tables from GCS or BigQuery. For this solution, you will import data from a BigQuery Table. The URI for your table is in the format of `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", - "\n", - "The BigQuery Table used for demonstration purpose can be accessed as `bq://product-stockout.product_stockout.stockout`. \n", - "\n", - "See the table schema and dataset description from the README. " - ] - }, - { - "metadata": { - "id": "bB_GdeqCJW5i", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title ... if data source is BigQuery { vertical-output: true }\n", - "\n", - "dataset_bq_input_uri = 'bq://product-stockout.product_stockout.stockout' #@param {type: 'string'}\n", - "# Define input configuration.\n", - "input_config = {\n", - " 'bigquery_source': {\n", - " 'input_uri': dataset_bq_input_uri\n", - " }\n", - "}" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "FNVYfpoXJsNB", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - " #@title Import data { vertical-output: true }\n", - "\n", - "import_data_response = client.import_data(dataset_name, \n", - " input_config)\n", - "print('Dataset import operation: {}'.format(import_data_response.operation))" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "1O7tJ8IlefRC", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - " #@title Check if importing the data is complete { vertical-output: true }\n", - "\n", - "# If returns `False`, you can check back again later.\n", - "# Continue with the rest only if this cell returns a `True`.\n", - "import_data_response.done()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "_WLvyGIDe9ah", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "Importing this stockout datasets takes about 10 minutes. \n", - "\n", - "If you re-visit this Colab, uncomment the following cell and run the command to retrieve your dataset. Replace `YOUR_DATASET_NAME` with its actual value obtained in the preceding cells.\n", - "\n", - "`YOUR_DATASET_NAME` is a string in the format of `'projects//locations//datasets/'`." - ] - }, - { - "metadata": { - "id": "P6NkRMyJfAGm", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "# dataset_name = '' #@param {type: 'string'}\n", - "# dataset = client.get_dataset(dataset_name) " - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "QdxBI4s44ZRI", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "### Review the specs" - ] - }, - { - "metadata": { - "id": "RC0PWKqH4jwr", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "Run the following command to see table specs such as row count." - ] - }, - { - "metadata": { - "id": "v2Vzq_gwXxo-", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Table schema { vertical-output: true }\n", - "\n", - "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", - "import matplotlib.pyplot as plt\n", - "\n", - "# List table specs\n", - "list_table_specs_response = client.list_table_specs(dataset_name)\n", - "table_specs = [s for s in list_table_specs_response]\n", - "# List column specs\n", - "table_spec_name = table_specs[0].name\n", - "list_column_specs_response = client.list_column_specs(table_spec_name)\n", - "column_specs = {s.display_name: s for s in list_column_specs_response}\n", - "# Table schema pie chart.\n", - "type_counts = {}\n", - "for column_spec in column_specs.values():\n", - " type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n", - " type_counts[type_name] = type_counts.get(type_name, 0) + 1\n", - "\n", - "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n", - "plt.axis('equal')\n", - "plt.show()\n" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "Lqjq4X43v3ON", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "In the pie chart above, you see this dataset contains three variable types: `FLOAT64` (treated as `Numeric`), `CATEGORY` (treated as `Categorical`) and `STRING` (treated as `Text`). " - ] - }, - { - "metadata": { - "id": "FNykW_YOYt6d", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "___" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "kNRVJqVOL8h3" - }, - "cell_type": "markdown", - "source": [ - "## 4. Update dataset: assign a label column and enable nullable columns" - ] - }, - { - "metadata": { - "id": "VsOPwxN9fOIl", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "### Get column specs" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "-57gehId9PQ5" - }, - "cell_type": "markdown", - "source": [ - "AutoML Tables automatically detects your data column type. \n", - "\n", - "There are a total of 120 columns in this stockout dataset.\n", - "\n", - "Run the following command to check the column data type that automaticallyed detected. If columns contains only numerical values, but they represent categories, change that column data type to caregorical by updating your schema.\n", - "\n", - "In addition, AutoML Tables detects `Stockout` to be categorical that chooses to run a classification model. " - ] - }, - { - "metadata": { - "id": "Pyku3AHEfSp4", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title List table specs { vertical-output: true }\n", - "\n", - "list_table_specs_response = client.list_table_specs(dataset_name)\n", - "table_specs = [s for s in list_table_specs_response]\n", - "table_specs" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "jso_JBI9fgy6", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Check column data type { vertical-output: true }\n", - "\n", - "# Get column specs.\n", - "table_spec_name = table_specs[0].name\n", - "list_column_specs_response = client.list_column_specs(table_spec_name)\n", - "column_specs = {s.display_name: s for s in list_column_specs_response}\n", - "\n", - "# Print column data types.\n", - "for column in column_specs:\n", - " print(column, '-', column_specs[column].data_type)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "iRqdQ7Xiq04x", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "### Update columns: make categorical\n", - "\n", - "From the column data type, you noticed `Item_Number`, `Category`, `Vendor_Number`, `Store_Number`, `Zip_Code` and `County_Number` have been autodetected as `FLOAT64` (Numerical) instead of `CATEGORY` (Categorical). \n", - "\n", - "In this solution, the columns `Item_Number`, `Category`, `Vendor_Number` and `Store_Number` are not nullable, but `Zip_Code` and `County_Number` can take null values.\n", - "\n", - "To change the data type, you can update the schema by updating the column spec.\n", - "\n", - "`update_column_response = client.update_column_spec(update_column_spec_dict)`" - ] - }, - { - "metadata": { - "id": "gAPg_ymDf4kL", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "def create_update_column_sepc_dict(column_name, type_code, nullable):\n", - " \"\"\"\n", - " Create `update_column_spec_dict` with a given column name and target `type_code`.\n", - " Inputs:\n", - " column_name: string. Represents column name.\n", - " type_code: string. Represents variable type. See details: \\\n", - " https://cloud.google.com/automl-tables/docs/reference/rest/v1beta1/projects.locations.datasets.tableSpecs.columnSpecs#typecode\n", - " nullable: boolean. If true, this DataType can also be null.\n", - " Return:\n", - " update_column_spec_dict: dictionary. Encodes the target column specs.\n", - " \"\"\"\n", - " update_column_spec_dict = {\n", - " 'name': column_specs[column_name].name,\n", - " 'data_type': {\n", - " 'type_code': type_code,\n", - " 'nullable': nullable\n", - " }\n", - " }\n", - " return update_column_spec_dict" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "_xePITEYf5po", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "# Update dataset\n", - "categorical_column_names = ['Item_Number',\n", - " 'Category',\n", - " 'Vendor_Number',\n", - " 'Store_Number',\n", - " 'Zip_Code',\n", - " 'County_Number']\n", - "is_nullable = [False, \n", - " False,\n", - " False,\n", - " False,\n", - " True,\n", - " True]\n", - "\n", - "for i in range(len(categorical_column_names)):\n", - " column_name = categorical_column_names[i]\n", - " nullable = is_nullable[i]\n", - " update_column_spec_dict = create_update_column_sepc_dict(column_name, 'CATEGORY', nullable)\n", - " update_column_response = client.update_column_spec(update_column_spec_dict)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "colab_type": "text", - "id": "nDMH_chybe4w" - }, - "cell_type": "markdown", - "source": [ - "### Update dataset: assign a label\n", - "\n", - "Select the label column and update the dataset." - ] - }, - { - "metadata": { - "id": "hVIruWg0u33t", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Update dataset { vertical-output: true }\n", - "\n", - "label_column_name = 'Stockout' #@param {type: 'string'}\n", - "label_column_spec = column_specs[label_column_name]\n", - "label_column_id = label_column_spec.name.rsplit('/', 1)[-1]\n", - "print('Label column ID: {}'.format(label_column_id))\n", - "# Define the values of the fields to be updated.\n", - "update_dataset_dict = {\n", - " 'name': dataset_name,\n", - " 'tables_dataset_metadata': {\n", - " 'target_column_spec_id': label_column_id\n", - " }\n", - "}\n", - "\n", - "update_dataset_response = client.update_dataset(update_dataset_dict)\n", - "update_dataset_response" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "z23NITLrcxmi", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "___" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "FcKgvj1-Tbgj" - }, - "cell_type": "markdown", - "source": [ - "## 5. Creating a model" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "Pnlk8vdQlO_k" - }, - "cell_type": "markdown", - "source": [ - "### Train a model\n", - "Training the model may take one hour or more. To obtain the results with less training time or budget, you can set [`train_budget_milli_node_hours`](https://cloud.google.com/automl-tables/docs/reference/rest/v1beta1/projects.locations.models), which is the train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour. \n", - "\n", - "For demonstration purpose, the following command sets the budget as 1 node hour. You can increate that number up to a maximum of 72 hours ('train_budget_milli_node_hours': 72000) for the best model performance. \n", - "\n", - "You can also select the objective to optimize your model training by setting `optimization_objective`. This solution optimizes the model by maximizing the Area Under the Precision-Recall (PR) Curve. \n" - ] - }, - { - "metadata": { - "id": "11izNd6Fu37N", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Create model { vertical-output: true }\n", - "\n", - "feature_list = list(column_specs.keys())\n", - "feature_list.remove('Stockout')\n", - "\n", - "model_display_name = 'stockout_model' #@param {type:'string'}\n", - "dataset_id = dataset_name.rsplit('/', 1)[-1]\n", - "\n", - "model_dict = {\n", - " 'display_name': model_display_name,\n", - " 'dataset_id': dataset_id, \n", - " 'tables_model_metadata': {\n", - " 'target_column_spec': column_specs['Stockout'],\n", - " 'input_feature_column_specs': [column_specs[f] for f in feature_list],\n", - " 'optimization_objective': 'MAXIMIZE_AU_PRC',\n", - " 'train_budget_milli_node_hours': 1000\n", - " }, \n", - "}\n", - "\n", - "create_model_response = client.create_model(location_path, model_dict)\n", - "print('Dataset import operation: {}'.format(create_model_response.operation))" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "wCQdx9VyhKY5", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Check if model training is complete { vertical-output: true }\n", - "# If returns `False`, you can check back again later.\n", - "# Continue with the rest only if this cell returns a `True`.\n", - "create_model_response.done()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "bPiR8zMwhQYO", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Retrieve the model name { vertical-output: true }\n", - "create_model_result = create_model_response.result()\n", - "model_name = create_model_result.name\n", - "model_name" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "neYjToB36q9E", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "If your Colab times out, use `client.list_models(location_path)` to check whether your model has been created. \n", - "\n", - "Then uncomment the following cell and run the command to retrieve your model. Replace `YOUR_MODEL_NAME` with its actual value obtained in the preceding cell.\n", - "\n", - "`YOUR_MODEL_NAME` is a string in the format of `'projects//locations//models/'`" - ] - }, - { - "metadata": { - "id": "QptCwUIK7yhU", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "# model_name = '' #@param {type: 'string'}\n", - "# model = client.get_model(model_name)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "1wS1is9IY5nK", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "___" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "TarOq84-GXch" - }, - "cell_type": "markdown", - "source": [ - "## 6. Batch prediction" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "Soy5OB8Wbp_R" - }, - "cell_type": "markdown", - "source": [ - "### Initialize prediction" - ] - }, - { - "metadata": { - "colab_type": "text", - "id": "39bIGjIlau5a" - }, - "cell_type": "markdown", - "source": [ - "Your data source for batch prediction can be GCS or BigQuery. For this solution, you will use a BigQuery Table as the input source. The URI for your table is in the format of `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", - "\n", - "To write out the predictions, you need to specify a GCS bucket `gs://BUCKET_NAME`.\n", - "\n", - "The AutoML Tables logs the errors in the `errors.csv` file.\n", - "\n", - "**NOTE:** The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells." - ] - }, - { - "metadata": { - "id": "gkF3bH0qu4DU", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Start batch prediction { vertical-output: true, output-height: 200 }\n", - "\n", - "batch_predict_bq_input_uri = 'bq://product-stockout.product_stockout.batch_prediction_inputs'\n", - "batch_predict_gcs_output_uri_prefix = 'gs://' #@param {type:'string'}\n", - "\n", - "# Define input source.\n", - "batch_prediction_input_source = {\n", - " 'bigquery_source': {\n", - " 'input_uri': batch_predict_bq_input_uri\n", - " }\n", - "}\n", - "# Define output target.\n", - "batch_prediction_output_target = {\n", - " 'gcs_destination': {\n", - " 'output_uri_prefix': batch_predict_gcs_output_uri_prefix\n", - " }\n", - "}\n", - "batch_predict_response = prediction_client.batch_predict(model_name, \n", - " batch_prediction_input_source, \n", - " batch_prediction_output_target)\n", - "print('Batch prediction operation: {}'.format(batch_predict_response.operation))" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "AVJhh_k0PfxD", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Check if batch prediction is complete { vertical-output: true }\n", - "\n", - "# If returns `False`, you can check back again later.\n", - "# Continue with the rest only if this cell returns a `True`.\n", - "batch_predict_response.done()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "8nr5q2M8W2VX", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Retrieve batch prediction metadata { vertical-output: true }\n", - "\n", - "batch_predict_response.metadata" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "kgwbJwS2iLpc", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title Check prediction results { vertical-output: true }\n", - "\n", - "gcs_output_directory = batch_predict_response.metadata.batch_predict_details.output_info.gcs_output_directory\n", - "result_file = gcs_output_directory + '/result.csv'\n", - "print('Batch prediction results are stored as: {}'.format(result_file))" - ], - "execution_count": 0, - "outputs": [] - } - ] + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Copyright 2019 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Retail Product Stockouts Prediction using AutoML Tables\n", + "\n", + "\n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Google Read on cloud.google.com\n", + " \n", + " \n", + " \n", + " \"Colab Run in Colab\n", + " \n", + " \n", + " \n", + " \"GitHub\n", + " View on GitHub\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "m26YhtBMvVWA" + }, + "source": [ + "# Overview\n", + "\n", + "AutoML Tables enables you to build machine learning models based on tables of your own data and host them on Google Cloud for scalability. This Notebook demonstrates how you can use AutoML Tables to solve a product stockouts problem in the retail industry. This problem is solved using a binary classification approach, which predicts whether a particular product at a certain store will be out-of-stock or not in the next four weeks. Once the solution is built, you can plug this in with your production system and proactively predict stock-outs for your business." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "m26YhtBMvVWA" + }, + "source": [ + "## Objective\n", + "\n", + "### Problem statement\n", + "\n", + "A stockout, or out-of-stock (OOS) event is an event that causes inventory to be exhausted. While out-of-stocks can occur along the entire supply chain, the most visible kind are retail out-of-stocks in the fast-moving consumer goods industry (e.g., sweets, diapers, fruits). Stockouts are the opposite of overstocks, where too much inventory is retained.\n", + "\n", + "### Impact\n", + "\n", + "According to a study by researchers Thomas Gruen and Daniel Corsten, the global average level of out-of-stocks within retail fast-moving consumer goods sector across developed economies was 8.3% in 2002. This means that shoppers would have a 42% chance of fulfilling a ten-item shopping list without encountering a stockout. Despite the initiatives designed to improve the collaboration of retailers and their suppliers, such as Efficient Consumer Response (ECR), and despite the increasing use of new technologies such as radio-frequency identification (RFID) and point-of-sale data analytics, this situation has improved little over the past decades.\n", + "\n", + "The biggest impacts being\n", + "1. Customer dissatisfaction\n", + "2. Loss of revenue\n", + "\n", + "### Machine Learning Solution\n", + "\n", + "Using machine learning to solve for stock-outs can help with store operations and thus prevent out-of-stock proactively.\n", + "\n", + "There are three big challenges any retailer would face as they try and solve this problem with machine learning:\n", + "\n", + "1. Data silos: Sales data, supply-chain data, inventory data, etc. may all be in silos. Such disjoint datasets could be a challenge to work with as a machine learning model tries to derive insights from all these data points. \n", + "2. Missing Features: Features such as vendor location, weather conditions, etc. could add a lot of value to a machine learning algorithm to learn from. But such features are not always available and when building machine learning solutions we think for collecting features as an iterative approach to improving the machine learning model.\n", + "3. Imbalanced dataset: Datasets for classification problems such as retail stock-out are traditionally very imbalanced with fewer cases for stock-out. Designing machine learning solutions by hand for such problems would be time consuming effort when your team should be focusing on collecting features.\n", + "\n", + "Hence, we recommend using AutoML Tables. With AutoML Tables you only need to work on acquiring all data and features, and AutoML Tables would do the rest. This is a one-click deploy to solving the problem of stock-out with machine learning." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "m26YhtBMvVWA" + }, + "source": [ + "## Dataset\n", + "\n", + "In this solution, you will use two datasets: Training/Evaluation data and Batch Prediction inputs. To access the datasets in BigQuery, you need the following information. \n", + "\n", + "Training/Evaluation dataset: \n", + "\n", + "`Project ID: product-stockout`\n", + "\n", + "`Dataset ID: product_stockout`\n", + "\n", + "`Table ID: stockout`\n", + "\n", + "Batch Prediction inputs: \n", + "\n", + "`Project ID: product-stockout`\n", + "\n", + "`Dataset ID: product_stockout`\n", + "\n", + "`Table ID: batch_prediction_inputs`\n", + "\n", + "### Data Schema\n", + "\n", + "|Field name \t|Datatype \t|Type \t|Description \t|\n", + "|---\t|---\t|---\t|---\t|\n", + "|Item_Number |STRING |Identifier |This is the product/ item identifier |\n", + "|Category |STRING |Identifier\t|Several items could belong to one category |\n", + "|Vendor_Number\t|STRING\t|Identifier\t|Product vendor identifier |\n", + "|Store_Number\t|STRING\t|Identifier\t|Store identifier |\n", + "|Item_Description\t|STRING\t|Text Features\t|Item Description |\n", + "|Category_Name\t|STRING\t|Text Features\t|Category Name |\n", + "|Vendor_Name\t|STRING\t|Text Features\t|Vendor Name |\n", + "|Store_Name\t|STRING\t|Text Features\t|Store Name |\n", + "|Address\t|STRING\t|Text Features\t|Address |\n", + "|City\t|STRING\t|Categorical Features\t|City |\n", + "|Zip_Code\t|STRING\t|Categorical Features\t|Zip-code |\n", + "|Store_Location\t|STRING\t|Categorical Features\t|Store Location |\n", + "|County_Number\t|STRING\t|Categorical Features\t|County Number |\n", + "|County\t|STRING\t|Categorical Features\t|County Name |\n", + "|Weekly Sales Quantity |INTEGER\t|Time series data\t|52 columns for weekly sales quantity from week 1 to week 52 |\n", + "|Weekly Sales Dollars\t|INTEGER\t|Time series data\t|52 columns for weekly sales dollars from week 1 to week 52 |\n", + "|Inventory\t|FLOAT\t|Numeric Feature\t|This inventory is stocked by the retailer looking at past sales and seasonality of the product to meet demand for future sales. |\n", + "|Stockout\t|INTEGER\t|Label\t|(1 - Stock-out, 0 - No stock-out) When the demand for four weeks future sales is not met by the inventory in stock we say we see a stock-out. This is because an early warning sign would help the retailer re-stock inventory with a lead time for the stock to be replenished. |\n", + "\n", + "\n", + "To use AutoML Tables with BigQuery you do not need to download this dataset. However, if you would like to use AutoML Tables with GCS you may want to download this dataset and upload it into your GCP Project storage bucket. \n", + "\n", + "Instructions to download dataset: \n", + "\n", + "Sample Dataset: Download this dataset which contains sales data.\n", + "\n", + "1. [Link to training data](https://console.cloud.google.com/bigquery?folder=&organizationId=&project=product-stockout&p=product-stockout&d=product_stockout&t=stockout&page=table): \n", + "\n", + "Dataset URI: \n", + "\n", + "2. [Link to data for batch predictions](https://console.cloud.google.com/bigquery?folder=&organizationId=&project=product-stockout&p=product-stockout&d=product_stockout&t=batch_prediction_inputs&page=table): \n", + "\n", + "Dataset URI: \n", + "\n", + "Upload this dataset to GCS or BigQuery (optional). \n", + "\n", + "You could select either [GCS](https://cloud.google.com/storage/) or [BigQuery](https://cloud.google.com/bigquery/) as the location of your choice to store the data for this challenge. \n", + "\n", + "1. Storing data on GCS: [Creating storage buckets, Uploading data to storage buckets](https://cloud.google.com/storage/docs/creating-buckets)\n", + "2. Storing data on BigQuery: [Create and load data to BigQuery](https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui) (optional)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "b--5FDDwCG9C" + }, + "source": [ + "## 1. Before you begin\n", + "\n", + "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to:\n", + "* Create a Google Cloud Platform (GCP) project and local development environment.\n", + "* Enable billing.\n", + "* Enable AutoML API.\n", + "* Enter your project ID in the cell below. Then run the cell to make sure the\n", + "\n", + "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n", + "all the requirements to run this notebook. You can skip this step from the AutoML Tables documentation\n", + "\n", + "Cloud SDK uses the right project for all the commands in this notebook.\n", + "\n", + "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "PROJECT_ID = \"\" # @param {type:\"string\"}\n", + "COMPUTE_REGION = \"us-central1\" # Currently only supported region.\n", + "! gcloud config set project $PROJECT_ID" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "xZECt1oL429r" + }, + "source": [ + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This section runs initialization and authentication. It creates an authenticated session which is required for running any of the following sections." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "rstRPH9SyZj_" + }, + "source": [ + "### Authenticate your GCP account\n", + "\n", + "**If you are using AI Platform Notebooks**, your environment is already\n", + "authenticated. Skip this step." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**If you are using Colab**, run the cell below and follow the instructions\n", + "when prompted to authenticate your account via oAuth.\n", + "\n", + "**Otherwise**, follow these steps:\n", + "\n", + "1. In the GCP Console, go to the [**Create service account key**\n", + " page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n", + "\n", + "2. From the **Service account** drop-down list, select **New service account**.\n", + "\n", + "3. In the **Service account name** field, enter a name.\n", + "\n", + "4. From the **Role** drop-down list, select\n", + " **AutoML > AutoML Admin** and\n", + " **Storage > Storage Object Admin**.\n", + "\n", + "5. Click *Create*. A JSON file that contains your key downloads to your\n", + "local environment.\n", + "\n", + "6. Enter the path to your service account key as the\n", + "`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "# If you are running this notebook in Colab, run this cell and follow the\n", + "# instructions to authenticate your GCP account. This provides access to your\n", + "# Cloud Storage bucket and lets you submit training jobs and prediction\n", + "# requests.\n", + "\n", + "if 'google.colab' in sys.modules: \n", + " from google.colab import files\n", + " keyfile_upload = files.upload()\n", + " keyfile = list(keyfile_upload.keys())[0]\n", + " %env GOOGLE_APPLICATION_CREDENTIALS $keyfile\n", + "# If you are running this notebook locally, replace the string below with the\n", + "# path to your service account key and run this cell to authenticate your GCP\n", + "# account.\n", + "else:\n", + " %env GOOGLE_APPLICATION_CREDENTIALS /path/to/service_account.json" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "BR0POq2UzE7e" + }, + "source": [ + "### Install the client library\n", + "Run the following cell to install the client libary using `pip`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --quiet google-cloud-automl" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Restart the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.core.display import HTML\n", + "HTML(\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Import libraries and define constants\n", + "\n", + "First, import Python libraries required for training,\n", + "The code example below demonstrates importing the AutoML Python API module into a python script. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# AutoML library\n", + "from google.cloud import automl\n", + "\n", + "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", + "import matplotlib.pyplot as plt\n", + "\n", + "client = automl.AutoMlClient()\n", + "prediction_client = automl.PredictionServiceClient()\n", + "\n", + "# Get the GCP location of your project.\n", + "location_path = client.location_path(PROJECT_ID, COMPUTE_REGION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "s3F2xbEJdDvN" + }, + "source": [ + "### Test the set up" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "rUlBcZ3OfWcJ" + }, + "source": [ + "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.\n", + "\n", + "If no dataset has previously imported into AutoML Tables, you shall expect an empty return." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "both", + "colab": {}, + "colab_type": "code", + "id": "sf32nKXIqYje" + }, + "outputs": [], + "source": [ + "#@title List datasets. { vertical-output: true }\n", + "\n", + "list_datasets_response = client.list_datasets(location_path)\n", + "datasets = {dataset.display_name: dataset.name for dataset in list_datasets_response}\n", + "datasets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "t9uE8MvMkOPd" + }, + "source": [ + "You can also print the list of your models by running the following cell.\n", + "\n", + "If no model has previously trained using AutoML Tables, you shall expect an empty return." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "both", + "colab": {}, + "colab_type": "code", + "id": "j4-bYRSWj7xk" + }, + "outputs": [], + "source": [ + "#@title List models. { vertical-output: true }\n", + "\n", + "list_models_response = client.list_models(location_path)\n", + "models = {model.display_name: model.name for model in list_models_response}\n", + "models" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "qozQWMnOu48y" + }, + "source": [ + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ODt86YuVDZzm" + }, + "source": [ + "## 2. Import training data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "XwjZc9Q62Fm5" + }, + "source": [ + "### Create dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "_JfZFGSceyE_" + }, + "source": [ + "Select a dataset display name and pass your table source information to create a new dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Z_JErW3cw-0J" + }, + "outputs": [], + "source": [ + "#@title Create dataset { vertical-output: true, output-height: 200 }\n", + "\n", + "dataset_display_name = 'stockout_data' #@param {type: 'string'}\n", + "\n", + "dataset_dict = {\n", + " 'display_name': dataset_display_name, \n", + " 'tables_dataset_metadata': {}\n", + "}\n", + "\n", + "create_dataset_response = client.create_dataset(\n", + " location_path,\n", + " dataset_dict\n", + ")\n", + "create_dataset_response" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "RLRgvqzUdxfL" + }, + "outputs": [], + "source": [ + " #@title Get dataset name { vertical-output: true }\n", + "\n", + "dataset_name = create_dataset_response.name\n", + "dataset_name" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "35YZ9dy34VqJ" + }, + "source": [ + "### Import data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "3c0o15gVREAw" + }, + "source": [ + "You can import your data to AutoML Tables from GCS or BigQuery. For this solution, you will import data from a BigQuery Table. The URI for your table is in the format of `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", + "\n", + "The BigQuery Table used for demonstration purpose can be accessed as `bq://product-stockout.product_stockout.stockout`. \n", + "\n", + "See the table schema and dataset description from the README. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "bB_GdeqCJW5i" + }, + "outputs": [], + "source": [ + "#@title ... if data source is BigQuery { vertical-output: true }\n", + "\n", + "dataset_bq_input_uri = 'bq://product-stockout.product_stockout.stockout' #@param {type: 'string'}\n", + "# Define input configuration.\n", + "input_config = {\n", + " 'bigquery_source': {\n", + " 'input_uri': dataset_bq_input_uri\n", + " }\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "FNVYfpoXJsNB" + }, + "outputs": [], + "source": [ + "#@title Import data { vertical-output: true }\n", + "\n", + "import_data_response = client.import_data(dataset_name, \n", + " input_config)\n", + "print('Dataset import operation: {}'.format(import_data_response.operation))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "1O7tJ8IlefRC" + }, + "outputs": [], + "source": [ + "#@title Check if importing the data is complete { vertical-output: true }\n", + "\n", + "# If returns `False`, you can check back again later.\n", + "# Continue with the rest only if this cell returns a `True`.\n", + "import_data_response.done()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "_WLvyGIDe9ah" + }, + "source": [ + "Importing this stockout datasets takes about 10 minutes. \n", + "\n", + "If you re-visit this Notebook, uncomment the following cell and run the command to retrieve your dataset. Replace `YOUR_DATASET_NAME` with its actual value obtained in the preceding cells.\n", + "\n", + "`YOUR_DATASET_NAME` is a string in the format of `'projects//locations//datasets/'`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "P6NkRMyJfAGm" + }, + "outputs": [], + "source": [ + "# dataset_name = '' #@param {type: 'string'}\n", + "# dataset = client.get_dataset(dataset_name) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "QdxBI4s44ZRI" + }, + "source": [ + "### Review the specs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "RC0PWKqH4jwr" + }, + "source": [ + "Run the following command to see table specs such as row count." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "v2Vzq_gwXxo-" + }, + "outputs": [], + "source": [ + "#@title Table schema { vertical-output: true }\n", + "\n", + "# List table specs\n", + "list_table_specs_response = client.list_table_specs(dataset_name)\n", + "table_specs = [s for s in list_table_specs_response]\n", + "# List column specs\n", + "table_spec_name = table_specs[0].name\n", + "list_column_specs_response = client.list_column_specs(table_spec_name)\n", + "column_specs = {s.display_name: s for s in list_column_specs_response}\n", + "# Table schema pie chart.\n", + "type_counts = {}\n", + "for column_spec in column_specs.values():\n", + " type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n", + " type_counts[type_name] = type_counts.get(type_name, 0) + 1\n", + "\n", + "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n", + "plt.axis('equal')\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Lqjq4X43v3ON" + }, + "source": [ + "In the pie chart above, you see this dataset contains three variable types: `FLOAT64` (treated as `Numeric`), `CATEGORY` (treated as `Categorical`) and `STRING` (treated as `Text`). " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "FNykW_YOYt6d" + }, + "source": [ + "___" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "kNRVJqVOL8h3" + }, + "source": [ + "## 3. Update dataset: assign a label column and enable nullable columns" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "VsOPwxN9fOIl" + }, + "source": [ + "### Get column specs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "-57gehId9PQ5" + }, + "source": [ + "AutoML Tables automatically detects your data column type. \n", + "\n", + "There are a total of 120 columns in this stockout dataset.\n", + "\n", + "Run the following command to check the column data type that automaticallyed detected. If columns contains only numerical values, but they represent categories, change that column data type to caregorical by updating your schema.\n", + "\n", + "In addition, AutoML Tables detects `Stockout` to be categorical that chooses to run a classification model. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Pyku3AHEfSp4" + }, + "outputs": [], + "source": [ + "#@title List table specs { vertical-output: true }\n", + "\n", + "list_table_specs_response = client.list_table_specs(dataset_name)\n", + "table_specs = [s for s in list_table_specs_response]\n", + "table_specs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "jso_JBI9fgy6" + }, + "outputs": [], + "source": [ + "#@title Check column data type { vertical-output: true }\n", + "\n", + "# Get column specs.\n", + "table_spec_name = table_specs[0].name\n", + "list_column_specs_response = client.list_column_specs(table_spec_name)\n", + "column_specs = {s.display_name: s for s in list_column_specs_response}\n", + "\n", + "# Print column data types.\n", + "for column in column_specs:\n", + " print(column, '-', column_specs[column].data_type)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "iRqdQ7Xiq04x" + }, + "source": [ + "### Update columns: make categorical\n", + "\n", + "From the column data type, you noticed `Item_Number`, `Category`, `Vendor_Number`, `Store_Number`, `Zip_Code` and `County_Number` have been autodetected as `FLOAT64` (Numerical) instead of `CATEGORY` (Categorical). \n", + "\n", + "In this solution, the columns `Item_Number`, `Category`, `Vendor_Number` and `Store_Number` are not nullable, but `Zip_Code` and `County_Number` can take null values.\n", + "\n", + "To change the data type, you can update the schema by updating the column spec.\n", + "\n", + "`update_column_response = client.update_column_spec(update_column_spec_dict)`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "gAPg_ymDf4kL" + }, + "outputs": [], + "source": [ + "def create_update_column_sepc_dict(column_name, type_code, nullable):\n", + " \"\"\"\n", + " Create `update_column_spec_dict` with a given column name and target `type_code`.\n", + " Inputs:\n", + " column_name: string. Represents column name.\n", + " type_code: string. Represents variable type. See details: \\\n", + " https://cloud.google.com/automl-tables/docs/reference/rest/v1beta1/projects.locations.datasets.tableSpecs.columnSpecs#typecode\n", + " nullable: boolean. If true, this DataType can also be null.\n", + " Return:\n", + " update_column_spec_dict: dictionary. Encodes the target column specs.\n", + " \"\"\"\n", + " update_column_spec_dict = {\n", + " 'name': column_specs[column_name].name,\n", + " 'data_type': {\n", + " 'type_code': type_code,\n", + " 'nullable': nullable\n", + " }\n", + " }\n", + " return update_column_spec_dict" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "_xePITEYf5po" + }, + "outputs": [], + "source": [ + "# Update dataset\n", + "categorical_column_names = ['Item_Number',\n", + " 'Category',\n", + " 'Vendor_Number',\n", + " 'Store_Number',\n", + " 'Zip_Code',\n", + " 'County_Number']\n", + "is_nullable = [False, \n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " True]\n", + "\n", + "for i in range(len(categorical_column_names)):\n", + " column_name = categorical_column_names[i]\n", + " nullable = is_nullable[i]\n", + " update_column_spec_dict = create_update_column_sepc_dict(column_name, 'CATEGORY', nullable)\n", + " update_column_response = client.update_column_spec(update_column_spec_dict)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "nDMH_chybe4w" + }, + "source": [ + "### Update dataset: assign a label\n", + "\n", + "Select the label column and update the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "hVIruWg0u33t" + }, + "outputs": [], + "source": [ + "#@title Update dataset { vertical-output: true }\n", + "\n", + "label_column_name = 'Stockout' #@param {type: 'string'}\n", + "label_column_spec = column_specs[label_column_name]\n", + "label_column_id = label_column_spec.name.rsplit('/', 1)[-1]\n", + "print('Label column ID: {}'.format(label_column_id))\n", + "# Define the values of the fields to be updated.\n", + "update_dataset_dict = {\n", + " 'name': dataset_name,\n", + " 'tables_dataset_metadata': {\n", + " 'target_column_spec_id': label_column_id\n", + " }\n", + "}\n", + "\n", + "update_dataset_response = client.update_dataset(update_dataset_dict)\n", + "update_dataset_response" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "z23NITLrcxmi" + }, + "source": [ + "___" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "FcKgvj1-Tbgj" + }, + "source": [ + "## 4. Creating a model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Pnlk8vdQlO_k" + }, + "source": [ + "### Train a model\n", + "Training the model may take one hour or more. To obtain the results with less training time or budget, you can set [`train_budget_milli_node_hours`](https://cloud.google.com/automl-tables/docs/reference/rest/v1beta1/projects.locations.models), which is the train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour. \n", + "\n", + "For demonstration purpose, the following command sets the budget as 1 node hour. You can increate that number up to a maximum of 72 hours ('train_budget_milli_node_hours': 72000) for the best model performance. \n", + "\n", + "You can also select the objective to optimize your model training by setting `optimization_objective`. This solution optimizes the model by maximizing the Area Under the Precision-Recall (PR) Curve. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "11izNd6Fu37N" + }, + "outputs": [], + "source": [ + "#@title Create model { vertical-output: true }\n", + "\n", + "feature_list = list(column_specs.keys())\n", + "feature_list.remove('Stockout')\n", + "\n", + "model_display_name = 'stockout_model' #@param {type:'string'}\n", + "dataset_id = dataset_name.rsplit('/', 1)[-1]\n", + "\n", + "model_dict = {\n", + " 'display_name': model_display_name,\n", + " 'dataset_id': dataset_id, \n", + " 'tables_model_metadata': {\n", + " 'target_column_spec': column_specs['Stockout'],\n", + " 'input_feature_column_specs': [column_specs[f] for f in feature_list],\n", + " 'optimization_objective': 'MAXIMIZE_AU_PRC',\n", + " 'train_budget_milli_node_hours': 1000\n", + " }, \n", + "}\n", + "\n", + "create_model_response = client.create_model(location_path, model_dict)\n", + "print('Dataset import operation: {}'.format(create_model_response.operation))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "wCQdx9VyhKY5" + }, + "outputs": [], + "source": [ + "#@title Check if model training is complete { vertical-output: true }\n", + "# If returns `False`, you can check back again later.\n", + "# Continue with the rest only if this cell returns a `True`.\n", + "create_model_response.done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "bPiR8zMwhQYO" + }, + "outputs": [], + "source": [ + "#@title Retrieve the model name { vertical-output: true }\n", + "create_model_result = create_model_response.result()\n", + "model_name = create_model_result.name\n", + "model_name" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "neYjToB36q9E" + }, + "source": [ + "If your Colab times out, use `client.list_models(location_path)` to check whether your model has been created. \n", + "\n", + "Then uncomment the following cell and run the command to retrieve your model. Replace `YOUR_MODEL_NAME` with its actual value obtained in the preceding cell.\n", + "\n", + "`YOUR_MODEL_NAME` is a string in the format of `'projects//locations//models/'`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "QptCwUIK7yhU" + }, + "outputs": [], + "source": [ + "# model_name = '' #@param {type: 'string'}\n", + "# model = client.get_model(model_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "1wS1is9IY5nK" + }, + "source": [ + "___" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "TarOq84-GXch" + }, + "source": [ + "## 5. Batch prediction" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Soy5OB8Wbp_R" + }, + "source": [ + "### Initialize prediction" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "39bIGjIlau5a" + }, + "source": [ + "Your data source for batch prediction can be GCS or BigQuery. For this solution, you will use a BigQuery Table as the input source. The URI for your table is in the format of `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", + "\n", + "To write out the predictions, you need to specify a GCS bucket `gs://BUCKET_NAME`.\n", + "\n", + "The AutoML Tables logs the errors in the `errors.csv` file.\n", + "\n", + "**NOTE:** The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "gkF3bH0qu4DU" + }, + "outputs": [], + "source": [ + "#@title Start batch prediction { vertical-output: true, output-height: 200 }\n", + "\n", + "batch_predict_bq_input_uri = 'bq://product-stockout.product_stockout.batch_prediction_inputs'\n", + "batch_predict_gcs_output_uri_prefix = 'gs://' #@param {type:'string'}\n", + "\n", + "# Define input source.\n", + "batch_prediction_input_source = {\n", + " 'bigquery_source': {\n", + " 'input_uri': batch_predict_bq_input_uri\n", + " }\n", + "}\n", + "# Define output target.\n", + "batch_prediction_output_target = {\n", + " 'gcs_destination': {\n", + " 'output_uri_prefix': batch_predict_gcs_output_uri_prefix\n", + " }\n", + "}\n", + "batch_predict_response = prediction_client.batch_predict(model_name, \n", + " batch_prediction_input_source, \n", + " batch_prediction_output_target)\n", + "print('Batch prediction operation: {}'.format(batch_predict_response.operation))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "AVJhh_k0PfxD" + }, + "outputs": [], + "source": [ + "#@title Check if batch prediction is complete { vertical-output: true }\n", + "\n", + "# If returns `False`, you can check back again later.\n", + "# Continue with the rest only if this cell returns a `True`.\n", + "batch_predict_response.done()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "8nr5q2M8W2VX" + }, + "outputs": [], + "source": [ + "#@title Retrieve batch prediction metadata { vertical-output: true }\n", + "\n", + "batch_predict_response.metadata" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "kgwbJwS2iLpc" + }, + "outputs": [], + "source": [ + "#@title Check prediction results { vertical-output: true }\n", + "\n", + "gcs_output_directory = batch_predict_response.metadata.batch_predict_details.output_info.gcs_output_directory\n", + "result_file = gcs_output_directory + 'tables_1.csv'\n", + "print('Batch prediction results are stored as: {}'.format(result_file))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Clean up\n", + "\n", + "To clean up all GCP resources used in this notebook, you can [delete the GCP\n", + "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)." + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "retail_product_stockout_prediction.ipynb", + "provenance": [], + "version": "0.3.2" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 }