diff --git a/notebooks/tutorials/cloud-ml-engine/Training and prediction with scikit-learn.ipynb b/notebooks/tutorials/cloud-ml-engine/Training and prediction with scikit-learn.ipynb new file mode 100644 index 00000000000..f968cc911f6 --- /dev/null +++ b/notebooks/tutorials/cloud-ml-engine/Training and prediction with scikit-learn.ipynb @@ -0,0 +1,570 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Training and prediction with scikit-learn\n", + "\n", + "This notebook demonstrates how to use Cloud Machine Learning Engine to train a simple classification model using `scikit-learn`, and then deploy the model to get predictions.\n", + "\n", + "You train the model to predict a person's income level based on the [Census Income data set](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n", + "\n", + "Before you jump in, let’s cover some of the different tools you’ll be using:\n", + "\n", + "+ [Cloud Machine Learning Engine](https://cloud.google.com/ml-engine/) (Cloud ML Engine) is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.\n", + "\n", + "+ [Cloud Storage](https://cloud.google.com/storage/) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.\n", + "\n", + "+ [Cloud SDK](https://cloud.google.com/sdk/) is a command line tool which allows you to interact with Google Cloud products. This notebook introduces several `gcloud` and `gsutil` commands, which are part of the Cloud SDK. Note that shell commands in a notebook must be prepended with a `!`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Set up your environment\n", + "\n", + "### Enable the required APIs\n", + "\n", + "In order to use Cloud ML Engine, confirm that the required APIs are enabled:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!gcloud services enable ml.googleapis.com\n", + "!gcloud services enable compute.googleapis.com" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a storage bucket\n", + "Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. You can use buckets to organize your data and control access to your data.\n", + "\n", + "Start by defining a globally unique name.\n", + "\n", + "For more information about naming buckets, see [Bucket name requirements](https://cloud.google.com/storage/docs/naming#requirements)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "BUCKET_NAME = 'your-new-bucket'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the examples below, the `BUCKET_NAME` variable is referenced in the commands using `$`.\n", + "\n", + "Create the new bucket with the `gsutil mb` command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!gsutil mb gs://$BUCKET_NAME/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### About the data\n", + "\n", + "The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample\n", + "uses for training is provided by the [UC Irvine Machine Learning\n", + "Repository](https://archive.ics.uci.edu/ml/datasets/).\n", + "\n", + "Census data courtesy of: Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://archive.ics.uci.edu/ml - and is provided \"AS IS\" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.\n", + "\n", + "The data used in this tutorial is located in a public Cloud Storage bucket:\n", + "\n", + " gs://cloud-samples-data/ml-engine/sklearn/census_data/ \n", + "\n", + "The training file is `adult.data` ([download](https://storage.googleapis.com/cloud-samples-data/ml-engine/sklearn/census_data/adult.data)) and the evaluation file is `adult.test` ([download](https://storage.googleapis.com/cloud-samples-data/ml-engine/sklearn/census_data/adult.test)). The evaluation file is not used in this tutorial." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create training application package\n", + "\n", + "The easiest (and recommended) way to create a training application package is to use `gcloud` to package and upload the application when you submit your training job. This method allows you to create a very simple file structure with only two files. For this tutorial, the file structure of your training application package should appear similar to the following:\n", + "\n", + " census_training/\n", + " __init__.py\n", + " train.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a directory locally:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!mkdir census_training" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a blank file named `__init__.py`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!touch ./census_training/__init__.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Save training code in one Python file in the `census_training` directory. The following cell writes a training file to the `census_training` directory. The training file performs the following operations:\n", + "+ Loads the data into a pandas `DataFrame` that can be used by `scikit-learn`\n", + "+ Fits the model is against the training data\n", + "+ Exports the model with the [Python `pickle` library](https://docs.python.org/3/library/pickle.html)\n", + "\n", + "The following model training code is not executed within this notebook. Instead, it is saved to a Python file and packaged as a Python module that runs on Cloud ML Engine after you submit the training job." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile ./census_training/train.py\n", + "import argparse\n", + "import pickle\n", + "import pandas as pd\n", + "\n", + "from google.cloud import storage\n", + "\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.feature_selection import SelectKBest\n", + "from sklearn.pipeline import FeatureUnion\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.preprocessing import LabelBinarizer\n", + "\n", + "parser = argparse.ArgumentParser()\n", + "parser.add_argument(\"--bucket-name\", help=\"The bucket name\", required=True)\n", + "\n", + "arguments, unknown = parser.parse_known_args()\n", + "bucket_name = arguments.bucket_name\n", + "\n", + "# Define the format of your input data, including unused columns.\n", + "# These are the columns from the census data files.\n", + "COLUMNS = (\n", + " 'age',\n", + " 'workclass',\n", + " 'fnlwgt',\n", + " 'education',\n", + " 'education-num',\n", + " 'marital-status',\n", + " 'occupation',\n", + " 'relationship',\n", + " 'race',\n", + " 'sex',\n", + " 'capital-gain',\n", + " 'capital-loss',\n", + " 'hours-per-week',\n", + " 'native-country',\n", + " 'income-level'\n", + ")\n", + "\n", + "# Categorical columns are columns that need to be turned into a numerical value\n", + "# to be used by scikit-learn\n", + "CATEGORICAL_COLUMNS = (\n", + " 'workclass',\n", + " 'education',\n", + " 'marital-status',\n", + " 'occupation',\n", + " 'relationship',\n", + " 'race',\n", + " 'sex',\n", + " 'native-country'\n", + ")\n", + "\n", + "# Create a Cloud Storage client to download the census data\n", + "storage_client = storage.Client()\n", + "\n", + "# Download the data\n", + "public_bucket = storage_client.bucket('cloud-samples-data')\n", + "blob = public_bucket.blob('ml-engine/sklearn/census_data/adult.data')\n", + "blob.download_to_filename('adult.data')\n", + "\n", + "# Load the training census dataset\n", + "with open(\"./adult.data\", \"r\") as train_data:\n", + " raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)\n", + " # Removing the whitespaces in categorical features\n", + " for col in CATEGORICAL_COLUMNS:\n", + " raw_training_data[col] = raw_training_data[col].apply(lambda x: str(x).strip())\n", + "\n", + "# Remove the column we are trying to predict ('income-level') from our features\n", + "# list and convert the DataFrame to a lists of lists\n", + "train_features = raw_training_data.drop(\"income-level\", axis=1).values.tolist()\n", + "# Create our training labels list, convert the DataFrame to a lists of lists\n", + "train_labels = (raw_training_data[\"income-level\"] == \" >50K\").values.tolist()\n", + "\n", + "# Since the census data set has categorical features, we need to convert\n", + "# them to numerical values. We'll use a list of pipelines to convert each\n", + "# categorical column and then use FeatureUnion to combine them before calling\n", + "# the RandomForestClassifier.\n", + "categorical_pipelines = []\n", + "\n", + "# Each categorical column needs to be extracted individually and converted to a\n", + "# numerical value. To do this, each categorical column will use a pipeline that\n", + "# extracts one feature column via SelectKBest(k=1) and a LabelBinarizer() to\n", + "# convert the categorical value to a numerical one. A scores array (created\n", + "# below) will select and extract the feature column. The scores array is\n", + "# created by iterating over the columns and checking if it is a\n", + "# categorical column.\n", + "for i, col in enumerate(COLUMNS[:-1]):\n", + " if col in CATEGORICAL_COLUMNS:\n", + " # Create a scores array to get the individual categorical column.\n", + " # Example:\n", + " # data = [\n", + " # 39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married',\n", + " # 'Adm-clerical', 'Not-in-family', 'White', 'Male', 2174, 0,\n", + " # 40, 'United-States'\n", + " # ]\n", + " # scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n", + " #\n", + " # Returns: [['State-gov']]\n", + " # Build the scores array\n", + " scores = [0] * len(COLUMNS[:-1])\n", + " # This column is the categorical column we want to extract.\n", + " scores[i] = 1\n", + " skb = SelectKBest(k=1)\n", + " skb.scores_ = scores\n", + " # Convert the categorical column to a numerical value\n", + " lbn = LabelBinarizer()\n", + " r = skb.transform(train_features)\n", + " lbn.fit(r)\n", + " # Create the pipeline to extract the categorical feature\n", + " categorical_pipelines.append(\n", + " (\n", + " 'categorical-{}'.format(i), \n", + " Pipeline([\n", + " ('SKB-{}'.format(i), skb),\n", + " ('LBN-{}'.format(i), lbn)])\n", + " )\n", + " )\n", + "\n", + "# Create pipeline to extract the numerical features\n", + "skb = SelectKBest(k=6)\n", + "# From COLUMNS use the features that are numerical\n", + "skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]\n", + "categorical_pipelines.append((\"numerical\", skb))\n", + "\n", + "# Combine all the features using FeatureUnion\n", + "preprocess = FeatureUnion(categorical_pipelines)\n", + "\n", + "# Create the classifier\n", + "classifier = RandomForestClassifier()\n", + "\n", + "# Transform the features and fit them to the classifier\n", + "classifier.fit(preprocess.transform(train_features), train_labels)\n", + "\n", + "# Create the overall model as a single pipeline\n", + "pipeline = Pipeline([(\"union\", preprocess), (\"classifier\", classifier)])\n", + "\n", + "# Create the model file\n", + "# It is required to name the model file \"model.pkl\" if you are using pickle\n", + "model_filename = \"model.pkl\"\n", + "with open(model_filename, \"wb\") as model_file:\n", + " pickle.dump(pipeline, model_file)\n", + "\n", + "# Upload the model to Cloud Storage\n", + "bucket = storage_client.bucket(bucket_name)\n", + "blob = bucket.blob(model_filename)\n", + "blob.upload_from_filename(model_filename)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Submit the training job\n", + "\n", + "In this section, you use [`gcloud ml-engine jobs submit training`](https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/submit/training) to submit your training job. The `--` argument passed to the command is a separator; anything after the separator will be passed to the Python code as input arguments.\n", + "\n", + "For more information about the arguments preceeding the separator, run the following:\n", + "\n", + " gcloud ml-engine jobs submit training --help\n", + "\n", + "The argument given to the python script is `--bucket-name`. The `--bucket-name` argument is used to specify the name of the bucket to save the model file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "import time\n", + "\n", + "# Define a timestamped job name\n", + "JOB_NAME = \"census_training_{}\".format(int(time.time()))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# Submit the training job:\n", + "!gcloud ml-engine jobs submit training $JOB_NAME \\\n", + " --job-dir gs://$BUCKET_NAME/census_job_dir \\\n", + " --package-path ./census_training \\\n", + " --module-name census_training.train \\\n", + " --region us-central1 \\\n", + " --runtime-version=1.12 \\\n", + " --python-version=3.5 \\\n", + " --scale-tier BASIC \\\n", + " --stream-logs \\\n", + " -- \\\n", + " --bucket-name $BUCKET_NAME" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Verify model file in Cloud Storage\n", + "\n", + "View the contents of the destination model directory to verify that your model file has been uploaded to Cloud Storage.\n", + "\n", + "Note: The model can take a few minutes to train and show up in Cloud Storage." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!gsutil ls gs://$BUCKET_NAME/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Serve the model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once the model is successfully created and trained, you can serve it. A model can have different versions. In order to serve the model, create a model and version in Cloud ML Engine.\n", + "\n", + "Define the model and version names:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MODEL_NAME = \"CensusPredictor\"\n", + "VERSION_NAME = \"census_predictor_{}\".format(int(time.time()))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create the model in Cloud ML Engine:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!gcloud ml-engine models create $MODEL_NAME --regions us-central1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a version that points to your model file in Cloud Storage:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!gcloud ml-engine versions create $VERSION_NAME \\\n", + " --model=$MODEL_NAME \\\n", + " --framework=scikit-learn \\\n", + " --origin=gs://$BUCKET_NAME/ \\\n", + " --python-version=3.5 \\\n", + " --runtime-version=1.12" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Make predictions\n", + "\n", + "### Format data for prediction\n", + "\n", + "Before you send an online prediction request, you must format your test data to prepare it for use by the Cloud ML Engine prediction service. Make sure that the format of your input instances matches what your model expects.\n", + "\n", + "Create an `input.json` file with each input instance on a separate line. The following example uses ten data instances. Note that the format of input instances needs to match what your model expects. In this example, the Census model requires 14 features, so your input must be a matrix of shape (`num_instances, 14`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define a name for the input file\n", + "INPUT_FILE = \"./census_training/input.json\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "%%writefile $INPUT_FILE\n", + "[25, \"Private\", 226802, \"11th\", 7, \"Never-married\", \"Machine-op-inspct\", \"Own-child\", \"Black\", \"Male\", 0, 0, 40, \"United-States\"]\n", + "[38, \"Private\", 89814, \"HS-grad\", 9, \"Married-civ-spouse\", \"Farming-fishing\", \"Husband\", \"White\", \"Male\", 0, 0, 50, \"United-States\"]\n", + "[28, \"Local-gov\", 336951, \"Assoc-acdm\", 12, \"Married-civ-spouse\", \"Protective-serv\", \"Husband\", \"White\", \"Male\", 0, 0, 40, \"United-States\"]\n", + "[44, \"Private\", 160323, \"Some-college\", 10, \"Married-civ-spouse\", \"Machine-op-inspct\", \"Husband\", \"Black\", \"Male\", 7688, 0, 40, \"United-States\"]\n", + "[18, \"?\", 103497, \"Some-college\", 10, \"Never-married\", \"?\", \"Own-child\", \"White\", \"Female\", 0, 0, 30, \"United-States\"]\n", + "[34, \"Private\", 198693, \"10th\", 6, \"Never-married\", \"Other-service\", \"Not-in-family\", \"White\", \"Male\", 0, 0, 30, \"United-States\"]\n", + "[29, \"?\", 227026, \"HS-grad\", 9, \"Never-married\", \"?\", \"Unmarried\", \"Black\", \"Male\", 0, 0, 40, \"United-States\"]\n", + "[63, \"Self-emp-not-inc\", 104626, \"Prof-school\", 15, \"Married-civ-spouse\", \"Prof-specialty\", \"Husband\", \"White\", \"Male\", 3103, 0, 32, \"United-States\"]\n", + "[24, \"Private\", 369667, \"Some-college\", 10, \"Never-married\", \"Other-service\", \"Unmarried\", \"White\", \"Female\", 0, 0, 40, \"United-States\"]\n", + "[55, \"Private\", 104996, \"7th-8th\", 4, \"Married-civ-spouse\", \"Craft-repair\", \"Husband\", \"White\", \"Male\", 0, 0, 10, \"United-States\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Send the online prediction request\n", + "\n", + "The prediction results return `True` if the person's income is predicted to be greater than $50,000 per year, and `False` otherwise. The output of the command below may appear similar to the following:\n", + "\n", + " [False, False, False, True, False, False, False, False, False, False]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "!gcloud ml-engine predict --model $MODEL_NAME --version \\\n", + " $VERSION_NAME --json-instances $INPUT_FILE" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Clean up" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To delete all resources you created in this tutorial, run the following commands:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# Delete the model version\n", + "!gcloud ml-engine versions delete $VERSION_NAME --model=$MODEL_NAME --quiet\n", + "\n", + "# Delete the model\n", + "!gcloud ml-engine models delete $MODEL_NAME --quiet\n", + "\n", + "# Delete the bucket and contents\n", + "!gsutil rm -r gs://$BUCKET_NAME\n", + " \n", + "# Delete the local files created by the tutorial\n", + "!rm -rf census_training" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}