Skip to content

Starting AI/ML section in python docs [3rd attempt] #2702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Aug 18, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
d7d9288
Add sklearn to docs requirements
Feb 8, 2020
612c0f6
Create kNN docs draft
Feb 8, 2020
6b3bbb1
Update based on Emma's suggestions
Feb 22, 2020
fbdd889
Add a header
Feb 22, 2020
b1d7fef
Placeholder Regression Section
Feb 23, 2020
eafaf28
Create 2 basic sections, 2 advanced sections
Feb 23, 2020
08aa89b
KNN ML docs: Update thumbnail, name, permalink, description, display_as
Feb 28, 2020
be71cfe
Added 3 sections, drafted out 2 sections
Feb 28, 2020
61b3ad8
ML Docs: Added 3 new sections to regression notebook
Mar 2, 2020
86e987b
ML Docs: Updated last ML regression section for clarity
Mar 2, 2020
1e4a008
ML Docs: Added annotations after each section of regression notebook
Mar 2, 2020
1de7a14
ML Docs: updated ml regression header
Mar 2, 2020
a28ee1f
ML Docs: Added new section to regression, updated references
Mar 3, 2020
0df5bcb
ML Docs: Added coefficient MLR example
Mar 6, 2020
8e4dad2
ML Docs: Start pca notebook
Mar 6, 2020
4b71430
ML Docs: Start ROC/PR section
Mar 6, 2020
ca24949
ML Docs: Remove 2 sections
Mar 13, 2020
99621b0
ML Docs, Regression: fix import, update titles, colors
Mar 13, 2020
0cde621
ML Docs: Update all kNN sections based on discussions
Mar 13, 2020
7447304
ML Docs: Update Regression notebook
Mar 14, 2020
7ab73cb
ML Docs: Updated PCA notebook
Mar 14, 2020
0e8b5d6
ML Docs: Update knn and regression based on Emma's reviews
Mar 17, 2020
3bb49a3
ML Docs: Update header description
Mar 17, 2020
895231f
ML Docs: Add t-SNE/UMAP notebook (read todo)
Mar 17, 2020
802d1ef
ML Docs: More explanations for the KNN section
Aug 12, 2020
5cda611
Rename Tsne tutorial
Aug 12, 2020
46d93de
Update kNN page
Aug 12, 2020
38ef59d
ML Docs: Update PCA page
Aug 12, 2020
a954d0d
ML Docs: Update regression page
Aug 12, 2020
2152601
ML Docs: Update ROC/PR Page
Aug 12, 2020
209dfea
ML Docs: Update T-sne and UMAP section
Aug 12, 2020
c419b09
Add umap to requirements
Aug 13, 2020
caebf49
fixups
nicolaskruchten Aug 18, 2020
f3507e4
longer timeout for umap
nicolaskruchten Aug 18, 2020
53de99c
longer timeout for umap
nicolaskruchten Aug 18, 2020
9b8ec17
Merge branch 'doc-prod' into updated-ml-docs
nicolaskruchten Aug 18, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ $(HTML_DIR)/2019-07-03-%.html: $(IPYNB_DIR)/%.ipynb
@mkdir -p $(FAIL_DIR)
@echo "[nbconvert] $<"
@jupyter nbconvert $< --to html --template nb.tpl \
--ExecutePreprocessor.timeout=600\
--output-dir $(HTML_DIR) --output 2019-07-03-$*.html \
--execute > $(FAIL_DIR)/$* 2>&1 && rm -f $(FAIL_DIR)/$*

Expand Down
324 changes: 324 additions & 0 deletions doc/python/ml-knn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,324 @@
---
jupyter:
jupytext:
notebook_metadata_filter: all
text_representation:
extension: .md
format_name: markdown
format_version: '1.2'
jupytext_version: 1.4.2
kernelspec:
display_name: Python 3
language: python
name: python3
language_info:
codemirror_mode:
name: ipython
version: 3
file_extension: .py
mimetype: text/x-python
name: python
nbconvert_exporter: python
pygments_lexer: ipython3
version: 3.7.7
plotly:
description: Visualize scikit-learn's k-Nearest Neighbors (kNN) classification
in Python with Plotly.
display_as: ai_ml
language: python
layout: base
name: kNN Classification
order: 2
page_type: u-guide
permalink: python/knn-classification/
thumbnail: thumbnail/knn-classification.png
---

## Basic binary classification with kNN

This section gets us started with displaying basic binary classification using 2D data. We first show how to display training versus testing data using [various marker styles](https://plot.ly/python/marker-style/), then demonstrate how to evaluate our classifier's performance on the **test split** using a continuous color gradient to indicate the model's predicted score.

We will use [Scikit-learn](https://scikit-learn.org/) for training our model and for loading and splitting data. Scikit-learn is a popular Machine Learning (ML) library that offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models. It was designed to be accessible, and to work seamlessly with popular libraries like NumPy and Pandas.

We will train a [k-Nearest Neighbors (kNN)](https://scikit-learn.org/stable/modules/neighbors.html) classifier. First, the model records the label of each training sample. Then, whenever we give it a new sample, it will look at the `k` closest samples from the training set to find the most common label, and assign it to our new sample.


### Display training and test splits

Using Scikit-learn, we first generate synthetic data that form the shape of a moon. We then split it into a training and testing set. Finally, we display the ground truth labels using [a scatter plot](https://plotly.com/python/line-and-scatter/).

In the graph, we display all the negative labels as squares, and positive labels as circles. We differentiate the training and test set by adding a dot to the center of test data.

In this example, we will use [graph objects](/python/graph-objects/), Plotly's low-level API for building figures.

```python
import plotly.graph_objects as go
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load and split data
X, y = make_moons(noise=0.3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
X, y.astype(str), test_size=0.25, random_state=0)

trace_specs = [
[X_train, y_train, '0', 'Train', 'square'],
[X_train, y_train, '1', 'Train', 'circle'],
[X_test, y_test, '0', 'Test', 'square-dot'],
[X_test, y_test, '1', 'Test', 'circle-dot']
]

fig = go.Figure(data=[
go.Scatter(
x=X[y==label, 0], y=X[y==label, 1],
name=f'{split} Split, Label {label}',
mode='markers', marker_symbol=marker
)
for X, y, label, split, marker in trace_specs
])
fig.update_traces(
marker_size=12, marker_line_width=1.5,
marker_color="lightyellow"
)
fig.show()
```

### Visualize predictions on test split with [`plotly.express`](https://plotly.com/python/plotly-express/)


Now, we train the kNN model on the same training data displayed in the previous graph. Then, we predict the confidence score of the model for each of the data points in the test set. We will use shapes to denote the true labels, and the color will indicate the confidence of the model for assign that score.

In this example, we will use [Plotly Express](/python/plotly-express/), Plotly's high-level API for building figures. Notice that `px.scatter` only require 1 function call to plot both negative and positive labels, and can additionally set a continuous color scale based on the `y_score` output by our kNN model.

```python
import plotly.express as px
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load and split data
X, y = make_moons(noise=0.3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
X, y.astype(str), test_size=0.25, random_state=0)

# Fit the model on training data, predict on test data
clf = KNeighborsClassifier(15)
clf.fit(X_train, y_train)
y_score = clf.predict_proba(X_test)[:, 1]

fig = px.scatter(
X_test, x=0, y=1,
color=y_score, color_continuous_scale='RdBu',
symbol=y_test, symbol_map={'0': 'square-dot', '1': 'circle-dot'},
labels={'symbol': 'label', 'color': 'score of <br>first class'}
)
fig.update_traces(marker_size=12, marker_line_width=1.5)
fig.update_layout(legend_orientation='h')
fig.show()
```

## Probability Estimates with `go.Contour`

Just like the previous example, we will first train our kNN model on the training set.

Instead of predicting the conference for the test set, we can predict the confidence map for the entire area that wraps around the dimensions of our dataset. To do this, we use [`np.meshgrid`](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html) to create a grid, where the distance between each point is denoted by the `mesh_size` variable.

Then, for each of those points, we will use our model to give a confidence score, and plot it with a [contour plot](https://plotly.com/python/contour-plots/).

In this example, we will use [graph objects](/python/graph-objects/), Plotly's low-level API for building figures.

```python
import plotly.graph_objects as go
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

mesh_size = .02
margin = 0.25

# Load and split data
X, y = make_moons(noise=0.3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
X, y.astype(str), test_size=0.25, random_state=0)

# Create a mesh grid on which we will run our model
x_min, x_max = X[:, 0].min() - margin, X[:, 0].max() + margin
y_min, y_max = X[:, 1].min() - margin, X[:, 1].max() + margin
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)

# Create classifier, run predictions on grid
clf = KNeighborsClassifier(15, weights='uniform')
clf.fit(X, y)
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)


# Plot the figure
fig = go.Figure(data=[
go.Contour(
x=xrange,
y=yrange,
z=Z,
colorscale='RdBu'
)
])
fig.show()
```

Now, let's try to combine our `go.Contour` plot with the first scatter plot of our data points, so that we can visually compare the confidence of our model with the true labels.

```python
import plotly.graph_objects as go
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

mesh_size = .02
margin = 0.25

# Load and split data
X, y = make_moons(noise=0.3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
X, y.astype(str), test_size=0.25, random_state=0)

# Create a mesh grid on which we will run our model
x_min, x_max = X[:, 0].min() - margin, X[:, 0].max() + margin
y_min, y_max = X[:, 1].min() - margin, X[:, 1].max() + margin
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)

# Create classifier, run predictions on grid
clf = KNeighborsClassifier(15, weights='uniform')
clf.fit(X, y)
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

trace_specs = [
[X_train, y_train, '0', 'Train', 'square'],
[X_train, y_train, '1', 'Train', 'circle'],
[X_test, y_test, '0', 'Test', 'square-dot'],
[X_test, y_test, '1', 'Test', 'circle-dot']
]

fig = go.Figure(data=[
go.Scatter(
x=X[y==label, 0], y=X[y==label, 1],
name=f'{split} Split, Label {label}',
mode='markers', marker_symbol=marker
)
for X, y, label, split, marker in trace_specs
])
fig.update_traces(
marker_size=12, marker_line_width=1.5,
marker_color="lightyellow"
)

fig.add_trace(
go.Contour(
x=xrange,
y=yrange,
z=Z,
showscale=False,
colorscale='RdBu',
opacity=0.4,
name='Score',
hoverinfo='skip'
)
)
fig.show()
```

## Multi-class prediction confidence with [`go.Heatmap`](https://plotly.com/python/heatmaps/)

It is also possible to visualize the prediction confidence of the model using [heatmaps](https://plotly.com/python/heatmaps/). In this example, you can see how to compute how confident the model is about its prediction at every point in the 2D grid. Here, we define the confidence as the difference between the highest score and the score of the other classes summed, at a certain point.

In this example, we will use [Plotly Express](/python/plotly-express/), Plotly's high-level API for building figures.

```python
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

mesh_size = .02
margin = 1

# We will use the iris data, which is included in px
df = px.data.iris()
df_train, df_test = train_test_split(df, test_size=0.25, random_state=0)
X_train = df_train[['sepal_length', 'sepal_width']]
y_train = df_train.species_id

# Create a mesh grid on which we will run our model
l_min, l_max = df.sepal_length.min() - margin, df.sepal_length.max() + margin
w_min, w_max = df.sepal_width.min() - margin, df.sepal_width.max() + margin
lrange = np.arange(l_min, l_max, mesh_size)
wrange = np.arange(w_min, w_max, mesh_size)
ll, ww = np.meshgrid(lrange, wrange)

# Create classifier, run predictions on grid
clf = KNeighborsClassifier(15, weights='distance')
clf.fit(X_train, y_train)
Z = clf.predict(np.c_[ll.ravel(), ww.ravel()])
Z = Z.reshape(ll.shape)
proba = clf.predict_proba(np.c_[ll.ravel(), ww.ravel()])
proba = proba.reshape(ll.shape + (3,))

# Compute the confidence, which is the difference
diff = proba.max(axis=-1) - (proba.sum(axis=-1) - proba.max(axis=-1))

fig = px.scatter(
df_test, x='sepal_length', y='sepal_width',
symbol='species',
symbol_map={
'setosa': 'square-dot',
'versicolor': 'circle-dot',
'virginica': 'diamond-dot'},
)
fig.update_traces(
marker_size=12, marker_line_width=1.5,
marker_color="lightyellow"
)
fig.add_trace(
go.Heatmap(
x=lrange,
y=wrange,
z=diff,
opacity=0.25,
customdata=proba,
colorscale='RdBu',
hovertemplate=(
'sepal length: %{x} <br>'
'sepal width: %{y} <br>'
'p(setosa): %{customdata[0]:.3f}<br>'
'p(versicolor): %{customdata[1]:.3f}<br>'
'p(virginica): %{customdata[2]:.3f}<extra></extra>'
)
)
)
fig.update_layout(
legend_orientation='h',
title='Prediction Confidence on Test Split'
)
fig.show()
```

### Reference

Learn more about `px`, `go.Contour`, and `go.Heatmap` here:
* https://plot.ly/python/plotly-express/
* https://plot.ly/python/heatmaps/
* https://plot.ly/python/contour-plots/

This tutorial was inspired by amazing examples from the official scikit-learn docs:
* https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
* https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
* https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html
Loading