Skip to content

Commit c4356aa

Browse files
Merge pull request #2702 from plotly/updated-ml-docs
Starting AI/ML section in python docs [3rd attempt]
2 parents 4efdc59 + 9b8ec17 commit c4356aa

File tree

7 files changed

+1583
-0
lines changed

7 files changed

+1583
-0
lines changed

doc/Makefile

+1
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ $(HTML_DIR)/2019-07-03-%.html: $(IPYNB_DIR)/%.ipynb
3838
@mkdir -p $(FAIL_DIR)
3939
@echo "[nbconvert] $<"
4040
@jupyter nbconvert $< --to html --template nb.tpl \
41+
--ExecutePreprocessor.timeout=600\
4142
--output-dir $(HTML_DIR) --output 2019-07-03-$*.html \
4243
--execute > $(FAIL_DIR)/$* 2>&1 && rm -f $(FAIL_DIR)/$*
4344

doc/python/ml-knn.md

+324
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
---
2+
jupyter:
3+
jupytext:
4+
notebook_metadata_filter: all
5+
text_representation:
6+
extension: .md
7+
format_name: markdown
8+
format_version: '1.2'
9+
jupytext_version: 1.4.2
10+
kernelspec:
11+
display_name: Python 3
12+
language: python
13+
name: python3
14+
language_info:
15+
codemirror_mode:
16+
name: ipython
17+
version: 3
18+
file_extension: .py
19+
mimetype: text/x-python
20+
name: python
21+
nbconvert_exporter: python
22+
pygments_lexer: ipython3
23+
version: 3.7.7
24+
plotly:
25+
description: Visualize scikit-learn's k-Nearest Neighbors (kNN) classification
26+
in Python with Plotly.
27+
display_as: ai_ml
28+
language: python
29+
layout: base
30+
name: kNN Classification
31+
order: 2
32+
page_type: u-guide
33+
permalink: python/knn-classification/
34+
thumbnail: thumbnail/knn-classification.png
35+
---
36+
37+
## Basic binary classification with kNN
38+
39+
This section gets us started with displaying basic binary classification using 2D data. We first show how to display training versus testing data using [various marker styles](https://plot.ly/python/marker-style/), then demonstrate how to evaluate our classifier's performance on the **test split** using a continuous color gradient to indicate the model's predicted score.
40+
41+
We will use [Scikit-learn](https://scikit-learn.org/) for training our model and for loading and splitting data. Scikit-learn is a popular Machine Learning (ML) library that offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models. It was designed to be accessible, and to work seamlessly with popular libraries like NumPy and Pandas.
42+
43+
We will train a [k-Nearest Neighbors (kNN)](https://scikit-learn.org/stable/modules/neighbors.html) classifier. First, the model records the label of each training sample. Then, whenever we give it a new sample, it will look at the `k` closest samples from the training set to find the most common label, and assign it to our new sample.
44+
45+
46+
### Display training and test splits
47+
48+
Using Scikit-learn, we first generate synthetic data that form the shape of a moon. We then split it into a training and testing set. Finally, we display the ground truth labels using [a scatter plot](https://plotly.com/python/line-and-scatter/).
49+
50+
In the graph, we display all the negative labels as squares, and positive labels as circles. We differentiate the training and test set by adding a dot to the center of test data.
51+
52+
In this example, we will use [graph objects](/python/graph-objects/), Plotly's low-level API for building figures.
53+
54+
```python
55+
import plotly.graph_objects as go
56+
import numpy as np
57+
from sklearn.datasets import make_moons
58+
from sklearn.model_selection import train_test_split
59+
from sklearn.neighbors import KNeighborsClassifier
60+
61+
# Load and split data
62+
X, y = make_moons(noise=0.3, random_state=0)
63+
X_train, X_test, y_train, y_test = train_test_split(
64+
X, y.astype(str), test_size=0.25, random_state=0)
65+
66+
trace_specs = [
67+
[X_train, y_train, '0', 'Train', 'square'],
68+
[X_train, y_train, '1', 'Train', 'circle'],
69+
[X_test, y_test, '0', 'Test', 'square-dot'],
70+
[X_test, y_test, '1', 'Test', 'circle-dot']
71+
]
72+
73+
fig = go.Figure(data=[
74+
go.Scatter(
75+
x=X[y==label, 0], y=X[y==label, 1],
76+
name=f'{split} Split, Label {label}',
77+
mode='markers', marker_symbol=marker
78+
)
79+
for X, y, label, split, marker in trace_specs
80+
])
81+
fig.update_traces(
82+
marker_size=12, marker_line_width=1.5,
83+
marker_color="lightyellow"
84+
)
85+
fig.show()
86+
```
87+
88+
### Visualize predictions on test split with [`plotly.express`](https://plotly.com/python/plotly-express/)
89+
90+
91+
Now, we train the kNN model on the same training data displayed in the previous graph. Then, we predict the confidence score of the model for each of the data points in the test set. We will use shapes to denote the true labels, and the color will indicate the confidence of the model for assign that score.
92+
93+
In this example, we will use [Plotly Express](/python/plotly-express/), Plotly's high-level API for building figures. Notice that `px.scatter` only require 1 function call to plot both negative and positive labels, and can additionally set a continuous color scale based on the `y_score` output by our kNN model.
94+
95+
```python
96+
import plotly.express as px
97+
import numpy as np
98+
from sklearn.datasets import make_moons
99+
from sklearn.model_selection import train_test_split
100+
from sklearn.neighbors import KNeighborsClassifier
101+
102+
# Load and split data
103+
X, y = make_moons(noise=0.3, random_state=0)
104+
X_train, X_test, y_train, y_test = train_test_split(
105+
X, y.astype(str), test_size=0.25, random_state=0)
106+
107+
# Fit the model on training data, predict on test data
108+
clf = KNeighborsClassifier(15)
109+
clf.fit(X_train, y_train)
110+
y_score = clf.predict_proba(X_test)[:, 1]
111+
112+
fig = px.scatter(
113+
X_test, x=0, y=1,
114+
color=y_score, color_continuous_scale='RdBu',
115+
symbol=y_test, symbol_map={'0': 'square-dot', '1': 'circle-dot'},
116+
labels={'symbol': 'label', 'color': 'score of <br>first class'}
117+
)
118+
fig.update_traces(marker_size=12, marker_line_width=1.5)
119+
fig.update_layout(legend_orientation='h')
120+
fig.show()
121+
```
122+
123+
## Probability Estimates with `go.Contour`
124+
125+
Just like the previous example, we will first train our kNN model on the training set.
126+
127+
Instead of predicting the conference for the test set, we can predict the confidence map for the entire area that wraps around the dimensions of our dataset. To do this, we use [`np.meshgrid`](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html) to create a grid, where the distance between each point is denoted by the `mesh_size` variable.
128+
129+
Then, for each of those points, we will use our model to give a confidence score, and plot it with a [contour plot](https://plotly.com/python/contour-plots/).
130+
131+
In this example, we will use [graph objects](/python/graph-objects/), Plotly's low-level API for building figures.
132+
133+
```python
134+
import plotly.graph_objects as go
135+
import numpy as np
136+
from sklearn.datasets import make_moons
137+
from sklearn.model_selection import train_test_split
138+
from sklearn.neighbors import KNeighborsClassifier
139+
140+
mesh_size = .02
141+
margin = 0.25
142+
143+
# Load and split data
144+
X, y = make_moons(noise=0.3, random_state=0)
145+
X_train, X_test, y_train, y_test = train_test_split(
146+
X, y.astype(str), test_size=0.25, random_state=0)
147+
148+
# Create a mesh grid on which we will run our model
149+
x_min, x_max = X[:, 0].min() - margin, X[:, 0].max() + margin
150+
y_min, y_max = X[:, 1].min() - margin, X[:, 1].max() + margin
151+
xrange = np.arange(x_min, x_max, mesh_size)
152+
yrange = np.arange(y_min, y_max, mesh_size)
153+
xx, yy = np.meshgrid(xrange, yrange)
154+
155+
# Create classifier, run predictions on grid
156+
clf = KNeighborsClassifier(15, weights='uniform')
157+
clf.fit(X, y)
158+
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
159+
Z = Z.reshape(xx.shape)
160+
161+
162+
# Plot the figure
163+
fig = go.Figure(data=[
164+
go.Contour(
165+
x=xrange,
166+
y=yrange,
167+
z=Z,
168+
colorscale='RdBu'
169+
)
170+
])
171+
fig.show()
172+
```
173+
174+
Now, let's try to combine our `go.Contour` plot with the first scatter plot of our data points, so that we can visually compare the confidence of our model with the true labels.
175+
176+
```python
177+
import plotly.graph_objects as go
178+
import numpy as np
179+
from sklearn.datasets import make_moons
180+
from sklearn.model_selection import train_test_split
181+
from sklearn.neighbors import KNeighborsClassifier
182+
183+
mesh_size = .02
184+
margin = 0.25
185+
186+
# Load and split data
187+
X, y = make_moons(noise=0.3, random_state=0)
188+
X_train, X_test, y_train, y_test = train_test_split(
189+
X, y.astype(str), test_size=0.25, random_state=0)
190+
191+
# Create a mesh grid on which we will run our model
192+
x_min, x_max = X[:, 0].min() - margin, X[:, 0].max() + margin
193+
y_min, y_max = X[:, 1].min() - margin, X[:, 1].max() + margin
194+
xrange = np.arange(x_min, x_max, mesh_size)
195+
yrange = np.arange(y_min, y_max, mesh_size)
196+
xx, yy = np.meshgrid(xrange, yrange)
197+
198+
# Create classifier, run predictions on grid
199+
clf = KNeighborsClassifier(15, weights='uniform')
200+
clf.fit(X, y)
201+
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
202+
Z = Z.reshape(xx.shape)
203+
204+
trace_specs = [
205+
[X_train, y_train, '0', 'Train', 'square'],
206+
[X_train, y_train, '1', 'Train', 'circle'],
207+
[X_test, y_test, '0', 'Test', 'square-dot'],
208+
[X_test, y_test, '1', 'Test', 'circle-dot']
209+
]
210+
211+
fig = go.Figure(data=[
212+
go.Scatter(
213+
x=X[y==label, 0], y=X[y==label, 1],
214+
name=f'{split} Split, Label {label}',
215+
mode='markers', marker_symbol=marker
216+
)
217+
for X, y, label, split, marker in trace_specs
218+
])
219+
fig.update_traces(
220+
marker_size=12, marker_line_width=1.5,
221+
marker_color="lightyellow"
222+
)
223+
224+
fig.add_trace(
225+
go.Contour(
226+
x=xrange,
227+
y=yrange,
228+
z=Z,
229+
showscale=False,
230+
colorscale='RdBu',
231+
opacity=0.4,
232+
name='Score',
233+
hoverinfo='skip'
234+
)
235+
)
236+
fig.show()
237+
```
238+
239+
## Multi-class prediction confidence with [`go.Heatmap`](https://plotly.com/python/heatmaps/)
240+
241+
It is also possible to visualize the prediction confidence of the model using [heatmaps](https://plotly.com/python/heatmaps/). In this example, you can see how to compute how confident the model is about its prediction at every point in the 2D grid. Here, we define the confidence as the difference between the highest score and the score of the other classes summed, at a certain point.
242+
243+
In this example, we will use [Plotly Express](/python/plotly-express/), Plotly's high-level API for building figures.
244+
245+
```python
246+
import plotly.express as px
247+
import plotly.graph_objects as go
248+
import numpy as np
249+
from sklearn.neighbors import KNeighborsClassifier
250+
251+
mesh_size = .02
252+
margin = 1
253+
254+
# We will use the iris data, which is included in px
255+
df = px.data.iris()
256+
df_train, df_test = train_test_split(df, test_size=0.25, random_state=0)
257+
X_train = df_train[['sepal_length', 'sepal_width']]
258+
y_train = df_train.species_id
259+
260+
# Create a mesh grid on which we will run our model
261+
l_min, l_max = df.sepal_length.min() - margin, df.sepal_length.max() + margin
262+
w_min, w_max = df.sepal_width.min() - margin, df.sepal_width.max() + margin
263+
lrange = np.arange(l_min, l_max, mesh_size)
264+
wrange = np.arange(w_min, w_max, mesh_size)
265+
ll, ww = np.meshgrid(lrange, wrange)
266+
267+
# Create classifier, run predictions on grid
268+
clf = KNeighborsClassifier(15, weights='distance')
269+
clf.fit(X_train, y_train)
270+
Z = clf.predict(np.c_[ll.ravel(), ww.ravel()])
271+
Z = Z.reshape(ll.shape)
272+
proba = clf.predict_proba(np.c_[ll.ravel(), ww.ravel()])
273+
proba = proba.reshape(ll.shape + (3,))
274+
275+
# Compute the confidence, which is the difference
276+
diff = proba.max(axis=-1) - (proba.sum(axis=-1) - proba.max(axis=-1))
277+
278+
fig = px.scatter(
279+
df_test, x='sepal_length', y='sepal_width',
280+
symbol='species',
281+
symbol_map={
282+
'setosa': 'square-dot',
283+
'versicolor': 'circle-dot',
284+
'virginica': 'diamond-dot'},
285+
)
286+
fig.update_traces(
287+
marker_size=12, marker_line_width=1.5,
288+
marker_color="lightyellow"
289+
)
290+
fig.add_trace(
291+
go.Heatmap(
292+
x=lrange,
293+
y=wrange,
294+
z=diff,
295+
opacity=0.25,
296+
customdata=proba,
297+
colorscale='RdBu',
298+
hovertemplate=(
299+
'sepal length: %{x} <br>'
300+
'sepal width: %{y} <br>'
301+
'p(setosa): %{customdata[0]:.3f}<br>'
302+
'p(versicolor): %{customdata[1]:.3f}<br>'
303+
'p(virginica): %{customdata[2]:.3f}<extra></extra>'
304+
)
305+
)
306+
)
307+
fig.update_layout(
308+
legend_orientation='h',
309+
title='Prediction Confidence on Test Split'
310+
)
311+
fig.show()
312+
```
313+
314+
### Reference
315+
316+
Learn more about `px`, `go.Contour`, and `go.Heatmap` here:
317+
* https://plot.ly/python/plotly-express/
318+
* https://plot.ly/python/heatmaps/
319+
* https://plot.ly/python/contour-plots/
320+
321+
This tutorial was inspired by amazing examples from the official scikit-learn docs:
322+
* https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
323+
* https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
324+
* https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

0 commit comments

Comments
 (0)