Tuning

class: title-slide, center

# Tuning

## Machine Learning in the Tidyverse

### Alison Hill &#183; Garrett Grolemund

#### [https://conf20-intro-ml.netlify.com/](https://conf20-intro-ml.netlify.com/) &#183; [https://rstd.io/conf20-intro-ml](https://rstd.io/conf20-intro-ml)

---

# KNN

---
class: middle, center

# `nearest_neighbor()`

Specifies a model that uses K Nearest Neighbors

```r
nearest_neighbor(neighbors = 1)
```

### k = `neighbors` (PLURAL)

.footnote[regression and classification modes]

---
class: your-turn

# Your Turn 1

Here's a new recipe (also in your .Rmd)…

```r
normalize_rec <-
  recipe(Sale_Price ~ ., data = ames) %>% 
    step_novel(all_nominal()) %>% 
    step_dummy(all_nominal()) %>% 
    step_zv(all_predictors()) %>% 
    step_center(all_predictors()) %>% 
    step_scale(all_predictors())
```

---
class: your-turn

# Your Turn 1

…and a new model. Can you tell what type of model this is?…

```r
knn5_spec <- 
  nearest_neighbor(neighbors = 5) %>% 
    set_engine("kknn") %>% 
    set_mode("regression")
```

---
class: your-turn

# Your Turn 1

Combine the recipe and model into a new workflow named knn_wf.
Fit the workflow to cv_folds and collect its RMSE.

---

```r
knn5_wf <-
  workflow() %>% 
  add_recipe(normalize_rec) %>% 
  add_model(knn5_spec)

knn5_wf %>%
  fit_resamples(resamples = cv_folds) %>% 
  collect_metrics()
# A tibble: 2 x 5
  .metric .estimator      mean     n    std_err
  <chr>   <chr>          <dbl> <int>      <dbl>
1 rmse    standard   37191.       10 1130.     
2 rsq     standard       0.786    10    0.00971
```

---
class: your-turn

# Your Turn 2

Repeat the process in Your Turn 1 with a similar workflow that uses neighbors = 10. Does the RMSE change?

---

```r
knn10_spec <- nearest_neighbor(neighbors = 10) %>% 
    set_engine("kknn") %>% 
    set_mode("regression")

knn10_wf <- 
  knn5_wf %>% 
    update_model(knn10_spec)

knn10_wf %>%
  fit_resamples(resamples = cv_folds) %>% 
  collect_metrics()
# A tibble: 2 x 5
  .metric .estimator      mean     n   std_err
  <chr>   <chr>          <dbl> <int>     <dbl>
1 rmse    standard   35817.       10 972.     
2 rsq     standard       0.806    10   0.00869
```

---
class: middle, center

# Quiz

How can you find the best value of neighbors/k?

Compare all the separate values/models

---
class: inverse, middle, center

# `tune_grid()`

---
class: middle, center, frame

# tune

Functions for fitting and tuning models

<tidymodels.github.io/tune/>

---
class: middle, center

# `tune()`

A placeholder for hyper-parameters to be "tuned"

```r
nearest_neighbor(neighbors = tune())
```

---

.center[
# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.
]

.pull-left[

```r
tune_grid(
  object, 
  resamples, 
  ..., 
  grid = 10, 
  metrics = NULL, 
  control = control_grid()
)
```

]

---

.center[
# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.
]

.pull-left[

```r
tune_grid(
* object,
  resamples, 
  ..., 
  grid = 10, 
  metrics = NULL, 
  control = control_grid()
)
```

]

.pull-right[
One of:

+ A `workflow`

+ A formula

+ A `recipe` 
]

---

.center[
# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.
]

.pull-left[

```r
tune_grid(
* object,
* model,
  resamples, 
  ..., 
  grid = 10, 
  metrics = NULL, 
  control = control_grid()
)
```

]

.pull-right[
One of:

+ formula + `model`

+ `recipe` + `model`
]

---

.center[
# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.
]

.pull-left[

```r
tune_grid(
  object, 
  resamples, 
  ..., 
* grid = 10,
  metrics = NULL, 
  control = control_grid()
)
```

]

.pull-right[
One of:

+ A positive integer.

+ A data frame of tuning combinations.

]

---

.center[

# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.

]

.pull-left[

```r
tune_grid(
  object, 
  resamples, 
  ..., 
* grid = 10,
  metrics = NULL, 
  control = control_grid()
)
```

]

.pull-right[
Number of candidate parameter sets to be created automatically.
]

---

.center[
# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.

]

.pull-left[

```r
tune_grid(
  object, 
  resamples, 
  ..., 
* grid = df,
  metrics = NULL, 
  control = control_grid()
)
```

]

.pull-right[
A data frame of tuning combinations.
]

---
class: middle, center

# `expand_grid()`

Takes one or more vectors, and returns a data frame holding all combinations of their values.

```r
expand_grid(neighbors = c(1,2), foo = 3:5)
# A tibble: 6 x 2
  neighbors   foo
      <dbl> <int>
1         1     3
2         1     4
3         1     5
4         2     3
5         2     4
6         2     5
```

.footnote[tidyr package; see also base `expand.grid()`]

---
class: your-turn

# Your Turn 3

Use `expand_grid()` to create a grid of values for neighbors that spans from 10 to 20. Save the result as `k10_20`.

---

```r
k10_20 <- expand_grid(neighbors = 10:20)
k10_20
# A tibble: 11 x 1
   neighbors
       <int>
 1        10
 2        11
 3        12
 4        13
 5        14
 6        15
 7        16
 8        17
 9        18
10        19
11        20
```

---
class: your-turn

# Your Turn 4

Create a knn workflow that tunes over neighbors.

Then use `tune_grid()`, `cv_folds` and `k10_20` to find the best value of neighbors.

Save the output of `tune_grid()` as `knn_results`.

---

```r
knn_tuner <- 
  nearest_neighbor(neighbors = tune()) %>% 
    set_engine("kknn") %>% 
    set_mode("regression")

knn_twf <-
  workflow() %>% 
    add_recipe(normalize_rec) %>% 
    add_model(knn_tuner)

knn_results <- 
  knn_twf %>%
    tune_grid(resamples = cv_folds, 
              grid = k10_20)

knn_results %>% 
  collect_metrics() %>% 
  filter(.metric == "rmse")
```

---

```
# A tibble: 11 x 6
   neighbors .metric .estimator   mean     n std_err
       <int> <chr>   <chr>       <dbl> <int>   <dbl>
 1        10 rmse    standard   35817.    10    972.
 2        11 rmse    standard   35719.    10    979.
 3        12 rmse    standard   35648.    10    991.
 4        13 rmse    standard   35596.    10   1004.
 5        14 rmse    standard   35558.    10   1017.
 6        15 rmse    standard   35533.    10   1030.
 7        16 rmse    standard   35524.    10   1044.
 8        17 rmse    standard   35530.    10   1057.
 9        18 rmse    standard   35543.    10   1068.
10        19 rmse    standard   35557.    10   1078.
11        20 rmse    standard   35577.    10   1088.
```

---

```
# A tibble: 110 x 5
   id     neighbors .metric .estimator .estimate
   <chr>      <int> <chr>   <chr>          <dbl>
 1 Fold01        10 rmse    standard      39579.
 2 Fold01        11 rmse    standard      39582.
 3 Fold01        12 rmse    standard      39628.
 4 Fold01        13 rmse    standard      39693.
 5 Fold01        14 rmse    standard      39743.
 6 Fold01        15 rmse    standard      39787.
 7 Fold01        16 rmse    standard      39850.
 8 Fold01        17 rmse    standard      39928.
 9 Fold01        18 rmse    standard      40004.
10 Fold01        19 rmse    standard      40081.
# … with 100 more rows
```

---
class: middle
name: show-best

.center[
# `show_best()`

Shows the .display[n] most optimum combinations of hyper-parameters
]

```r
knn_results %>% 
  show_best(metric = "rmse", n = 5, maximize = FALSE)
```

---
template: show-best

```
# A tibble: 5 x 6
  neighbors .metric .estimator   mean     n std_err
      <int> <chr>   <chr>       <dbl> <int>   <dbl>
1        16 rmse    standard   35524.    10   1044.
2        17 rmse    standard   35530.    10   1057.
3        15 rmse    standard   35533.    10   1030.
4        18 rmse    standard   35543.    10   1068.
5        19 rmse    standard   35557.    10   1078.
```

---
class: middle, center

# `autoplot()`

Quickly visualize tuning results

```r
knn_results %>% autoplot()
```

---
class: middle, center

---

# You can tune models *and* recipes!

---
class: your-turn

# Your Turn 5

Modify our PCA workflow provided to find the best value for `num_comp` on the grid provided. Which is it? Use `show_best()` to see. Save the output of the fit function as `pca_results`.

---

```r
lm_spec <- linear_reg() %>% set_engine("lm")
pca_tuner <- recipe(Sale_Price ~ ., data = ames) %>%
    step_novel(all_nominal()) %>%
    step_dummy(all_nominal()) %>%
    step_zv(all_predictors()) %>%
    step_center(all_predictors()) %>%
    step_scale(all_predictors()) %>%
    step_pca(all_predictors(), num_comp = tune())
pca_twf <- workflow() %>% 
    add_recipe(pca_tuner) %>% 
    add_model(lm_spec)
nc10_40 <- expand_grid(num_comp = c(10,20,30,40))
pca_results <- pca_twf %>% 
    tune_grid(resamples = cv_folds, grid = nc10_40)
pca_results %>% show_best(metric = "rmse", maximize = FALSE)
```

---

```
# A tibble: 4 x 6
  num_comp .metric .estimator   mean     n std_err
     <dbl> <chr>   <chr>       <dbl> <int>   <dbl>
1       40 rmse    standard   32384.    10   2184.
2       30 rmse    standard   33549.    10   2089.
3       20 rmse    standard   33997.    10   2063.
4       10 rmse    standard   36081.    10   1881.
```

---

```r
library(modeldata)
data(stackoverflow)

# split the data
set.seed(100) # Important!
so_split <- initial_split(stackoverflow, strata = Remote)
so_train <- training(so_split)
so_test  <- testing(so_split)

set.seed(100) # Important!
so_folds <- vfold_cv(so_train, v = 10, strata = Remote)
```

---
class: your-turn

# Your Turn 6

Here's a new recipe (also in your .Rmd)…

```r
so_rec <- recipe(Remote ~ ., 
                 data = so_train) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_lincomb(all_predictors()) %>% 
  step_downsample(Remote)
```

---
class: your-turn

# Your Turn 6

…and a new model plus workflow. Can you tell what type of model this is?…

```r
rf_spec <- 
  rand_forest() %>% 
    set_engine("ranger") %>% 
    set_mode("classification")

rf_wf <-
  workflow() %>% 
    add_recipe(so_rec) %>% 
    add_model(rf_spec)
```

---
class: your-turn

# Your Turn 6

Here is the output from `fit_resamples()`...

```r
rf_results <-
  rf_wf %>% 
    fit_resamples(resamples = so_folds,
                  metrics = metric_set(roc_auc))

rf_results %>% 
  collect_metrics(summarize = TRUE)
# A tibble: 1 x 5
  .metric .estimator  mean     n std_err
  <chr>   <chr>      <dbl> <int>   <dbl>
1 roc_auc binary     0.684    10  0.0165
```

---
class: your-turn

# Your Turn 6

Edit the random forest model to tune the `mtry` and `min_n` hyper-parameters; call the new model spec `rf_tuner`.

Update the model for your workflow; save it as `rf_twf`.

Tune the workflow to so_folds and show the best combination of hyper-parameters to maximize `roc_auc`.

How does it compare to the average ROC AUC across folds from `fit_resamples()`?

---

```r
rf_tuner <- 
  rand_forest(mtry = tune(),
              min_n = tune()) %>% 
    set_engine("ranger") %>% 
    set_mode("classification")

rf_twf <-
  rf_wf %>% 
    update_model(rf_tuner)

rf_results <-
  rf_twf %>% 
    tune_grid(resamples = so_folds)
i Creating pre-processing data to finalize unknown parameter: mtry
```

---
class: middle, center

# `metric_set()`

A helper function for selecting yardstick metric functions.

```r
metric_set(roc_auc, sens, spec)
```

---

# What next?

---
class: middle
name: show-best

.center[
# `show_best()`

Shows the .display[n] most optimum combinations of hyper-parameters.
]

```r
rf_results %>% 
  show_best(metric = "roc_auc")
# A tibble: 5 x 7
   mtry min_n .metric .estimator  mean     n std_err
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
1     1    33 roc_auc binary     0.690    10  0.0182
2     4    17 roc_auc binary     0.689    10  0.0169
3     8    32 roc_auc binary     0.686    10  0.0189
4    17    38 roc_auc binary     0.682    10  0.0200
5    13    24 roc_auc binary     0.679    10  0.0198
```

---
class: middle
name: select-best

.center[
# `select_best()`

Shows the .display[top] combination of hyper-parameters.
]

```r
so_best <-
  rf_results %>% 
    select_best(metric = "roc_auc")

so_best
```

---
template: select-best

```
# A tibble: 1 x 2
   mtry min_n
  <int> <int>
1     1    33
```

---

.center[
# `finalize_workflow()`

Replaces `tune()` placeholders in a model/recipe/workflow with a set of hyper-parameter values.
]

```r
so_wfl_final <- 
  rf_twf %>%
    finalize_workflow(so_best) 
```

---
class: middle, center

# The test set

Remember me?

---
class: middle

.center[

# `fit_split()`

Remember me?

]

```r
so_test_results <-
  so_wfl_final %>% 
    fit_split(split = so_split)
```

---

```r
so_test_results
# # Monte Carlo cross-validation (0.75/0.25) with 1 resamples  
# A tibble: 1 x 6
  splits        id          .metrics      .notes      .predictions     .workflow
* <list>        <chr>       <list>        <list>      <list>           <list>   
1 <split [4.2K… train/test… <tibble [2 ×… <tibble [0… <tibble [1,398 … <workflo…
```

---
class: your-turn

# Your Turn 7

Use `select_best()`, `finalize_workflow()`, and `fit_split()` to take the best combination of hyper-parameters from `rf_results` and use them to predict the test set.

How does our actual test ROC AUC compare to our cross-validated estimate?

---

```r
so_best <-
  rf_results %>% 
    select_best(metric = "roc_auc")

so_wfl_final <- 
  rf_twf %>%
    finalize_workflow(so_best)

so_test_results <-
  so_wfl_final %>% 
    fit_split(split = so_split)

so_test_results %>% 
  collect_metrics()
```

---

# final final final

---
class: middle

.center[
# Final metrics
]

```r
so_test_results %>% 
  collect_metrics()
# A tibble: 2 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.665
2 roc_auc  binary         0.709
```

---
class: middle

.center[
# Predict the test set
]

```r
so_test_results %>% 
  collect_predictions()
# A tibble: 1,398 x 6
   id               .pred_Remote `.pred_Not remote`  .row .pred_class Remote    
   <chr>                   <dbl>              <dbl> <int> <fct>       <fct>     
 1 train/test split        0.479              0.521     1 Not remote  Remote    
 2 train/test split        0.450              0.550     6 Not remote  Not remote
 3 train/test split        0.409              0.591    18 Not remote  Not remote
 4 train/test split        0.582              0.418    23 Remote      Not remote
 5 train/test split        0.513              0.487    30 Remote      Not remote
 6 train/test split        0.521              0.479    50 Remote      Not remote
 7 train/test split        0.609              0.391    53 Remote      Not remote
 8 train/test split        0.504              0.496    56 Remote      Not remote
 9 train/test split        0.552              0.448    63 Remote      Not remote
10 train/test split        0.418              0.582    68 Not remote  Not remote
# … with 1,388 more rows
```

---

```r
roc_values <- 
  so_test_results %>% 
    collect_predictions() %>% 
    roc_curve(truth = Remote, estimate = .pred_Remote)
autoplot(roc_values)
```

---

# Mea Culpa

.pull-left[

```r
fit_split(
  object, 
  split, 
  ..., 
  metrics = NULL
)
```
]

.pull-right[

```r
last_fit(
  object, 
  split, 
  ..., 
  metrics = NULL
)
```

From the tune package
]