Cross-validation

class: title-slide, center

# Cross-validation

## Machine Learning in the Tidyverse

### Alison Hill &#183; Garrett Grolemund

#### [https://conf20-intro-ml.netlify.com/](https://conf20-intro-ml.netlify.com/) &#183; [https://rstd.io/conf20-intro-ml](https://rstd.io/conf20-intro-ml)

---
class: middle, center

# Quiz

What property of models is Machine Learning most concerned about?

Predictions.

---
class: middle, center, frame

# rsample

---
class: your-turn

# Your Turn 1

Run the first code chunk. Then fill in the blanks to

1. Create a split object that apportions 75% of ames to a training set and the remainder to a testing set.

1. Fit the all_wf to the split object.

1. Extract the rmse of the fit.

---

```r
all_wf <- 
  workflow() %>% 
  add_formula(Sale_Price ~ .) %>% 
  add_model(lm_spec)

new_split <- initial_split(ames)
all_wf %>% 
  fit_split(split = new_split,
            metrics = metric_set(rmse)) %>% 
  collect_metrics()
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      24111.
```

---
class: your-turn

# Your Turn 2

What would happen if you repeated this process? Would you get the same answers? Discuss in your team.

Then rerun the last code chunk from Your Turn 1. Do you get the same answer?

---

.pull-left[

```
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      38727.
```

```
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      41088.
```

```
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      44389.
```

]

.pull-right[

```
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      36100.
```

```
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      38463.
```

```
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      44457.
```

]

---
class: middle, center

# Quiz

Why is the new estimate different?

---

---
class: middle, center

# Data Splitting

---

.right[Mean RMSE]

---
class: your-turn

# Your Turn 3

Rerun the code below 10 times and then compute the mean of the results (you will need to jot them down as you go).

---

```r
rmses %>% tibble::enframe(name = "rmse")
# A tibble: 10 x 2
    rmse  value
   <int>  <dbl>
 1     1 32886.
 2     2 43209.
 3     3 44021.
 4     4 34077.
 5     5 38473.
 6     6 30188.
 7     7 30311.
 8     8 28644.
 9     9 35107.
10    10 42656.
mean(rmses)
[1] 35957.18
```

---
class: middle, center

# Discuss

Which do you think is more accurate, the best result or the mean of the results? Why? Discuss with your team.

---
class: middle, center, inverse

# Cross-validation

---

# There has to be a better way...

```r
rmses <- vector(length = 10, mode = "double")
for (i in 1:10) {
  new_split <- initial_split(ames)
  rmses[i] <-
    all_wf %>% 
      fit_split(split = new_split,
                metrics = metric_set(rmse)) %>% 
      collect_metrics() %>% 
      pull(.estimate)
}
```

---
class: middle, center

# V-fold cross-validation

```r
vfold_cv(data, v = 10, ...)
```

---

---
class: middle, center

# Guess

How many times does in observation/row appear in the assessment set?

---

---
class: middle, center

# Quiz

If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold?

90% - training

10% - test

---
class: your-turn

# Your Turn 4

Run the code below. What does it return?

```r
set.seed(100)
cv_folds <- 
    vfold_cv(ames_train, v = 10, strata = Sale_Price, breaks = 4)
cv_folds
```

---

```r
set.seed(100)
cv_folds <- 
    vfold_cv(ames_train, v = 10, strata = Sale_Price, breaks = 4)
cv_folds
#  10-fold cross-validation using stratification 
# A tibble: 10 x 2
   splits           id    
   <named list>     <chr> 
 1 <split [2K/221]> Fold01
 2 <split [2K/221]> Fold02
 3 <split [2K/220]> Fold03
 4 <split [2K/220]> Fold04
 5 <split [2K/220]> Fold05
 6 <split [2K/220]> Fold06
 7 <split [2K/220]> Fold07
 8 <split [2K/219]> Fold08
 9 <split [2K/219]> Fold09
10 <split [2K/218]> Fold10
```

---
class: middle

.center[
# We need a new way to fit
]

```r
split1 <- cv_folds %>% 
  pluck("splits", 1)
all_wf %>% 
  fit_split(split = split1,
            metrics = metric_set(rmse)) %>% 
  collect_metrics()

# rinse and repeat
split2 <- ...
```

---
class: inverse, middle, center

# `fit_resamples()`

---
class: middle

.center[
# `fit_resamples()`

Trains and tests a model with cross-validation.
]

```r
fit_resamples(
  Sale_Price ~ Gr_Liv_Area, 
  model = lm_spec,          
  resamples = cv_folds
)
```

---

```r
fit_resamples(
  Sale_Price ~ Gr_Liv_Area, 
  model = lm_spec,          
  resamples = cv_folds
)
#  10-fold cross-validation using stratification 
# A tibble: 10 x 4
   splits           id     .metrics         .notes          
 * <list>           <chr>  <list>           <list>          
 1 <split [2K/221]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]>
 2 <split [2K/221]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]>
 3 <split [2K/220]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]>
 4 <split [2K/220]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]>
 5 <split [2K/220]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]>
 6 <split [2K/220]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]>
 7 <split [2K/220]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]>
 8 <split [2K/219]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]>
 9 <split [2K/219]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]>
10 <split [2K/218]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]>
```

---

# `fit_resamples()`

.pull-left[
Fit with formula and model

```r
fit_resamples(
* Sale_Price ~ Gr_Liv_Area,
* model = lm_spec,
  resamples = cv_folds
)
```
]

.pull-right[
Fit with workflow

```r
fit_resamples(
* all_wf,
  resamples = cv_folds
)
```
]

---
class: middle, center

# `collect_metrics()`

Unnest the metrics column from a tidymodels `fit_resamples()`

```r
_results %>% collect_metrics(summarize = TRUE)
```

.footnote[`TRUE` is actually the default; averages across folds]

---
class: your-turn

# Your Turn 5

Modify the code below to use `fit_resamples()` and `cv_folds` to cross-validate the `all_wf` workflow.

Which RMSE do you collect at the end?

```r
all_wf %>% 
  fit_split(split = new_split,
            metrics = metric_set(rmse)) %>% 
  collect_metrics()
```

---

```r
all_wf %>% 
  fit_resamples(resamples = cv_folds,
                metrics = metric_set(rmse)) %>% 
  collect_metrics()
# A tibble: 1 x 5
  .metric .estimator   mean     n std_err
  <chr>   <chr>       <dbl> <int>   <dbl>
1 rmse    standard   41797.    10   5122.
```

---
class: inverse, middle, center

# Comparing Models

---
class: your-turn

# Your Turn 6

Create two new workflows, one that fits the bedbath model, 
`Sale_Price ~ Bedroom_AbvGr + Full_Bath + Half_Bath` 
and one that fits the square foot model, 
`Sale_Price ~ Gr_Liv_Area`

Then use `fit_resamples` and `cv_folds` to compare the performance of each.

---

```r
bb_wf <- 
  workflow() %>% 
    add_formula(Sale_Price ~ Bedroom_AbvGr + Full_Bath + Half_Bath) %>% 
    add_model(lm_spec)

sqft_wf <- 
  workflow() %>% 
    add_formula(Sale_Price ~ Gr_Liv_Area) %>% 
    add_model(lm_spec)

bb_wf %>% 
  fit_resamples(resamples = cv_folds) %>% 
  collect_metrics()

sqft_wf %>% 
  fit_resamples(resamples = cv_folds) %>% 
  collect_metrics()
```

---
class: middle

.pull-left[

```r
bb_wf %>% 
  fit_resamples(resamples = cv_folds) %>% 
  collect_metrics()
# A tibble: 2 x 5
  .metric .estimator      mean     n   std_err
  <chr>   <chr>          <dbl> <int>     <dbl>
1 rmse    standard   64514.       10 1588.    
2 rsq     standard       0.339    10    0.0160
```

]

.pull-right[

```r
sqft_wf %>% 
  fit_resamples(resamples = cv_folds) %>% 
  collect_metrics()
# A tibble: 2 x 5
  .metric .estimator      mean     n   std_err
  <chr>   <chr>          <dbl> <int>     <dbl>
1 rmse    standard   57177.       10 1919.    
2 rsq     standard       0.482    10    0.0321
```

]

---
class: middle, center

# Quiz

Why should you use the same data splits to compare each model?

🍎 to 🍎

---
class: middle, center

# Quiz

Does Cross-Validation measure the accuracy of just your model, or your entire workflow?

Your entire workflow

---
class: your-turn

# Your Turn 7

Work together with your teammates to complete the Cross-Validation handout.

---
background-image: url(images/cv-match.jpeg)
background-size: contain

---
background-image: url(images/vfoldcv/vfoldcv.001.jpeg)
background-size: contain

---
background-image: url(images/vfoldcv/vfoldcv.002.jpeg)
background-size: contain

---
background-image: url(images/vfoldcv/vfoldcv.003.jpeg)
background-size: contain

---
background-image: url(images/vfoldcv/vfoldcv.004.jpeg)
background-size: contain

---
background-image: url(images/vfoldcv/vfoldcv.005.jpeg)
background-size: contain

---
class: middle, center, inverse

# Other types of cross-validation

---
class: middle, center

# `vfold_cv()` - V Fold cross-validation

---
class: middle, center

# `loo_cv()` - Leave one out CV

---
class: middle, center

# `mc_cv()` - Monte Carlo (random) CV

(Test sets sampled without replacement)

---
class: middle, center

# `bootstraps()`

(Test sets sampled with replacement)

---
class: middle, center, frame

# yardstick

Functions that compute common model metrics

<tidymodels.github.io/yardstick/>

---
class: middle

.center[
# `fit_resamples()`

Trains and tests a model with cross-validation.
]

.pull-left[

```r
fit_resamples(
  object, 
  resamples, 
  ..., 
* metrics = NULL,
  control = control_resamples()
)
```

]

.pull-right[

If `NULL`...

regression = `rmse` + `rsq`

classification = `accuracy` + `roc_auc`
]

---
class: middle, center

# `metric_set()`

A helper function for selecting yardstick metric functions.

```r
metric_set(rmse, rsq)
```

---
class: middle

.center[
# `fit_resamples()`

.fade[Trains and tests a model with cross-validation.]
]

.pull-left[

```r
fit_resamples(
  object, 
  resamples, 
  ..., 
* metrics = metric_set(rmse, rsq),
  control = control_resamples()
)
```

]

---
class: middle, center, frame

# Metric Functions

<https://tidymodels.github.io/yardstick/reference/index.html>

---
class: your-turn

# Your Turn 8

Modify the code below to return the **Mean Absolute Error.** Visit 
<https://tidymodels.github.io/yardstick/reference/index.html> to find the right function to use.

---

```r
bb_wf %>% 
  fit_resamples(resamples = cv_folds, 
  metrics = metric_set(mae)) %>% 
  collect_metrics()
# A tibble: 1 x 5
  .metric .estimator   mean     n std_err
  <chr>   <chr>       <dbl> <int>   <dbl>
1 mae     standard   44970.    10   1079.

sqft_wf %>% 
  fit_resamples(resamples = cv_folds, 
  metrics = metric_set(mae)) %>% 
  collect_metrics()
# A tibble: 1 x 5
  .metric .estimator   mean     n std_err
  <chr>   <chr>       <dbl> <int>   <dbl>
1 mae     standard   38831.    10   1031.
```