class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#E7553C;">7</strong> </span> # Cross-validation ## Machine Learning in the Tidyverse ### Alison Hill · Garrett Grolemund #### [https://conf20-intro-ml.netlify.com/](https://conf20-intro-ml.netlify.com/) · [https://rstd.io/conf20-intro-ml](https://rstd.io/conf20-intro-ml) --- class: middle, center # Quiz What property of models is Machine Learning most concerned about? -- Predictions. --- class: middle, center, frame # rsample <iframe src="https://tidymodels.github.io/rsample/" width="100%" height="400px"></iframe> --- class: your-turn # Your Turn 1 Run the first code chunk. Then fill in the blanks to 1. Create a split object that apportions 75% of ames to a training set and the remainder to a testing set. 1. Fit the all_wf to the split object. 1. Extract the rmse of the fit.
03
:
00
--- ```r all_wf <- workflow() %>% add_formula(Sale_Price ~ .) %>% add_model(lm_spec) new_split <- initial_split(ames) all_wf %>% fit_split(split = new_split, metrics = metric_set(rmse)) %>% collect_metrics() # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 rmse standard 24111. ``` --- class: your-turn # Your Turn 2 What would happen if you repeated this process? Would you get the same answers? Discuss in your team. Then rerun the last code chunk from Your Turn 1. Do you get the same answer?
02
:
00
--- .pull-left[ ``` # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 rmse standard 38727. ``` ``` # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 rmse standard 41088. ``` ``` # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 rmse standard 44389. ``` ] -- .pull-right[ ``` # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 rmse standard 36100. ``` ``` # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 rmse standard 38463. ``` ``` # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 rmse standard 44457. ``` ] --- class: middle, center # Quiz Why is the new estimate different? --- --- class: middle, center # Data Splitting <img src="figs/03-cv/unnamed-chunk-11-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-12-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-13-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-14-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-15-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-16-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-17-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-18-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-19-1.png" width="720" style="display: block; margin: auto;" /> --- <img src="figs/03-cv/unnamed-chunk-20-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-21-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-22-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-23-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-24-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-25-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-26-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/03-cv/unnamed-chunk-27-1.png" width="1080" style="display: block; margin: auto;" /> -- .right[Mean RMSE] --- class: your-turn # Your Turn 3 Rerun the code below 10 times and then compute the mean of the results (you will need to jot them down as you go).
03
:
00
--- ```r rmses %>% tibble::enframe(name = "rmse") # A tibble: 10 x 2 rmse value <int> <dbl> 1 1 32886. 2 2 43209. 3 3 44021. 4 4 34077. 5 5 38473. 6 6 30188. 7 7 30311. 8 8 28644. 9 9 35107. 10 10 42656. mean(rmses) [1] 35957.18 ``` --- class: middle, center # Discuss Which do you think is more accurate, the best result or the mean of the results? Why? Discuss with your team. --- class: middle, center, inverse # Cross-validation --- # There has to be a better way... ```r rmses <- vector(length = 10, mode = "double") for (i in 1:10) { new_split <- initial_split(ames) rmses[i] <- all_wf %>% fit_split(split = new_split, metrics = metric_set(rmse)) %>% collect_metrics() %>% pull(.estimate) } ``` --- class: middle, center # V-fold cross-validation ```r vfold_cv(data, v = 10, ...) ``` --- <img src="images/cv.gif" style="display: block; margin: auto;" /> --- class: middle, center # Guess How many times does in observation/row appear in the assessment set? <img src="figs/03-cv/vfold-tiles-1.png" width="864" style="display: block; margin: auto;" /> --- <img src="figs/03-cv/unnamed-chunk-34-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center # Quiz If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold? -- 90% - training 10% - test --- class: your-turn # Your Turn 4 Run the code below. What does it return? ```r set.seed(100) cv_folds <- vfold_cv(ames_train, v = 10, strata = Sale_Price, breaks = 4) cv_folds ```
01
:
00
--- ```r set.seed(100) cv_folds <- vfold_cv(ames_train, v = 10, strata = Sale_Price, breaks = 4) cv_folds # 10-fold cross-validation using stratification # A tibble: 10 x 2 splits id <named list> <chr> 1 <split [2K/221]> Fold01 2 <split [2K/221]> Fold02 3 <split [2K/220]> Fold03 4 <split [2K/220]> Fold04 5 <split [2K/220]> Fold05 6 <split [2K/220]> Fold06 7 <split [2K/220]> Fold07 8 <split [2K/219]> Fold08 9 <split [2K/219]> Fold09 10 <split [2K/218]> Fold10 ``` --- class: middle .center[ # We need a new way to fit ] ```r split1 <- cv_folds %>% pluck("splits", 1) all_wf %>% fit_split(split = split1, metrics = metric_set(rmse)) %>% collect_metrics() # rinse and repeat split2 <- ... ``` --- class: inverse, middle, center # `fit_resamples()` --- class: middle .center[ # `fit_resamples()` Trains and tests a model with cross-validation. ] ```r fit_resamples( Sale_Price ~ Gr_Liv_Area, model = lm_spec, resamples = cv_folds ) ``` --- ```r fit_resamples( Sale_Price ~ Gr_Liv_Area, model = lm_spec, resamples = cv_folds ) # 10-fold cross-validation using stratification # A tibble: 10 x 4 splits id .metrics .notes * <list> <chr> <list> <list> 1 <split [2K/221]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]> 2 <split [2K/221]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]> 3 <split [2K/220]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]> 4 <split [2K/220]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]> 5 <split [2K/220]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]> 6 <split [2K/220]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]> 7 <split [2K/220]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]> 8 <split [2K/219]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]> 9 <split [2K/219]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]> 10 <split [2K/218]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]> ``` --- # `fit_resamples()` .pull-left[ Fit with formula and model ```r fit_resamples( * Sale_Price ~ Gr_Liv_Area, * model = lm_spec, resamples = cv_folds ) ``` ] .pull-right[ Fit with workflow ```r fit_resamples( * all_wf, resamples = cv_folds ) ``` ] --- class: middle, center # `collect_metrics()` Unnest the metrics column from a tidymodels `fit_resamples()` ```r _results %>% collect_metrics(summarize = TRUE) ``` -- .footnote[`TRUE` is actually the default; averages across folds] --- class: your-turn # Your Turn 5 Modify the code below to use `fit_resamples()` and `cv_folds` to cross-validate the `all_wf` workflow. Which RMSE do you collect at the end? ```r all_wf %>% fit_split(split = new_split, metrics = metric_set(rmse)) %>% collect_metrics() ```
03
:
00
--- ```r all_wf %>% fit_resamples(resamples = cv_folds, metrics = metric_set(rmse)) %>% collect_metrics() # A tibble: 1 x 5 .metric .estimator mean n std_err <chr> <chr> <dbl> <int> <dbl> 1 rmse standard 41797. 10 5122. ``` --- class: inverse, middle, center # Comparing Models --- class: your-turn # Your Turn 6 Create two new workflows, one that fits the bedbath model, `Sale_Price ~ Bedroom_AbvGr + Full_Bath + Half_Bath` and one that fits the square foot model, `Sale_Price ~ Gr_Liv_Area` Then use `fit_resamples` and `cv_folds` to compare the performance of each.
06
:
00
--- ```r bb_wf <- workflow() %>% add_formula(Sale_Price ~ Bedroom_AbvGr + Full_Bath + Half_Bath) %>% add_model(lm_spec) sqft_wf <- workflow() %>% add_formula(Sale_Price ~ Gr_Liv_Area) %>% add_model(lm_spec) bb_wf %>% fit_resamples(resamples = cv_folds) %>% collect_metrics() sqft_wf %>% fit_resamples(resamples = cv_folds) %>% collect_metrics() ``` --- class: middle .pull-left[ ```r bb_wf %>% fit_resamples(resamples = cv_folds) %>% collect_metrics() # A tibble: 2 x 5 .metric .estimator mean n std_err <chr> <chr> <dbl> <int> <dbl> 1 rmse standard 64514. 10 1588. 2 rsq standard 0.339 10 0.0160 ``` ] .pull-right[ ```r sqft_wf %>% fit_resamples(resamples = cv_folds) %>% collect_metrics() # A tibble: 2 x 5 .metric .estimator mean n std_err <chr> <chr> <dbl> <int> <dbl> 1 rmse standard 57177. 10 1919. 2 rsq standard 0.482 10 0.0321 ``` ] --- class: middle, center # Quiz Why should you use the same data splits to compare each model? -- 🍎 to 🍎 --- class: middle, center # Quiz Does Cross-Validation measure the accuracy of just your model, or your entire workflow? -- Your entire workflow --- class: your-turn # Your Turn 7 Work together with your teammates to complete the Cross-Validation handout.
05
:
00
--- background-image: url(images/cv-match.jpeg) background-size: contain --- background-image: url(images/vfoldcv/vfoldcv.001.jpeg) background-size: contain --- background-image: url(images/vfoldcv/vfoldcv.002.jpeg) background-size: contain --- background-image: url(images/vfoldcv/vfoldcv.003.jpeg) background-size: contain --- background-image: url(images/vfoldcv/vfoldcv.004.jpeg) background-size: contain --- background-image: url(images/vfoldcv/vfoldcv.005.jpeg) background-size: contain --- class: middle, center, inverse # Other types of cross-validation --- class: middle, center # `vfold_cv()` - V Fold cross-validation <img src="figs/03-cv/unnamed-chunk-50-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center # `loo_cv()` - Leave one out CV <img src="figs/03-cv/loocv-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center # `mc_cv()` - Monte Carlo (random) CV (Test sets sampled without replacement) <img src="figs/03-cv/mccv-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center # `bootstraps()` (Test sets sampled with replacement) <img src="figs/03-cv/bootstrap-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center, frame # yardstick Functions that compute common model metrics <tidymodels.github.io/yardstick/> <iframe src="https://tidymodels.github.io/yardstick/" width="100%" height="400px"></iframe> --- class: middle .center[ # `fit_resamples()` Trains and tests a model with cross-validation. ] .pull-left[ ```r fit_resamples( object, resamples, ..., * metrics = NULL, control = control_resamples() ) ``` ] .pull-right[ If `NULL`... regression = `rmse` + `rsq` classification = `accuracy` + `roc_auc` ] --- class: middle, center # `metric_set()` A helper function for selecting yardstick metric functions. ```r metric_set(rmse, rsq) ``` --- class: middle .center[ # `fit_resamples()` .fade[Trains and tests a model with cross-validation.] ] .pull-left[ ```r fit_resamples( object, resamples, ..., * metrics = metric_set(rmse, rsq), control = control_resamples() ) ``` ] --- class: middle, center, frame # Metric Functions <https://tidymodels.github.io/yardstick/reference/index.html> <iframe src="https://tidymodels.github.io/yardstick/reference/index.html" width="100%" height="400px"></iframe> --- class: your-turn # Your Turn 8 Modify the code below to return the **Mean Absolute Error.** Visit <https://tidymodels.github.io/yardstick/reference/index.html> to find the right function to use.
03
:
00
--- ```r bb_wf %>% fit_resamples(resamples = cv_folds, metrics = metric_set(mae)) %>% collect_metrics() # A tibble: 1 x 5 .metric .estimator mean n std_err <chr> <chr> <dbl> <int> <dbl> 1 mae standard 44970. 10 1079. sqft_wf %>% fit_resamples(resamples = cv_folds, metrics = metric_set(mae)) %>% collect_metrics() # A tibble: 1 x 5 .metric .estimator mean n std_err <chr> <chr> <dbl> <int> <dbl> 1 mae standard 38831. 10 1031. ```