class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#E7553C;">08</strong> </span> # Tuning ## Machine Learning in the Tidyverse ### Alison Hill · Garrett Grolemund #### [https://conf20-intro-ml.netlify.com/](https://conf20-intro-ml.netlify.com/) · [https://rstd.io/conf20-intro-ml](https://rstd.io/conf20-intro-ml) --- # KNN --- class: middle, center # `nearest_neighbor()` Specifies a model that uses K Nearest Neighbors ```r nearest_neighbor(neighbors = 1) ``` -- ### k = `neighbors` (PLURAL) -- .footnote[regression and classification modes] --- class: your-turn # Your Turn 1 Here's a new recipe (also in your .Rmd)… ```r normalize_rec <- recipe(Sale_Price ~ ., data = ames) %>% step_novel(all_nominal()) %>% step_dummy(all_nominal()) %>% step_zv(all_predictors()) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) ``` --- class: your-turn # Your Turn 1 …and a new model. Can you tell what type of model this is?… ```r knn5_spec <- nearest_neighbor(neighbors = 5) %>% set_engine("kknn") %>% set_mode("regression") ``` --- class: your-turn # Your Turn 1 Combine the recipe and model into a new workflow named knn_wf. Fit the workflow to cv_folds and collect its RMSE.
04
:
00
--- ```r knn5_wf <- workflow() %>% add_recipe(normalize_rec) %>% add_model(knn5_spec) knn5_wf %>% fit_resamples(resamples = cv_folds) %>% collect_metrics() # A tibble: 2 x 5 .metric .estimator mean n std_err <chr> <chr> <dbl> <int> <dbl> 1 rmse standard 37191. 10 1130. 2 rsq standard 0.786 10 0.00971 ``` --- class: your-turn # Your Turn 2 Repeat the process in Your Turn 1 with a similar workflow that uses neighbors = 10. Does the RMSE change?
05
:
00
--- ```r knn10_spec <- nearest_neighbor(neighbors = 10) %>% set_engine("kknn") %>% set_mode("regression") knn10_wf <- knn5_wf %>% update_model(knn10_spec) knn10_wf %>% fit_resamples(resamples = cv_folds) %>% collect_metrics() # A tibble: 2 x 5 .metric .estimator mean n std_err <chr> <chr> <dbl> <int> <dbl> 1 rmse standard 35817. 10 972. 2 rsq standard 0.806 10 0.00869 ``` --- class: middle, center # Quiz How can you find the best value of neighbors/k? -- Compare all the separate values/models --- class: inverse, middle, center # `tune_grid()` --- class: middle, center, frame # tune Functions for fitting and tuning models <tidymodels.github.io/tune/> <iframe src="https://tidymodels.github.io/tune/" width="100%" height="400px"></iframe> --- class: middle, center # `tune()` A placeholder for hyper-parameters to be "tuned" ```r nearest_neighbor(neighbors = tune()) ``` --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( object, resamples, ..., grid = 10, metrics = NULL, control = control_grid() ) ``` ] --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( * object, resamples, ..., grid = 10, metrics = NULL, control = control_grid() ) ``` ] -- .pull-right[ One of: + A `workflow` + A formula + A `recipe` ] --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( * object, * model, resamples, ..., grid = 10, metrics = NULL, control = control_grid() ) ``` ] .pull-right[ One of: + formula + `model` + `recipe` + `model` ] --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( object, resamples, ..., * grid = 10, metrics = NULL, control = control_grid() ) ``` ] .pull-right[ One of: + A positive integer. + A data frame of tuning combinations. ] --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( object, resamples, ..., * grid = 10, metrics = NULL, control = control_grid() ) ``` ] .pull-right[ Number of candidate parameter sets to be created automatically. ] --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( object, resamples, ..., * grid = df, metrics = NULL, control = control_grid() ) ``` ] .pull-right[ A data frame of tuning combinations. ] --- class: middle, center # `expand_grid()` Takes one or more vectors, and returns a data frame holding all combinations of their values. ```r expand_grid(neighbors = c(1,2), foo = 3:5) # A tibble: 6 x 2 neighbors foo <dbl> <int> 1 1 3 2 1 4 3 1 5 4 2 3 5 2 4 6 2 5 ``` -- .footnote[tidyr package; see also base `expand.grid()`] --- class: your-turn # Your Turn 3 Use `expand_grid()` to create a grid of values for neighbors that spans from 10 to 20. Save the result as `k10_20`.
02
:
00
--- ```r k10_20 <- expand_grid(neighbors = 10:20) k10_20 # A tibble: 11 x 1 neighbors <int> 1 10 2 11 3 12 4 13 5 14 6 15 7 16 8 17 9 18 10 19 11 20 ``` --- class: your-turn # Your Turn 4 Create a knn workflow that tunes over neighbors. Then use `tune_grid()`, `cv_folds` and `k10_20` to find the best value of neighbors. Save the output of `tune_grid()` as `knn_results`.
05
:
00
--- ```r knn_tuner <- nearest_neighbor(neighbors = tune()) %>% set_engine("kknn") %>% set_mode("regression") knn_twf <- workflow() %>% add_recipe(normalize_rec) %>% add_model(knn_tuner) knn_results <- knn_twf %>% tune_grid(resamples = cv_folds, grid = k10_20) knn_results %>% collect_metrics() %>% filter(.metric == "rmse") ``` --- ``` # A tibble: 11 x 6 neighbors .metric .estimator mean n std_err <int> <chr> <chr> <dbl> <int> <dbl> 1 10 rmse standard 35817. 10 972. 2 11 rmse standard 35719. 10 979. 3 12 rmse standard 35648. 10 991. 4 13 rmse standard 35596. 10 1004. 5 14 rmse standard 35558. 10 1017. 6 15 rmse standard 35533. 10 1030. 7 16 rmse standard 35524. 10 1044. 8 17 rmse standard 35530. 10 1057. 9 18 rmse standard 35543. 10 1068. 10 19 rmse standard 35557. 10 1078. 11 20 rmse standard 35577. 10 1088. ``` --- ``` # A tibble: 110 x 5 id neighbors .metric .estimator .estimate <chr> <int> <chr> <chr> <dbl> 1 Fold01 10 rmse standard 39579. 2 Fold01 11 rmse standard 39582. 3 Fold01 12 rmse standard 39628. 4 Fold01 13 rmse standard 39693. 5 Fold01 14 rmse standard 39743. 6 Fold01 15 rmse standard 39787. 7 Fold01 16 rmse standard 39850. 8 Fold01 17 rmse standard 39928. 9 Fold01 18 rmse standard 40004. 10 Fold01 19 rmse standard 40081. # … with 100 more rows ``` --- class: middle name: show-best .center[ # `show_best()` Shows the .display[n] most optimum combinations of hyper-parameters ] ```r knn_results %>% show_best(metric = "rmse", n = 5, maximize = FALSE) ``` --- template: show-best ``` # A tibble: 5 x 6 neighbors .metric .estimator mean n std_err <int> <chr> <chr> <dbl> <int> <dbl> 1 16 rmse standard 35524. 10 1044. 2 17 rmse standard 35530. 10 1057. 3 15 rmse standard 35533. 10 1030. 4 18 rmse standard 35543. 10 1068. 5 19 rmse standard 35557. 10 1078. ``` --- class: middle, center # `autoplot()` Quickly visualize tuning results ```r knn_results %>% autoplot() ``` <img src="figs/04-Tune/knn-plot-1.png" width="504" /> --- class: middle, center <img src="figs/04-Tune/unnamed-chunk-21-1.png" width="504" /> --- # You can tune models *and* recipes! --- class: your-turn # Your Turn 5 Modify our PCA workflow provided to find the best value for `num_comp` on the grid provided. Which is it? Use `show_best()` to see. Save the output of the fit function as `pca_results`.
05
:
00
--- ```r lm_spec <- linear_reg() %>% set_engine("lm") pca_tuner <- recipe(Sale_Price ~ ., data = ames) %>% step_novel(all_nominal()) %>% step_dummy(all_nominal()) %>% step_zv(all_predictors()) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) %>% step_pca(all_predictors(), num_comp = tune()) pca_twf <- workflow() %>% add_recipe(pca_tuner) %>% add_model(lm_spec) nc10_40 <- expand_grid(num_comp = c(10,20,30,40)) pca_results <- pca_twf %>% tune_grid(resamples = cv_folds, grid = nc10_40) pca_results %>% show_best(metric = "rmse", maximize = FALSE) ``` --- ``` # A tibble: 4 x 6 num_comp .metric .estimator mean n std_err <dbl> <chr> <chr> <dbl> <int> <dbl> 1 40 rmse standard 32384. 10 2184. 2 30 rmse standard 33549. 10 2089. 3 20 rmse standard 33997. 10 2063. 4 10 rmse standard 36081. 10 1881. ``` --- ```r library(modeldata) data(stackoverflow) # split the data set.seed(100) # Important! so_split <- initial_split(stackoverflow, strata = Remote) so_train <- training(so_split) so_test <- testing(so_split) set.seed(100) # Important! so_folds <- vfold_cv(so_train, v = 10, strata = Remote) ``` --- class: your-turn # Your Turn 6 Here's a new recipe (also in your .Rmd)… ```r so_rec <- recipe(Remote ~ ., data = so_train) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_lincomb(all_predictors()) %>% step_downsample(Remote) ``` --- class: your-turn # Your Turn 6 …and a new model plus workflow. Can you tell what type of model this is?… ```r rf_spec <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification") rf_wf <- workflow() %>% add_recipe(so_rec) %>% add_model(rf_spec) ``` --- class: your-turn # Your Turn 6 Here is the output from `fit_resamples()`... ```r rf_results <- rf_wf %>% fit_resamples(resamples = so_folds, metrics = metric_set(roc_auc)) rf_results %>% collect_metrics(summarize = TRUE) # A tibble: 1 x 5 .metric .estimator mean n std_err <chr> <chr> <dbl> <int> <dbl> 1 roc_auc binary 0.684 10 0.0165 ``` --- class: your-turn # Your Turn 6 Edit the random forest model to tune the `mtry` and `min_n` hyper-parameters; call the new model spec `rf_tuner`. Update the model for your workflow; save it as `rf_twf`. Tune the workflow to so_folds and show the best combination of hyper-parameters to maximize `roc_auc`. How does it compare to the average ROC AUC across folds from `fit_resamples()`?
10
:
00
--- ```r rf_tuner <- rand_forest(mtry = tune(), min_n = tune()) %>% set_engine("ranger") %>% set_mode("classification") rf_twf <- rf_wf %>% update_model(rf_tuner) rf_results <- rf_twf %>% tune_grid(resamples = so_folds) i Creating pre-processing data to finalize unknown parameter: mtry ``` --- class: middle, center # `metric_set()` A helper function for selecting yardstick metric functions. ```r metric_set(roc_auc, sens, spec) ``` --- # What next? --- class: middle name: show-best .center[ # `show_best()` Shows the .display[n] most optimum combinations of hyper-parameters. ] ```r rf_results %>% show_best(metric = "roc_auc") # A tibble: 5 x 7 mtry min_n .metric .estimator mean n std_err <int> <int> <chr> <chr> <dbl> <int> <dbl> 1 1 33 roc_auc binary 0.690 10 0.0182 2 4 17 roc_auc binary 0.689 10 0.0169 3 8 32 roc_auc binary 0.686 10 0.0189 4 17 38 roc_auc binary 0.682 10 0.0200 5 13 24 roc_auc binary 0.679 10 0.0198 ``` --- class: middle name: select-best .center[ # `select_best()` Shows the .display[top] combination of hyper-parameters. ] ```r so_best <- rf_results %>% select_best(metric = "roc_auc") so_best ``` --- template: select-best ``` # A tibble: 1 x 2 mtry min_n <int> <int> 1 1 33 ``` --- .center[ # `finalize_workflow()` Replaces `tune()` placeholders in a model/recipe/workflow with a set of hyper-parameter values. ] ```r so_wfl_final <- rf_twf %>% finalize_workflow(so_best) ``` --- class: middle, center # The test set Remember me? --- class: middle .center[ # `fit_split()` Remember me? ] ```r so_test_results <- so_wfl_final %>% fit_split(split = so_split) ``` --- ```r so_test_results # # Monte Carlo cross-validation (0.75/0.25) with 1 resamples # A tibble: 1 x 6 splits id .metrics .notes .predictions .workflow * <list> <chr> <list> <list> <list> <list> 1 <split [4.2K… train/test… <tibble [2 ×… <tibble [0… <tibble [1,398 … <workflo… ``` --- class: your-turn # Your Turn 7 Use `select_best()`, `finalize_workflow()`, and `fit_split()` to take the best combination of hyper-parameters from `rf_results` and use them to predict the test set. How does our actual test ROC AUC compare to our cross-validated estimate?
05
:
00
--- ```r so_best <- rf_results %>% select_best(metric = "roc_auc") so_wfl_final <- rf_twf %>% finalize_workflow(so_best) so_test_results <- so_wfl_final %>% fit_split(split = so_split) so_test_results %>% collect_metrics() ``` --- # final final final --- class: middle .center[ # Final metrics ] ```r so_test_results %>% collect_metrics() # A tibble: 2 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy binary 0.665 2 roc_auc binary 0.709 ``` --- class: middle .center[ # Predict the test set ] ```r so_test_results %>% collect_predictions() # A tibble: 1,398 x 6 id .pred_Remote `.pred_Not remote` .row .pred_class Remote <chr> <dbl> <dbl> <int> <fct> <fct> 1 train/test split 0.479 0.521 1 Not remote Remote 2 train/test split 0.450 0.550 6 Not remote Not remote 3 train/test split 0.409 0.591 18 Not remote Not remote 4 train/test split 0.582 0.418 23 Remote Not remote 5 train/test split 0.513 0.487 30 Remote Not remote 6 train/test split 0.521 0.479 50 Remote Not remote 7 train/test split 0.609 0.391 53 Remote Not remote 8 train/test split 0.504 0.496 56 Remote Not remote 9 train/test split 0.552 0.448 63 Remote Not remote 10 train/test split 0.418 0.582 68 Not remote Not remote # … with 1,388 more rows ``` --- ```r roc_values <- so_test_results %>% collect_predictions() %>% roc_curve(truth = Remote, estimate = .pred_Remote) autoplot(roc_values) ``` <img src="figs/04-Tune/unnamed-chunk-40-1.png" width="504" /> --- # Mea Culpa .pull-left[ ```r fit_split( object, split, ..., metrics = NULL ) ``` ] .pull-right[ ```r last_fit( object, split, ..., metrics = NULL ) ``` From the tune package ]