class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#E7553C;">5</strong> </span> # Workflows ## Machine Learning in the Tidyverse ### Alison Hill · Garrett Grolemund #### [https://conf20-intro-ml.netlify.com/](https://conf20-intro-ml.netlify.com/) · [https://rstd.io/conf20-intro-ml](https://rstd.io/conf20-intro-ml) --- background-image: url(images/daan-mooij-91LGCVN5SAI-unsplash.jpg) background-size: cover --- class: middle, center, inverse # ⚠️ Data Leakage ⚠️ --- ### What will this code do? ```r ames_zsplit <- ames %>% mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price)) %>% initial_split() ``` -- ``` ## # A tibble: 2,198 x 2 ## Sale_Price z_price ## <int> <dbl> ## 1 105000 -0.949 ## 2 172000 -0.110 ## 3 244000 0.791 ## 4 213500 0.409 ## 5 191500 0.134 ## 6 236500 0.697 ## 7 189000 0.103 ## 8 175900 -0.0613 ## 9 185000 0.0526 ## 10 180400 -0.00496 ## # … with 2,188 more rows ``` --- # Quiz What could go wrong? 1. Take the `mean` and `sd` of `Sale_Price` 1. Transform all sale prices in `ames` 1. Train with training set 1. Predict sale prices with testing set --- # What (else) could go wrong? ```r ames_train <- training(ames_split) %>% mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price)) ames_test <- testing(ames_split) %>% mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price)) lm_fit <- fit_data(Sale_Price ~ Gr_Liv_Area, model = lm_spec, data = ames_train) price_pred <- lm_fit %>% predict(new_data = ames_test) %>% mutate(price_truth = ames_test$Sale_Price) rmse(price_pred, truth = price_truth, estimate = .pred) ``` --- # Better 1. Split the data 1. Transform training set sale prices based on `mean` and `sd` of `Sale_Price` of the training set 1. Train with training set 1. Transform testing set sale prices based on `mean` and `sd` of `Sale_Price` of the **training set** 1. Predict sale prices with testing set --- class: middle, center, frame # Data Leakage "When the data you are using to train a machine learning algorithm happens to have the information you are trying to predict." .footnote[Daniel Gutierrez, [Ask a Data Scientist: Data Leakage](http://insidebigdata.com/2014/11/26/ask-data-scientist-data-leakage/)] --- class: middle, center, frame # Axiom Your learner is more than a model. --- class: middle, center, frame # Lemma #1 Your learner is more than a model. -- Your learner is only as good as your data. --- class: middle, center, frame # Lemma #2 Your learner is more than a model. Your learner is only as good as your data. -- Your data is only as good as your workflow. --- class: middle, center <img src="images/pink-thunder.png" width="618" /> --- class: middle, center, frame # **Revised** Goal of Machine Learning -- Build reliable workflows -- that generate accurate predictions -- for future, yet-to-be-seen data. --- class: middle, center, frame # Quiz What does GIGO stand for? -- Garbage in, garbage out --- class: center, middle, frame # Axiom Feature engineering and modeling are two halves of a single predictive workflow. --- background-image: url(images/workflows/workflows.001.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.002.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.003.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.004.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.005.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.006.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.007.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.008.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.009.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.010.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.011.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.012.jpeg) background-size: contain --- background-image: url(images/workflows/workflows.013.jpeg) background-size: contain --- class: center, middle, inverse # Workflows --- class: middle, center # `workflow()` Creates a workflow to add a model and more to ```r workflow() ``` --- class: middle, center # `add_formula()` Adds a formula to a workflow `*` ```r workflow() %>% add_formula(Sale_Price ~ Year) ``` .footnote[`*` If you do not plan to do your own preprocessing] --- class: middle, center # `add_model()` Adds a parsnip model spec to a workflow ```r workflow() %>% add_model(lm_spec) ``` --- background-image: url(images/zestimate.png) background-position: center background-size: contain --- class: your-turn # Your Turn 1 Build a workflow that uses a linear model to predict `Sale_Price` with `Bedrooms_AbvGr`, `Full_Bath` and `Half_Bath` in ames. Save it as `bb_wf`.
03
:
00
--- ```r lm_spec <- linear_reg() %>% set_engine("lm") bb_wf <- workflow() %>% add_formula(Sale_Price ~ Bedroom_AbvGr + Full_Bath + Half_Bath) %>% add_model(lm_spec) ``` --- ```r bb_wf ## ══ Workflow ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Formula ## Model: linear_reg() ## ## ── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────────────────── ## Sale_Price ~ Bedroom_AbvGr + Full_Bath + Half_Bath ## ## ── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- `fit_data()` and `fit_split()` also use workflows. Pass a workflow in place of a formula and model. .pull-left[ ```r fit_split( * Sale_Price ~ Bedroom_AbvGr + * Full_Bath + Half_Bath, * model = lm_spec, split = ames_split ) ``` ] .pull-right[ ```r fit_split( * bb_wf, split = ames_split ) ``` ] --- class: middle, center # `update_formula()` Removes the formula, then replaces with the new one. ```r workflow() %>% update_formula(Sale_Price ~ Bedroom_AbvGr) ``` --- class: your-turn # Your Turn 2 Test the linear model that predicts `Sale_Price` with everything else in ames on `ames_split`. What RMSE do you get? Hint: Create a new workflow by updating `bb_wf`.
04
:
00
--- ```r all_wf <- bb_wf %>% update_formula(Sale_Price ~ .) fit_split(all_wf, split = ames_split) %>% collect_metrics() ## ! Resample1: model (predictions): prediction from a rank-deficient fit may be misleading ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 22701. ## 2 rsq standard 0.923 ``` --- class: middle, center # `update_model()` Removes the model spec, then replaces with the new one. ```r workflow() %>% update_model(knn_spec) ``` --- class: your-turn # Your Turn 3 Fill in the blanks to test the regression tree model that predicts `Sale_Price` with _everything else in `ames`_ on `ames_split`. What RMSE do you get? Hint: Create a new workflow by updating `all_wf`.
04
:
00
--- ```r rt_spec <- decision_tree() %>% set_engine(engine = "rpart") %>% set_mode("regression") rt_wf <- all_wf %>% update_model(rt_spec) fit_split(rt_wf, split = ames_split) %>% collect_metrics() ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 42678. ## 2 rsq standard 0.727 ``` --- class: your-turn # Your Turn 4 But what about the predictions of our model? Save the fitted object from your regression tree, and use `collect_predictions()` to see the predictions generated from the test data.
03
:
00
--- ```r all_fitwf <- fit_split(rt_wf, split = ames_split) all_fitwf %>% collect_predictions() ## # A tibble: 732 x 4 ## id .pred .row Sale_Price ## <chr> <dbl> <int> <int> ## 1 train/test split 190775. 1 215000 ## 2 train/test split 108409. 2 105000 ## 3 train/test split 252556. 4 244000 ## 4 train/test split 155275. 11 175900 ## 5 train/test split 339239. 16 538000 ## 6 train/test split 351391. 18 394432 ## 7 train/test split 138151. 26 142000 ## 8 train/test split 108409. 30 96000 ## 9 train/test split 192131. 56 216500 ## 10 train/test split 252556. 65 221000 ## # … with 722 more rows ``` --- # Quiz Another tibble with list columns! ```r all_fitwf ## # # Monte Carlo cross-validation (0.75/0.25) with 1 resamples ## # A tibble: 1 x 6 ## splits id .metrics .notes .predictions .workflow ## * <list> <chr> <list> <list> <list> <list> ## 1 <split [2.2K… train/test … <tibble [2 ×… <tibble [0… <tibble [732 ×… <workflo… ``` -- How we can expand a single row in a list column to see what is in it? --- ```r all_fitwf %>% pluck(".workflow", 1) ## ══ Workflow ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Formula ## Model: decision_tree() ## ## ── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────────────────── ## Sale_Price ~ . ## ## ── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## n= 2198 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 2198 13813560000000 180960.9 ## 2) Garage_Cars< 2.5 1905 5695476000000 161849.2 ## 4) Gr_Liv_Area< 1416.5 1024 1262826000000 133918.8 ## 8) Year_Built< 1976.5 741 629702500000 121734.8 ## 16) Total_Bsmt_SF< 908.5 409 287201900000 108409.2 * ## 17) Total_Bsmt_SF>=908.5 332 180402800000 138151.0 * ## 9) Year_Built>=1976.5 283 235093500000 165821.3 * ## 5) Gr_Liv_Area>=1416.5 881 2705332000000 194313.1 ## 10) Exter_QualTypical>=0.5 479 840585200000 169653.2 ## 20) BsmtFin_SF_1>=3.5 285 398958500000 155275.4 * ## 21) BsmtFin_SF_1< 3.5 194 296159600000 190775.3 * ## 11) Exter_QualTypical< 0.5 402 1226384000000 223696.4 ## 22) Total_Bsmt_SF< 1015 192 327074400000 192131.0 * ## 23) Total_Bsmt_SF>=1015 210 533098200000 252556.3 * ## 3) Garage_Cars>=2.5 293 2898272000000 305219.8 ## 6) Total_Bsmt_SF< 1716.5 204 1018322000000 268711.9 ## 12) Year_Remod_Add< 1977.5 26 31649720000 154457.7 * ## 13) Year_Remod_Add>=1977.5 178 597691700000 285400.7 ## 26) Gr_Liv_Area< 2322 121 208583900000 260039.1 * ## 27) Gr_Liv_Area>=2322 57 146064400000 339238.5 * ## 7) Total_Bsmt_SF>=1716.5 89 984832300000 388900.7 ## 14) Gr_Liv_Area< 2187 60 237424700000 351391.3 * ## 15) Gr_Liv_Area>=2187 29 488334000000 466506.3 ## 30) Latitude< 42.05321 7 117163300000 329621.3 * ## 31) Latitude>=42.05321 22 198274700000 510060.6 * ``` --- class: middle # .center[`pull_workflow_fit()`] .center[Returns the parsnip model fit.] ```r all_fitwf %>% pluck(".workflow", 1) %>% pull_workflow_fit() ``` -- .footnote[Pipe to `pluck("fit")` to get the non-parsnip fit back. Useful for plotting.] --- ```r all_fitwf %>% pluck(".workflow", 1) %>% pull_workflow_fit() ## parsnip model object ## ## Fit time: 544ms ## n= 2198 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 2198 13813560000000 180960.9 ## 2) Garage_Cars< 2.5 1905 5695476000000 161849.2 ## 4) Gr_Liv_Area< 1416.5 1024 1262826000000 133918.8 ## 8) Year_Built< 1976.5 741 629702500000 121734.8 ## 16) Total_Bsmt_SF< 908.5 409 287201900000 108409.2 * ## 17) Total_Bsmt_SF>=908.5 332 180402800000 138151.0 * ## 9) Year_Built>=1976.5 283 235093500000 165821.3 * ## 5) Gr_Liv_Area>=1416.5 881 2705332000000 194313.1 ## 10) Exter_QualTypical>=0.5 479 840585200000 169653.2 ## 20) BsmtFin_SF_1>=3.5 285 398958500000 155275.4 * ## 21) BsmtFin_SF_1< 3.5 194 296159600000 190775.3 * ## 11) Exter_QualTypical< 0.5 402 1226384000000 223696.4 ## 22) Total_Bsmt_SF< 1015 192 327074400000 192131.0 * ## 23) Total_Bsmt_SF>=1015 210 533098200000 252556.3 * ## 3) Garage_Cars>=2.5 293 2898272000000 305219.8 ## 6) Total_Bsmt_SF< 1716.5 204 1018322000000 268711.9 ## 12) Year_Remod_Add< 1977.5 26 31649720000 154457.7 * ## 13) Year_Remod_Add>=1977.5 178 597691700000 285400.7 ## 26) Gr_Liv_Area< 2322 121 208583900000 260039.1 * ## 27) Gr_Liv_Area>=2322 57 146064400000 339238.5 * ## 7) Total_Bsmt_SF>=1716.5 89 984832300000 388900.7 ## 14) Gr_Liv_Area< 2187 60 237424700000 351391.3 * ## 15) Gr_Liv_Area>=2187 29 488334000000 466506.3 ## 30) Latitude< 42.05321 7 117163300000 329621.3 * ## 31) Latitude>=42.05321 22 198274700000 510060.6 * ``` --- class: middle # .center[`pull_workflow_spec()`] .center[Returns the parsnip model specification.] ```r all_fitwf %>% pluck(".workflow", 1) %>% pull_workflow_spec() ``` --- ```r all_fitwf %>% pluck(".workflow", 1) %>% pull_workflow_spec() ## Decision Tree Model Specification (regression) ## ## Computational engine: rpart ```