Workflows

class: title-slide, center

<span class="fa-stack fa-4x">
  <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i>
  <strong class="fa-stack-1x" style="color:#E7553C;">5</strong>
</span>

# Workflows

## Machine Learning in the Tidyverse

### Alison Hill &#183; Garrett Grolemund

#### [https://conf20-intro-ml.netlify.com/](https://conf20-intro-ml.netlify.com/) &#183; [https://rstd.io/conf20-intro-ml](https://rstd.io/conf20-intro-ml)                                   
---
background-image: url(images/daan-mooij-91LGCVN5SAI-unsplash.jpg)
background-size: cover

---
class: middle, center, inverse

# ⚠️ Data Leakage ⚠️

---

### What will this code do?

```r
ames_zsplit <- ames %>% 
  mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price)) %>% 
  initial_split()
```

--

```
## # A tibble: 2,198 x 2
##    Sale_Price  z_price
##         <int>    <dbl>
##  1     105000 -0.949  
##  2     172000 -0.110  
##  3     244000  0.791  
##  4     213500  0.409  
##  5     191500  0.134  
##  6     236500  0.697  
##  7     189000  0.103  
##  8     175900 -0.0613 
##  9     185000  0.0526 
## 10     180400 -0.00496
## # … with 2,188 more rows
```

---

# Quiz

What could go wrong?

1. Take the `mean` and `sd` of `Sale_Price`

1. Transform all sale prices in `ames`

1. Train with training set

1. Predict sale prices with testing set

---

# What (else) could go wrong?

```r
ames_train <- training(ames_split) %>% 
  mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price))

ames_test <- testing(ames_split) %>% 
  mutate(z_price = (Sale_Price - mean(Sale_Price)) / sd(Sale_Price))

lm_fit <- fit_data(Sale_Price ~ Gr_Liv_Area, 
                   model = lm_spec, 
                   data = ames_train)

price_pred  <- lm_fit %>% 
  predict(new_data = ames_test) %>% 
  mutate(price_truth = ames_test$Sale_Price)

rmse(price_pred, truth = price_truth, estimate = .pred)
```

---

# Better

1. Split the data

1. Transform training set sale prices based on `mean` and `sd` of `Sale_Price` of the training set

1. Train with training set

1. Transform testing set sale prices based on `mean` and `sd` of `Sale_Price` of the **training set**

1. Predict sale prices with testing set

---
class: middle, center, frame

# Data Leakage

"When the data you are using to train a machine learning algorithm happens to have the information you are trying to predict."

.footnote[Daniel Gutierrez, [Ask a Data Scientist: Data Leakage](http://insidebigdata.com/2014/11/26/ask-data-scientist-data-leakage/)]

---
class: middle, center, frame

# Axiom

Your learner is more than a model.

---
class: middle, center, frame

# Lemma #1

Your learner is more than a model.

--

Your learner is only as good as your data.

---
class: middle, center, frame

# Lemma #2

Your learner is more than a model.

Your learner is only as good as your data.

--

Your data is only as good as your workflow.

---
class: middle, center

<img src="images/pink-thunder.png" width="618" />

---
class: middle, center, frame

# **Revised** Goal of Machine Learning

--

Build reliable workflows

--

that generate accurate predictions

--

for future, yet-to-be-seen data.

---
class: middle, center, frame

# Quiz

What does GIGO stand for?

--

Garbage in, garbage out

---
class: center, middle, frame

# Axiom

Feature engineering and modeling are two halves of a single predictive workflow.

---
background-image: url(images/workflows/workflows.001.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.002.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.003.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.004.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.005.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.006.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.007.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.008.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.009.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.010.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.011.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.012.jpeg)
background-size: contain

---
background-image: url(images/workflows/workflows.013.jpeg)
background-size: contain

---
class: center, middle, inverse

# Workflows

---
class: middle, center

# `workflow()`

Creates a workflow to add a model and more to

```r
workflow()
```

---
class: middle, center

# `add_formula()`

Adds a formula to a workflow `*`

```r
workflow() %>% add_formula(Sale_Price ~ Year)
```

.footnote[`*` If you do not plan to do your own preprocessing]

---
class: middle, center

# `add_model()`

Adds a parsnip model spec to a workflow

```r
workflow() %>% add_model(lm_spec)
```

---
background-image: url(images/zestimate.png)
background-position: center
background-size: contain

---
class: your-turn

# Your Turn 1

Build a workflow that uses a linear model to predict `Sale_Price` with `Bedrooms_AbvGr`, `Full_Bath` and `Half_Bath` in ames. Save it as `bb_wf`.

<div class="countdown" id="timer_5e46e729" style="right:0;bottom:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">03</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

---

```r
lm_spec <- 
  linear_reg() %>% 
  set_engine("lm")

bb_wf <- 
  workflow() %>% 
  add_formula(Sale_Price ~ Bedroom_AbvGr + 
              Full_Bath + Half_Bath) %>% 
  add_model(lm_spec)
```

---

```r
bb_wf
## ══ Workflow ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: linear_reg()
## 
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Sale_Price ~ Bedroom_AbvGr + Full_Bath + Half_Bath
## 
## ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

---

`fit_data()` and `fit_split()` also use workflows. Pass a workflow in place of a formula and model.

.pull-left[

```r
fit_split(
* Sale_Price ~ Bedroom_AbvGr +
*   Full_Bath + Half_Bath,
* model = lm_spec,
  split = ames_split
)
```

]

.pull-right[

```r
fit_split(
* bb_wf,
  split = ames_split
  )
```
]

---
class: middle, center

# `update_formula()`

Removes the formula, then replaces with the new one.

```r
workflow() %>% update_formula(Sale_Price ~ Bedroom_AbvGr)
```

---
class: your-turn

# Your Turn 2

Test the linear model that predicts `Sale_Price` with everything else in ames on `ames_split`. What RMSE do you get?

Hint: Create a new workflow by updating `bb_wf`.

<div class="countdown" id="timer_5e46e7fc" style="right:0;bottom:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">04</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

---

```r
all_wf <- 
  bb_wf %>% 
  update_formula(Sale_Price ~ .)

fit_split(all_wf, split = ames_split) %>% 
  collect_metrics()
## ! Resample1: model (predictions): prediction from a rank-deficient fit may be misleading
## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard   22701.   
## 2 rsq     standard       0.923
```

---
class: middle, center

# `update_model()`

Removes the model spec, then replaces with the new one.

```r
workflow() %>% update_model(knn_spec)
```

---
class: your-turn

# Your Turn 3

Fill in the blanks to test the regression tree model that predicts `Sale_Price` with _everything else in `ames`_ on `ames_split`. What RMSE do you get?

Hint: Create a new workflow by updating `all_wf`.

<div class="countdown" id="timer_5e46e68e" style="right:0;bottom:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">04</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

---

```r
rt_spec <- 
  decision_tree() %>%          
  set_engine(engine = "rpart") %>% 
  set_mode("regression")

rt_wf <- 
  all_wf %>% 
  update_model(rt_spec)

fit_split(rt_wf, split = ames_split) %>% 
  collect_metrics()
## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard   42678.   
## 2 rsq     standard       0.727
```

---
class: your-turn

# Your Turn 4

But what about the predictions of our model?

Save the fitted object from your regression tree, and use `collect_predictions()` to see the predictions generated from the test data.

<div class="countdown" id="timer_5e46e4b9" style="right:0;bottom:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">03</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

---

```r
all_fitwf <- fit_split(rt_wf, split = ames_split)
all_fitwf %>% 
  collect_predictions()
## # A tibble: 732 x 4
##    id                 .pred  .row Sale_Price
##    <chr>              <dbl> <int>      <int>
##  1 train/test split 190775.     1     215000
##  2 train/test split 108409.     2     105000
##  3 train/test split 252556.     4     244000
##  4 train/test split 155275.    11     175900
##  5 train/test split 339239.    16     538000
##  6 train/test split 351391.    18     394432
##  7 train/test split 138151.    26     142000
##  8 train/test split 108409.    30      96000
##  9 train/test split 192131.    56     216500
## 10 train/test split 252556.    65     221000
## # … with 722 more rows
```

---

# Quiz

Another tibble with list columns!

```r
all_fitwf
## # # Monte Carlo cross-validation (0.75/0.25) with 1 resamples  
## # A tibble: 1 x 6
##   splits        id           .metrics      .notes      .predictions    .workflow
## * <list>        <chr>        <list>        <list>      <list>          <list>   
## 1 <split [2.2K… train/test … <tibble [2 ×… <tibble [0… <tibble [732 ×… <workflo…
```

--

How we can expand a single row in a list column to see what is in it?

---

```r
all_fitwf %>% 
  pluck(".workflow", 1)
## ══ Workflow ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: decision_tree()
## 
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Sale_Price ~ .
## 
## ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## n= 2198 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 2198 13813560000000 180960.9  
##    2) Garage_Cars< 2.5 1905  5695476000000 161849.2  
##      4) Gr_Liv_Area< 1416.5 1024  1262826000000 133918.8  
##        8) Year_Built< 1976.5 741   629702500000 121734.8  
##         16) Total_Bsmt_SF< 908.5 409   287201900000 108409.2 *
##         17) Total_Bsmt_SF>=908.5 332   180402800000 138151.0 *
##        9) Year_Built>=1976.5 283   235093500000 165821.3 *
##      5) Gr_Liv_Area>=1416.5 881  2705332000000 194313.1  
##       10) Exter_QualTypical>=0.5 479   840585200000 169653.2  
##         20) BsmtFin_SF_1>=3.5 285   398958500000 155275.4 *
##         21) BsmtFin_SF_1< 3.5 194   296159600000 190775.3 *
##       11) Exter_QualTypical< 0.5 402  1226384000000 223696.4  
##         22) Total_Bsmt_SF< 1015 192   327074400000 192131.0 *
##         23) Total_Bsmt_SF>=1015 210   533098200000 252556.3 *
##    3) Garage_Cars>=2.5 293  2898272000000 305219.8  
##      6) Total_Bsmt_SF< 1716.5 204  1018322000000 268711.9  
##       12) Year_Remod_Add< 1977.5 26    31649720000 154457.7 *
##       13) Year_Remod_Add>=1977.5 178   597691700000 285400.7  
##         26) Gr_Liv_Area< 2322 121   208583900000 260039.1 *
##         27) Gr_Liv_Area>=2322 57   146064400000 339238.5 *
##      7) Total_Bsmt_SF>=1716.5 89   984832300000 388900.7  
##       14) Gr_Liv_Area< 2187 60   237424700000 351391.3 *
##       15) Gr_Liv_Area>=2187 29   488334000000 466506.3  
##         30) Latitude< 42.05321 7   117163300000 329621.3 *
##         31) Latitude>=42.05321 22   198274700000 510060.6 *
```

---
class: middle

# .center[`pull_workflow_fit()`]

.center[Returns the parsnip model fit.]

```r
all_fitwf %>% 
  pluck(".workflow", 1) %>% 
  pull_workflow_fit()
```

--

.footnote[Pipe to `pluck("fit")` to get the non-parsnip fit back. Useful for plotting.]

---

```r
all_fitwf %>% 
  pluck(".workflow", 1) %>% 
  pull_workflow_fit()
## parsnip model object
## 
## Fit time:  544ms 
## n= 2198 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 2198 13813560000000 180960.9  
##    2) Garage_Cars< 2.5 1905  5695476000000 161849.2  
##      4) Gr_Liv_Area< 1416.5 1024  1262826000000 133918.8  
##        8) Year_Built< 1976.5 741   629702500000 121734.8  
##         16) Total_Bsmt_SF< 908.5 409   287201900000 108409.2 *
##         17) Total_Bsmt_SF>=908.5 332   180402800000 138151.0 *
##        9) Year_Built>=1976.5 283   235093500000 165821.3 *
##      5) Gr_Liv_Area>=1416.5 881  2705332000000 194313.1  
##       10) Exter_QualTypical>=0.5 479   840585200000 169653.2  
##         20) BsmtFin_SF_1>=3.5 285   398958500000 155275.4 *
##         21) BsmtFin_SF_1< 3.5 194   296159600000 190775.3 *
##       11) Exter_QualTypical< 0.5 402  1226384000000 223696.4  
##         22) Total_Bsmt_SF< 1015 192   327074400000 192131.0 *
##         23) Total_Bsmt_SF>=1015 210   533098200000 252556.3 *
##    3) Garage_Cars>=2.5 293  2898272000000 305219.8  
##      6) Total_Bsmt_SF< 1716.5 204  1018322000000 268711.9  
##       12) Year_Remod_Add< 1977.5 26    31649720000 154457.7 *
##       13) Year_Remod_Add>=1977.5 178   597691700000 285400.7  
##         26) Gr_Liv_Area< 2322 121   208583900000 260039.1 *
##         27) Gr_Liv_Area>=2322 57   146064400000 339238.5 *
##      7) Total_Bsmt_SF>=1716.5 89   984832300000 388900.7  
##       14) Gr_Liv_Area< 2187 60   237424700000 351391.3 *
##       15) Gr_Liv_Area>=2187 29   488334000000 466506.3  
##         30) Latitude< 42.05321 7   117163300000 329621.3 *
##         31) Latitude>=42.05321 22   198274700000 510060.6 *
```

---
class: middle

# .center[`pull_workflow_spec()`]

.center[Returns the parsnip model specification.]

```r
all_fitwf %>% 
  pluck(".workflow", 1) %>% 
  pull_workflow_spec()
```

---

```r
all_fitwf %>% 
  pluck(".workflow", 1) %>% 
  pull_workflow_spec()
## Decision Tree Model Specification (regression)
## 
## Computational engine: rpart
```