A tidymodels workflow

class: title-slide, center, middle
background-image: url(img/tidymodels_all_30.png)
background-size: contain
background-color:black

# A tidymodels workflow

## Caroline Ledbetter

### 2020-02-18

#### <i class="fas fa-images"></i> https://slides.carolineledbetter.us/tidymodels/ 
<i class="fab fa-github-square"></i> https://github.com/ledbettc/slides/tree/master/tidymodels
---
class:middle
name: details

.intro[
#Find me at: 
]
.intro-right[
<i class = "fas fa-globe"></i> carolineledbetter.us  
<i class = "far fa-envelope"></i>
carolineledbetter@yahoo.com  
<i class="fab fa-github"></i>
 @ledbettc  
<i class="fab fa-twitter"></i>
@C_line_sealion
]

---
layout: true
background-image: url(img/tidymodels_all_15.png)
background-size: contain
.footer[https://slides.carolineledbetter.us/tidymodels/]
---
class:center, middle

**`tidymodels` is a "meta-package" for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse**  
**&mdash; https://github.com/tidymodels/tidymodels**

------
<strong>
<blockquote class="twitter-tweet">
<p lang="en" dir="ltr">
Exciting - with the tune package, 
<a href="https://twitter.com/topepos?ref_src=twsrc%5Etfw">@topepos</a> 
says the tidymodels suite of packages is now “a fully operational death star” for modeling in 
<a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a>. 
Check them out! <a href="https://t.co/nCdA1Fayvv">https://t.co/nCdA1Fayvv</a></p>&mdash; 
Emily Robinson (@robinson_es) 
<a href="https://twitter.com/robinson_es/status/1204918320346157056?ref_src=twsrc%5Etfw">
December 12, 2019</a>

</blockquote> <script async src="https://platform.twitter.com/widgets.js" 
charset="utf-8"></script>
</strong>

---
class: center
background-image: url(img/tidymodels_all.png)
background-size: contain

---
class: middle
# The Data
###[German Credit data from the University of California Irving Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29)

```r
library(tidymodels)
library(tune)
library(workflows)
source("R/import_credit_data.R")
```
<br>
<br>
Code: https://slides.carolineledbetter.us/tidymodels/R/import_credit_data.R
Data: https://slides.carolineledbetter.us/tidymodels/data/german_credit.rda

---

```r
glimpse(german_credit)
## Observations: 1,000
## Variables: 21
## $ CheckingAccountStatus     <chr> "lt.0", "0.to.200", "no checking account", "lt.0", "lt.0", …
## $ Duration                  <dbl> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12, 24, 15, …
## $ CreditHistory             <chr> "critical account/ other credits existing (not at this bank…
## $ Purpose                   <chr> "radio/television", "radio/television", "education", "furni…
## $ Amount                    <dbl> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3059, 5234,…
## $ SavingsAccountBonds       <chr> "unknown/ no savings account", "lt.100", "lt.100", "lt.100"…
## $ EmploymentDuration        <chr> "gt.7", "1.to.4", "4.to.7", "4.to.7", "1.to.4", "1.to.4", "…
## $ InstallmentRatePercentage <dbl> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4, 2, 4, 3,…
## $ Personal                  <chr> "male:single", "female:divorced/separated/married", "male:s…
## $ OtherDebtorsGuarantors    <chr> "none", "none", "none", "guarantor", "none", "none", "none"…
## $ ResidenceDuration         <dbl> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4, 3, 2, 2,…
## $ Property                  <chr> "RealEstate", "RealEstate", "RealEstate", "Insurance", "Unk…
## $ Age                       <dbl> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22, 60, 28,…
## $ OtherInstallmentPlans     <chr> "none", "none", "none", "none", "none", "none", "none", "no…
## $ Housing                   <chr> "own", "own", "own", "for free", "for free", "for free", "o…
## $ NumberExistingCredits     <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 3, 1, 1,…
## $ Job                       <chr> "skilled employee", "skilled employee", "unskilled - reside…
## $ NumberPeopleMaintenance   <dbl> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ Telephone                 <chr> "yes, registered under the customers name", "none", "none",…
## $ ForeignWorker             <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ Class                     <chr> "Good", "Bad", "Good", "Good", "Bad", "Good", "Good", "Good…
```
---
<br>
<br>

```r
set.seed(1450)
credit_split <- german_credit %>% 
  initial_split(prop = 0.75, strata = Class)
credit_split
## <750/250/1000>
```
<br>

```r
set.seed(1450)
credit_split_no_strata <- german_credit %>% 
  initial_split(prop = 0.75)
credit_split_no_strata
## <750/250/1000>
```

---

# stratified sample

```
## # A tibble: 2 x 3
##   Class Training Testing
##   <chr> <chr>    <chr>  
## 1 Bad   30%      30%    
## 2 Good  70%      70%
```

# unstratified sample

```
## # A tibble: 2 x 3
##   Class Training Testing
##   <chr> <chr>    <chr>  
## 1 Bad   29%      34%    
## 2 Good  71%      66%
```

---

<div class="figure">
<a href="https://github.com/allisonhorst" target="_blank"><img src="img/recipes.png" alt="Artwork by @allison_horst" width="1067" /></a>
<p class="caption">Artwork by @allison_horst</p>
</div>
---

```r
our_recipe <- 
  training(credit_split) %>% 
  recipe(Class ~ .) %>% 
  step_center(all_numeric()) %>% 
  step_scale(all_numeric()) %>%
  step_mutate(EmploymentDuration = 
                factor(EmploymentDuration,
                       levels = c('unemployed',
                                  'lt.1',
                                  '1.to.4',
                                  '4.to.7',
                                  'gt.7'),
                       ordered = TRUE)
              ) %>%
  step_ordinalscore(EmploymentDuration) %>% 
  step_string2factor(all_nominal()) 
  # step_knnimpute(all_predictors()) %>% 
  # step_nzv(all_predictors()) %>% 
  # step_upsample(all_outcomes(), over_ratio = 1) %>% 
  # step_filter(age > 70) 
```

---
class:middle

```r
our_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         20
## 
## Operations:
## 
## Centering for all_numeric
## Scaling for all_numeric
## Variable mutation for  EmploymentDuration
## Scoring for EmploymentDuration
## Factor variables from all_nominal
```

---
<div class="figure">
<a href="https://github.com/allisonhorst" target="_blank"><img src="img/parsnip.png" alt="Artwork by @allison_horst" width="1067" /></a>
<p class="caption">Artwork by @allison_horst</p>
</div>

---

```r
?boost_tree
```

```r
# logisitic regression
log_reg_mod <- 
  logistic_reg() %>%
  set_engine("glm")%>% 
  set_mode('classification')

# random forest
rf_mod <- rand_forest(
  trees = tune(),
  mtry = tune(),
  min_n = tune(), 
  mode = 'classification'
  ) %>%
  set_engine("ranger")

#k nearest neighbors
knn_mod <- 
  nearest_neighbor(neighbors = tune(), 
                   weight_func = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")
```

---

```r
# random forest
rf_mod <- rand_forest(
* trees = tune(),
* mtry = tune(),
* min_n = tune(),
  mode = 'classification'
  ) %>%
  set_engine("ranger")

#k nearest neighbors
knn_mod <- 
  nearest_neighbor(
*   neighbors = tune(),
*   weight_func = tune()
    ) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")
```

---

```r
set.seed(2134)
(cv_resamples <- 
  training(credit_split) %>% 
  vfold_cv(v = 10))
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits           id    
##    <named list>     <chr> 
##  1 <split [675/75]> Fold01
##  2 <split [675/75]> Fold02
##  3 <split [675/75]> Fold03
##  4 <split [675/75]> Fold04
##  5 <split [675/75]> Fold05
##  6 <split [675/75]> Fold06
##  7 <split [675/75]> Fold07
##  8 <split [675/75]> Fold08
##  9 <split [675/75]> Fold09
## 10 <split [675/75]> Fold10
```

---
class: center
background-image: url(img/middle_row.png)
background-size: contain

---

```r
ctrl <- control_grid(verbose = TRUE)
set.seed(2117)
knn_grid <- knn_mod %>% 
* parameters() %>%
  grid_regular(levels = c(15, 5))
knn_tune <- tune_grid(
  our_recipe, 
  model = knn_mod, 
  resamples = cv_resamples, 
  grid = knn_grid, 
  control = ctrl
)
```

---

```r
(rf_params <- 
  dials::parameters(dials::trees(), 
                    dials::min_n(), 
*                   finalize(mtry(),
*                            select(training(credit_split),
*                                   -Class))
                    ) %>% 
  dials::grid_latin_hypercube(size = 3))
## # A tibble: 3 x 3
##   trees min_n  mtry
##   <int> <int> <int>
## 1  1462    30     8
## 2   944     9    18
## 3   636    23     3
```

```r
rf_tune <- 
    tune::tune_grid(
      our_recipe, 
      model = rf_mod, 
      resamples = cv_resamples, 
      grid = rf_params, 
      control = ctrl
    )
```

---
class: middle

```r
best_rf <-
  select_best(rf_tune, metric = "roc_auc",  maximize = FALSE)
best_rf
## # A tibble: 1 x 3
##    mtry trees min_n
##   <int> <int> <int>
## 1    18   944     9
rf_mod_final <- finalize_model(rf_mod, best_rf)
```
---

```r
train <- our_recipe %>% 
  prep() %>% 
  juice()
test <- our_recipe %>% 
  prep() %>% 
  bake(testing(credit_split))
```

```r
rf_fit <- rf_mod_final %>% 
  fit(Class ~ ., train)
rf_fit
## parsnip model object
## 
## Fit time:  988ms 
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, mtry = ~18L, num.trees = ~944L,      min.node.size = ~9L, num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  944 
## Sample size:                      750 
## Number of independent variables:  20 
## Mtry:                             18 
## Target node size:                 9 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1661911
```
---

```r
credit_wfl <- 
  workflow() %>% 
  add_recipe(our_recipe) %>% 
  add_model(rf_mod_final)
fit(credit_wfl, 
*   training(credit_split))
## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## ● step_center()
## ● step_scale()
## ● step_mutate()
## ● step_ordinalscore()
## ● step_string2factor()
## 
## ── Model ───────────────────────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, mtry = ~18L, num.trees = ~944L,      min.node.size = ~9L, num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  944 
## Sample size:                      750 
## Number of independent variables:  20 
## Mtry:                             18 
## Target node size:                 9 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1662282
```

---

```r
(log_reg_fit <- 
   credit_wfl %>% 
*  update_model(log_reg_mod) %>%
   fit(training(credit_split)))
## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## ● step_center()
## ● step_scale()
## ● step_mutate()
## ● step_ordinalscore()
## ● step_string2factor()
## 
## ── Model ───────────────────────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = formula, family = stats::binomial, data = data)
## 
## Coefficients:
##                                                              (Intercept)  
##                                                                 -0.08702  
##                                              CheckingAccountStatusgt.200  
##                                                                  0.38310  
##                                                CheckingAccountStatuslt.0  
##                                                                 -0.31961  
##                                 CheckingAccountStatusno checking account  
##                                                                  1.34995  
##                                                                 Duration  
##                                                                 -0.19458  
## CreditHistorycritical account/ other credits existing (not at this bank)  
##                                                                  1.52333  
##                             CreditHistorydelay in paying off in the past  
##                                                                  0.88547  
##                    CreditHistoryexisting credits paid back duly till now  
##                                                                  0.55648  
##                CreditHistoryno credits taken/ all credits paid back duly  
##                                                                  0.13453  
##                                                         Purposecar (new)  
##                                                                 -0.86459  
##                                                        Purposecar (used)  
##                                                                  1.03172  
##                                               Purposedomestic appliances  
##                                                                 -0.65840  
##                                                         Purposeeducation  
##                                                                 -0.57080  
##                                               Purposefurniture/equipment  
##                                                                  0.02357  
##                                                            Purposeothers  
##                                                                  0.06858  
##                                                  Purposeradio/television  
##                                                                  0.19205  
##                                                           Purposerepairs  
##                                                                 -0.38717  
##                                                        Purposeretraining  
##                                                                  1.60914  
##                                                                   Amount  
##                                                                 -0.42958  
##                                           SavingsAccountBonds500.to.1000  
##                                                                  0.23510  
##                                               SavingsAccountBondsgt.1000  
##                                                                  1.48298  
##                                                SavingsAccountBondslt.100  
##                                                                 -0.40903  
##                           SavingsAccountBondsunknown/ no savings account  
##                                                                  0.75882  
## 
## ...
## and 50 more lines.
```

---

.pull-left[

```r
predict(rf_fit, test) %>% 
  bind_cols(select(test, 
                   truth = Class))
## # A tibble: 250 x 2
##    .pred_class truth
##    <fct>       <fct>
##  1 Good        Good 
##  2 Good        Bad  
##  3 Good        Good 
##  4 Good        Good 
##  5 Bad         Bad  
##  6 Good        Bad  
##  7 Good        Good 
##  8 Good        Good 
##  9 Bad         Bad  
## 10 Good        Good 
## # … with 240 more rows
```
]

.pull-right[

```r
predict(rf_fit, 
        test, 
        type = 'prob') %>% 
  bind_cols(select(test, 
                   truth = Class))
## # A tibble: 250 x 3
##    .pred_Bad .pred_Good truth
##        <dbl>      <dbl> <fct>
##  1     0.128      0.872 Good 
##  2     0.358      0.642 Bad  
##  3     0.156      0.844 Good 
##  4     0.427      0.573 Good 
##  5     0.801      0.199 Bad  
##  6     0.490      0.510 Bad  
##  7     0.403      0.597 Good 
##  8     0.410      0.590 Good 
##  9     0.673      0.327 Bad  
## 10     0.374      0.626 Good 
## # … with 240 more rows
```
]

```r
rf_pred <- test %>% 
  select(truth = Class) %>% 
  bind_cols(predict(rf_fit, test, type = 'prob'), 
            predict(rf_fit, test) )
```

---
class: middle

#Yardstick

```r
metrics(rf_pred, truth, .pred_class)
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.784
## 2 kap      binary         0.404
roc_auc(rf_pred, truth, .pred_Bad)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.779
```

---
class: middle

.pull-left[

```r
roc_curve(rf_pred, 
          truth, 
          .pred_Bad) %>% 
  autoplot()
```
]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-26-1.png" width="504" />
]

---
layout:false
class:center, middle

---

#Resources:

.resources[
Blog Post: *Coming Soon* :http://carolineledbetter.us/  
tidymodels: https://github.com/tidymodels/tidymodels  
rsample: https://tidymodels.github.io/rsample/  
recipes: https://tidymodels.github.io/recipes/  
parsnip: https://tidymodels.github.io/parsnip/  
tune: https://tidymodels.github.io/tune/  
dials:  https://tidymodels.github.io/dials/  
yardstick: https://tidymodels.github.io/yardstick/  
workflows: https://tidymodels.github.io/workflows/  
Applied Machine Learning: rstudio::conf 2020
https://rstudio-conf-2020.github.io/applied-ml/
]
---

#[Return to Contact Info](#details)