class: title-slide, center, middle background-image: url(img/tidymodels_all_30.png) background-size: contain background-color:black # A tidymodels workflow ## Caroline Ledbetter ### 2020-02-18 #### <i class="fas fa-images"></i> https://slides.carolineledbetter.us/tidymodels/ <i class="fab fa-github-square"></i> https://github.com/ledbettc/slides/tree/master/tidymodels --- class:middle name: details <!-- edit slides and github repo in the YAML above --> .intro[ #Find me at: ] .intro-right[ <i class = "fas fa-globe"></i> carolineledbetter.us <i class = "far fa-envelope"></i> carolineledbetter@yahoo.com <i class="fab fa-github"></i> @ledbettc <i class="fab fa-twitter"></i> @C_line_sealion ] --- layout: true background-image: url(img/tidymodels_all_15.png) background-size: contain .footer[https://slides.carolineledbetter.us/tidymodels/] --- class:center, middle **`tidymodels` is a "meta-package" for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse** **— https://github.com/tidymodels/tidymodels** -- ------ <strong> <blockquote class="twitter-tweet"> <p lang="en" dir="ltr"> Exciting - with the tune package, <a href="https://twitter.com/topepos?ref_src=twsrc%5Etfw">@topepos</a> says the tidymodels suite of packages is now “a fully operational death star” for modeling in <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a>. Check them out! <a href="https://t.co/nCdA1Fayvv">https://t.co/nCdA1Fayvv</a></p>— Emily Robinson (@robinson_es) <a href="https://twitter.com/robinson_es/status/1204918320346157056?ref_src=twsrc%5Etfw"> December 12, 2019</a> </blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> </strong> --- class: center background-image: url(img/tidymodels_all.png) background-size: contain --- class: middle # The Data ###[German Credit data from the University of California Irving Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29) ```r library(tidymodels) library(tune) library(workflows) source("R/import_credit_data.R") ``` <br> <br> Code: https://slides.carolineledbetter.us/tidymodels/R/import_credit_data.R Data: https://slides.carolineledbetter.us/tidymodels/data/german_credit.rda --- ```r glimpse(german_credit) ## Observations: 1,000 ## Variables: 21 ## $ CheckingAccountStatus <chr> "lt.0", "0.to.200", "no checking account", "lt.0", "lt.0", … ## $ Duration <dbl> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12, 24, 15, … ## $ CreditHistory <chr> "critical account/ other credits existing (not at this bank… ## $ Purpose <chr> "radio/television", "radio/television", "education", "furni… ## $ Amount <dbl> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3059, 5234,… ## $ SavingsAccountBonds <chr> "unknown/ no savings account", "lt.100", "lt.100", "lt.100"… ## $ EmploymentDuration <chr> "gt.7", "1.to.4", "4.to.7", "4.to.7", "1.to.4", "1.to.4", "… ## $ InstallmentRatePercentage <dbl> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4, 2, 4, 3,… ## $ Personal <chr> "male:single", "female:divorced/separated/married", "male:s… ## $ OtherDebtorsGuarantors <chr> "none", "none", "none", "guarantor", "none", "none", "none"… ## $ ResidenceDuration <dbl> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4, 3, 2, 2,… ## $ Property <chr> "RealEstate", "RealEstate", "RealEstate", "Insurance", "Unk… ## $ Age <dbl> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22, 60, 28,… ## $ OtherInstallmentPlans <chr> "none", "none", "none", "none", "none", "none", "none", "no… ## $ Housing <chr> "own", "own", "own", "for free", "for free", "for free", "o… ## $ NumberExistingCredits <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 3, 1, 1,… ## $ Job <chr> "skilled employee", "skilled employee", "unskilled - reside… ## $ NumberPeopleMaintenance <dbl> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,… ## $ Telephone <chr> "yes, registered under the customers name", "none", "none",… ## $ ForeignWorker <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye… ## $ Class <chr> "Good", "Bad", "Good", "Good", "Bad", "Good", "Good", "Good… ``` --- <br> <br> ```r set.seed(1450) credit_split <- german_credit %>% initial_split(prop = 0.75, strata = Class) credit_split ## <750/250/1000> ``` <br> -- ```r set.seed(1450) credit_split_no_strata <- german_credit %>% initial_split(prop = 0.75) credit_split_no_strata ## <750/250/1000> ``` --- # stratified sample ``` ## # A tibble: 2 x 3 ## Class Training Testing ## <chr> <chr> <chr> ## 1 Bad 30% 30% ## 2 Good 70% 70% ``` # unstratified sample ``` ## # A tibble: 2 x 3 ## Class Training Testing ## <chr> <chr> <chr> ## 1 Bad 29% 34% ## 2 Good 71% 66% ``` --- <div class="figure"> <a href="https://github.com/allisonhorst" target="_blank"><img src="img/recipes.png" alt="Artwork by @allison_horst" width="1067" /></a> <p class="caption">Artwork by @allison_horst</p> </div> --- ```r our_recipe <- training(credit_split) %>% recipe(Class ~ .) %>% step_center(all_numeric()) %>% step_scale(all_numeric()) %>% step_mutate(EmploymentDuration = factor(EmploymentDuration, levels = c('unemployed', 'lt.1', '1.to.4', '4.to.7', 'gt.7'), ordered = TRUE) ) %>% step_ordinalscore(EmploymentDuration) %>% step_string2factor(all_nominal()) # step_knnimpute(all_predictors()) %>% # step_nzv(all_predictors()) %>% # step_upsample(all_outcomes(), over_ratio = 1) %>% # step_filter(age > 70) ``` --- class:middle ```r our_recipe ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Centering for all_numeric ## Scaling for all_numeric ## Variable mutation for EmploymentDuration ## Scoring for EmploymentDuration ## Factor variables from all_nominal ``` --- <div class="figure"> <a href="https://github.com/allisonhorst" target="_blank"><img src="img/parsnip.png" alt="Artwork by @allison_horst" width="1067" /></a> <p class="caption">Artwork by @allison_horst</p> </div> --- ```r ?boost_tree ``` ```r # logisitic regression log_reg_mod <- logistic_reg() %>% set_engine("glm")%>% set_mode('classification') # random forest rf_mod <- rand_forest( trees = tune(), mtry = tune(), min_n = tune(), mode = 'classification' ) %>% set_engine("ranger") #k nearest neighbors knn_mod <- nearest_neighbor(neighbors = tune(), weight_func = tune()) %>% set_engine("kknn") %>% set_mode("classification") ``` --- ```r # random forest rf_mod <- rand_forest( * trees = tune(), * mtry = tune(), * min_n = tune(), mode = 'classification' ) %>% set_engine("ranger") #k nearest neighbors knn_mod <- nearest_neighbor( * neighbors = tune(), * weight_func = tune() ) %>% set_engine("kknn") %>% set_mode("classification") ``` --- ```r set.seed(2134) (cv_resamples <- training(credit_split) %>% vfold_cv(v = 10)) ## # 10-fold cross-validation ## # A tibble: 10 x 2 ## splits id ## <named list> <chr> ## 1 <split [675/75]> Fold01 ## 2 <split [675/75]> Fold02 ## 3 <split [675/75]> Fold03 ## 4 <split [675/75]> Fold04 ## 5 <split [675/75]> Fold05 ## 6 <split [675/75]> Fold06 ## 7 <split [675/75]> Fold07 ## 8 <split [675/75]> Fold08 ## 9 <split [675/75]> Fold09 ## 10 <split [675/75]> Fold10 ``` --- class: center background-image: url(img/middle_row.png) background-size: contain --- ```r ctrl <- control_grid(verbose = TRUE) set.seed(2117) knn_grid <- knn_mod %>% * parameters() %>% grid_regular(levels = c(15, 5)) knn_tune <- tune_grid( our_recipe, model = knn_mod, resamples = cv_resamples, grid = knn_grid, control = ctrl ) ``` --- ```r (rf_params <- dials::parameters(dials::trees(), dials::min_n(), * finalize(mtry(), * select(training(credit_split), * -Class)) ) %>% dials::grid_latin_hypercube(size = 3)) ## # A tibble: 3 x 3 ## trees min_n mtry ## <int> <int> <int> ## 1 1462 30 8 ## 2 944 9 18 ## 3 636 23 3 ``` ```r rf_tune <- tune::tune_grid( our_recipe, model = rf_mod, resamples = cv_resamples, grid = rf_params, control = ctrl ) ``` --- class: middle ```r best_rf <- select_best(rf_tune, metric = "roc_auc", maximize = FALSE) best_rf ## # A tibble: 1 x 3 ## mtry trees min_n ## <int> <int> <int> ## 1 18 944 9 rf_mod_final <- finalize_model(rf_mod, best_rf) ``` --- ```r train <- our_recipe %>% prep() %>% juice() test <- our_recipe %>% prep() %>% bake(testing(credit_split)) ``` -- ```r rf_fit <- rf_mod_final %>% fit(Class ~ ., train) rf_fit ## parsnip model object ## ## Fit time: 988ms ## Ranger result ## ## Call: ## ranger::ranger(formula = formula, data = data, mtry = ~18L, num.trees = ~944L, min.node.size = ~9L, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) ## ## Type: Probability estimation ## Number of trees: 944 ## Sample size: 750 ## Number of independent variables: 20 ## Mtry: 18 ## Target node size: 9 ## Variable importance mode: none ## Splitrule: gini ## OOB prediction error (Brier s.): 0.1661911 ``` --- ```r credit_wfl <- workflow() %>% add_recipe(our_recipe) %>% add_model(rf_mod_final) fit(credit_wfl, * training(credit_split)) ## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: rand_forest() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────── ## 5 Recipe Steps ## ## ● step_center() ## ● step_scale() ## ● step_mutate() ## ● step_ordinalscore() ## ● step_string2factor() ## ## ── Model ─────────────────────────────────────────────────────────────────────────────────────── ## Ranger result ## ## Call: ## ranger::ranger(formula = formula, data = data, mtry = ~18L, num.trees = ~944L, min.node.size = ~9L, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) ## ## Type: Probability estimation ## Number of trees: 944 ## Sample size: 750 ## Number of independent variables: 20 ## Mtry: 18 ## Target node size: 9 ## Variable importance mode: none ## Splitrule: gini ## OOB prediction error (Brier s.): 0.1662282 ``` --- ```r (log_reg_fit <- credit_wfl %>% * update_model(log_reg_mod) %>% fit(training(credit_split))) ## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: logistic_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────── ## 5 Recipe Steps ## ## ● step_center() ## ● step_scale() ## ● step_mutate() ## ● step_ordinalscore() ## ● step_string2factor() ## ## ── Model ─────────────────────────────────────────────────────────────────────────────────────── ## ## Call: stats::glm(formula = formula, family = stats::binomial, data = data) ## ## Coefficients: ## (Intercept) ## -0.08702 ## CheckingAccountStatusgt.200 ## 0.38310 ## CheckingAccountStatuslt.0 ## -0.31961 ## CheckingAccountStatusno checking account ## 1.34995 ## Duration ## -0.19458 ## CreditHistorycritical account/ other credits existing (not at this bank) ## 1.52333 ## CreditHistorydelay in paying off in the past ## 0.88547 ## CreditHistoryexisting credits paid back duly till now ## 0.55648 ## CreditHistoryno credits taken/ all credits paid back duly ## 0.13453 ## Purposecar (new) ## -0.86459 ## Purposecar (used) ## 1.03172 ## Purposedomestic appliances ## -0.65840 ## Purposeeducation ## -0.57080 ## Purposefurniture/equipment ## 0.02357 ## Purposeothers ## 0.06858 ## Purposeradio/television ## 0.19205 ## Purposerepairs ## -0.38717 ## Purposeretraining ## 1.60914 ## Amount ## -0.42958 ## SavingsAccountBonds500.to.1000 ## 0.23510 ## SavingsAccountBondsgt.1000 ## 1.48298 ## SavingsAccountBondslt.100 ## -0.40903 ## SavingsAccountBondsunknown/ no savings account ## 0.75882 ## ## ... ## and 50 more lines. ``` --- .pull-left[ ```r predict(rf_fit, test) %>% bind_cols(select(test, truth = Class)) ## # A tibble: 250 x 2 ## .pred_class truth ## <fct> <fct> ## 1 Good Good ## 2 Good Bad ## 3 Good Good ## 4 Good Good ## 5 Bad Bad ## 6 Good Bad ## 7 Good Good ## 8 Good Good ## 9 Bad Bad ## 10 Good Good ## # … with 240 more rows ``` ] .pull-right[ ```r predict(rf_fit, test, type = 'prob') %>% bind_cols(select(test, truth = Class)) ## # A tibble: 250 x 3 ## .pred_Bad .pred_Good truth ## <dbl> <dbl> <fct> ## 1 0.128 0.872 Good ## 2 0.358 0.642 Bad ## 3 0.156 0.844 Good ## 4 0.427 0.573 Good ## 5 0.801 0.199 Bad ## 6 0.490 0.510 Bad ## 7 0.403 0.597 Good ## 8 0.410 0.590 Good ## 9 0.673 0.327 Bad ## 10 0.374 0.626 Good ## # … with 240 more rows ``` ] -- ```r rf_pred <- test %>% select(truth = Class) %>% bind_cols(predict(rf_fit, test, type = 'prob'), predict(rf_fit, test) ) ``` --- class: middle #Yardstick ```r metrics(rf_pred, truth, .pred_class) ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.784 ## 2 kap binary 0.404 roc_auc(rf_pred, truth, .pred_Bad) ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 0.779 ``` --- class: middle .pull-left[ ```r roc_curve(rf_pred, truth, .pred_Bad) %>% autoplot() ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-26-1.png" width="504" /> ] --- layout:false class:center, middle <img src="img/iu.jpg" width="70%" /> --- #Resources: .resources[ Blog Post: *Coming Soon* :http://carolineledbetter.us/ tidymodels: https://github.com/tidymodels/tidymodels rsample: https://tidymodels.github.io/rsample/ recipes: https://tidymodels.github.io/recipes/ parsnip: https://tidymodels.github.io/parsnip/ tune: https://tidymodels.github.io/tune/ dials: https://tidymodels.github.io/dials/ yardstick: https://tidymodels.github.io/yardstick/ workflows: https://tidymodels.github.io/workflows/ Applied Machine Learning: rstudio::conf 2020 https://rstudio-conf-2020.github.io/applied-ml/ ] --- <!-- last slide go back to details slide --> #[Return to Contact Info](#details)