This family of functions allows using AMR-specific data types such as <mic>
and <sir>
inside tidymodels
pipelines.
Usage
all_mic()
all_mic_predictors()
all_sir()
all_sir_predictors()
step_mic_log2(recipe, ..., role = NA, trained = FALSE, columns = NULL,
skip = FALSE, id = recipes::rand_id("mic_log2"))
step_sir_numeric(recipe, ..., role = NA, trained = FALSE, columns = NULL,
skip = FALSE, id = recipes::rand_id("sir_numeric"))
Arguments
- recipe
A recipe object. The step will be added to the sequence of operations for this recipe.
- ...
One or more selector functions to choose variables for this step. See
selections()
for more details.- role
Not used by this step since no new variables are created.
- trained
A logical to indicate if the quantities for preprocessing have been estimated.
- skip
A logical. Should the step be skipped when the recipe is baked by
bake()
? While all operations are baked whenprep()
is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when usingskip = TRUE
as it may affect the computations for subsequent operations.- id
A character string that is unique to this step to identify it.
Details
You can read more in our online AMR with tidymodels introduction.
Tidyselect helpers include:
all_mic()
andall_mic_predictors()
to select<mic>
columnsall_sir()
andall_sir_predictors()
to select<sir>
columns
Pre-processing pipeline steps include:
step_mic_log2()
to convert MIC columns to numeric (viaas.numeric()
) and apply a log2 transform, to be used withall_mic_predictors()
step_sir_numeric()
to convert SIR columns to numeric (viaas.numeric()
), to be used withall_sir_predictors()
:"S"
= 1,"I"
/"SDD"
= 2,"R"
= 3. All other values are renderedNA
. Keep this in mind for further processing, especially if the model does not allow forNA
values.
These steps integrate with recipes::recipe()
and work like standard preprocessing steps. They are useful for preparing data for modelling, especially with classification models.
Examples
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
#> ✔ broom 1.0.8 ✔ rsample 1.3.0
#> ✔ dials 1.4.0 ✔ tibble 3.3.0
#> ✔ infer 1.0.8 ✔ tidyr 1.3.1
#> ✔ modeldata 1.4.0 ✔ tune 1.3.0
#> ✔ parsnip 1.3.2 ✔ workflows 1.2.0
#> ✔ purrr 1.0.4 ✔ workflowsets 1.1.1
#> ✔ recipes 1.3.1 ✔ yardstick 1.3.2
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ recipes::step() masks stats::step()
# The below approach formed the basis for this paper: DOI 10.3389/fmicb.2025.1582703
# Presence of ESBL genes was predicted based on raw MIC values.
# example data set in the AMR package
esbl_isolates
#> # A tibble: 500 × 19
#> esbl genus AMC AMP TZP CXM FOX CTX CAZ GEN TOB TMP SXT
#> <lgl> <chr> <mic> <mic> <mic> <mic> <mic> <mic> <mic> <mic> <mic> <mic> <mic>
#> 1 FALSE Esch… 32 32 4 64 64 8.00 8.00 1 1 16.0 20
#> 2 FALSE Esch… 32 32 4 64 64 4.00 8.00 1 1 16.0 320
#> 3 FALSE Esch… 4 2 64 8 4 8.00 0.12 16 16 0.5 20
#> 4 FALSE Kleb… 32 32 16 64 64 8.00 8.00 1 1 0.5 20
#> 5 FALSE Esch… 32 32 4 4 4 0.25 2.00 1 1 16.0 320
#> 6 FALSE Citr… 32 32 16 64 64 64.00 32.00 1 1 0.5 20
#> 7 FALSE Morg… 32 32 4 64 64 16.00 2.00 1 1 0.5 20
#> 8 FALSE Prot… 16 32 4 1 4 8.00 0.12 1 1 16.0 320
#> 9 FALSE Ente… 32 32 8 64 64 32.00 4.00 1 1 0.5 20
#> 10 FALSE Citr… 32 32 32 64 64 8.00 64.00 1 1 16.0 320
#> # ℹ 490 more rows
#> # ℹ 6 more variables: NIT <mic>, FOS <mic>, CIP <mic>, IPM <mic>, MEM <mic>,
#> # COL <mic>
# Prepare a binary outcome and convert to ordered factor
data <- esbl_isolates %>%
mutate(esbl = factor(esbl, levels = c(FALSE, TRUE), ordered = TRUE))
# Split into training and testing sets
split <- initial_split(data)
training_data <- training(split)
testing_data <- testing(split)
# Create and prep a recipe with MIC log2 transformation
mic_recipe <- recipe(esbl ~ ., data = training_data) %>%
# Optionally remove non-predictive variables
remove_role(genus, old_role = "predictor") %>%
# Apply the log2 transformation to all MIC predictors
step_mic_log2(all_mic_predictors()) %>%
prep()
# View prepped recipe
mic_recipe
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 17
#> undeclared role: 1
#>
#> ── Training information
#> Training data contained 375 data points and no incomplete rows.
#>
#> ── Operations
#> • Log2 transformation of MIC columns: AMC, AMP, TZP, CXM, FOX, ... | Trained
# Apply the recipe to training and testing data
out_training <- bake(mic_recipe, new_data = NULL)
out_testing <- bake(mic_recipe, new_data = testing_data)
# Fit a logistic regression model
fitted <- logistic_reg(mode = "classification") %>%
set_engine("glm") %>%
fit(esbl ~ ., data = out_training)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Generate predictions on the test set
predictions <- predict(fitted, out_testing) %>%
bind_cols(out_testing)
# Evaluate predictions using standard classification metrics
our_metrics <- metric_set(accuracy, kap, ppv, npv)
metrics <- our_metrics(predictions, truth = esbl, estimate = .pred_class)
# Show performance:
# - negative predictive value (NPV) of ~98%
# - positive predictive value (PPV) of ~94%
metrics
#> # A tibble: 4 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.936
#> 2 kap binary 0.872
#> 3 ppv binary 0.925
#> 4 npv binary 0.948