Skip to contents

This family of functions allows using AMR-specific data types such as <mic> and <sir> inside tidymodels pipelines.

Usage

all_mic()

all_mic_predictors()

all_sir()

all_sir_predictors()

step_mic_log2(recipe, ..., role = NA, trained = FALSE, columns = NULL,
  skip = FALSE, id = recipes::rand_id("mic_log2"))

step_sir_numeric(recipe, ..., role = NA, trained = FALSE, columns = NULL,
  skip = FALSE, id = recipes::rand_id("sir_numeric"))

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose variables for this step. See selections() for more details.

role

Not used by this step since no new variables are created.

trained

A logical to indicate if the quantities for preprocessing have been estimated.

skip

A logical. Should the step be skipped when the recipe is baked by bake()? While all operations are baked when prep() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.

id

A character string that is unique to this step to identify it.

Details

You can read more in our online AMR with tidymodels introduction.

Tidyselect helpers include:

  • all_mic() and all_mic_predictors() to select <mic> columns

  • all_sir() and all_sir_predictors() to select <sir> columns

Pre-processing pipeline steps include:

  • step_mic_log2() to convert MIC columns to numeric (via as.numeric()) and apply a log2 transform, to be used with all_mic_predictors()

  • step_sir_numeric() to convert SIR columns to numeric (via as.numeric()), to be used with all_sir_predictors(): "S" = 1, "I"/"SDD" = 2, "R" = 3. All other values are rendered NA. Keep this in mind for further processing, especially if the model does not allow for NA values.

These steps integrate with recipes::recipe() and work like standard preprocessing steps. They are useful for preparing data for modelling, especially with classification models.

Examples

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
#>  broom        1.0.8      rsample      1.3.0
#>  dials        1.4.0      tibble       3.3.0
#>  infer        1.0.8      tidyr        1.3.1
#>  modeldata    1.4.0      tune         1.3.0
#>  parsnip      1.3.2      workflows    1.2.0
#>  purrr        1.0.4      workflowsets 1.1.1
#>  recipes      1.3.1      yardstick    1.3.2
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#>  purrr::discard() masks scales::discard()
#>  dplyr::filter()  masks stats::filter()
#>  dplyr::lag()     masks stats::lag()
#>  recipes::step()  masks stats::step()

# The below approach formed the basis for this paper: DOI 10.3389/fmicb.2025.1582703
# Presence of ESBL genes was predicted based on raw MIC values.


# example data set in the AMR package
esbl_isolates
#> # A tibble: 500 × 19
#>    esbl  genus   AMC   AMP   TZP   CXM   FOX   CTX   CAZ   GEN   TOB   TMP   SXT
#>    <lgl> <chr> <mic> <mic> <mic> <mic> <mic> <mic> <mic> <mic> <mic> <mic> <mic>
#>  1 FALSE Esch…    32    32     4    64    64  8.00  8.00     1     1  16.0    20
#>  2 FALSE Esch…    32    32     4    64    64  4.00  8.00     1     1  16.0   320
#>  3 FALSE Esch…     4     2    64     8     4  8.00  0.12    16    16   0.5    20
#>  4 FALSE Kleb…    32    32    16    64    64  8.00  8.00     1     1   0.5    20
#>  5 FALSE Esch…    32    32     4     4     4  0.25  2.00     1     1  16.0   320
#>  6 FALSE Citr…    32    32    16    64    64 64.00 32.00     1     1   0.5    20
#>  7 FALSE Morg…    32    32     4    64    64 16.00  2.00     1     1   0.5    20
#>  8 FALSE Prot…    16    32     4     1     4  8.00  0.12     1     1  16.0   320
#>  9 FALSE Ente…    32    32     8    64    64 32.00  4.00     1     1   0.5    20
#> 10 FALSE Citr…    32    32    32    64    64  8.00 64.00     1     1  16.0   320
#> # ℹ 490 more rows
#> # ℹ 6 more variables: NIT <mic>, FOS <mic>, CIP <mic>, IPM <mic>, MEM <mic>,
#> #   COL <mic>

# Prepare a binary outcome and convert to ordered factor
data <- esbl_isolates %>%
  mutate(esbl = factor(esbl, levels = c(FALSE, TRUE), ordered = TRUE))

# Split into training and testing sets
split <- initial_split(data)
training_data <- training(split)
testing_data <- testing(split)

# Create and prep a recipe with MIC log2 transformation
mic_recipe <- recipe(esbl ~ ., data = training_data) %>%
  # Optionally remove non-predictive variables
  remove_role(genus, old_role = "predictor") %>%
  # Apply the log2 transformation to all MIC predictors
  step_mic_log2(all_mic_predictors()) %>%
  prep()

# View prepped recipe
mic_recipe
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:          1
#> predictor:       17
#> undeclared role:  1
#> 
#> ── Training information 
#> Training data contained 375 data points and no incomplete rows.
#> 
#> ── Operations 
#>  Log2 transformation of MIC columns: AMC, AMP, TZP, CXM, FOX, ... | Trained

# Apply the recipe to training and testing data
out_training <- bake(mic_recipe, new_data = NULL)
out_testing <- bake(mic_recipe, new_data = testing_data)

# Fit a logistic regression model
fitted <- logistic_reg(mode = "classification") %>%
  set_engine("glm") %>%
  fit(esbl ~ ., data = out_training)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Generate predictions on the test set
predictions <- predict(fitted, out_testing) %>%
  bind_cols(out_testing)

# Evaluate predictions using standard classification metrics
our_metrics <- metric_set(accuracy, kap, ppv, npv)
metrics <- our_metrics(predictions, truth = esbl, estimate = .pred_class)

# Show performance:
# - negative predictive value (NPV) of ~98%
# - positive predictive value (PPV) of ~94%
metrics
#> # A tibble: 4 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.936
#> 2 kap      binary         0.872
#> 3 ppv      binary         0.925
#> 4 npv      binary         0.948