Section 10 Workflow Management with {targets}

10.1 Using the {targets} package to automate your analyses

Data science workflows will often grow beyond a single script and can rapidly become unwieldy when many steps, datasets, and files are involved. This is both a challenge to keep track of (i.e., which scripts to run first, second, etc.) and a challenge to communicate to other users in the world of open science.

Workflow management software is a solution to this that the ROSSyndicate has embraced. Specifically, we use the {targets} R package. {targets} in particular allows you to use R to build out a pipeline (used interchangeably with workflow here) which the software automatically tracks so that it “knows” which steps have been run, which have failed, and which need to be re-run. Your entire analytical pipeline or workflow can be rerun with just the command tar_make() and you can visualize the connections between objects and functions in your pipeline with tar_visnetwork().

One standout strength of this workflow management architecture is that the software “knows” which downstream steps are affected by a change in the upstream code or datasets. This means that you as a human have no need to manually re-run scripts that are outdated by changes in code; {targets} knows which are affected, can tell you, and automatically re-run them.

Not every analysis is a perfect fit for a {targets} workflow, however. Very small analyses may not be worth the time needed to invest to build out the structure of the pipeline. Workflows that are often re-run with updated datasets or which can perform many parallel iterations are especially good candidates for the {targets} pipeline style.

Because the ROSSyndicate values literate code and {targets} does not render and track the output of Markdown or Quarto documents, there are additional steps to creating easily understandable and literate code within the {targets} framework. In short, it involves using the tar_render() function to create a bookdown from Markdown documents as one of the final steps in the workflow. In that bookdown, you would integrate our practice of literate coding. We will be developing more framework around this as time passes.

We won’t explain how to use {targets} in detail here, but good resources are readily available. The {targets} R package user manual is a great resource written by the author of the {targets} package. Matt Brousil from the ROSSyndicate has also written a short example of building an analysis with the package in Targets for Ecologists and created an instructional video.

10.1.1 Hot tips and tricks

Sourcing scripts for functions in {targets} pipelines

The _targets.R script is an essential piece of a {targets} pipeline. This file is also where functions that define workflow steps (“targets”) are sourced from other scripts. {targets} provides a built-in function to do this, tar_source(). For example: tar_source("src/custom_functions.R")

Package use in a {targets} pipeline

You might find it unclear how to require certain packages when writing code for a {targets} workflow. The ideal situation is that you use the tar_option_set() function near the top of your _targets.R script and provide the packages argument a character vector of package names that should be loaded for every target in the workflow. This often might just be tar_option_set(packages = c("tidyverse")). Then, you can change this default for specific targets within the respective targets’s tar_target() function like this: tar_target(packages = c("tidyverse", "lubridate")). Note that you’ll need to still repeat any packages you’ve already listed in tar_option_set() when changing the packages argument in tar_target().

Reading external files into {targets} pipelines

When building your {targets} pipeline you may wonder how best to read in data and have the pipeline track the existence of input files. The {tarchetypes} package provides some handy shortcut functions for this: tar_file() and tar_file_read(). The first, tar_file(), identifies that a pipeline target is a dynamic file based on a path provided to the command argument. By contrast, tar_file_read() will create two pipeline targets: one tracking the file path provided, and a second reading in the file given the instructions you provide to the read argument.

For instance, if you have a .yml file that you want to track for changes, you can use the tar_file() method for tracking. If you have a file (for example a .csv) that is output from a target that you want to use later, your target that creates that .csv would return the filepath of that .csv. You could then use tar_read() to store the data in that .csv as a target and track the file (as your filepath).

Using the ‘branching’ function in {targets}

There are times where your workflow will benefit from tracking specific inputs into a target. For instance, in the Landsat acquisition workflow, we repeat a function over every single WRS path-row over the United States. Sometimes this acquisition fails because it takes a long time to execute. Deploying branches requires two steps: first, you create a list to iterate (or branch) on - in this case, a list of path-rows; then you map that list over the function that relies on the path-row. Your {targets} list might look something like this:

list(
  # get WRS tiles - this target returns a list of path-rows
  tar_target(
    name = WRS_tiles,
    command = get_WRS_tiles(WRS_detection_method, yml_file),
    packages = c("readr", "sf")
  ),

  # run the landsat pull as function per path-row
  tar_target(
    name = eeRun,
    command = {
      ref_pull_457_DSWE1
      ref_pull_89_DSWE1
      run_GEE_per_tile(WRS_tiles)
    },
    pattern = map(WRS_tiles), # this is where the magic happens!
    packages = "reticulate"
  )
)

Using python scripts in {targets}

The {targets} package was designed to work for R workflows only, which means for our mixed-language workflows, we have to get creative. Again, {targets} will not render AND track the output of a .Rmd file, which means all of our precious .Rmd must be translated into .R and .py scripts in order to work in {targets}. We employ the use of source_python() from the {reticulate} package. You can use this in the same way you might use tar_source() to track all of the functions you call in your pipeline. {targets} will track all of the functions in your .py file as individual targets. To run a python script as a target, you have to nest it in an R function. For instance, our run_GEE_per_tile() from above is just this wrapper to:

run_GEE_per_tile <- function(WRS_tile) {
  tile <- WRS_tile # store the tile from the list
  write_lines(tile, 'data_acquisition/out/current_tile.txt', sep = '') # save it as a generic file
  source_python('data_acquisition/src/runGEEperTile.py') # this script actually *reads* the generic file above as one of the first steps
}

Note, {targets} does not track any of the functions called in the source_python() line like it does in an .R script - you must list all of the functions that are needed in your script in the command list so that {targets} knows you need those functions to run the code. Additionally, targets does not actively track any output from source_python(), so again, you’ll have to get creative!

Rendering a Bookdown in {targets} [placeholder]

10.1.2 Resources

Below are links to resources to help you learn more about {targets} and implementing workflow management software: