Interfacing your model with R

Interfacing a model with BT - step-by-step guide

Step 1: Create a runModel(par) function

To enable calibration in BT, it’s essential to run the model with a specified set of parameters. In fact, BT does not see the model as such, but requires a likelihood function with the interface likelihood(par), where par is a vector, but in this function you will probably then run the model with the parameters par, where par could remain a vector or be transformed into some other format, e.g. data.frame, matrix or list.

What happens next depends on how your model is programmed. The following steps are arranged according to speed and convenience. If your model has not been interfaced with R before, you will most likely have to skip to the last option.

Case 1 - model programmed in R

Typically, no action is required. Ensure that you are able to call your model.

runMyModel(par)

Typically this function will directly return the model outputs, so step 2 can be skipped.

Case 2 - compiled dll, parameters are set via dll interface

If your model is already in the form of a DLL, or if you can prepare it as such, use the function to link R to your model.

dyn.load(model)

runMyModel(par){
  out = # model call here 
  # process out
  return(out)
}

If you implement this process, you will also usually return the output directly via the dll and not write to a file, which means that step 2 can be skipped.

The problem with this approach is that you have to code the interface to your dll, which in most programming languages technically means setting your variables as external or something like that, so that they can be accessed from outside. The specifics of how this works will vary depending on the programming language being used.

Case 3 - model programmed in C / C++, interfaced with RCPP

RCPP provides a highly flexible platform to interface between R and C/C++. RCPP provides a secure and powerful way to connect to R if your model is coded in C/C++ (much more flexible than using the command line or a dll).

Nevertheless, code adjustments may be necessary for the interface, and beginners may find it challenging to resolve any technical issues. Attempting to interface an existing C/C++ model without prior experience using RCPP or at least substantial experience with C/C++ is not advisable.

In addition, step 2 can be skipped if you are implementing this, as you will usually return the output directly via the dll, rather than writing to a file.

Case 4 - compiled executable, parameters set via command line (std I/O)

If your model is written in a compiled or interpreted language and accepts parameters via std I/O, wrapping is usually nothing more than writing the system call in an R function. For example

runMyModel(par){
  
  # Create here a string with what you would write to call the model from the command line
  systemCall <- paste("model.exe", par[1], par[2])
  
  out = system(systemCall,  intern = TRUE) # intern indicates whether to capture the output of the command as an R character vector
  
  # write here to convert out in the apprpriate R classes
  
}

Note: If you encounter difficulties with the system command, try system2. If the model provides the output via std.out, you can catch and convert it and skip step 2. If your model writes the output to a file, proceed to step 2.

Case 5 - compiled model, parameters set via parameter file or in any other method

Many models use parameter files to read parameters. For this case, you want to do something like the following example

runMyModel(par, returnData = NULL){
  
  writeParameters(par)
  
  system("Model.exe")
  
  if(! is.null(returnData)) return(readData(returnData)) # The readData function will be defined later
  
}

writeParameters(par){
  
  # e.g.
  # read template parameter fil
  # replace strings in template file
  # write parameter file 
}

For some problems, it may be useful to create a setup function, as in the example below.

setUpModel <- function(parameterTemplate, site, localConditions){
  
  # create the runModel, readData functions (see later) here
  
  return(list(runModel, readData))
  
}

The way you write a parameter function depends on the file format you are using for the parameters. Usually, creating a template parameter file is recommended as a starting point, and then the parameters can be changed as required.

  • If your parameter file is in an .xml format, check out the xml functions in R.
  • If your parameter file is in a general text format, the best option may be to create a template parameter file, place a unique string at the locations of the parameters that you want to replace, and then use string replace functions in R, e.g. grep to replace this string.

Case 6 - compiled model, parameters cannot be changed

To achieve one of the previous options, you must change your model code. If the model is in C/C++, the best alternative is to go directly to RCPP.

Step 2: Read back data

If your model returns the output directly (which is highly preferable, ), you can skip this step.

You can skip this process if your model directly returns the output (which is highly recommended).

If you have a simple model, you might consider using the runMyModel function to return the model’s output directly. This is suitable for cases in a) and b) namely, models that are already in R, or receive parameters via the command line.

In contrast, more complex models generate a large number of outputs. However, usually you do not need all of them. Therefore, it is more useful to create one or multiple separate readData or getDate functions. The only two cases I will consider here are

  • via dll / RCPP
  • via file ouputs

Model is a dll If the model is a dll file, it would probably be best to implement appropriate getData functions in the source code that can then be called from R. If your model is in C and in a dll, interfacing it via RCPP would probably be easier because you can return R dataframes and other data structures directly.

Model writes file output If the model generates file output, create a getData function that reads the model outputs and returns the data in the desired format, typically the same format you would use to represent your field data.

getData(type = X){
  
  read.csv(xxx)
  
  # do some transformation 
  
  # return data in desidered format   
}

You should now see an example from R that looks like this

par = c(1,2,3,4 ..)

runMyModel(par)

output <- getData(type = DesiredType)

plot(output)

Step 3 (optional) - creating an R package from your code

Although the final step is optional, we recommend doing it from the beginning because there is really no downside. You can work with R code in multiple files that can be run manually or incorporated into an R package directly. Creating and managing R packages is a straightforward process and makes it simpler to share your code, as everything, including help guides, is in one package. To create an R package, please refer to the tutorial . Please remember to write good documentation using Roxygen.

Speed optimization and parallelization

Runtime is often a concern when performing sensitivity analyses or calibrations. Ensure that your model has been optimized for maximum speed before parallelization.

Easy things

  • Are you compiling with maximum optimization (e.g. -o3 in cpp)
  • If there is a spin-up phase, consider increasing the time step during that phase.
  • Consider increasing the time step in general.
  • Are you producing unnecessary outputs that could be turned off to reduce hard disk I/O, which is often slow?

Difficult things

  • Make the model directly callable (RCPP or dll) to avoid harddisk I/O
  • Is it possible to reduce the initialization time (not only for spin-up, but also for reading the forcings / drivers) by not terminating the model executable after each run, but keeping it “waiting” for a new run?
  • Code optimization: did you use a profiler? Read up on code optimization
  • Check for unnecessary calculations in your code / introduce compiler flags where appropriate

Parallelization

One way to speed up the runtime of your model is to run it on multiple cores (CPUs). To do so, you have two choices:

  1. Parallelize the model itself
  2. Parallelize the model call so that BT can perform multiple model evaluations in parallel.

Which of the two makes more sense depends a lot on your problem. Parallelizing the model itself will be especially interesting for very large models that could not otherwise be calibrated with MCMCs. However, this approach typically requires writing parallel C/C++ code and advanced programming skills, which is why we will not discuss it further here.

The usual advice in parallel computing anyway is to parallelize the outer loops first to minimize the communication overhead, which would suggest starting with parallelizing the model evaluations. This is also much easier to program. Even within that, there are two levels of parallelization possible:

  1. Parallelize the call of multiple MCMC / SMC samplers.
  2. Parallelize within the MCMC / SMC samplers

Currently, BT only supports parallelization within MCMCs / SMCs, but it easy to also implement between sampler parallelization by hand. Both approaches are describe below.

Within sampler parallelization

Within-sampler parallelization is particularly useful for algorithms that can use a large number of cores in parallel, such as sensitivity analysis or SMC sampling. For MCMCs, the amount of parallelization that can be used depends on the settings and the algorithm. In general, MCMCs are, as the name implies, Markovian, i.e., they set up a chain of sequential model evaluations, and these calls cannot be fully parallelized. However, a number of MCMCs in the BT package uses MCMC algorithms that can be partly parallelized, in particular the population MCMC algorithms DE/DEzs/DREAM/DREAMzs. In all these cases, BT will automatically use parallelization of the BT setup to indicate that it is implemented.

How to do this? A first requirement is that your model is wrapped in an R function (see PREVIOUS SECTION). Once that is the case, R provides a number of ways to run functions in parallel. The easiest is to use the parallel package that comes with the R core. Other packages can be found on the Internet and in the CRAN task view at High Performance Computing

As an example, suppose we have the following very simple model:

mymodel<-function(x){
  output<-0.2*x+0.1^x
  return(output)
}

To start a parallel computation, we must first create a cluster object. Here we start a cluster with 2 CPUs.

library(parallel)
cl <- makeCluster(2)

runParallel<- function(parList){
  parSapply(cl, parList, mymodel)
}

runParallel(c(1,2))

You could use this principle to build your own parallelized likelihood. However, something very similar to the previous loop is automated in BayesianTools. You can create a parallel model evaluation function directly with the generateParallelExecuter function, or alternatively directly in the createBayesianSetup function.

library(BayesianTools)
parModel <- generateParallelExecuter(mymodel)

If your model is thread-safe, you should be able to run it out of the box. Therefore, I recommend using the hand-coded parallelization only if your model is not thread-safe.

Running several MCMCs in parallel

In addition to within-chain parallelization, you can also run multiple MCMCs in parallel and later combine them into a single McmcSamplerList.

library(BayesianTools)

ll <- generateTestDensityMultiNormal(sigma = "no correlation")
bayesianSetup <- createBayesianSetup(likelihood = ll, lower = rep(-10, 3), upper = rep(10, 3))

settings = list(iterations = 200)

# run the several MCMCs chains either in seperate R sessions, or via R parallel packages
out1 <- runMCMC(bayesianSetup = bayesianSetup, sampler = "DEzs", settings = settings)
##  Running DEzs-MCMC, chain  1 iteration 201 of 201 . Current logp  -16.84623 -12.12165 -13.12708 . Please wait! 
## runMCMC terminated after 0.0520000000000014seconds
out2 <- runMCMC(bayesianSetup = bayesianSetup, sampler = "DEzs", settings = settings)
##  Running DEzs-MCMC, chain  1 iteration 201 of 201 . Current logp  -12.16394 -12.36114 -13.63612 . Please wait! 
## runMCMC terminated after 0.0549999999999997seconds
res <- createMcmcSamplerList(list(out1, out2))
plot(res)

Thread safety

Thread safety generally means that you can run multiple instances of your code on your hardware. There are several things that can limit thread safety, such as

  • Writing output to a file (multiple threads can write to the same file at the same time)