PEcAn.assim.batch Vignette

Install package from Github

Only needs to be done the first time :

library(devtools)
install_github("PecanProject/pecan", subdir="all")

Parameter Data Assimilation (a.k.a. calibration) in PEcAn

Currently, there are three ways of doing Parameter Data Assimilation (PDA) in PEcAn :

bayesian.tools (bruteforce algorithms)
emulator
multi-site hierarchical PDA (HPDA) with emulators

Note that, there used to be Metropolis-Hastings Markov Chain Monte Carlo (MCMC) algorithms implemented natively in PEcAn, namely /modules/assim.batch/R/pda.mcmc.R and pda.mcmc.bs.R. These are deprecated and they will no longer be maintained.

Which one to use?

bayesian.tools : The pda.bayesian.tools.R is the main script of this PDA workflow that performs a bruteforce (MCMC methods that require running your models at least couple of thousand of times) calibration. This script uses the BayesianTools R-package (Hartig, Minuno and Paul, 2019) that includes MCMC and SMC samplers and other tools for Bayesian parameter calibration with external models. If you choose bayesian.tools option, PEcAn framework will hand the PDA calculations over to the BayesianTools package. Although this package includes algorithms that are designed to explore the parameter space more efficiently than the traditional MH-MCMC, you would still need a relatively faster model to use these algorithms. Note that the BayesianTools R-package itself is constantly under development, and its integration with PEcAn may not be able to leverage all BayesianTools functionality.

emulator : The pda.emulator.R is the main script of this PDA workflow. When a model is slow, it is practically not possible to run it hundreds of thousands of times sequentially to explore the parameter space and draw enough samples to converge the target distribution with bruteforce algorithms. Instead, we can run the model for a relatively smaller number of times with parameter combinations that have been carefully chosen to give good coverage of parameter space. Then we can interpolate the likelihood calculated for each of those runs to get a surface that “emulates” the true likelihood and perform regular MCMC (just like the “bruteforce” approach), except instead of actually running the model on every iteration to obtain a likelihood value, this time we will just get an approximation from the likelihood emulator. A lot more details about the emulator approach can be found in Fer et al. 2018.

Multi-site HPDA : The pda.emulator.ms.R is the main script of this PDA workflow which performs a multi-site hierarchical Bayesian calibration using the emulator approach. In the multi-site hierarhical calibration, global posteriors are fitted simultaneously using an additional Gibbs update step, and used as the new prior hyperparameters in the next iteration of MCMC to obtain site-level hierarchical posteriors. This PDA workflow has its own vignette (see /modules/assim.batch/vignettes/MultiSitePDAVignette.Rmd) and will not be one of the examples here. For more information also see Fer et al. 2021 bioRxiv preprint.

Adding PDA tags to pecan.xml

The easiest way to use PEcAn’s parameter data assimilation module is to add an <assim.batch> block to pecan.xml, load the file with read.settings, and pass the resulting settings object to pda.bayesian.tools() or pda.emulator(). There are some differences in the settings for using different PDA methods (we will go through test cases, at this point just familiarize yourself with the general pattern), but here is an example <assim.batch> block :

<assim.batch>
  <iter>100000</iter>
  <method>emulator</method>                     <--- WHICH PDA APPROACH
  <chain>3</chain>
  <param.names>
   <temperate.coniferous>                       <--- YOUR PFT(S)
    <param>PARAM1</param>                       <--- PARAMETER(S) YOU TARGET IN PDA
    <param>PARAM2</param>
   </temperate.coniferous>
  </param.names>
  <jump>
   <adapt>250</adapt>
   <adj.min>0.1</adj.min>
   <ar.target>0.3</ar.target>
  </jump>
  <inputs>
    <file>
      <input.id> ... </input.id>               <--- YOUR CONSTRAINT ID FROM BETY DB 
      <path>
        <path> ... </path>                     <--- YOUR CONSTRAINT FILE PATH(S)
        <path> ... </path>
      </path>
      <likelihood>Laplace</likelihood>         <--- LIKELIHOOD FORM FOR THIS CONSTRAINT 
      <variable.id>298</variable.id>           <--- (CONSTRAINT) VARIABLE ID FROM BETY DB
      <variable.name>
        <variable.name>LE</variable.name>      <--- (CONSTRAINT) VARIABLE
        <variable.name>UST</variable.name>     <--- HELPER VARIABLE SPECIFIC TO FLUX CASE
      </variable.name>
    </file>
    <file>
      <input.id>...</input.id>
      <path>
        <path>...</path>
      </path>
      <likelihood>multipGauss</likelihood>
      <hyper.pars>
        <parama>0.001</parama>                 <--- HYPERPARAMETERS SPECIFIC TO GAUSSIAN FORM
        <paramb>0.001</paramb>
      </hyper.pars>
      <ss.positive>TRUE</ss.positive>
      <variable.id>...</variable.id> 
      <variable.name>
        <variable.name>...</variable.name>
      </variable.name>
    </file>
  </inputs>
</assim.batch>

Here are some more details about the PDA tags:

<iter> Specifies the number of MCMC iterations to run. If continuing a previous MCMC, this is the number of additional iterations, which will be added to the previous total. Defaults to 100 if missing.
<chain> Specifies the number of MCMC chains to be run. Defaults to 2. Note that some of the post-processing functions will look for at least 2 chains for diagnostics.

<prior> Identifies the prior to be used for PDA. Can be one of either:

 + `<posterior.id>` A posterior ID in BETY specifying the posterior from a previous PEcAn analysis (e.g., meta-analysis or previous PDA) to be used as the prior for PDA. Defaults to the most recent relevant posterior in the database if omitted (and no `<path>` specified instead; see below).

 + `<path>` As an alternative to using a posterior ID, can specify a file path to either a `prior.distns.Rdata` or `post.distns.Rdata` file generated from an earlier analysis. Conceptually, using a posterior distribution as the prior for PDA is preferred, as this allows the multiple analyses to work together to iteratively constrain parameters. In practice, previous analyses may have over-constrained parameters to ranges that do not actually optimize model outputs, so using a less informative prior for PDA might yield better results. 

 NOTE: It is recommended to leave the `<prior>` tag empty if you are doing this for the first time. PEcAn workflow will handle it for you.

<param.names> The names of parameters to be constrained by assimilation, listed in individual <param> tags. These must be the standard names given in BETY DB (e.g. wood_density), not your model specific parameter names (e.g. wooddens), i.e. check out the id column of the trait.dictionary, i.e. :

data(trait.dictionary, package = "PEcAn.utils")

head(trait.dictionary[,c("id", "figid")])

##                         id                   figid
## 1           plant_min_temp          Plant Min Temp
## 2                      GDD     Growing Degree Days
## 3                    leafC            Leaf C conc.
## 4                 c2n_leaf                Leaf C:N
## 5 leaf_respiration_rate_m2   Leaf Respiration Rate
## 6  dark_respiration_factor Dark Respiration Factor

You can inspect your write.config.MODEL.R script for standard parameter names and their mappings to model parameter names.

       NOTE : `<param.names>` chunk should always be on your xml file regardless of the method you use in the PDA.

<inputs> Observation (constraint) data to be compared to the model. In principle, can be one or more datasets, specified in a variety of ways. In practice, the code is tested for assimilating flux datasets numerous times using the NEE/FC or LE variables.
- <file> Denotes a set of tags for a single input (constraint data). Would be repeated for multiple datasets/variables, e.g. in this case note the differences in <variable.id> and <variable.name> :
```
...
<inputs>
 <file>
  <input.id>15000000028</input.id>   
  <path>
   <path>/data/dbfiles/ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv</path> 
  </path>
  <likelihood>Laplace</likelihood>
  <variable.id>297</variable.id>
  <variable.name>
   <variable.name>NEE</variable.name>
   <variable.name>UST</variable.name>
  </variable.name>
 </file>   
 <file>
  <input.id>15000000028</input.id>   
  <path>
   <path>/data/dbfiles/ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv</path> 
  </path>
  <likelihood>Laplace</likelihood>
  <variable.id>298</variable.id>
  <variable.name>
   <variable.name>LE</variable.name>
   <variable.name>UST</variable.name>
  </variable.name>
 </file>
</inputs>
...
```
- <input.id> BETY input ID for looking up the input.
- <path> File path to the input. Both <id> and <path> of the observation data should be supplied for the PDA.
- <source> A standardized source of input data (e.g., Ameriflux). Not implemented yet, but the idea would be similar to the met workflow, PEcAn would be able to use standard data sources automatically where available. Only used if no <id> or <path> is given.
- <likelihood> Identifier for the likelihood to use. E.g., the Ameriflux NEE/FC and LE data use a Laplacian likelihood.
- <hyper.pars> Optional. Hyperparameters for your likelihood. E.g. Prior parameters of the precision for Gaussian likelihood. Defaults to scaled values if not provided.
- <ss.positive> When using the emulator, it is important to let the algorithm know that you have a likelihood whose sufficient statistics is zero bound.
- <variable.id> The BETY variable ID associated with this dataset. The idea is that specific preprocessing steps (e.g., estimating heteroskedastic error for tower NEE) would be associated with particular IDs. Could automate further by assigning default <likelihood> to variable.id values (allowing <likelihood> to be omitted from pecan.xml).
```
 NOTE : `<inputs>` chunk should always be on your xml file regardless of the method you use in the PDA.
```
<jump> Settings for the specifics of the proposal schema and adaptation functionality. It is used in the emulator approach, it’s possible to set adaptation options using bayesian.tools specific tags as well (see below for <bt.settings>).
- <ar.target> Target acceptance rate for the adaptive jump algorithm. Defaults to 0.5 if missing.
- <adapt> Number of iterations between jump variance adaptations. Defaults to floor(iter/10) if missing. If set equal to the number of <iter>, basically turns off the adaptation functionality, it is not recommended to turn-off this functionality for the emulator approach.
- <adj.min> Minimum factor by which to reduce jump variance when adapting. Prevents jump variances from degenerating to 0. Defaults to 0.1 if missing.

Method specific settings

Now we will go through 4 examples of how you could arrange your <assim.batch> tags and run a PDA in PEcAn:

Using existing PEcAn data pipelines (e.g. Ameriflux) and variables in the emulator-based calibration.
Using existing PEcAn data pipelines and variables in the BayesianTools-based (bruteforce) calibration.
Using your own pre-formatted data and variables in the emulator-based calibration.
Using your own pre-formatted data and variables in the BayesianTools-based (bruteforce) calibration.

1. PEcAn data pipeline with emulator

Assuming you already familiarized yourself with Demo 1 and Demo 2, and ran SIPNET at Niwot Ridge (US-NR1), in this section we will go through the pecan.xml tags for calibrating SIPNET with Ameriflux NEE data with the emulator approach at the US-NR1 site. The SIPNET PFT for this site was temperate.coniferous, let’s assume the top three parameters contributing to the model NEE output uncertainty (Demo 2) were psnTOpt,half_saturation_PAR, and Vm_low_temp, so we decided to constrain these in the PDA. Go to your workflow directory where you performed the uncertainty analysis, open up your pecan.CONFIGS.xml, insert the following tags at the top of your xml after the <pecan> tag (don’t forget to remove <--- comments), modify the paths according to your setup, and save the file as pecan.PDA.xml:

 <assim.batch>
  <method>emulator</method>
  <n.knot>60</n.knot>   <--- 20 x p (= no. of paramaters you target) 
  <iter>100000</iter>
  <chain>3</chain>
  <param.names>
   <temperate.coniferous>
    <param>psnTOpt</param>
    <param>half_saturation_PAR</param>
    <param>Vm_low_temp</param>
   </temperate.coniferous>
  </param.names>
  <jump>
   <adapt>250</adapt>
    <adj.min>0.1</adj.min>
    <ar.target>0.3</ar.target>
  </jump>
  <inputs>
   <file>
    <input.id>1000011238</input.id>   <--- your input ID (of the DB record of where the AMF csv file is)
    <path>
     <path>/.../AmerifluxLBL_site_0-772/AMF_US-NR1_BASE_HH_12-5.csv</path>  <--- your path
    </path>
    <likelihood>Laplace</likelihood>
    <variable.id>1000000042</variable.id>
    <variable.name>
     <variable.name>FC</variable.name>
     <variable.name>UST</variable.name>
    </variable.name>
   </file>
  </inputs>
 </assim.batch>

For reference, these are the tags for a similar workflow for SIPNET at ICOS-NEE (FI-Var site) combination. Note the differences in PFT name, input ID and path, variable ID and name:

 <assim.batch>
  <method>emulator</method>
  <n.knot>60</n.knot>   
  <iter>100000</iter>
  <chain>3</chain>
  <param.names>
   <boreal.coniferous>
    <param>psnTOpt</param>
    <param>half_saturation_PAR</param>
    <param>Vm_low_temp</param>
   </boreal.coniferous>
  </param.names>
  <jump>
   <adapt>250</adapt>
    <adj.min>0.1</adj.min>
    <ar.target>0.3</ar.target>
  </jump>
  <inputs>
   <file>
    <input.id>15000000028</input.id>  
    <path>
     <path>/.../ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv</path>  
    </path>
    <likelihood>Laplace</likelihood>
    <variable.id>297</variable.id>
    <variable.name>
     <variable.name>NEE</variable.name>
     <variable.name>UST</variable.name>
    </variable.name>
   </file>
  </inputs>
 </assim.batch>

Now you can read in your pecan.PDA.xml and run PDA (This will take some time, go grab a drink. The first round especially takes time because this is when the constraint data is loaded and its effective sample size is calculated):

settings <- read.settings("pecan.PDA.xml")
settings_PDA_round1 <- pda.emulator(settings)

The PDA workflow has populated you settings list object with couple of relevant tags. Now settings_PDA_round1 has these additional tags (note that this post-PDA settings list is also automatically written to your workflow directory as pecan.pdaR1_ENS_ID.xml). These tags look like this:

  <prior>
   <prior.id>
    <boreal.coniferous>INITIAL_PDA_PRIOR_ID</boreal.coniferous>
   </prior.id>
  </prior>
  <ensemble.id>
   <id>R1_ENS_ID</id>
  </ensemble.id>
  <emulator.path>/.../PEcAn_WORKFLOWID/emulator.pdaR1_ENS_ID.Rdata</emulator.path>
  <ss.path>/.../PEcAn_WORKFLOWID/ss.pdaR1_ENS_ID.Rdata</ss.path>
  <mcmc.path>/.../PEcAn_WORKFLOWID/mcmc.list.pdaR1_ENS_ID.Rdata</mcmc.path>
  <resume.path>/.../PEcAn_WORKFLOWID/resume.pdaR1_ENS_ID.Rdata</resume.path>
  <round_counter>1</round_counter>
  <extension>round</extension>

<prior.id> This is the initial prior that went into your first PDA round. Even though the PDA workflow updates the <posteriorid> under your PFT block, if you do another emulator round only the initial prior will be used in the MCMC (otherwise you use data twice!). Hence, we keep track of its ID in the xml. But it is still perfectly fine to propose the new training points for the next emulator round from the posterior of the previous round, so that you can refine your emulator surface, that’s when the PDA workflow uses <posteriorid> under the hood.
<ensemble.id> Each emulator round is assigned a new ensemble ID. You can see this ID in every file recorded for the associated emulator round, e.g.: history.pdaR1_ENS_ID.Rdata, pecan.pdaR1_ENS_ID.xml, post.distns.pda.YOURPFT_R1_ENS_ID.Rdata etc.

Rest of the tags are for the emulator workflow to be able to find previous rounds’ information and understand what to do in the next round. The emulator workflow automatically adds <extension>round</extension> because if you forget to add this tag manually after the first round, you will just be repeating the first round while you almost certainly intend to do an iterative round if you are using the settings of your previous round. Forgetting to add this tag has happened enough times that we now add this tag automatically. But note that in PDA we have a second type of extension which is <extension>longer</extension>. This is to run your MCMC chain longer in case it didn’t converge. This will be more of an issue for the bruteforce calibration (see below), but it is also possible to run your MCMC chains longer if your emulator-based MCMC also didn’t converge:

<extension> Specifies the extension type of additional emulator runs, should be skipped if this a first emulator run of its own. Otherwise it is possible to extend emulator PDA runs in two ways:
- <extension>longer</extension> using the same emulator, this extension run takes the MCMC sampling from where it was left in the previous emulator run and runs a longer MCMC chain.
- <extension>round</extension> this extension run proposes new points in the parameter space in addition to the previous ones, and builds a new emulator including these additional points for a new MCMC sampling. These new points can come from both your inital PDA prior and the posterior of your first round of emulator run. You can determine the percentage of new knots coming from the posterior of your previous run in the <knot.par> tag. If you leave it empty, 90% of your new points will be drawn from the posterior of your previous run by default.

Now you can either pass the post PDA settings in your environment to the pda.emulator() function again (yes you can put this in a loop) or you can re-read your pecan.pdaR1_ENS_ID.xml at any time to perform an iterative emulator round.

### settings_PDA_round1 <- read.settings("pecan.pdaR1_ENS_ID.xml")
settings_PDA_round2 <- pda.emulator(settings_PDA_round1)

Before moving on to the BayesianTools case, here is what your pecan.PDA.xml could look like for a case where you constrain parameters of more than one model PFT with more than one Ameriflux data stream, note the ordering in <assim.batch><param.names> and <pfts>:

...
 <assim.batch>
  ...
  <param.names>
   <PFT1>
    <param>PFT1_PDA_param1</param>
    <param>PFT1_PDA_param2</param>
   </PFT1>
   <PFT2>
    <param>PFT2_PDA_param1</param>
    <param>PFT2_PDA_param2</param>
    <param>PFT2_PDA_param3</param>
   </PFT2>
  </param.names>
  ...
  <inputs>
   <file>
    <input.id>YOURINPUTID</input.id>   
    <path>
     <path>/YOURPATH/AmerifluxLBL_site_***/AMF_***_BASE_HH_***.csv</path> 
    </path>
    <likelihood>Laplace</likelihood>
    <variable.id>297</variable.id>
    <variable.name>
     <variable.name>NEE</variable.name>
     <variable.name>UST</variable.name>
    </variable.name>
   </file>   
   <file>
    <input.id>YOURINPUTID</input.id>   
    <path>
     <path>/YOURPATH/AmerifluxLBL_site_***/AMF_***_BASE_HH_***.csv</path> 
    </path>
    <likelihood>Laplace</likelihood>
    <variable.id>298</variable.id>
    <variable.name>
     <variable.name>LE</variable.name>
     <variable.name>UST</variable.name>
    </variable.name>
   </file>
  </inputs>
  ...
 </assim.batch>
 ...
 <pfts>
  <pft>
   <name>PFT1</name>
   <outdir>/YOUR/WORKFLOWS/PATH/PEcAn_WORKFLOWID/pft/PFT1</outdir>
   <posteriorid>PFT1_POSTID</posteriorid>
  </pft>
  <pft>
   <name>PFT2</name>
   <outdir>/YOUR/WORKFLOWS/PATH/PEcAn_WORKFLOWID/pft/PFT2</outdir>
   <posteriorid>PFT2_POSTID</posteriorid>
  </pft>
 </pfts>
 ...

2. PEcAn data pipeline with BayesianTools

Using the bruteforce PDA approach with one of the MCMC-flavors as implemented in the BayesianTools R-package requires slightly different tags to configure its setup. pda.bayesian.tools() would look for sampler specific settings that can be passed through the pecan.xml as a block under the <bt.settings> tag. Currently, the available samplers in the BayesianTools package are:

Metropolis
- Standard MH-MCMC
- With pre-optimization
- Adaptive MCMC
- Delayed rejection
- Gibbs updating
M : Another implementation of standard MH-MCMC.
AM : Adaptive Metropolis
DR : Delayed Rejection
DRAM : Delayed Rejection Adaptive Metropolis
DE : Differential Evolution
DEzs : Differential Evolution with a snooker updater
DREAM : Differential Evolution Adaptive Metropolis
DREAMzs : Differential Evolution Adaptive Metropolis with a snooker updater
Twalk : “traverse” or “thoughtful” walk, a general purpose sampling algorithm
SMC : Sequential Monte Carlo

Note that it is not possible to use some of the samplers in this package for univariate cases. The ones that you can use for univariate cases are: “Metropolis”, “DE”, “DEzs” and “Twalk”. Some of the samplers are also restartable : “Metropolis”, “DE”, “DEzs”, “DREAM”, “DREAMzs”, “Twalk”. For further details please see the BayesianTools documentation which itself has a helpful vignette.

The name of the chosen sampler would be passed under <sampler> tag within the <bt.settings> block. E.g. for Metropolis variations:

  ...
  <method>bayesian.tools</method>
  <bt.settings>
    <iter>10000</iter>
    <sampler>Metropolis</sampler>
    <DRlevels>1</DRlevels>                               <-- set 2 for Delayed Rejection
    <optimize>FALSE</optimize>                           <-- set TRUE for pre-optimization
    <adapt>FALSE</adapt>                                 <-- set TRUE for Adaptive Metropolis
    <adaptationNotBefore>500</adaptationNotBefore>       <-- set for Adaptive Metropolis
  </bt.settings>
  ...

The rest of the <assim.batch> block is the same (except you can now omit the <jump> tag as it takes no effect in this case). While all the algorithms can be accessed by the “Metropolis” sampler in BayesianTools package by specifying the sampler’s settings, it could be easier to directly pass the sampler name explicitly. Below is an example for DREAMzs (recommended) algorithm:

 <assim.batch>
   <method>bayesian.tools</method>
   <bt.settings>
     <sampler>DREAMzs</sampler>
     <iter>10000</iter>
     <chain>3</chain>
   </bt.settings>
  <param.names>
   <soil>
    <param>som_respiration_rate</param>
   </soil>
   <boreal.coniferous>
    <param>psnTOpt</param>
    <param>half_saturation_PAR</param>
    <param>Vm_low_temp</param>
   </boreal.coniferous>
  </param.names>
  <inputs>
   <file>
    <input.id>15000000028</input.id>   
    <path>
     <path>/data/dbfiles/ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv</path> 
    </path>
    <likelihood>Laplace</likelihood>
    <variable.id>297</variable.id>
    <variable.name>
     <variable.name>NEE</variable.name>
     <variable.name>UST</variable.name>
    </variable.name>
   </file>
  </inputs>
 </assim.batch>

Assuming you save these tags under a pecan.PDA.xml

settings <- read.settings("pecan.PDA.xml")
settings_postPDA <- pda.bayesian.tools(settings)

In case of no convergence and/or if you would like to run the same chains longer, you can add the extension option:

settings_postPDA$assim.batch$extension <- 'longer'
settings_postPDA_longer <- pda.bayesian.tools(settings_postPDA)

3. Own data with emulator

Sometimes you might want to use your own local data (e.g. from a tower that is not necessarily part of any flux tower network PEcAn has access to, or some chamber measurements etc.) for constraining your model. Although we still do recommend using formal PEcAn data registration and processing pipelines, sometimes you may not have access to the DB and/or may want to bypass other data processing steps to perform a quick PDA analysis to plan ahead. In that case, you can use external.* arguments of the pda.emulator() function. You can prepare the external.* arguments manually, but the easiest way to generate them is to use pda.generate.externals() function that you can find under the script with the same name. pda.emulator() function accepts the following external.* arguments:

external.data This is a list containing your constraining variables in the format expected by the PDA functions (see also pda.generate.externals() function). The length of this list is the number of your data streams (e.g. if you are using NEE and LE in calibration, it will have two sublists). Hint: You may want to perform case #1 and load the history.pda file to check the format of the inputs object out, because if external.data is passed to the pda.emulator() it will simply be assined to this variable, inputs <- external.data. To be able to prepare external.data with the pda.generate.externals() function you need your data as a(n ordered) list where each sublist corresponds to a data frame of your constraining variable with two columns, variable name - posix.
external.formats This is a list of variable names in the format expected by the PDA functions (see also pda.generate.externals() function). The length of this list is the number of your data streams (e.g. if you are using NEE and LE in calibration, it will have two sublists). Depending on expressions provided in this object, PDA functions are capable of performing simple derivations on data. Hence, this list has a slightly more complex structure than a simple list of names, but don’t worry about it. To be able to prepare external.formats with the pda.generate.externals() it is enough to pass varn vector that has the standard variable names (when in doubt, check the standard variable names under /base/utils/data/standard_vars.csv).
external.priors This is a list of pecan probability distribution dataframes to be used as PDA priors, one sublist per PFT, e.g.:

[[1]]
                       distn parama  paramb  n
frozenSoilEff           beta  1.000    1.00 NA
soilRespMoistEffect     unif  0.000    2.00 NA
...

[[2]]
                              distn      parama       paramb   n
growth_resp_factor             beta       2.630      6.5e+00   0
root_respiration_rate          norm       8.700      9.5e-01  NA
...

external.knots This is a list of dataframes containing emulator training points. Only pass it if you want to build the emulator on specific knots over which you want to have control, otherwise knots are generated from the PDA priors using Latin Hypercube Sampling. Each sublist should be a nknot x nPFTparam dataframe, e.g. if you are building an emulator with 250 knots and your PFT has 30 parameters, this will be a 250 x 30 dataframe where each row is a parameter vector. Therefore only the columns of parameters that you are targeting in PDA should have varying parameter values. E.g. if you are calibrating the psnTOpt parameter, it would look like this:

> external.knots
$temperate.deciduous
        ...  half_saturation_PAR fine_root_respiration_Q10   psnTOpt  AmaxFrac   ...
  [1,]  ...                 15.5                       3.2  24.83626 0.8605507   ...   
  [2,]  ...                 15.5                       3.2  25.18056 0.8605507   ...
  [3,]  ...                 15.5                       3.2  26.10204 0.8605507   ...
  [4,]  ...                 15.5                       3.2  25.03353 0.8605507   ...
   ...  ...                 ...                        ...       ...       ...   ...

Once you prepare external.data and external.format, now you can omit the <input.id></input.id> and <path></path> tags in the <assim.batch> block (they will be ignored even if you pass them when external.data is not NULL:

 <assim.batch>
  <method>emulator</method>
  <n.knot>60</n.knot>   
  <iter>100000</iter>
  <chain>3</chain>
  <param.names>
   <boreal.coniferous>
    <param>psnTOpt</param>
    <param>half_saturation_PAR</param>
    <param>Vm_low_temp</param>
   </boreal.coniferous>
  </param.names>
  <jump>
   <adapt>250</adapt>
    <adj.min>0.1</adj.min>
    <ar.target>0.3</ar.target>
  </jump>
  <inputs>
   <file>
    <likelihood>Laplace</likelihood>
    <variable.id>297</variable.id>
    <variable.name>
     <variable.name>NEE</variable.name>
     <variable.name>UST</variable.name>
    </variable.name>
   </file>
  </inputs>
 </assim.batch>

Save and read your pecan.PDA.xml accordingly. Next, pass the external data and formats you prepared explicitly to the pda.emulator() function:

settings <- read.settings("pecan.PDA.xml")
pda.externals <-  pda.generate.externals(external.data = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE) 
settings_postPDA <- pda.emulator(settings, external.data = pda.externals$external.data, external.formats = pda.externals$external.formats)

Note that this way we provided external constraint data and its format information to the PDA functions but we are still using database connection. If you want to use PDA functions with no database connection, then you need to provide prior list to the function as an external object as well and run it in remote = TRUE mode. In that case, it is also recommended to pass an ensemble.id as an identifier for your PDA run (normally this is given by the PDA workflow), it can even be a string:

settings <- read.settings("pecan.PDA.xml")
prior.list <-  list(data.frame(distn  = c("unif", "unif", "norm"), 
                               parama = c(5, 4, 0),  
                               paramb = c(40, 27, 3),   
                               n = rep(NA, 3), 
                               row.names = c("psnTOpt", "half_saturation_PAR", "Vm_low_temp")))
pda.externals <-  pda.generate.externals(external.data    = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE, external.priors  = TRUE, prior.list = prior.list) 
settings_postPDA <- pda.emulator(settings, external.data    = pda.externals$external.data, 
                                           external.formats = pda.externals$external.formats,
                                           external.priors  = pda.externals$external.priors, 
                                           remote = TRUE, ensemble.id = "PDA_VIGNETTE")

Two more notes here before moving onto the next case. Above, in the prior.list object, we only passed prior distribution for the three parameters we target. In this case rest of the model parameters will be kept as their defaults (whichever way it is implemented in the model package), they won’t be kept at the medians of their prior range as it is normally the case in PDA because we haven’t passed a prior (range) for them. If you don’t want that to happen, pass a more comprehensive list or try to use prior (or meta analysis posterior) objects generated by PEcAn workflows. Finally, we haven’t used the external.knots argument yet. We only use it when we need to overwrite and enforce the training points for the emulator and don’t necessarily need to populate this argument when doing PDA without database connection. We leave testing with external.knots as an exercise.

4. Own data with BayesianTools

It’s all very much the same as the cases above, in combination of #2 (prepare BayesianTools specific xml tags) and #3 (prepare external arguments).

 <assim.batch>
   <method>bayesian.tools</method>
   <bt.settings>
     <sampler>DREAMzs</sampler>
     <iter>10000</iter>
     <chain>3</chain>
   </bt.settings>
  <param.names>
   <boreal.coniferous>
    <param>psnTOpt</param>
    <param>half_saturation_PAR</param>
    <param>Vm_low_temp</param>
   </boreal.coniferous>
  </param.names>
  <inputs>
   <file>
    <likelihood>Laplace</likelihood>
    <variable.id>297</variable.id>
    <variable.name>
     <variable.name>NEE</variable.name>
     <variable.name>UST</variable.name>
    </variable.name>
   </file>
  </inputs>
 </assim.batch>

Save and read your pecan.PDA.xml accordingly. Next, pass the external data and formats you prepared explicitly to the pda.bayesian.tools() function:

settings <- read.settings("pecan.PDA.xml")
pda.externals <-  pda.generate.externals(external.data    = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE, external.priors  = TRUE, prior.list = prior.list) 
settings_postPDA <- pda.bayesian.tools(settings, external.data    = pda.externals$external.data, 
                                           external.formats = pda.externals$external.formats,
                                           external.priors  = pda.externals$external.priors, 
                                           remote = TRUE, ensemble.id = "PDA_VIGNETTE")