--- title: "PEcAn.assim.batch Vignette" author: - "Istem Fer" - "Ryan Kelly" date: "January 5, 2022" output: html_vignette vignette: > %\VignetteIndexEntry{PEcAn.assim.batch Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Install package from Github Only needs to be done the first time : ```{r, eval=FALSE} library(devtools) install_github("PecanProject/pecan", subdir="all") ``` ## Parameter Data Assimilation (a.k.a. calibration) in PEcAn Currently, there are three ways of doing Parameter Data Assimilation (PDA) in PEcAn : - bayesian.tools (bruteforce algorithms) - emulator - multi-site hierarchical PDA (HPDA) with emulators **Note that, there used to be Metropolis-Hastings Markov Chain Monte Carlo (MCMC) algorithms implemented natively in PEcAn, namely `/modules/assim.batch/R/pda.mcmc.R` and `pda.mcmc.bs.R`. These are deprecated and they will no longer be maintained.** ### Which one to use? **bayesian.tools :** The `pda.bayesian.tools.R` is the main script of this PDA workflow that performs a bruteforce (MCMC methods that require running your models at least couple of thousand of times) calibration. This script uses the `BayesianTools` R-package (Hartig, Minuno and Paul, 2019) that includes MCMC and SMC samplers and other tools for Bayesian parameter calibration with external models. If you choose *bayesian.tools* option, PEcAn framework will hand the PDA calculations over to the `BayesianTools` package. Although this package includes algorithms that are designed to explore the parameter space more efficiently than the traditional MH-MCMC, you would still need a relatively faster model to use these algorithms. Note that the `BayesianTools` R-package itself is constantly under development, and its integration with PEcAn may not be able to leverage all `BayesianTools` functionality. **emulator :** The `pda.emulator.R` is the main script of this PDA workflow. When a model is slow, it is practically not possible to run it hundreds of thousands of times sequentially to explore the parameter space and draw enough samples to converge the target distribution with bruteforce algorithms. Instead, we can run the model for a relatively smaller number of times with parameter combinations that have been carefully chosen to give good coverage of parameter space. Then we can interpolate the likelihood calculated for each of those runs to get a surface that "emulates" the true likelihood and perform regular MCMC (just like the "bruteforce" approach), except instead of actually running the model on every iteration to obtain a likelihood value, this time we will just get an approximation from the likelihood emulator. A lot more details about the emulator approach can be found in [Fer et al. 2018.](https://doi.org/10.5194/bg-15-5801-2018) **Multi-site HPDA :** The `pda.emulator.ms.R` is the main script of this PDA workflow which performs a multi-site hierarchical Bayesian calibration using the emulator approach. In the multi-site hierarhical calibration, global posteriors are fitted simultaneously using an additional Gibbs update step, and used as the new prior hyperparameters in the next iteration of MCMC to obtain site-level hierarchical posteriors. This PDA workflow has its own vignette (see `/modules/assim.batch/vignettes/MultiSitePDAVignette.Rmd`) and will not be one of the examples here. For more information also see [Fer et al. 2021 bioRxiv preprint](https://doi.org/10.1101/2021.04.28.441243). ## Adding PDA tags to pecan.xml The easiest way to use PEcAn's parameter data assimilation module is to add an `` block to pecan.xml, load the file with `read.settings`, and pass the resulting settings object to `pda.bayesian.tools()` or `pda.emulator()`. There are some differences in the settings for using different PDA methods (we will go through test cases, at this point just familiarize yourself with the general pattern), but here is an example `` block : ``` 100000 emulator <--- WHICH PDA APPROACH 3 <--- YOUR PFT(S) PARAM1 <--- PARAMETER(S) YOU TARGET IN PDA PARAM2 250 0.1 0.3 ... <--- YOUR CONSTRAINT ID FROM BETY DB ... <--- YOUR CONSTRAINT FILE PATH(S) ... Laplace <--- LIKELIHOOD FORM FOR THIS CONSTRAINT 298 <--- (CONSTRAINT) VARIABLE ID FROM BETY DB LE <--- (CONSTRAINT) VARIABLE UST <--- HELPER VARIABLE SPECIFIC TO FLUX CASE ... ... multipGauss 0.001 <--- HYPERPARAMETERS SPECIFIC TO GAUSSIAN FORM 0.001 TRUE ... ... ``` Here are some more details about the PDA tags: * `` Specifies the number of MCMC iterations to run. If continuing a previous MCMC, this is the number of additional iterations, which will be added to the previous total. Defaults to 100 if missing. * `` Specifies the number of MCMC chains to be run. Defaults to 2. Note that some of the post-processing functions will look for at least 2 chains for diagnostics. * `` Identifies the prior to be used for PDA. Can be one of either: + `` A posterior ID in BETY specifying the posterior from a previous PEcAn analysis (e.g., meta-analysis or previous PDA) to be used as the prior for PDA. Defaults to the most recent relevant posterior in the database if omitted (and no `` specified instead; see below). + `` As an alternative to using a posterior ID, can specify a file path to either a `prior.distns.Rdata` or `post.distns.Rdata` file generated from an earlier analysis. Conceptually, using a posterior distribution as the prior for PDA is preferred, as this allows the multiple analyses to work together to iteratively constrain parameters. In practice, previous analyses may have over-constrained parameters to ranges that do not actually optimize model outputs, so using a less informative prior for PDA might yield better results. NOTE: It is recommended to leave the `` tag empty if you are doing this for the first time. PEcAn workflow will handle it for you. * `` The names of parameters to be constrained by assimilation, listed in individual `` tags. *These must be the standard names given in BETY DB (e.g. wood_density), not your model specific parameter names (e.g. wooddens)*, i.e. check out the `id` column of the `trait.dictionary`, i.e. : ```{r} data(trait.dictionary, package = "PEcAn.utils") head(trait.dictionary[,c("id", "figid")]) ``` You can inspect your `write.config.MODEL.R` script for standard parameter names and their mappings to model parameter names. NOTE : `` chunk should always be on your xml file regardless of the method you use in the PDA. * `` Observation (constraint) data to be compared to the model. In principle, can be one or more datasets, specified in a variety of ways. In practice, the code is tested for assimilating flux datasets numerous times using the NEE/FC or LE variables. + `` Denotes a set of tags for a single input (constraint data). Would be repeated for multiple datasets/variables, e.g. in this case note the differences in `` and `` : ``` ... 15000000028 /data/dbfiles/ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv Laplace 297 NEE UST 15000000028 /data/dbfiles/ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv Laplace 298 LE UST ... ``` + `` BETY input ID for looking up the input. + `` File path to the input. Both `` and `` of the observation data should be supplied for the PDA. + `` A standardized source of input data (e.g., Ameriflux). Not implemented yet, but the idea would be similar to the met workflow, PEcAn would be able to use standard data sources automatically where available. Only used if no `` or `` is given. + `` Identifier for the likelihood to use. E.g., the Ameriflux NEE/FC and LE data use a Laplacian likelihood. + `` Optional. Hyperparameters for your likelihood. E.g. Prior parameters of the precision for Gaussian likelihood. Defaults to scaled values if not provided. + `` When using the emulator, it is important to let the algorithm know that you have a likelihood whose sufficient statistics is zero bound. + `` The BETY variable ID associated with this dataset. The idea is that specific preprocessing steps (e.g., estimating heteroskedastic error for tower NEE) would be associated with particular IDs. Could automate further by assigning default `` to variable.id values (allowing `` to be omitted from pecan.xml). NOTE : `` chunk should always be on your xml file regardless of the method you use in the PDA. * `` Settings for the specifics of the proposal schema and adaptation functionality. It is used in the `emulator` approach, it's possible to set adaptation options using `bayesian.tools` specific tags as well (see below for ``). + `` Target acceptance rate for the adaptive jump algorithm. Defaults to 0.5 if missing. + `` Number of iterations between jump variance adaptations. Defaults to `floor(iter/10)` if missing. If set equal to the number of ``, basically turns off the adaptation functionality, it is not recommended to turn-off this functionality for the emulator approach. + `` Minimum factor by which to reduce jump variance when adapting. Prevents jump variances from degenerating to 0. Defaults to 0.1 if missing. ## Method specific settings Now we will go through 4 examples of how you could arrange your `` tags and run a PDA in PEcAn: 1. Using existing PEcAn data pipelines (e.g. Ameriflux) and variables in the emulator-based calibration. 2. Using existing PEcAn data pipelines and variables in the BayesianTools-based (bruteforce) calibration. 3. Using your own pre-formatted data and variables in the emulator-based calibration. 4. Using your own pre-formatted data and variables in the BayesianTools-based (bruteforce) calibration. ## 1. PEcAn data pipeline with emulator Assuming you already familiarized yourself with Demo 1 and Demo 2, and ran SIPNET at Niwot Ridge (US-NR1), in this section we will go through the `pecan.xml` tags for calibrating SIPNET with Ameriflux NEE data with the emulator approach at the US-NR1 site. The SIPNET PFT for this site was `temperate.coniferous`, let's assume the top three parameters contributing to the model NEE output uncertainty (Demo 2) were `psnTOpt`,`half_saturation_PAR`, and `Vm_low_temp`, so we decided to constrain these in the PDA. Go to your workflow directory where you performed the uncertainty analysis, open up your `pecan.CONFIGS.xml`, insert the following tags at the top of your xml after the `` tag (don't forget to remove `<---` comments), modify the paths according to *your* setup, and save the file as `pecan.PDA.xml`: ``` emulator 60 <--- 20 x p (= no. of paramaters you target) 100000 3 psnTOpt half_saturation_PAR Vm_low_temp 250 0.1 0.3 1000011238 <--- your input ID (of the DB record of where the AMF csv file is) /.../AmerifluxLBL_site_0-772/AMF_US-NR1_BASE_HH_12-5.csv <--- your path Laplace 1000000042 FC UST ``` For reference, these are the tags for a similar workflow for SIPNET at ICOS-NEE (FI-Var site) combination. Note the differences in PFT name, input ID and path, variable ID and name: ``` emulator 60 100000 3 psnTOpt half_saturation_PAR Vm_low_temp 250 0.1 0.3 15000000028 /.../ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv Laplace 297 NEE UST ``` Now you can read in your `pecan.PDA.xml` and run PDA (This will take some time, go grab a drink. The first round especially takes time because this is when the constraint data is loaded and its effective sample size is calculated): ``` settings <- read.settings("pecan.PDA.xml") settings_PDA_round1 <- pda.emulator(settings) ``` The PDA workflow has populated you `settings` list object with couple of relevant tags. Now `settings_PDA_round1` has these additional tags (note that this post-PDA `settings` list is also automatically written to your workflow directory as `pecan.pdaR1_ENS_ID.xml`). These tags look like this: ``` INITIAL_PDA_PRIOR_ID R1_ENS_ID /.../PEcAn_WORKFLOWID/emulator.pdaR1_ENS_ID.Rdata /.../PEcAn_WORKFLOWID/ss.pdaR1_ENS_ID.Rdata /.../PEcAn_WORKFLOWID/mcmc.list.pdaR1_ENS_ID.Rdata /.../PEcAn_WORKFLOWID/resume.pdaR1_ENS_ID.Rdata 1 round ``` * `` This is the initial prior that went into your first PDA round. Even though the PDA workflow updates the `` under your PFT block, if you do another emulator round only the initial prior will be used in the MCMC (otherwise you use data twice!). Hence, we keep track of its ID in the xml. But it is still perfectly fine to propose the new training points for the next emulator round from the posterior of the previous round, so that you can refine your emulator surface, that's when the PDA workflow uses `` under the hood. * `` Each emulator round is assigned a new ensemble ID. You can see this ID in every file recorded for the associated emulator round, e.g.: `history.pdaR1_ENS_ID.Rdata`, `pecan.pdaR1_ENS_ID.xml`, `post.distns.pda.YOURPFT_R1_ENS_ID.Rdata` etc. Rest of the tags are for the emulator workflow to be able to find previous rounds' information and understand what to do in the next round. The emulator workflow automatically adds `round` because if you forget to add this tag manually after the first round, you will just be repeating the first round while you almost certainly intend to do an iterative round if you are using the settings of your previous round. Forgetting to add this tag has happened enough times that we now add this tag automatically. But note that in PDA we have a second type of extension which is `longer`. This is to run your MCMC chain longer in case it didn't converge. This will be more of an issue for the bruteforce calibration (see below), but it is also possible to run your MCMC chains longer if your emulator-based MCMC also didn't converge: * `` Specifies the extension type of additional emulator runs, should be skipped if this a first emulator run of its own. Otherwise it is possible to extend emulator PDA runs in two ways: + `longer` using the same emulator, this extension run takes the MCMC sampling from where it was left in the previous emulator run and runs a longer MCMC chain. + `round` this extension run proposes new points in the parameter space in addition to the previous ones, and builds a new emulator including these additional points for a new MCMC sampling. These new points can come from both your inital PDA prior and the posterior of your first round of emulator run. You can determine the percentage of new knots coming from the posterior of your previous run in the `` tag. If you leave it empty, 90% of your new points will be drawn from the posterior of your previous run by default. Now you can either pass the post PDA settings in your environment to the `pda.emulator()` function again (yes you can put this in a loop) or you can re-read your `pecan.pdaR1_ENS_ID.xml` at any time to perform an iterative emulator round. ``` ### settings_PDA_round1 <- read.settings("pecan.pdaR1_ENS_ID.xml") settings_PDA_round2 <- pda.emulator(settings_PDA_round1) ``` Before moving on to the `BayesianTools` case, here is what your `pecan.PDA.xml` could look like for a case where you constrain parameters of more than one model PFT with more than one Ameriflux data stream, note the ordering in `` and ``: ``` ... ... PFT1_PDA_param1 PFT1_PDA_param2 PFT2_PDA_param1 PFT2_PDA_param2 PFT2_PDA_param3 ... YOURINPUTID /YOURPATH/AmerifluxLBL_site_***/AMF_***_BASE_HH_***.csv Laplace 297 NEE UST YOURINPUTID /YOURPATH/AmerifluxLBL_site_***/AMF_***_BASE_HH_***.csv Laplace 298 LE UST ... ... PFT1 /YOUR/WORKFLOWS/PATH/PEcAn_WORKFLOWID/pft/PFT1 PFT1_POSTID PFT2 /YOUR/WORKFLOWS/PATH/PEcAn_WORKFLOWID/pft/PFT2 PFT2_POSTID ... ``` ## 2. PEcAn data pipeline with BayesianTools Using the bruteforce PDA approach with one of the MCMC-flavors as implemented in the BayesianTools R-package requires slightly different tags to configure its setup. **pda.bayesian.tools()** would look for sampler specific settings that can be passed through the pecan.xml as a block under the `` tag. Currently, the available samplers in the BayesianTools package are: * Metropolis + Standard MH-MCMC + With pre-optimization + Adaptive MCMC + Delayed rejection + Gibbs updating * M : Another implementation of standard MH-MCMC. * AM : Adaptive Metropolis * DR : Delayed Rejection * DRAM : Delayed Rejection Adaptive Metropolis * DE : Differential Evolution * DEzs : Differential Evolution with a snooker updater * DREAM : Differential Evolution Adaptive Metropolis * DREAMzs : Differential Evolution Adaptive Metropolis with a snooker updater * Twalk : "traverse" or "thoughtful" walk, a general purpose sampling algorithm * SMC : Sequential Monte Carlo Note that it is not possible to use some of the samplers in this package for univariate cases. The ones that you can use for univariate cases are: "Metropolis", "DE", "DEzs" and "Twalk". Some of the samplers are also restartable : "Metropolis", "DE", "DEzs", "DREAM", "DREAMzs", "Twalk". For further details please see the BayesianTools documentation which itself has a helpful vignette. The name of the chosen sampler would be passed under `` tag within the `` block. E.g. for Metropolis variations: ``` ... bayesian.tools 10000 Metropolis 1 <-- set 2 for Delayed Rejection FALSE <-- set TRUE for pre-optimization FALSE <-- set TRUE for Adaptive Metropolis 500 <-- set for Adaptive Metropolis ... ``` The rest of the `` block is the same (except you can now omit the `` tag as it takes no effect in this case). While all the algorithms can be accessed by the “Metropolis” sampler in BayesianTools package by specifying the sampler’s settings, it could be easier to directly pass the sampler name explicitly. Below is an example for DREAMzs (recommended) algorithm: ``` bayesian.tools DREAMzs 10000 3 som_respiration_rate psnTOpt half_saturation_PAR Vm_low_temp 15000000028 /data/dbfiles/ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv Laplace 297 NEE UST ``` Assuming you save these tags under a `pecan.PDA.xml` ``` settings <- read.settings("pecan.PDA.xml") settings_postPDA <- pda.bayesian.tools(settings) ``` In case of no convergence and/or if you would like to run the same chains longer, you can add the extension option: ``` settings_postPDA$assim.batch$extension <- 'longer' settings_postPDA_longer <- pda.bayesian.tools(settings_postPDA) ``` ## 3. Own data with emulator Sometimes you might want to use your own local data (e.g. from a tower that is not necessarily part of any flux tower network PEcAn has access to, or some chamber measurements etc.) for constraining your model. Although we still do recommend using formal PEcAn data registration and processing pipelines, sometimes you may not have access to the DB and/or may want to bypass other data processing steps to perform a quick PDA analysis to plan ahead. In that case, you can use `external.*` arguments of the `pda.emulator()` function. You can prepare the `external.*` arguments manually, but the easiest way to generate them is to use `pda.generate.externals()` function that you can find under the script with the same name. `pda.emulator()` function accepts the following `external.*` arguments: * `external.data` This is a list containing your constraining variables in the format expected by the PDA functions (see also `pda.generate.externals()` function). The length of this list is the number of your data streams (e.g. if you are using NEE and LE in calibration, it will have two sublists). *Hint: You may want to perform case #1 and load the `history.pda` file to check the format of the `inputs` object out, because if `external.data` is passed to the `pda.emulator()` it will simply be assined to this variable, `inputs <- external.data`.* To be able to prepare `external.data` with the `pda.generate.externals()` function you need your data as a(n ordered) list where each sublist corresponds to a data frame of your constraining variable with two columns, variable name - posix. * `external.formats` This is a list of variable names in the format expected by the PDA functions (see also `pda.generate.externals()` function). The length of this list is the number of your data streams (e.g. if you are using NEE and LE in calibration, it will have two sublists). Depending on expressions provided in this object, PDA functions are capable of performing simple derivations on data. Hence, this list has a slightly more complex structure than a simple list of names, but don't worry about it. To be able to prepare `external.formats` with the `pda.generate.externals()` it is enough to pass `varn` vector that has the standard variable names (when in doubt, check the standard variable names under `/base/utils/data/standard_vars.csv`). * `external.priors` This is a list of pecan probability distribution dataframes to be used as PDA priors, one sublist per PFT, e.g.: ``` [[1]] distn parama paramb n frozenSoilEff beta 1.000 1.00 NA soilRespMoistEffect unif 0.000 2.00 NA ... [[2]] distn parama paramb n growth_resp_factor beta 2.630 6.5e+00 0 root_respiration_rate norm 8.700 9.5e-01 NA ... ``` * `external.knots` This is a list of dataframes containing emulator training points. Only pass it if you want to build the emulator on specific knots over which you want to have control, otherwise knots are generated from the PDA priors using Latin Hypercube Sampling. Each sublist should be a `nknot x nPFTparam` dataframe, e.g. if you are building an emulator with 250 knots and your PFT has 30 parameters, this will be a 250 x 30 dataframe where each row is a parameter vector. Therefore only the columns of parameters that you are targeting in PDA should have varying parameter values. E.g. if you are calibrating the `psnTOpt` parameter, it would look like this: ``` > external.knots $temperate.deciduous ... half_saturation_PAR fine_root_respiration_Q10 psnTOpt AmaxFrac ... [1,] ... 15.5 3.2 24.83626 0.8605507 ... [2,] ... 15.5 3.2 25.18056 0.8605507 ... [3,] ... 15.5 3.2 26.10204 0.8605507 ... [4,] ... 15.5 3.2 25.03353 0.8605507 ... ... ... ... ... ... ... ... ``` Once you prepare `external.data` and `external.format`, now you can omit the `` and `` tags in the `` block (they will be ignored even if you pass them when `external.data` is not NULL: ``` emulator 60 100000 3 psnTOpt half_saturation_PAR Vm_low_temp 250 0.1 0.3 Laplace 297 NEE UST ``` Save and read your `pecan.PDA.xml` accordingly. Next, pass the external data and formats you prepared explicitly to the `pda.emulator()` function: ``` settings <- read.settings("pecan.PDA.xml") pda.externals <- pda.generate.externals(external.data = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE) settings_postPDA <- pda.emulator(settings, external.data = pda.externals$external.data, external.formats = pda.externals$external.formats) ``` Note that this way we provided external constraint data and its format information to the PDA functions but we are still using database connection. If you want to use PDA functions with no database connection, then you need to provide prior list to the function as an external object as well and run it in `remote = TRUE` mode. In that case, it is also recommended to pass an `ensemble.id` as an identifier for your PDA run (normally this is given by the PDA workflow), it can even be a string: ``` settings <- read.settings("pecan.PDA.xml") prior.list <- list(data.frame(distn = c("unif", "unif", "norm"), parama = c(5, 4, 0), paramb = c(40, 27, 3), n = rep(NA, 3), row.names = c("psnTOpt", "half_saturation_PAR", "Vm_low_temp"))) pda.externals <- pda.generate.externals(external.data = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE, external.priors = TRUE, prior.list = prior.list) settings_postPDA <- pda.emulator(settings, external.data = pda.externals$external.data, external.formats = pda.externals$external.formats, external.priors = pda.externals$external.priors, remote = TRUE, ensemble.id = "PDA_VIGNETTE") ``` Two more notes here before moving onto the next case. Above, in the `prior.list` object, we only passed prior distribution for the three parameters we target. In this case rest of the model parameters will be kept as their defaults (whichever way it is implemented in the model package), *they won't be kept at the medians of their prior range as it is normally the case in PDA because we haven't passed a prior (range) for them*. If you don't want that to happen, pass a more comprehensive list or try to use prior (or meta analysis posterior) objects generated by PEcAn workflows. Finally, we haven't used the `external.knots` argument yet. We only use it when we need to overwrite and enforce the training points for the emulator and don't necessarily need to populate this argument when doing PDA without database connection. We leave testing with `external.knots` as an exercise. ## 4. Own data with BayesianTools It's all very much the same as the cases above, in combination of #2 (prepare `BayesianTools` specific xml tags) and #3 (prepare external arguments). ``` bayesian.tools DREAMzs 10000 3 psnTOpt half_saturation_PAR Vm_low_temp Laplace 297 NEE UST ``` Save and read your `pecan.PDA.xml` accordingly. Next, pass the external data and formats you prepared explicitly to the `pda.bayesian.tools()` function: ``` settings <- read.settings("pecan.PDA.xml") pda.externals <- pda.generate.externals(external.data = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE, external.priors = TRUE, prior.list = prior.list) settings_postPDA <- pda.bayesian.tools(settings, external.data = pda.externals$external.data, external.formats = pda.externals$external.formats, external.priors = pda.externals$external.priors, remote = TRUE, ensemble.id = "PDA_VIGNETTE") ```