Only needs to be done the first time :
Currently, there are three ways of doing Parameter Data Assimilation (PDA) in PEcAn :
Note that, there used to be Metropolis-Hastings Markov Chain
Monte Carlo (MCMC) algorithms implemented natively in PEcAn, namely
/modules/assim.batch/R/pda.mcmc.R
and
pda.mcmc.bs.R
. These are deprecated and they will no longer
be maintained.
bayesian.tools : The
pda.bayesian.tools.R
is the main script of this PDA
workflow that performs a bruteforce (MCMC methods that require running
your models at least couple of thousand of times) calibration. This
script uses the BayesianTools
R-package (Hartig, Minuno and
Paul, 2019) that includes MCMC and SMC samplers and other tools for
Bayesian parameter calibration with external models. If you choose
bayesian.tools option, PEcAn framework will hand the PDA
calculations over to the BayesianTools
package. Although
this package includes algorithms that are designed to explore the
parameter space more efficiently than the traditional MH-MCMC, you would
still need a relatively faster model to use these algorithms. Note that
the BayesianTools
R-package itself is constantly under
development, and its integration with PEcAn may not be able to leverage
all BayesianTools
functionality.
emulator : The pda.emulator.R
is the
main script of this PDA workflow. When a model is slow, it is
practically not possible to run it hundreds of thousands of times
sequentially to explore the parameter space and draw enough samples to
converge the target distribution with bruteforce algorithms. Instead, we
can run the model for a relatively smaller number of times with
parameter combinations that have been carefully chosen to give good
coverage of parameter space. Then we can interpolate the likelihood
calculated for each of those runs to get a surface that “emulates” the
true likelihood and perform regular MCMC (just like the “bruteforce”
approach), except instead of actually running the model on every
iteration to obtain a likelihood value, this time we will just get an
approximation from the likelihood emulator. A lot more details about the
emulator approach can be found in Fer et al. 2018.
Multi-site HPDA : The pda.emulator.ms.R
is the main script of this PDA workflow which performs a multi-site
hierarchical Bayesian calibration using the emulator approach. In the
multi-site hierarhical calibration, global posteriors are fitted
simultaneously using an additional Gibbs update step, and used as the
new prior hyperparameters in the next iteration of MCMC to obtain
site-level hierarchical posteriors. This PDA workflow has its own
vignette (see
/modules/assim.batch/vignettes/MultiSitePDAVignette.Rmd
)
and will not be one of the examples here. For more information also see
Fer et al. 2021
bioRxiv preprint.
Now we will go through 4 examples of how you could arrange your
<assim.batch>
tags and run a PDA in PEcAn:
Assuming you already familiarized yourself with Demo 1 and Demo 2,
and ran SIPNET at Niwot Ridge (US-NR1), in this section we will go
through the pecan.xml
tags for calibrating SIPNET with
Ameriflux NEE data with the emulator approach at the US-NR1 site. The
SIPNET PFT for this site was temperate.coniferous
, let’s
assume the top three parameters contributing to the model NEE output
uncertainty (Demo 2) were
psnTOpt
,half_saturation_PAR
, and
Vm_low_temp
, so we decided to constrain these in the PDA.
Go to your workflow directory where you performed the uncertainty
analysis, open up your pecan.CONFIGS.xml
, insert the
following tags at the top of your xml after the
<pecan>
tag (don’t forget to remove
<---
comments), modify the paths according to
your setup, and save the file as
pecan.PDA.xml
:
<assim.batch>
<method>emulator</method>
<n.knot>60</n.knot> <--- 20 x p (= no. of paramaters you target)
<iter>100000</iter>
<chain>3</chain>
<param.names>
<temperate.coniferous>
<param>psnTOpt</param>
<param>half_saturation_PAR</param>
<param>Vm_low_temp</param>
</temperate.coniferous>
</param.names>
<jump>
<adapt>250</adapt>
<adj.min>0.1</adj.min>
<ar.target>0.3</ar.target>
</jump>
<inputs>
<file>
<input.id>1000011238</input.id> <--- your input ID (of the DB record of where the AMF csv file is)
<path>
<path>/.../AmerifluxLBL_site_0-772/AMF_US-NR1_BASE_HH_12-5.csv</path> <--- your path
</path>
<likelihood>Laplace</likelihood>
<variable.id>1000000042</variable.id>
<variable.name>
<variable.name>FC</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
For reference, these are the tags for a similar workflow for SIPNET at ICOS-NEE (FI-Var site) combination. Note the differences in PFT name, input ID and path, variable ID and name:
<assim.batch>
<method>emulator</method>
<n.knot>60</n.knot>
<iter>100000</iter>
<chain>3</chain>
<param.names>
<boreal.coniferous>
<param>psnTOpt</param>
<param>half_saturation_PAR</param>
<param>Vm_low_temp</param>
</boreal.coniferous>
</param.names>
<jump>
<adapt>250</adapt>
<adj.min>0.1</adj.min>
<ar.target>0.3</ar.target>
</jump>
<inputs>
<file>
<input.id>15000000028</input.id>
<path>
<path>/.../ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv</path>
</path>
<likelihood>Laplace</likelihood>
<variable.id>297</variable.id>
<variable.name>
<variable.name>NEE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
Now you can read in your pecan.PDA.xml
and run PDA (This
will take some time, go grab a drink. The first round especially takes
time because this is when the constraint data is loaded and its
effective sample size is calculated):
settings <- read.settings("pecan.PDA.xml")
settings_PDA_round1 <- pda.emulator(settings)
The PDA workflow has populated you settings
list object
with couple of relevant tags. Now settings_PDA_round1
has
these additional tags (note that this post-PDA settings
list is also automatically written to your workflow directory as
pecan.pdaR1_ENS_ID.xml
). These tags look like this:
<prior>
<prior.id>
<boreal.coniferous>INITIAL_PDA_PRIOR_ID</boreal.coniferous>
</prior.id>
</prior>
<ensemble.id>
<id>R1_ENS_ID</id>
</ensemble.id>
<emulator.path>/.../PEcAn_WORKFLOWID/emulator.pdaR1_ENS_ID.Rdata</emulator.path>
<ss.path>/.../PEcAn_WORKFLOWID/ss.pdaR1_ENS_ID.Rdata</ss.path>
<mcmc.path>/.../PEcAn_WORKFLOWID/mcmc.list.pdaR1_ENS_ID.Rdata</mcmc.path>
<resume.path>/.../PEcAn_WORKFLOWID/resume.pdaR1_ENS_ID.Rdata</resume.path>
<round_counter>1</round_counter>
<extension>round</extension>
<prior.id>
This is the initial prior that went
into your first PDA round. Even though the PDA workflow updates the
<posteriorid>
under your PFT block, if you do another
emulator round only the initial prior will be used in the MCMC
(otherwise you use data twice!). Hence, we keep track of its ID in the
xml. But it is still perfectly fine to propose the new training points
for the next emulator round from the posterior of the previous round, so
that you can refine your emulator surface, that’s when the PDA workflow
uses <posteriorid>
under the hood.
<ensemble.id>
Each emulator round is assigned
a new ensemble ID. You can see this ID in every file recorded for the
associated emulator round, e.g.:
history.pdaR1_ENS_ID.Rdata
,
pecan.pdaR1_ENS_ID.xml
,
post.distns.pda.YOURPFT_R1_ENS_ID.Rdata
etc.
Rest of the tags are for the emulator workflow to be able to find
previous rounds’ information and understand what to do in the next
round. The emulator workflow automatically adds
<extension>round</extension>
because if you
forget to add this tag manually after the first round, you will just be
repeating the first round while you almost certainly intend to do an
iterative round if you are using the settings of your previous round.
Forgetting to add this tag has happened enough times that we now add
this tag automatically. But note that in PDA we have a second type of
extension which is
<extension>longer</extension>
. This is to run
your MCMC chain longer in case it didn’t converge. This will be more of
an issue for the bruteforce calibration (see below), but it is also
possible to run your MCMC chains longer if your emulator-based MCMC also
didn’t converge:
<extension>
Specifies the extension type of
additional emulator runs, should be skipped if this a first emulator run
of its own. Otherwise it is possible to extend emulator PDA runs in two
ways:
<extension>longer</extension>
using the
same emulator, this extension run takes the MCMC sampling from where it
was left in the previous emulator run and runs a longer MCMC
chain.
<extension>round</extension>
this
extension run proposes new points in the parameter space in addition to
the previous ones, and builds a new emulator including these additional
points for a new MCMC sampling. These new points can come from both your
inital PDA prior and the posterior of your first round of emulator run.
You can determine the percentage of new knots coming from the posterior
of your previous run in the <knot.par>
tag. If you
leave it empty, 90% of your new points will be drawn from the posterior
of your previous run by default.
Now you can either pass the post PDA settings in your environment to
the pda.emulator()
function again (yes you can put this in
a loop) or you can re-read your pecan.pdaR1_ENS_ID.xml
at
any time to perform an iterative emulator round.
### settings_PDA_round1 <- read.settings("pecan.pdaR1_ENS_ID.xml")
settings_PDA_round2 <- pda.emulator(settings_PDA_round1)
Before moving on to the BayesianTools
case, here is what
your pecan.PDA.xml
could look like for a case where you
constrain parameters of more than one model PFT with more than one
Ameriflux data stream, note the ordering in
<assim.batch><param.names>
and
<pfts>
:
...
<assim.batch>
...
<param.names>
<PFT1>
<param>PFT1_PDA_param1</param>
<param>PFT1_PDA_param2</param>
</PFT1>
<PFT2>
<param>PFT2_PDA_param1</param>
<param>PFT2_PDA_param2</param>
<param>PFT2_PDA_param3</param>
</PFT2>
</param.names>
...
<inputs>
<file>
<input.id>YOURINPUTID</input.id>
<path>
<path>/YOURPATH/AmerifluxLBL_site_***/AMF_***_BASE_HH_***.csv</path>
</path>
<likelihood>Laplace</likelihood>
<variable.id>297</variable.id>
<variable.name>
<variable.name>NEE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
<file>
<input.id>YOURINPUTID</input.id>
<path>
<path>/YOURPATH/AmerifluxLBL_site_***/AMF_***_BASE_HH_***.csv</path>
</path>
<likelihood>Laplace</likelihood>
<variable.id>298</variable.id>
<variable.name>
<variable.name>LE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
...
</assim.batch>
...
<pfts>
<pft>
<name>PFT1</name>
<outdir>/YOUR/WORKFLOWS/PATH/PEcAn_WORKFLOWID/pft/PFT1</outdir>
<posteriorid>PFT1_POSTID</posteriorid>
</pft>
<pft>
<name>PFT2</name>
<outdir>/YOUR/WORKFLOWS/PATH/PEcAn_WORKFLOWID/pft/PFT2</outdir>
<posteriorid>PFT2_POSTID</posteriorid>
</pft>
</pfts>
...
Using the bruteforce PDA approach with one of the MCMC-flavors as
implemented in the BayesianTools R-package requires slightly different
tags to configure its setup. pda.bayesian.tools() would
look for sampler specific settings that can be passed through the
pecan.xml as a block under the <bt.settings>
tag.
Currently, the available samplers in the BayesianTools package are:
Note that it is not possible to use some of the samplers in this package for univariate cases. The ones that you can use for univariate cases are: “Metropolis”, “DE”, “DEzs” and “Twalk”. Some of the samplers are also restartable : “Metropolis”, “DE”, “DEzs”, “DREAM”, “DREAMzs”, “Twalk”. For further details please see the BayesianTools documentation which itself has a helpful vignette.
The name of the chosen sampler would be passed under
<sampler>
tag within the
<bt.settings>
block. E.g. for Metropolis
variations:
...
<method>bayesian.tools</method>
<bt.settings>
<iter>10000</iter>
<sampler>Metropolis</sampler>
<DRlevels>1</DRlevels> <-- set 2 for Delayed Rejection
<optimize>FALSE</optimize> <-- set TRUE for pre-optimization
<adapt>FALSE</adapt> <-- set TRUE for Adaptive Metropolis
<adaptationNotBefore>500</adaptationNotBefore> <-- set for Adaptive Metropolis
</bt.settings>
...
The rest of the <assim.batch>
block is the same
(except you can now omit the <jump>
tag as it takes
no effect in this case). While all the algorithms can be accessed by the
“Metropolis” sampler in BayesianTools package by specifying the
sampler’s settings, it could be easier to directly pass the sampler name
explicitly. Below is an example for DREAMzs (recommended) algorithm:
<assim.batch>
<method>bayesian.tools</method>
<bt.settings>
<sampler>DREAMzs</sampler>
<iter>10000</iter>
<chain>3</chain>
</bt.settings>
<param.names>
<soil>
<param>som_respiration_rate</param>
</soil>
<boreal.coniferous>
<param>psnTOpt</param>
<param>half_saturation_PAR</param>
<param>Vm_low_temp</param>
</boreal.coniferous>
</param.names>
<inputs>
<file>
<input.id>15000000028</input.id>
<path>
<path>/data/dbfiles/ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv</path>
</path>
<likelihood>Laplace</likelihood>
<variable.id>297</variable.id>
<variable.name>
<variable.name>NEE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
Assuming you save these tags under a pecan.PDA.xml
settings <- read.settings("pecan.PDA.xml")
settings_postPDA <- pda.bayesian.tools(settings)
In case of no convergence and/or if you would like to run the same chains longer, you can add the extension option:
settings_postPDA$assim.batch$extension <- 'longer'
settings_postPDA_longer <- pda.bayesian.tools(settings_postPDA)
Sometimes you might want to use your own local data (e.g. from a
tower that is not necessarily part of any flux tower network PEcAn has
access to, or some chamber measurements etc.) for constraining your
model. Although we still do recommend using formal PEcAn data
registration and processing pipelines, sometimes you may not have access
to the DB and/or may want to bypass other data processing steps to
perform a quick PDA analysis to plan ahead. In that case, you can use
external.*
arguments of the pda.emulator()
function. You can prepare the external.*
arguments
manually, but the easiest way to generate them is to use
pda.generate.externals()
function that you can find under
the script with the same name. pda.emulator()
function
accepts the following external.*
arguments:
external.data
This is a list containing your
constraining variables in the format expected by the PDA functions (see
also pda.generate.externals()
function). The length of this
list is the number of your data streams (e.g. if you are using NEE and
LE in calibration, it will have two sublists). Hint: You may want to
perform case #1 and load the history.pda
file to check the
format of the inputs
object out, because if
external.data
is passed to the pda.emulator()
it will simply be assined to this variable,
inputs <- external.data
. To be able to prepare
external.data
with the
pda.generate.externals()
function you need your data as a(n
ordered) list where each sublist corresponds to a data frame of your
constraining variable with two columns, variable name - posix.
external.formats
This is a list of variable names in
the format expected by the PDA functions (see also
pda.generate.externals()
function). The length of this list
is the number of your data streams (e.g. if you are using NEE and LE in
calibration, it will have two sublists). Depending on expressions
provided in this object, PDA functions are capable of performing simple
derivations on data. Hence, this list has a slightly more complex
structure than a simple list of names, but don’t worry about it. To be
able to prepare external.formats
with the
pda.generate.externals()
it is enough to pass
varn
vector that has the standard variable names (when in
doubt, check the standard variable names under
/base/utils/data/standard_vars.csv
).
external.priors
This is a list of pecan probability
distribution dataframes to be used as PDA priors, one sublist per PFT,
e.g.:
[[1]]
distn parama paramb n
frozenSoilEff beta 1.000 1.00 NA
soilRespMoistEffect unif 0.000 2.00 NA
...
[[2]]
distn parama paramb n
growth_resp_factor beta 2.630 6.5e+00 0
root_respiration_rate norm 8.700 9.5e-01 NA
...
external.knots
This is a list of dataframes containing
emulator training points. Only pass it if you want to build the emulator
on specific knots over which you want to have control, otherwise knots
are generated from the PDA priors using Latin Hypercube Sampling. Each
sublist should be a nknot x nPFTparam
dataframe, e.g. if
you are building an emulator with 250 knots and your PFT has 30
parameters, this will be a 250 x 30 dataframe where each row is a
parameter vector. Therefore only the columns of parameters that you are
targeting in PDA should have varying parameter values. E.g. if you are
calibrating the psnTOpt
parameter, it would look like
this:> external.knots
$temperate.deciduous
... half_saturation_PAR fine_root_respiration_Q10 psnTOpt AmaxFrac ...
[1,] ... 15.5 3.2 24.83626 0.8605507 ...
[2,] ... 15.5 3.2 25.18056 0.8605507 ...
[3,] ... 15.5 3.2 26.10204 0.8605507 ...
[4,] ... 15.5 3.2 25.03353 0.8605507 ...
... ... ... ... ... ... ...
Once you prepare external.data
and
external.format
, now you can omit the
<input.id></input.id>
and
<path></path>
tags in the
<assim.batch>
block (they will be ignored even if you
pass them when external.data
is not NULL:
<assim.batch>
<method>emulator</method>
<n.knot>60</n.knot>
<iter>100000</iter>
<chain>3</chain>
<param.names>
<boreal.coniferous>
<param>psnTOpt</param>
<param>half_saturation_PAR</param>
<param>Vm_low_temp</param>
</boreal.coniferous>
</param.names>
<jump>
<adapt>250</adapt>
<adj.min>0.1</adj.min>
<ar.target>0.3</ar.target>
</jump>
<inputs>
<file>
<likelihood>Laplace</likelihood>
<variable.id>297</variable.id>
<variable.name>
<variable.name>NEE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
Save and read your pecan.PDA.xml
accordingly. Next, pass
the external data and formats you prepared explicitly to the
pda.emulator()
function:
settings <- read.settings("pecan.PDA.xml")
pda.externals <- pda.generate.externals(external.data = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE)
settings_postPDA <- pda.emulator(settings, external.data = pda.externals$external.data, external.formats = pda.externals$external.formats)
Note that this way we provided external constraint data and its
format information to the PDA functions but we are still using database
connection. If you want to use PDA functions with no database
connection, then you need to provide prior list to the function as an
external object as well and run it in remote = TRUE
mode.
In that case, it is also recommended to pass an ensemble.id
as an identifier for your PDA run (normally this is given by the PDA
workflow), it can even be a string:
settings <- read.settings("pecan.PDA.xml")
prior.list <- list(data.frame(distn = c("unif", "unif", "norm"),
parama = c(5, 4, 0),
paramb = c(40, 27, 3),
n = rep(NA, 3),
row.names = c("psnTOpt", "half_saturation_PAR", "Vm_low_temp")))
pda.externals <- pda.generate.externals(external.data = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE, external.priors = TRUE, prior.list = prior.list)
settings_postPDA <- pda.emulator(settings, external.data = pda.externals$external.data,
external.formats = pda.externals$external.formats,
external.priors = pda.externals$external.priors,
remote = TRUE, ensemble.id = "PDA_VIGNETTE")
Two more notes here before moving onto the next case. Above, in the
prior.list
object, we only passed prior distribution for
the three parameters we target. In this case rest of the model
parameters will be kept as their defaults (whichever way it is
implemented in the model package), they won’t be kept at the medians
of their prior range as it is normally the case in PDA because we
haven’t passed a prior (range) for them. If you don’t want that to
happen, pass a more comprehensive list or try to use prior (or meta
analysis posterior) objects generated by PEcAn workflows. Finally, we
haven’t used the external.knots
argument yet. We only use
it when we need to overwrite and enforce the training points for the
emulator and don’t necessarily need to populate this argument when doing
PDA without database connection. We leave testing with
external.knots
as an exercise.
It’s all very much the same as the cases above, in combination of #2
(prepare BayesianTools
specific xml tags) and #3 (prepare
external arguments).
<assim.batch>
<method>bayesian.tools</method>
<bt.settings>
<sampler>DREAMzs</sampler>
<iter>10000</iter>
<chain>3</chain>
</bt.settings>
<param.names>
<boreal.coniferous>
<param>psnTOpt</param>
<param>half_saturation_PAR</param>
<param>Vm_low_temp</param>
</boreal.coniferous>
</param.names>
<inputs>
<file>
<likelihood>Laplace</likelihood>
<variable.id>297</variable.id>
<variable.name>
<variable.name>NEE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
Save and read your pecan.PDA.xml
accordingly. Next, pass
the external data and formats you prepared explicitly to the
pda.bayesian.tools()
function:
settings <- read.settings("pecan.PDA.xml")
pda.externals <- pda.generate.externals(external.data = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE, external.priors = TRUE, prior.list = prior.list)
settings_postPDA <- pda.bayesian.tools(settings, external.data = pda.externals$external.data,
external.formats = pda.externals$external.formats,
external.priors = pda.externals$external.priors,
remote = TRUE, ensemble.id = "PDA_VIGNETTE")