--- title: "Multi-site hierarchical calibration vignette" author: "Istem Fer" date: "09/23/2019" output: html_vignette vignette: > %\VignetteIndexEntry{Multi-site hierarchical calibration vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Multi-Site Hierarchical PDA This vignette documents the steps and settings of Hierarchical Bayesian Parameter Data Assimilation (HB-PDA) analysis on multiple sites, and accompanies the paper. ## Initiate a new run from the web interface Follow the steps: ``` host : pecan model : SIPNET (r136) # note I'll change this, because I want to make sure SIPNET code is also version controlled (which exact settings/modules I turned on/off etc.) sitegroup : HB-PDA site : US-Bar (choose any we will modify the xml in the next step) CLICK NEXT pft : soil.ALL, temperate.deciduous.ALL start date : leave as it is end date : leave as it is pool_initial_condition : skip Sipnet.climna: Use AmerifluxLBL NOW CHECK THE EDIT XML TICK BOX AND CLICK NEXT (Enter pecan for user name in the following page) ``` ## Start with multi-settings Now you should be seeing the `pecan.xml` associated with the choices you made above. Remove ``, `` and `` tags from the `` section and add the `` tag in your `pecan.xml` before the `` section for the group of sites you are interested in. In this vignette, we will use HB-PDA site group in BETYdb which consists of 12 temperate decidious broadleaf forest (DBF) sites ofthe Ameriflux network. Overall, your `pecan.xml` for these sections will look like this: ``` 1000000022 AmerifluxLBL SIPNET pecan ``` Now hit save button and open RStudio. Go to the working directory for the run you just created. E.g. `setwd("/fs/data2/output/PEcAn_1000010272")`. Open `workflow.R` script and start going through the lines. Stop after: ``` # Write pecan.CHECKED.xml PEcAn.settings::write.settings(settings, outputfile = "pecan.CHECKED.xml") ``` ## Fill in site-specific input information When you open `pecan.CHECKED.xml`, you'll see that the `` tag has expanded with the same tags repeated for each site. Now specify ``, `` and `` information. E.g.: ``` 796 2005/01/01 2011/12/31 Bartlett Experimental Forest (US-Bar) 44.06464 -71.288077 2005/01/01 2011/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_0-796/AMF_US-Bar_BASE_HH_4-1.2005-01-01.2011-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_0-796/IC_site_0-796.nc 767 2001/01/01 2014/12/31 Morgan Monroe State Forest (US-MMS) 39.3231 -86.4131 2001/01/01 2014/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_0-767/AMF_US-MMS_BASE_HR_8-1.2001-01-01.2014-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_0-767/IC_site_0-767.nc 768 2005/01/01 2015/12/31 Missouri Ozark Site/BREA (US-MOz) 38.7441 -92.2 2005/01/01 2015/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_0-768/AMF_US-MOz_BASE_HH_7-1.2005-01-01.2015-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_0-768/IC_site_0-768.nc 776 2007/01/01 2014/12/31 Univiversity of Michigan Biological Station (US-UMB) 45.5598 -84.7138 2007/01/01 2014/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_0-776/AMF_US-UMB_BASE_HH_10-1.2007-01-01.2014-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_0-776/IC_site_0-776.nc 676 1999/01/01 2006/12/31 Willow Creek (US-WCr) 45.805925 -90.07961 1999/01/01 2006/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_0-676/AMF_US-WCr_BASE_HH_11-1.1999-01-01.2006-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_0-676/IC_site_0-676.nc 1000000109 2013/01/01 2015/12/31 Ontario - Turkey Point Mature Deciduous (CA-TPD) 42.6353 -80.5577 2013/01/01 2015/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_1-109/AMF_CA-TPD_BASE_HH_1-1.2013-01-01.2015-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_1-109/IC_site_1-109.nc 755 2002/01/01 2008/12/31 Duke Forest-hardwoods (US-Dk2) 35.9736 -79.1004 2002/01/01 2008/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_0-755/AMF_US-Dk2_BASE_HH_4-4.2002-01-01.2008-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_0-755/IC_site_0-755.nc 1000000145 2010/01/01 2010/12/31 Chestnut Ridge (US-ChR) 35.9311 -84.3324 2010/01/01 2010/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_1-145//AMF_US-ChR_BASE_HH_2-1.2010-01-01.2010-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_1-145/IC_site_1-145.nc 1000000066 2005/01/01 2014/12/31 Silas Little- New Jersey (US-Slt) 39.9138 -74.596 2005/01/01 2014/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_1-66/AMF_US-Slt_BASE_HH_5-1.2005-01-01.2014-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_1-66/IC_site_1-66.nc 1000000061 2004/01/01 2013/12/31 Oak Openings (US-Oho) 41.5545 -83.8438 2004/01/01 2013/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_1-61/AMF_US-Oho_BASE_HH_4-1.2004-01-01.2013-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_1-61/IC_site_1-61.nc 740 1997/01/01 2010/12/31 BOREAS SSA Old Aspen (CA-Oas) 53.6289 -106.198 1997/01/01 2010/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_0-740/AMF_CA-Oas_BASE_HH_1-1.1997-01-01.2010-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_0-740/IC_site_0-740.nc 758 2010/01/01 2015/12/31 Harvard Forest EMS Tower/HFR1 (US-Ha1) 42.5378 -72.1715 2010/01/01 2015/12/31 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_0-758/AMF_US-Ha1_BASE_HR_10-1.2010-01-01.2015-12-31.clim /fs/data3/istfer/HPDA/data/param.files/site_0-758/IC_site_0-758.nc ``` ## Modify ensemble analysis tags Now let's specify the ensemble `` and the ensemble `` within the `` tag. In the paper we give an ensemble of runs with `250` members but you can try smaller ensembles. Overall, your `pecan.xml` for the `` sections will look like this: ``` 250 NEE uniform sampling ``` Now save the file. Re-read the updated `pecan.CHECKED.xml`, make sure to re-assign random effects as logical, otherwise `meta.analysis` will throw an error: ``` settings <- read.settings("pecan.CHECKED.xml") settings$meta.analysis$random.effects <- as.logical(settings$meta.analysis$random.effects) ``` ## Global SA ``` globalSA_explore <- function(workflow_dir, var, ens.settings){ load(paste0(workflow_dir, "/ensemble.samples.", ens.settings$ensemble$ensemble.id,".Rdata")) samps <- do.call("cbind", ens.samples) load(paste0(workflow_dir, "/ensemble.output.", ens.settings$ensemble$ensemble.id, ".",var,".", ens.settings$ensemble$start.year, ".", ens.settings$ensemble$end.year,".Rdata")) allm <- cbind(unlist(ensemble.output), samps) colnames(allm) <- c(var, colnames(samps)) form <- as.formula(paste(var, " ~ .")) var.fit <- lm(form, allm) alls <-summary(var.fit)$coefficients[,1] alls[summary(var.fit)$coefficients[,4] >= 0.05] <- NA # don't consider non-significant fits alls <- alls[names(alls) != "(Intercept)"] preds <- names(alls) n <- length(preds) id <- unlist(lapply(1, function(i)combn(1:n,i,simplify=FALSE)),recursive=FALSE) Formulas <- sapply(id, function(i) paste(var,"~",paste(preds[i],collapse=":")) ) fits <- lapply(Formulas,function(i) lm(as.formula(i), data=allm)) for(tr in 1:n){ f <- summary(fits[[tr]])$fstatistic pfval <- pf(f[1],f[2],f[3],lower.tail=F) if(pfval < 0.001){ alls[names(alls) == preds[tr]] <- summary(fits[[tr]])$adj.r.squared }else{ alls[names(alls) == preds[tr]] <- NA } } alls <- data.frame(par=names(alls), rexp = alls, site =i) alls <- alls[order( alls$rexp),] return(alls) } workflow_dir <- "/fs/data2/output//PEcAn_1000010172" multi.settings <- read.settings("pecan.CONFIGS.xml") Qle_list <- NEE_list <- list() for(i in seq_along(multi.settings)){ Qle_list[[i]] <- globalSA_explore(workflow_dir, "Qle", multi.settings[[i]]) NEE_list[[i]] <- globalSA_explore(workflow_dir, "NEE", multi.settings[[i]]) } Qle_df <- do.call("rbind", Qle_list) Qle_df$var <- "Qle" NEE_df <- do.call("rbind", NEE_list) NEE_df$var <- "NEE" allm <- rbind(Qle_df, NEE_df) allm <- allm[!is.na(allm$rexp),] p1 <- ggplot(allm, aes(par, rexp, fill=var)) + coord_flip() + ylab("R2") + xlab("SIPNET parameters") p1 + geom_boxplot() + theme_bw() ``` Then continue with the rest of the workflow until `` module: ``` # Run parameter data assimilation if ('assim.batch' %in% names(settings)) { if (PEcAn.utils::status.check("PDA") == 0) { PEcAn.utils::status.start("PDA") settings <- PEcAn.assim.batch::runModule.assim.batch(settings) PEcAn.utils::status.end() } } ``` Before running this module, we will conduct some analyses on the ensemble run outputs to identify the parameters we would like to target in PDA. add the PDA tags, see next section. ## Settings for HB-PDA Similar to PDA settings, we will contain HB-PDA settings under the `` section. Likewise standard PDA, if you have chosen the parameters you want to target in the calibration, we will open the `pecan.CONFIGS.xml` file, and insert the `` tag. Different from the PDA settings, this time the main function will be the `emulator.ms`. This function can be run in three modes, which will be selected under the `` tag above: individual, joint and hierarchical. If left empty, the function will go through all three modes. But note that, for both joint and hierarhical calibration, we need site-level runs. We will start with individual mode, i.e. site-level calibration. Now open the `pecan.CONFIGS.xml` file. First, add `` tag to the multisettings in the bottom of your xml among others: ``` assim.batch ensemble sensitivity.analysis run ``` Now we can insert multi-site settings in the `` section accordingly (note that `method`, `mode`, `param.names` etc. will go under each `setting.*`): ``` emulator.ms individual 70 100000 3 som_respiration_rate soil_respiration_Q10 psnTOpt half_saturation_PAR dVPDSlope leafGrowth AmaxFrac 1000013322 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_site_0-796/AMF_US-Bar_BASE_HH_4-1.csv Laplace 1000000042 FC UST emulator.ms individual 70 100000 3 som_respiration_rate soil_respiration_Q10 psnTOpt half_saturation_PAR dVPDSlope leafGrowth AmaxFrac 1000013317 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_site_0-767/AMF_US-MMS_BASE_HR_8-1.csv Laplace 1000000042 FC UST emulator.ms individual 70 100000 3 som_respiration_rate soil_respiration_Q10 psnTOpt half_saturation_PAR dVPDSlope leafGrowth AmaxFrac 1000013316 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_site_0-768/AMF_US-MOz_BASE_HH_7-1.csv Laplace 1000000042 FC UST emulator.ms individual 70 100000 3 som_respiration_rate soil_respiration_Q10 psnTOpt half_saturation_PAR dVPDSlope leafGrowth AmaxFrac 1000013832 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_site_0-776/AMF_US-UMB_BASE_HH_10-1.csv Laplace 1000000042 FC UST emulator.ms individual 70 100000 3 som_respiration_rate soil_respiration_Q10 psnTOpt half_saturation_PAR dVPDSlope leafGrowth AmaxFrac 1000012965 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_site_0-676/AMF_US-WCr_BASE_HH_11-1.csv Laplace 1000000042 FC UST emulator.ms individual 70 100000 3 som_respiration_rate soil_respiration_Q10 psnTOpt half_saturation_PAR dVPDSlope leafGrowth AmaxFrac 1000013836 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_site_1-109/AMF_CA-TPD_BASE_HH_1-1.csv Laplace 1000000042 FC UST emulator.ms individual 70 100000 3 som_respiration_rate soil_respiration_Q10 psnTOpt half_saturation_PAR dVPDSlope leafGrowth AmaxFrac 1000013546 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_site_1-145/AMF_US-ChR_BASE_HH_2-1.csv Laplace 1000000042 FC UST emulator.ms individual 70 100000 3 som_respiration_rate soil_respiration_Q10 psnTOpt half_saturation_PAR dVPDSlope leafGrowth AmaxFrac 1000013572 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_site_1-66/AMF_US-Slt_BASE_HH_5-1.csv Laplace 1000000042 FC UST emulator.ms individual 70 100000 3 som_respiration_rate soil_respiration_Q10 psnTOpt half_saturation_PAR dVPDSlope leafGrowth AmaxFrac 1000013482 /fs/data1/pecan.data/dbfiles/AmerifluxLBL_site_1-61/AMF_US-Oho_BASE_HH_4-1.csv Laplace 1000000042 FC UST ``` Now save this as `pecan.HBPDA.xml` and read it. Following section briefly explains what happens after: ``` multi.settings <- read.settings("pecan.HBPDA.xml") pda.emulator.ms(multi.settings) ``` ## The workflow for HB-PDA HB-PDA workflow for multiple sites starts with fitting models locally. This is done by looping over the PEcAn multi-settings list using the `pda.emulator` function. During this step, site-level runs, GP fitting and MCMC are done. Components of these runs are saved and will be used in global and hierarhical fitting. In global calibration, we estimate posterior using all site-level GPs at once at each iteration. Proposed values are plugged-in all GPs, and accepted/rejected according to all estimated likelihoods. Hierarhical calibration, uses a similar function to `mcmc.GP` but modified for hierarchihcal fitting, called `hier.mcmc`. Within this function, both site-level and (hierarchically modeled) global parameters are fitted.