| Title: | PEcAn Functions Used for Ecological Forecasts and Reanalysis |
|---|---|
| Description: | The Predictive Ecosystem Carbon Analyzer (PEcAn) is a scientific workflow management tool that is designed to simplify the management of model parameterization, execution, and analysis. The goal of PECAn is to streamline the interaction between data and models, and to improve the efficacy of scientific investigation. |
| Authors: | Mike Dietze |
| Maintainer: | Mike Dietze <[email protected]> |
| License: | BSD_3_clause + file LICENSE |
| Version: | 1.10.0.9000 |
| Built: | 2026-06-05 14:54:39 UTC |
| Source: | https://github.com/PecanProject/pecan |
This functions gives weights to different ensemble members based on their likelihood during the analysis step. Then it adjusts the analysis mean estimates of state variables based on the estimated weights.
adj.ens(Pf, X, mu.f, mu.a, Pa)adj.ens(Pf, X, mu.f, mu.a, Pa)
Pf |
A cov matrix of forecast state variables. |
X |
Dataframe or matrix of forecast state variables for different ensembles. |
mu.f |
A vector with forecast mean estimates of state variables. |
mu.a |
A vector with analysis mean estimates of state variables. |
Pa |
The state estimate cov matrix of analysis. |
Returns a vector of adjusted analysis mean estimates of state variables.
Michael Dietze [email protected], Ann Raiho and Hamze Dokoohaki
Aggregation Function
aggregate(downscale_output, polygon_data, func = "mean")aggregate(downscale_output, polygon_data, func = "mean")
downscale_output |
Raster file output from downscale_function.R. Read file in this way if stored locally: |
polygon_data |
A spatial polygon object (e.g., an 'sf' object) that defines the spatial units for aggregation. This data should be in a coordinate reference system compatible with the raster data (e.g., "EPSG:4326"). |
func |
A character string specifying the aggregation function to use (e.g., 'mean', 'sum'). |
This function will aggregate previously downscaled carbon flux amount to a spatial unit of choice
It returns the 'polygon_data' with added columns for mean and sum values of the aggregated raster data for each ensemble member.
Harunobu Ishii
## Not run: # Download a shapefile of U.S. (polygon data) url <- "https://www2.census.gov/geo/tiger/GENZ2020/shp/cb_2020_us_state_20m.zip" download.file(url, destfile = "polygon/us_states.zip") # Unzip the downloaded file and save locally unzip("polygon/us_states.zip", exdir = "polygon/us_states") us_states <- st_read("polygon/us_states/cb_2020_us_state_20m.shp") saveRDS(us_states, "polygon/us_states.rds") # Load the saved polygon data with Massachusetts as an example us_states <- readRDS("polygon/us_states.rds") state <- "MA" polygon_data <- st_transform(us_states[us_states$STUSPS == state, ], crs = "EPSG:4326") # Load the downscaled raster output downscale_output <- readRDS("path/to/downscale_output.rds") # Slot in as argument to the aggregate function result <- aggregate(downscale_output, polygon_data) print(result) ## End(Not run)## Not run: # Download a shapefile of U.S. (polygon data) url <- "https://www2.census.gov/geo/tiger/GENZ2020/shp/cb_2020_us_state_20m.zip" download.file(url, destfile = "polygon/us_states.zip") # Unzip the downloaded file and save locally unzip("polygon/us_states.zip", exdir = "polygon/us_states") us_states <- st_read("polygon/us_states/cb_2020_us_state_20m.shp") saveRDS(us_states, "polygon/us_states.rds") # Load the saved polygon data with Massachusetts as an example us_states <- readRDS("polygon/us_states.rds") state <- "MA" polygon_data <- st_transform(us_states[us_states$STUSPS == state, ], crs = "EPSG:4326") # Load the downscaled raster output downscale_output <- readRDS("path/to/downscale_output.rds") # Slot in as argument to the aggregate function result <- aggregate(downscale_output, polygon_data) print(result) ## End(Not run)
This function finds all the tic functions called before and estimates the time elapsed for each one saves/appends it to a csv file.
alltocs(fname = "tocs.csv")alltocs(fname = "tocs.csv")
fname |
string path to where the output needs to be saved as a csv file. |
This function writes down a csv file with three columns: 1- message sepecified in the 'tic' 2- Total elapsed time and 3- the execution time
## Not run: library(tictoc) tic("Analysis") Sys.sleep(5) testfunc() tic("Adjustment") Sys.sleep(4) alltocs("timing.csv") ## End(Not run)## Not run: library(tictoc) tic("Analysis") Sys.sleep(5) testfunc() tic("Adjustment") Sys.sleep(4) alltocs("timing.csv") ## End(Not run)
Additive Log Ratio transform
alr(y)alr(y)
y |
state var |
This function provides the block-based MCMC sampling approach.
analysis_sda_block( settings, block.list.all, X, obs.mean, obs.cov, t, nt, MCMC.args, block.list.all.pre = NULL )analysis_sda_block( settings, block.list.all, X, obs.mean, obs.cov, t, nt, MCMC.args, block.list.all.pre = NULL )
settings |
pecan standard multi-site settings list. |
block.list.all |
Lists of forecast and analysis outputs for each time point of each block. If t=1, we initialize those outputs of each block with NULL from the 'sda.enkf.multisite' function. |
X |
A matrix contains ensemble forecasts with the dimensions of '[ensemble number, site number * number of state variables]'. The columns are matched with the site.ids and state variable names of the inside the 'FORECAST' object in the 'sda.enkf.multisite' script. |
obs.mean |
Lists of date times named by time points, which contains lists of sites named by site ids, which contains observation means for each state variables of each site for each time point. |
obs.cov |
Lists of date times named by time points, which contains lists of sites named by site ids, which contains observation covariances for all state variables of each site for each time point. |
t |
time point in format of YYYY-MM-DD. |
nt |
total length of time steps, corresponding to the 'nt' variable in the 'sda.enkf.multisite' function. |
MCMC.args |
arguments for the MCMC sampling, details can be found in the roxygen strucutre for control list in the 'sda.enkf.multisite' function. |
block.list.all.pre |
pre-existed block.list.all object for passing the aqq and bqq to the current SDA run, the default is NULL. Details can be found in the roxygen structure for 'pre_enkf_params' of the 'sda.enkf.multisite' function. |
This function will add data and constants into each block that are needed for the MCMC sampling.
It returns the 'build.block.xy' object and the analysis results.
Dongchen Zhang
This functions uses the FUN to perform the analysis. EnKF function is developed inside the PEcAnAssimSequential package which can be sent to this function to perform the Ensemble Kalman Filter. The other option is GEF function inside the same package allowing to perform Generalized Ensemble kalman Filter.
If you're using an arbitrary function you can use the ... to send any other variables to your desired analysis function.
Analysis.sda( settings, FUN, Forecast = list(Pf = NULL, mu.f = NULL, Q = NULL, X = NULL), Observed = list(R = NULL, Y = NULL), H, extraArg, ... )Analysis.sda( settings, FUN, Forecast = list(Pf = NULL, mu.f = NULL, Q = NULL, X = NULL), Observed = list(R = NULL, Y = NULL), H, extraArg, ... )
settings |
pecan standard settings list. |
FUN |
A Function for performing the analysis step. Two available options are: 1-EnKF and 2-GEF. |
Forecast |
A list containing the forecasts variables including Q (process variance) and X (a dataframe of forecasts state variables for different ensemble) |
Observed |
A list containing the observed variables including R (cov of observed state variables) and Y (vector of estimated mean of observed state variables) |
H |
is a matrix of 1's and 0's specifying which observations go with which variables. |
extraArg |
This argument is a list containing aqq, bqq and t. The aqq and bqq are shape parameters estimated over time for the proccess covariance and t gives the time in terms of index of obs.list. See Details. |
... |
Extra argument sent to the analysis function. In case you're using the 'GEF' function, this function requires nt, obs.mean, obs.cov, which are the total number of steps, list of observed means and list of observed cov respectively. |
Returns whatever the FUN is returning. In case of EnKF and GEF, this function returns a list with estimated mean and cov matrix of forecast state variables as well as mean and cov estimated as a result of assimilation/analysis .
Michael Dietze [email protected], Ann Raiho and Hamze Dokoohaki
Assessing parameter estimations after mapping model output to tobit space
assessParams(dat, Xt, wts = NULL, mu_f_TRUE = NULL, P_f_TRUE = NULL)assessParams(dat, Xt, wts = NULL, mu_f_TRUE = NULL, P_f_TRUE = NULL)
dat |
MCMC output |
Xt |
ensemble output matrix |
wts |
ensemble weights |
mu_f_TRUE |
muf before tobit2space |
P_f_TRUE |
Pf before tobit2space |
make plots
Michael Dietze and Ann Raiho [email protected]
This function is adopted from migest package.
block_matrix(x = NULL, b = NULL, byrow = FALSE, dimnames = NULL)block_matrix(x = NULL, b = NULL, byrow = FALSE, dimnames = NULL)
x |
Vector of numbers to identify each block. |
b |
Numeric value for the size of the blocks within the matrix ordered depending on byrow |
byrow |
logical value. If FALSE (the default) the blocks are filled by columns, otherwise the blocks in the matrix are filled by rows. |
dimnames |
Character string of name attribute for the basis of the block matrix. If NULL a vector of the same length of b provides the basis of row and column names.#'. |
Returns a matrix with block sizes determined by the b argument. Each block is filled with the same value taken from x.
Guy J. Abel
block.2.vector
block.2.vector(block.list, X, H, adjustment)block.2.vector(block.list, X, H, adjustment)
block.list |
lists of blocks generated by the 'build.block.xy' function. |
X |
A matrix contains ensemble forecasts. |
H |
H index created by the 'construct_nimble_H' function. |
adjustment |
logical variable determine if we want to adjust the analysis ensembles based on likelihood. |
It returns a list of analysis results by MCMC sampling.
Dongchen Zhang
builds X matrix for SDA
build_X( out.configs, settings, new.params, nens, read_restart_times, outdir, t = 1, var.names, my.read_restart, restart_flag = FALSE )build_X( out.configs, settings, new.params, nens, read_restart_times, outdir, t = 1, var.names, my.read_restart, restart_flag = FALSE )
out.configs |
object created for build_X passed from sda.enkf_MultiSite |
settings |
settings object, passed from sda.enkf_MultiSite |
new.params |
object created from sda_matchparam, passed from sda.enkf_MultiSite |
nens |
number of ensemble members i.e. runs |
read_restart_times |
passed from sda.enkf_MultiSite |
outdir |
location of previous run output folder containing .nc files |
t |
Default t=1, for function to work within time loop |
var.names |
list of state variables taken from settings object |
my.read_restart |
object that points to the model restart function i.e. read_restart.SIPNET |
restart_flag |
flag if it's a restart stage. Default is FALSE. |
X ready to be passed to SDA Analysis code
Alexis Helgeson
This function split long vector and covariance matrix into blocks corresponding to the localization.
build.block.xy(settings, block.list.all, X, obs.mean, obs.cov, t)build.block.xy(settings, block.list.all, X, obs.mean, obs.cov, t)
settings |
pecan standard multi-site settings list. |
block.list.all |
List contains nt empty sub-elements. |
X |
A matrix contains ensemble forecasts. |
obs.mean |
List of dataframe of observation means, named with observation datetime. |
obs.cov |
List of covariance matrices of state variables , named with observation datetime. |
t |
time point. |
This function will add data and constants into each block that are needed for the MCMC sampling.
Dongchen Zhang
This function can help to assemble sda outputs (analysis and forecasts) from each job execution.
check.qsub.job.info(username)check.qsub.job.info(username)
username |
character: username for the HPC. |
list: number of jobs and vector of job ids.
Dongchen Zhang.
Weighted conjugate wishart
conj_wt_wishart_sampler(model, mvSaved, target, control)conj_wt_wishart_sampler(model, mvSaved, target, control)
model |
model |
mvSaved |
copied to |
target |
thing being targetted |
control |
unused |
This function creates a matrix mapping obsereved data to their forecast state variable.
Construct_H(choose, Y, X)Construct_H(choose, Y, X)
choose |
a vector of observations indices oredered based on their appearances in the list of state variable names. |
Y |
vector of observations |
X |
Dataframe or matrix of forecast state variables for different ensembles. |
This returns a matrix specifying which observation go with which state variables.
Hamze Dokoohaki
This function is an upgrade to the Construct.H.multisite function which provides the index by different criteria.
construct_nimble_H( site.ids, var.names, obs.t, pft.path = NULL, by = "single", products = NULL )construct_nimble_H( site.ids, var.names, obs.t, pft.path = NULL, by = "single", products = NULL )
site.ids |
a vector name of site ids |
var.names |
vector names of state variable names |
obs.t |
list of vector of means for the time t for different sites. |
pft.path |
physical path to the pft.csv file. |
by |
criteria, it supports by variable, site, pft, all, and single Q. |
products |
vector names of products for each observation. |
Returns one vector containing index for which Q to be estimated for which variable, and the other vector gives which state variable has which observation (= element.W.Data).
Dongchen Zhang
This function is makes the blocked mapping function.
Construct.H.multisite(site.ids, var.names, obs.t.mean, products = NULL)Construct.H.multisite(site.ids, var.names, obs.t.mean, products = NULL)
site.ids |
a vector name of site ids |
var.names |
vector names of state variable names |
obs.t.mean |
list of vector of means for the time t for different sites. |
products |
vector names of products for each observation. |
Returns a matrix with block sizes determined by the b argument. Each block is filled with the same value taken from x.
Hamze
Make sure that both lists are named with siteids.
Construct.R(site.ids, var.names, obs.t.mean, obs.t.cov)Construct.R(site.ids, var.names, obs.t.mean, obs.t.cov)
site.ids |
a vector name of site ids |
var.names |
vector names of state variable names |
obs.t.mean |
list of vector of means for the time t for different sites. |
obs.t.cov |
list of list of cov for the time for different sites. |
This function returns a list with Y and R ready to be sent to the analysis functions.
Hamze Dokoohaki
The argument X needs to have an attribute pointing the state variables to their corresponding site. This attribute needs to be called 'Site'. At the moment, the cov between state variables at blocks defining the cov between two sites are assumed zero.
Contruct.Pf( site.ids, var.names, X, localization.FUN = NULL, t = 1, blocked.dis = NULL, ... )Contruct.Pf( site.ids, var.names, X, localization.FUN = NULL, t = 1, blocked.dis = NULL, ... )
site.ids |
a vector name of site ids. |
var.names |
vector names of state variable names. |
X |
a matrix of state variables. In this matrix rows represent ensembles, while columns show the variables for different sites. |
localization.FUN |
This is the function that performs the localization of the Pf matrix and it returns a localized matrix with the same dimensions. |
t |
not used |
blocked.dis |
passed to 'localization.FUN' |
... |
passed to 'localization.FUN' |
It returns the var-cov matrix of state variables at multiple sites.
Hamze Dokoohaki
Title Identify pft for each site of a multi-site settings using NLCD and Eco-region
Create_Site_PFT_CSV(settings, Ecoregion, NLCD, con)Create_Site_PFT_CSV(settings, Ecoregion, NLCD, con)
settings |
a multi-site settings |
Ecoregion |
path of Ecoregion data (*.shp) |
NLCD |
path of NLCD img file |
con |
connection to bety |
pft info with each site
## Not run: NLCD <- file.path( "/fs", "data1", "pecan.data", "input", "nlcd_2001_landcover_2011_edition_2014_10_10", "nlcd_2001_landcover_2011_edition_2014_10_10.img") Ecoregion <- file.path( "/projectnb", "dietzelab", "dongchen", "All_NEON_SDA", "NEON42", "eco-region", "us_eco_l3_state_boundaries.shp") settings <- PEcAn.settings::read.settings( "/projectnb/dietzelab/dongchen/All_NEON_SDA/NEON42/pecan.xml") con <- PEcAn.DB::db.open(settings$database$bety) site_pft_info <- Create_Site_PFT_CSV(settings, Ecoregion, NLCD, con) ## End(Not run)## Not run: NLCD <- file.path( "/fs", "data1", "pecan.data", "input", "nlcd_2001_landcover_2011_edition_2014_10_10", "nlcd_2001_landcover_2011_edition_2014_10_10.img") Ecoregion <- file.path( "/projectnb", "dietzelab", "dongchen", "All_NEON_SDA", "NEON42", "eco-region", "us_eco_l3_state_boundaries.shp") settings <- PEcAn.settings::read.settings( "/projectnb/dietzelab/dongchen/All_NEON_SDA/NEON42/pecan.xml") con <- PEcAn.DB::db.open(settings$database$bety) site_pft_info <- Create_Site_PFT_CSV(settings, Ecoregion, NLCD, con) ## End(Not run)
This function helps to average the predicted residuals based on certain criteria. The default function is averaging. TODO: we will be averaging residuals based on ML precision for each residual prediction.
debias.average(residuals, by_product_accuracy = NULL)debias.average(residuals, by_product_accuracy = NULL)
residuals |
vector: predicted residuals across products. |
by_product_accuracy |
vector: the out-of-sample RMSE for each product. |
numeric: an averaged residual.
Dongchen Zhang
This function helps to extract the covariates based on date and coordinates from the settings object.
debias.extract.cov(cov.dir, date, site.info)debias.extract.cov(cov.dir, date, site.info)
cov.dir |
character: physical path to the directory contains the time series covariate maps. |
date |
numeric or character: date that used to find the corresponding covariate file. |
site.info |
data.frame: data.frame that contains longitude and latitude in its first and second column. |
data.frame: a data frame containing M (rows) of locations across N (columns) variables.
Dongchen Zhang
This function helps to create the mapping operator that maps between product names and variable names by locations.
debias.map.products(obs.mean.t, obs_prep)debias.map.products(obs.mean.t, obs_prep)
obs.mean.t |
list: lists of sites that contain observations for each variable. |
obs_prep |
list: lists contains basic information for each product of each observation. |
matrix: a matrix containing M (rows) of locations*variables across N (columns) products.
Dongchen Zhang
This function helps to correct the forecasts' biases based on ML (random forest) training on the previous time points.
debias.ML(pred.name, cov.names, dat.train, dat.pred, var.name, py.init)debias.ML(pred.name, cov.names, dat.train, dat.pred, var.name, py.init)
pred.name |
character: the name for the predictive variable. |
cov.names |
character: the name for the covariates. |
dat.train |
data.frame: data frame containing associated covariates and predictive variable for ML training. |
dat.pred |
data.frame: data frame containing associated covariates for prediction. |
var.name |
character: variable name to be predicted. |
py.init |
R function: R function to initialize the python functions. Default is NULL. the default random forest will be used if 'py.init' is NULL. |
list: the ML predicted residuals and other ML outputs.
Dongchen Zhang
This function helps to calculate the lagged residual error time series.
debias.res.lag.calc(res.ts)debias.res.lag.calc(res.ts)
res.ts |
list: lists of residuals at each time point. |
list: lists of lagged residuals between time points.
Dongchen Zhang
This function helps to calculate the residual error for a certain time point and variable.
debias.residual.calc(obs.mean, all.X, t, var.name, data.source)debias.residual.calc(obs.mean, all.X, t, var.name, data.source)
obs.mean |
List: lists of date times named by time points, which contains lists of sites named by site ids, which contains observation means for each state variables of each site for each time point. |
all.X |
list: lists of data frame of model forecast from the beginning to the current time points that has n (ensemble size) rows and n.var (number of variables) times n.site (number of locations) columns. (e.g., 100 ensembles, 4 variables, and 8,000 locations will end up with data.frame of 100 rows and 32,000 columns). |
t |
numeric: the current number of time points (e.g., t=1 for the beginning time point). |
var.name |
character: variable name to be predicted. |
data.source |
character: product name of the data. |
list: lists of residuals between forecasts and observations across at time t.
Dongchen Zhang
This function helps to calculate the out-of-sample accuracy of residual predictions.
debias.train.accuracy( pred.name, cov.names, dat.train, var.name, py.init, ratio = 0.8 )debias.train.accuracy( pred.name, cov.names, dat.train, var.name, py.init, ratio = 0.8 )
pred.name |
character: the name for the predictive variable. |
cov.names |
character: the name for the covariates. |
dat.train |
data.frame: data frame containing associated covariates and predictive variable for ML training. |
var.name |
character: variable name to be predicted. |
py.init |
R function: R function to initialize the python functions. Default is NULL. the default random forest will be used if 'py.init' is NULL. |
ratio |
numeric from 0 to 1: define the ratio of samples used for training. The rest samples will be used for calculating the out-of-sample accuracy. Default is 0.8. |
list: the variable importance of the ML and RMSE of the out-of-sample predictions.
Dongchen Zhang
This is the main function to execute the machine learning training and prediction. Note it will be deployed by each node you requested if the qsub feature is enabled below.
downscale_main( settings, analysis, covariates.dir, time, variable, outdir, base.map.dir, method = "randomForest", cores = parallel::detectCores() )downscale_main( settings, analysis, covariates.dir, time, variable, outdir, base.map.dir, method = "randomForest", cores = parallel::detectCores() )
settings |
character: physical path that points to the pecan settings XML file. |
analysis |
numeric: data frame (rows: ensemble member; columns: site*state_variables) of updated ensemble analysis results from the 'sda_enkf' function. |
covariates.dir |
character: path to the exported covariates GeoTIFF file. |
time |
character: the time tag used to differentiate the outputs from others. |
variable |
character: name of state variable. It should match up with the column names of the analysis data frame. |
outdir |
character: the output directory where the downscaled maps will be stored. |
base.map.dir |
character: path to the GeoTIFF file within which the extents and CRS will be used to generate the ensemble maps. |
method |
character: machine learning method, default is randomForest (currently support randomForest and xgboost). |
cores |
numeric: how many CPus to be used in the calculation, the default is the total CPU number you have. |
paths to the ensemble downscaled maps.
Dongchen Zhang
This qsub function helps to run the submitted qsub jobs for running the downscale_main function.
downscale_qsub_main(folder.path)downscale_qsub_main(folder.path)
folder.path |
Character: physical path to which the job file is located. |
Dongchen Zhang
weighted multivariate normal density
dwtmnorm(x, mean, prec, wt, log = 0)dwtmnorm(x, mean, prec, wt, log = 0)
x |
random variable |
mean |
mean |
prec |
precision |
wt |
weight |
log |
log |
Given the Forecast and Observed this function performs the Ensemble Kalamn Filter.
EnKF(settings, Forecast, Observed, H, extraArg = NULL, ...)EnKF(settings, Forecast, Observed, H, extraArg = NULL, ...)
settings |
pecan standard settings list. |
Forecast |
A list containing the forecasts variables including Q (process variance) and X (a dataframe of forecasts state variables for different ensemble) |
Observed |
A list containing the observed variables including R (cov of observed state variables) and Y (vector of estimated mean of observed state variables) |
H |
is a matrix of 1's and 0's specifying which observations go with which variables. |
extraArg |
This argument is NOT used inside this function but it is a list containing aqq, bqq and t. The aqq and bqq are shape parameters estimated over time for the proccess covariance and t gives the time in terms of index of obs.list. See Details. |
... |
Extra argument sent to the analysis function. |
It returns a list with estimated mean and cov matrix of forecast state variables as well as mean and cov estimated as a result of assimilation/analysis .
Michael Dietze [email protected], Ann Raiho and Hamze Dokoohaki
Given the Forecast and Observed this function performs the Ensemble Kalamn Filter.
EnKF.MultiSite(settings, Forecast, Observed, H, extraArg = NULL, ...)EnKF.MultiSite(settings, Forecast, Observed, H, extraArg = NULL, ...)
settings |
pecan standard settings list. |
Forecast |
A list containing the forecasts variables including Q (process variance) and X (a dataframe of forecasts state variables for different ensemble) |
Observed |
A list containing the observed variables including R (cov of observed state variables) and Y (vector of estimated mean of observed state variables) |
H |
is a matrix of 1's and 0's specifying which observations go with which state variables. |
extraArg |
This argument is NOT used inside this function but it is a list containing aqq, bqq and t. The aqq and bqq are shape parameters estimated over time for the proccess covariance and t gives the time in terms of index of obs.list. |
... |
Extra argument sent to the analysis function. |
This function is different than 'EnKF' function in terms of how it creates the Pf matrix.
It returns a list with estimated mean and cov matrix of forecast state variables as well as mean and cov estimated as a result of assimilation/analysis .
Michael Dietze [email protected], Ann Raiho and Hamze Dokoohaki
Given the Forecast and Observed this function performs the Generalized Ensemble Kalamn Filter. The generalized ensemble filter follows generally the three steps of sequential state data assimilation. But, in the generalized ensemble filter we add a latent state vector that accounts for added process variance. Furthermore, instead of solving the analysis analytically like the EnKF, we have to estimate the mean analysis vector and covariance matrix with MCMC.
GEF( settings, Forecast, Observed, H, extraArg, nitr = 50000, nburnin = 10000, ... ) GEF.MultiSite(settings, Forecast, Observed, H, extraArg, ...)GEF( settings, Forecast, Observed, H, extraArg, nitr = 50000, nburnin = 10000, ... ) GEF.MultiSite(settings, Forecast, Observed, H, extraArg, ...)
settings |
pecan standard settings list. |
Forecast |
A list containing the forecasts variables including Q (process variance) and X (a dataframe of forecast state variables for different ensemble) |
Observed |
A list containing the observed variables including R (cov of observed state variables) and Y (vector of estimated mean of observed state variables) |
H |
not used |
extraArg |
This argument is a list containing aqq, bqq and t. The aqq and bqq are shape parameters estimated over time for the process covariance and t gives the time in terms of index of obs.list. See Details. |
nitr |
Number of iterations to run each MCMC chain. |
nburnin |
Number of initial, pre-thinning, MCMC iterations to discard. |
... |
This function requires nt, obs.mean, obs.cov, which are the total number of steps, list of observed means and list of observed cov respectively. |
It returns a list with estimated mean and cov matrix of forecast state variables as well as mean and cov estimated as a result of assimilation/analysis .
Michael Dietze [email protected], Ann Raiho and Hamze Dokoohaki
multisite TWEnF
GEF.MultiSite.NimbleGEF.MultiSite.Nimble
TBD
Creates file of ensemble weights in format needed for SDA
get_ensemble_weights(settings, time_do)get_ensemble_weights(settings, time_do)
settings |
PEcAn settings object |
time_do |
Give user specific time so you don't have to have it be annual |
NONE
Ann Raiho [email protected]
GrabFillMatrix
GrabFillMatrix(M, ind, M1 = NULL)GrabFillMatrix(M, ind, M1 = NULL)
M |
source matrix that will be either subtracted or filled in. |
ind |
vector of index that of where to be subtracted or filled in. |
M1 |
additional matrix used to fill in the source matrix, the default it NULL. |
This function helps subtract or fill in a matrix given the index.
Dongchen Zhang
Hop test. This script tests that the model successfully reads it's own restart and can restart without loss of information.
hop_test(settings, ens.runid = NULL, nyear)hop_test(settings, ens.runid = NULL, nyear)
settings |
SDA PEcAn settings object |
ens.runid |
run id. If not provided, is looked up from [settings$outdir]/runs.txt |
nyear |
number of years to run hop test over |
NONE
Ann Raiho [email protected]
Internal functions for plotting SDA outputs. Interactive, post analysis time-series and bias plots in base plotting system and ggplot
interactive.plotting.sda( settings, t, obs.times, obs.mean, obs.cov, obs, X, FORECAST, ANALYSIS ) postana.timeser.plotting.sda( settings, t, obs.times, obs.mean, obs.cov, obs, X, FORECAST, ANALYSIS ) postana.bias.plotting.sda( settings, t, obs.times, obs.mean, obs.cov, obs, X, FORECAST, ANALYSIS ) postana.bias.plotting.sda.corr(t, obs.times, X, aqq, bqq) post.analysis.ggplot( settings, t, obs.times, obs.mean, obs.cov, obs, X, FORECAST, ANALYSIS, plot.title = NULL ) post.analysis.ggplot.violin( settings, t, obs.times, obs.mean, obs.cov, obs, X, FORECAST, ANALYSIS, plot.title = NULL ) post.analysis.multisite.ggplot( settings, t, obs.times, obs.mean, obs.cov, FORECAST, ANALYSIS, plot.title = NULL, facetg = FALSE, readsFF = NULL, Add_Map = FALSE ) SDA_timeseries_plot( ANALYSIS, FORECAST, obs.mean = NULL, obs.cov = NULL, outdir, pft.path = NULL, by = "site", types = c("FORECAST", "ANALYSIS", "OBS"), CI = c(0.025, 0.975), unit = list(AbvGrndWood = "Mg/ha", LAI = "m2/m2", SoilMoistFrac = "", TotSoilCarb = "kg/m2"), style = list(general_color = c(FORECAST = "blue", ANALYSIS = "red", OBS = "black"), fill_color = c(FORECAST = "yellow", ANALYSIS = "green", OBS = "grey"), title_color = "red"), PDF_w = 20, PDF_h = 16, t.inds = NULL )interactive.plotting.sda( settings, t, obs.times, obs.mean, obs.cov, obs, X, FORECAST, ANALYSIS ) postana.timeser.plotting.sda( settings, t, obs.times, obs.mean, obs.cov, obs, X, FORECAST, ANALYSIS ) postana.bias.plotting.sda( settings, t, obs.times, obs.mean, obs.cov, obs, X, FORECAST, ANALYSIS ) postana.bias.plotting.sda.corr(t, obs.times, X, aqq, bqq) post.analysis.ggplot( settings, t, obs.times, obs.mean, obs.cov, obs, X, FORECAST, ANALYSIS, plot.title = NULL ) post.analysis.ggplot.violin( settings, t, obs.times, obs.mean, obs.cov, obs, X, FORECAST, ANALYSIS, plot.title = NULL ) post.analysis.multisite.ggplot( settings, t, obs.times, obs.mean, obs.cov, FORECAST, ANALYSIS, plot.title = NULL, facetg = FALSE, readsFF = NULL, Add_Map = FALSE ) SDA_timeseries_plot( ANALYSIS, FORECAST, obs.mean = NULL, obs.cov = NULL, outdir, pft.path = NULL, by = "site", types = c("FORECAST", "ANALYSIS", "OBS"), CI = c(0.025, 0.975), unit = list(AbvGrndWood = "Mg/ha", LAI = "m2/m2", SoilMoistFrac = "", TotSoilCarb = "kg/m2"), style = list(general_color = c(FORECAST = "blue", ANALYSIS = "red", OBS = "black"), fill_color = c(FORECAST = "yellow", ANALYSIS = "green", OBS = "grey"), title_color = "red"), PDF_w = 20, PDF_h = 16, t.inds = NULL )
settings |
pecan standard settings list. |
t |
current time - int number giving the position of the current time in obs.time. |
obs.times |
vector of dates of measurements |
obs.mean |
obs.mean |
obs.cov |
obs.cov |
obs |
list containing the mean and cov object |
X |
dataframe of state variables for each ensemble |
FORECAST |
Forecast object from the sda.output.Rdata. |
ANALYSIS |
Analysis object from the sda.output.Rdata. |
aqq, bqq
|
shape parameters estimated over time for the process covariance |
plot.title |
character giving the title for post visualization ggplots |
facetg |
logical: Create a subpanel for each variable? |
readsFF |
optional forward forecast |
Add_Map |
Bool variable decide if we want to export the GIS map of Ecoregion. |
outdir |
physical path where the pdf will be stored. |
pft.path |
Physical path of pft.csv file to allow by = pft option. |
by |
arrange figures by var, pft, or site. |
types |
data types that shown in the figure. |
CI |
range of confidence interval. |
unit |
list of unit used for y axis label. |
style |
color option. |
PDF_w |
width of exported PDF file, passed on to 'base::pdf()'. |
PDF_h |
height of exported PDF file, passed on to 'base::pdf()'. |
t.inds |
index of period that will be plotted. |
Dongchen Zhang
inverse of ALR transform
inv.alr(alr)inv.alr(alr)
alr |
state var |
Load data function for paleon SDA data products
load_data_paleon_sda(settings)load_data_paleon_sda(settings)
settings |
PEcAn SDA settings object |
obs.mean and obs.cov for sda.enkf
Ann Raiho [email protected]
This functions is internally used to register a series of nimble functions inside GEF analysis function.
y_star_create(X)y_star_create(X)
X |
state var |
Ann Raiho, Hamze Dokoohaki
distance.mat matrix doesn't need to be just the physical distance, however it represent a measure of similarity between state variables in different sites.
Local.support(Pf, distance.mat, scalef = 1)Local.support(Pf, distance.mat, scalef = 1)
Pf |
Forecast error covariance matrix |
distance.mat |
is matrix of distances between sites. |
scalef |
scalef is a numeric value that requires tuning and it controls the shape of the corrolation function |
It returns a localized covariance matrix by taking a Schur product between Pf and a corrolation function
Hamze Dokoohaki
matrix_network
matrix_network(mat)matrix_network(mat)
mat |
a boolean matrix representing the interactions between any sites. |
It returns lists of index representing each network.
Dongchen Zhang
MCMC_block_function
MCMC_block_function(block)MCMC_block_function(block)
block |
each block within the 'block.list' lists. |
It returns the 'block' object with analysis results filled in.
Dongchen Zhang
MCMC_function
MCMC_function(data)MCMC_function(data)
data |
list containing everything needed for the MCMC sampling. |
This function replace the previous code where implenmented the MCMC sampling part, which allows the MCMC sampling of multiple chains under parallel mode.
Michael Dietze [email protected], Ann Raiho, Hamze Dokoohaki, and Dongchen Zhang.
MCMC_Init
MCMC_Init(block.list, X)MCMC_Init(block.list, X)
block.list |
lists of blocks generated by the 'build.block.xy' function. |
X |
A matrix contains ensemble forecasts. |
This function helps create initial conditions for the MCMC sampling.
It returns the 'block.list' object with initial conditions filled in.
Dongchen Zhang
metSplit
metSplit( conf.settings, inputs, settings, model, no_split = FALSE, obs.times, t, nens, restart_flag = FALSE, my.split_inputs )metSplit( conf.settings, inputs, settings, model, no_split = FALSE, obs.times, t, nens, restart_flag = FALSE, my.split_inputs )
conf.settings |
listed multisite settings object generated by sda.enkf_MultiSite |
inputs |
listed object containing met ensemble members generated by input.ens.gen |
settings |
settings object passed to sda.enkf_MultiSite |
model |
model name ex. SIPNET |
no_split |
TRUE/FALSE if model requires met split |
obs.times |
matrix of dates used for assimilation |
t |
number of dates in obs.times |
nens |
number of ensemble members, taken from settings object |
restart_flag |
TRUE/FALSE taken from restart arguement |
my.split_inputs |
generated by sda.enkf_MultiSite ex. split_inputs.SIPNET |
input.split object with split met filepaths
Alexis Helgeson
convert from timestep to actual time points. supports year, month, week, and day as time unit.
obs_timestep2timepoint(start.date, end.date, timestep)obs_timestep2timepoint(start.date, end.date, timestep)
start.date |
start date when the first observation was taken. |
end.date |
end date when the last observation was taken. |
timestep |
a list includes time unit and number of time unit per timestep. |
timepoints from start to end date given the number of time unit per timestep.
Dongchen Zhang
Obs.data.prepare.MultiSite
Obs.data.prepare.MultiSite(obs.path, site.ids)Obs.data.prepare.MultiSite(obs.path, site.ids)
obs.path |
Path to the obs data which is expected to be an .Rdata. |
site.ids |
a character vector of site ids which need to be extracted. |
a list of observed mean and cov as the SDA expected it to be.
This function performs a simple outlier replacement on all the columns of dataframes inside a list
outlier.detector.boxplot(X)outlier.detector.boxplot(X)
X |
A list of dataframes |
A list the same dimension as X, with each column of each dataframe modified by replacing outlier points with the column median
This function helps to predict the target variable observations based on the covariates. The prediction is working in parallel across vegetated pixels.
parallel_prediction( base.map.dir, models, cov.vecs, non.na.inds, outdir, name, cores = parallel::detectCores() )parallel_prediction( base.map.dir, models, cov.vecs, non.na.inds, outdir, name, cores = parallel::detectCores() )
base.map.dir |
character: path to the GeoTIFF file within which the extents and CRS will be used to generate the ensemble maps. |
models |
list: trained models across ensemble members generated by the 'parallel_train' function. |
cov.vecs |
numeric: data frame containing covaraites across vegetated pixels generated from the 'stack_covariates_2_df' function. |
non.na.inds |
numeric: the corresponding index of vegetated pixels generated from the 'stack_covariates_2_df' function. |
outdir |
character: the output directory where the downscaled maps will be stored. |
name |
list: containing the time and variable name to create the final GeoTIFF file name. |
cores |
numeric: how many CPus to be used in the calculation, the default is the total CPU number you have. |
paths to the ensemble downscaled maps.
Dongchen Zhang
This function helps to train the ML model across ensemble members in parallel.
parallel_train( full_data, method = "randomForest", cores = parallel::detectCores() )parallel_train( full_data, method = "randomForest", cores = parallel::detectCores() )
full_data |
numeric: the matrix generated using the 'prepare_train_dat' function. |
method |
character: machine learning method (currently support randomForest and xgboost). |
cores |
numeric: how many CPus to be used in the calculation, the default is the total CPU number you have. |
list of trained models across ensemble members.
Dongchen Zhang
convert settings to geospatial points in terra.
pecan_settings_2_pts(settings)pecan_settings_2_pts(settings)
settings |
PEcAn settings: either a character that points to the settings or shape file or the actual pecan settings object will be accepted. |
terra spatial points object.
Dongchen Zhang
5th order piecewise polynomial adopted from Data assimilation for spatio-temporal processes - p250 - Sebastian Reich
piecew.poly.local(Pf, distance.mat, scalef = 2)piecew.poly.local(Pf, distance.mat, scalef = 2)
Pf |
Forecast error covariance matrix |
distance.mat |
is matrix of distances between sites. |
scalef |
scalef is a numeric value that requires tuning and it controls the shape of the corrolation function |
It returns a localized covariance matrix by taking a Schur product between Pf and a corrolation function
Hamze Dokoohaki
SDA observation preparation function for LAI and AGB
Prep_OBS_SDA(settings, out_dir, AGB_dir, Search_Window = 30)Prep_OBS_SDA(settings, out_dir, AGB_dir, Search_Window = 30)
settings |
multi.settings objects that contains multiple sites info |
out_dir |
output dir |
AGB_dir |
AGB data dir |
Search_Window |
search window for locate available LAI values |
mean and covariance of observations
This function helps to create the training dataset of specific variable type and locations for downscaling. TODO: Add a ratio argument (training sample size/total sample size) so that we could calculate the out-of-sample accuracy.
prepare_train_dat(pts, analysis, covariates.dir, variable)prepare_train_dat(pts, analysis, covariates.dir, variable)
pts |
spatialpoints: spatial points returned by 'terra::vectors' function. |
analysis |
numeric: data frame (rows: ensemble member; columns: site*state_variables) of updated ensemble analysis results from the 'sda_enkf' function. |
covariates.dir |
character: path to the exported covariates GeoTIFF file. |
variable |
character: name of state variable. It should match up with the column names of the analysis data frame. |
matrix (num.sites, num.variables * num.ensemble + num.covariates) within which the first sets of columns contain values of state variables for each ensemble member of every site, and the rest columns contain the corresponding covariates.
Dongchen Zhang
This function provides means to split large SDA runs into separate 'qsub' jobs. Including job creation, submission, and assemble.
qsub_sda( settings, obs.mean, obs.cov, Q, pre_enkf_params, ensemble.samples, outdir, control, block.index = NULL, debias = list(cov.dir = NULL, start.year = NULL, t.start = NULL, residual.lag = NULL, fun = NULL), prefix = "batch" )qsub_sda( settings, obs.mean, obs.cov, Q, pre_enkf_params, ensemble.samples, outdir, control, block.index = NULL, debias = list(cov.dir = NULL, start.year = NULL, t.start = NULL, residual.lag = NULL, fun = NULL), prefix = "batch" )
settings |
PEcAn settings object |
obs.mean |
Lists of date times named by time points, which contains lists of sites named by site ids, which contains observation means for each state variables of each site for each time point. |
obs.cov |
Lists of date times named by time points, which contains lists of sites named by site ids, which contains observation covariances for all state variables of each site for each time point. |
Q |
Process covariance matrix given if there is no data to estimate it. |
pre_enkf_params |
Used for passing pre-existing time-series of process error into the current SDA runs to ignore the impact by the differences between process errors. |
ensemble.samples |
Pass ensemble.samples from outside to avoid GitHub check issues. |
outdir |
Physical path to the folder where the SDA outputs will be stored. The default is NULL, where we will be using outdir from the settings object. |
control |
List of flags controlling the behavior of the SDA. 'TimeseriesPlot' for post analysis examination; 'OutlierDetection' decide if we want to execute the outlier detection each time after the model forecasting; 'send_email' contains lists for sending email to report the SDA progress; 'keepNC' decide if we want to keep the NetCDF files inside the out directory; 'forceRun' decide if we want to proceed the Bayesian MCMC sampling without observations; 'MCMC.args' include lists for controling the MCMC sampling process (iteration, nchains, burnin, and nthin.). |
block.index |
list of site ids for each block, default is NULL. This is used when the localization turns on. Please keep using the default value because the localization feature is still in development. |
debias |
List: R list containing the covariance directory ('cov.dir'), start year ('start.year'), start time point ('t.start'), if we want to include the 'residual.lag' as additional covariate, and the ML function ('fun'). covariance directory should include GeoTIFF files named by year. start time point is numeric input which decide when to start the debiasing feature. 'residual.lag' is either TRUE or FALSE. |
prefix |
character: the desired folder name to store the outputs. |
NONE
Dongchen Zhang
This function can help to execute sda function.
qsub_sda_batch(folder.path)qsub_sda_batch(folder.path)
folder.path |
character: path where the 'configs.rds' file is stored. |
Dongchen Zhang.
Remote_Sync_launcher
Remote_Sync_launcher(settingPath, remote.path, PID)Remote_Sync_launcher(settingPath, remote.path, PID)
settingPath |
Path to your local setting . |
remote.path |
Path generated by SDA_remote_launcher which shows the path to your remote SDA run. |
PID |
PID generated by SDA_remote_launcher which shows the active PID running your SDA job. |
This function uses a set of scaling factors defined in the pecan XML to scale a given matrix
rescaling_stateVars(settings, X, multiply = TRUE)rescaling_stateVars(settings, X, multiply = TRUE)
settings |
pecan xml settings where state variables have the scaling_factor tag |
X |
Any Matrix with column names as variable names |
multiply |
TRUE = multiplication, FALSE = division |
rescaled Matrix
random weighted multivariate normal
rwtmnorm(n, mean, prec, wt)rwtmnorm(n, mean, prec, wt)
n |
sample size |
mean |
mean |
prec |
precision |
wt |
weight |
Sample meteorological ensembles
sample_met(settings, nens = 1)sample_met(settings, nens = 1)
settings |
PEcAn settings list |
nens |
number of ensemble members to be sampled |
sample parameters
sample.parameters(ne, settings, con)sample.parameters(ne, settings, con)
ne |
number of ensemble members |
settings |
PEcAn settings object |
con |
PEcAn database connection |
data frame of sampled parameters from the posterior distribution
Michael Dietze [email protected]
sampler toggling
sampler_toggle(model, mvSaved, target, control)sampler_toggle(model, mvSaved, target, control)
model |
model |
mvSaved |
copied to |
target |
thing being targetted |
control |
unused |
This function can help to assemble sda outputs (analysis and forecasts) from each job execution.
sda_assemble(batch.folder, outdir)sda_assemble(batch.folder, outdir)
batch.folder |
character: path where the SDA batch jobs stored. |
outdir |
character: path where we want to store the assembled analysis and forecasts. |
Dongchen Zhang.
SDA_control
SDA_control( trace = TRUE, ForewardForecast = FALSE, interactivePlot = FALSE, TimeseriesPlot = FALSE, BiasPlot = FALSE, plot.title = NULL, facet.plots = FALSE, debug = FALSE, pause = FALSE, Profiling = FALSE, OutlierDetection = FALSE )SDA_control( trace = TRUE, ForewardForecast = FALSE, interactivePlot = FALSE, TimeseriesPlot = FALSE, BiasPlot = FALSE, plot.title = NULL, facet.plots = FALSE, debug = FALSE, pause = FALSE, Profiling = FALSE, OutlierDetection = FALSE )
trace |
Logical if code should print out the progress of SDA . |
ForewardForecast |
Logical if the foreward forecast estimates needs to be read and visualized in time series plots. |
interactivePlot |
Logical if time series plots need to be generated. |
TimeseriesPlot |
Logical if time series plots need to be generated. |
BiasPlot |
Logical if bias plots need to be generated |
plot.title |
Character defining the title of times series plots |
facet.plots |
Logical if the timeseries plots should be faceted based on state variables |
debug |
Logical if the code should stop at some milestones and open an interactive debugging environment |
pause |
Logical if code needs to be paused and wait for further instruction after the analysis step |
Profiling |
Logical if code should keep track of how much time each step took |
OutlierDetection |
Logical if TRUE then a simple method will be used to replace simulations of outside 3IQR with the median of ensembles. |
list of all arguments needed to setup the SDA function
This function uses either Random Forest or Convolutional Neural Network model based on the model_type parameter.
SDA_downscale( preprocessed, date, carbon_pool, covariates, model_type = "rf", seed = NULL )SDA_downscale( preprocessed, date, carbon_pool, covariates, model_type = "rf", seed = NULL )
preprocessed |
List. Preprocessed data returned as an output from the SDA_downscale_preprocess function. |
date |
Date. If SDA site run, format is yyyy/mm/dd; if NEON, yyyy-mm-dd. Restricted to years within file supplied to 'preprocessed' from the 'data_path'. |
carbon_pool |
Character. Carbon pool of interest. Name must match carbon pool name found within file supplied to 'preprocessed' from the 'data_path'. |
covariates |
SpatRaster stack. Used as predictors in downscaling. Layers within stack should be named. Recommended that this stack be generated using 'covariates' instructions in assim.sequential/inst folder |
model_type |
Character. Either "rf" for Random Forest or "cnn" for Convolutional Neural Network. Default is Random Forest. |
seed |
Numeric or NULL. Optional seed for random number generation. Default is NULL. |
This function will downscale forecast data to unmodeled locations using covariates and site locations
A list containing the training and testing data sets, models, predicted maps for each ensemble member, and predictions for testing data.
Joshua Ploshay, Sambhav Dixit
This function uses the randomForest model to downscale forecast data (hourly) to unmodeled locations using covariates and site locations
SDA_downscale_hrly(nc_file, coords, yyyy, covariates)SDA_downscale_hrly(nc_file, coords, yyyy, covariates)
nc_file |
In quotes, file path for .nc containing ensemble data. |
coords |
In quotes, file path for .csv file containing the site coordinates, columns named "lon" and "lat". |
yyyy |
In string, format is yyyy(year of interest) |
covariates |
SpatRaster stack, used as predictors in randomForest. Layers within stack should be named. Recommended that this stack be generated using 'covariates' instructions in assim.sequential/inst folder |
It returns the 'downscale_output' list containing lists for the training and testing data sets, models, and predicted maps for each ensemble member.
Harunobu Ishii
This function takes the output from the SDA_downscale function and computes various performance metrics for each ensemble. It provides a way to evaluate the accuracy of the downscaling results without modifying the main downscaling function.
SDA_downscale_metrics(downscale_output, carbon_pool)SDA_downscale_metrics(downscale_output, carbon_pool)
downscale_output |
List. Output from the SDA_downscale function, containing data, models, maps, and predictions for each ensemble. |
carbon_pool |
Character. Name of the carbon pool used in the downscaling process. |
This function calculates performance metrics for the downscaling results. It computes Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared for each ensemble. The function uses the actual values from the testing data and the predictions generated during the downscaling process.
A list of metrics for each ensemble, where each element contains MAE , MSE ,R_squared ,actual values from testing data and predicted values for the testing data
Sambhav Dixit
This function reads and checks the input data, ensuring that the required date and carbon pool exist, and that the site coordinates are valid.
SDA_downscale_preprocess(data_path, coords_path, date, carbon_pool)SDA_downscale_preprocess(data_path, coords_path, date, carbon_pool)
data_path |
Character. File path for .rds containing ensemble data. |
coords_path |
Character. File path for .csv file containing the site coordinates, with columns named "lon" and "lat". |
date |
Date. If SDA site run, format is yyyy/mm/dd; if NEON, yyyy-mm-dd. Restricted to years within the file supplied to 'data_path'. |
carbon_pool |
Character. Carbon pool of interest. Name must match the carbon pool name found within the file supplied to 'data_path'. |
This function ensures that the specified date and carbon pool are present in the input data. It also checks the validity of the site coordinates and aligns the number of rows between site coordinates and carbon data.
A list containing The read .rds data , The cleaned site coordinates, and the preprocessed carbon data.
Sambhav Dixit
sda_matchparam
sda_matchparam(settings, ensemble.samples, site.ids, nens)sda_matchparam(settings, ensemble.samples, site.ids, nens)
settings |
settings object passed from sda.enkf_MultiSite |
ensemble.samples |
taken from sample.Rdata object |
site.ids |
character object passed from sda.enkf_MultiSite |
nens |
number of ensemble members in model runs, taken from restart$runids |
new.params object used to
Alexis Helgeson
Assembler for preparing obs.mean and obs.cov for the SDA workflow
SDA_OBS_Assembler(settings)SDA_OBS_Assembler(settings)
settings |
the settings object followed by PEcAn.settings format. |
list of obs.mean and obs.cov
Dongchen Zhang
## Not run: settings_dir <- "/projectnb/dietzelab/dongchen/All_NEON_SDA/NEON42/IC/pecan.xml" settings <- PEcAn.settings::read.settings(settings_dir) OBS <- SDA_OBS_Assembler(settings) ## End(Not run)## Not run: settings_dir <- "/projectnb/dietzelab/dongchen/All_NEON_SDA/NEON42/IC/pecan.xml" settings <- PEcAn.settings::read.settings(settings_dir) OBS <- SDA_OBS_Assembler(settings) ## End(Not run)
SDA_remote_launcher
SDA_remote_launcher(settingPath, ObsPath, run.bash.args)SDA_remote_launcher(settingPath, ObsPath, run.bash.args)
settingPath |
The Path to the setting that will run SDA |
ObsPath |
Path to the obs data which is expected to be an .Rdata. |
run.bash.args |
Shell commands to be run on the remote host before launching the SDA. See examples |
This function returns a list of two pieces of information. One the remote path that SDA is running and the PID of the active run.
## Not run: # This example can be found under inst folder in the package library(PEcAn.all) library(purrr) run.bash.args <- c( "#$ -l h_rt=48:00:00", "#$ -pe omp 28 # Request a parallel environment with 4 cores", "#$ -l mem_per_core=1G # and 4G memory for each", "#$ -l buyin", "module load R/3.5.2", "module load python/2.7.13" ) settingPath <-"pecan.SDA.4sites.xml" ObsPath <- "Obs/LandTrendr_AGB_output50s.RData" SDA_remote_launcher(settingPath, ObsPath, run.bash.args) ## End(Not run)## Not run: # This example can be found under inst folder in the package library(PEcAn.all) library(purrr) run.bash.args <- c( "#$ -l h_rt=48:00:00", "#$ -pe omp 28 # Request a parallel environment with 4 cores", "#$ -l mem_per_core=1G # and 4G memory for each", "#$ -l buyin", "module load R/3.5.2", "module load python/2.7.13" ) settingPath <-"pecan.SDA.4sites.xml" ObsPath <- "Obs/LandTrendr_AGB_output50s.RData" SDA_remote_launcher(settingPath, ObsPath, run.bash.args) ## End(Not run)
Calculate ensemble weights for each site at time t.
sda_weights_site(FORECAST, ANALYSIS, t, ens)sda_weights_site(FORECAST, ANALYSIS, t, ens)
FORECAST |
FORECAST object built within the sda.enkf_MultiSite function. |
ANALYSIS |
ANALYSIS object built within the Analysis_sda_multisite function. |
t |
exact number of t inside the sda.enkf_MultiSite function. |
ens |
number of ensemble members. |
list of weights associated with each ensemble member of each site.
Dongchen Zhang and Hamze Dokoohaki
This function helps to correct the forecasts' biases based on ML (random forest) training on the previous time points.
sda.bias.correction( settings, t, t.start, dates, all.X, obs.mean, state.interval, cov.dir, residual.lag = FALSE, py.init = NULL )sda.bias.correction( settings, t, t.start, dates, all.X, obs.mean, state.interval, cov.dir, residual.lag = FALSE, py.init = NULL )
settings |
PEcAn settings object. |
t |
numeric: the current number of time points (e.g., t=1 for the beginning time point). |
t.start |
numeric: the user-defined time point to avoid the initial burnin period. |
dates |
vector: a vector of dates used for extracting covariates through time. |
all.X |
list: lists of data frame of model forecast from the beginning to the current time points that has n (ensemble size) rows and n.var (number of variables) times n.site (number of locations) columns. (e.g., 100 ensembles, 4 variables, and 8,000 locations will end up with data.frame of 100 rows and 32,000 columns) |
obs.mean |
List: lists of date times named by time points, which contains lists of sites named by site ids, which contains observation means for each state variables of each site for each time point. |
state.interval |
matrix: containing the upper and lower boundaries for each state variable. |
cov.dir |
character: physical path to the directory contains the time series covariate maps. |
residual.lag |
logical: decide if we want to include the lagged residual (difference in residual between time points) in the ML process. |
py.init |
R function: R function to initialize the python functions. Default is NULL. the default random forest will be used if 'py.init' is NULL. |
list: the current X after the bias-correction; the ML outputs for each variable; predicted residuals.
Dongchen Zhang
Restart mode: Basic idea is that during a restart (primary case envisioned as an iterative forecast), a new workflow folder is created and the previous forecast for the start_time is copied over. During restart the initial run before the loop is skipped, with the info being populated from the previous run. The function then dives right into the first Analysis, then continues on like normal.
sda.enkf( settings, obs.mean, obs.cov, Q = NULL, restart = NULL, control = list(trace = TRUE, interactivePlot = TRUE, TimeseriesPlot = TRUE, BiasPlot = FALSE, plot.title = NULL, debug = FALSE, pause = FALSE), ... )sda.enkf( settings, obs.mean, obs.cov, Q = NULL, restart = NULL, control = list(trace = TRUE, interactivePlot = TRUE, TimeseriesPlot = TRUE, BiasPlot = FALSE, plot.title = NULL, debug = FALSE, pause = FALSE), ... )
settings |
PEcAn settings object |
obs.mean |
List of dataframe of observation means, named with observation datetime. |
obs.cov |
List of covariance matrices of state variables , named with observation datetime. |
Q |
Process covariance matrix given if there is no data to estimate it. |
restart |
Used for iterative updating previous forecasts. When the restart is TRUE it read the object in SDA folder written from previous SDA. |
control |
List of flags controlling the behaviour of the SDA. trace for reporting back the SDA outcomes, interactivePlot for plotting the outcomes after each step, TimeseriesPlot for post analysis examination, BiasPlot for plotting the correlation between state variables, plot.title is the title of post analysis plots and debug mode allows for pausing the code and examining the variables inside the function. |
... |
Additional arguments, currently ignored |
NONE
Michael Dietze and Ann Raiho [email protected]
This function provides complete support for the multi-core and multi-node computation on the general HPC system. Thus, this script will be more computationally efficient, making it possible to run SDA over thousands of locations.
sda.enkf_local( settings, obs.mean, obs.cov, Q = NULL, pre_enkf_params = NULL, ensemble.samples = NULL, outdir = NULL, control = list(TimeseriesPlot = FALSE, OutlierDetection = FALSE, send_email = NULL, keepNC = TRUE, forceRun = TRUE, MCMC.args = NULL, merge_nc = TRUE), debias = list(cov.dir = NULL, start.year = NULL, t.start = NULL, residual.lag = NULL, fun = NULL) )sda.enkf_local( settings, obs.mean, obs.cov, Q = NULL, pre_enkf_params = NULL, ensemble.samples = NULL, outdir = NULL, control = list(TimeseriesPlot = FALSE, OutlierDetection = FALSE, send_email = NULL, keepNC = TRUE, forceRun = TRUE, MCMC.args = NULL, merge_nc = TRUE), debias = list(cov.dir = NULL, start.year = NULL, t.start = NULL, residual.lag = NULL, fun = NULL) )
settings |
PEcAn settings object |
obs.mean |
Lists of date times named by time points, which contains lists of sites named by site ids, which contains observation means for each state variables of each site for each time point. |
obs.cov |
Lists of date times named by time points, which contains lists of sites named by site ids, which contains observation covariances for all state variables of each site for each time point. |
Q |
Process covariance matrix given if there is no data to estimate it. |
pre_enkf_params |
Used for passing pre-existing time-series of process error into the current SDA runs to ignore the impact by the differences between process errors. |
ensemble.samples |
list of ensemble parameters across PFTs. Default is NULL. |
outdir |
physical path to the folder that stores the SDA outputs. Default is NULL. |
control |
List of flags controlling the behavior of the SDA. 'TimeseriesPlot' for post analysis examination; 'OutlierDetection' decide if we want to execute the outlier detection each time after the model forecasting; 'send_email' contains lists for sending email to report the SDA progress; 'keepNC' decide if we want to keep the NetCDF files inside the out directory; 'forceRun' decide if we want to proceed the Bayesian MCMC sampling without observations; 'MCMC.args' include lists for controling the MCMC sampling process (iteration, nchains, burnin, and nthin.). 'merge_nc' determine if we want to merge all netCDF files across sites and ensembles. If it's set as 'TRUE', we will then combine all netCDF files into the 'merged_nc' folder within the 'outdir'. |
debias |
List: R list containing the covariance directory ('cov.dir'), start year ('start.year'), start time point ('t.start'), if we want to include the 'residual.lag' as additional covariate, and the ML function ('fun'). covariance directory should include GeoTIFF files named by year. start time point is numeric input which decide when to start the debiasing feature. 'residual.lag' is either TRUE or FALSE. |
NONE
Dongchen Zhang [email protected]
Check out SDA_control function for more details on the control arguments.
sda.enkf.multisite( settings, obs.mean, obs.cov, Q = NULL, restart = NULL, pre_enkf_params = NULL, ensemble.samples = NULL, control = list(TimeseriesPlot = FALSE, OutlierDetection = FALSE, send_email = NULL, keepNC = TRUE, forceRun = TRUE, run_parallel = TRUE, MCMC.args = NULL, merge_nc = TRUE, execution = "local"), debias = list(cov.dir = NULL, t.start = NULL, residual.lag = NULL), ... )sda.enkf.multisite( settings, obs.mean, obs.cov, Q = NULL, restart = NULL, pre_enkf_params = NULL, ensemble.samples = NULL, control = list(TimeseriesPlot = FALSE, OutlierDetection = FALSE, send_email = NULL, keepNC = TRUE, forceRun = TRUE, run_parallel = TRUE, MCMC.args = NULL, merge_nc = TRUE, execution = "local"), debias = list(cov.dir = NULL, t.start = NULL, residual.lag = NULL), ... )
settings |
PEcAn settings object |
obs.mean |
Lists of date times named by time points, which contains lists of sites named by site ids, which contains observation means for each state variables of each site for each time point. |
obs.cov |
Lists of date times named by time points, which contains lists of sites named by site ids, which contains observation covariances for all state variables of each site for each time point. |
Q |
Process covariance matrix given if there is no data to estimate it. |
restart |
Used for iterative updating previous forecasts. Default NULL. List object includes file path to previous runs and start date for SDA. |
pre_enkf_params |
Used for passing pre-existing time-series of process error into the current SDA runs to ignore the impact by the differences between process errors. |
ensemble.samples |
list of ensemble parameters across PFTs. Default is NULL. |
control |
List of flags controlling the behavior of the SDA. 'TimeseriesPlot' for post analysis examination; 'OutlierDetection' decide if we want to execute the outlier detection each time after the model forecasting; 'send_email' contains lists for sending email to report the SDA progress; 'keepNC' decide if we want to keep the NetCDF files inside the out directory; 'forceRun' decide if we want to proceed the Bayesian MCMC sampling without observations; 'run_parallel' decide if we want to run the SDA under parallel mode for the 'future_map' function; 'MCMC.args' include lists for controling the MCMC sampling process (iteration, nchains, burnin, and nthin.). 'merge_nc' determine if we want to merge all netCDF files across sites and ensembles. If it's set as 'TRUE', we will then combine all netCDF files into the 'merged_nc' folder within the 'outdir'. 'execution' decide the way we want to execute model including 'local' ,where we execute the model locally; 'qsub', where we use the traditional 'start_model_runs' function for submission; 'qsub_parallel', where we first combine jobs and submit them into the SCC. |
debias |
List: R list containing the covariance directory, the start time point, and if we want to include the 'residual.lag' as additional covariate. covariance directory should include GeoTIFF files named by year. start time point is numeric input which decide when to start the debiasing feature. 'residual.lag' is either TRUE or FALSE. |
... |
Additional arguments, currently ignored |
Restart mode: Basic idea is that during a restart (primary case envisioned as an iterative forecast), a new workflow folder is created and the previous forecast for the start_time is copied over. During restart the initial run before the loop is skipped, with the info being populated from the previous run. The function then dives right into the first Analysis, then continues on like normal.
NONE
Michael Dietze, Ann Raiho and Alexis Helgeson [email protected]
Restart mode: Basic idea is that during a restart (primary case envisioned as an iterative forecast), a new workflow folder is created and the previous forecast for the start_time is copied over. During restart the initial run before the loop is skipped, with the info being populated from the previous run. The function then dives right into the first Analysis, then continues on like normal.
sda.enkf.original( settings, obs.mean, obs.cov, IC = NULL, Q = NULL, adjustment = TRUE, restart = NULL )sda.enkf.original( settings, obs.mean, obs.cov, IC = NULL, Q = NULL, adjustment = TRUE, restart = NULL )
settings |
PEcAn settings object |
obs.mean |
list of observations of the means of state variable (time X nstate) |
obs.cov |
list of observations of covariance matrices of state variables (time X nstate X nstate) |
IC |
initial conditions |
Q |
process covariance matrix given if there is no data to estimate it |
adjustment |
flag for using ensemble adjustment filter or not |
restart |
Used for iterative updating previous forecasts. This is a list that includes ens.inputs, the list of inputs by ensemble member, params, the parameters, and old_outdir, the output directory from the previous workflow. These three things are needed to ensure that if a new workflow is started that ensemble members keep there run-specific met and params. See Details |
NONE
Michael Dietze and Ann Raiho [email protected]
This function can help to assemble sda outputs (analysis and forecasts) from each job execution.
sda.qsub.job.submission( batch.folder, username, outdir = NULL, past.job.ids = NULL, max.job = 20, prefix = "sda.all.forecast.analysis.Rdata", resources = list(hour = 24, memory = "4G", ncpu = 28) )sda.qsub.job.submission( batch.folder, username, outdir = NULL, past.job.ids = NULL, max.job = 20, prefix = "sda.all.forecast.analysis.Rdata", resources = list(hour = 24, memory = "4G", ncpu = 28) )
batch.folder |
character: path where the SDA batch jobs stored. |
username |
character: username for the HPC. |
outdir |
character: path where to stored the assembled results. Default is NULL. |
past.job.ids |
vector: vector of the past submitted job ids. Default is NULL. |
max.job |
numeric: maximum allowance for the number of running jobs on HPC. |
prefix |
character: file name to be determined if a job is finished or not. |
resources |
list: computation resources (time, memory, cores) when used to submit jobs. |
vector or numeric: return -1 when all jobs are finished; return vector of submitted job ids when there is any job that is not finished.
Dongchen Zhang.
Adopted from Data assimilation for spatio-temporal processes - p250 - Sebastian Reich
simple.local(Pf, distance.mat, scalef = 2)simple.local(Pf, distance.mat, scalef = 2)
Pf |
Forecast error covariance matrix |
distance.mat |
is matrix of distances between sites. |
scalef |
scalef is a numeric value that requires tuning and it controls the shape of the corrolation function |
It returns a localized covariance matrix by taking a Schur product between Pf and a corrolation function
Hamze Dokoohaki
This function helps to build the data frame (pixels by data columns) for only vegetated pixels to improve the efficiency. Note that the 'LC' field using the 'MODIS land cover' observations (MCD12Q1.061) must be supplied in the covariates to make this function work.
stack_covariates_2_df(rast.dir, cores = parallel::detectCores())stack_covariates_2_df(rast.dir, cores = parallel::detectCores())
rast.dir |
character: a character that points to the covariates raster file generated by the 'stack_covariates_2_geotiff' function. |
cores |
numeric: how many CPus to be used in the calculation, the default is the total CPU number you have. |
list containing the data frame of covariates for vegetated pixels and the corresponding index of the pixels.
Dongchen Zhang
This function helps to stack target data layers from various GeoTIFF maps (with different extents, CRS, and resolutions) to a single map.
stack_covariates_2_geotiff( outdir, year, base.map.dir, cov.tif.file.list, normalize = T, cores = parallel::detectCores() )stack_covariates_2_geotiff( outdir, year, base.map.dir, cov.tif.file.list, normalize = T, cores = parallel::detectCores() )
outdir |
character: the output directory where the stacked GeoTIFF file will be generated. |
year |
numeric: the year of when the covariates are stacked. |
base.map.dir |
character: path to the GeoTIFF file within which the extents and CRS will be used to generate the final map. |
cov.tif.file.list |
list: a list contains sub-lists with each including path to the corresponding map and the variables to be extracted (e.g., list(LC = list(dir = "path/to/landcover.tiff", var.name = "LC")). |
normalize |
boolean: decide if we want to normalize each data layer, the default is TRUE. |
cores |
numeric: how many CPus to be used in the calculation, the default is the total CPU number you have. |
path to the exported GeoTIFF file.
Dongchen Zhang
tobit_model_censored
tobit_model_censored(settings, X, var.names, mu.f, Pf, t)tobit_model_censored(settings, X, var.names, mu.f, Pf, t)
settings |
(list) pecan standard settings list. |
X |
(numeric) A matrix contains ensemble forecasts (ensembles * variables). |
var.names |
(character) variable names. |
mu.f |
(numeric) forecast mean values. |
Pf |
(numeric) forecast covariance matrix. |
t |
(numeric) timestep. If t=1, initial values are imputed for zero values in mu.f |
list with updated mu.f, pf, X, and indication of which y values are censored
Fit tobit prior to ensemble members
tobit2space.modeltobit2space.model
TBD
update_q
update_q( block.list.all, t, nt, aqq.Init = NULL, bqq.Init = NULL, MCMC_dat = NULL, block.list.all.pre = NULL )update_q( block.list.all, t, nt, aqq.Init = NULL, bqq.Init = NULL, MCMC_dat = NULL, block.list.all.pre = NULL )
block.list.all |
each block within the 'block.list' lists. |
t |
time point. |
nt |
total length of time steps. |
aqq.Init |
the initial values of aqq, the default is NULL. |
bqq.Init |
the initial values of bqq, the default is NULL. |
MCMC_dat |
data frame of MCMC samples, the default it NULL. |
block.list.all.pre |
pre-existed block.list.all object for passing the aqq and bqq to the current SDA run, the default is NULL. |
It returns the 'block.list.all' object with initialized/updated Q filled in.
Dongchen Zhang