Skip to contents

eXtrem Gradient Boosted models

Usage

pipe_xgboost(
  df,
  predInput = NULL,
  responseVars = 1,
  caseClass = NULL,
  idVars = character(),
  weight = "class",
  crossValStrategy = c("Kfold", "bootstrap"),
  k = 5,
  replicates = 10,
  crossValRatio = c(train = 0.6, test = 0.2, validate = 0.2),
  params = list(),
  nrounds = 5,
  shap = TRUE,
  aggregate_shap = TRUE,
  repVi = 5,
  summarizePred = TRUE,
  scaleDataset = FALSE,
  XGBmodel = FALSE,
  DALEXexplainer = FALSE,
  variableResponse = FALSE,
  save_validateset = FALSE,
  baseFilenameXDG = NULL,
  filenameRasterPred = NULL,
  tempdirRaster = NULL,
  nCoresRaster = parallel::detectCores()%/%2,
  verbose = 0,
  ...
)

Arguments

df

a data.frame with the data.

predInput

a data.frame or a Raster with the input variables for the model as columns or layers. The columns or layer names must match the names of df columns.

responseVars

response variables as column names or indexes on df.

caseClass

class of the samples used to weight cases. Column names or indexes on df, or a vector with the class for each rows in df.

idVars

id column names or indexes on df. This columns will not be used for training.

weight

Optional array of the same length as nrow(df), containing weights to apply to the model's loss for each sample.

crossValStrategy

Kfold or bootstrap.

k

number of data partitions when crossValStrategy="Kfold".

replicates

number of replicates for crossValStrategy="bootstrap" and crossValStrategy="Kfold" (replicates * k-1, 1 fold for validation).

crossValRatio

proportion of the dataset used to train, test and validate the model when crossValStrategy="bootstrap". Default to c(train=0.6, test=0.2, validate=0.2). If there is only one value, will be taken as a train proportion and the test set will be used for validation.

params

the list of parameters to xgboost::xgb.train(). The complete list of parameters is available in the online documentation.

nrounds

max number of boosting iterations.

shap

if TRUE, return the SHAP values as shapviz::shapviz() objects.

aggregate_shap

if TRUE, and shap is also TRUE, aggregate SHAP from all replicates.

repVi

replicates of the permutations to calculate the importance of the variables. 0 to avoid calculating variable importance.

summarizePred

if TRUE, return the mean, sd and se of the predictors. if FALSE, return the predictions for each replicate.

scaleDataset

if TRUE, scale the whole dataset only once instead of the train set at each replicate. Optimize processing time for predictions with large rasters.

XGBmodel

if TRUE, return the model with the result.

DALEXexplainer

if TRUE, return a explainer for the models from DALEX::explain() function. It doesn't work with multisession future plans.

variableResponse

if TRUE, return aggregated_profiles_explainer object from ingredients::partial_dependency() and the coefficients of the adjusted linear model.

save_validateset

save the validateset (independent data not used for training).

baseFilenameXDG

if no missing, save the NN in hdf5 format on this path with iteration appended.

filenameRasterPred

if no missing, save the predictions in a RasterBrick to this file.

tempdirRaster

path to a directory to save temporal raster files.

nCoresRaster

number of cores used for parallelized raster cores. Use half of the available cores by default.

verbose

if > 0, print the state. The bigger the more information printed.

...

extra parameters for xgboost::xgb.train(), future.apply::future_replicate() or ingredients::feature_importance().