Introduction

This vignette focuses on giving an overview of the main functions of the package. This vignette uses the simulated data which ships with the package.

Setup

We load the package stackBagg

library(stackBagg)

Example: Simulated Data

The simulated data can be loaded via:

exampleData is a list with three elements. The first elemenent contains the training data set and the second element contains the test data set. The data set train and test consist of the variables id, the binary outcome E, the event times denoted as ttilde, delta as the event indicator at ttilde (censored observations are denoted by the value 0), the trueT which denotes the binary outcome under no censoring and lastly 20 covariates:

train <- exampleData[[1]][[1]]
head(train)
#>      id    E    ttilde delta trueT          X1           X2          X3
#> 423 423    0 52.300159     1 FALSE  0.08281419  0.172352230  0.06647657
#> 780 780    1  7.949695     1  TRUE  0.16976528 -0.081576673 -0.10240238
#> 250 250    0 34.372640     1 FALSE  0.09636968  0.052661443  0.12062371
#> 646 646 <NA>  8.767113     0 FALSE -0.15564309 -0.061024800 -0.01785616
#> 411 411    0 36.957775     1 FALSE -0.07297439  0.004880959 -0.13187802
#> 273 273 <NA> 10.331866     0 FALSE  0.09327623 -0.163935475 -0.05678734
#>               X4           X5          X6          X7          X8
#> 423  0.010862900 -0.060631525 -0.08074032  0.16873440  0.04340777
#> 780 -0.063301947 -0.189945655 -0.24955745 -0.05273801  0.01022313
#> 250  0.002720333 -0.121034800 -0.15937346  0.12669938  0.13122538
#> 646  0.080797624  0.007448868 -0.04943161  0.06093838 -0.23895869
#> 411 -0.046518184  0.003982125 -0.04968490 -0.09684055  0.06668652
#> 273 -0.106773616 -0.053778966 -0.36543560  0.04241530 -0.04468489
#>               X9         X10         X11        X12        X13        X14
#> 423 -0.006117758 -0.12283608 -0.64504704  0.3258370 -0.2671135  0.8266186
#> 780 -0.070563657 -0.08396796 -0.18397728  0.6004917 -0.4475623 -0.3590347
#> 250  0.029273293  0.01372634 -0.38093408 -0.9907098  0.2355303  0.8957453
#> 646 -0.113911314 -0.13912994 -0.69319058 -1.3872817  1.0595717 -0.1584360
#> 411 -0.071835283 -0.10135100 -0.02037296 -0.3527412 -0.4043686 -0.6587839
#> 273 -0.071334425 -0.05086768 -0.67349006 -0.3978803  0.3390769  0.2385083
#>            X15        X16        X17        X18         X19        X20
#> 423 -0.2348479  0.9260907  0.5582131  0.1672586  0.05897599  0.4196418
#> 780  0.9044775  0.0510946 -0.3397124  0.8655912  0.58665758 -1.0098241
#> 250  0.3558889  1.3780777  0.1241028 -0.4286867  0.10661455 -0.4749545
#> 646 -0.7281273  0.6509288 -0.5915999 -0.2572964 -0.81711210  0.5843599
#> 411  0.5088058 -0.5017764 -0.6649954  0.2478391  0.03220401  0.7999203
#> 273 -0.1007935  0.1772234  0.9978399  0.7961343 -0.08876092  0.4818328
test <- exampleData[[2]][[1]]
head(test)
#>    id    E    ttilde delta trueT          X1          X2          X3
#> 2   2    0 64.494547     1 FALSE  0.19653678 -0.06612803 -0.17318574
#> 7   7    0 64.323597     2 FALSE  0.07215234  0.02897143  0.08957245
#> 32 32    1 20.115272     1  TRUE  0.10124635 -0.13058958  0.08660183
#> 50 50 <NA>  9.417668     0  TRUE -0.05097176 -0.06717374  0.18942554
#> 62 62 <NA> 12.124566     0 FALSE -0.11860139 -0.03328412  0.01552746
#> 65 65    0 20.023369     2 FALSE  0.21348415 -0.02690853 -0.02702318
#>             X4          X5          X6          X7          X8
#> 2   0.08928017  0.02733158 -0.06803704 -0.06582461 -0.03348928
#> 7   0.05467074 -0.05344298 -0.07477942  0.08642445 -0.05110457
#> 32  0.01934914 -0.13447104 -0.00592559  0.02323055  0.03760170
#> 50 -0.15595468  0.18782818 -0.11684748 -0.07174367  0.10368502
#> 62 -0.03566977  0.17223400  0.11178515 -0.07117080  0.10993110
#> 65  0.05891138  0.08819088  0.02286303  0.07790421  0.13443462
#>               X9         X10         X11         X12         X13
#> 2  -0.0817359311  0.03067802  0.53518872  0.04725869  0.69187769
#> 7  -0.0635635495 -0.02555120  0.45535370  0.24984347  0.08591775
#> 32 -0.0156705988  0.09414932 -0.77074516 -0.05461036  0.18559911
#> 50 -0.0001588212  0.03087372  0.09910245  0.08448019  0.06461913
#> 62  0.2151523951 -0.03255328  0.75207670  0.66549092  0.46742318
#> 65  0.1478304786  0.01003466 -0.01380349 -1.25888263 -0.49010812
#>           X14        X15           X16        X17        X18         X19
#> 2  -0.1857249  0.4538783  0.0308172015 -0.7309331  0.3560292 -0.27862282
#> 7   0.5821856 -0.3025899 -0.9635968963  1.4319108 -0.1644014  1.37290745
#> 32 -0.4332053  0.9523075  0.3172001109 -1.0319663  1.0912307 -0.09370163
#> 50 -1.2068642 -0.7959719 -0.2613358633 -0.4844889 -0.5355841  0.01226348
#> 62 -1.2911703 -0.2458894  0.0004135976 -0.7451024 -0.6173149 -0.43773691
#> 65  0.2247324  0.3529755  0.3388744213 -0.2660526 -0.2320293 -0.18283466
#>            X20
#> 2  -0.63319481
#> 7   0.02371843
#> 32 -0.57132685
#> 50 -0.55612238
#> 62 -0.10535540
#> 65 -1.30989032

There is a third element exampleData which is the true AUC for the simulated data on the cumulative incidence of the main event at time 26.5 computed analytically under the Weibull distribution using the scale and shape parameter for both event times.

The train data set consists in 800 individuals where 167 are censored and 180 experience the event of interest at time 26.5:

and 51 subjects experience the competing event within the time of interest 26.5. These individuals are part of the control group.

The test data set consists in 200 individuals where 45 are censored and 39 experience the event of interest:

and there are 24 subjects that experience the competing event within the time of interest 26.5:

Next we apply stackBagg to estimate the risk of experiencing the main event using the 20 covariates in the data set. In other words, using stackBagg we are going to estimate \(P(T<26.5,\delta=1|X)\) using a bunch of machine learning algorithms. Before applying stackBagg, we need to make sure that the data set is in the appropiate format: time event is in the first column and the second column is the indicator event type.

Note: The first two columns of the data.frame data set must be in the following order: time and then status/event indicator. The function stackBagg::stackBagg creates for us the binary variable of interest E in terms of the time point of interest and the status/event indicator.

Another argument of the function stackBagg::stackBagg is the names of covariates as they are named in the data set that we want to include in the model. As we have said above, we are going to use all covariates

We have also to specify the library of algorithms that we want to use to predict the event of interest and to be used to form the stack. We could see all the algorithms that the user could potentially include in the analysis through stackBagg::all.algorithms(). Let ’s use all of them and we denote it as ens.library.

Another argument of the function stackBagg::stackBagg is a list of tune parameters for each machine learning procedure. If this argument is missing, the function uses as default values the same used for the simulation in the paper. For now, we are going to use the default values for the tune parameters. Additionally, we will use 5 folds and we are going to show the results computing the weights under a Cox proportional hazard model (CoxPH) and boosting Cox regression (Cox-Boost).

Firstly, we model the weights under CoxPH and we train the different models and get their predictions on the test data set.

We show several output of pred.

Predictions and performance of IPCW Bagging

Let s take a look first at the IPCW Bagging prediction of the different algorithms and the stacked IPCW Bagging on the test data set:

The assessment of predictive performance using the IPCW AUC is:

The optimal coefficients used to get the stacked IPCW Bagging:

We can note that the algorithms with the largest predictive performance weigh in more in the stack. We check if convergence has been reached in the optimization problem of finding the optimal coefficients (0 denotes convergence, 1 otherwise) and the penalization term used:

We can check the tune parameters used to train the algorithms:

The GAM is trained with two degree of freedom 3 and 4. The LASSO paremeter refers to the lambda penalization term. The parameters in the random forest refer to the number of trees (num_tree=500) and the umber of variables randomly sampled as candidates at each split (mtry=4). k chosen in the k-NN is 25. The SVM parameters are the cost, gamma and kernel (radial=1 and linear=2). Since the kernel is linear the gamma is NA since the linear kernel does not use the gamma parameter. The neurons is set to 1 in the neural network. The last values are the parameter number of trees, k that determines the prior probability that the average of the outcome falls into (-3,3) and q the quantile of the prior on the error variance.

ROC curve

Let s compare the ROC curve for the best and worst single algorithm and the stack. The ROC curve of the stack is

stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred$prediction_ensBagg[,"Stack"],wts=pred$wts_test,tao=26.5,method = "ipcw")

The GAM.4 ROC curve is

stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred$prediction_ensBagg[,"ens.gam.4"],wts=pred$wts_test,tao=26.5,method = "ipcw")

The k-NN ROC curve is

stackBagg::plot_roc(time=test$ttilde,delta = test$delta,marker =pred$prediction_ensBagg[,"ens.knn"],wts=pred$wts_test,tao=26.5,method = "ipcw")

Survival Methods

Moreover, let s see the prediction of three survival based methods: a cause-specific Cox proportional hazard regression model, Cox-Boost and survival random forests for competing risk.

Naive Methods

Lastly, we could see the performance of the algorithms if we were to discard the censored observations

#> NULL