Title: | Stability Selection with Error Control |
---|---|
Description: | Resampling procedures to assess the stability of selected variables with additional finite sample error control for high-dimensional variable selection procedures such as Lasso or boosting. Both, standard stability selection (Meinshausen & Buhlmann, 2010, <doi:10.1111/j.1467-9868.2010.00740.x>) and complementary pairs stability selection with improved error bounds (Shah & Samworth, 2013, <doi:10.1111/j.1467-9868.2011.01034.x>) are implemented. The package can be combined with arbitrary user specified variable selection approaches. |
Authors: | Benjamin Hofner [aut, cre], Torsten Hothorn [aut] |
Maintainer: | Benjamin Hofner <[email protected]> |
License: | GPL-2 |
Version: | 0.6-4 |
Built: | 2024-11-13 03:27:50 UTC |
Source: | https://github.com/hofnerb/stabs |
(Internal) function that checks if folds result from subsampling with p = 0.5 and adds complementary pairs if needed.
check_folds(folds, B, n, sampling.type)
check_folds(folds, B, n, sampling.type)
folds |
a weight matrix that represents the subsamples. |
B |
number of subsampling replicates. |
n |
the number of observations; needed for internal checks. |
sampling.type |
sampling type to be used. |
This is an internal function used to check if folds
are
specified correctly. For details (e.g. on arguments) see
stabsel
.
A matrix containing the folds
, possibly after adding the
complementary pairs.
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi:10.1186/s12859-015-0575-3.
For details see stabsel
.
Functions that fit a model until variables are selected and
that returns the indices (and names) of the selected variables.
## package lars: lars.lasso(x, y, q, ...) lars.stepwise(x, y, q, ...) ## package glmnet: glmnet.lasso(x, y, q, type = c("conservative", "anticonservative"), ...) glmnet.lasso_maxCoef(x, y, q, ...)
## package lars: lars.lasso(x, y, q, ...) lars.stepwise(x, y, q, ...) ## package glmnet: glmnet.lasso(x, y, q, type = c("conservative", "anticonservative"), ...) glmnet.lasso_maxCoef(x, y, q, ...)
x |
a matrix containing the predictors or an object of class
|
y |
a vector or matrix containing the outcome. |
q |
number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample. |
type |
a charachter vector specifying if the number of selected
variables per subsample is |
... |
additional arguments passed to the underlying fitting function.
See the example on |
All fitting functions are named after the package and the type of
model that is fitted: package_name.model
, e.g.,
glmnet.lasso
stands for a lasso model that is fitted using the
package glmnet.
glmnet.lasso_maxCoef
fits a lasso model with a given penalty parameter
and returns the q
largest coefficients. If one wants to use
glmnet.lasso_maxCoef
, one must specify the penalty parameter
lambda
(via the ...
argument) or in
stabsel
via args.fitfun(lambda = )
. Note that usually,
the penalty parameter cannot be specified but is chosen such that q
variables are selected. For an example on how to use
glmnet.lasso_maxCoef
see stabsel
.
A named list with elements
selected |
logical. A vector that indicates which variable was selected. |
path |
logical. A matrix that indicates which variable was selected in which step. Each row represents one variable, the columns represent the steps. |
stabsel
for stability selection itself.
if (require("TH.data")) { ## make data set available data("bodyfat", package = "TH.data") } else { ## simulate some data if TH.data not available. ## Note that results are non-sense with this data. bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10) } if (require("lars")) { ## selected variables lars.lasso(bodyfat[, -2], bodyfat[,2], q = 3)$selected lars.stepwise(bodyfat[, -2], bodyfat[,2], q = 3)$selected } if (require("glmnet")) { glmnet.lasso(bodyfat[, -2], bodyfat[,2], q = 3)$selected ## selection path glmnet.lasso(bodyfat[, -2], bodyfat[,2], q = 3)$path ## Using the anticonservative glmnet.lasso (see args.fitfun): stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = glmnet.lasso, args.fitfun = list(type = "anticonservative"), cutoff = 0.75, PFER = 1) }
if (require("TH.data")) { ## make data set available data("bodyfat", package = "TH.data") } else { ## simulate some data if TH.data not available. ## Note that results are non-sense with this data. bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10) } if (require("lars")) { ## selected variables lars.lasso(bodyfat[, -2], bodyfat[,2], q = 3)$selected lars.stepwise(bodyfat[, -2], bodyfat[,2], q = 3)$selected } if (require("glmnet")) { glmnet.lasso(bodyfat[, -2], bodyfat[,2], q = 3)$selected ## selection path glmnet.lasso(bodyfat[, -2], bodyfat[,2], q = 3)$path ## Using the anticonservative glmnet.lasso (see args.fitfun): stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = glmnet.lasso, args.fitfun = list(type = "anticonservative"), cutoff = 0.75, PFER = 1) }
Extract stability selection parameters, i.e., tuning parameters, from
a stabsel
object.
parameters(object) ## extract parameters from a stabsel model ## (same as parameters(p) ) ## S3 method for class 'stabsel' stabsel_parameters(p, ...)
parameters(object) ## extract parameters from a stabsel model ## (same as parameters(p) ) ## S3 method for class 'stabsel' stabsel_parameters(p, ...)
object |
an object of class |
p |
an object of class |
... |
additional arguments, currently not used. |
An object of class stabsel_parameters
with a special
print
method. See there for details.
stabsel
to run stability selection and
stabsel_parameters
for details on the parameters.
Display results of stability selection.
## S3 method for class 'stabsel' plot(x, main = deparse(x$call), type = c("maxsel", "paths"), xlab = NULL, ylab = NULL, col = NULL, ymargin = 10, np = sum(x$max > 0), labels = NULL, ...) ## S3 method for class 'stabsel' print(x, decreasing = FALSE, print.all = TRUE, ...)
## S3 method for class 'stabsel' plot(x, main = deparse(x$call), type = c("maxsel", "paths"), xlab = NULL, ylab = NULL, col = NULL, ymargin = 10, np = sum(x$max > 0), labels = NULL, ...) ## S3 method for class 'stabsel' print(x, decreasing = FALSE, print.all = TRUE, ...)
x |
object of class |
main |
main title for the plot. |
type |
plot type; either stability paths ( |
xlab , ylab
|
labels for the x- and y-axis of the plot. Per
default, sensible labels are used depending on the |
col |
a vector of colors; Typically, one can specify a single color or one color for each variable. Per default, colors depend on the maximal selection frequency of the variable and range from grey to red. |
ymargin |
(temporarily) specifies the y margin of of the plot in
lines (see argument |
np |
number of variables to plot for the maximum selection
frequency plot ( |
labels |
variable labels for the plot; one label per variable / effect
must be specified. Per default, the names of |
decreasing |
logical. Should the selection frequencies be printed
in descending order ( |
print.all |
logical. Should all selection frequencies be displayed or only those that are greater than zero? |
... |
additional arguments to |
This function implements the stability selection procedure by Meinshausen and Buehlmann (2010) and the improved error bounds by Shah and Samworth (2013).
Two of the three arguments cutoff
, q
and PFER
must be specified. The per-family error rate (PFER), i.e., the
expected number of false positives , where
is the
number of false positives, is bounded by the argument
PFER
.
As controlling the PFER is more conservative as controlling the
family-wise error rate (FWER), the procedure also controlls the FWER,
i.e., the probability of selecting at least one non-influential
variable (or model component) is less than PFER
.
An object of class stabsel
with a special print
method.
The object has the following elements:
phat |
selection probabilities. |
selected |
elements with maximal selection probability greater
|
max |
maximum of selection probabilities. |
cutoff |
cutoff used. |
q |
average number of selected variables used. |
PFER |
per-family error rate. |
sampling.type |
the sampling type used for stability selection. |
assumption |
the assumptions made on the selection probabilities. |
call |
the call. |
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi:10.1186/s12859-015-0575-3.
N. Meinshausen and P. Buehlmann (2010), Stability selection. Journal of the Royal Statistical Society, Series B, 72, 417–473.
R.D. Shah and R.J. Samworth (2013), Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society, Series B, 75, 55–80.
if (require("TH.data")) { ## make data set available data("bodyfat", package = "TH.data") } else { ## simulate some data if TH.data not available. ## Note that results are non-sense with this data. bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10) } ## set seed set.seed(1234) #################################################################### ### using stability selection with Lasso methods: if (require("lars")) { (stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = lars.lasso, cutoff = 0.75, PFER = 1)) par(mfrow = c(2, 1)) plot(stab.lasso, ymargin = 6) opar <- par(mai = par("mai") * c(1, 1, 1, 2.7)) plot(stab.lasso, type = "paths") }
if (require("TH.data")) { ## make data set available data("bodyfat", package = "TH.data") } else { ## simulate some data if TH.data not available. ## Note that results are non-sense with this data. bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10) } ## set seed set.seed(1234) #################################################################### ### using stability selection with Lasso methods: if (require("lars")) { (stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = lars.lasso, cutoff = 0.75, PFER = 1)) par(mfrow = c(2, 1)) plot(stab.lasso, ymargin = 6) opar <- par(mai = par("mai") * c(1, 1, 1, 2.7)) plot(stab.lasso, type = "paths") }
(Internal) function that is used to run stability selection (i.e. to apply the fit-function to the subsamples. This function is not intended to be directly called.
run_stabsel(fitter, args.fitter, n, p, cutoff, q, PFER, folds, B, assumption, sampling.type, papply, verbose, FWER, eval, names, mc.preschedule = FALSE, ...)
run_stabsel(fitter, args.fitter, n, p, cutoff, q, PFER, folds, B, assumption, sampling.type, papply, verbose, FWER, eval, names, mc.preschedule = FALSE, ...)
fitter |
a function to fit the model on subsamples. See argument
|
args.fitter |
a named list containing additional arguments that are
passed to |
n |
the number of observations; needed for internal checks. |
p |
number of possible predictors (including intercept if applicable). |
cutoff |
cutoff between 0.5 and 1. |
q |
number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample. |
PFER |
upper bound for the per-family error rate. |
folds |
a weight matrix that represents the subsamples. |
B |
number of subsampling replicates. |
assumption |
distributional assumption. |
sampling.type |
sampling type to be used. |
papply |
(parallel) apply function. |
verbose |
logical (default: |
FWER |
deprecated. Only for compatibility with older versions, use PFER instead. |
eval |
logical. Determines whether stability selection is evaluated. |
names |
variable names that are used to label the results. |
mc.preschedule |
preschedule tasks? |
... |
additional arguments to be passed to next function. |
This is an internal function that fits the actual models to the
subsamples, i.e., this is the work horse that runs stability
selection. Usually, one should use stabsel
, which
internally calls run_stabsel
.
run_stabsel
can be used by expert users to implement stability
selection methods for new model types.
For details (e.g. on arguments) see stabsel
.
An object of class stabsel
with the following elements:
phat |
selection probabilities. |
selected |
elements with maximal selection probability greater
|
max |
maximum of selection probabilities. |
cutoff |
cutoff used. |
q |
average number of selected variables used. |
PFER |
per-family error rate. |
p |
the number of effects subject to selection. |
sampling.type |
the sampling type used for stability selection. |
assumption |
the assumptions made on the selection probabilities. |
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi:10.1186/s12859-015-0575-3.
For details see stabsel
.
Extract selected variables from a stabsel
object.
selected(object, ...) ## S3 method for class 'stabsel' selected(object, ...)
selected(object, ...) ## S3 method for class 'stabsel' selected(object, ...)
object |
an object of class |
... |
additional arguments passed to specific |
The ids of variables selected during the stability selection process
can be extracted using selected()
.
Selection of influential variables or model components with error control.
## generic stability selection funcion stabsel(x, ...) ## a method to fit models with stability selection ## S3 method for class 'matrix' stabsel(x, y, fitfun = lars.lasso, args.fitfun = list(), cutoff, q, PFER, folds = subsample(rep(1, nrow(x)), B = B), B = ifelse(sampling.type == "MB", 100, 50), assumption = c("unimodal", "r-concave", "none"), sampling.type = c("SS", "MB"), papply = mclapply, mc.preschedule = FALSE, verbose = TRUE, FWER, eval = TRUE, ...) ## essentially a wrapper for data.frames (see details) ## S3 method for class 'data.frame' stabsel(x, y, intercept = FALSE, ...)
## generic stability selection funcion stabsel(x, ...) ## a method to fit models with stability selection ## S3 method for class 'matrix' stabsel(x, y, fitfun = lars.lasso, args.fitfun = list(), cutoff, q, PFER, folds = subsample(rep(1, nrow(x)), B = B), B = ifelse(sampling.type == "MB", 100, 50), assumption = c("unimodal", "r-concave", "none"), sampling.type = c("SS", "MB"), papply = mclapply, mc.preschedule = FALSE, verbose = TRUE, FWER, eval = TRUE, ...) ## essentially a wrapper for data.frames (see details) ## S3 method for class 'data.frame' stabsel(x, y, intercept = FALSE, ...)
x |
a |
y |
a vector or matrix containing the outcome. |
intercept |
logical. If |
fitfun |
a function that takes the arguments |
args.fitfun |
a named list containing additional arguments that are
passed to the fitting function; see also argument |
cutoff |
cutoff between 0.5 and 1. Preferably a value between 0.6 and 0.9 should be used. |
q |
number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample. |
PFER |
upper bound for the per-family error rate. This specifies the amount of falsely selected base-learners, which is tolerated. See details. |
folds |
a weight matrix with number of rows equal to the number
of observations, see |
assumption |
Defines the type of assumptions on the
distributions of the selection probabilities and simultaneous
selection probabilities. Only applicable for
|
sampling.type |
use sampling scheme of of Shah & Samworth
(2013), i.e., with complementarty pairs ( |
B |
number of subsampling replicates. Per default, we use 50
complementary pairs for the error bounds of Shah & Samworth (2013)
and 100 for the error bound derived in Meinshausen & Buehlmann
(2010). As we use |
papply |
(parallel) apply function, defaults to
|
mc.preschedule |
preschedule tasks if |
verbose |
logical (default: |
FWER |
deprecated. Only for compatibility with older versions, use PFER instead. |
eval |
logical. Determines whether stability selection is
evaluated ( |
... |
additional arguments to parallel apply methods such as
|
This function implements the stability selection procedure by
Meinshausen and Buehlmann (2010) and the improved error bounds by Shah
and Samworth (2013). For details see also Hofner et al. (2014). The
error bounds are implemented in the function
stabsel_parameters
. Two of the three arguments
cutoff
, q
and PFER
must be specified. The
per-family error rate (PFER), i.e., the expected number of false
positives , where
is the number of false positives,
is bounded by the argument
PFER
.
As controlling the PFER is more conservative as controlling the
family-wise error rate (FWER), the procedure also controlls the FWER,
i.e., the probability of selecting at least one non-influential
variable (or model component) is less than PFER
.
Predefined fitfuns
functions exist but more can be
easily implemented. Note that stepwise regression methods are usually
not advised as they tend to be relatively unstable. See example below.
The function stabsel
for data.frame
s is
essentially just a wrapper to the matrix
function with
the same argments. The only difference is that in a pre-processing
step, the data set is converted to a model matrix using the function
model.matrix
. The additional argument intercept
determines if an explicit intercept should be added to the model
matrix. This is often not neccessary but depends on the fitfun
.
An object of class stabsel
with a special print
method.
The object has the following elements:
phat |
selection probabilities. |
selected |
elements with maximal selection probability greater
|
max |
maximum of selection probabilities. |
cutoff |
cutoff used. |
q |
average number of selected variables used. |
PFER |
(realized) upper bound for the per-family error rate. |
specifiedPFER |
specified upper bound for the per-family error rate. |
p |
the number of effects subject to selection. |
B |
the number of subsamples. |
sampling.type |
the sampling type used for stability selection. |
assumption |
the assumptions made on the selection probabilities. |
call |
the call. |
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi:10.1186/s12859-015-0575-3.
N. Meinshausen and P. Buehlmann (2010), Stability selection. Journal of the Royal Statistical Society, Series B, 72, 417–473.
R.D. Shah and R.J. Samworth (2013), Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society, Series B, 75, 55–80.
stabsel_parameters
for the computation of error bounds,
stabsel.stabsel
for the fast re-computation of
parameters of a fitted stabsel
object,
fitfun
for available fitting functions and
plot.stabsel
for available plot functions
if (require("TH.data")) { ## make data set available data("bodyfat", package = "TH.data") } else { ## simulate some data if TH.data not available. ## Note that results are non-sense with this data. bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10) } ## set seed set.seed(1234) #################################################################### ### using stability selection with Lasso methods: if (require("lars")) { (stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = lars.lasso, cutoff = 0.75, PFER = 1)) (stab.stepwise <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = lars.stepwise, cutoff = 0.75, PFER = 1)) par(mfrow = c(2, 1)) plot(stab.lasso, main = "Lasso") plot(stab.stepwise, main = "Stepwise Selection") ## --> stepwise selection seems to be quite unstable even in this low ## dimensional example! } ## set seed (again to make results comparable) set.seed(1234) if (require("glmnet")) { (stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = glmnet.lasso, cutoff = 0.75, PFER = 1)) par(mfrow = c(2, 1)) plot(stab.glmnet, main = "Lasso (glmnet)") if (exists("stab.lasso")) plot(stab.lasso, main = "Lasso (lars)") } ## Select variables with maximum coefficients based on lasso estimate set.seed(1234) # reset seed if (require("glmnet")) { ## use cross-validated lambda lambda.min <- cv.glmnet(x = as.matrix(bodyfat[, -2]), y = bodyfat[,2])$lambda.min (stab.maxCoef <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = glmnet.lasso_maxCoef, # specify additional parameters to fitfun args.fitfun = list(lambda = lambda.min), cutoff = 0.75, PFER = 1)) ## WARNING: Using a fixed penalty (lambda) is usually not permitted and ## not sensible. See ?fitfun for details. ## now compare standard lasso with "maximal parameter estimates" from lasso par(mfrow = c(2, 1)) plot(stab.maxCoef, main = "Lasso (glmnet; Maximum Coefficients)") plot(stab.glmnet, main = "Lasso (glmnet)") ## --> very different results. } #################################################################### ### using stability selection directly on computed boosting models ### from mboost if (require("mboost")) { ### low-dimensional example mod <- glmboost(DEXfat ~ ., data = bodyfat) ## compute cutoff ahead of running stabsel to see if it is a sensible ## parameter choice. ## p = ncol(bodyfat) - 1 (= Outcome) + 1 ( = Intercept) stabsel_parameters(q = 3, PFER = 1, p = ncol(bodyfat) - 1 + 1, sampling.type = "MB") ## the same: stabsel(mod, q = 3, PFER = 1, sampling.type = "MB", eval = FALSE) ### Do not test the following code per default on CRAN as it takes some time to run: ## now run stability selection (sbody <- stabsel(mod, q = 3, PFER = 1, sampling.type = "MB")) opar <- par(mai = par("mai") * c(1, 1, 1, 2.7)) plot(sbody) par(opar) plot(sbody, type = "maxsel", ymargin = 6) }
if (require("TH.data")) { ## make data set available data("bodyfat", package = "TH.data") } else { ## simulate some data if TH.data not available. ## Note that results are non-sense with this data. bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10) } ## set seed set.seed(1234) #################################################################### ### using stability selection with Lasso methods: if (require("lars")) { (stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = lars.lasso, cutoff = 0.75, PFER = 1)) (stab.stepwise <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = lars.stepwise, cutoff = 0.75, PFER = 1)) par(mfrow = c(2, 1)) plot(stab.lasso, main = "Lasso") plot(stab.stepwise, main = "Stepwise Selection") ## --> stepwise selection seems to be quite unstable even in this low ## dimensional example! } ## set seed (again to make results comparable) set.seed(1234) if (require("glmnet")) { (stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = glmnet.lasso, cutoff = 0.75, PFER = 1)) par(mfrow = c(2, 1)) plot(stab.glmnet, main = "Lasso (glmnet)") if (exists("stab.lasso")) plot(stab.lasso, main = "Lasso (lars)") } ## Select variables with maximum coefficients based on lasso estimate set.seed(1234) # reset seed if (require("glmnet")) { ## use cross-validated lambda lambda.min <- cv.glmnet(x = as.matrix(bodyfat[, -2]), y = bodyfat[,2])$lambda.min (stab.maxCoef <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = glmnet.lasso_maxCoef, # specify additional parameters to fitfun args.fitfun = list(lambda = lambda.min), cutoff = 0.75, PFER = 1)) ## WARNING: Using a fixed penalty (lambda) is usually not permitted and ## not sensible. See ?fitfun for details. ## now compare standard lasso with "maximal parameter estimates" from lasso par(mfrow = c(2, 1)) plot(stab.maxCoef, main = "Lasso (glmnet; Maximum Coefficients)") plot(stab.glmnet, main = "Lasso (glmnet)") ## --> very different results. } #################################################################### ### using stability selection directly on computed boosting models ### from mboost if (require("mboost")) { ### low-dimensional example mod <- glmboost(DEXfat ~ ., data = bodyfat) ## compute cutoff ahead of running stabsel to see if it is a sensible ## parameter choice. ## p = ncol(bodyfat) - 1 (= Outcome) + 1 ( = Intercept) stabsel_parameters(q = 3, PFER = 1, p = ncol(bodyfat) - 1 + 1, sampling.type = "MB") ## the same: stabsel(mod, q = 3, PFER = 1, sampling.type = "MB", eval = FALSE) ### Do not test the following code per default on CRAN as it takes some time to run: ## now run stability selection (sbody <- stabsel(mod, q = 3, PFER = 1, sampling.type = "MB")) opar <- par(mai = par("mai") * c(1, 1, 1, 2.7)) plot(sbody) par(opar) plot(sbody, type = "maxsel", ymargin = 6) }
Compute the missing parameter from the two given parameters in order to assess suitability of the parameter constellation
stabsel_parameters(p, ...) ## Default S3 method: stabsel_parameters(p, cutoff, q, PFER, B = ifelse(sampling.type == "MB", 100, 50), assumption = c("unimodal", "r-concave", "none"), sampling.type = c("SS", "MB"), verbose = FALSE, FWER, ...) ## S3 method for class 'stabsel_parameters' print(x, heading = TRUE, ...)
stabsel_parameters(p, ...) ## Default S3 method: stabsel_parameters(p, cutoff, q, PFER, B = ifelse(sampling.type == "MB", 100, 50), assumption = c("unimodal", "r-concave", "none"), sampling.type = c("SS", "MB"), verbose = FALSE, FWER, ...) ## S3 method for class 'stabsel_parameters' print(x, heading = TRUE, ...)
p |
number of possible predictors (including intercept if applicable). |
cutoff |
cutoff between 0.5 and 1. Preferably a value between 0.6 and 0.9 should be used. |
q |
number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample. |
PFER |
upper bound for the per-family error rate. This specifies the amount of falsely selected base-learners, which is tolerated. See details. |
B |
number of subsampling replicates. Per default, we use 50
complementary pairs for the error bounds of Shah & Samworth (2013)
and 100 for the error bound derived in Meinshausen & Buehlmann
(2010). As we use |
assumption |
Defines the type of assumptions on the
distributions of the selection probabilities and simultaneous
selection probabilities. Only applicable for
|
sampling.type |
use sampling scheme of of Shah & Samworth
(2013), i.e., with complementarty pairs ( |
verbose |
logical (default: |
FWER |
deprecated. Only for compatibility with older versions, use PFER instead. |
x |
an object of class |
heading |
logical. Specifies if a heading line should be printed. |
... |
additional arguments to be passed to next function. |
This function implements the error bounds for stability selection by Meinshausen and Buehlmann (2010) and the improved error bounds by Shah and Samworth (2013). For details see also Hofner et al. (2014).
Two of the three arguments cutoff
, q
and PFER
must be specified. The per-family error rate (PFER), i.e., the
expected number of false positives , where
is the
number of false positives, is bounded by the argument
PFER
.
For more details see also stabsel
.
An object of class stabsel_parameters
with a special print
method.
The object has the following elements:
cutoff |
cutoff used. |
q |
average number of selected variables used. |
PFER |
(realized) upper bound for the per-family error rate. |
specifiedPFER |
specified upper bound for the per-family error rate. |
p |
the number of effects subject to selection. |
B |
the number of subsamples. |
sampling.type |
the sampling type used for stability selection. |
assumption |
the assumptions made on the selection probabilities. |
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi:10.1186/s12859-015-0575-3.
N. Meinshausen and P. Buehlmann (2010), Stability selection. Journal of the Royal Statistical Society, Series B, 72, 417–473.
R.D. Shah and R.J. Samworth (2013), Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society, Series B, 75, 55–80.
For more details see also stabsel
.
Method to change the parameters cutoff
, PFER
and
assumption
of stability selection that can be altered without
the need to re-run the subsampling process.
## S3 method for class 'stabsel' stabsel(x, cutoff, PFER, assumption = x$assumption, ...)
## S3 method for class 'stabsel' stabsel(x, cutoff, PFER, assumption = x$assumption, ...)
x |
an object that results from a call to |
cutoff |
cutoff between 0.5 and 1. Preferably a value between 0.6 and 0.9 should be used. |
PFER |
upper bound for the per-family error rate. This specifies the amount of falsely selected base-learners, which is tolerated. See details. |
assumption |
Defines the type of assumptions on the
distributions of the selection probabilities and simultaneous
selection probabilities. Only applicable for
|
... |
additional arguments that are currently ignored. |
This function allows to alter the parameters cutoff
,
PFER
and assumption
of a fitted stability selection
result. All other parameters are re-used from the original stability
selection results. The missing paramter is computed and the selected
variables are updated accordingly.
An object of class stabsel
. For details see there.
stabsel
for the generic function,
stabsel_parameters
for the computation of error bounds,
fitfun
for available fitting functions and
plot.stabsel
for available plot functions
if (require("TH.data")) { ## make data set available data("bodyfat", package = "TH.data") } else { ## simulate some data if TH.data not available. ## Note that results are non-sense with this data. bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10) } ## set seed set.seed(1234) #################################################################### ### using stability selection with Lasso methods: if (require("lars")) { (stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = lars.lasso, cutoff = 0.75, PFER = 1)) par(mfrow = c(2, 1)) plot(stab.lasso) ## now change the PFER and the assumption: (stab.lasso_cf0.93_rconc <- stabsel(stab.lasso, cutoff = 0.93, assumption = "r-concave")) plot(stab.lasso_cf0.93_rconc) ## the cutoff did change and hence the PFER and the selected ## variables }
if (require("TH.data")) { ## make data set available data("bodyfat", package = "TH.data") } else { ## simulate some data if TH.data not available. ## Note that results are non-sense with this data. bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10) } ## set seed set.seed(1234) #################################################################### ### using stability selection with Lasso methods: if (require("lars")) { (stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2], fitfun = lars.lasso, cutoff = 0.75, PFER = 1)) par(mfrow = c(2, 1)) plot(stab.lasso) ## now change the PFER and the assumption: (stab.lasso_cf0.93_rconc <- stabsel(stab.lasso, cutoff = 0.93, assumption = "r-concave")) plot(stab.lasso_cf0.93_rconc) ## the cutoff did change and hence the PFER and the selected ## variables }
Set up weight matrix for subsampling with sample proportion
to be used with
stabsel
.
subsample(weights, B = 100, strata = NULL)
subsample(weights, B = 100, strata = NULL)
weights |
a numeric vector of weights for the model to be cross-validated. |
B |
number of folds, per default 25 for |
strata |
a factor of the same length as |
The function subsample
can be used to build an appropriate
weight matrix to be used with stabsel
. See there for
more details.
If strata
is defined sampling is performed in each stratum
separately thus preserving the distribution of the strata
variable in each fold.
## just a low-dimensional example subsample(weights = rep(1, 10), B = 50)
## just a low-dimensional example subsample(weights = rep(1, 10), B = 50)