Valores iniciales predeterminados que ajustan la regresión logística con glm

Me pregunto cómo se especifican los valores iniciales predeterminados glm.

Esta publicación sugiere que los valores predeterminados se establecen como ceros. Esto se dice que hay un algoritmo detrás de él, sin embargo enlace correspondiente se rompe.

Traté de ajustar el modelo de regresión logística simple con rastreo de algoritmo:

set.seed(123)

x <- rnorm(100)
p <- 1/(1 + exp(-x))
y <- rbinom(100, size = 1, prob = p)

# to see parameter estimates in each step
trace(glm.fit, quote(print(coefold)), at = list(c(22, 4, 8, 4, 19, 3)))

Primero, sin especificación de valores iniciales:

glm(y ~ x, family = "binomial")

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
NULL
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.386379 1.106234
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3991135 1.1653971
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995188 1.1669508

En el primer paso, los valores iniciales son NULL.

En segundo lugar, configuro los valores iniciales como ceros:

glm(y ~ x, family = "binomial", start = c(0, 0))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0 0
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3177530 0.9097521
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3909975 1.1397163
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3994147 1.1666173
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995191 1.1669518

Y podemos ver que las iteraciones entre el primer y el segundo enfoque difieren.

Para ver los valores iniciales especificados por glmTraté de ajustar el modelo con solo una iteración:

glm(y ~ x, family = "binomial", control = list(maxit = 1))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
NULL

Call:  glm(formula = y ~ x, family = "binomial", control = list(maxit = 1))

Coefficients:
(Intercept)            x  
     0.3864       1.1062  

Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
Null Deviance:      134.6 
Residual Deviance: 115  AIC: 119

Las estimaciones de los parámetros (no es sorprendente) corresponden a las estimaciones del primer enfoque en la segunda iteración, es decir, [1] 0.386379 1.106234 establecer estos valores como valores iniciales conduce a la misma secuencia de iteraciones que en el primer enfoque:

glm(y ~ x, family = "binomial", start = c(0.386379, 1.106234))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.386379 1.106234
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3991135 1.1653971
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995188 1.1669508

Entonces la pregunta es, ¿cómo se calculan estos valores?

r logistic-regression glm

— Adela
fuente

Es complicado. Si proporciona startvalores, se utilizan en el cálculo de lo que se pasa a la C_Cdqrlsrutina. Si no lo hace, los valores que se pasan se calculan (incluida una llamada eval(binomial()$initialize)), pero glm.fitnunca calcula explícitamente los valores para start. Tómate una o dos horas y estudia el glm.fitcódigo.

— Roland

Gracias por el comentario. Traté de estudiar el glm.fitcódigo pero todavía no tengo idea de cómo se calculan los valores iniciales.

— Adela

TL; DR

start=c(b0,b1)inicializa eta a b0+x*b1(mu a 1 / (1 + exp (-eta)))
start=c(0,0) inicializa eta a 0 (mu a 0.5) independientemente del valor de y o x.
start=NULL inicializa eta = 1.098612 (mu = 0.75) si y = 1, independientemente del valor de x.
start=NULL inicializa eta = -1.098612 (mu = 0.25) si y = 0, independientemente del valor de x.
Una vez eta (y por consiguiente mu y var (mu)) se ha calculado, wy zse calculan y se envía a un solucionador de QR, en el espíritu de qr.solve(cbind(1,x) * w, z*w).

Forma larga

A partir del comentario de Roland: hice un glm.fit.truncated(), donde glm.fitatendí la C_Cdqrlsllamada y luego lo comenté. glm.fit.truncatedgenera los valores zy w(así como los valores de las cantidades utilizadas para calcular zy w) que luego se pasarían a la C_Cdqrlsllamada:

## call Fortran code via C wrapper
fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * w, z * w,
             min(1e-7, control$epsilon/1000), check=FALSE)

Más se puede leer sobre C_Cdqrls aquí . Afortunadamente, la función qr.solveen la base R aprovecha directamente las versiones de LINPACK que se invocan en glm.fit().

Así que corremos glm.fit.truncatedpara las diferentes especificaciones de valores iniciales, y luego hacemos una llamada a qr.solvelos valores w y z, y vemos cómo se calculan los "valores iniciales" (o los primeros valores de iteración mostrados). Como indicó Roland, especificar start=NULLo start=c(0,0)en glm () afecta los cálculos para w y z, no para start.

Para el inicio = NULL: zes un vector donde los elementos tienen el valor 2.431946 o -2.431946 y wes un vector donde todos los elementos son 0.4330127:

start.is.null <- glm.fit.truncated(x,y,family=binomial(), start=NULL)
start.is.null
w <- start.is.null$w
z <- start.is.null$z
## if start is NULL, the first displayed values are:
qr.solve(cbind(1,x) * w, z*w)  
# > qr.solve(cbind(1,x) * w, z*w)  
#                 x 
# 0.386379 1.106234

Para el inicio = c (0,0): zes un vector donde los elementos tienen el valor 2 o -2 y wes un vector donde todos los elementos son 0.5:

## if start is c(0,0)    
start.is.00 <- glm.fit.truncated(x,y,family=binomial(), start=0)
start.is.00
w <- start.is.00$w
z <- start.is.00$z
## if start is c(0,0), the first displayed values are:    
qr.solve(cbind(1,x) * w, z*w)  
# > qr.solve(cbind(1,x) * w, z*w)  
#                   x 
# 0.3177530 0.9097521

Eso está muy bien, pero ¿cómo calculamos el wy z? Cerca del fondo de glm.fit.truncated()vemos

z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])

Observe las siguientes comparaciones entre los valores de salida de las cantidades utilizadas para calcular zy w:

cbind(y, start.is.null$mu, start.is.00$mu)
cbind(y, start.is.null$eta, start.is.00$eta)
cbind(start.is.null$var_mu, start.is.00$var_mu)
cbind(start.is.null$mu.eta.val, start.is.00$mu.eta.val)

Tenga en cuenta que start.is.00tendrá un vector mucon solo los valores 0.5 porque eta se establece en 0 y mu (eta) = 1 / (1 + exp (-0)) = 0.5. start.is.nullestablece que aquellos con y = 1 sean mu = 0.75 (que corresponde a eta = 1.098612) y aquellos con y = 0 sean mu = 0.25 (que corresponde a eta = -1.098612), y por lo tanto var_mu= 0.75 * 0.25 = 0.1875.

Sin embargo, es interesante notar que cambié la semilla y volví a clasificar todo y mu = 0.75 para y = 1 y mu = 0.25 para y = 0 (y, por lo tanto, las otras cantidades permanecieron iguales). Es decir, start = NULL da lugar a la misma wy zcon independencia de lo que yy xson, porque se inicializan eta = 1.098612 (mu = 0,75), si y = 1 y eta = -1.098612 (mu = 0,25) si y = 0.

Por lo tanto, parece que un valor inicial para el coeficiente de intercepción y para el coeficiente X no se establece para start = NULL, sino que se dan valores iniciales a eta dependiendo del valor y e independientemente del valor x. A partir de ahí wy zse calculan, luego se envían junto con xel qr.solver.

Código para ejecutar antes de los fragmentos anteriores:

set.seed(123)

x <- rnorm(100)
p <- 1/(1 + exp(-x))
y <- rbinom(100, size = 1, prob = p)


glm.fit.truncated <- function(x, y, weights = rep.int(1, nobs), 
start = 0,etastart = NULL, mustart = NULL, 
offset = rep.int(0, nobs),
family = binomial(), 
control = list(), 
intercept = TRUE,
singular.ok = TRUE
){
control <- do.call("glm.control", control)
x <- as.matrix(x)
xnames <- dimnames(x)[[2L]]
ynames <- if(is.matrix(y)) rownames(y) else names(y)
conv <- FALSE
nobs <- NROW(y)
nvars <- ncol(x)
EMPTY <- nvars == 0
## define weights and offset if needed
if (is.null(weights))
  weights <- rep.int(1, nobs)
if (is.null(offset))
  offset <- rep.int(0, nobs)

## get family functions:
variance <- family$variance
linkinv  <- family$linkinv
if (!is.function(variance) || !is.function(linkinv) )
  stop("'family' argument seems not to be a valid family object", call. = FALSE)
dev.resids <- family$dev.resids
aic <- family$aic
mu.eta <- family$mu.eta
unless.null <- function(x, if.null) if(is.null(x)) if.null else x
valideta <- unless.null(family$valideta, function(eta) TRUE)
validmu  <- unless.null(family$validmu,  function(mu) TRUE)
if(is.null(mustart)) {
  ## calculates mustart and may change y and weights and set n (!)
  eval(family$initialize)
} else {
  mukeep <- mustart
  eval(family$initialize)
  mustart <- mukeep
}
if(EMPTY) {
  eta <- rep.int(0, nobs) + offset
  if (!valideta(eta))
    stop("invalid linear predictor values in empty model", call. = FALSE)
  mu <- linkinv(eta)
  ## calculate initial deviance and coefficient
  if (!validmu(mu))
    stop("invalid fitted means in empty model", call. = FALSE)
  dev <- sum(dev.resids(y, mu, weights))
  w <- sqrt((weights * mu.eta(eta)^2)/variance(mu))
  residuals <- (y - mu)/mu.eta(eta)
  good <- rep_len(TRUE, length(residuals))
  boundary <- conv <- TRUE
  coef <- numeric()
  iter <- 0L
} else {
  coefold <- NULL
  eta <-
    if(!is.null(etastart)) etastart
  else if(!is.null(start))
    if (length(start) != nvars)
      stop(gettextf("length of 'start' should equal %d and correspond to initial coefs for %s", nvars, paste(deparse(xnames), collapse=", ")),
           domain = NA)
  else {
    coefold <- start
    offset + as.vector(if (NCOL(x) == 1L) x * start else x %*% start)
  }
  else family$linkfun(mustart)
  mu <- linkinv(eta)
  if (!(validmu(mu) && valideta(eta)))
    stop("cannot find valid starting values: please specify some", call. = FALSE)
  ## calculate initial deviance and coefficient
  devold <- sum(dev.resids(y, mu, weights))
  boundary <- conv <- FALSE

  ##------------- THE Iteratively Reweighting L.S. iteration -----------
  for (iter in 1L:control$maxit) {
    good <- weights > 0
    varmu <- variance(mu)[good]
    if (anyNA(varmu))
      stop("NAs in V(mu)")
    if (any(varmu == 0))
      stop("0s in V(mu)")
    mu.eta.val <- mu.eta(eta)
    if (any(is.na(mu.eta.val[good])))
      stop("NAs in d(mu)/d(eta)")
    ## drop observations for which w will be zero
    good <- (weights > 0) & (mu.eta.val != 0)

    if (all(!good)) {
      conv <- FALSE
      warning(gettextf("no observations informative at iteration %d",
                       iter), domain = NA)
      break
    }
    z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
    w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
    # ## call Fortran code via C wrapper
    # fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * w, z * w,
    #              min(1e-7, control$epsilon/1000), check=FALSE)
    # 

    #print(iter)
    #print(z)
    #print(w)
  }


  }
  return(list(z=z, w=w, mustart=mustart, etastart=etastart, eta=eta, offset=offset, mu=mu, mu.eta.val=mu.eta.val,
              weight=weights, var_mu=variance(mu)))

}

— swihart
fuente

Gracias por su excelente respuesta, esto es mucho más de lo que esperaba :)

— Adela