¿Por qué el uso del método de Newton para la optimización de la regresión logística se llama mínimos cuadrados iterativos re-ponderados?

No me parece claro porque la pérdida logística y la pérdida de mínimos cuadrados son cosas completamente diferentes.

— Haitao Du
fuente

No creo que sean iguales. IRLS es Newton-Raphson con el Hessian esperado en lugar del Hessian observado.

— Dimitriy V. Masterov

@ DimitriyV.Masterov gracias, ¿podría contarme más sobre Hessian esperado vs observado? Además, ¿qué opinas sobre esta explicación

— Haitao Du

Ver también stats.stackexchange.com/questions/236676/…

— kjetil b halvorsen

Resumen: los GLM se ajustan a través de la puntuación de Fisher que, como Dimitriy V. Masterov señala, es Newton-Raphson con el Hessian esperado (es decir, usamos una estimación de la información de Fisher en lugar de la información observada). Si estamos utilizando la función de enlace canónico, resulta que la arpillera observada es igual a la arpillera esperada, por lo que la puntuación de NR y Fisher es la misma en ese caso. De cualquier manera, veremos que la puntuación de Fisher se ajusta realmente a un modelo lineal de mínimos cuadrados ponderados, y las estimaciones de coeficientes de este convergen * en un máximo de la probabilidad de regresión logística. Además de reducir el ajuste de una regresión logística a un problema ya resuelto, también obtenemos el beneficio de poder utilizar diagnósticos de regresión lineal en el ajuste final de WLS para conocer nuestra regresión logística.

Voy a mantener esto enfocado en la regresión logística, pero para una perspectiva más general sobre la máxima probabilidad en GLMs, recomiendo la sección 15.3 de este capítulo que trata este tema y deriva IRLS en un entorno más general (creo que es de Applied de John Fox Análisis de regresión y modelos lineales generalizados ).

$^*$ ver comentarios al final

La probabilidad y la función de puntuación

Ajustaremos nuestro GLM iterando algo de la forma

b^{(m + 1)} = b^{(m)} - J_{(m)}^{- 1} \nabla ℓ (b^{(m)})

$b^{(m+1)} = b^{(m)} - J^{-1}_{(m)}\nabla \ell(b^{(m)})$ donde

ℓ

$\ell$ es la probabilidad logarítmica y

J_{m}

$J_{m}$ será la arpillera observada o esperada de la probabilidad logarítmica.

Nuestra función de enlace es una función $g$ que asigna la media condicional $\mu_i = E(y_i | x_i)$ a nuestro predictor lineal, por lo que nuestro modelo para la media es $g(\mu_i) = x_i^T\beta$ . Sea $h$ la función de enlace inverso que asigna el predictor lineal a la media.

Para una regresión logística tenemos una probabilidad de Bernoulli con observaciones independientes, entonces

ℓ (b; y) = \sum_{i = 1}^{n} y_{i} \log h (x_{i}^{T} b) + (1 - y_{i}) \log (1 - h (x_{i}^{T} b)) .

$\ell(b; y) = \sum_{i=1}^n y_i\log h(x_i^T b) + (1 - y_i) \log(1 - h(x_i^Tb)).$ Tomando derivados,

\frac{\partial ℓ}{\partial b_{j}} = \sum_{i = 1}^{n} \frac{y_{i}}{h (x_{i}^{T} b)} h^{'} (x_{i}^{T} b) x_{i j} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)} h^{'} (x_{i}^{T} b) x_{i j}

$\frac{\partial \ell}{\partial b_j} = \sum_{i=1}^n \frac{y_i}{h(x_i^T b)} h'(x_i^T b) x_{ij} - \frac{1 - y_i}{1 - h(x_i^T b)} h'(x_i^T b) x_{ij}$

= \sum_{i = 1}^{n} x_{i j} h^{'} (x_{i}^{T} b) (\frac{y_{i}}{h (x_{i}^{T} b)} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)})

$= \sum_{i=1}^n x_{ij} h'(x_i^T b) \left(\frac{y_i}{h(x_i^T b)} - \frac{1 - y_i}{1 - h(x_i^T b)} \right)$

= \sum_{i} x_{i j} \frac{h^{'} (x_{i}^{T} b)}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} (y_{i} - h (x_{i}^{T} b)) .

$= \sum_i x_{ij} \frac{h'(x_i^T b)}{h(x_i^T b)(1 - h(x_i^T b))}(y_i - h(x_i^T b)).$

Usando el enlace canónico

Ahora supongamos que estamos utilizando la función de enlace canónico . Entonces $g_c = \text{logit}$ entoncesque significa que esto se simplifica a $g^{-1}_c(x) := h_c(x) = \frac{1}{1+e^{-x}}$ $h_c' = h_c \cdot (1-h_c)$ por lo Además, todavía usando,

\frac{\partial ℓ}{\partial b_{j}} = \sum_{i} x_{i j} (y_{i} - h_{c} (x_{i}^{T} b))

$\frac{\partial \ell}{\partial b_j} = \sum_i x_{ij} (y_i - h_c(x_i^T b))$

\nabla ℓ (b; y) = X^{T} (y - \hat{y}) .

$\nabla \ell (b; y) = X^T (y - \hat y).$

h_{c}

$h_c$

\frac{\partial^{2} ℓ}{\partial b_{k} \partial b_{j}} = - \sum_{i} x_{i j} \frac{\partial}{\partial b_{k}} h_{c} (x_{i}^{T} b) = - \sum_{i} x_{i j} x_{i k} [h_{c} (x_{i}^{T} b) (1 - h_{c} (x_{i}^{T} b))] .

$\frac{\partial^2 \ell}{\partial b_k \partial b_j} = - \sum_i x_{ij} \frac{\partial}{\partial b_k} h_c(x_i^T b) = - \sum_i x_{ij}x_{ik} \left[h_c(x_i^T b) (1 - h_c(x_i^T b))\right].$

Sea Entonces tenemos y nota cómo esto no tiene ninguna en él nunca más, por lo que (estamos viendo esto como una función de la por lo que la única cosa al azar es sí mismo). Así, hemos demostrado que la puntuación de Fisher es equivalente a Newton-Raphson cuando usamos el enlace canónico en la regresión logística. También en virtud

W = diag (h_{c} (x_{1}^{T} b) (1 - h_{c} (x_{1}^{T} b)), \dots, h_{c} (x_{n}^{T} b) (1 - h_{c} (x_{n}^{T} b))) = diag ({\hat{y}}_{1} (1 - {\hat{y}}_{1}), \dots, {\hat{y}}_{n} (1 - {\hat{y}}_{n})) .

$W = \text{diag}\left(h_c(x_1^T b)(1 - h_c(x_1^T b)), \dots, h_c(x_n^T b)(1 - h_c(x_n^T b))\right) = \text{diag}\left(\hat y_1(1 - \hat y_1), \dots, \hat y_n (1 - \hat y_n)\right).$

H = - X^{T} W X

$H = -X^TWX$

y_{i}

$y_i$

E (H) = H

$E(H) = H$

b

$b$

y

$y$

{\hat{y}}_{i} \in (0, 1)

$\hat y_i \in (0,1)$

- X^{T} W X

$-X^TWX$

{\hat{y}}_{i}

$\hat y_i$

0

$0$

1

$1$

0

$0$

H

$H$

$z = W^{-1}(y - \hat y)$ and note that

\nabla ℓ = X^{T} (y - \hat{y}) = X^{T} W z .

$\nabla \ell = X^T(y - \hat y) = X^T W z.$

All together this means that we can optimize the log likelihood by iterating

b^{(m + 1)} = b^{(m)} + (X^{T} W_{(m)} X)^{- 1} X^{T} W_{(m)} z_{(m)}

$b^{(m+1)} = b^{(m)} + (X^T W_{(m)} X)^{-1}X^T W_{(m)} z_{(m)}$ and

(X^{T} W_{(m)} X)^{- 1} X^{T} W_{(m)} z_{(m)}

$(X^T W_{(m)} X)^{-1}X^T W_{(m)} z_{(m)}$ is exactly

\hat{β}

$\hat \beta$ for a weighted least squares regression of

z_{(m)}

$z_{(m)}$ on

X

$X$ .

Checking this in R:

set.seed(123)
p <- 5
n <- 500
x <- matrix(rnorm(n * p), n, p)
betas <- runif(p, -2, 2)
hc <- function(x) 1 /(1 + exp(-x)) # inverse canonical link
p.true <- hc(x %*% betas)
y <- rbinom(n, 1, p.true)

# fitting with our procedure
my_IRLS_canonical <- function(x, y, b.init, hc, tol=1e-8) {
  change <- Inf
  b.old <- b.init
  while(change > tol) {
    eta <- x %*% b.old  # linear predictor
    y.hat <- hc(eta)
    h.prime_eta <- y.hat * (1 - y.hat)
    z <- (y - y.hat) / h.prime_eta

    b.new <- b.old + lm(z ~ x - 1, weights = h.prime_eta)$coef  # WLS regression
    change <- sqrt(sum((b.new - b.old)^2))
    b.old <- b.new
  }
  b.new
}

my_IRLS_canonical(x, y, rep(1,p), hc)
# x1         x2         x3         x4         x5 
# -1.1149687  2.1897992  1.0271298  0.8702975 -1.2074851

glm(y ~ x - 1, family=binomial())$coef
# x1         x2         x3         x4         x5 
# -1.1149687  2.1897992  1.0271298  0.8702975 -1.2074851

and they agree.

Non-canonical link functions

Now if we're not using the canonical link we don't get the simplification of $\frac{h'}{h(1-h)} = 1$ in $\nabla \ell$ so $H$ becomes much more complicated, and we therefore see a noticeable difference by using $E(H)$ in our Fisher scoring.

Here's how this will go: we already worked out the general $\nabla \ell$ so the Hessian will be the main difficulty. We need

\frac{\partial^{2} ℓ}{\partial b_{k} \partial b_{j}} = \sum_{i} x_{i j} \frac{\partial}{\partial b_{k}} h^{'} (x_{i}^{T} b) (\frac{y_{i}}{h (x_{i}^{T} b)} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)})

$\frac{\partial^2 \ell}{\partial b_k \partial b_j} = \sum_i x_{ij} \frac{\partial}{\partial b_k}h'(x_i^T b) \left(\frac{y_i}{h(x_i^T b)} - \frac{1 - y_i}{1 - h(x_i^T b)} \right)$

= \sum_{i} x_{i j} x_{i k} [h^{″} (x_{i}^{T} b) (\frac{y_{i}}{h (x_{i}^{T} b)} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)}) - h^{'} (x_{i}^{T} b)^{2} (\frac{y_{i}}{h (x_{i}^{T} b)^{2}} + \frac{1 - y_{i}}{(1 - h (x_{i}^{T} b))^{2}})]

$= \sum_i x_{ij}x_{ik} \left[h''(x_i^T b) \left(\frac{y_i}{h(x_i^T b)} - \frac{1 - y_i}{1 - h(x_i^T b)} \right) - h'(x_i^T b)^2\left(\frac{y_i}{h(x_i^T b)^2} + \frac{1-y_i}{(1-h(x_i^T b))^2} \right)\right]$

Via the linearity of expectation all we need to do to get $E(H)$ is replace each occurrence of $y_i$ with its mean under our model which is $\mu_i=h(x_i^T\beta)$ . Each term in the summand will therefore contain a factor of the form

h^{″} (x_{i}^{T} b) (\frac{h (x_{i}^{T} β)}{h (x_{i}^{T} b)} - \frac{1 - h (x_{i}^{T} β)}{1 - h (x_{i}^{T} b)}) - h^{'} (x_{i}^{T} b)^{2} (\frac{h (x_{i}^{T} β)}{h (x_{i}^{T} b)^{2}} + \frac{1 - h (x_{i}^{T} β)}{(1 - h (x_{i}^{T} b))^{2}}) .

$h''(x_i^T b) \left(\frac{h(x_i^T \beta)}{h(x_i^T b)} - \frac{1 - h(x_i^T \beta)}{1 - h(x_i^T b)} \right) - h'(x_i^T b)^2\left(\frac{h(x_i^T \beta)}{h(x_i^T b)^2} + \frac{1-h(x_i^T \beta)}{(1-h(x_i^T b))^2} \right).$ But to actually do our optimization we'll need to estimate each

β

$\beta$ , and at step

m

$m$

b^{(m)}

$b^{(m)}$ is the best guess we have. This means that this will reduce to

h^{″} (x_{i}^{T} b) (\frac{h (x_{i}^{T} b)}{h (x_{i}^{T} b)} - \frac{1 - h (x_{i}^{T} b)}{1 - h (x_{i}^{T} b)}) - h^{'} (x_{i}^{T} b)^{2} (\frac{h (x_{i}^{T} b)}{h (x_{i}^{T} b)^{2}} + \frac{1 - h (x_{i}^{T} b)}{(1 - h (x_{i}^{T} b))^{2}})

$h''(x_i^T b) \left(\frac{h(x_i^T b)}{h(x_i^T b)} - \frac{1 - h(x_i^T b)}{1 - h(x_i^T b)} \right) - h'(x_i^T b)^2\left(\frac{h(x_i^T b)}{h(x_i^T b)^2} + \frac{1-h(x_i^T b)}{(1-h(x_i^T b))^2} \right)$

= - h^{'} (x_{i}^{T} b)^{2} (\frac{1}{h (x_{i}^{T} b)} + \frac{1}{1 - h (x_{i}^{T} b)})

$= - h'(x_i^T b)^2\left(\frac{1}{h(x_i^T b)} + \frac{1}{1-h(x_i^T b)} \right)$

= - \frac{h^{'} (x_{i}^{T} b)^{2}}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} .

$= -\frac{h'(x_i^T b)^2}{h(x_i^T b)(1-h(x_i^T b))}.$ This means we will use

J

$J$ with

J_{j k} = - \sum_{i} x_{i j} x_{i k} \frac{h^{'} (x_{i}^{T} b)^{2}}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} .

$J_{jk} = -\sum_i x_{ij}x_{ik} \frac{h'(x_i^T b)^2}{h(x_i^T b)(1-h(x_i^T b))}.$

Now let

W^{*} = diag (\frac{h^{'} (x_{1}^{T} b)^{2}}{h (x_{1}^{T} b) (1 - h (x_{1}^{T} b))}, \dots, \frac{h^{'} (x_{n}^{T} b)^{2}}{h (x_{n}^{T} b) (1 - h (x_{n}^{T} b))})

$W^* = \text{diag}\left(\frac{h'(x_1^T b)^2}{h(x_1^T b)(1-h(x_1^T b))} ,\dots, \frac{h'(x_n^T b)^2}{h(x_n^T b)(1-h(x_n^T b))}\right)$ and note how under the canonical link

h_{c}^{'} = h_{c} \cdot (1 - h_{c})

$h_c' = h_c \cdot (1-h_c)$ reduces

W^{*}

$W^*$ to

W

$W$ from the previous section. This lets us write

J = - X^{T} W^{*} X

$J = -X^TW^*X$ except this is now

\hat{E} (H)

$\hat E(H)$ rather than necessarily being

H

$H$ itself, so this can differ from Newton-Raphson. For all

i

$i$

W_{i i}^{*} > 0

$W_{ii}^* > 0$ so aside from numerical issues

J

$J$ will be negative definite.

We have

\frac{\partial ℓ}{\partial b_{j}} = \sum_{i} x_{i j} \frac{h^{'} (x_{i}^{T} b)}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} (y_{i} - h (x_{i}^{T} b))

$\frac{\partial \ell}{\partial b_j} = \sum_i x_{ij} \frac{h'(x_i^T b)}{h(x_i^T b)(1 - h(x_i^T b))}(y_i - h(x_i^T b))$ so letting our new working response be

z^{*} = D^{- 1} (y - \hat{y})

$z^* = D^{-1}(y-\hat y)$ with

D = diag (h^{'} (x_{1}^{T} b), \dots, h^{'} (x_{n}^{T} b))

$D=\text{diag}\left(h'(x_1^T b), \dots, h'(x_n^T b)\right)$ , we have

\nabla ℓ = X^{T} W^{*} z^{*}

$\nabla \ell = X^TW^*z^*$ .

All together we are iterating

b^{(m + 1)} = b^{(m)} + (X^{T} W_{(m)}^{*} X)^{- 1} X^{T} W_{(m)}^{*} z_{(m)}^{*}

$b^{(m+1)} = b^{(m)} + (X^T W_{(m)}^* X)^{-1}X^T W_{(m)}^* z_{(m)}^*$ so this is still a sequence of WLS regressions except now it's not necessarily Newton-Raphson.

I've written it out this way to emphasize the connection to Newton-Raphson, but frequently people will factor the updates so that each new point $b^{(m+1)}$ is itself the WLS solution, rather than a WLS solution added to the current point $b^{(m)}$ . If we wanted to do this, we can do the following:

b^{(m + 1)} = b^{(m)} + (X^{T} W_{(m)}^{*} X)^{- 1} X^{T} W_{(m)}^{*} z_{(m)}^{*}

$b^{(m+1)} = b^{(m)} + (X^T W_{(m)}^* X)^{-1}X^T W_{(m)}^* z_{(m)}^*$

= (X^{T} W_{(m)}^{*} X)^{- 1} (X^{T} W_{(m)}^{*} X b^{(m)} + X^{T} W_{(m)}^{*} z_{(m)}^{*})

$= (X^T W_{(m)}^* X)^{-1}\left(X^T W_{(m)}^* Xb^{(m)}+ X^TW^*_{(m)}z_{(m)}^* \right)$

= (X^{T} W_{(m)}^{*} X)^{- 1} X^{T} W_{(m)}^{*} (X b^{(m)} + z_{(m)}^{*})

$= (X^T W_{(m)}^* X)^{-1}X^TW_{(m)}^*\left(Xb^{(m)}+ z_{(m)}^* \right)$ so if we're going this way you'll see the working response take the form

η^{(m)} + D_{(m)}^{- 1} (y - {\hat{y}}^{(m)})

$\eta^{(m)} + D^{-1}_{(m)}(y - \hat y^{(m)})$ , but it's the same thing.

Let's confirm that this works by using it to perform a probit regression on the same simulated data as before (and this is not the canonical link, so we need this more general form of IRLS).

my_IRLS_general <- function(x, y, b.init, h, h.prime, tol=1e-8) {
  change <- Inf
  b.old <- b.init
  while(change > tol) {
    eta <- x %*% b.old  # linear predictor
    y.hat <- h(eta)
    h.prime_eta <- h.prime(eta)
    w_star <- h.prime_eta^2 / (y.hat * (1 - y.hat))
    z_star <- (y - y.hat) / h.prime_eta

    b.new <- b.old + lm(z_star ~ x - 1, weights = w_star)$coef  # WLS

    change <- sqrt(sum((b.new - b.old)^2))
    b.old <- b.new
  }
  b.new
}

# probit inverse link and derivative
h_probit <- function(x) pnorm(x, 0, 1)
h.prime_probit <- function(x) dnorm(x, 0, 1)

my_IRLS_general(x, y, rep(0,p), h_probit, h.prime_probit)
# x1         x2         x3         x4         x5 
# -0.6456508  1.2520266  0.5820856  0.4982678 -0.6768585 

glm(y~x-1, family=binomial(link="probit"))$coef
# x1         x2         x3         x4         x5 
# -0.6456490  1.2520241  0.5820835  0.4982663 -0.6768581

and again the two agree.

Comments on convergence

Finally, a few quick comments on convergence (I'll keep this brief as this is getting really long and I'm no expert at optimization). Even though theoretically each $J_{(m)}$ is negative definite, bad initial conditions can still prevent this algorithm from converging. In the probit example above, changing the initial conditions to b.init=rep(1,p) results in this, and that doesn't even look like a suspicious initial condition. If you step through the IRLS procedure with that initialization and these simulated data, by the second time through the loop there are some $\hat y_i$ that round to exactly $1$ and so the weights become undefined. If we're using the canonical link in the algorithm I gave we won't ever be dividing by $\hat y_i (1 - \hat y_i)$ to get undefined weights, but if we've got a situation where some $\hat y_i$ are approaching $0$ or $1$ , such as in the case of perfect separation, then we'll still get non-convergence as the gradient dies without us reaching anything.

— jld
fuente

+1. I love how detailed your answers often are.

— amoeba says Reinstate Monica

You stated "the coefficient estimates from this converge on a maximum of the logistic regression likelihood." Is that necessarily so, from any initial values?

— Mark L. Stone

@MarkL.Stone ah I was being too casual there, didn't mean to offend the optimization people :) I'll add some more details (and would appreciate your thoughts on them when I do)

— jld

any chance you watched the link I posted? Seems that video is talking from machine learning perspective, just optimize logistic loss, without talking about Hessain expectation?

— Haitao Du

@hxd1011 in that pdf i linked to (link again: sagepub.com/sites/default/files/upm-binaries/…) on page 24 of it the author goes into the theory and explains what exactly makes a link function canonical. I found that pdf extremely helpful when I first came across this (although it took me a while to get through).

— jld