Ejemplo cuando se usa la precisión como medida de resultado conducirá a una conclusión incorrecta

Estoy buscando varias medidas de rendimiento para modelos predictivos. Se escribió mucho sobre problemas de uso de precisión, en lugar de algo más continuo para evaluar el rendimiento del modelo. Frank Harrell http://www.fharrell.com/post/class-damage/ proporciona un ejemplo cuando agregar una variable informativa a un modelo conducirá a una caída en la precisión, claramente contraintuitiva y una conclusión errónea.

Sin embargo, en este caso, esto parece ser causado por tener clases desequilibradas y, por lo tanto, puede resolverse simplemente usando precisión equilibrada ((sens + spec) / 2). ¿Hay algún ejemplo en el que el uso de la precisión en un conjunto de datos equilibrado conduzca a conclusiones claramente erróneas o contradictorias?

Editar

Estoy buscando algo donde la precisión disminuirá incluso cuando el modelo sea claramente mejor, o que el uso de la precisión conduzca a una selección falsa positiva de algunas características. Es fácil hacer ejemplos falsos negativos, donde la precisión es la misma para dos modelos donde uno es claramente mejor usando otros criterios.

machine-learning predictive-models accuracy

— rep_ho
fuente

Relacionado, pero no un duplicado: ¿Por qué la precisión no es la mejor medida para evaluar los modelos de clasificación?

— Stephan Kolassa

La publicación de Stephan vinculada anteriormente es un recurso maravilloso y tiene todo lo que necesita. Su paso inicial de asumir que se necesita una clasificación forzada (decisión prematura) ha llevado a un problema. Y (sens + spec) / 2 no es una puntuación de precisión adecuada; optimizarlo conducirá a elegir las características incorrectas y darles los pesos incorrectos, sin mencionar ignorar la información útil que proviene de las probabilidades, como las zonas de "no decisión".

— Frank Harrell

Entiendo el valor de las predicciones probabilísticas, pero estoy buscando ejemplos en los que estas cosas malas que mencionó suceden realmente para datos equilibrados o precisión equilibrada.

— rep_ho

Por cierto: según Gneiting & Raftery 2007, la precisión es adecuada (aunque no estrictamente adecuada). algún comentario sobre eso? amstat.tandfonline.com/doi/abs/10.1198/…

— rep_ho

Espero que mi respuesta sea útil. (@FrankHarrell: cualquier comentario sería bienvenido). La precisión equilibrada tampoco será útil aquí. Con respecto a la precisión como una regla de puntuación adecuada pero no estrictamente adecuada: puede interesarle ¿Es la precisión una regla de puntuación incorrecta en una configuración de clasificación binaria? Si ese hilo no responde a su pregunta, considere hacer una nueva pregunta (y señalar por qué el hilo existente no es una respuesta, por lo que no está cerrado).

— Stephan Kolassa

Respuestas:

Voy a engañar

Específicamente, he argumentado a menudo (por ejemplo, aquí ) que la parte estadística del modelado y la predicción se extiende solo a hacer predicciones probabilísticas para los miembros de la clase (o dar densidades predictivas, en el caso del pronóstico numérico). Tratar una instancia específica como si perteneciera a una clase específica (o predicciones puntuales en el caso numérico), ya no es una estadística adecuada. Es parte del aspecto teórico de la decisión .

Y las decisiones no solo deben basarse en la predicción probabilística, sino también en los costos de las clasificaciones erróneas y en una serie de otras posibles acciones . Por ejemplo, incluso si solo tiene dos clases posibles, "enfermo" versus "saludable", podría tener una amplia gama de acciones posibles dependiendo de la probabilidad de que un paciente sufra la enfermedad, de enviarlo a casa porque es casi seguro que es saludable, darle dos aspirinas, realizar pruebas adicionales, llamar inmediatamente a una ambulancia y ponerlo en soporte vital.

Evaluar la precisión presupone tal decisión. La precisión como medida de evaluación para la clasificación es un error de categoría .

Entonces, para responder a su pregunta, caminaré por el camino de ese error de categoría. Consideraremos un escenario simple con clases equilibradas donde la clasificación sin tener en cuenta los costos de clasificación errónea nos inducirá a error.

Supongamos que una epidemia de Gutrot maligno corre desenfrenada en la población. Afortunadamente, podemos examinar a todos fácilmente por algún rasgo $t$ ( $0\leq t \leq 1$ ), y sabemos que la probabilidad de desarrollar MG depende linealmente de $t$ , $p=\gamma t$ para algún parámetro $\gamma$ ( $0\leq \gamma \leq 1$ ) El rasgo $t$ se distribuye uniformemente en la población.

Afortunadamente, hay una vacuna. Desafortunadamente, es costoso y los efectos secundarios son muy incómodos. (Dejaré que tu imaginación proporcione los detalles). Sin embargo, son mejores que sufrir de MG.

En aras de la abstracción, considero que en realidad solo hay dos posibles cursos de acción para un paciente dado, dado su valor de rasgo $t$ : vacunar o no vacunar.

Por lo tanto, la pregunta es: ¿cómo debemos decidir a quién vacunar y a quién no? $t$ ? Seremos utilitarios al respecto y aspiraremos a tener los costos totales esperados más bajos. Es obvio que esto se reduce a elegir un umbral $\theta$ y vacunar a todos con $t\geq\theta$ .

El modelo y la decisión 1 dependen de la precisión. Montar un modelo. Afortunadamente, ya conocemos el modelo. Elige el umbral $\theta$ que maximiza la precisión al clasificar pacientes y vacunar a todos con $t\geq \theta$ . Fácilmente vemos eso $\theta=\frac{1}{2\gamma}$ es el número mágico, todos con $t\geq \theta$ tiene una mayor probabilidad de contraer MG que no, y viceversa, por lo que este umbral de probabilidad de clasificación maximizará la precisión. Asumiendo clases balanceadas, $\gamma=1$ , vamos a vacunar a la mitad de la población. Curiosamente, si $\gamma<\frac{1}{2}$ , no vamos a vacunar a nadie . (Nos interesan principalmente las clases equilibradas, por lo que debemos dejar de lado que solo dejamos que una parte de la población muera como una muerte dolorosa horrible).

Huelga decir que esto no tiene en cuenta los costos diferenciales de la clasificación errónea.

El modelo y la decisión 2 aprovechan nuestra predicción probabilística ("dado su rasgo $t$ , su probabilidad de contraer MG es $\gamma t$ ") y la estructura de costos.

Primero, aquí hay un pequeño gráfico. El eje horizontal da el rasgo, el eje vertical la probabilidad de MG. El triángulo sombreado da la proporción de la población que contraerá MG. La línea vertical da algo de particular $\theta$ . La línea discontinua horizontal en $\gamma\theta$ hará que los cálculos a continuación sean un poco más fáciles de seguir. Asumimos $\gamma>\frac{1}{2}$ , solo para hacer la vida más fácil.

Démosle nombres a nuestros costos y calculemos sus contribuciones al costo total esperado, dado $\theta$ y $\gamma$ (y el hecho de que el rasgo se distribuye uniformemente en la población).

Dejar $c^+_+$ denote el costo para un paciente que está vacunado y habría contraído MG. Dado $\theta$ , la proporción de la población que incurre en este costo es el trapecio sombreado en la parte inferior derecha con área $(1 - θ) γ θ + \frac{1}{2} (1 - θ) (γ - γ θ) .$ $(1-\theta)\gamma\theta + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta).$
Dejar $c^-_+$ denotar el costo para un paciente que está vacunado y no habría contraído MG. Dado $\theta$ , la proporción de la población que incurre en este costo es el trapecio sin sombrear en la parte superior derecha con área $(1 - θ) (1 - γ) + \frac{1}{2} (1 - θ) (γ - γ θ) .$ $(1-\theta)(1-\gamma) + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta).$
Dejar $c^-_-$ denote el costo para un paciente que no está vacunado y no habría contraído MG. Dado $\theta$ , la proporción de la población que incurre en este costo es el trapecio sin sombrear en la parte superior izquierda con área $θ (1 - γ θ) + \frac{1}{2} θ γ θ .$ $\theta(1-\gamma\theta) + \frac{1}{2}\theta\gamma\theta.$
Dejar $c^+_-$ denote el costo para un paciente que no está vacunado y que habría contraído MG. Dado $\theta$ , la proporción de la población que incurre en este costo es el triángulo sombreado en la parte inferior izquierda con área $\frac{1}{2} θ γ θ .$ $\frac{1}{2}\theta\gamma\theta.$

(En cada trapecio, primero calculo el área del rectángulo, luego agrego el área del triángulo).

Los costos totales esperados son

c_{+}^{+} ((1 - θ) γ θ + \frac{1}{2} (1 - θ) (γ - γ θ)) + c_{+}^{-} ((1 - θ) (1 - γ) + \frac{1}{2} (1 - θ) (γ - γ θ)) + c_{-}^{-} (θ (1 - γ θ) + \frac{1}{2} θ γ θ) + c_{-}^{+} \frac{1}{2} θ γ θ .

$c^+_+\bigg((1-\theta)\gamma\theta + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta)\bigg) + c^-_+\bigg((1-\theta)(1-\gamma) + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta)\bigg) + c^-_-\bigg(\theta(1-\gamma\theta) + \frac{1}{2}\theta\gamma\theta\bigg) + c^+_-\frac{1}{2}\theta\gamma\theta.$

Al diferenciar y establecer la derivada en cero, obtenemos que los costos esperados se minimizan al

θ^{*} = \frac{c_{+}^{-} - c_{-}^{-}}{γ (c_{-}^{+} + c_{+}^{-} - c_{+}^{+} - c_{-}^{-})} .

$\theta^\ast = \frac{c^-_+-c^-_-}{\gamma(c^+_-+c^-_+-c^+_+-c^-_-)}.$

Esto es solo igual al valor de maximización de precisión de $\theta$ para una estructura de costos muy específica, es decir, si y solo si

\frac{1}{2 γ} = \frac{c_{+}^{-} - c_{-}^{-}}{γ (c_{-}^{+} + c_{+}^{-} - c_{+}^{+} - c_{-}^{-})},

$\frac{1}{2\gamma} = \frac{c^-_+-c^-_-}{\gamma(c^+_-+c^-_+-c^+_+-c^-_-)},$ o

\frac{1}{2} = \frac{c_{+}^{-} - c_{-}^{-}}{c_{-}^{+} + c_{+}^{-} - c_{+}^{+} - c_{-}^{-}} .

$\frac{1}{2} = \frac{c^-_+-c^-_-}{c^+_-+c^-_+-c^+_+-c^-_-}.$

Como ejemplo, supongamos que $\gamma=1$ para clases equilibradas y que los costos son

c_{+}^{+} = 1, c_{+}^{-} = 2, c_{-}^{+} = 10, c_{-}^{-} = 0.

$c^+_+ = 1, \quad c^-_+=2, \quad c^+_-=10, \quad c^-_-=0.$ Entonces la precisión maximizando

θ = \frac{1}{2}

$\theta=\frac{1}{2}$ rendirá los costos esperados de

1.875

$1.875$ , mientras que el costo minimiza

θ = \frac{2}{11}

$\theta=\frac{2}{11}$ rendirá los costos esperados de

1.318

$1.318$ .

En este ejemplo, basar nuestras decisiones en clasificaciones no probabilísticas que maximizan la precisión condujo a más vacunas y costos más altos que usar una regla de decisión que usara explícitamente las estructuras de costos diferenciales en el contexto de una predicción probabilística.

En pocas palabras: la precisión es solo un criterio de decisión válido si

Existe una relación uno a uno entre las clases y las posibles acciones.
y los costos de las acciones aplicadas a las clases siguen una estructura muy específica.

En el caso general, evaluar la precisión hace una pregunta incorrecta, y maximizar la precisión es un error llamado tipo III: proporcionar la respuesta correcta a la pregunta incorrecta.

Código R:

rm(list=ls())
gamma <- 0.7

cost_treated_positive <- 1          # cost of treatment, side effects unimportant
cost_treated_negative <- 2          # cost of treatment, side effects unnecessary
cost_untreated_positive <- 10       # horrible, painful death
cost_untreated_negative <- 0        # nothing

expected_cost <- function ( theta ) {
    cost_treated_positive * ( (1-theta)*theta*gamma + (1-theta)*(gamma-gamma*theta)/2 ) +
    cost_treated_negative * ( (1-theta)*(1-gamma) + (1-theta)*(gamma-gamma*theta)/2 ) +
    cost_untreated_negative *( theta*(1-gamma*theta) + theta*gamma*theta/2 ) +
    cost_untreated_positive * theta*gamma*theta/2
}

(theta <- optim(par=0.5,fn=expected_cost,lower=0,upper=1,method="L-BFGS-B")$par)
(cost_treated_negative-cost_untreated_negative)/
    (gamma*(cost_treated_negative+cost_untreated_positive-cost_treated_positive-cost_untreated_negative))

plot(c(0,1),c(0,1),type="n",bty="n",xaxt="n",xlab="Trait t",yaxt="n",ylab="MG probability")
rect(0,0,1,1)
axis(1,c(0,theta,1),c(0,"theta",1),lty=0,line=-1)
axis(2,c(0,1),lty=0,line=-1,las=1)
axis(4,c(0,gamma,1),c(0,"gamma",1),lty=0,line=-1.8,las=1)
polygon(c(0,1,1),c(0,0,gamma),col="lightgray")
abline(v=theta,col="red",lwd=2)
abline(h=gamma*theta,lty=2,col="red",lwd=2)

expected_cost(1/(2*gamma))
expected_cost(theta)

— Stephan Kolassa
fuente

¡Buena publicación! ¡Acabo de iniciar sesión para expresar mi agradecimiento!

— Wolfone

"pero también en los costos de las clasificaciones erróneas" No creo que sea cierto: tal como lo muestra su cálculo, también (¡sorprendentemente!) ¡también depende del costo de las clasificaciones correctas !

— Tamas Ferenci

No quería editar su excelente respuesta tan profundamente, pero tal vez sería instructivo notar que el umbral óptimo depende de los cuatro costos, pero solo a través de

c_{d}^{+} = c_{-}^{+} - c_{+}^{+}

$c_d^+=c_-^+-c_+^+$ y

c_{d}^{-} = c_{+}^{-} - c_{-}^{-}

$c_d^-=c_+^--c_-^-$ :

θ^{*} = \frac{c_{d}^{-}}{γ (c_{d}^{-} + c_{d}^{+})}

$\theta^\ast=\frac{c_d^-}{\gamma\left(c_d^-+c_d^+\right)}$ .

— Tamas Ferenci

También podríamos hacer una trama de esto:

levelplot( thetastar ~ cdminus + cdplus, data = data.table( expand.grid( cdminus = seq( 0, 10, 0.01 ), cdplus = seq( 0, 10, 0.01 ) ) )[ , .( cdminus, cdplus, thetastar = cdminus/(cdminus + cdplus) ) ] )

— Tamas Ferenci

"El umbral óptimo depende de los cuatro costos, pero solo a través de" Más específicamente, solo a través de su relación, lo siento:

θ^{*} = \frac{1 / γ}{1 + \frac{c_{d}^{+}}{c_{d}^{-}}}

$\theta^{\ast}=\frac{1/\gamma}{1+\frac{c_d^+}{c_d^-}}$ .

— Tamas Ferenci

Podría valer la pena agregar otro ejemplo, quizás más directo, a la excelente respuesta de Stephen.

Consideremos una prueba médica, cuyo resultado se distribuye normalmente, tanto en personas enfermas como en personas sanas, con diferentes parámetros, por supuesto (pero por simplicidad, supongamos homoscedasticidad, es decir, que la varianza es la misma):

\begin{matrix} T ∣ D ⊖ \sim N (μ_{-}, σ^{2}) \\ T ∣ D \oplus \sim N (μ_{+}, σ^{2}) \end{matrix} .

$\begin{gather*}T \mid D \ominus \sim \mathcal{N}\left(\mu_{-},\sigma^2\right)\\T \mid D \oplus \sim \mathcal{N}\left(\mu_{+},\sigma^2\right)\end{gather*}.$ Denotemos la prevalencia de la enfermedad con

p

$p$ (es decir

D \oplus \sim B e r n (p)

$D\oplus\sim Bern\left(p\right)$ ), por lo que esto, junto con lo anterior, que son distribuciones esencialmente condicionales, especifica completamente la distribución conjunta.

Así, la matriz de confusión con umbral $b$ (es decir, aquellos con resultados de pruebas anteriores $b$ están clasificados como enfermos) es

(\begin{matrix} D \oplus & D ⊖ \\ T \oplus & p (1 - Φ_{+} (b)) & (1 - p) (1 - Φ_{-} (b)) \\ T ⊖ & p Φ_{+} (b) & (1 - p) Φ_{-} (b) \end{matrix}) .

$\begin{pmatrix} & D\oplus & D\ominus\\ T\oplus & p\left(1-\Phi_{+}\left(b\right)\right) & \left(1-p\right)\left(1-\Phi_{-}\left(b\right)\right)\\ T\ominus & p\Phi_{+}\left(b\right) & \left(1-p\right)\Phi_{-}\left(b\right)\\ \end{pmatrix}.$

Enfoque basado en la precisión

La precisión es

p (1 - Φ_{+} (b)) + (1 - p) Φ_{-} (b),

$p\left(1-\Phi_{+}\left(b\right)\right)+\left(1-p\right)\Phi_{-}\left(b\right),$

tomamos su derivada wrt $b$ , póngalo igual a 0, multiplique con $\sqrt{1\pi\sigma^2}$ y reorganizar un poco:

\begin{matrix} - p φ_{+} (b) + φ_{-} (b) - p φ_{-} (b) = 0 \\ e^{- \frac{{(b - μ_{-})}^{2}}{2 σ^{2}}} [(1 - p) - p e^{- \frac{2 b (μ_{-} - μ_{+}) + (μ_{+}^{2} - μ_{-}^{2})}{2 σ^{2}}}] = 0 \end{matrix}

$\begin{gather*} -p\varphi_{+}\left(b\right)+\varphi_{-}\left(b\right)-p\varphi_{-}\left(b\right)=0\\ e^{-\frac{\left(b-\mu_{-}\right)^2}{2\sigma^2}}\left[\left(1-p\right)-pe^{-\frac{2b\left(\mu_{-}-\mu_{+}\right)+\left(\mu_{+}^2-\mu_{-}^2\right)}{2\sigma^2}}\right]=0\end{gather*}$ El primer término no puede ser cero, por lo que la única forma en que el producto puede ser cero es si el segundo término es cero:

\begin{matrix} (1 - p) - p e^{- \frac{2 b (μ_{-} - μ_{+}) + (μ_{+}^{2} - μ_{-}^{2})}{2 σ^{2}}} = 0 \\ - \frac{2 b (μ_{-} - μ_{+}) + (μ_{+}^{2} - μ_{-}^{2})}{2 σ^{2}} = \log \frac{1 - p}{p} \\ 2 b (μ_{+} - μ_{-}) + (μ_{-}^{2} - μ_{+}^{2}) = 2 σ^{2} \log \frac{1 - p}{p} \end{matrix}

$\begin{gather*}\left(1-p\right)-pe^{-\frac{2b\left(\mu_{-}-\mu_{+}\right)+\left(\mu_{+}^2-\mu_{-}^2\right)}{2\sigma^2}}=0\\-\frac{2b\left(\mu_{-}-\mu_{+}\right)+\left(\mu_{+}^2-\mu_{-}^2\right)}{2\sigma^2}=\log\frac{1-p}{p}\\ 2b\left(\mu_{+}-\mu_{-}\right)+\left(\mu_{-}^2-\mu_{+}^2\right)=2\sigma^2\log\frac{1-p}{p}\\ \end{gather*}$ So the solution is

b^{*} = \frac{(μ_{+}^{2} - μ_{-}^{2}) + 2 σ^{2} \log \frac{1 - p}{p}}{2 (μ_{+} - μ_{-})} = \frac{μ_{+} + μ_{-}}{2} + \frac{σ^{2}}{μ_{+} - μ_{-}} \log \frac{1 - p}{p} .

$b^{\ast}=\frac{\left(\mu_{+}^2-\mu_{-}^2\right)+2\sigma^2\log\frac{1-p}{p}}{2\left(\mu_{+}-\mu_{-}\right)}=\frac{\mu_{+}+\mu_{-}}{2}+\frac{\sigma^2}{\mu_{+}-\mu_{-}}\log\frac{1-p}{p}.$

Note that this - of course - doesn't depend on the costs.

If the classes are balanced, the optimum is the average of the mean test values in sick and healthy people, otherwise it is displaced based on the imbalance.

Cost-based approach

Using Stephen's notation, the expected overall cost is

c_{+}^{+} p (1 - Φ_{+} (b)) + c_{+}^{-} (1 - p) (1 - Φ_{-} (b)) + c_{-}^{+} p Φ_{+} (b) + c_{-}^{-} (1 - p) Φ_{-} (b) .

$c_{+}^{+}p\left(1-\Phi_{+}\left(b\right)\right) + c_{+}^{-}\left(1-p\right)\left(1-\Phi_{-}\left(b\right)\right) + c_{-}^{+} p\Phi_{+}\left(b\right) + c_{-}^{-} \left(1-p\right)\Phi_{-}\left(b\right).$ Take its derivate w.r.t

b

$b$ and set it equal to zero:

\begin{matrix} - c_{+}^{+} p φ_{+} (b) - c_{+}^{-} (1 - p) φ_{-} (b) + c_{-}^{+} p φ_{+} (b) + c_{-}^{-} (1 - p) φ_{-} (b) = \\ = φ_{+} (b) p (c_{-}^{+} - c_{+}^{+}) + φ_{-} (b) (1 - p) (c_{-}^{-} - c_{+}^{-}) = \\ = φ_{+} (b) p c_{d}^{+} - φ_{-} (b) (1 - p) c_{d}^{-} = 0, \end{matrix}

$\begin{gather*} -c_{+}^{+} p \varphi_{+}\left(b\right)-c_{+}^{-}\left(1-p\right)\varphi_{-}\left(b\right)+c_{-}^{+}p\varphi_{+}\left(b\right)+c_{-}^{-}\left(1-p\right)\varphi_{-}\left(b\right)=\\ =\varphi_{+}\left(b\right)p\left(c_{-}^{+}-c_{+}^{+}\right)+\varphi_{-}\left(b\right)\left(1-p\right)\left(c_{-}^{-}-c_{+}^{-}\right)=\\ = \varphi_{+}\left(b\right)pc_d^{+}-\varphi_{-}\left(b\right)\left(1-p\right)c_d^{-}= 0,\end{gather*}$ using the notation I introduced in my comments below Stephen's answer, i.e.,

c_{d}^{+} = c_{-}^{+} - c_{+}^{+}

$c_d^{+}=c_{-}^{+}-c_{+}^{+}$ and

c_{d}^{-} = c_{+}^{-} - c_{-}^{-}

$c_d^{-}=c_{+}^{-}-c_{-}^{-}$ .

The optimal threshold is therefore given by the solution of the equation

\frac{φ_{+} (b)}{φ_{-} (b)} = \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}} .

$\boxed{\frac{\varphi_{+}\left(b\right)}{\varphi_{-}\left(b\right)}=\frac{\left(1-p\right)c_d^{-}}{pc_d^{+}}}.$ Two things should be noted here:

This results is totally generic and works for any distribution of the test results, not only normal. ( $\varphi$ in that case of course means the probability density function of the distribution, not the normal density.)
Whatever the solution for $b$ is, it is surely a function of $\frac{\left(1-p\right)c_d^{-}}{pc_d^{+}}$ . (I.e., we immediately see how costs matter - in addition to class imbalance!)

I'd be really interested to see if this equation has a generic solution for $b$ (parametrized by the $\varphi$ s), but I would be surprised.

Nevertheless, we can work it out for normal! $\sqrt{2\pi\sigma^2}$ s cancel on the left hand side, so we have

\begin{matrix} e^{- \frac{1}{2} (\frac{{(b - μ_{+})}^{2}}{σ^{2}} - \frac{{(b - μ_{-})}^{2}}{σ^{2}})} = \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}} \\ {(b - μ_{-})}^{2} - {(b - μ_{+})}^{2} = 2 σ^{2} \log \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}} \\ 2 b (μ_{+} - μ_{-}) + (μ_{-}^{2} - μ_{+}^{2}) = 2 σ^{2} \log \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}} \end{matrix}

$\begin{gather*} e^{-\frac{1}{2}\left(\frac{\left(b-\mu_{+}\right)^2}{\sigma^2}-\frac{\left(b-\mu_{-}\right)^2}{\sigma^2}\right)}=\frac{\left(1-p\right)c_d^{-}}{pc_d^{+}} \\ \left(b-\mu_{-}\right)^2-\left(b-\mu_{+}\right)^2 =2\sigma^2 \log \frac{\left(1-p\right)c_d^{-}}{pc_d^{+}} \\ 2b\left(\mu_{+}-\mu_{-}\right)+\left(\mu_{-}^2-\mu_{+}^2\right) =2\sigma^2 \log \frac{\left(1-p\right)c_d^{-}}{pc_d^{+}}\end{gather*}$ therefore the solution is

b^{*} = \frac{(μ_{+}^{2} - μ_{-}^{2}) + 2 σ^{2} \log \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}}}{2 (μ_{+} - μ_{-})} = \frac{μ_{+} + μ_{-}}{2} + \frac{σ^{2}}{μ_{+} - μ_{-}} \log \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}} .

$b^{\ast}=\frac{\left(\mu_{+}^2-\mu_{-}^2\right)+2\sigma^2 \log \frac{\left(1-p\right)c_d^{-}}{pc_d^{+}}}{2\left(\mu_{+}-\mu_{-}\right)}=\frac{\mu_{+}+\mu_{-}}{2}+\frac{\sigma^2}{\mu_{+}-\mu_{-}}\log \frac{\left(1-p\right)c_d^{-}}{pc_d^{+}}.$

(Compare it the the previous result! We see that they are equal if and only if $c_d^{-}=c_d^{+}$ , i.e. the differences in misclassification cost compared to the cost of correct classification is the same in sick and healthy people.)

A short demonstration

Let's say $c_{-}^{-}=0$ (it is quite natural medically), and that $c_{+}^{+}=1$ (we can always obtain it by dividing the costs with $c_{+}^{+}$ , i.e., by measuring every cost in $c_{+}^{+}$ units). Let's say that the prevalence is $p=0.2$ . Also, let's say that $\mu_{-}=9.5$ , $\mu_{+}=10.5$ and $\sigma=1$ .

In this case:

library( data.table )
library( lattice )

cminusminus <- 0
cplusplus <- 1
p <- 0.2
muminus <- 9.5
muplus <- 10.5
sigma <- 1

res <- data.table( expand.grid( b = seq( 6, 17, 0.1 ),
                                cplusminus = c( 1, 5, 10, 50, 100 ),
                                cminusplus = c( 2, 5, 10, 50, 100 ) ) )
res$cost <- cplusplus*p*( 1-pnorm( res$b, muplus, sigma ) ) +
  res$cplusminus*(1-p)*(1-pnorm( res$b, muminus, sigma ) ) +
  res$cminusplus*p*pnorm( res$b, muplus, sigma ) +
  cminusminus*(1-p)*pnorm( res$b, muminus, sigma )

xyplot( cost ~ b | factor( cminusplus ), groups = cplusminus, ylim = c( -1, 22 ),
        data = res, type = "l", xlab = "Threshold",
        ylab = "Expected overall cost", as.table = TRUE,
        abline = list( v = (muplus+muminus)/2+
                         sigma^2/(muplus-muminus)*log((1-p)/p) ),
        strip = strip.custom( var.name = expression( {"c"^{"+"}}["-"] ),
                              strip.names = c( TRUE, TRUE ) ),
        auto.key = list( space = "right", points = FALSE, lines = TRUE,
                         title = expression( {"c"^{"-"}}["+"] ) ),
        panel = panel.superpose, panel.groups = function( x, y, col.line, ... ) {
          panel.xyplot( x, y, col.line = col.line, ... )
          panel.points( x[ which.min( y ) ], min( y ), pch = 19, col = col.line )
        } )

The result is (points depict the minimum cost, and the vertical line shows the optimal threshold with the accuracy-based approach):

We can very nicely see how cost-based optimum can be different than the accuracy-based optimum. It is instructive to think over why: if it is more costly to classify a sick people erroneously healthy than the other way around ( $c_{-}^{+}$ is high, $c_{+}^{-}$ is low) than the threshold goes down, as we prefer to classify more easily into the category sick, on the other hand, if it is more costly to classify a healthy people erroneously sick than the other way around ( $c_{-}^{+}$ is low, $c_{+}^{-}$ is high) than the threshold goes up, as we prefer to classify more easily into the category healthy. (Check these on the figure!)

A real-life example

Let's have a look at an empirical example, instead of a theoretical derivation. This example will be different basically from two aspects:

Instead of assuming normality, we will simply use the empirical data without any such assumption.
Instead of using one single test, and its results in its own units, we will use several tests (and combine them with a logistic regression). Threshold will be given to the final predicted probability. This is actually the preferred approach, see Chapter 19 - Diagnosis - in Frank Harrell's BBR.

The dataset (acath from the package Hmisc) is from the Duke University Cardiovascular Disease Databank, and contains whether the patient had significant coronary disease, as assessed by cardiac catheterization, this will be our gold standard, i.e., the true disease status, and the "test" will be the combination of the subject's age, sex, cholesterol level and duration of symptoms:

library( rms )
library( lattice )
library( latticeExtra )
library( data.table )

getHdata( "acath" )
acath <- acath[ !is.na( acath$choleste ), ]
dd <- datadist( acath )
options( datadist = "dd" )

fit <- lrm( sigdz ~ rcs( age )*sex + rcs( choleste ) + cad.dur, data = acath )

It worth plotting the predicted risks on logit-scale, to see how normal they are (essentially, that was what we assumed previously, with one single test!):

densityplot( ~predict( fit ), groups = acath$sigdz, plot.points = FALSE, ref = TRUE,
             auto.key = list( columns = 2 ) )

Well, they're hardly normal...

Let's go on and calculate the expected overall cost:

ExpectedOverallCost <- function( b, p, y, cplusminus, cminusplus,
                                 cplusplus = 1, cminusminus = 0 ) {
  sum( table( factor( p>b, levels = c( FALSE, TRUE ) ), y )*matrix(
    c( cminusminus, cplusminus, cminusplus, cplusplus ), nc = 2 ) )
}

table( predict( fit, type = "fitted" )>0.5, acath$sigdz )

ExpectedOverallCost( 0.5, predict( fit, type = "fitted" ), acath$sigdz, 2, 4 )

And let's plot it for all possible costs (a computational note: we don't need to mindlessly iterate through numbers from 0 to 1, we can perfectly reconstruct the curve by calculating it for all unique values of predicted probabilities):

ps <- sort( unique( c( 0, 1, predict( fit, type = "fitted" ) ) ) )

xyplot( sapply( ps, ExpectedOverallCost,
                p = predict( fit, type = "fitted" ), y = acath$sigdz,
                cplusminus = 2, cminusplus = 4 ) ~ ps, type = "l", xlab = "Threshold",
        ylab = "Expected overall cost", panel = function( x, y, ... ) {
          panel.xyplot( x, y, ... )
          panel.points( x[ which.min( y ) ], min( y ), pch = 19, cex = 1.1 )
          panel.text( x[ which.min( y ) ], min( y ), round( x[ which.min( y ) ], 3 ),
                      pos = 3 )
        } )

We can very well see where we should put the threshold to optimize the expected overall cost (without using sensitivity, specificity or predictive values anywhere!). This is the correct approach.

It is especially instructive to contrast these metrics:

ExpectedOverallCost2 <- function( b, p, y, cplusminus, cminusplus,
                                  cplusplus = 1, cminusminus = 0 ) {
  tab <- table( factor( p>b, levels = c( FALSE, TRUE ) ), y )
  sens <- tab[ 2, 2 ] / sum( tab[ , 2 ] )
  spec <- tab[ 1, 1 ] / sum( tab[ , 1 ] )
  c( `Expected overall cost` = sum( tab*matrix( c( cminusminus, cplusminus, cminusplus,
                                                   cplusplus ), nc = 2 ) ),
     Sensitivity = sens,
     Specificity = spec,
     PPV = tab[ 2, 2 ] / sum( tab[ 2, ] ),
     NPV = tab[ 1, 1 ] / sum( tab[ 1, ] ),
     Accuracy = 1 - ( tab[ 1, 1 ] + tab[ 2, 2 ] )/sum( tab ),
     Youden = 1 - ( sens + spec - 1 ),
     Topleft = ( 1-sens )^2 + ( 1-spec )^2
  )
}

ExpectedOverallCost2( 0.5, predict( fit, type = "fitted" ), acath$sigdz, 2, 4 )

res <- melt( data.table( ps, t( sapply( ps, ExpectedOverallCost2,
                                        p = predict( fit, type = "fitted" ),
                                        y = acath$sigdz,
                                        cplusminus = 2, cminusplus = 4 ) ) ),
             id.vars = "ps" )

p1 <- xyplot( value ~ ps, data = res, subset = variable=="Expected overall cost",
              type = "l", xlab = "Threshold", ylab = "Expected overall cost",
              panel=function( x, y, ... ) {
                panel.xyplot( x, y,  ... )
                panel.abline( v = x[ which.min( y ) ],
                              col = trellis.par.get()$plot.line$col )
                panel.points( x[ which.min( y ) ], min( y ), pch = 19 )
              }  )
p2 <- xyplot( value ~ ps, groups = variable,
              data = droplevels( res[ variable%in%c( "Expected overall cost",
                                                     "Sensitivity",
                                                     "Specificity", "PPV", "NPV" ) ] ),
              subset = variable%in%c( "Sensitivity", "Specificity", "PPV", "NPV" ),
              type = "l", xlab = "Threshold", ylab = "Sensitivity/Specificity/PPV/NPV",
              auto.key = list( columns = 3, points = FALSE, lines = TRUE ) )
doubleYScale( p1, p2, use.style = FALSE, add.ylab2 = TRUE )

We can now analyze those metrics that are sometimes specifically advertised as being able to come up with an optimal cutoff without costs, and contrast it with our cost-based approach! Let's use the three most often used metrics:

Accuracy (maximize accuracy)
Youden rule (maximize $Sens+Spec-1$ )
Topleft rule (minimize $\left(1-Sens\right)^2+\left(1-Spec\right)^2$ )

(For simplicity, we will subtract the above values from 1 for the Youden and the Accuracy rule so that we have a minimization problem everywhere.)

Let's see the results:

p3 <- xyplot( value ~ ps, groups = variable,
              data = droplevels( res[ variable%in%c( "Expected overall cost", "Accuracy",
                                                     "Youden", "Topleft"  ) ] ),
              subset = variable%in%c( "Accuracy", "Youden", "Topleft"  ),
              type = "l", xlab = "Threshold", ylab = "Accuracy/Youden/Topleft",
              auto.key = list( columns = 3, points = FALSE, lines = TRUE ),
              panel = panel.superpose, panel.groups = function( x, y, col.line, ... ) {
                panel.xyplot( x, y, col.line = col.line, ... )
                panel.abline( v = x[ which.min( y ) ], col = col.line )
                panel.points( x[ which.min( y ) ], min( y ), pch = 19, col = col.line )
              } )
doubleYScale( p1, p3, use.style = FALSE, add.ylab2 = TRUE )

This of course pertains to one specific cost structure, $c_{-}^{-}=0$ , $c_{+}^{+}=1$ , $c_{+}^{-}=2$ , $c_{-}^{+}=4$ (this obviously matters only for the optimal cost decision). To investigate the effect of cost structure, let's pick just the optimal threshold (instead of tracing the whole curve), but plot it as a function of costs. More specifically, as we have already seen, the optimal threshold depends on the four costs only through the $c_d^{-}/c_d^{+}$ ratio, so let's plot the optimal cutoff as a function of this, along with the typically used metrics that don't use costs:

res2 <- data.frame( rat = 10^( seq( log10( 0.02 ), log10( 50 ), length.out = 500 ) ) )
res2$OptThreshold <- sapply( res2$rat,
                             function( rat ) ps[ which.min(
                               sapply( ps, Vectorize( ExpectedOverallCost, "b" ),
                                       p = predict( fit, type = "fitted" ),
                                       y = acath$sigdz,
                                       cplusminus = rat,
                                       cminusplus = 1,
                                       cplusplus = 0 ) ) ] )

xyplot( OptThreshold ~ rat, data = res2, type = "l", ylim = c( -0.1, 1.1 ),
        xlab = expression( {"c"^{"-"}}["d"]/{"c"^{"+"}}["d"] ), ylab = "Optimal threshold",
        scales = list( x = list( log = 10, at = c( 0.02, 0.05, 0.1, 0.2, 0.5, 1,
                                                   2, 5, 10, 20, 50 ) ) ),
        panel = function( x, y, resin = res[ ,.( ps[ which.min( value ) ] ),
                                             .( variable ) ], ... ) {
          panel.xyplot( x, y, ... )
          panel.abline( h = resin[variable=="Youden"] )
          panel.text( log10( 0.02 ), resin[variable=="Youden"], "Y", pos = 3 )
          panel.abline( h = resin[variable=="Accuracy"] )
          panel.text( log10( 0.02 ), resin[variable=="Accuracy"], "A", pos = 3 )
          panel.abline( h = resin[variable=="Topleft"] )
          panel.text( log10( 0.02 ), resin[variable=="Topleft"], "TL", pos = 1 )
        } )

Horizontal lines indicate the approaches that don't use costs (and are therefore constant).

Again, we nicely see that as the additional cost of misclassification in the healthy group rises compared to that of the diseased group, the optimal threshold increases: if we really don't want healthy people to be classified as sick, we will use higher cutoff (and the other way around, of course!).

And, finally, we yet again see why those methods that don't use costs are not (and can't!) be always optimal.

— Tamas Ferenci
fuente