Mostrando que 100 mediciones para 5 sujetos proporcionan mucha menos información que 5 mediciones para 100 sujetos

En una conferencia escuché la siguiente declaración:

100 mediciones para 5 sujetos proporcionan mucha menos información que 5 mediciones para 100 sujetos.

Es bastante obvio que esto es cierto, pero me preguntaba cómo se podría probar matemáticamente ... Creo que se podría usar un modelo lineal mixto. Sin embargo, no sé mucho acerca de las matemáticas utilizadas para estimarlos (solo corro lmer4para LMM y bmrsGLMM :) ¿Podría mostrarme un ejemplo donde esto sea cierto? Prefiero una respuesta con algunas fórmulas, que solo un código en R. Siéntase libre de asumir una configuración simple, como por ejemplo un modelo mixto lineal con intercepciones aleatorias y pendientes distribuidas normalmente.

PD: una respuesta basada en matemáticas que no implique LMM también estaría bien. Pensé en LMM porque me parecían la herramienta natural para explicar por qué menos medidas de más sujetos son mejores que más medidas de pocos sujetos, pero es posible que me equivoque.

— DeltaIV
fuente

+1. Supongo que la configuración más simple sería considerar una tarea de estimar la media de la población

μ

$\mu$ donde cada sujeto tiene su propia media

a \sim N (μ, σ_{a}^{2})

$a \sim \mathcal N(\mu, \sigma_a^2)$ y cada medición de este sujeto se distribuye como

x \sim N (a, σ^{2})

$x \sim \mathcal N(a, \sigma^2)$ . Si tomamos

n

$n$ medidas de cada uno de los

m

$m$ sujetos, entonces ¿cuál es la forma óptima de establecer

n

$n$ y

m

$m$ dado un producto constante

n m = N

$nm=N$ ?

— ameba dice Restablecer Monica

"Óptimo" en el sentido de minimizar la varianza de la media muestral de los

puntos de datos adquiridos.

N

$N$

— ameba dice Reinstate Monica

Sí. Pero para su pregunta no necesitamos preocuparnos por cómo estimar las variaciones; su pregunta (es decir, la cita en su pregunta) es que solo creo en estimar la media global

y parece obvio que el mejor estimador viene dado por la gran media

de todos los puntos

de la muestra. La pregunta entonces es: dado

, ¿cuál es la varianza de

? Si sabemos eso, podremos minimizarlo con respecto a

dado el

μ

$\mu$

\bar{x}

$\bar x$

N = n m

$N=nm$

μ

$\mu$

σ^{2}

$\sigma^2$

σ_{a}^{2}

$\sigma^2_a$

n

$n$

m

$m$

\bar{x}

$\bar x$

n

$n$

restricción.

n m = N

$nm=N$

— ameba dice Reinstate Monica

No sé cómo derivar nada de eso, pero estoy de acuerdo en que parece obvio: para estimar la varianza del error, sería mejor tener todas las

mediciones de un solo sujeto; y para estimar la varianza del sujeto (¿probablemente?) sería mejor tener

sujetos diferentes con 1 medición cada uno. Sin embargo, no está tan claro sobre la media, pero mi intuición me dice que tener

sujetos con 1 medición cada uno también sería lo mejor. Me pregunto si eso es cierto ...

N

$N$

N

$N$

N

$N$

— ameba dice Reinstate Monica

Tal vez algo así: la varianza de las medias de muestra por sujeto debería ser

, donde el primer término es la varianza del sujeto y el segundo es la varianza de la estimación de la media de cada sujeto. Entonces la varianza de la media de sobre-sujetos (es decir, gran media) será

σ_{a}^{2} + σ^{2} / n

$\sigma^2_a + \sigma^2/n$

que se minimiza cuando

(σ_{a}^{2} + σ^{2} / n) / m = σ_{a}^{2} / m + σ^{2} / (n m) = σ_{a}^{2} / m + σ^{2} / N = σ_{a}^{2} / m + c o n s t,

$(\sigma^2_a + \sigma^2/n)/m = \sigma^2_a/m + \sigma^2/(nm) = \sigma^2_a/m + \sigma^2/N = \sigma^2_a/m + \mathrm{const},$

m = N

$m=N$

— ameba dice Reinstate Monica

La respuesta corta es que su conjetura es verdadera cuando y solo cuando hay una correlación positiva dentro de la clase en los datos . Hablando empíricamente, la mayoría de los conjuntos de datos agrupados la mayor parte del tiempo muestran una correlación positiva dentro de la clase, lo que significa que en la práctica su conjetura suele ser cierta. Pero si la correlación intraclase es 0, entonces los dos casos que mencionó son igualmente informativos. Y si la correlación intraclase es negativa , en realidad es menos informativo tomar menos medidas en más sujetos; en realidad preferiríamos (en lo que respecta a la reducción de la varianza de la estimación del parámetro) tomar todas nuestras mediciones en un solo tema.

Estadísticamente, hay dos perspectivas desde las cuales podemos pensar en esto: un efecto aleatorio (o mixto ) modelo , que usted menciona en su pregunta, o un modelo marginal , que termina siendo un poco más informativo aquí.

Modelo de efectos aleatorios (mixto)

Digamos que tenemos un conjunto de sujetos de los cuales hemos tomado mediciones cada uno. Entonces, un modelo simple de efectos aleatorios de la medida del sujeto podría ser donde es la intersección fija, es el efecto aleatorio del sujeto (con varianza ), es el término de error de nivel de observación (con varianza $n$ $m$ $j$ $i$

y_{i j} = β + u_{i} + e_{i j},

$y_{ij} = \beta + u_i + e_{ij},$

β

$\beta$

u_{i}

$u_i$

σ_{u}^{2}

$\sigma^2_u$

e_{i j}

$e_{ij}$

σ_{e}^{2}

$\sigma^2_e$ ), y los dos últimos términos aleatorios son independientes.

En este modelo, representa la media de la población, y con un conjunto de datos equilibrado (es decir, un número igual de mediciones de cada sujeto), nuestra mejor estimación es simplemente la media de la muestra. Entonces, si tomamos "más información" para significar una varianza menor para esta estimación, entonces básicamente queremos saber cómo la varianza de la media muestral depende de y . Con un poco de álgebra podemos resolver esa $\beta$ $n$ $m$ Al examinar esta expresión, podemos ver quecada vezquehay alguna variación de sujeto(es decir,), al aumentar el número de sujetos (), ambos términos serán más pequeños, al tiempo que aumenta el número de mediciones por sujeto () solo hará que el segundo término sea más pequeño. (Para una implicación práctica de esto para el diseño de proyectos de replicación de sitios múltiples, veaesta publicación de blog que escribí hace un tiempo).

\begin{aligned} var (\frac{1}{n m} \sum_{i} \sum_{j} y_{i j}) & = var (\frac{1}{n m} \sum_{i} \sum_{j} β + u_{i} + e_{i j}) \\ = \frac{1}{n^{2} m^{2}} var (\sum_{i} \sum_{j} u_{i} + \sum_{i} \sum_{j} e_{i j}) \\ = \frac{1}{n^{2} m^{2}} (m^{2} \sum_{i} var (u_{i}) + \sum_{i} \sum_{j} var (e_{i j})) \\ = \frac{1}{n^{2} m^{2}} (n m^{2} σ_{u}^{2} + n m σ_{e}^{2}) \\ = \frac{σ_{u}^{2}}{n} + \frac{σ_{e}^{2}}{n m} . \end{aligned}

$\begin{aligned} \text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) &= \text{var}(\frac{1}{nm}\sum_i\sum_j\beta + u_i + e_{ij}) \\ &= \frac{1}{n^2m^2}\text{var}(\sum_i\sum_ju_i + \sum_i\sum_je_{ij}) \\ &= \frac{1}{n^2m^2}\Big(m^2\sum_i\text{var}(u_i) + \sum_i\sum_j\text{var}(e_{ij})\Big) \\ &= \frac{1}{n^2m^2}(nm^2\sigma^2_u + nm\sigma^2_e) \\ &= \frac{\sigma^2_u}{n} + \frac{\sigma^2_e}{nm}. \end{aligned}$

σ_{u}^{2} > 0

$\sigma^2_u>0$

n

$n$

m

$m$

$m$ $n$ $nm$

\frac{σ_{u}^{2}}{n} + constant,

$\frac{\sigma^2_u}{n} + \text{constant},$

n

$n$

n = n m

$n=nm$

m = 1

$m=1$

ρ = \frac{σ_{u}^{2}}{σ_{u}^{2} + σ_{e}^{2}}

$\rho = \frac{\sigma^2_u}{\sigma^2_u + \sigma^2_e}$ (sketch of a derivation here). So we can write the variance equation above as

var (\frac{1}{n m} \sum_{i} \sum_{j} y_{i j}) = \frac{σ_{u}^{2}}{n} + \frac{σ_{e}^{2}}{n m} = (\frac{ρ}{n} + \frac{1 - ρ}{n m}) (σ_{u}^{2} + σ_{e}^{2})

$\text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) = \frac{\sigma^2_u}{n} + \frac{\sigma^2_e}{nm} = \Big(\frac{\rho}{n} + \frac{1-\rho}{nm}\Big)(\sigma^2_u+\sigma^2_e)$ This doesn't really add any insight to what we already saw above, but it does make us wonder: since the intra-class correlation is a bona fide correlation coefficient, and correlation coefficients can be negative, what would happen (and what would it mean) if the intra-class correlation were negative?

In the context of the random-effects model, a negative intra-class correlation doesn't really make sense, because it implies that the subject variance $\sigma^2_u$ is somehow negative (as we can see from the $\rho$ equation above, and as explained here and here)... but variances can't be negative! But this doesn't mean that the concept of a negative intra-class correlation doesn't make sense; it just means that the random-effects model doesn't have any way to express this concept, which is a failure of the model, not of the concept. To express this concept adequately we need to consider the marginal model.

Marginal model

For this same dataset we could consider a so-called marginal model of $y_{ij}$ ,

y_{i j} = β + e_{i j}^{*},

$y_{ij} = \beta + e^*_{ij},$ where basically we've pushed the random subject effect

u_{i}

$u_i$ from before into the error term

e_{i j}

$e_{ij}$ so that we have

e_{i j}^{*} = u_{i} + e_{i j}

$e^*_{ij} = u_i + e_{ij}$ . In the random-effects model we considered the two random terms

u_{i}

$u_i$ and

e_{i j}

$e_{ij}$ to be i.i.d., but in the marginal model we instead consider

e_{i j}^{*}

$e^*_{ij}$ to follow a block-diagonal covariance matrix

C

$\textbf{C}$ like

C = σ^{2} [\begin{matrix} R & 0 & \dots & 0 \\ 0 & R & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & R \end{matrix}], R = [\begin{matrix} 1 & ρ & \dots & ρ \\ ρ & 1 & \dots & ρ \\ ⋮ & ⋮ & ⋱ & ⋮ \\ ρ & ρ & \dots & 1 \end{matrix}]

$\textbf{C}= \sigma^2\begin{bmatrix} \textbf{R} & 0& \cdots & 0\\ 0& \textbf{R} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0& 0& \cdots &\textbf{R}\\ \end{bmatrix}, \textbf{R}= \begin{bmatrix} 1 & \rho & \cdots & \rho \\ \rho & 1 & \cdots & \rho \\ \vdots & \vdots & \ddots & \vdots \\ \rho & \rho & \cdots &1\\ \end{bmatrix}$ In words, this means that under the marginal model we simply consider

ρ

$\rho$ to be the expected correlation between two

e^{*}

$e^*$ s from the same subject (we assume the correlation across subjects is 0). When

ρ

$\rho$ is positive, two observations drawn from the same subject tend to be more similar (closer together), on average, than two observations drawn randomly from the dataset while ignoring the clustering due to subjects. When

ρ

$\rho$ is negative, two observations drawn from the same subject tend to be less similar (further apart), on average, than two observations drawn completely at random. (More information about this interpretation in the question/answers here.)

So now when we look at the equation for the variance of the sample mean under the marginal model, we have

\begin{aligned} var (\frac{1}{n m} \sum_{i} \sum_{j} y_{i j}) & = var (\frac{1}{n m} \sum_{i} \sum_{j} β + e_{i j}^{*}) \\ = \frac{1}{n^{2} m^{2}} var (\sum_{i} \sum_{j} e_{i j}^{*}) \\ = \frac{1}{n^{2} m^{2}} (n (m σ^{2} + (m^{2} - m) ρ σ^{2})) \\ = \frac{σ^{2} (1 + (m - 1) ρ)}{n m} \\ = (\frac{ρ}{n} + \frac{1 - ρ}{n m}) σ^{2}, \end{aligned}

$\begin{aligned} \text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) &= \text{var}(\frac{1}{nm}\sum_i\sum_j\beta + e^*_{ij}) \\ &= \frac{1}{n^2m^2}\text{var}(\sum_i\sum_je^*_{ij}) \\ &= \frac{1}{n^2m^2}\Big(n\big(m\sigma^2 + (m^2-m)\rho\sigma^2\big)\Big) \\ &= \frac{\sigma^2\big(1+(m-1)\rho\big)}{nm} \\ &= \Big(\frac{\rho}{n}+\frac{1-\rho}{nm}\Big)\sigma^2, \end{aligned}$ which is the same variance expression we derived above for the random-effects model, just with

σ_{e}^{2} + σ_{u}^{2} = σ^{2}

$\sigma^2_e+\sigma^2_u=\sigma^2$ , which is consistent with our note above that

e_{i j}^{*} = u_{i} + e_{i j}

$e^*_{ij} = u_i + e_{ij}$ . The advantage of this (statistically equivalent) perspective is that here we can think about a negative intra-class correlation without needing to invoke any weird concepts like a negative subject variance. Negative intra-class correlations just fit naturally in this framework.

(BTW, just a quick aside to point out that the second-to-last line of the derivation above implies that we must have $\rho \ge -1/(m-1)$ , or else the whole equation is negative, but variances can't be negative! So there is a lower bound on the intra-class correlation that depends on how many measurements we have per cluster. For $m=2$ (i.e., we measure each subject twice), the intra-class correlation can go all the way down to $\rho=-1$ ; for $m=3$ it can only go down to $\rho=-1/2$ ; and so on. Fun fact!)

So finally, once again considering the total number of observations $nm$ to be a constant, we see that the second-to-last line of the derivation above just looks like

(1 + (m - 1) ρ) \times positive constant .

$\big(1+(m-1)\rho\big) \times \text{positive constant}.$ So when

ρ > 0

$\rho>0$ , having

m

$m$ as small as possible (so that we take fewer measurements of more subjects--in the limit, 1 measurement of each subject) makes the variance of the estimate as small as possible. But when

ρ < 0

$\rho<0$ , we actually want

m

$m$ to be as large as possible (so that, in the limit, we take all

n m

$nm$ measurements from a single subject) in order to make the variance as small as possible. And when

ρ = 0

$\rho=0$ , the variance of the estimate is just a constant, so our allocation of

m

$m$ and

n

$n$ doesn't matter.

— Jake Westfall
fuente

+1. Great answer. I have to admit that the second part, about

ρ < 0

$\rho<0$ , is quite unintuitive: even with a huge (or infinite) total number

n m

$nm$ of observations the best we can do is to allocate all observations to one single subject, meaning that the standard error of the mean will be

σ_{u}

$\sigma_u$ and it's not possible in principle to reduce it any further. This is just so weird! True

β

$\beta$ remains unknowable, whatever resources one puts into measuring it. Is this interpretation correct?

— amoeba says Reinstate Monica

Ah, no. The above is not correct because as

m

$m$ increases to infinity,

ρ

$\rho$ cannot stay negative and has to approach zero (corresponding to zero subject variance). Hmm. This negative correlation is a funny thing: it's not really a parameter of the generative model because it's constrained by the sample size (whereas one would normally expect a generative model to be able to generate any number of observations, whatever the parameters are). I am not quite sure what is the proper way to think about it.

— amoeba says Reinstate Monica

@DeltaIV What is "the covariance matrix of the random effects" in this case? In the mixed model written by Jake above, there is only one random effect and so there is no "covariance matrix" really, but just one number:

σ_{u}^{2}

$\sigma^2_u$ . What

Σ

$\Sigma$ are you referring to?

— amoeba says Reinstate Monica

@DeltaIV Well, the general principle is en.wikipedia.org/wiki/Inverse-variance_weighting, and the variance of each subject's sample mean is given by

σ_{u}^{2} + σ_{e}^{2} / m_{i}

$\sigma^2_u + \sigma^2_e/m_i$ (that's why Jake wrote above that the weights have to depend on the estimate of between-subject variance). The estimate of within-subject variance is given by the variance of the pooled within-subject deviations, the estimate of between-subject variance is the variance of subjects' means, and using all that one can compute the weights. (But I am not sure if this is 100% equivalent to what lmer will do.)

— amoeba says Reinstate Monica

Jake, yes, it's exactly this hard-coding of

m

$m$ that was bothering me. If this is "sample size" then it cannot be a parameter of the underlying system. My current thinking is that negative

ρ

$\rho$ should actually indicate that there is another within-subject factor that is ignored/unknown to us. E.g. it could be pre & post of some intervention and the difference between them is so large that the measurements are negatively correlated. But this would mean that

m

$m$ is not really a sample size, but the number of levels of this unknown factor, and that can certainly be hard coded...

— amoeba says Reinstate Monica