No rederive la B e t a ( k - 12 ,n - k2 )distribución en la excelente respuesta de @ Alecos (es un resultado estándar, veaaquípara otra buena discusión) pero quiero completar más detalles sobre las consecuencias. En primer lugar, ¿cómo se ve la distribución nula deR2para un rango de valores denyk? El gráfico en la respuesta de @ Alecos es bastante representativo de lo que ocurre en las regresiones múltiples prácticas, pero a veces la percepción se obtiene más fácilmente de casos más pequeños. He incluido la media, el modo (donde existe) y la desviación estándar. El gráfico / tabla merece un buen globo ocular: seve mejor a tamaño completo. Podría haber incluido menos facetas, pero el patrón habría sido menos claro; He adjuntadoBeta(k−12,n−k2)R2nkR
código para que los lectores puedan experimentar con diferentes subconjuntos de n y k .nk
Valores de parámetros de forma
El esquema de color del gráfico indica si cada parámetro de forma es menor que uno (rojo), igual a uno (azul) o más de uno (verde). El lado izquierdo muestra el valor de α mientras que β está a la derecha. Como α = k - 1αβ2 , su valor aumenta en progresión aritmética por una diferencia común de1α=k−122 a medida que nos movemos hacia la derecha de una columna a otra (agreguemos un regresor a nuestro modelo) mientras que, paranfijo,β=n-k12n2 disminuye en1β=n−k22 . El totalα+β=n-1122 se fija para cada fila (para un tamaño de muestra dado). Si en cambio arreglamosky nos movemos hacia abajo en la columna (aumentamos el tamaño de la muestra en 1), entoncesαpermanece constante yβaumenta en1α+β=n−12kαβ2 . En términos de regresión,αes la mitad del número de regresores incluidos en el modelo, yβes la mitad de los grados residuales de libertad. Para determinar la forma de la distribución, estamos particularmente interesados en dóndeαoβ soniguales.12αβαβ
El álgebra es sencillo para α : tenemos k - 1α2 =1,entoncesk=3. De hecho, esta es la única columna del gráfico de facetas que se llena de azul a la izquierda. De manera similar,α<1parak<3(lacolumnak=2es roja a la izquierda) yα>1parak>3(desde lacolumnak=4 enadelante, el lado izquierdo es verde).k−12=1k=3α<1k<3k=2α>1k>3k=4
Para β = 1 tenemos n - kβ=12 =1 por lotantok=n-2. Observe cómo estos casos (marcados con un lado azul a la derecha) cortan una línea diagonal a través del diagrama de facetas. Paraβ>1obtenemosk<n-2(las gráficas con un lado verde a la izquierda se encuentran a la izquierda de la línea diagonal). Paraβ<1necesitamosk>n-2, que involucra solo los casos más a la derecha en mi gráfico: enn=ktenemosβ=0y la distribución es degenerada, peronn−k2=1k=n−2β>1k<n−2β<1k>n−2n=kβ=0= k - 1 donde β = 1n=k−12 se traza (lado derecho en rojo).β=12
Como el PDF es f ( x ;α ,β ) ∝ x α - 1 ( 1 - x ) β - 1 , está claro que si (y solo si) α < 1, entonces f ( x ) → ∞ como x → 0 . Podemos ver esto en el gráfico: cuando el lado izquierdo está sombreado en rojo, observe el comportamiento en 0. De manera similar cuando β < 1 entonces f ( x ) → ∞ como x → 1 . ¡Mira donde el lado derecho es rojo!f(x;α,β)∝xα−1(1−x)β−1α<1f(x)→∞x→0β<1f(x)→∞x→1
Simetrías
Una de las características más llamativas del gráfico es el nivel de simetría, pero cuando se trata de la distribución Beta, ¡esto no debería ser sorprendente!
La distribución Beta en sí es simétrica si α = β . Para nosotros esto ocurre si n = 2 k - 1 que identifica correctamente los paneles ( k = 2 , n = 3 ) , ( k = 3 , n = 5 ) , ( k = 4 , n = 7 ) y ( k = 5 , n = 9 )α=βn=2k−1(k=2,n=3)(k=3,n=5)(k=4,n=7)(k=5,n=9). La medida en que la distribución es simétrica a través de R 2 = 0.5 depende del número de las variables regresor incluimos en el modelo para ese tamaño de la muestra. Si k = n + 1R2=0.52k=n+12 the distribution of R2R2 is perfectly symmetric about 0.5; if we include fewer variables than that it becomes increasingly asymmetric and the bulk of the probability mass shifts closer to R2=0R2=0; if we include more variables then it shifts closer to R2=1R2=1. Remember that kk includes the intercept in its count, and that we are working under the null, so the regressor variables should have coefficient zero in the correctly specified model.
There is also an obviously symmetry between distributions for any given nn, i.e. any row in the facet grid. For example, compare (k=3,n=9)(k=3,n=9) with (k=7,n=9)(k=7,n=9). What's causing this? Recall that the distribution of Beta(α,β)Beta(α,β) is the mirror image of Beta(β,α)Beta(β,α) across x=0.5x=0.5. Now we had αk,n=k−12αk,n=k−12 and βk,n=n−k2βk,n=n−k2. Consider k′=n−k+1k′=n−k+1 and we find:
αk′,n=(n−k+1)−12=n−k2=βk,n
αk′,n=(n−k+1)−12=n−k2=βk,n
βk′,n=n−(n−k+1)2=k−12=αk,nβk′,n=n−(n−k+1)2=k−12=αk,n
So this explains the symmetry as we vary the number of regressors in the model for a fixed sample size. It also explains the distributions that are themselves symmetric as a special case: for them, k′=kk′=k so they are obliged to be symmetric with themselves!
This tells us something we might not have guessed about multiple regression: for a given sample size nn, and assuming no regressors have a genuine relationship with YY, the R2R2 for a model using k−1k−1 regressors plus an intercept has the same distribution as 1−R21−R2 does for a model with k−1k−1 residual degrees of freedom remaining.
Special distributions
When k=nk=n we have β=0β=0, which isn't a valid parameter. However, as β→0β→0 the distribution becomes degenerate with a spike such that P(R2=1)=1P(R2=1)=1. This is consistent with what we know about a model with as many parameters as data points - it achieves perfect fit. I haven't drawn the degenerate distribution on my graph but did include the mean, mode and standard deviation.
When k=2k=2 and n=3n=3 we obtain Beta(12,12)Beta(12,12) which is the arcsine distribution. This is symmetric (since α=βα=β) and bimodal (0 and 1). Since this is the only case where both α<1α<1 and β<1β<1 (marked red on both sides), it is our only distribution which goes to infinity at both ends of the support.
The Beta(1,1)Beta(1,1) distribution is the only Beta distribution to be rectangular (uniform). All values of R2R2 from 0 to 1 are equally likely. The only combination of kk and nn for which α=β=1α=β=1 occurs is k=3k=3 and n=5n=5 (marked blue on both sides).
The previous special cases are of limited applicability but the case α>1α>1 and β=1β=1 (green on left, blue on right) is important. Now f(x;α,β)∝xα−1(1−x)β−1=xα−1f(x;α,β)∝xα−1(1−x)β−1=xα−1 so we have a power-law distribution on [0, 1]. Of course it's unlikely we'd perform a regression with k=n−2k=n−2 and k>3k>3, which is when this situation occurs. But by the previous symmetry argument, or some trivial algebra on the PDF, when k=3k=3 and n>5n>5, which is the frequent procedure of multiple regression with two regressors and an intercept on a non-trivial sample size, R2R2 will follow a reflected power law distribution on [0, 1] under H0H0. This corresponds to α=1α=1 and β>1β>1 so is marked blue on left, green on right.
You may also have noticed the triangular distributions at (k=5,n=7)(k=5,n=7) and its reflection (k=3,n=7)(k=3,n=7). We can recognise from their αα and ββ that these are just special cases of the power-law and reflected power-law distributions where the power is 2−1=12−1=1.
Mode
If α>1α>1 and β>1β>1, all green in the plot, f(x;α,β)f(x;α,β) is concave with f(0)=f(1)=0f(0)=f(1)=0, and the Beta distribution has a unique mode α−1α+β−2α−1α+β−2. Putting these in terms of kk and nn, the condition becomes k>3k>3 and n>k+2n>k+2 while the mode is k−3n−5k−3n−5.
All other cases have been dealt with above. If we relax the inequality to allow β=1β=1, then we include the (green-blue) power-law distributions with k=n−2k=n−2 and k>3k>3 (equivalently, n>5n>5). These cases clearly have mode 1, which actually agrees with the previous formula since (n−2)−3n−5=1(n−2)−3n−5=1. If instead we allowed α=1α=1 but still demanded β>1β>1, we'd find the (blue-green) reflected power-law distributions with k=3k=3 and n>5n>5. Their mode is 0, which agrees with 3−3n−5=03−3n−5=0. However, if we relaxed both inequalities simultaneously to allow α=β=1α=β=1, we'd find the (all blue) uniform distribution with k=3k=3 and n=5n=5, which does not have a unique mode. Moreover the previous formula can't be applied in this case, since it would return the indeterminate form 3−35−5=003−35−5=00.
When n=kn=k we get a degenerate distribution with mode 1. When β<1β<1 (in regression terms, n=k−1n=k−1 so there is only one residual degree of freedom) then f(x)→∞f(x)→∞ as x→1x→1, and when α<1α<1 (in regression terms, k=2k=2 so a simple linear model with intercept and one regressor) then f(x)→∞f(x)→∞ as x→0x→0. These would be unique modes except in the unusual case where k=2k=2 and n=3n=3 (fitting a simple linear model to three points) which is bimodal at 0 and 1.
Mean
The question asked about the mode, but the mean of R2R2 under the null is also interesting - it has the remarkably simple form k−1n−1k−1n−1. For a fixed sample size it increases in arithmetic progression as more regressors are added to the model, until the mean value is 1 when k=nk=n. The mean of a Beta distribution is αα+βαα+β so such an arithmetic progression was inevitable from our earlier observation that, for fixed nn, the sum α+βα+β is constant but αα increases by 0.5 for each regressor added to the model.
αα+β=(k−1)/2(k−1)/2+(n−k)/2=k−1n−1
αα+β=(k−1)/2(k−1)/2+(n−k)/2=k−1n−1
Code for plots
require(grid)
require(dplyr)
nlist <- 3:9 #change here which n to plot
klist <- 2:8 #change here which k to plot
totaln <- length(nlist)
totalk <- length(klist)
df <- data.frame(
x = rep(seq(0, 1, length.out = 100), times = totaln * totalk),
k = rep(klist, times = totaln, each = 100),
n = rep(nlist, each = totalk * 100)
)
df <- mutate(df,
kname = paste("k =", k),
nname = paste("n =", n),
a = (k-1)/2,
b = (n-k)/2,
density = dbeta(x, (k-1)/2, (n-k)/2),
groupcol = ifelse(x < 0.5,
ifelse(a < 1, "below 1", ifelse(a ==1, "equals 1", "more than 1")),
ifelse(b < 1, "below 1", ifelse(b ==1, "equals 1", "more than 1")))
)
g <- ggplot(df, aes(x, density)) +
geom_line(size=0.8) + geom_area(aes(group=groupcol, fill=groupcol)) +
scale_fill_brewer(palette="Set1") +
facet_grid(nname ~ kname) +
ylab("probability density") + theme_bw() +
labs(x = expression(R^{2}), fill = expression(alpha~(left)~beta~(right))) +
theme(panel.margin = unit(0.6, "lines"),
legend.title=element_text(size=20),
legend.text=element_text(size=20),
legend.background = element_rect(colour = "black"),
legend.position = c(1, 1), legend.justification = c(1, 1))
df2 <- data.frame(
k = rep(klist, times = totaln),
n = rep(nlist, each = totalk),
x = 0.5,
ymean = 7.5,
ymode = 5,
ysd = 2.5
)
df2 <- mutate(df2,
kname = paste("k =", k),
nname = paste("n =", n),
a = (k-1)/2,
b = (n-k)/2,
meanR2 = ifelse(k > n, NaN, a/(a+b)),
modeR2 = ifelse((a>1 & b>=1) | (a>=1 & b>1), (a-1)/(a+b-2),
ifelse(a<1 & b>=1 & n>=k, 0, ifelse(a>=1 & b<1 & n>=k, 1, NaN))),
sdR2 = ifelse(k > n, NaN, sqrt(a*b/((a+b)^2 * (a+b+1)))),
meantext = ifelse(is.nan(meanR2), "", paste("Mean =", round(meanR2,3))),
modetext = ifelse(is.nan(modeR2), "", paste("Mode =", round(modeR2,3))),
sdtext = ifelse(is.nan(sdR2), "", paste("SD =", round(sdR2,3)))
)
g <- g + geom_text(data=df2, aes(x, ymean, label=meantext)) +
geom_text(data=df2, aes(x, ymode, label=modetext)) +
geom_text(data=df2, aes(x, ysd, label=sdtext))
print(g)