Cómo mantener variables invariables en el tiempo en un modelo de efectos fijos

Tengo datos sobre los empleados de una gran empresa italiana durante diez años y me gustaría ver cómo la brecha de género en los ingresos de hombres y mujeres ha cambiado con el tiempo. Para este propósito ejecuto OLS agrupados:

y_{i t} = X_{i t}^{'} β + δ {m a l e}_{i} + \sum_{t = 1}^{10} γ_{t} d_{t} + ε_{i t}

$y_{it} = X'_{it}\beta + \delta {\rm male}_i + \sum^{10}_{t=1}\gamma_t d_t + \varepsilon_{it}$ donde

y

$y$ es el registro de ganancias por año,

X_{i t}

$X_{it}$ incluye covariables que difieren por individuo y tiempo,

d_{t}

$d_t$ son dummies anuales

{m a l e}_{i}

${\rm male}_i$ es igual a uno si un trabajador es de sexo masculino y es cero en caso contrario.

Ahora me preocupa que algunas de las covariables puedan estar correlacionadas con efectos fijos no observados. Pero cuando uso el estimador de efectos fijos (dentro) o las primeras diferencias, pierdo la simulación de género porque esta variable no cambia con el tiempo. No quiero usar el estimador de efectos aleatorios porque a menudo escucho a la gente decir que pone suposiciones que son muy poco realistas y que es poco probable que se cumplan.

¿Hay alguna manera de mantener el género ficticio y controlar los efectos fijos al mismo tiempo? Si hay alguna forma, ¿necesito agrupar o solucionar otros problemas con los errores para las pruebas de hipótesis en la variable de género?

— usuario42263
fuente

Respuestas:

Hay algunas formas posibles para mantener el ficticio de género en una regresión de efectos fijos.

Dentro del Estimador
Suponga que tiene un modelo similar en comparación con su modelo OLS agrupado que es

y_{i t} = β_{1} + \sum_{t = 2}^{10} β_{t} d_{t} + γ_{1} (m a l e_{i}) + \sum_{t = 1}^{10} γ_{t} (d_{t} \cdot m a l e_{i}) + X_{i t}^{'} θ + c_{i} + ϵ_{i t}

$y_{it} = \beta_1 + \sum^{10}_{t=2} \beta_t d_t + \gamma_1 (male_i) + \sum^{10}_{t=1} \gamma_t (d_t \cdot male_i) + X'_{it}\theta + c_i + \epsilon_{it}$ where the variables are as before. Now note that

β_{1}

$\beta_1$ and

β_{1} + γ_{1} (m a l e_{i})

$\beta_1 + \gamma_1 (male_i)$ cannot be identified because the within estimator cannot distinguish them from the fixed effect

c_{i}

$c_i$ . Given that

β_{1}

$\beta_1$ is the intercept for the base year

t = 1

$t=1$ ,

γ_{1}

$\gamma_1$ is the gender effect on earnings in this period. What we can identify in this case are

γ_{2}, . . ., γ_{10}

$\gamma_2, ..., \gamma_{10}$ because they are interacted with your time dummies and they measure the differences in the partial effects of your gender variable relative to the first time period. This means if you observe an increase in your

γ_{2}, . . ., γ_{10}

$\gamma_2,...,\gamma_{10}$ over time this is an indication for a widening of the earnings gap between men and women.

First-Difference Estimator
If you want to know the overall effect of the difference between men and women over time, you can try the following model:

y_{i t} = β_{1} + \sum_{t = 2}^{10} β_{t} d_{t} + γ (t \cdot m a l e_{i}) + X_{i t}^{'} θ + c_{i} + ϵ_{i t}

$y_{it} = \beta_1 + \sum^{10}_{t=2} \beta_t d_t + \gamma (t\cdot male_i) + X'_{it}\theta + c_i + \epsilon_{it}$ where the variable

t = 1, 2, . . ., 10

$t = 1, 2,...,10$ is interacted with the time-invariant gender dummy. Now if you take first differences

β_{1}

$\beta_1$ and

c_{i}

$c_i$ drop out and you get

y_{i t} - y_{i (t - 1)} = \sum_{t = 3}^{10} β_{t} (d_{t} - d_{(t - 1)}) + γ (t \cdot m a l e_{i} - [(t - 1) m a l e_{i}]) + (X_{i t}^{'} - X_{i (t - 1)}^{'}) θ + ϵ_{i t} - ϵ_{i (t - 1)}

$y_{it} - y_{i(t-1)} = \sum^{10}_{t=3} \beta_t (d_t - d_{(t-1)}) + \gamma (t\cdot male_i - [(t-1)male_i]) + (X'_{it}-X'_{i(t-1)})\theta + \epsilon_{it}-\epsilon_{i(t-1)}$ Then

γ (t \cdot m a l e_{i} - [(t - 1) m a l e_{i}]) = γ [(t - (t - 1)) \cdot m a l e_{i}] = γ (m a l e_{i})

$\gamma(t\cdot male_i - [(t-1)male_i]) = \gamma[(t - (t-1))\cdot male_i] = \gamma (male_i)$ and you can identify the gender difference in earnings

γ

$\gamma$ . So the final regression equation will be:

Δ y_{yo t} = \sum_{t = 3}^{10} β_{t} Δ {re}_{t} + γ (metro un l {mi}_{yo}) + Δ X_{yo t}^{'} θ + Δ ϵ_{yo t}

$\Delta y_{it} = \sum_{t=3}^{10}\beta_t \Delta d_t + \gamma(male_i) + \Delta X'_{it}\theta + \Delta \epsilon_{it}$ and you get your effect of interest. The nice thing is that this is easily implemented in any statistical software but you lose a time period.

$c_i$ $1$ denote variables that are uncorrelated with $c_i$ and $2$ those who are and let's say your gender variable is the only time-invariant variable. The Hausman-Taylor estimator then applies the random effects transformation:

{\tilde{y}}_{i t} = {\tilde{X}}_{1 i t}^{'} + {\tilde{X}}_{2 i t}^{'} + γ ({\tilde{m a l e}}_{i 2}) + {\tilde{c}}_{i} + {\tilde{ϵ}}_{i t}

$\tilde{y}_{it} = \tilde{X}'_{1it} + \tilde{X}'_{2it} + \gamma (\widetilde{male}_{i2}) + \tilde{c}_i + \tilde{\epsilon}_{it}$ where tilde notation means

{\tilde{X}}_{1 i t} = X_{1 i t} - {\hat{θ}}_{i} {\bar{X}}_{1 i}

$\tilde{X}_{1it} = X_{1it} - \hat{\theta}_i \overline{X}_{1i}$ where

{\hat{θ}}_{i}

$\hat{\theta}_i$ is used for the random effects transformation and

{\bar{X}}_{1 i}

$\overline{X}_{1i}$ is the time-average over each individual. This isn't like the usual random effects estimator that you wanted to avoid because group

2

$2$ variables are instrumented for in order to remove the correlation with

c_{i}

$c_i$ . For

{\tilde{X}}_{2 i t}

$\tilde{X}_{2it}$ the instrument is

X_{2 i t} - {\bar{X}}_{2 i}

$X_{2it} - \overline{X}_{2i}$ . The same is done for the time-invariant variables, so if you specify the gender variable to be potentially correlated with the fixed effect it gets instrumented with

{\bar{X}}_{1 i}

$\overline{X}_{1i}$ , so you must have more time-varying than time-invariant variables.

All of this might sound a little complicated but there are canned packages for this estimator. For instance, in Stata the corresponding command is xthtaylor. For further information on this method you could read Cameron and Trivedi (2009) "Microeconometrics Using Stata". Otherwise you can just stick with the two previous methods which are a bit easier.

Inference
For your hypothesis tests there is not much that needs to be considered other than what you would need to do anyway in a fixed effects regression. You need to take care for the autocorrelation in the errors, for example by clustering on the individual ID variable. This allows for an arbitrary correlation structure among clusters (individuals) which deals with autocorrelation. For a reference see again Cameron and Trivedi (2009).

— Andy
fuente

Another potential way for you to keep the gender dummy is the the Mundlak's (1978) approach for a fixed effect model with time invariant variables. The Mundlak's approach would posit that the gender effect can be projected upon the group means of the time-varying variables.

Mundlak, Y. 1978: On the pooling of time series and cross section data. Econometrica 46:69-85.

— emeryville
fuente

Another method is to estimate the time-invariant coefficients in a second stage equation, using the mean error as the dependent variable.

First, estimate the model with FE. From here you get an estimation of $\beta$ and $\gamma_{t}$ . For simplicity, let's forget about the year-effects. Define the estimation error $\hat{u}_{it}$ as before:

{\hat{u}}_{i t} \equiv y_{i t} - X_{i t} \hat{β}

$\hat{u}_{it} \equiv y_{it} - X_{it}\hat{\beta}$

The linear predictor $\bar{u}_{i}$ is:

{\bar{u}}_{i} \equiv \frac{\sum_{t = 1}^{T} {\hat{u}}_{i}}{T} = \bar{y_{i t}} - {\bar{x}}_{i} \hat{β}

$\bar{u}_{i} \equiv \frac{\sum_{t=1}^{T}\hat{u}_{i}}{T} = \bar{y_{it}} - \bar{x}_{i}\hat{\beta}$

Now, consider the following second stage equation:

{\bar{u}}_{i} = δ m a l e_{i} + c_{i}

$\begin{equation} \bar{u}_{i} = \delta male_{i} + c_{i} \end{equation}$

Assuming that gender is uncorrelated with unobserved factors $c_{i}$ . Then, the OLS estimator of $\delta$ is unbiased and time-consistent (this is, it is consistent when $T \rightarrow \infty$ ).

To prove the above, replace the original model into the estimator $\bar{u}_{i}$ :

{\bar{u}}_{i} = {\bar{x}}_{i} β - {\bar{x}}_{i} \hat{β} + δ m a l e_{i} + c_{i} + \frac{\sum_{t = 1}^{T} ϵ_{i t}}{T}

$\bar{u}_{i} = \bar{x}_{i}\beta - \bar{x}_{i}\hat{\beta} + \delta male_{i} + c_{i} + \frac{\sum_{t=1}^{T}\epsilon_{it}}{T}$

The expectation of this estimator is:

E ({\bar{u}}_{i}) = {\bar{x}}_{i} β - {\bar{x}}_{i} E (\hat{β}) + δ m a l e_{i} + E (c_{i}) + \frac{\sum_{t = 1}^{T} E (ϵ_{i t})}{T}

$E(\bar{u}_{i}) = \bar{x}_{i}\beta - \bar{x}_{i}E(\hat{\beta}) + \delta male_{i} + E(c_{i}) + \frac{\sum_{t=1}^{T}E(\epsilon_{it})}{T}$

If assumptions for FE consistency hold, $\hat{\beta}$ is an unbiased estimator of $\beta$ , and $E(\epsilon_{it}) = 0$ . Thus:

E ({\bar{u}}_{i}) = δ m a l e_{i} + E (c_{i})

$E(\bar{u}_{i}) = \delta male_{i} + E(c_{i})$

This is, our predictor is an unbiased estimator of the time-invariant components of the model.

Regarding consistency, the probability limit of this predictor is:

p lim_{T \to \infty} {\bar{u}}_{i} = p lim_{T \to \infty} ({\bar{x}}_{i} β) - p lim_{T \to \infty} ({\bar{x}}_{i} \hat{β}) + p lim_{T \to \infty} δ m a l e_{i} + p lim_{T \to \infty} c_{i} + p lim_{T \to \infty} (\frac{\sum_{t = 1}^{T} ϵ_{i t}}{T})

$p \lim\limits_{T \rightarrow \infty} \bar{u}_{i} = p \lim\limits_{T \rightarrow \infty} \left( \bar{x}_{i}\beta\right) - p \lim\limits_{T \rightarrow \infty} \left(\bar{x}_{i}\hat{\beta}\right) + p \lim\limits_{T \rightarrow \infty} \delta male_{i} + p \lim\limits_{T \rightarrow \infty} c_{i} + p \lim\limits_{T \rightarrow \infty} \left( \frac{\sum_{t=1}^{T}\epsilon_{it}}{T}\right)$

Again, given FE assumptions, $\hat{\beta}$ is a consistent estimator of $\beta$ , and the error term converges to its mean, which is zero. Therefore:

p lim_{T \to \infty} {\bar{u}}_{i} = δ m a l e_{i} + c_{i}

$p \lim\limits_{T \rightarrow \infty} \bar{u}_{i} = \delta male_{i} + c_{i}$

Again, our predictor is a consistent estimator of the time-invariant components of the model.

— luchonacho
fuente

The Mundlak chamberlain device is a perfect tool for this. It is usually referred to as the correlated random effects model because it uses the random effect model to implicitly estimate fixed effects for time variant variables while also estimating the random effects for time invariant variables.

However, in statistical softwares, you implement it thesame as the random effect model but you have to add the means of all time variant covariates.

— Martin Paul
fuente