## Wednesday, 28 February 2018

### Panel Data Analysis (Lecture 2): How to Perform the Hausman Test in Stata

Introduction to Panel Data Models

## The panel data approach pools time series data with cross-sectional data. Depending on the application, it can comprise a sample of individuals, firms, countries, or regions over a specific time period. The general structure of such a model could be expressed as follows:

Yit = ao + bXit + uit

where uit ~ IID(0, 𝜎2) and i = 1, 2, ..., N individual-level observations, and t = 1, 2, ...,T time series observations.

In this application, it is assumed that Yit is a continuous variable. In this model, the observations of each individual, firm or country are simply stacked over time on top of each another. This is the standard pooled model where intercepts and slope coefficients are homogeneous across all N cross-sections and through all T time periods. The application of OLS to this model ignores the temporal and spatial dimension inherent in the data and thus throws away useful information. It is important to note that the temporal dimension captures the ‘within’ variation in the data while the spatial dimension captures the ‘between’ variation in the data. The pooled OLS estimator exploits both ‘between’ and ‘within’ dimensions of the data but does not do so efficiently. Thus, in this procedure each observation is given equal weight in estimation. In addition, the unbiasedness and consistency of the estimator requires that the explanatory variables are uncorrelated with any omitted factors. The limitations of OLS in such an application prompted interest in alternative procedures. There are a number of different panel estimators but the most popular is the fixed effects (or ‘within’) estimator.

Fixed Effects or Random Effects?
The question is usually asked which econometric model an investigator should use when modelling with panel data. The different models can generate considerably different results and this has been documented in many empirical studies. In terms of a model where time effects are assumed absent for simplicity, the model to be estimated may be given by:

Yit = ai + bXit + uit

The question, therefore, is do we treat ai as fixed or random? The following points are worth noting.

·  The estimation of the fixed effects model is costly in terms of degrees of freedom. This is a statistical and not a computing cost. It is particularly problematic when N is large and T is small. The occurrence of large N and small T currently tends to characterize most panel data applications encountered.
·   The ai terms are taken to characterize (for want of a better expression) investigator ignorance. In the fixed effects model does it make sense to treat one type of investigator ignorance (ai) as fixed but another as random (uit)?
·      The fixed effects formulation is viewed as one where investigators make inferences conditional on the fixed effects in the sample.
·  The random effects formulation is viewed as one where investigators make unconditional inferences with respect to the population of all effects.
·  The random effects formulation treats the random effects as independent of the explanatory variables (i.e. E(ai Xit) = 0). Violation of this assumption leads to bias and inconsistency in the b vector.

The main advantage of the fixed effects model is its relative ease of estimation and the fact that it does not require independence of the fixed effects from the other included explanatory variables. The main disadvantage is that it requires estimation of N separate intercepts. This causes problems because much of the variation that exists in the data may be used up in estimating these different intercept terms. As a consequence, the estimated effects (the bs) for other explanatory variables in the regression model may be imprecisely estimated. These might represent the more important parameters of interest from the perspective of policy. As noted above the fixed effects estimator is derived using the deviations between the cross-sectional observations and the long-run average value for the cross-sectional unit. This problem is most acute, therefore, when there is little variation or movement in the characteristics over time, that is when the variables are rarely-changing or they are time-invariant. In essence, the effects of these variables are eliminated from the analysis.

The main advantage of the random effects estimator is that it uses up fewer degrees of freedom in estimation and allows for the inclusion of time invariant covariates. The main disadvantage of the model is the assumption that the random effects are independent of the included explanatory variables. It is fairly plausible that there may be unobservable attributes not included in the regression model that are correlated with the observable characteristics. This procedure, unlike fixed effects, does not allow for the elimination of the omitted heterogeneous effects.

The Hausman Test
In determining which model is the more appropriate to use, a statistical test can be implemented. The Hausman test compares the random effects estimator to the ‘within’ estimator. If the null is rejected, this favours the ‘within’ estimator’s treatment of the omitted effects (i.e., it favours the fixed effects but only relative to the random effects). The use of the test in this case is to discriminate between a model where the omitted heterogeneity is treated as fixed and correlated with the explanatory variables, and a model where the omitted heterogeneity is treated as random and independent of the explanatory variables.

·      If the omitted effects are uncorrelated with the explanatory variables, the random effects estimator is consistent and efficient. However, the fixed effects estimator is consistent but not efficient given the estimation of a large number of additional parameters (i.e., the fixed effects).
·      If the effects are correlated with the explanatory variables, the fixed effects estimator is consistent but the random effects estimator is inconsistent. The Hausman test provides the basis for discriminating between these two models and the matrix version of the Hausman test is expressed as:

[bREbFE][V(bFE) – V(bRE)]-1[bREbFE]′ ~   𝝌²k

where k is the number of covariates (excluding the constant) in the specification. If the random effects are correlated with the explanatory variables, then there will be a statistically significant difference between the random effects and the fixed effects estimates. Thus, the null and alternative hypotheses are expressed as:

H0: Random effects are independent of explanatory variables
H1: H0 is not true.

The null hypothesis is the random effects model and if the test statistic exceeds the relevant critical value, the random effects model is rejected in favour of the fixed effects model. In finite samples the inversion of the matrix incorporating the difference in the variance-covariance matrices may be negative-definite (or negative semi-definite) thus yielding non-interpretable values for the chi-squared.

The selection of one model over the other might be dictated by the nature of the application. For example, if the cross-sectional units were countries and states, it may be plausible to assume that the omitted effects are fixed in nature and not the outcome of a random draw. However, if we are dealing with a sample of individuals or firms drawn from a population, the assumption of a random effects model has greater appeal. However, the choice of which model to choose is ultimately dictated empirically. If it does not prove possible to discriminate between the two models on the basis of the Hausman test, it may be safest to use the fixed effects model, where the consequences of a correlation between the fixed effects and the explanatory variables are less devastating than is the case with the random effects model where the consequences of failure result in inconsistent estimates. Of course, if the random effects are found to be independent of the covariates, the random effects model is the most appropriate because it provides a more efficient estimator than the fixed effects estimator.

**This tutorial is culled from my lecture note as given by Prof. Barry Reilly (Professor of Econometrics, University of Sussex, UK).

How to Perform the Hausman Test in Stata
First: Open a log file, load data into Stata, use a do-file (to replicate your research)

Second: Inform Stata that you are using a panel with ‘id’ the cross-sectional indicator and 'year' the time period indicator to prepare for panel data analysis.
xtset id year

Third: Create year dummies (to capture time variations in the data)
tab year, gen(yr)

Fourth: Run the fixed effects model and store the results
eststo fixed: xtreg y x1 x2 x3 x4 yr2 – yr..., fe i(c_id)

Fifth: Run the random effects model and store the results
eststo random: xtreg y x1 x2 x3 x4 yr2 – yr..., re i(c_id)

Sixth: Run the Hausman test
hausman fixed random

Seventh: Interpret results: Reject the null hypothesis if the prob-value is statistically significant at 5% level. It implies that the individual effects (ai) correlate with the explanatory variables. Therefore use the fixed effect estimator to run the analysis. Otherwise, use the random effects estimator.

[Watch video tutorial on performing the Hausman test in Stata]
If you still have comments or questions regarding how to perform the Hausman test, kindly post them in the comments section below…..