## Monday, 8 January 2018

### Heteroscedasticity

Heteroscedasticity, you can try to pronounce it the way I do, he-te-ro-sce-das-ti-ci-ty. You see it isn’t so difficult to pronounce after all. It happens to be the longest word in the econometrics dictionary with 18 words…yes, 18 words! Can be written as heteroskedasticity but whichever way you choose to write it is fine, only be consistent with your choice. I will be sticking to heteroscedasticity…

Perhaps you have heard about the word, so what exactly is heteroscedasticity? It may seem like a ton of vocals in your mouth but the concept is very simple to grasp. It refers to disturbances (errors) whose variances are not constant in a given model. It is when the variance of the error terms differ across observations. That is, when a data has unequal variability (dispersion) across a given set of second predictor variables. Again what are disturbances? You may begin to think at this point that econometrics has tons of jargons, yes, you are absolutely right. But relax, you will understand them as you become more involved in its processes.

So, again what does heteroscedasticity mean? It means that in a given model, it is important that error variances across observations are constant. For instance, one of the assumptions of ordinary least squares (OLS) is that the model must be homoscedastic.

In the presence of heteroscedasticity:
·      OLS estimators,  are still linear, unbiased, consistent and asymptotically normally distributed. The regression estimates and the attendant predictions remain unbiased and consistent. But the estimators are inefficient (that is, not having minimum variance) in the class of minimum variance estimators. Hence, OLS is not BLUE (Best Linear Unbiased Estimator), therefore the regression predictors are also inefficient, though consistent. What this means that the regression estimates cannot be used to construct confidence intervals, or used for inferences.

·      Heteroskedasticity causes statistical inference based on the usual t and F statistics to be invalid, even in large samples. As heteroskedasticity is a violation of the Gauss-Markov assumptions, OLS is no longer BLUE.

Causes of heteroscedasticity:
Okay, having known that the presence of heteroscedasticity in a model can invalidate statistical tests of significance, it is important to know its causes. That is, what can lead to heteroscedasticity being evident in your data?
·      The presence of outliers can lead to your model becoming heteroscedastic. And what are outliers? These are simply bogus figures in your data that stands out. Very obvious to the prying eyes. Doing a simple summary statistic of your data before any regression analysis, can easily detect outliers by indicating both the minimum and maximum values of a variable. For example, you may have a 30 years inflation data for country J and on average, the yearly inflation figures for that country hovers around 9%, 7.5%, 8.2% and suddenly you observe an inflation rate of 58.7%. Since there is no economic phenomenon to support that outrageous figure, then 58.7% is an outlier which may cause your model to become heteroscedastic.
·      Wrongly specifying your model is another factor. This can be related to the functional form by which your model is specified. Functional form can be a log-log model (where the dependent variable and all or some of the explanatory variables are in natural logarithms or logs for short); a log-level model (where only the dependent variable is transformed into natural logarithm and the explanatory variables are in their level forms, that is, not transformed); lastly is the level-level form.
·      Wrong data transformation. For instance over differencing a variable can be a cause. If a variable is stationary in level at 10%, for example, I have seen cases where students still go ahead to difference the same variable in order to obtain stationarity at maybe 1% or 5% statistical significance. This is not necessary. Once your variable is stationary in level, that is an I(0) series, just go ahead and run your analysis. Note that further differencing the variable again, may lead to heteroscedasticity.
·      Poor data sampling method may lead to heteroscedasticity particularly when collecting primary data.
·      Skewness of one or more regressors (closely related to outliers being evident in the data). Regressors are explanatory or independent variables.

Detecting heteroscedasticity
Having known what heteroscedasticity is and its causes, how can it be detected? The truth is that there is no hard and fast rule for detecting heteroscedasticity. Therefore, more often than not, heteroscedasticity may be a case of educated guesswork, prior empirical experiences or mere speculation. However, several formal and informal approaches can be used in detecting the presence of heteroscedasticity but discussions will be limited to the graphical approach (plotting the residuals form the regression against the estimated dependent variable), Breusch-Pagan test and White test.

So, let us take an example using JM Wooldridge’s GPA3.dta or GPA3.xls data to make this topic clearer. (use .xls if Stata is not installed on your devise and run the analysis using any econometric software). Regression output in Stata Source: CrunchEconometrix
From the regression output, the F-statistic is significant at the 1% level, the R2 reveals that about 48% variation are explained by the independent variables.

But how do we know if this model is heteroscedastic or not?
1). Start from the informal approach which is plotting the squared residuals,  against  using the Stata commands rvfplot or rvfplot, yline(0) to see if there is a definite pattern. If a definite pattern exists, then the model is heteroscedastic.

rvfplot Residual plot in Stata Source: CrunchEconometrix

rvfplot, yline(0) Residual plot in Stata Source: CrunchEconometrix

From both plots, a definite pattern is observed evidencing that the model is heteroscedastic.

2). Conduct either the Breusch-Pagan or White heteroscedasticity test after your regression to check if the residuals of a regression have a changing variance. The Stata commands are: estat hettest and estat imtest, white. If the obtained p-values are significant, then the model exhibits heteroscedasticity and if otherwise, then the model is homoscedastic.

estat hettest
Breusch-Pagan/Cook-Weisberg test for heteroscedasticity
Ho: Constant variance
Variables: fitted values of trmgpa

chi2(1)      =    14.12
Prob > chi2  =   0.0002

estat imtest, white
White's test for Ho: homoscedasticity
against Ha: unrestricted heteroscedasticity

chi2(33)     =     61.22
Prob > chi2  =    0.0020
//the null hypotheses for both tests are that the model is homoscedastic. But since the p-values for both tests are significant, the null hypothesis is rejected in favour of the alternative hypothesis evidencing that the model is heteroscedastic

Controlling/Correcting heteroscedasticity
Also, as a pre-condition it is advisable to run your analysis using White’s heteroscedasticity-robust standard errors by including the robust option in the command line like this example:

reg trmgpa crsgpa cumgpa tothrs sat hsperc female season, robust

By using this code, the problem of heteroscedasticity is controlled in comparison to if the robust option is not used.

Assignment: Using Wooldridge’s hprice1.dta or hprice1.xls data, how can you detect if the model is heteroscedastic and how will you correct it? Compare the usual standard errors with the obtained heteroscedasticity-robust standard errors. What do you observe?

So, with this brief and practical tutorial, you can confidently run your regressions and test if your model suffers from heteroscedasticity or not….good luck!

Post your comments and questions….