Friday, 12 January 2018

Tell me, what is econometrics?

This is the first question I asked the very day this website came on live…so, I ask again: what is econometrics? It simply refers to measuring economic phenomena. The word “econo” refers to the economic events while the “metrics” is the measurement. So, econometrics is the process of measuring economic scenarios. It therefore means that measurement is the defining distinction between econometrics and other related disciplines like statistics, mathematics and mathematical economics.

I have come across other supportive definitions of econometrics as the discipline that is more concerned with quantitative analysis. A discipline that is wrapped around empirical testing to either validate or refute economic theory. That means, an econometrician must be quite proficient with the use of data and must fashion out ways of making the data to communicate. Communication is germane in the sense that a large pool of data (either quantitative or qualitative) will not make much sense to policy makers if they (data) do not say anything or point out any prevailing scenarios or assist in planning and forecasting. So, am I implying that data, talks??? Oh yes, they do….all the time if an econometrician have a prevailing theory and if s/he knows the appropriate model to deploy.

Next, I will briefly explain the pillars that hold econometrics, so that starters will know what basic skills to acquire.

Econometrics is a combination of economic theory, mathematical economics, economic statistics and mathematical statistics. Each of these are the benchmarks that an econometrician must be familiar with.

So, what is an economic theory? These are simply hypotheses, conjectures, assumptions, ideas that mostly describe how economies operate. Since, theories are often qualitative in nature, it is the econometrician that validates economic theory through empirical testing. For instance, an economic theory says that the higher the price, the lower the quantitative demanded of commodities. This is an economic theory stating that price and quantity demanded exhibit a negative relationship. The econometrician then takes it from there by analysing both data on price and demand quantities to observe what the outcome will be….sort of verifying/validating or disproving what the economic theory says.

A mathematical economist, on the other hand, expresses economic theories in mathematical forms. That is, using equations. Given the economic model stated above, the mathematical economist will express it as: Q = a -bP  or P = a -bQ. The first equation is known as demand function while the second is the inverse demand function. The -b sign indicates the negative slope of the demand function which emanates from economic theory. Since the mathematical economist is not involved in measuring the impact of price increase on quantity demanded or in the empirical investigation of economic theory, again, this is where the econometrician comes in using these mathematical equations to either validate or disprove theory via empirical testing.

What of an economic statistician? Such person is only concerned with the collection and descriptive presentation of data. The main tools used are charts, tables and graphs. Again, from the price and quantity example, an economic statistician goes to the field, collects data on prices and quantities and puts them out using pie charts, bar charts, histograms, line graphs showing the pictorial illustrations between these two variables. These primary level of data communication is very relevant to policy makers as it shows the relationships between the two variables. However, since the economic statistician is not involved in empirical testing, it is left for the econometrician to take the data collected and subject it to tests using several econometric tools and models at his/her disposal.

Lastly, the mathematical statistician provides the tools used by the econometrician. They construct the programmes and methods used in econometric analysis.

I conclude that a starting econometrician must have some basic knowledge about each of the sub-fields explained above. Econometrics is not difficult but interesting. Begin, by knowing the fundamentals from the tools used, to modelling, analysing, hypothesis testing, interpretation, forecasting. Stay with me on this platform as I patiently teach you this subject and it will not take too long for you to gravitate into the intermediate and complex stuff.

So, let us keep it simple and take it one step at a time 😊

Again, I ask you: what is econometrics? Let me know what other definitions you can come up with.

Post your comments and questions…

Wednesday, 10 January 2018


Multicollinearity, pronounced as mul-ti-co-lli-nea-ri-ty is the second longest word in the econometrics dictionary after heteroscedasticity. It contains 17 talking words! It occurs when there exists perfect or exact linear dependence or relationships among explanatory variables in a given model. Collinearity is when such exact dependence is between two variables. In that wise, we say that the variables are collinear. For instance, when an explanatory variable is 80 to 100% explained by another explanatory variable, separating the influence of each of them on the dependent variable (regressand) becomes difficult and interpreting the estimated coefficients from that model will also be problematic. This is because variation in one regressor can be completely explained by another regressor in the same model.

With perfect or less than perfect multicollinearity or collinearity:

Ø  Regression coefficients are indeterminate (because the collinear variables cannot be distinguished from one another)

Ø  Standard errors are infinite (they are very large)

Ø  Estimates are biased

Ø  Coefficients cannot be estimated with precision or accuracy

Note: multicollinearity does not violate any regression assumptions; the OLS estimators are still BLUE (Best Linear Unbiased Estimators); it does not destroy the property of minimum variance.

Multicollinearity can be detected using “r” the coefficient of correlation. So, if r = 1, then multicollinearity or collinearity exists. So whenever you run your correlation matrix, look out for those relationship where r > 0.8, that tells us that the respective variables are collinear. Multicollinearity is ruled out when regressors in a model have non-linear relationships. 

A major problem associated with multicollinearity is that, if r is high, then the standard error will be high and the computed t-statistic will be low making it more likely not to reject the null hypothesis when is false. Thereby committing a Type II error….that is, incorrectly retaining a very false null hypothesis.

How do you know if your model suffers from multicollinearity? 

Ø  High R2

Ø  Few significant t-ratios

Ø  Wider confidence intervals

Ø  Contradictory signs of beta coefficients to expected a priori

Ø  Estimates are sensitive to even small changes in model specification

Ø  High pair-wise correlation statistic among the regressors

Ø  From the tolerance level and variance inflation factor (VIF). A tolerance level lower than 0.10 and a VIF of 10 are indicative of multicollinearity in a model.  A higher VIF provides evidence of multicollinearity.

Correcting/controlling for multicollinearity:

Ø  Collect more data

Ø  Change the scope of analysis

Ø  Do not include collinear variables in the same regression

Ø  Drop the highly collinear variable

Ø  Transform the collinear variable through differencing (however, the differenced error term is serially-correlated and violates OLS assumptions).

What I do often, is to drop the collinear variable and if that variable is very important to my model, I’ll transform my modelling structure into a step-wise fashion such that collinear variables are not included together in the same regression.

[Watch video on multicollinearity]

Post your comments and questions….

Back to Home

Monday, 8 January 2018


Heteroscedasticity, you can try to pronounce it the way I do, he-te-ro-sce-das-ti-ci-ty. You see it isn’t so difficult to pronounce after all. It happens to be the longest word in the econometrics dictionary with 18 words…yes, 18 words! Can be written as heteroskedasticity but whichever way you choose to write it is fine, only be consistent with your choice. I will be sticking to heteroscedasticity…

Perhaps you have heard about the word, so what exactly is heteroscedasticity? It may seem like a ton of vocals in your mouth but the concept is very simple to grasp. It refers to disturbances (errors) whose variances are not constant in a given model. It is when the variance of the error terms differ across observations. That is, when a data has unequal variability (dispersion) across a given set of second predictor variables. Again what are disturbances? You may begin to think at this point that econometrics has tons of jargons, yes, you are absolutely right. But relax, you will understand them as you become more involved in its processes.

So, again what does heteroscedasticity mean? It means that in a given model, it is important that error variances across observations are constant. For instance, one of the assumptions of ordinary least squares (OLS) is that the model must be homoscedastic.

In the presence of heteroscedasticity:
·      OLS estimators,  are still linear, unbiased, consistent and asymptotically normally distributed. The regression estimates and the attendant predictions remain unbiased and consistent. But the estimators are inefficient (that is, not having minimum variance) in the class of minimum variance estimators. Hence, OLS is not BLUE (Best Linear Unbiased Estimator), therefore the regression predictors are also inefficient, though consistent. What this means that the regression estimates cannot be used to construct confidence intervals, or used for inferences.

·      Heteroskedasticity causes statistical inference based on the usual t and F statistics to be invalid, even in large samples. As heteroskedasticity is a violation of the Gauss-Markov assumptions, OLS is no longer BLUE.

Causes of heteroscedasticity:
Okay, having known that the presence of heteroscedasticity in a model can invalidate statistical tests of significance, it is important to know its causes. That is, what can lead to heteroscedasticity being evident in your data?
·      The presence of outliers can lead to your model becoming heteroscedastic. And what are outliers? These are simply bogus figures in your data that stands out. Very obvious to the prying eyes. Doing a simple summary statistic of your data before any regression analysis, can easily detect outliers by indicating both the minimum and maximum values of a variable. For example, you may have a 30 years inflation data for country J and on average, the yearly inflation figures for that country hovers around 9%, 7.5%, 8.2% and suddenly you observe an inflation rate of 58.7%. Since there is no economic phenomenon to support that outrageous figure, then 58.7% is an outlier which may cause your model to become heteroscedastic.
·      Wrongly specifying your model is another factor. This can be related to the functional form by which your model is specified. Functional form can be a log-log model (where the dependent variable and all or some of the explanatory variables are in natural logarithms or logs for short); a log-level model (where only the dependent variable is transformed into natural logarithm and the explanatory variables are in their level forms, that is, not transformed); lastly is the level-level form.
·      Wrong data transformation. For instance over differencing a variable can be a cause. If a variable is stationary in level at 10%, for example, I have seen cases where students still go ahead to difference the same variable in order to obtain stationarity at maybe 1% or 5% statistical significance. This is not necessary. Once your variable is stationary in level, that is an I(0) series, just go ahead and run your analysis. Note that further differencing the variable again, may lead to heteroscedasticity.
·      Poor data sampling method may lead to heteroscedasticity particularly when collecting primary data.
·      Skewness of one or more regressors (closely related to outliers being evident in the data). Regressors are explanatory or independent variables.

Detecting heteroscedasticity
Having known what heteroscedasticity is and its causes, how can it be detected? The truth is that there is no hard and fast rule for detecting heteroscedasticity. Therefore, more often than not, heteroscedasticity may be a case of educated guesswork, prior empirical experiences or mere speculation. However, several formal and informal approaches can be used in detecting the presence of heteroscedasticity but discussions will be limited to the graphical approach (plotting the residuals form the regression against the estimated dependent variable), Breusch-Pagan test and White test.

So, let us take an example using JM Wooldridge’s GPA3.dta or GPA3.xls data to make this topic clearer. (use .xls if Stata is not installed on your devise and run the analysis using any econometric software). 

Regression output in Stata from
Regression output in Stata
Source: CrunchEconometrix
From the regression output, the F-statistic is significant at the 1% level, the R2 reveals that about 48% variation are explained by the independent variables.

But how do we know if this model is heteroscedastic or not?
1). Start from the informal approach which is plotting the squared residuals,  against  using the Stata commands rvfplot or rvfplot, yline(0) to see if there is a definite pattern. If a definite pattern exists, then the model is heteroscedastic.


Residual plot in Stata from
Residual plot in Stata
Source: CrunchEconometrix

rvfplot, yline(0)

Residual plot in Stata from
Residual plot in Stata
Source: CrunchEconometrix

From both plots, a definite pattern is observed evidencing that the model is heteroscedastic.

2). Conduct either the Breusch-Pagan or White heteroscedasticity test after your regression to check if the residuals of a regression have a changing variance. The Stata commands are: estat hettest and estat imtest, white. If the obtained p-values are significant, then the model exhibits heteroscedasticity and if otherwise, then the model is homoscedastic.

estat hettest
Breusch-Pagan/Cook-Weisberg test for heteroscedasticity
         Ho: Constant variance
         Variables: fitted values of trmgpa

         chi2(1)      =    14.12
         Prob > chi2  =   0.0002

estat imtest, white
White's test for Ho: homoscedasticity
         against Ha: unrestricted heteroscedasticity

         chi2(33)     =     61.22
         Prob > chi2  =    0.0020
//the null hypotheses for both tests are that the model is homoscedastic. But since the p-values for both tests are significant, the null hypothesis is rejected in favour of the alternative hypothesis evidencing that the model is heteroscedastic

Controlling/Correcting heteroscedasticity
Also, as a pre-condition it is advisable to run your analysis using White’s heteroscedasticity-robust standard errors by including the robust option in the command line like this example:

reg trmgpa crsgpa cumgpa tothrs sat hsperc female season, robust

By using this code, the problem of heteroscedasticity is controlled in comparison to if the robust option is not used.

Assignment: Using Wooldridge’s hprice1.dta or hprice1.xls data, how can you detect if the model is heteroscedastic and how will you correct it? Compare the usual standard errors with the obtained heteroscedasticity-robust standard errors. What do you observe?

So, with this brief and practical tutorial, you can confidently run your regressions and test if your model suffers from heteroscedasticity or not….good luck!

Post your comments and questions….