Monday 12 February 2018

Excel: How to Interpret Regression Output

How to Interpret Regression Output in Excel

The dissertation buzz is on and students are doing everything possible to meet up with the deadline. This is the current atmosphere in tertiary institutions, at least for those with undisrupted academic calendar J. The students are in different stages of their project, as it is commonly called. Some are yet to wrap up their chapter one which gives the “study background” and the framing of research hypotheses, objectives and questions. Some have moved on to chapter two reviewing relevant literature related to their scope of study. Others have gone further in developing both the theoretical and empirical frameworks for chapter three, but not without the usual teething lags…but they’ll get around it, somehow J. A handful have even done better by progressing to chapter four attempting to analyse their data.

Since, chapters one to three are relative to each students’ scope of research, however, a regression output is common to all (although actual outcomes differ). It is based on this that I decided to do a tutorial in explaining the basic features in a regression output. Likewise, this write-up is in response to requests received from readers on (1) what some specific figures in a regression output are and (2) how to interpret their results. Let me state here that regardless of the analytical software whether Stata, EViews, SPSS, R, Python, Excel etc. what you obtain in a regression output is common to all analytical packages (except where slight variations occur).

For instance, in undertaking an ordinary least squares (OLS) estimation using any of these applications, the regression output will churn out the ANOVA (analysis of variance) table, F-statistic, R-squared, prob-values, coefficient, standard error, t-statistic, degrees of freedom, 95% confidence interval and so on. These are the features of a regression output. However, the issue is: what do these mean and how can they be interpreted and related to a research.

Hence, the essence of this tutorial is to teach students the relevance of these features and how to interpret their results. I will be using Excel analytical package to explain a regression output, but you can practise along using any analytical package of your choice. (See tutorial for Stata and EViews users).

An Example: Use Gujarati and Porter Table7_12.dta or Table7_12.xlsx dataset
Note: I will not be discussing stationarity or cointegration analysis in this tutorial (that will come later on). Since the issue on how to understand the features of a regression output and interpret results, I will just be doing a simple linear regression analysis (a bi-variate analysis) with only one explanatory variable.

The dataset is on the United States from 1960 to 2009 (50 years data). The outcome variable is consumption expenditure (pce) and the explanatory variable is income (income).

First step: get the Data Analysis Add-in menu
Before you begin, ensure that the DATA ANALYSIS Add-in is in your tool bar because without it, you cannot perform any regression analysis. To obtain it follow this guide:
File >> Options >> Add-ins >> Excel Options dialog box opens
Under Active Application Add-ins, choose Analysis ToolPak
In the Manage section, choose Excel Add-ins
Click Go, then OK

If it is correctly done, you should see this:
Excel Add-in Dialog Box from cruncheconometrix.com.ng
Excel Add-in Dialog Box
Source: CrunchEconometrix
And you will have the Data Analysis menu to your extreme top-right corner under Data menu:
Excel Data Analysis Menu from cruncheconometrix.com.ng



Second step: have your data ready
Here is the data in excel format:
Data in excel format from cruncheconometrix.com.ng




Third step: Visualise the relationship between the variables
Before analysing the data, it is good to always graph the dependent and key explanatory variable (using a scatter plot) in order to observe the pattern between them. This gives you what to expect in your actual analysis. Here’s the procedure:
1.    Highlight the 2 columns that contain the variables
2.    Go to Insert >> Charts >> Scatter

Excel - Scatter plot of pce and income from cruncheconometrix.com.ng
Excel - Scatter plot of pce and income
Source: CrunchEconometrix

The graph indicates a positive relationship between the two variables. This seems plausible because the higher your income, the higher will be your consumption, except you are very frugalJ.

The graph can be formatted by adding a trend line (see video on how to do this). In Excel, adding a trendline also gives you the linear prediction:

Excel - Scatter plot with trendline from cruncheconometrix.com.ng

As can be seen from the second graph, we have the linear prediction for pce and the R2.

Fourth step: The scientific investigation
Now we want to scientifically investigate the relationship between pce and income. To do this, go to Data >> Data Analysis (dialogue box opens) >> Regression >> OK. 

Excel - Data Analysis Dialogue Box from cruncheconometrix.com.ng
Excel - Data Analysis Dialogue Box
Source: Crunch Econometrix
Once you click OK, the Regression dialogue box opens:
Excel - Regression Dialogue Box from cruncheconometrix.com.ng
Excel - Regression Dialogue Box
Source: CrunchEconometrix

·      Put data range for pce under Input Y Range
·      Put data range for income under Input X Range
·      Check label box
·      Check Confidence Level box
·      Check Output range
·      Click OK
(You have simply told Excel to regress the dependent variable, pce, on the explanatory variable, income), and the output is shown as:

Excel - Regression Output from cruncheconometrix.com.ng



Fifth step: The features of a regression output
The Excel output gives the Regression Statistics and the ANOVA table. So what do these figures mean? I will explain each feature in turns.

Under “Regression Statistics”:
R-squared: gives the variation in pce that is explained by income. The higher the R2, the better the model and the more predictive power the variables have. Although, an R2 that equals 1 will elicit some suspicion. The R is actually the correlation coefficient between the 2 variables. This implies that  
= the correlation coefficient.


Adjusted R-squared: this is the R2 adjusted as you increase your explanatory variables. It reduces as more explanatory variables are added.

Standard Error: this is the standard error of the regression

Observations: the data span is from 1960 to 2009 = 50 years

Under “ANOVA” (analysis of variance):
Source: there are two sources of variation on the dependent variable, pce. Those explained by the regression (i.e, the Model) and those due to randomness (Residuals)

df: this is degree of freedom calculated as k-1 (for the model) and n-k (for the residuals)

SS: implies sum of squared residuals for the Regression (explained variation in pce) and Residuals (unexplained variation in pce). After doing the regression analysis, all the points on pce do not fall on the predicted line. Those points outside the line are known as residuals. Those that can be explained by the regression are known as Explained Sum of Squares (ESS) while those that are due to random forces, which are outside the model are known as Residual Sum of Squares (RSS).

Excel - predicted Value of pce from cruncheconometrix.com.ng
Excel - Predicted Value of pce
Source: CrunchEconomterix
As observed from the graph, all the points do not fall on the predicted line. Some lie above, while some are beneath the line. These are all the residuals (in order words, the remnants obtained after the regression analysis). If the predicted line falls above a point, it means that pce is over-predicted (that is, pce – pcehat is negative) and if it is beneath a point, it implies that pce is under-predicted (that is, pce – pcehat is positive). The sum and mean of the residuals equals zero.

MS: implies mean sum of squared residuals obtained by dividing SS by df i.e. SS/df

F: captures whether the explanatory variable, income is significant in explaining the outcome variable, pce. The higher the F-stat, the better for the model.

Significance F: this is the probability value that indicates the statistical significance of the F ratio. A significance-value that is less than 0.05 is often preferred.

Coefficient: this is the slope coefficient. The estimate for income. The sign of the coefficient also tells you the direction of the relationship. A positive (negative) sign implies a positive (negative) relationship.

Intercept: this is the hypothetical outcome on pce if income is zero. It is also the intercept for the model.

Standard error: this is the standard deviation for the coefficient. That is, since you are not so sure about the exact value for income, there will be some variation in the prediction for the coefficient. Therefore, the standard error shows how much deviation occurs from predicting the slope coefficient estimate.

t-stat: this measures the number of standard errors that the coefficient is from zero. It is obtained by: coeff/std. error. A t-stat above 2 is sufficient evidence against the null hypothesis

P-value: there are several interpretations for this. (1) it is smallest evidence required to reject the null hypothesis, (2) it is the probability that one would have obtained the slope coefficient value from the data if the actual slope coefficient is zero, (3) the p-value looks up the t-stat table using the degree of freedom (df) to show the number of standard errors the coefficient is from zero, (4) tells whether the relationship is significant or not.

So, if the p-value is 0.3, then it means that you are only 70% (that is, (100-30)% ) confident that the slope coefficient is non-zero. This is not good enough. This is because a very low p-value gives a higher level of confidence in rejecting the null hypothesis. Hence, a p-value of 0.02, implies that you are 98% (that is, (100 - 2)% ) confident that the slope coefficient is non-zero which is more re-assuring! J.

Lower and Upper 95%: these are the confidence intervals. If the coefficient is significant, this interval will contain that slope coefficient but it will not, if otherwise.

Assignment:
Use Gujarati and Porter Table7_12.dta or Table7_12.xlsx dataset.
(1)  With pce as the dependent variable and gdpi as the explanatory variable, plot the graph of pce and gdpi, what do you observe?
(2)  Run your regression. Can you interpret the table and the features?
(3)  Plot the predicted line. What are your observations?

I have taken you through the basic features of a regression output using Excel data analysis software on ordinary least squares (OLS) model in a simple linear regression. So, you now have the basic idea of what the F-stat, t-stat, df, SS, MS, prob>F, p>|t|, confidence interval, R2, coefficient, standard error stand for.


[Watch video on "How to interpret regression output in Excel"]


Practice the assignment and if you still have further questions, kindly post them below....

No comments:

Post a Comment