Showing posts with label students dissertation. Show all posts
Showing posts with label students dissertation. Show all posts

Monday, 12 February 2018

Excel: How to Interpret Regression Output

How to Interpret Regression Output in Excel

The dissertation buzz is on and students are doing everything possible to meet up with the deadline. This is the current atmosphere in tertiary institutions, at least for those with undisrupted academic calendar J. The students are in different stages of their project, as it is commonly called. Some are yet to wrap up their chapter one which gives the “study background” and the framing of research hypotheses, objectives and questions. Some have moved on to chapter two reviewing relevant literature related to their scope of study. Others have gone further in developing both the theoretical and empirical frameworks for chapter three, but not without the usual teething lags…but they’ll get around it, somehow J. A handful have even done better by progressing to chapter four attempting to analyse their data.

Since, chapters one to three are relative to each students’ scope of research, however, a regression output is common to all (although actual outcomes differ). It is based on this that I decided to do a tutorial in explaining the basic features in a regression output. Likewise, this write-up is in response to requests received from readers on (1) what some specific figures in a regression output are and (2) how to interpret their results. Let me state here that regardless of the analytical software whether Stata, EViews, SPSS, R, Python, Excel etc. what you obtain in a regression output is common to all analytical packages (except where slight variations occur).

For instance, in undertaking an ordinary least squares (OLS) estimation using any of these applications, the regression output will churn out the ANOVA (analysis of variance) table, F-statistic, R-squared, prob-values, coefficient, standard error, t-statistic, degrees of freedom, 95% confidence interval and so on. These are the features of a regression output. However, the issue is: what do these mean and how can they be interpreted and related to a research.

Hence, the essence of this tutorial is to teach students the relevance of these features and how to interpret their results. I will be using Excel analytical package to explain a regression output, but you can practise along using any analytical package of your choice. (See tutorial for Stata and EViews users).

An Example: Use Gujarati and Porter Table7_12.dta or Table7_12.xlsx dataset
Note: I will not be discussing stationarity or cointegration analysis in this tutorial (that will come later on). Since the issue on how to understand the features of a regression output and interpret results, I will just be doing a simple linear regression analysis (a bi-variate analysis) with only one explanatory variable.

The dataset is on the United States from 1960 to 2009 (50 years data). The outcome variable is consumption expenditure (pce) and the explanatory variable is income (income).

First step: get the Data Analysis Add-in menu
Before you begin, ensure that the DATA ANALYSIS Add-in is in your tool bar because without it, you cannot perform any regression analysis. To obtain it follow this guide:
File >> Options >> Add-ins >> Excel Options dialog box opens
Under Active Application Add-ins, choose Analysis ToolPak
In the Manage section, choose Excel Add-ins
Click Go, then OK

If it is correctly done, you should see this:
Excel Add-in Dialog Box from cruncheconometrix.com.ng
Excel Add-in Dialog Box
Source: CrunchEconometrix
And you will have the Data Analysis menu to your extreme top-right corner under Data menu:
Excel Data Analysis Menu from cruncheconometrix.com.ng



Second step: have your data ready
Here is the data in excel format:
Data in excel format from cruncheconometrix.com.ng




Third step: Visualise the relationship between the variables
Before analysing the data, it is good to always graph the dependent and key explanatory variable (using a scatter plot) in order to observe the pattern between them. This gives you what to expect in your actual analysis. Here’s the procedure:
1.    Highlight the 2 columns that contain the variables
2.    Go to Insert >> Charts >> Scatter

Excel - Scatter plot of pce and income from cruncheconometrix.com.ng
Excel - Scatter plot of pce and income
Source: CrunchEconometrix

The graph indicates a positive relationship between the two variables. This seems plausible because the higher your income, the higher will be your consumption, except you are very frugalJ.

The graph can be formatted by adding a trend line (see video on how to do this). In Excel, adding a trendline also gives you the linear prediction:

Excel - Scatter plot with trendline from cruncheconometrix.com.ng

As can be seen from the second graph, we have the linear prediction for pce and the R2.

Fourth step: The scientific investigation
Now we want to scientifically investigate the relationship between pce and income. To do this, go to Data >> Data Analysis (dialogue box opens) >> Regression >> OK. 

Excel - Data Analysis Dialogue Box from cruncheconometrix.com.ng
Excel - Data Analysis Dialogue Box
Source: Crunch Econometrix
Once you click OK, the Regression dialogue box opens:
Excel - Regression Dialogue Box from cruncheconometrix.com.ng
Excel - Regression Dialogue Box
Source: CrunchEconometrix

·      Put data range for pce under Input Y Range
·      Put data range for income under Input X Range
·      Check label box
·      Check Confidence Level box
·      Check Output range
·      Click OK
(You have simply told Excel to regress the dependent variable, pce, on the explanatory variable, income), and the output is shown as:

Excel - Regression Output from cruncheconometrix.com.ng



Fifth step: The features of a regression output
The Excel output gives the Regression Statistics and the ANOVA table. So what do these figures mean? I will explain each feature in turns.

Under “Regression Statistics”:
R-squared: gives the variation in pce that is explained by income. The higher the R2, the better the model and the more predictive power the variables have. Although, an R2 that equals 1 will elicit some suspicion. The R is actually the correlation coefficient between the 2 variables. This implies that  
= the correlation coefficient.


Adjusted R-squared: this is the R2 adjusted as you increase your explanatory variables. It reduces as more explanatory variables are added.

Standard Error: this is the standard error of the regression

Observations: the data span is from 1960 to 2009 = 50 years

Under “ANOVA” (analysis of variance):
Source: there are two sources of variation on the dependent variable, pce. Those explained by the regression (i.e, the Model) and those due to randomness (Residuals)

df: this is degree of freedom calculated as k-1 (for the model) and n-k (for the residuals)

SS: implies sum of squared residuals for the Regression (explained variation in pce) and Residuals (unexplained variation in pce). After doing the regression analysis, all the points on pce do not fall on the predicted line. Those points outside the line are known as residuals. Those that can be explained by the regression are known as Explained Sum of Squares (ESS) while those that are due to random forces, which are outside the model are known as Residual Sum of Squares (RSS).

Excel - predicted Value of pce from cruncheconometrix.com.ng
Excel - Predicted Value of pce
Source: CrunchEconomterix
As observed from the graph, all the points do not fall on the predicted line. Some lie above, while some are beneath the line. These are all the residuals (in order words, the remnants obtained after the regression analysis). If the predicted line falls above a point, it means that pce is over-predicted (that is, pce – pcehat is negative) and if it is beneath a point, it implies that pce is under-predicted (that is, pce – pcehat is positive). The sum and mean of the residuals equals zero.

MS: implies mean sum of squared residuals obtained by dividing SS by df i.e. SS/df

F: captures whether the explanatory variable, income is significant in explaining the outcome variable, pce. The higher the F-stat, the better for the model.

Significance F: this is the probability value that indicates the statistical significance of the F ratio. A significance-value that is less than 0.05 is often preferred.

Coefficient: this is the slope coefficient. The estimate for income. The sign of the coefficient also tells you the direction of the relationship. A positive (negative) sign implies a positive (negative) relationship.

Intercept: this is the hypothetical outcome on pce if income is zero. It is also the intercept for the model.

Standard error: this is the standard deviation for the coefficient. That is, since you are not so sure about the exact value for income, there will be some variation in the prediction for the coefficient. Therefore, the standard error shows how much deviation occurs from predicting the slope coefficient estimate.

t-stat: this measures the number of standard errors that the coefficient is from zero. It is obtained by: coeff/std. error. A t-stat above 2 is sufficient evidence against the null hypothesis

P-value: there are several interpretations for this. (1) it is smallest evidence required to reject the null hypothesis, (2) it is the probability that one would have obtained the slope coefficient value from the data if the actual slope coefficient is zero, (3) the p-value looks up the t-stat table using the degree of freedom (df) to show the number of standard errors the coefficient is from zero, (4) tells whether the relationship is significant or not.

So, if the p-value is 0.3, then it means that you are only 70% (that is, (100-30)% ) confident that the slope coefficient is non-zero. This is not good enough. This is because a very low p-value gives a higher level of confidence in rejecting the null hypothesis. Hence, a p-value of 0.02, implies that you are 98% (that is, (100 - 2)% ) confident that the slope coefficient is non-zero which is more re-assuring! J.

Lower and Upper 95%: these are the confidence intervals. If the coefficient is significant, this interval will contain that slope coefficient but it will not, if otherwise.

Assignment:
Use Gujarati and Porter Table7_12.dta or Table7_12.xlsx dataset.
(1)  With pce as the dependent variable and gdpi as the explanatory variable, plot the graph of pce and gdpi, what do you observe?
(2)  Run your regression. Can you interpret the table and the features?
(3)  Plot the predicted line. What are your observations?

I have taken you through the basic features of a regression output using Excel data analysis software on ordinary least squares (OLS) model in a simple linear regression. So, you now have the basic idea of what the F-stat, t-stat, df, SS, MS, prob>F, p>|t|, confidence interval, R2, coefficient, standard error stand for.


[Watch video on "How to interpret regression output in Excel"]


Practice the assignment and if you still have further questions, kindly post them below....

Wednesday, 7 February 2018

Panel Data Analysis (Lecture 1): Sourcing Data, Theoretical Framework and Model Specification

Caution: This tutorial is only a guide and should not be adopted in its entirety. Endeavour to consult your tutor and other resource materials for proper guidance!

Introduction
The dissertation fervor is heating up with the usual twists and turns. In view of these and in response to readers’ requests, I will be starting a series of lectures on how to run time series and panel data analyses. These will be in parts and supported with short video tutorials posted to YouTube (so ensure to hook up to get the hands-on training). In order not to leave anyone out, these practical lectures will be carried out using three (3) analytical packages that is common among final-year students – Stata, EViews and Excel. Also, real country-level and longitudinal data will be used (but subject to my modifications to prevent unethical conduct from readers). Lastly, only quantitative research will be addressed.

For time series analysis, the lectures will only cover: data sourcing, model specification, lag selection, unit root testing, cointegration test, vector autoregressive model (VAR), autoregressive distributed lag model (ARDL), vector error correction mechanism (VECM), Granger causality tests, CUSUMSQ test and other post-estimation tests. While for panel data analysis, the lectures will only cover: setting up a panel data in Stata and EViews, data sourcing, model specification, Hausman test, fixed effects (FE) model, random effects (RE) model and generalised methods of moments (GMM).

So, in order to get prompt tutorials, the moment I click the “post” button, I will encourage you to subscribe for these blog posts. Use the “Follow by Email” menu on my blog https://cruncheconometrix.blogspot.com.ng, activate the link once you receive the notification in your email (check your spam box too) and you are good to go! Likewise, follow that up by subscribing to my YouTube videos for those short hands-on video clips. Click on this link CrunchEconometrix YouTube videos and subscribe!

Data Sourcing
“I can’t get data!!!”, “what’s a proxy?”, “I have data but not for all the groups”,“ how do I go about modeling my theoretical framework?”, “how do I construct my empirical model?”, “in fact, I’m confused!”…so many questions and believe me the chattering seems endless. First, I always tell students to relax! Secondly, I tell them that the moment the research area has been identified, and the topic streamlined, the next thing to do is to go on data-search. Okay, think about this: of what use is an empirical research if there is no data (or you have insufficient data to test your hypothesis)? So before, you proceed to writing chapter 1 (that is, the study background), make certain that you have the data handy.

Primary Data Sources
Regardless of the field of study or research discipline, primary data gathering requires the use of questionnaires, interviews, focus group discussions etc. It may require one of these or a combination of 2 or 3 data-gathering methods. So, if you are using primary data, ensure to get out these materials and distribute to the respondents in order to harvest responses within the shortest time frame. Getting a good number of responses is a precursor to having a quality research and unbiased results. However, these structured tutorials will not be extended to analysing primary data….my sincere apologies!

Secondary Data Sources
Since, research is not limited to those in the field of economics, it is important that researchers identify those databases hosting the relevant data required for their work. As an economist, I will indicate some databases/sources where students can go source for their data. Here are some which can be accessed (for macro and micro datasets):
IEA Coal Information
IEA CO2 Emissions from Fuel Combustion
IEA Electricity information
IEA Energy Prices and Taxes
IEA Energy Technology Research and Development Database
IEA Natural Gas Information
IEA Oil Information
IEA Renewables Information
IEA World Energy Statistics and Balances
ILO Key Indicators of the Labour Market
IMF Balance of Payment Statistics
IMF Direction of Trade Statistics
IMF Government Finance Statistics
IMF International Financial Statistics
IMF World Economic Outlook
OECD Education Statistics
OECD Globalisation
OECD International Development
OECD International Direct Investment Statistics
OECD International Migration Statistics
OECD International Trade by Commodities Statistics
OECD Main Economic Indicators
OECD Main Science and Technology Indicators
OECD National Accounts
OECD Quarterly Labour Force Statistics
OECD Services Statistics
OECD Social Expenditure Database
OECD Structural Analysis
UNIDO Industrial Demand Supply
UNIDO Industrial Statistics
World Bank Global Development Finance
World Bank World Development Indicators
World Bank Africa Development Indicators

Other sources of international data include but not limited to:
International Monetary Fund - http://www.imf.org/external/data.htm#data
United Nations - http://data.un.org/
Data on aid flows complied by OECD - http://www.oecd.org/dac/stats/
NBER data sets - http://www.nber.org/data/

For information from over 256 and regions since 1960, the accessible databases are:
World Development Indicators
Global Development Finance
The African Development Indicators
Doing Business
Education Statistics
Enterprise Surveys
Gender Statistics
Health Nutrition and Population Statistics
Millennium Development Goals
Worldwide Governance Indicators
Endeavour to check out those sites that are relevant to your study.
Note: it is expected that you state your data source in your thesis/dissertation and the years of coverage say 1980 to 2016, or 1970 to 2015 etc.

What is a Panel Data?
The panel data approach pools time series data with cross-sectional data. Depending on the application, it can comprise a sample of individuals, firms, countries, or regions over a specific time period. The general structure of such a model could be expressed as follows:

Yit = a + bXit + uit 
where uit ~ IID(0, s2),i = 1,2,…,N individual-level observations, and t = 1, 2,…,T time series observations.

In this application, it is assumed that Yit is a continuous variable. The panel data model is simply where the observations of each individual, firm or country over time are stacked on top of each another. This is the standard pooled model where intercepts and slope coefficients are homogeneous across all N cross-sections and through all T time periods. The application of ordinary least squares (OLS) to this model ignores the temporal and spatial dimension inherent in the data and thus throws away useful information. It is important to note that the temporal dimension captures the ‘within’ variation in the data while the spatial dimension captures the ‘between’ variation in the data. The pooled OLS estimator exploits both ‘between’ and ‘within’ dimensions of the data but does not do so efficiently. Thus, in this procedure each observation is given equal weight in estimation. In addition, the unbiasedness and consistency of the estimator requires that the explanatory variables are uncorrelated with any omitted factors. The limitations of OLS in such an application prompted interest in alternative procedures. There are a number of different panel estimators but the most popular is the fixed effects (or ‘within’) estimator and this will be reviewed extensively here. Lastly, the generalized methods of moments (GMM) estimator will be discussed given its relevance to dynamic panel modelling.

Some Advantages of Panel Data Analysis
Panel data analysis has quite a number of distinct advantages over time series and cross-section analysis:
·   Panel (or longitudinal) data allows a researcher to analyse a number of important economic questions not readily answerable by either a cross-section or a time-series dataset alone.
·  The availability of panel data increases the number of data points available and reduces collinearity among the explanatory variables thus improving the efficiency of the econometric estimates.
·  Panel data captures the heterogeneity that is related to the individuals, firms, states, countries etc. over time.
· By combining time series of cross-sectional observations, panel data gives “more informative data, more variability, less collinearity among variables, more degrees of freedom and more efficiency”.
·    Dynamic effects cannot be estimated using cross-sectional data. Even time series data are imprecise in this regard as there is generally limited change or variation in the data to identify such effects. For instance, in estimating a distributed lag model using only time series data, multicollinearity lowers the precision of the estimates. Hence, panel data models can provide greater variation in the explanatory variable for a given year thus reducing the degree of multicollinearity and improving the precision of the estimates. This clearly renders panel data better suited to the study of dynamic change However, it should be emphasised that the estimation procedures required for dynamic models which include a lagged dependent variable are not straightforward and this issue is the subject of discussion in later sections.
· Panel data models can take into account a greater degree of the heterogeneity that characterize individuals, states, and firms over time.
(Detailed discussion on the rudiments of panel data analysis will be done in the next tutorial).

(Here is the link to video clip on converting wide-format data to long-format in Stata).

Model Framework and Specification
This section focusses on the theoretical framework and model specification. I will also touch on description of variables in a model, the a priori expectations and finally, the method of analysis (or the estimation technique(s) to be used in testing the research hypothesis).

Theoretical Framework
Before you specify the empirical model, you must first state the theoretical model. That is, let your readers know where your empirical model is linked to. The theoretical model is that model supporting the theory you are using to undertake your research because no research can be done in isolation without an underlying theory. For instance, if my study is on the effect of exchange rate on output for 30 countries from 2000 to 2016 (that is, 17years), then I must look for a suitable theory which I can adapt to my research. Hence, I may decide to use the “monetary model of exchange rate” which is one of the earliest models used to determine the exchange rate. It is used as a measure to study the other approaches that are used in determining exchange rate. The monetary model approach assumes a simple demand for money curve, the purchasing power parity or the law of one price and a vertical aggregate supply curve.

The theoretical framework can be built as follows: (remember that this is just an example, and should not to be copied literarily!)

From the absolute purchasing power parity (P = EP*), the exchange rate is obtained by dividing the price of the domestic currency by the foreign price for that domestic currency. That is: Eppp = P/P*. The demand for money assumption: since real money balance depends on real income, demand for money is given as Md = kPY, where k is constant and Y is the real income level. Hence, in equilibrium, money demand (Md) equals money supply (Ms) and at the point of intersection of the aggregate demand and the aggregate supply curve:
kPY=Ms
P = Ms/kY
EP* = P = Ms/kY
and E = Ms/P*kY

From the stated framework, it is theorised that if the money supply within an economy increases, it will result in appreciation of the domestic currency. Hence, if it is generalised for the 30 countries in the data, the same assumption must be made, ceteris paribus. Likewise, foreign price level and the output level are inversely related to the exchange rate. If fixed money supply rises in the domestic economy, since prices are held constant, excess money supply leads to higher demand for goods and services within the economy.

Model Specification
So, having stated the theoretical framework, I can now go ahead to modify it to suit my research and form there formulate my empirical model. For instance, in using a Cobb-Douglas production from the neo-classical growth mode, I will attempt to explain output growth in the context of capital accumulation, labour and productivity, usually referred to as technological progress. The Cobb- Douglas production model is implicitly stated as:

Y = f(ALβKα)                                                                                    [1]
where, Y is output; K is capital stock; L is labour and A is productivity of labour which grows at an exogenous rate. As a result of constant returns to scale, if all inputs are increased by the same amount, then there would be an increase in output. The production function,

Y = KαL1-α                                                                                         [2]
where (1 - a = b) is mainly used by economists and researchers due to the following reasons: firstly, there is a constant return to scale and secondly, the two exponents α and (1 - a), sum up to one.

Next, is to tie up the empirical model to the theoretical framework. That is given the relationship between exchange rate and output, the model is implicitly specified as:

Yit = f (Exchrateit, X1it, X2it, …, Xnit)                                          [3]
where Yit = output (the dependent variable, state the measurement either gross output, or % of GDP, or growth rate etc.)
Exchrateit = real exchange rate (main explanatory variable)
X1it, X2it, …, Xnit = control variables (state their individual measurements either gross output, or % of GDP, or growth rate etc.)

On the basis of the theoretical framework and using the Cobb-Douglas production, the explicit model is stated as:
Yit = β0 + β1Exchrateit + β2X1it + β3X2it + … + βnXnit + uit        [4]
where, ut = white noise error term

 A Priori Expectations
Always know that the expected a priori is directly related to what theory says. It is from that you know what signs of the coefficients are expected from the main regressor and other covariates. For instance, from the theory, it is expected that currency depreciation will have a positive impact on domestic output, hence, a negative sign of the coefficient is expected. That is:

β1 < 0

Therefore, the expected signs of the control variables must be in line with their respective theories which must be related to your study.

Estimation Technique
At this point, the researcher may not know the exact technique or estimator to adopt between the fixed-effects within-group (fixed effects model) or the random effects estimator. The choice between these two is subject to the outcome of the Hausman test.
That is, to determine which model is the more appropriate to adopt, a statistical test is implemented. The Hausman test compares the random effects estimator to the ‘within’ estimator. The null hypothesis of the test is that the composite error term is not correlated with the explanatory variables in the model. If the null is rejected, then the fixed effects estimator is applicable (i.e., it favours the fixed effects but only relative to the random effects). The use of the test in this case is to discriminate between a model where the omitted heterogeneity is treated as fixed and correlated with the explanatory variables, and a model where the omitted heterogeneity is treated as random and independent of the explanatory variables.

Variables, Measurement and Description
Lastly, tabulate your variables detailing their names, short description, measurement and sources.

Here’s an example:
Table xxx: Variables Description and Measurement
Variables
Short Definition
Measurement
Source
Output


World Bank (2016)
Real exchange rate


World Bank (2016)
If you have any comments or question in relation to what have been discussed in this post, do not hesitate to post them in the comment section below….
Source: Researcher’s compilation (always put this at the bottom of the Table)

Conclusion
I have taken you through the steps required on how to source for your data, in addition to a brief on panel data analysis and its relevance over time series and cross-sectional data. I also briefly explained how to formulate a theoretical framework, adapting the framework to align with the research, how to construct the empirical model, stating the expected a priori, having an idea about the estimation technique with a brief on the Hausman test, tabulating your data showing the brief description of your variables, their measurements and data sources.

From next lecture, I will begin analysing the data using both Stata and EViews analytical packages. So, endeavour to follow these tutorials by getting the most of it to ease the dissertation pressure. Make sure you follow me on the next lecture series which is: Panel Data Analysis (Lecture 2): Setting up panel data model and the Hausman Test.

If you have any comments or question in relation to what have been discussed, do not hesitate to post them in the comment section below….