Showing posts with label estimation. Show all posts
Showing posts with label estimation. Show all posts

Monday, 22 January 2018

Data Handling: Interpretation and Discussion of Results in Scientific Economic Research

Philip O. Alege, Ph.D
Professor of Economics
                                 Department of Economics and Development Studies                                
Covenant University, Ota, Ogun State

Introduction
The tools available to modern economists in the discharge of functions as an analyst are very many simply because of various infiltrations of knowledge from other sciences into the discipline of economics such as physics, biology, mechanical engineering and particularly mathematics and statistics. Today, modern economies will be difficult to analyse, understand and predict without the tools of mathematics, statistics and in particular econometrics. This can be explained by virtue of the growing number of economic activities and interactions among the different agents in a given country and between/among countries. There is a school of thought that believes in more of economics and little of mathematics. There is also another school of thought that believes in substantial application of the tools of mathematics in economics as necessary to get the “useful” results from our analysis. Though I belong to the latter school, I do also contend that things must be done properly.

Basics of Econometric Modeling
Basically, econometrics is to provide empirical support for economic data. Its main purpose is to estimate the parameter(s) of a model that capture the behaviour of economic agent(s) as described by the theory and the model. Since the estimated parameters may be useful in understanding the economic theory, for policy analysis and forecasting, it becomes necessary on the econometrician to obtain parameters that are efficient. In order to achieve this, we must adhere to some principles of model building that can generate results whose interpretations and discussions will be useful for policy analysis as well as decision making. These are listed as follows:
·    Economic theory applicable to the specific area of the research
·    Design of the mathematical model and the hypotheses of the study
· The quest to obtain the right economic statistics i.e. the collection, collation and analysis of requisite data for the research, and
·  Interpretation/discussion of the findings/results

The researcher should keep in mind that econometric models are tools and therefore means to some desired ends. That is, our professional calling is to provide plausible parameter estimates that should be useful for policy analysis and decision making. Therefore, any mathematical and/or statistical model must be able to deliver these objectives of the researcher in an efficient manner.

Model Specification and Estimation Techniques
Consequently, model specification is the nucleus/DNA of any scientific economic research. I usually call it the economics of the study. It shows the depth of the researcher in the knowledge of theoretical economics as well as ability to state clearly the contribution(s) to knowledge as envisaged in the study. The latter may come as:
·      Additional variable to existing theoretical model
·  A single equation now specified as system of equations in order to capture a phenomenon hitherto not considered, or
·      Application of a technique not commonly used in our own environment.

Once the model is correctly specified, the next step is to consider the estimation technique that will produce the most efficient estimates of the parameters of the model. It is important to use a technique of estimation that will deliver the objective(s) of the study. It is apposite to mention some estimation techniques at this stage. It should, however, be noted that the list is not exhaustive. Some of these are as follows: ordinary least squares (OLS), indirect least squares (ILS), instrumental variables (IL), two stage least squares (2SLS): in the case of system of simultaneous equation, three stage least squares (3SLS), error correction model (ECM) which examines short-run dynamics, cointegration regression, generalised method of moments (GMM), vector autoregressive method (VAR) which examines the effect of shocks on a system, structural vector autoregressive method (SVAR), panel data method, panel vector autoregressive method (PVAR), panel structural vector autoregressive method (PSVAR), vector error correction (VECM), panel cointegration, panel vector error correction (PVECM) and so on.
Some learning resource materials are, but not limited to:
1.    Gujarati D. N. (2013). Basic Econometrics, Eight Edition, McGraw-Hill International Editions Economic Series, Glasgow
2.    Maddala, G. S. and Lahiri, K. (2009). Introduction to Econometrics. Fourth Edition. John Wiley
3.    Wooldridge J. M. (2009). Introductory Econometrics, Fourth Edition, South-Western Cengage Learning, Mason, U.S.A

Dynamic General Equilibrium (DGE) Models
There are lots of other techniques of estimation that should be of interest to the younger generations of economists. The basic framework is the dynamic general equilibrium (DGE) theories. Models built around this method are solved using the DYNARE codes in the MATLAB environment or directly using the Matlab codes written for such models. As part of the estimation is the need to calibrate the model. This consists of finding values for some parameters in the model though theoretical knowledge, calculating long-run averages as well as micro-econometric studies. The statistics often used are derived from the Bayesian inference as against the classical statistics referred to in the preceding paragraphs. Some of these models are: real business cycle (RBC), New Keynesian models (NKM), dynamic stochastic general equilibrium (DSGE), over-lapping generation (OLG), computable general equilibrium (CGE), dynamic computable general equilibrium (DCGE), Bayesian vector autoregression (BVAR), Bayesian structural vector autoregression (BSVAR), dynamic macro panels (DMP), augmented gravity models (AGM) and multicounty New Keynesian (MCNK) models.
Some learning resource materials are:
1.    Wichens, M. (2008). Macroeconomic Theory: A Dynamic General Equilibrium Approach. Princeton University Press, Princeton
2.    Canova, F. (undated). Methods for applied Macroeconomic Research
3.    Dejong, D. N. and Dave, C. (2007). Structural Macroeconometrics, Princeton University Press, Princeton.
4.    Cooley, T. F. (ed.) (1995). Frontiers of Business Cycle Research. Princeton University Press, Princeton.
5.    McCandless, G. (2008). The ABCs of RBCs: An Introduction to Dynamic Macroeconomic Models. Harvard University Press; and
6.    Lucas, R. E. (1991). Models of Business Cycles.

It is apposite to state that researchers must have a working understanding of the tests that must be carried out under each technique of estimation. I need to also draw the attention of interested researcher in the area of dynamic general equilibrium because it requires adequate knowledge of computational economics. Specifically, you need sound working knowledge of the following: dynamic optimization, method of Lagrange multipliers, continuous-time optimization, dynamic programming, stochastic dynamic optimization, time-consistency and time-inconsistency and linear rational-expectation models.
Some learning resource materials
1.    Dadkhah, K. (undated) Foundation of Mathematical and Computational Economics, Thomson South-Western.

Interpretation of Results
In interpreting the results of an econometric model, you have the choice of the most appropriate method for your work either the classical or Bayesian statistics as mentioned above. This aspect of the work constitutes the scientific content emanating from economic statistics and mathematical economics. In this case, we should be addressing statistics such as:
·      R-squared
·      Adjusted R-squared (“goodness of fit” test)
·      F-statistics
·      Durbin-Watson statistic
These, in addition to the test of heteroscedasticity constitute the “diagnostic tests”. Once they fail to fall within the zones of acceptance, we cannot go ahead to test for the significance of each variable. There may be the need for: model re-specification, detection and correction of autocorrelation, and/or detection and correction of multicollinearity.

We may also need to test for heteroscedasticity. The occurrence of any of this is an evidence of the violation of assumption(s) of the technique being applied. This is followed by the statistics to test the significance of the individual variables included in the model. This was the standard during the époque of almighty OLS. Later in the history of applied econometrics, it was observed that certain time-series are non-stationary, i.e. their means, variances and covariances are not constant over time. In such situation regression results are generally meaningless and are, therefore, termed spurious. In order to correct for the latter, the statistics often used to examine the stationarity of time series include the following: Dickey-Fuller test, “augmented” Dickey-Fuller test in the presence of error term that is none white noise, Panel data unit root tests, co-integration tests and error correction model (ECM), to mention a few. The use of any of these tests should be in response to the objective of the researcher and the desired contribution(s) to knowledge.

Some Pitfalls in Econometrics
·      The wrong way to go in modelling
How one interprets the coefficients in regression models will be a function of how the dependent (y) and independent (x) variables are measured. In general, there are three main types of variables used in econometrics: (1) continuous variables, (2) the natural logarithm of continuous variables, and (3) dummy variables.



·      Some Specific Rules of Thumb from Statistics

After performing a regression analysis:
1.    Look at the number of observations:
·      Is your result in line with a priori expectation?
·      If not, you should find out why.
·      Remember, any observations with missing values will be dropped from the regression.
·      Do not take the logarithm of a variables whose value equals zero. The model will not run, simple.
·      Ensure the number of observations in your model falls within the rule i.e. sample size should be greater than or equals to 30 (the law of large numbers).

2.    Observe the value of the R2:
·      The R2 tells you the percentage of the total variation in the dependent variable that the independent variables of your model “explains”.
·      This should be less than 1. The rest is the error term.
·      Suppose an estimated model of R2 = 0.46. This means that 46% of the total variation in the dependent variable is explained by the independent variables. This is not a “good fit”.
·      For a regression to have a good fit then we must have a result such that 0.5<R2<1. This is in the case of a time series regression.
·      However, it is considered good for cross-section data and very good for panel data.

Problems with R2:
·      If you have a ‘very low’ R2, have a rethink about whether you might have omitted some important variables.
·      However, be careful not to include unnecessary variables only to increase your R2
·      A ‘very high’ R2 could indicate several problems.
·      Firstly, if a high R2 is combined with many statistically significant variables, your independent variables might be highly correlated amongst themselves (multicollinearity).
·      You might consider dropping some in the interest of parsimony.
·      It might be an indication that you have mis-specified your model.

The adjusted R2:
·      Adjust the R2 to penalize the inclusion of more variables. i.e. correct for the degree of freedom.
·      Include as many variables as you need but keep your model as parsimonious as possible. Observe the rules guiding this.

3.    Look at the F-test.
·      The F-test aims at the “joint significance” of the model.
·      More formally it is a test of whether all your coefficients are jointly equal to zero under the null hypothesis.
·      If they are, effectively your model is not really explaining anything. Hint: ideally you want a high F-value, and a low corresponding p-value 

4.    Interpret the signs of the coefficients.
·      Which ones should be positive and which should be negative from the theoretical perspective? Interpret this!
·      A positive coefficient means that variable has a positive impact on your dependent variable, and a negative one has a negative impact or inverse relationship.

5.    Interpret the size of the coefficients where relevant.
·      If you obtain a statistically significant coefficient-wonderful!
·      So maybe you’ve found consumption increases with disposable income. But by how much? Is it close to 1 by which the marginal propensity to consume is high and the marginal propensity to save is low? What would be the effect of this on the economy?

6.    Look at the significance of the coefficients (most important?).
·      This should in fact become the first thing that your eyes drift towards when you get regression output.
·      You should feel a little hint of excitement as you are waiting to find out whether your model works and whether your theory has been proved correct or not.
·      The test of significance is designed to test whether a coefficient is significantly different from zero or not.
·      If it is not, then you must conclude that your explanatory variable does not, in fact, explain at all your dependent variable.
·      We use t - test (just like we learnt in first year statistics) to test this so that we compare a t - value taken from the table (at a given significance level, α, with n - k degree of freedom) with a calculated t, where n = number of observations and k = number of parameters estimated/independent variables; n – k = degree of freedom.












 


7.    Others
·      Other tests follow, such as testing for normality of error terms, checking for existence of heteroscedasticity, performing specification and robustness tests.
·      But these exciting topics are to be covered if your econometric work would have any useful output valuable for policy making and decision making.

Discussion of Results
The essence of a scientific economic research is to build economic models that enable us obtain plausible estimates from given set of data. We should know that the structural parameters estimated encapsulate our behaviour and, therefore, in discussing them, we need to go beyond the confine of economics to locate additional means of buttressing our results from:
·      historical context
·      socio-political condition, and
·      psychological state as well as
·      international environment

Conclusion
I have tried to raise some important issues in this post. There are so many things to keep in mind when preparing a research work. The most important of them all is the need to keep your model simple and avoid frivolities in modeling. It is important to remember that we are first of all economists. The tools of analysis at our disposal should not overshadow that calling.


Quite a lot has been said about how a researcher in the field of economics can handle data, interpret and discuss research findings that will be relevant for policy-making. If you still have further questions or comments in this regard, kindly post them below for the benefit of all.

Post your comments and questions….

Saturday, 20 January 2018

Stata: Interpreting Two-way ANOVA Procedure

Two-way ANOVA Procedure using Stata

This is a follow-up to my previous post on how to analyse the one-way ANOVA using Stata analytical software endeavour to read it up...it also provides a good introduction to running ANOVA.

The essence of two-way ANOVA in data analysis
ANOVA simply means analysis of variance and its importance in analysing behavioural relationships between and among variables makes its use endearing to researchers. Basically, the ANOVA procedure is to determine if the average value (that is, the mean) of a dependent variable (the regressand, outcome variable, and endogenous variable) is the same in two or more unrelated, independent groups. That is, the two-way ANOVA indicates whether the mean of a dependent variable is the same or differs across independent unrelated categorical groups. The two-way ANOVA compares the mean differences between groups that have been split on two independent variables (called factors). The primary purpose of a two-way ANOVA is to understand if there is an interaction between the two independent variables on the dependent variable. The moment you understand how to compute the two-way ANOVA and interpret your table, you will always want to incorporate it in your study or research…after ensuring that your data meets some salient conditions.

For instance, you could use a two-way ANOVA to understand whether there is an interaction between activity level and diet on bmi (i.e. the dependent variable would be “bmi”, measured on a continuous scale, and your independent variables would be “activity level” (which has three groups – “low”, “moderate” and “high”) and “diet” (which has three groups “vegan”, “vegetarian” and “animal-based”). Again, the two-way ANOVA can be used understand if there is an interaction between demographic location and types of housing on rentals (i.e. the dependent variable would be “rentals”, measured on a continuous scale, and the independent variables would be “location” (which has two groups – “rural” and “urban”) and “housing types” (which has six groups “one-room apartment”, “two-room apartment”, “one-bedroom condo”, “two-bedroom condo”, “mini flat”, and “standard flat”). Lastly, an agronomist may be interested in knowing the interaction between temperate conditions and type of fertiliser on say the crop yield of cassava (i.e. the dependent variable would be “yield”, measured on a continuous scale, and the independent variables would be “temperate” (which has four groups – “autumn”, “spring”, “summer” and “winter”) and “fertiliser” (which has two groups “organic”, and “inorganic”). If there are three independent variables rather than two, a three-way ANOVA will be performed, and if four independent variables, a four-way ANOVA will be performed and so on

Given this preamble, here is a “step-by-step” tutorial showing you how to carry out a two-way ANOVA and some post-hoc checks using Stata analytical package. But before I proceed, it is important for you to understand some basic rules underlying the use of two-way ANOVA procedure. That is, your data must meet these criteria failing which your results may be invalidated if they are not adhered to. There are six (6) of them:

Rules:
These six "rules" represent the blueprint guiding the use of the two-way ANOVA technique. If any is not satisfied, you may obtain invalid results. Please note that the first three assumptions are closely related to the nature of your data and study structure (that is, directly related to your choice of variables), thus Stata cannot validate those while the last three must be met using some Stata criterion. It is therefore important that you ascertain that your study meets these conditions before proceeding with the two-way ANOVA.

· Rule #1: Make sure that the dependent variable (regressand, outcome variable) is cardinal and measured in continuous terms. Some example of variables measured in continuous terms are: time (measured in minutes, seconds, and milliseconds), weight (measured in stone, pounds, kilogramme, and grams); rentals (measure in local currency) and so on. These are called continuous variables.

· Rule #2: Both explanatory variables (regressor, independent variable) ought to comprise two or more categorical, independent (unrelated) groups. Some examples of these categorical variables are income group (3 groups: high-income, middle income and low income); grade (4 groups: excellent, very good, good, and poor); demography (2 groups: rural and urban); banking (3 groups: investment, mortgage, microfinance) etc. So make sure that your explanatory variable is a categorical variable.

· Rule #3: Ensure that you have independence of observations. That is, your observations must not over-lap across the different groups. This simply means that there must be no relationship between the observations in each group or between the groups themselves. For instance, an observation in a “winter” group must not be represented again in a “spring” group. Needless to say that, participants across the groups must be different. But where an exception is the case, the repeated measures of ANOVA should be used rather than the two-way ANOVA.

· Rule #4: Be wary of conspicuous outliers. These are figures that are either abnormally high or low, that is, they do not follow the typical pattern in a particular variable. The presence of outliers can bias your results, they can have a negative effect on the two-way ANOVA, thereby reducing the results accuracy. However, they can easily be tested in Stata by using the Boxplot or summary syntax (sum for short). The syntax computes the mean, standard deviation, minimum and maximum values in each variable in your data, thus enabling you to detect (identify) the abnormal figure.

· Rule #5: Since the two-way ANOVA is susceptible to violations of normality, it is essential that the dependent variable must be approximately normally distributed for each category of the independent variable. Although, you may still obtain some valid results if this rule is violated, that is why your data must be approximately and not 100% normally distributed before running a two-way ANOVA. A histogram test, Shapiro-Wilk test or Jarque-Bera test can be conducted in Stata to test for normality of residuals.

· Rule #6: There must be homogeneity of variances. This can be tested with the Levine’s test for homogeneity of variances in Stata. The Levine’s test is very vital when it comes to interpreting the results from a two-way ANOVA guide because Stata is capable of producing different outputs depending on whether your data meets or fails this assumption.

Note: The first three rules are specific to your data, choice of variables and nature of study which any analytical package, like Stata, has no control and thus cannot be scientifically verified. However, ascertaining that your data meets the last three rules can be verified which may seem daunting, but it is important that you do them. Moreso, these packages have really simplified these procedures.

So let us take an example to understand the two-way ANOVA….

EXAMPLE
From Wooldridge’s discrim2.dta or discrim2.xlsx files (if you don’t have Stata installed on your devise, download the .xlsx file and feed into the analytical package of your choice).
(Note: for simplicity, I have extracted from the initial dataset, discrim.dta, to use for this example. The initial dataset is quite detailed such that several two-way ANOVA simulations can be carried out).

A researcher collected ZIP code-level data on the prices charged for small fries at four fast-food chains – Burger King, Kentucky Fried Chicken, Roy Rogers and Wendy’s – along with the characteristics of the ZIP-code population in two US states – New Jersey and Pennsylvania. The idea is to compare the prices charged by these fast-food chains to see whether the prices are the same across the two states.

In this example, the dependent variable is “price of fries” (measured in US dollars), whilst the independent variables are “state” and “chain”. state has two independent groups: “New Jersey” and “Penn” and “chain” has four independent groups: “BK”, “KFC”, “RR” and “WD”. Remember that both are categorical variables whose members (observations) must not over-lap within their groups. The two-way ANOVA in this instance, is used to determine whether there is a statistically significant difference in prices charged among the four fast-food chains across the two states.

But before we begin, ensure that you set up your data in Stata (or any analytical package of your choice)

Setting up the data in Stata
1.    Ensure original data is in excel format (.xlsx, .xls or .csv)
2.    Have separate columns for prices of fries, state and chain
3.    Open the Stata application
4.    Go to Data >> Data Editor (Edit)
5.    Highlight data to be copied from excel
6.    Click the “paste” icon in Stata
7.    A dialog box opens: Select “Treat first row as variable names
8.    Click “OK” and Save.
These steps (1 – 7) create your Stata dataset (that is, .dta file)

ATTENTION: If you are using Stata, make sure you create a log file and a do-file.

To create a log file:
The log file gives a history of what you have done. You can always revisit the log file (saved as .smcl) to review the processes. So, it is advantageous to always have a log file. To create a log file:
1.    Go to Stata >> File >> Log >> Begin
2.    Give it a filename
3.    Click Save

To create a do-file:
The do-file on the other-hand shows the commands (codes) used to execute each process. Those familiar with the coding approach will agree with me that having a do-file can speed up the time used in executing the work. To create a do-file (saved as .do):
1.    Go to Stata >> New Do-File Editor
2.    New do-file opens
3.    Click File >> Save As
4.    Give it a filename
5.    Click Save

Having established that both explanatory variables are categorical variables made up of two and four groups respectively, it is important that Value Labels for both explanatory variables state” and “chain are created in Stata. The essence is to create values for each group in order to make estimations possible. So, the values for New Jersey and Penn under state will be 1 and 2 respectively while those for BK, KFC, RR and WD under chain will be 1, 2, 3 and 4 respectively.

How to do that? Here are the steps:
1.    Go to Stata >> Data >> Data Utilities >> Label Utilities >> Manage Value Labels >> Create Label
2.    Enter “new label name”: state
3.    Enter the appropriate values. Enter 1 for Value, and New Jersey for Label, click ADD. Next, enter 2 for Value, and Penn for Label click ADD. Then click OK.
4.    Again, click “Create Label
5.    Enter “new label name”: chain
6.    Enter the appropriate values. Enter 1 for Value, and BK for Label, click ADD. Next, enter 2 for Value, and KFC for Label click ADD. Again, enter 3 for Value, and RR for Label, click ADD. Lastly, enter 4 for Value, and WD for Label click ADD. Then click OK.

If it is correctly done, then you should have something like this as shown below:
Creating Value Labels for Categorical Variables from http://cruncheconometrix.com.ng
Creating Value Labels for Categorical Variables Using Stata
Source: CrunchEconometrix
(Used with written permission from Stata)

 Next is to assign value label to both categorical/explanatory variables one at a time. To do that:
1.    Go to Stata >> Data >> Data Utilities >> Label Utilities >> Assign Value Label to Variable
2.    Under “Variables” select state
3.    Under “Value label” select state
4.    Click OK.
5.    Again, under “Variables” select chain
6.    Under “Value label” select chain
7.    Click OK.

You should have something like this for both state and chain:
Adding Value Labels to Categorical Variables from http://cruncheconometrix.com.ng
Adding Value Labels to Categorical Variables
Source: CrunchEconometrix
(Used with written permission from Stata) 

With all the steps correctly done, your dataset should look like this:
Dataset showing dependent and explanatory variables in Stata from http://cruncheconometrix.com.ng
Dataset showing dependent and explanatory variables in Stata
Source: CrunchEconometrix from Wooldridge Dataset
(Used with written permission from Stata)

There are 410 observations, and to know the distribution of the four fast-food chains across the two states, use the tabulate syntax. That is,

tab state chain

and you have this output shown below:
Table showing distribution, Stata from http://cruncheconometrix.com.ng
Table showing the distribution of fast-food chains across state
Source: CrunchEconometrix from Wooldridge Dataset
(Used with written permission from Stata)
The above table shows how the 410 observations are distributed among the four fast-food chains in the two US states. For instance, Roy Rogers has 82 outlets in New Jersey and 17 in Pennsylvania, Wendy’s has 45 outlets in New Jersey and 15 in Pennsylvania and so on.

We are about to dig in much further…J

Please note that in Stata, you can either use the code (command, syntax) approach or the graphical user interface (GUI). Either approach is fine. If you are familiar with the coding approach, just go ahead and use it, if otherwise use the GUI (where you just click the applicable menus).

Having prepared our dataset, now let us run the two-way ANOVA. This tutorial will in the first part cover the two-way ANOVA analysis and in the second part the post-hoc checks. I will be using the syntax approach, but will show you later on how to manoeuvre the GUI interface…..are you ready? On the assumption that our dataset is in line with the six rules….we begin!

State the null and alternative hypotheses for the test
H0: the location of state will have no effect on prices charged for small fries
H0: the type of fast-food chain will have no effect on prices charged for small fries
H0: state and chain interaction will have no effect on prices charged for small fries
H1: the null hypotheses is not true

All codes are typed into the Command window, as shown below, and you simply press the ENTER key:
Command box in Stata from http://cruncheconometrix.com.ng
Command box in Stata
Source: CrunchEconometrix
(Used with written permission from Stata)
Two-way ANOVA Procedure
I will approach this from two angles.

First, we may want to know the main effects of each explanatory variable on the dependent variable, and the syntax is:

anova y x1 x2

where the y is the dependent variable (pfries) and x1 is the categorical/explanatory variable state and x2 is the categorical/explanatory variable chain. This becomes:

anova pfries state chain

The Stata output is shown as:
 
Stata output on the main effects from http://cruncheconometrix.com.ng
Stata output on the main effects
Source: CrunchEconometrix
The Stata output churns out quite a lot of information. For instance, the number of observations is given as 393 instead of 410. Reason is because 17 observations have missing values. The F-statistics and the associated p-values are also indicated. For the Model, the F-statistic (55.25) and its associated p-value (0.0000) shows that both categorical variables significantly explain pfries. For state and chain, their F-statistics and the associated p-values indicate that both have individual-significant effects on pfries. The R2 (0.3629) shows the percentage of variation in pfries that is explained by state and chain.

Second, to obtain both the individual and interactive effects of state and chain on pfries, the syntax is:

anova pfries state chain state#chain

and the Stata output is as shown below:
Stata output on the main and interaction effects from http://cruncheconometrix.com.ng
Stata output on the main and explanatory effects
Source: CrunchEconometrix
The explanations are similar to those stated previously except with the addition of the interaction term state#chain. Here the F-statistic (0.31) and its associated p-value (0.8204) shows that the joint-effect of both categorical variables insignificantly explains pfries. If a statistically significant interaction is observed, the result can be followed up by determining if there are any “simple main effects”, and if there are, what these effects are.

Post-hoc tests
The F-statistic tells us if there is the need to perform a post-hoc test or not. If the statistic is significant as it is for state and chain, then some post-hoc tests can be done but where the statistic is not significant, then there no need to talk about the variable, act as if the effect is zero as it is in the case if the interaction term state#chain because in actual fact, the effect on the population is zero.

Bottom line: only discuss the results that are significant! 

Therefore, since the main effect of each categorical variable is significant, post-hoc tests can be performed as done if a one-way ANOVA procedure is conducted. In this example, we use the Scheffe’s test. But this test will be irrelevant for state since we already know that there are only two means and the F-statistic has shown that the difference between them is statistically significant. However, because we have four groups under chain, the Scheffe’s test will be relevant in pointing out those combinations between the groups that have significant differential in their mean prices. The test can be computed using the syntax:

oneway pfries chain, scheffe

The Stata output is shown below:
Scheffe's post-hoc test in Stata from http://cruncheconometrix.com.ng
Scheffe's post-hoc test in Stata
Source: CrunchEconometrix
The Scheffe multiple comparison test tells us where the differences are between each pair of means. Also, in a more-than-two group scenario, this test applies corrections to the reported significance levels that take into account the fact that multiple comparisons are being conducted. Thus, as can be seen from the printout, the difference between the means of BK and KFC is -.053457 and the t-statistic is significant at the 1% level. With all six combinations, only the difference between WD and KFC (.012284) falls just short of being statistically significant.

Addendum:
By way of information, here is how to manoeuvre the graphical user interface (GUI) to run the two-way ANOVA.

Go to Stata >> Statistics >> Linear models and related >> ANOVA/MANOVA >> Analysis of variance and covariance from the top menu, as shown below.
Graphical user interface (GUI) for Two-way ANOVA in Stata from http://cruncheconometrix.com.ng
Graphical user interface (GUI) for Two-way ANOVA in Stata
Source: CrunchEconometrix
(Used with written permission from Stata)
A dialogue box for anova - Analysis of variance and covariance opens:
1.    Under Dependent variable, select pfries from the drop-down menu

You should have something like this:
Dialog box for dependent variable in Two-way ANOVA, Stata from http://cruncheconometrix.com.ng
Dialog box for dependent variable in Two-way ANOVA, Stata
Source: CrunchEconometrix
(Used with written permission from Stata)
To analyse the individual effects of both categorical variables on the dependent variables, here is what to do:
2.    Click on the three dot button, , to the far right of the Model: drop-down box and another dialog box opens where you have Create varlist with factor variables dialogue box:
·    Under Type of variable, leave Factor variable unchanged
·    Under Specification, leave Main effects unchanged
·    Open the drop down menu under Variable 1 >> select state >> Add to varlist
·    Again, open the drop down menu under Variable 1 >> select chain >> Add to varlist

Both state and chain will be shown under Varlist, so you should have something like this:
Dialog box for factor variables in Two-way ANOVA, Stata from http://cruncheconometrix.com.ng
Dialog box for factor variables in Two-way ANOVA, Stata
Source: CrunchEconometrix
(Used with written permission from Stata)

3.    Click OK and the previous page is modified as shown below with state and chain appearing under Model:

4.    Click OK to obtain the same regression outputs as in using the syntax approach.

To analyse the interactive effects of both categorical variables on the dependent variables, here is what to do:
1.    Click on the three dot button, , to the far right of the Model: drop-down box and another dialog box opens where you have Create varlist with factor variables dialogue box:
·    Under Type of variable, leave Factor variable unchanged
·    Under Specification, select Interaction (2-way)
·    Open the drop down menu under Variable 1 >> select state
·    Open the drop down menu under Variable 1 >> select chain >> Add to varlist
2.    Click OK.

If it’s correctly done, state chain state#chain will show under Varlist, so you have something like this:
Dialog box for factor and interaction variables in Two-way ANOVA, Stata from http://cruncheconometrix.com.ng
Dialog box for factor and interaction variables in Two-way ANOVA, Stata
Source: CrunchEconometrix
(Used with written permission from Stata)
3.    Click OK and you will obtain the same regression outputs as in using the syntax approach.

Summary of points to note when running a two-way ANOVA:
1.    Inform readers about the nature of your study (tell us what you are about to do)
2.    Ensure that your dependent variable is a continuous value
3.    The explanatory variables must be categorical variables with at least two groups
4.    Members in each group must not over-lap
5.    State the null and alternative hypotheses.
6.    Run the two-way ANOVA before carrying out any post-hoc checks otherwise Stata will give an error message.
7.    Report the F-statistic, degrees of freedom (df), the level of significance (the prob value [Prob>F])
8.    A statement of whether there were statistically significant differences between your groups and on the interaction term. Report that of the interaction first if it is significant.
9.    Report the results from the post-hoc checks and their prob values.

ASSIGNMENT
Using Wooldridge’s discrim2.dta or discrim2.xlsx show if the price of fries (pfries2) differ among the four food-chains (Burger King, Kentucky Fried Chicken, Roy Rogers and Wendy’s) across the two states – New Jersey and Pennsylvania.



If you have further questions on how to run the two-way ANOVA procedure and the post-hoc tests, kindly post your comments and questions below….