Note: This tutorial is somewhat detailed!
Data is essential to all disciplines, professions,
and fields of endeavours whether in social sciences, arts, technology, life
sciences or medicine. The truth is, who are we without data? Data either
qualitative or quantitative is informative. It tells us about past and current
occurrences. With data, predictions and forecasting can be made either to
forestall a negative recurring trend or improve future events. Whichever way,
knowing some rules guiding the use of data and how to make it communicate is very
important since it often comes out as large voluminous tons of figures or
statements. In the same vein, undertaking a research is impossible without
data. I mean, what will be the essence of your research if you have no data. In
other words, research and data are like siamese
twins.
Everyone has different views about how research
should be undertaken and how data should be analysed. Afterall, isn’t that why
we have different schools of thought?
I guess, that’s why. So, what I am about to teach are just simple steps common
to all disciplines that are required to undertake any form of research and analyse
that data accordingly. Therefore, whether you are a student or a practitioner
you will find this guide very helpful. Although, I may be a bit biased towards
economics this approach is not fool proof, and regardless of what you know
already (and whatever your field is), you will learn a thing or two from this
tutorial.
So, let us dig in…..
1. State the underlying theory.
You must have a theory underlying your
study or research. Theories are hypotheses, statements, conjectures, ideas, and
assumptions that someone somewhere came up with at some point in time. Such as
Darwin’s theory of evolution, Malthusian theory of food and population growth,
Keynes’ theory of consumption, McKinnon-Shaw hypothesis on financial reforms
etc. Every discipline has its fair share of theories. So make sure you have a
theory upon which your research hinges on. It is this theory you are out to
test with the available data which culminates into you undertaking a research. Right
now, I have a funny theory of my own
that countries that have strong and efficient institutions have lower income
inequality (…oh well, I just came up with that!). Or yours could be that richer
countries have happier citizens. Therefore, anyone can have a theory. Have a
theory before you begin that
research!
2. Specify the theoretical (mathematical) model
Having established the theory within
which you are situating your research, the next thing to do is to state the
theoretical model. Remember, since theories are statements (which are somewhat
unobservable), you have to construct them in a functional mathematical form
that embodies the theory. The model to be specified is a set of mathematical
equations. For instance, given my postulated negative relationship between
effective institutions and income inequality, a mathematical economist might specify
it as:
INQ = b1 + b2INST……………………..[1]
So, equation [1] becomes the mathematical model of the relationship between institutions and income
inequality.
Where INQ = income inequality, INST
= institutions, b1 and b2 are known as parameters of the model and they are
the intercept and slope coefficients. According to the
theory, b2 is expected to have
a negative sign.
The variable appearing on the left side
of the equality sign is the dependent
variable or regressand while the one on the right side is called the independent or explanatory variable or regressor. Again, if the model has one
equation as it is in equation [1], it is known as a single-equation model and if it has more than one equation, it is
called multiple-equation model.
Therefore, the inequality-institutions model stated above is a single equation
model.
3. Specify the empirical model
The word “empirical” connotes knowledge
derived from experimentation, investigation or verification. Therefore, the
mathematical model stated in equation [1] is of limited interest to the econometrician. The
econometrician must modify equation [1] to make it suitable for analysis of
some sort. This is because, that model assumes that an exact relationship
exists between effective institutions and income inequality. However, this
relationship is generally inexact. This is because, if we are to obtain
institutional data on 10 countries known to have good rankings on governance,
rule of law or corruption, we would not expect all their citizens to lie
exactly on the straight line. The reason is because aside quality or effective
institutions, other variables affect income inequality. Variables such as
income level, education, access to loans, economic opportunities etc. are
likely to exert some influence on income inequality. Therefore, to capture the inexact relationship(s) between and
among economic variables, the econometrician will modify equation [1] as:
INQ = b1 + b2INST + u ……………………..[2]
Thus, equation [2] becomes the econometric model of the relationship between institutions and income
inequality. It is with this model that the econometrician verifies the
inequality-institutions hypothesis using data.
Where u is the disturbance term
or often called the error term. The
error term is a random variable that may well capture other factors that affect
income inequality but not taken into account by the model explicitly.
Technically, equation [2] is an example of a linear regression model. The major difference between equations [1]
and [2] is that the econometric inequality function hypothesises that the
dependent variable INQ is linearly
related to the explanatory variable INST
but this relationship is not
exact due to individual variation represented by u.
4. Data
Now that you have the theory and have
been able to successfully construct your model, the question is, do you have
data? To estimate the econometric model stated in equation [2], data is
essential to obtain the numerical estimates of b1 and b2.
Your choice of data depends on the structure or nature of your research which
may determine if you will require the use of qualitative or quantitative data.
In line with that, is whether you require the use of primary or secondary data?
As a researcher, you can mix both qualitative and quantitative data to gain the
breadth and depth of understanding and corroborating what others have done.
This is known as meta-data analysis.
There is a growing body of researchers using this approach. At this point, you
already know whether the data is available for your research or not.
When sourcing your data, identify the
dependent variable and the explanatory variables. Let me say a word or two on
the explanatory variables. They can further be broken into control variables. The control variables are not directly in your
scope of research but they are often included to test if the expected a
priori on the key explanatory variable still holds with the inclusion of
control variables in the regression model. For instance, using the
inequality-institutions model, the dependent variable is INQ, the key explanatory variable is INST and I may decide to control for education, per capita income
and access to loans….the last three variables are known as the control
variables. Also, in applied research, data is often plagued by approximation
errors or incomplete coverage or omitted variables. For instance, social
sciences often depend on secondary data and usually have no way of identifying
the errors made by the agency that collected the primary data. That being said,
do not engage in any
research without first knowing that data is available.
...So, start sourcing
and putting your data together, we are about to delve into some pretty serious
stuff! J
5. Methodology
The next thing is knowing what
methodology to apply. This is peculiar to your research and your discipline. There
are so many methodologies, identify the one which best fits your model and use
it.
6. Analytical software
Students often ask me this question: “what
analytical software should I use?” My answer has and will always be: “use the
software that you are familiar with”. Don’t be lazy! Be proficient in the use
of at least one analytical software. There are hundreds of them out there –
Stata, R, EViews, SPSS, SAS, Python, SQL, Excel, Agile, and so on. Learn how to
use any of them. There are so many tutorial videos on YouTube. For instance, I
am very proficient in the use of Stata and Excel analytical softwares with
above 60% proficiency in the usage of EViews, SAS and SQL packages. As a
researcher and data analyst, you cannot be taken seriously if you cannot lay
claim to some level of proficiency in the usage of any of these packages. I use
Stata, I love Stata and I will be giving out some periodical hands-on tutorials
on how to use Stata to analyse your data. By way of information, I currently
analyse data using Stata13.1 package.
So, let us dig in further….it is getting
pretty interesting and more involving J
7. Estimation technique
This is the method of obtaining the
estimates for your model, at least an approximation. It is that method based on
finding that parameter estimate that best minimises
discrepancies between the observed sample(s) and the fitted or predicted model.
At this point, you already know what technique to apply that will best give unbiased estimates.
8. Pre-estimation checks
At this point, you are almost set to begin
analysing your data. However, before you proceed, your data must be subjected
to some pre-estimation checks. I am very sure that every discipline has these
pre-estimation checks in place before carrying out any analysis. In economics
there are several of them, such as: multicollinearity test, summary
statistics (mean, standard deviation, minimum, maximum, kurtosis, skewness,
normality etc.), stationarity test, Hausman test etc. It is from these tests
that you identify and correct any abnormality in your data. You may observe the
presence of an outlier (when a figure stands out conspicuously either because
it is abnormally low or high). You will also get some information regarding the
deviation of a variable from the mean (average value), the shape of the probability
distribution is also important – is it mesokurtic, platykurtic or leptokurtic? You
may want to know whether your data is heavy- or light-tailed. Also, if you are
using a time-series data, the stationarity of each variable should be of
paramount interest and if it is a panel data (combination of time- and
cross-sectional data) the Hausman test should come handy in knowing what
estimator (whether fixed or random) to adopt. The bottom-line is that: always carry out some
pre-estimation checks before you begin your analysis!
9. Functional form of the model
Linear relationships are not often
common for all economic research, while it is general to come across several
studies incorporating many nonlinearities into their regression analysis by
simply changing the functional forms of either or both the dependent
(regressand) and independent variables (regressors). More often than not,
econometricians transform variables from their level forms to functional forms using natural logarithms (denoted as ln).
Since variables come in different measurements, it is crucial to know how they
are measured in order to make sense of their regression estimates in an
equation. For example, using the inequality-institutions model, the inequality
variable (using the Gini index) ranges between 0 and 100 and the institution
variable is also a decimal ranging between -2.5 and +2.5, obviously these two
variables have different measurements. Therefore, an important advantage of
transforming variables into natural logarithms (logs, for short) is to equate the variables on the same measurement
and applying a constant elasticity
relationship and interpretations. It also controls for the presence of
outliers in the data amongst others. Let me state here that when the units of
measurement of the dependent and independent variables change, the ordinary
least squares (OLS) estimates change in entirely expected ways.
Note: changing the units of measurement
of only the regressor does not affect the intercept.
Table showing different functional forms of a model |
So, given the inequality-institutions
model, I may decide to re-specify equation [2] in a log-linear form to obtain
an elasticity relationship. That is:
lnINQ = b1 + b2lnINST + u ……………………..[3]
10.
Estimate the model
Prior to the
existence of analytical softwares, econometricians go through the cumbersome
approach of manual computation of regression coefficients. Well, I am glad to
tell you that those days are gone forever! With the advent of computerised
analytical packages like Stata, EViews, R and the rest of them all you have to
do is feed in your data into any that you are familiar with and click “RUN”…and voila! You have your results in
split micro-seconds! Most if not all of these packages are excel-friendly. That
is, you first have to put your data into an excel format (either .csv, .xls or.
xlx file) and then feed into any of them. This is the easiest part of the
entire data analysis process. Every researcher loves it whenever they are at
this stage. All you need do is feed in your data, click “RUN” and your result is churned out! However, that your
coefficients will be according to your expectations is an entirely different
story (won’t be told in this write up…hahahaha J).
From the
regression output, Stata (just like other softwares) provides the beta coefficients,
standard errors, t-statistics,
probability values, the confidence intervals, R2, F-statistic, number of observations, the
degree of freedom, the explained (denoted as Model) and unexplained (denoted as Residual) errors. I will cover analysis of variance (ANOVA) in subsequent
tutorials.
11.
Results
interpretation
In line with
model specification, you can then interpret your results. Always be mindful of
the units of measurements (if you are not using a log-linear model). The
results output shown above is for a linear (that is, a level-level) model.
12.
Hypothesis
testing
Since your primary
goal for undertaking the research is to test a theory or hypothesis, it is at
this point you do that having stated what your null and alternative hypotheses
are. Remember any theory or hypothesis that is not verifiable by empirical
evidence is not admissible as a part of scientific enquiry. Now that you have obtained
your results, do you reject the null hypothesis in favour of the alternative? The
econometric packages always include this element in the result output so that
you don’t have to manually compute. Simply check your t-statistics or p-values
to know if you will reject the null hypothesis or not.
For instance,
from the above output, the beta coefficient for crsgpa is 1.007817 and with the standard error of 0.1038808, you
can easily compute the t-statistic as
1.007817/0.1038808 = 9.70 (as given by the Stata output). Importantly, know that
a large t-statistic will always
provide evidence against the null hypothesis. Likewise, the p-value of 0.000 is indicative of the
fact that the likelihood of committing a Type I error (that is, rejecting the
null hypothesis when it is true) is very, very remote…close to zero! So, when
the null hypothesis is rejected, we say that the coefficient of crsgpa is statistically significant. When we fail to reject the null
hypothesis, we say the coefficient is not
statistically significant. It is inappropriate to say that you “accept” the null hypothesis. One can
only “fail to reject” the null
hypothesis. This is because you fail to
reject the null hypothesis due to insufficient evidence against it (often
due to the sample collected). So, we don’t accept
the null, but simply fail to reject
it!
(Detailed rudiments of hypothesis testing, Type I and II
errors will be covered in subsequent tutorials).
13. Post-estimation checks
Having obtained
your estimates, it is advisable to subject your model to some diagnostics. Most
journals or even our supervisor will want to see the post-estimation checks
carried out on your model. It will also give some level of confidence if your
model passes the following tests: normality, stability, heteroscedasticity,
serial correlation, model specification and so on. Regardless of your
discipline, empirical model and estimation technique, it is essential that your
results are supported with some “comforting” post-estimation checks. Find out those
applicable to your model and technique of estimation.
14.
Forecasting or
prediction/Submission
At this point, if the econometric model
does not refute the theory under consideration, it may be used for predicting
(forecast) future values of the dependent variable on the basis of the known or
expected future values of the regressors. However, if the work is for
submission, I will advise that it is proof-read as many times as possible
before doing so.
I hope this step-by-step guide gives you
some level of confidence to engage in research and data analysis. Let me know
if you have any additions or if I omitted some salient points.
Post your comments
and questions….
Thanks a lot ma. I will ensure to read through. I greatly appreciate this.
ReplyDeleteNo worries, Idowu. Just stay with me on this blog. I intend to simplify the hard stuffs and whenever you need more clarification, simply hit me with your questions!
Delete