##
**Note: This tutorial is somewhat detailed!**

Data is essential to all disciplines, professions,
and fields of endeavours whether in social sciences, arts, technology, life
sciences or medicine. The truth is, who are we without data? Data either
qualitative or quantitative is informative. It tells us about past and current
occurrences. With data, predictions and forecasting can be made either to
forestall a negative recurring trend or improve future events. Whichever way,
knowing some rules guiding the use of data and how to make it communicate is very
important since it often comes out as large voluminous tons of figures or
statements. In the same vein, undertaking a research is impossible without
data. I mean, what will be the essence of your research if you have no data. In
other words, research and data are like

*siamese*twins.
Everyone has different views about how research
should be undertaken and how data should be analysed. Afterall, isn’t that why
we have different

*schools of thought*? I guess, that’s why. So, what I am about to teach are just simple steps common to all disciplines that are required to undertake any form of research and analyse that data accordingly. Therefore, whether you are a student or a practitioner you will find this guide very helpful. Although, I may be a bit biased towards economics this approach is not fool proof, and regardless of what you know already (and whatever your field is), you will learn a thing or two from this tutorial.
So, let us dig in…..

**1.**

**State the underlying theory.**

You must have a theory underlying your
study or research. Theories are hypotheses, statements, conjectures, ideas, and
assumptions that someone somewhere came up with at some point in time. Such as
Darwin’s theory of evolution, Malthusian theory of food and population growth,
Keynes’ theory of consumption, McKinnon-Shaw hypothesis on financial reforms
etc. Every discipline has its fair share of theories. So make sure you have a
theory upon which your research hinges on. It is this theory you are out to
test with the available data which culminates into you undertaking a research. Right
now, I have a

*funny*theory of my own that countries that have strong and efficient institutions have lower income inequality (…oh well, I just came up with that!). Or yours could be that richer countries have happier citizens. Therefore, anyone can have a theory. Have a theory**you begin that research!**__before__**2.**

**Specify the theoretical (mathematical) model**

Having established the theory within
which you are situating your research, the next thing to do is to state the
theoretical model. Remember, since theories are statements (which are somewhat
unobservable), you have to construct them in a functional mathematical form
that embodies the theory. The model to be specified is a set of mathematical
equations. For instance, given my postulated negative relationship between
effective institutions and income inequality, a mathematical economist might specify
it as:

*INQ*= b

_{1}+ b

_{2}

*INST*……………………..[1]

So, equation [1] becomes the

**mathematical****model**of the relationship between institutions and income inequality.
Where

*INQ*= income inequality,*INST*= institutions,*b*and_{1}*b*_{2}_{ }are known as**parameters**of the model and they are the**intercept**and**slope**coefficients. According to the theory,*b*is expected to have a_{2}**negative**sign.
The variable appearing on the left side
of the equality sign is the

**dependent****variable**or**regressand**while the one on the right side is called the**independent**or**explanatory****variable**or**regressor**. Again, if the model has one equation as it is in equation [1], it is known as a**single-equation**model and if it has more than one equation, it is called**multiple-equation**model. Therefore, the inequality-institutions model stated above is a single equation model.**3.**

**Specify the empirical model**

The word “empirical” connotes knowledge
derived from experimentation, investigation or verification. Therefore, the
mathematical model stated in equation [1] is of limited interest to the econometrician. The
econometrician must modify equation [1] to make it suitable for analysis of
some sort. This is because, that model assumes that an exact relationship
exists between effective institutions and income inequality. However, this
relationship is generally inexact. This is because, if we are to obtain
institutional data on 10 countries known to have good rankings on governance,
rule of law or corruption, we would not expect all their citizens to lie
exactly on the straight line. The reason is because aside quality or effective
institutions, other variables affect income inequality. Variables such as
income level, education, access to loans, economic opportunities etc. are
likely to exert some influence on income inequality. Therefore, to capture the

*inexact*relationship(s) between and among economic variables, the econometrician will modify equation [1] as:*INQ*= b

_{1}+ b

_{2}

*INST + u*……………………..[2]

Thus, equation [2] becomes the

**econometric****model**of the relationship between institutions and income inequality. It is with this model that the econometrician verifies the inequality-institutions hypothesis using data.
Where

*u*is the**disturbance term**or often called the**error term**. The error term is a random variable that may well capture other factors that affect income inequality but not taken into account by the model explicitly. Technically, equation [2] is an example of a**linear regression model**. The major difference between equations [1] and [2] is that the econometric inequality function hypothesises that the dependent variable*INQ*is linearly related to the explanatory variable*INST*but this relationship is**exact due to individual variation represented by**__not__*u*.**4.**

**Data**

Now that you have the theory and have
been able to successfully construct your model, the question is, do you have
data? To estimate the econometric model stated in equation [2], data is
essential to obtain the numerical estimates of

*b*and_{1}*b*. Your choice of data depends on the structure or nature of your research which may determine if you will require the use of qualitative or quantitative data. In line with that, is whether you require the use of primary or secondary data? As a researcher, you can mix both qualitative and quantitative data to gain the breadth and depth of understanding and corroborating what others have done. This is known as_{2}*meta-data*analysis. There is a growing body of researchers using this approach. At this point, you already know whether the data is available for your research or not.
When sourcing your data, identify the
dependent variable and the explanatory variables. Let me say a word or two on
the explanatory variables. They can further be broken into

**control variables**. The control variables are not directly in your scope of research but they are often included to test if the expected**on the key explanatory variable still holds with the inclusion of control variables in the regression model. For instance, using the inequality-institutions model, the dependent variable is***a priori**INQ*, the key explanatory variable is*INST*and I may decide to control for education, per capita income and access to loans….the last three variables are known as the control variables. Also, in applied research, data is often plagued by approximation errors or incomplete coverage or omitted variables. For instance, social sciences often depend on secondary data and usually have no way of identifying the errors made by the agency that collected the primary data. That being said, do not engage in any research without first knowing that data is available.
...So, start sourcing
and putting your data together, we are about to delve into some pretty serious
stuff! J

**5.**

**Methodology**

The next thing is knowing what
methodology to apply. This is peculiar to your research and your discipline. There
are so many methodologies, identify the one which best fits your model and use
it.

**6.**

**Analytical software**

Students often ask me this question: “what
analytical software should I use?” My answer has and will always be: “use the
software that you are familiar with”. Don’t be lazy! Be proficient in the use
of at least one analytical software. There are hundreds of them out there –
Stata, R, EViews, SPSS, SAS, Python, SQL, Excel, Agile, and so on. Learn how to
use any of them. There are so many tutorial videos on YouTube. For instance, I
am very proficient in the use of Stata and Excel analytical softwares with
above 60% proficiency in the usage of EViews, SAS and SQL packages. As a
researcher and data analyst, you cannot be taken seriously if you cannot lay
claim to some level of proficiency in the usage of any of these packages. I use
Stata, I love Stata and I will be giving out some periodical hands-on tutorials
on how to use Stata to analyse your data. By way of information, I currently
analyse data using Stata13.1 package.

So, let us dig in further….it is getting
pretty interesting and more involving

**J****7.**

**Estimation technique**

This is the method of obtaining the
estimates for your model, at least an approximation. It is that method based on
finding that parameter estimate that best

*minimises*discrepancies between the observed sample(s) and the fitted or predicted model. At this point, you already know what technique to apply that will best give*unbiased*estimates.**8.**

**Pre-estimation checks**

At this point, you are almost set to begin
analysing your data. However, before you proceed, your data must be subjected
to some pre-estimation checks. I am very sure that every discipline has these
pre-estimation checks in place before carrying out any analysis. In economics
there are several of them, such as: multicollinearity test, summary
statistics (mean, standard deviation, minimum, maximum, kurtosis, skewness,
normality etc.), stationarity test, Hausman test etc. It is from these tests
that you identify and correct any abnormality in your data. You may observe the
presence of an outlier (when a figure stands out conspicuously either because
it is abnormally low or high). You will also get some information regarding the
deviation of a variable from the mean (average value), the shape of the probability
distribution is also important – is it mesokurtic, platykurtic or leptokurtic? You
may want to know whether your data is heavy- or light-tailed. Also, if you are
using a time-series data, the stationarity of each variable should be of
paramount interest and if it is a panel data (combination of time- and
cross-sectional data) the Hausman test should come handy in knowing what
estimator (whether fixed or random) to adopt. The bottom-line is that:

**carry out some pre-estimation checks before you begin your analysis!**__always__**9.**

**Functional form of the model**

Linear relationships are not often
common for all economic research, while it is general to come across several
studies incorporating many nonlinearities into their regression analysis by
simply changing the functional forms of either or both the dependent
(regressand) and independent variables (regressors). More often than not,
econometricians transform variables from their level forms to functional forms using natural logarithms (denoted as

*ln*). Since variables come in different measurements, it is crucial to know how they are measured in order to make sense of their regression estimates in an equation. For example, using the inequality-institutions model, the inequality variable (using the Gini index) ranges between 0 and 100 and the institution variable is also a decimal ranging between -2.5 and +2.5, obviously these two variables have different measurements. Therefore, an important advantage of transforming variables into natural logarithms (*logs*, for short) is to equate the variables on the same measurement and applying a*constant elasticity relationship*and interpretations. It also controls for the presence of outliers in the data amongst others. Let me state here that when the units of measurement of the dependent and independent variables change, the ordinary least squares (OLS) estimates change in entirely expected ways.
Note: changing the units of measurement
of only the regressor does not affect the intercept.

Table showing different functional forms of a model |

So, given the inequality-institutions
model, I may decide to re-specify equation [2] in a log-linear form to obtain
an elasticity relationship. That is:

ln

*INQ*= b_{1}+ b_{2}ln*INST + u*……………………..[3]**10.**

**Estimate the model**

Prior to the
existence of analytical softwares, econometricians go through the cumbersome
approach of manual computation of regression coefficients. Well, I am glad to
tell you that those days are gone forever! With the advent of computerised
analytical packages like Stata, EViews, R and the rest of them all you have to
do is feed in your data into any that you are familiar with and click “

*RUN*”…and voila! You have your results in split micro-seconds! Most if not all of these packages are excel-friendly. That is, you first have to put your data into an excel format (either .csv, .xls or. xlx file) and then feed into any of them. This is the easiest part of the entire data analysis process. Every researcher loves it whenever they are at this stage. All you need do is feed in your data, click “*RUN*” and your result is churned out! However, that your coefficients will be according to your expectations is an entirely different story (won’t be told in this write up…hahahaha J).
From the
regression output, Stata (just like other softwares) provides the beta coefficients,
standard errors,

*t*-statistics, probability values, the confidence intervals, R^{2},*F*-statistic, number of observations, the degree of freedom, the explained (denoted as*Model*) and unexplained (denoted as*Residual*) errors. I will cover analysis of variance (ANOVA) in subsequent tutorials.**11.**

**Results interpretation**

In line with
model specification, you can then interpret your results. Always be mindful of
the units of measurements (if you are not using a log-linear model). The
results output shown above is for a linear (that is, a level-level) model.

**12.**

**Hypothesis testing**

Since your primary
goal for undertaking the research is to test a theory or hypothesis, it is at
this point you do that having stated what your null and alternative hypotheses
are. Remember any theory or hypothesis that is not verifiable by empirical
evidence is not admissible as a part of scientific enquiry. Now that you have obtained
your results, do you reject the null hypothesis in favour of the alternative? The
econometric packages always include this element in the result output so that
you don’t have to manually compute. Simply check your

*t*-statistics or*p*-values to know if you will reject the null hypothesis or not.
For instance,
from the above output, the beta coefficient for

*crsgpa*is 1.007817 and with the standard error of 0.1038808, you can easily compute the*t*-statistic as 1.007817/0.1038808 = 9.70 (as given by the Stata output). Importantly, know that a large*t*-statistic will always provide evidence against the null hypothesis. Likewise, the*p*-value of 0.000 is indicative of the fact that the likelihood of committing a Type I error (that is, rejecting the null hypothesis when it is true) is very, very remote…close to zero! So, when the null hypothesis is rejected, we say that the coefficient of*crsgpa*is**statistically significant**. When we fail to reject the null hypothesis, we say the coefficient is**statistically significant. It is inappropriate to say that you “**__not__*accept*” the null hypothesis. One can only*“fail to reject”*the null hypothesis. This is because you*fail to reject*the null hypothesis due to insufficient evidence against it (often due to the sample collected). So, we don’t*accept*the null, but simply*fail to reject*it!
(Detailed rudiments of hypothesis testing, Type I and II
errors will be covered in subsequent tutorials).

**13.**

**Post-estimation checks**

Having obtained
your estimates, it is advisable to subject your model to some diagnostics. Most
journals or even our supervisor will want to see the post-estimation checks
carried out on your model. It will also give some level of confidence if your
model passes the following tests: normality, stability, heteroscedasticity,
serial correlation, model specification and so on. Regardless of your
discipline, empirical model and estimation technique, it is essential that your
results are supported with some “comforting” post-estimation checks. Find out those
applicable to your model and technique of estimation.

**14.**

**Forecasting or prediction/Submission**

At this point, if the econometric model
does not refute the theory under consideration, it may be used for predicting
(forecast) future values of the dependent variable on the basis of the known or
expected future values of the regressors. However, if the work is for
submission, I will advise that it is proof-read as many times as possible
before doing so.

I hope this step-by-step guide gives you
some level of confidence to engage in research and data analysis. Let me know
if you have any additions or if I omitted some salient points.

Post your comments
and questions….

_{}^{}
Thanks a lot ma. I will ensure to read through. I greatly appreciate this.

ReplyDeleteNo worries, Idowu. Just stay with me on this blog. I intend to simplify the hard stuffs and whenever you need more clarification, simply hit me with your questions!

Delete