Sunday, 14 January 2018

A Step-by-Step Tutorial on Research and Data Analysis

Note: This tutorial is somewhat detailed!

Data is essential to all disciplines, professions, and fields of endeavours whether in social sciences, arts, technology, life sciences or medicine. The truth is, who are we without data? Data either qualitative or quantitative is informative. It tells us about past and current occurrences. With data, predictions and forecasting can be made either to forestall a negative recurring trend or improve future events. Whichever way, knowing some rules guiding the use of data and how to make it communicate is very important since it often comes out as large voluminous tons of figures or statements. In the same vein, undertaking a research is impossible without data. I mean, what will be the essence of your research if you have no data. In other words, research and data are like siamese twins.

Everyone has different views about how research should be undertaken and how data should be analysed. Afterall, isn’t that why we have different schools of thought? I guess, that’s why. So, what I am about to teach are just simple steps common to all disciplines that are required to undertake any form of research and analyse that data accordingly. Therefore, whether you are a student or a practitioner you will find this guide very helpful. Although, I may be a bit biased towards economics this approach is not fool proof, and regardless of what you know already (and whatever your field is), you will learn a thing or two from this tutorial.

So, let us dig in…..

1.    State the underlying theory.
You must have a theory underlying your study or research. Theories are hypotheses, statements, conjectures, ideas, and assumptions that someone somewhere came up with at some point in time. Such as Darwin’s theory of evolution, Malthusian theory of food and population growth, Keynes’ theory of consumption, McKinnon-Shaw hypothesis on financial reforms etc. Every discipline has its fair share of theories. So make sure you have a theory upon which your research hinges on. It is this theory you are out to test with the available data which culminates into you undertaking a research. Right now, I have a funny theory of my own that countries that have strong and efficient institutions have lower income inequality (…oh well, I just came up with that!). Or yours could be that richer countries have happier citizens. Therefore, anyone can have a theory. Have a theory before you begin that research!

2.    Specify the theoretical (mathematical) model
Having established the theory within which you are situating your research, the next thing to do is to state the theoretical model. Remember, since theories are statements (which are somewhat unobservable), you have to construct them in a functional mathematical form that embodies the theory. The model to be specified is a set of mathematical equations. For instance, given my postulated negative relationship between effective institutions and income inequality, a mathematical economist might specify it as:

                               INQ = b1 + b2INST……………………..[1]

So, equation [1] becomes the mathematical model of the relationship between institutions and income inequality.

Where INQ = income inequality, INST = institutions, b1 and b2 are known as parameters of the model and they are the intercept and slope coefficients. According to the theory, b2 is expected to have a negative sign.

The variable appearing on the left side of the equality sign is the dependent variable or regressand while the one on the right side is called the independent or explanatory variable or regressor. Again, if the model has one equation as it is in equation [1], it is known as a single-equation model and if it has more than one equation, it is called multiple-equation model. Therefore, the inequality-institutions model stated above is a single equation model.

3.    Specify the empirical model
The word “empirical” connotes knowledge derived from experimentation, investigation or verification. Therefore, the mathematical model stated in equation [1] is of limited interest to the econometrician. The econometrician must modify equation [1] to make it suitable for analysis of some sort. This is because, that model assumes that an exact relationship exists between effective institutions and income inequality. However, this relationship is generally inexact. This is because, if we are to obtain institutional data on 10 countries known to have good rankings on governance, rule of law or corruption, we would not expect all their citizens to lie exactly on the straight line. The reason is because aside quality or effective institutions, other variables affect income inequality. Variables such as income level, education, access to loans, economic opportunities etc. are likely to exert some influence on income inequality. Therefore, to capture the inexact relationship(s) between and among economic variables, the econometrician will modify equation [1] as:

                             INQ = b1 + b2INST + u ……………………..[2]

Thus, equation [2] becomes the econometric model of the relationship between institutions and income inequality. It is with this model that the econometrician verifies the inequality-institutions hypothesis using data.

Where u is the disturbance term or often called the error term. The error term is a random variable that may well capture other factors that affect income inequality but not taken into account by the model explicitly. Technically, equation [2] is an example of a linear regression model. The major difference between equations [1] and [2] is that the econometric inequality function hypothesises that the dependent variable INQ is linearly related to the explanatory variable INST but this relationship is not exact due to individual variation represented by u.

4.    Data
Now that you have the theory and have been able to successfully construct your model, the question is, do you have data? To estimate the econometric model stated in equation [2], data is essential to obtain the numerical estimates of b1 and b2. Your choice of data depends on the structure or nature of your research which may determine if you will require the use of qualitative or quantitative data. In line with that, is whether you require the use of primary or secondary data? As a researcher, you can mix both qualitative and quantitative data to gain the breadth and depth of understanding and corroborating what others have done. This is known as meta-data analysis. There is a growing body of researchers using this approach. At this point, you already know whether the data is available for your research or not.

When sourcing your data, identify the dependent variable and the explanatory variables. Let me say a word or two on the explanatory variables. They can further be broken into control variables. The control variables are not directly in your scope of research but they are often included to test if the expected a priori on the key explanatory variable still holds with the inclusion of control variables in the regression model. For instance, using the inequality-institutions model, the dependent variable is INQ, the key explanatory variable is INST and I may decide to control for education, per capita income and access to loans….the last three variables are known as the control variables. Also, in applied research, data is often plagued by approximation errors or incomplete coverage or omitted variables. For instance, social sciences often depend on secondary data and usually have no way of identifying the errors made by the agency that collected the primary data. That being said, do not engage in any research without first knowing that data is available.

...So, start sourcing and putting your data together, we are about to delve into some pretty serious stuff! J

5.    Methodology
The next thing is knowing what methodology to apply. This is peculiar to your research and your discipline. There are so many methodologies, identify the one which best fits your model and use it.

6.    Analytical software
Students often ask me this question: “what analytical software should I use?” My answer has and will always be: “use the software that you are familiar with”. Don’t be lazy! Be proficient in the use of at least one analytical software. There are hundreds of them out there – Stata, R, EViews, SPSS, SAS, Python, SQL, Excel, Agile, and so on. Learn how to use any of them. There are so many tutorial videos on YouTube. For instance, I am very proficient in the use of Stata and Excel analytical softwares with above 60% proficiency in the usage of EViews, SAS and SQL packages. As a researcher and data analyst, you cannot be taken seriously if you cannot lay claim to some level of proficiency in the usage of any of these packages. I use Stata, I love Stata and I will be giving out some periodical hands-on tutorials on how to use Stata to analyse your data. By way of information, I currently analyse data using Stata13.1 package.

So, let us dig in further….it is getting pretty interesting and more involving J

7.    Estimation technique
This is the method of obtaining the estimates for your model, at least an approximation. It is that method based on finding that parameter estimate that best minimises discrepancies between the observed sample(s) and the fitted or predicted model. At this point, you already know what technique to apply that will best give unbiased estimates.

8.    Pre-estimation checks
At this point, you are almost set to begin analysing your data. However, before you proceed, your data must be subjected to some pre-estimation checks. I am very sure that every discipline has these pre-estimation checks in place before carrying out any analysis. In economics there are several of them, such as: multicollinearity test, summary statistics (mean, standard deviation, minimum, maximum, kurtosis, skewness, normality etc.), stationarity test, Hausman test etc. It is from these tests that you identify and correct any abnormality in your data. You may observe the presence of an outlier (when a figure stands out conspicuously either because it is abnormally low or high). You will also get some information regarding the deviation of a variable from the mean (average value), the shape of the probability distribution is also important – is it mesokurtic, platykurtic or leptokurtic? You may want to know whether your data is heavy- or light-tailed. Also, if you are using a time-series data, the stationarity of each variable should be of paramount interest and if it is a panel data (combination of time- and cross-sectional data) the Hausman test should come handy in knowing what estimator (whether fixed or random) to adopt. The bottom-line is that: always carry out some pre-estimation checks before you begin your analysis!

9.    Functional form of the model
Linear relationships are not often common for all economic research, while it is general to come across several studies incorporating many nonlinearities into their regression analysis by simply changing the functional forms of either or both the dependent (regressand) and independent variables (regressors). More often than not, econometricians transform variables from their level forms to functional forms using natural logarithms (denoted as ln). Since variables come in different measurements, it is crucial to know how they are measured in order to make sense of their regression estimates in an equation. For example, using the inequality-institutions model, the inequality variable (using the Gini index) ranges between 0 and 100 and the institution variable is also a decimal ranging between -2.5 and +2.5, obviously these two variables have different measurements. Therefore, an important advantage of transforming variables into natural logarithms (logs, for short) is to equate the variables on the same measurement and applying a constant elasticity relationship and interpretations. It also controls for the presence of outliers in the data amongst others. Let me state here that when the units of measurement of the dependent and independent variables change, the ordinary least squares (OLS) estimates change in entirely expected ways.
Note: changing the units of measurement of only the regressor does not affect the intercept.

Table showing different functional forms of a model
So, given the inequality-institutions model, I may decide to re-specify equation [2] in a log-linear form to obtain an elasticity relationship. That is:

                           lnINQ = b1 + b2lnINST + u ……………………..[3]

10.    Estimate the model
Prior to the existence of analytical softwares, econometricians go through the cumbersome approach of manual computation of regression coefficients. Well, I am glad to tell you that those days are gone forever! With the advent of computerised analytical packages like Stata, EViews, R and the rest of them all you have to do is feed in your data into any that you are familiar with and click “RUN”…and voila! You have your results in split micro-seconds! Most if not all of these packages are excel-friendly. That is, you first have to put your data into an excel format (either .csv, .xls or. xlx file) and then feed into any of them. This is the easiest part of the entire data analysis process. Every researcher loves it whenever they are at this stage. All you need do is feed in your data, click “RUN” and your result is churned out! However, that your coefficients will be according to your expectations is an entirely different story (won’t be told in this write up…hahahaha J).

Below is an example of a result output from Stata analytical software (see heteroscedasticity).

From the regression output, Stata (just like other softwares) provides the beta coefficients, standard errors, t-statistics, probability values, the confidence intervals, R2, F-statistic, number of observations, the degree of freedom, the explained (denoted as Model) and unexplained (denoted as Residual) errors. I will cover analysis of variance (ANOVA) in subsequent tutorials.

11.    Results interpretation
In line with model specification, you can then interpret your results. Always be mindful of the units of measurements (if you are not using a log-linear model). The results output shown above is for a linear (that is, a level-level) model.

12.         Hypothesis testing
Since your primary goal for undertaking the research is to test a theory or hypothesis, it is at this point you do that having stated what your null and alternative hypotheses are. Remember any theory or hypothesis that is not verifiable by empirical evidence is not admissible as a part of scientific enquiry. Now that you have obtained your results, do you reject the null hypothesis in favour of the alternative? The econometric packages always include this element in the result output so that you don’t have to manually compute. Simply check your t-statistics or p-values to know if you will reject the null hypothesis or not.

For instance, from the above output, the beta coefficient for crsgpa is 1.007817 and with the standard error of 0.1038808, you can easily compute the t-statistic as 1.007817/0.1038808 = 9.70 (as given by the Stata output). Importantly, know that a large t-statistic will always provide evidence against the null hypothesis. Likewise, the p-value of 0.000 is indicative of the fact that the likelihood of committing a Type I error (that is, rejecting the null hypothesis when it is true) is very, very remote…close to zero! So, when the null hypothesis is rejected, we say that the coefficient of crsgpa is statistically significant. When we fail to reject the null hypothesis, we say the coefficient is not statistically significant. It is inappropriate to say that you “accept” the null hypothesis. One can only “fail to reject” the null hypothesis. This is because you fail to reject the null hypothesis due to insufficient evidence against it (often due to the sample collected). So, we don’t accept the null, but simply fail to reject it!

(Detailed rudiments of hypothesis testing, Type I and II errors will be covered in subsequent tutorials).

13.  Post-estimation checks
Having obtained your estimates, it is advisable to subject your model to some diagnostics. Most journals or even our supervisor will want to see the post-estimation checks carried out on your model. It will also give some level of confidence if your model passes the following tests: normality, stability, heteroscedasticity, serial correlation, model specification and so on. Regardless of your discipline, empirical model and estimation technique, it is essential that your results are supported with some “comforting” post-estimation checks. Find out those applicable to your model and technique of estimation.

14.         Forecasting or prediction/Submission
At this point, if the econometric model does not refute the theory under consideration, it may be used for predicting (forecast) future values of the dependent variable on the basis of the known or expected future values of the regressors. However, if the work is for submission, I will advise that it is proof-read as many times as possible before doing so.

I hope this step-by-step guide gives you some level of confidence to engage in research and data analysis. Let me know if you have any additions or if I omitted some salient points.

Post your comments and questions….


  1. Thanks a lot ma. I will ensure to read through. I greatly appreciate this.

    1. No worries, Idowu. Just stay with me on this blog. I intend to simplify the hard stuffs and whenever you need more clarification, simply hit me with your questions!