Wednesday, 7 February 2018

Panel Data Analysis (Lecture 1): Sourcing Data, Theoretical Framework and Model Specification

Caution: This tutorial is only a guide and should not be adopted in its entirety. Endeavour to consult your tutor and other resource materials for proper guidance!

The dissertation fervor is heating up with the usual twists and turns. In view of these and in response to readers’ requests, I will be starting a series of lectures on how to run time series and panel data analyses. These will be in parts and supported with short video tutorials posted to YouTube (so ensure to hook up to get the hands-on training). In order not to leave anyone out, these practical lectures will be carried out using three (3) analytical packages that is common among final-year students – Stata, EViews and Excel. Also, real country-level and longitudinal data will be used (but subject to my modifications to prevent unethical conduct from readers). Lastly, only quantitative research will be addressed.

For time series analysis, the lectures will only cover: data sourcing, model specification, lag selection, unit root testing, cointegration test, vector autoregressive model (VAR), autoregressive distributed lag model (ARDL), vector error correction mechanism (VECM), Granger causality tests, CUSUMSQ test and other post-estimation tests. While for panel data analysis, the lectures will only cover: setting up a panel data in Stata and EViews, data sourcing, model specification, Hausman test, fixed effects (FE) model, random effects (RE) model and generalised methods of moments (GMM).

So, in order to get prompt tutorials, the moment I click the “post” button, I will encourage you to subscribe for these blog posts. Use the “Follow by Email” menu on my blog, activate the link once you receive the notification in your email (check your spam box too) and you are good to go! Likewise, follow that up by subscribing to my YouTube videos for those short hands-on video clips. Click on this link CrunchEconometrix YouTube videos and subscribe!

Data Sourcing
“I can’t get data!!!”, “what’s a proxy?”, “I have data but not for all the groups”,“ how do I go about modeling my theoretical framework?”, “how do I construct my empirical model?”, “in fact, I’m confused!”…so many questions and believe me the chattering seems endless. First, I always tell students to relax! Secondly, I tell them that the moment the research area has been identified, and the topic streamlined, the next thing to do is to go on data-search. Okay, think about this: of what use is an empirical research if there is no data (or you have insufficient data to test your hypothesis)? So before, you proceed to writing chapter 1 (that is, the study background), make certain that you have the data handy.

Primary Data Sources
Regardless of the field of study or research discipline, primary data gathering requires the use of questionnaires, interviews, focus group discussions etc. It may require one of these or a combination of 2 or 3 data-gathering methods. So, if you are using primary data, ensure to get out these materials and distribute to the respondents in order to harvest responses within the shortest time frame. Getting a good number of responses is a precursor to having a quality research and unbiased results. However, these structured tutorials will not be extended to analysing primary data….my sincere apologies!

Secondary Data Sources
Since, research is not limited to those in the field of economics, it is important that researchers identify those databases hosting the relevant data required for their work. As an economist, I will indicate some databases/sources where students can go source for their data. Here are some which can be accessed (for macro and micro datasets):
IEA Coal Information
IEA CO2 Emissions from Fuel Combustion
IEA Electricity information
IEA Energy Prices and Taxes
IEA Energy Technology Research and Development Database
IEA Natural Gas Information
IEA Oil Information
IEA Renewables Information
IEA World Energy Statistics and Balances
ILO Key Indicators of the Labour Market
IMF Balance of Payment Statistics
IMF Direction of Trade Statistics
IMF Government Finance Statistics
IMF International Financial Statistics
IMF World Economic Outlook
OECD Education Statistics
OECD Globalisation
OECD International Development
OECD International Direct Investment Statistics
OECD International Migration Statistics
OECD International Trade by Commodities Statistics
OECD Main Economic Indicators
OECD Main Science and Technology Indicators
OECD National Accounts
OECD Quarterly Labour Force Statistics
OECD Services Statistics
OECD Social Expenditure Database
OECD Structural Analysis
UNIDO Industrial Demand Supply
UNIDO Industrial Statistics
World Bank Global Development Finance
World Bank World Development Indicators
World Bank Africa Development Indicators

Other sources of international data include but not limited to:
International Monetary Fund -
United Nations -
Data on aid flows complied by OECD -
NBER data sets -

For information from over 256 and regions since 1960, the accessible databases are:
World Development Indicators
Global Development Finance
The African Development Indicators
Doing Business
Education Statistics
Enterprise Surveys
Gender Statistics
Health Nutrition and Population Statistics
Millennium Development Goals
Worldwide Governance Indicators
Endeavour to check out those sites that are relevant to your study.
Note: it is expected that you state your data source in your thesis/dissertation and the years of coverage say 1980 to 2016, or 1970 to 2015 etc.

What is a Panel Data?
The panel data approach pools time series data with cross-sectional data. Depending on the application, it can comprise a sample of individuals, firms, countries, or regions over a specific time period. The general structure of such a model could be expressed as follows:

Yit = a + bXit + uit 
where uit ~ IID(0, s2),i = 1,2,…,N individual-level observations, and t = 1, 2,…,T time series observations.

In this application, it is assumed that Yit is a continuous variable. The panel data model is simply where the observations of each individual, firm or country over time are stacked on top of each another. This is the standard pooled model where intercepts and slope coefficients are homogeneous across all N cross-sections and through all T time periods. The application of ordinary least squares (OLS) to this model ignores the temporal and spatial dimension inherent in the data and thus throws away useful information. It is important to note that the temporal dimension captures the ‘within’ variation in the data while the spatial dimension captures the ‘between’ variation in the data. The pooled OLS estimator exploits both ‘between’ and ‘within’ dimensions of the data but does not do so efficiently. Thus, in this procedure each observation is given equal weight in estimation. In addition, the unbiasedness and consistency of the estimator requires that the explanatory variables are uncorrelated with any omitted factors. The limitations of OLS in such an application prompted interest in alternative procedures. There are a number of different panel estimators but the most popular is the fixed effects (or ‘within’) estimator and this will be reviewed extensively here. Lastly, the generalized methods of moments (GMM) estimator will be discussed given its relevance to dynamic panel modelling.

Some Advantages of Panel Data Analysis
Panel data analysis has quite a number of distinct advantages over time series and cross-section analysis:
·   Panel (or longitudinal) data allows a researcher to analyse a number of important economic questions not readily answerable by either a cross-section or a time-series dataset alone.
·  The availability of panel data increases the number of data points available and reduces collinearity among the explanatory variables thus improving the efficiency of the econometric estimates.
·  Panel data captures the heterogeneity that is related to the individuals, firms, states, countries etc. over time.
· By combining time series of cross-sectional observations, panel data gives “more informative data, more variability, less collinearity among variables, more degrees of freedom and more efficiency”.
·    Dynamic effects cannot be estimated using cross-sectional data. Even time series data are imprecise in this regard as there is generally limited change or variation in the data to identify such effects. For instance, in estimating a distributed lag model using only time series data, multicollinearity lowers the precision of the estimates. Hence, panel data models can provide greater variation in the explanatory variable for a given year thus reducing the degree of multicollinearity and improving the precision of the estimates. This clearly renders panel data better suited to the study of dynamic change However, it should be emphasised that the estimation procedures required for dynamic models which include a lagged dependent variable are not straightforward and this issue is the subject of discussion in later sections.
· Panel data models can take into account a greater degree of the heterogeneity that characterize individuals, states, and firms over time.
(Detailed discussion on the rudiments of panel data analysis will be done in the next tutorial).

(Here is the link to video clip on converting wide-format data to long-format in Stata).

Model Framework and Specification
This section focusses on the theoretical framework and model specification. I will also touch on description of variables in a model, the a priori expectations and finally, the method of analysis (or the estimation technique(s) to be used in testing the research hypothesis).

Theoretical Framework
Before you specify the empirical model, you must first state the theoretical model. That is, let your readers know where your empirical model is linked to. The theoretical model is that model supporting the theory you are using to undertake your research because no research can be done in isolation without an underlying theory. For instance, if my study is on the effect of exchange rate on output for 30 countries from 2000 to 2016 (that is, 17years), then I must look for a suitable theory which I can adapt to my research. Hence, I may decide to use the “monetary model of exchange rate” which is one of the earliest models used to determine the exchange rate. It is used as a measure to study the other approaches that are used in determining exchange rate. The monetary model approach assumes a simple demand for money curve, the purchasing power parity or the law of one price and a vertical aggregate supply curve.

The theoretical framework can be built as follows: (remember that this is just an example, and should not to be copied literarily!)

From the absolute purchasing power parity (P = EP*), the exchange rate is obtained by dividing the price of the domestic currency by the foreign price for that domestic currency. That is: Eppp = P/P*. The demand for money assumption: since real money balance depends on real income, demand for money is given as Md = kPY, where k is constant and Y is the real income level. Hence, in equilibrium, money demand (Md) equals money supply (Ms) and at the point of intersection of the aggregate demand and the aggregate supply curve:
P = Ms/kY
EP* = P = Ms/kY
and E = Ms/P*kY

From the stated framework, it is theorised that if the money supply within an economy increases, it will result in appreciation of the domestic currency. Hence, if it is generalised for the 30 countries in the data, the same assumption must be made, ceteris paribus. Likewise, foreign price level and the output level are inversely related to the exchange rate. If fixed money supply rises in the domestic economy, since prices are held constant, excess money supply leads to higher demand for goods and services within the economy.

Model Specification
So, having stated the theoretical framework, I can now go ahead to modify it to suit my research and form there formulate my empirical model. For instance, in using a Cobb-Douglas production from the neo-classical growth mode, I will attempt to explain output growth in the context of capital accumulation, labour and productivity, usually referred to as technological progress. The Cobb- Douglas production model is implicitly stated as:

Y = f(ALβKα)                                                                                    [1]
where, Y is output; K is capital stock; L is labour and A is productivity of labour which grows at an exogenous rate. As a result of constant returns to scale, if all inputs are increased by the same amount, then there would be an increase in output. The production function,

Y = KαL1-α                                                                                         [2]
where (1 - a = b) is mainly used by economists and researchers due to the following reasons: firstly, there is a constant return to scale and secondly, the two exponents α and (1 - a), sum up to one.

Next, is to tie up the empirical model to the theoretical framework. That is given the relationship between exchange rate and output, the model is implicitly specified as:

Yit = f (Exchrateit, X1it, X2it, …, Xnit)                                          [3]
where Yit = output (the dependent variable, state the measurement either gross output, or % of GDP, or growth rate etc.)
Exchrateit = real exchange rate (main explanatory variable)
X1it, X2it, …, Xnit = control variables (state their individual measurements either gross output, or % of GDP, or growth rate etc.)

On the basis of the theoretical framework and using the Cobb-Douglas production, the explicit model is stated as:
Yit = β0 + β1Exchrateit + β2X1it + β3X2it + … + βnXnit + uit        [4]
where, ut = white noise error term

 A Priori Expectations
Always know that the expected a priori is directly related to what theory says. It is from that you know what signs of the coefficients are expected from the main regressor and other covariates. For instance, from the theory, it is expected that currency depreciation will have a positive impact on domestic output, hence, a negative sign of the coefficient is expected. That is:

β1 < 0

Therefore, the expected signs of the control variables must be in line with their respective theories which must be related to your study.

Estimation Technique
At this point, the researcher may not know the exact technique or estimator to adopt between the fixed-effects within-group (fixed effects model) or the random effects estimator. The choice between these two is subject to the outcome of the Hausman test.
That is, to determine which model is the more appropriate to adopt, a statistical test is implemented. The Hausman test compares the random effects estimator to the ‘within’ estimator. The null hypothesis of the test is that the composite error term is not correlated with the explanatory variables in the model. If the null is rejected, then the fixed effects estimator is applicable (i.e., it favours the fixed effects but only relative to the random effects). The use of the test in this case is to discriminate between a model where the omitted heterogeneity is treated as fixed and correlated with the explanatory variables, and a model where the omitted heterogeneity is treated as random and independent of the explanatory variables.

Variables, Measurement and Description
Lastly, tabulate your variables detailing their names, short description, measurement and sources.

Here’s an example:
Table xxx: Variables Description and Measurement
Short Definition

World Bank (2016)
Real exchange rate

World Bank (2016)
If you have any comments or question in relation to what have been discussed in this post, do not hesitate to post them in the comment section below….
Source: Researcher’s compilation (always put this at the bottom of the Table)

I have taken you through the steps required on how to source for your data, in addition to a brief on panel data analysis and its relevance over time series and cross-sectional data. I also briefly explained how to formulate a theoretical framework, adapting the framework to align with the research, how to construct the empirical model, stating the expected a priori, having an idea about the estimation technique with a brief on the Hausman test, tabulating your data showing the brief description of your variables, their measurements and data sources.

From next lecture, I will begin analysing the data using both Stata and EViews analytical packages. So, endeavour to follow these tutorials by getting the most of it to ease the dissertation pressure. Make sure you follow me on the next lecture series which is: Panel Data Analysis (Lecture 2): Setting up panel data model and the Hausman Test.

If you have any comments or question in relation to what have been discussed, do not hesitate to post them in the comment section below….

No comments:

Post a Comment