Regression analysis
When investigating the
relationship between two variables regression analysis is an appropriate tool.
In a regression analysis one variable needs to be distinguished as the
dependent variable and the other as the independent variable. When a variable
is named the dependent variable, it means that the values of this variable
depend on the values of another variable. The independent variable explains the
dependent variable. (Lee et
al., 2000) For example, if we were to
investigate the relationship between eyesight and age, the dependent variable
would be eyesight and the independent variable would be age. The reason is that
it is most likely that the older a person gets, the worse his or her eyesight
gets, and a person’s age is therefore an explaining factor when it comes to
that person’s eyesight.
When only one variable is assumed to explain the value of
another variable, meaning there is one dependent and one independent variable,
it is called simple regression. However, if it is thought that more than one
variable explains the value of the dependent variable, multiple regression
analysis is used. In multiple
regression analysis, the relationship between a dependent variable and more
than one independent variable can be determined.
Regression analysis provides two important results which are
presented below. The most important result in this study is related to the
marginal change in Y related to changes in the independent variables:
1) An
estimated linear equation that describes the dependent variable, Y, as a
function of a number of K observed independent variables, xj, where
j = 1,…,K,.
Y = b0
+ b1x1i + b2x2i + … + bKxKi
where i = 1,…,n observations.
2) The
marginal change in the dependent variable, Y, related to changes in independent
variables represented by bj’s. The coefficient bj
indicates the change in Y given a unit change in xj while
controlling for the simultaneous effect of other independent variables. (Newbold et al., 2007)
When applying multiple linear regression, a model is
constructed to explain variability in the dependent variable chosen. The
multiple regression model is:
yi = β0 + β1x1i + β2x2i
+ … + βKxKi + εi
where
εi is the random error term
βj are coefficients of marginal effects of
the independent variables
xj are the independent variables selected, also
known as explanatory variables
j indicates the index of the specific
independent variable, j = 1,…,K
i
indicates the specific observation, I = 1,…,n (Newbold
et al., 2007)
In multiple regression analysis the ordinary least‐square
method is used to estimate the
regression parameters. The least square method is based on the idea that the sum
of the squared deviations is to be minimized.
With deviation, the distance from the constructed plane to the actual
observation is referred to. In other words the least square method aims to
minimize the errors when constructing the regression plane from where the
intercept and the regression coefficients are found. Below, this is formulated
in mathematical terms for a case when there is one dependent variable and two
independent variables: (Lee et
al., 2000)
Minimize
∑ ∑
This method can only be illustrated graphically in two and
three dimensions; however in regression analysis more than two independent
variables can be used to determine
the relationship between a dependent variable and these independent
variables.

Regression plane with Y as dependent
variable and X1 and X2 as independent variables. The
regression plane is created to minimize the deviation (residuals)
in the observations. (www.sjsu.edu, 2007‐11‐20)
In such a model where the intercept, β0, and the
regression coefficients, β1,…, βK, are assumed to be
constant and there is residual homogeneity and normality, the model can be
estimated with ordinary least square estimation (Yaffee,
2007‐11‐20).
When the data used
in a regression analysis includes multiple cases (companies,
people, countries etc.) that
are observed at two or more time periods it is called panel data. The use of
panel data complicates the ordering of the data and the calculation process
since there is an extra dimension, apart from studying several companies they
are also studied over time. Therefore this type of data is commonly simplified
into what is called pooled data. Pooled data means that all data is merged into
one dimension. This implies that the panel data characteristics are disregarded,
however in the case where the intercept and the regression coefficients are
assumed to be constant over time it is recommended to proceed in this way.
Further, panel data can appear balanced or unbalanced. Balanced panel data
means that the numbers of time series observations are equal in each
crosssection, whereas unbalanced data means that the numbers of time series
observations are not equal in each cross section. (Antell,
2007‐10‐22 and Yaffee, 2007‐11‐20).
Before performing a regression analysis hypothesis regarding
the relationship between the dependent and independent variables are
formulated. After performing the regression analysis it can then be concluded,
in the case of significant results, that the relationship between the variables
is of a certain type.
There are two special types of variables that can be used as independent variables. These
variables are called dummy variable and interaction variable. A dummy variable
is a type of variable that take only two values, 0 or 1, and it can be a
powerful tool in multiple regression, especially when in situations involving
categorical variables. For example a dummy variable could indicate if a person
smokes or not by taking on the value 1 in those cases when a person smokes and
0 if the person does not smoke. It is a simple way to symbolize different
categories. An interaction variable is the product of two variables. It
investigates whether the combination of two variables is extra strong. For
example, it might be relevant to include an interaction variable if
investigating the growth of crops. Using a fertilizer might imply that the
crops grow particularly well, also rain fall might be a factor that influences
the growth of the crop. In this case it would be a good idea to include an
interaction variable to investigate what happens in the case when fertilizer is
used and there is rain fall. (Newbold et al., 2007).
To find the level of
significance of the results of regression analysis, several methods can be used. One of these methods is the t‐test,
where a calculated t‐value is translated into a p‐value which shows the level
of significance. There are two types of t‐tests, one‐tailed and two‐tailed. The
onetailed t‐test is used if the
results are interesting only if they turn out in a particular direction. The
two‐tailed t‐test is performed if
the results are interesting despite direction.

Illustration of one‐tailed t‐test and
two‐tailed t‐test.
Calculations necessary to perform multiple regression can be
executed by computer programs such as Excel, SPSS, STATA, Minitab and SAS. In
this thesis work, the statistical package STATA was used.
The sign of the calculated regression coefficients, βi:s, can be
interpreted as showing whether the dependent and independent variables are
positively or negatively related. The t‐statistics calculated in regression
analysis are used to determine the
significance of the results. STATA produces p‐values based on two‐tailed
t‐tests. To get the correct p‐values in this study where a one‐tailed t‐test is
wanted, the p‐value provided by STATA is divided by two.
No comments :
Post a Comment
Note: only a member of this blog may post a comment.