Darko Milosevic, Dr.rer.nat./Dr.oec.

Please fill free to lisen music until you read blog :-)

Regression analysis

Regression analysis

( _Forma-Sadrzaj_DOPUNJEN - SMANJEN.doc)

When investigating the relationship between two variables regression analysis is an appropriate tool. In a regression analysis one variable needs to be distinguished as the dependent variable and the other as the independent variable. When a variable is named the dependent variable, it means that the values of this variable depend on the values of another variable. The independent variable explains the dependent variable. (Lee et al., 2000) For example, if we were to investigate the relationship between eyesight and age, the dependent variable would be eyesight and the independent variable would be age. The reason is that it is most likely that the older a person gets, the worse his or her eyesight gets, and a person’s age is therefore an explaining factor when it comes to that person’s eyesight. 
When only one variable is assumed to explain the value of another variable, meaning there is one dependent and one independent variable, it is called simple regression. However, if it is thought that more than one variable explains the value of the dependent variable, multiple regression analysis is used. In multiple regression analysis, the relationship between a dependent variable and more than one independent variable can be determined. 

Regression analysis provides two important results which are presented below. The most important result in this study is related to the marginal change in Y related to changes in the independent variables:
1)      An estimated linear equation that describes the dependent variable, Y, as a function of a number of K observed independent variables, xj, where j = 1,…,K,.
Y = b0 + b1x1i + b2x2i + … + bKxKi where i = 1,…,n observations.
2)      The marginal change in the dependent variable, Y, related to changes in independent variables represented by bj’s. The coefficient bj indicates the change in Y given a unit change in xj while controlling for the simultaneous effect of other independent variables. (Newbold et al., 2007)
When applying multiple linear regression, a model is constructed to explain variability in the dependent variable chosen. The multiple regression model is:
yi = β0 + β1x1i + β2x2i + … + βKxKi + εi   

where 
      εi  is the random error term
   βare coefficients of marginal effects of the independent variables
  xj  are the independent variables selected, also known as explanatory variables
    j  indicates the index of the specific independent variable, j = 1,…,K
    i  indicates the specific observation, I = 1,…,n (Newbold et al., 2007)

In multiple regression analysis the ordinary least‐square method is used to estimate the regression parameters. The least square method is based on the idea that the sum of the squared deviations is to be minimized.  With deviation, the distance from the constructed plane to the actual observation is referred to. In other words the least square method aims to minimize the errors when constructing the regression plane from where the intercept and the regression coefficients are found. Below, this is formulated in mathematical terms for a case when there is one dependent variable and two independent variables: (Lee et al., 2000)

             Minimize ∑             ∑                                           
This method can only be illustrated graphically in two and three dimensions; however in regression analysis more than two independent variables can be used to determine the relationship between a dependent variable and these independent variables. 
Regression plane with Y as dependent variable and X1 and X2 as independent variables. The regression plane is created to minimize the deviation (residuals) in the observations. (www.sjsu.edu, 2007‐11‐20)
In such a model where the intercept, β0, and the regression coefficients, β1,…, βK, are assumed to be constant and there is residual homogeneity and normality, the model can be estimated with ordinary least square estimation (Yaffee, 2007‐11‐20)

When the data used in a regression analysis includes multiple cases (companies, people, countries etc.) that are observed at two or more time periods it is called panel data. The use of panel data complicates the ordering of the data and the calculation process since there is an extra dimension, apart from studying several companies they are also studied over time. Therefore this type of data is commonly simplified into what is called pooled data. Pooled data means that all data is merged into one dimension. This implies that the panel data characteristics are disregarded, however in the case where the intercept and the regression coefficients are assumed to be constant over time it is recommended to proceed in this way. Further, panel data can appear balanced or unbalanced. Balanced panel data means that the numbers of time series observations are equal in each crosssection, whereas unbalanced data means that the numbers of time series observations are not equal in each cross section. (Antell, 2007‐10‐22 and Yaffee, 2007‐11‐20).

Before performing a regression analysis hypothesis regarding the relationship between the dependent and independent variables are formulated. After performing the regression analysis it can then be concluded, in the case of significant results, that the relationship between the variables is of a certain type.

There are two special types of variables that can be used as independent variables. These variables are called dummy variable and interaction variable. A dummy variable is a type of variable that take only two values, 0 or 1, and it can be a powerful tool in multiple regression, especially when in situations involving categorical variables. For example a dummy variable could indicate if a person smokes or not by taking on the value 1 in those cases when a person smokes and 0 if the person does not smoke. It is a simple way to symbolize different categories. An interaction variable is the product of two variables. It investigates whether the combination of two variables is extra strong. For example, it might be relevant to include an interaction variable if investigating the growth of crops. Using a fertilizer might imply that the crops grow particularly well, also rain fall might be a factor that influences the growth of the crop. In this case it would be a good idea to include an interaction variable to investigate what happens in the case when fertilizer is used and there is rain fall. (Newbold et al., 2007).

To find the level of significance of the results of regression analysis, several methods can be used. One of these methods is the t‐test, where a calculated t‐value is translated into a p‐value which shows the level of significance. There are two types of t‐tests, one‐tailed and two‐tailed. The onetailed t‐test is used if the results are interesting only if they turn out in a particular direction. The two‐tailed t‐test is performed if the results are interesting despite direction.

Illustration of one‐tailed t‐test and two‐tailed t‐test.

Calculations necessary to perform multiple regression can be executed by computer programs such as Excel, SPSS, STATA, Minitab and SAS. In this thesis work, the statistical package STATA was used. The sign of the calculated regression coefficients, βi:s, can be interpreted as showing whether the dependent and independent variables are positively or negatively related. The t‐statistics calculated in regression analysis are used to determine the significance of the results. STATA produces p‐values based on two‐tailed t‐tests. To get the correct p‐values in this study where a one‐tailed t‐test is wanted, the p‐value provided by STATA is divided by two.  

No comments :

Post a Comment

Note: only a member of this blog may post a comment.

 
CONTACT FORM
Please fill contact form in details:
Name and surname:  *
E-mail:  *
Telephone:  *
Arrival:  *
Check out:  *
Number of Persons:  *
Accommodation Type:
Price:
Destination:  *
Business Sector:
Subject:  *
Wishes and comments:
 
 
 *Must be filled with fields.