# An Introduction to Multiple Regression

What is multiple regression? When is it used? How is it computed? This article expounds on these questions.

Multiple regression is a commonly used statistical tool that has a range of applications. It is most useful in making predictions of the behavior of a dependent variable using a set of related factors or independent variables. It is one of the many multivariate (many variables) statistical tools applied in a variety of fields.

### Origin of Multiple Regression

Multiple regression originated from the work of Sir Francis Galton, an Englishman who pioneered eugenics, a philosophy that advocates reproduction of desirable traits. In his study of sweet peas, an experimental plant popular among scientists like Gregor Mendel because it is easy to cultivate and has a short life span, Galton proposed that a characteristic (or variable) may be influenced, not by a single important cause but by a multitude of causes of greater and lesser importance. His work was further developed by English mathematician Karl Pearson, who employed a rigorous mathematical treatment of his findings.

### When do you use multiple regression?

Multiple regression is used appropriately on those occasions where only one dependent variable (denoted by the letter Y) is correlated with two or more independent variables (denoted by Xn). It is used to assess causal linkages and predict outcomes.

For example, a student’s grade in college as the dependent variable of a study can be predicted by the following variables: high school grade, college entrance examination score, study time, sports involvement, number of absences, hours of sleep, time spent viewing the television, among others. The computation of the multiple regression equation will show which of the independent variables have more influence than the others.

### How is a multiple regression equation computed?

The data in calculating multiple regression formula take the form of ratio and interval variables (see four statistical measures of measurement for a detailed description of variables). When data are in the form of categories, dummy variables are used instead because the computer cannot interpret those data. Dummy variables are numbers representing a categorical variable. For example, when gender is included in the multiple regression analysis, these are encoded as 1 to represent a male subject and 0 to represent a female or vice-versa.

If several independent variables are involved in the investigation, manual computation will be tedious and time-consuming. For this reason, statistical softwares like SPSS, Statistica, Minitab, Systat, and even MS Excel are used to correlate a set of independent variables to the dependent variable. The data analyst will just have to encode data into columns of categories for each sample which will occupy one row in a spreadsheet.

The formula used in multiple regression analysis is given below:

Y = a + b1*X1 + b2*X2 + … + bn*Xn

where a is the intercept, b is the beta coefficient, and X is an independent variable.

From the set of variables initially incorporated in the multiple regression equation, a set of significant predictors can be identified. This means that some of the independent variables will have to be eliminated in the multiple regression equation if they are found to exert minimal or insignificant correlation to the dependent variable. Thus, it is good practice to make an exhaustive review of literature first to avoid including variables which have consistently shown no correlation to the dependent variable being investigated.

### How do you write the multiple regression hypothesis?

For the example given above, you can state the multiple regression hypothesis this way:

There is no significant relationship between a student’s grade in college and the following: