**Regression**

Through this post I am going to explain How Linear Regression works? Let us start with what is *regression* and how it works? Regression is widely used for prediction and forecasting in field of machine learning. Focus of regression is on the relationship between dependent and one or more independent variables. The “dependent variable” represents the output or effect, or is tested to see if it is the effect. The “independent variables” represent the inputs or causes, or are tested to see if they are the cause. Regression analysis helps to understand how the value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are kept unchanged. In the regression, dependent variable is estimated as function of independent variables which is called regression function. Regression model involves following variables.

- Independent variables X.
- Dependent variable Y
- Unknown parameter θ

In the regression model Y is function of (X,θ). There are many techniques for regression analysis, but here we will consider linear regression.

**Linear regression**

In the Linear regression, dependent variable(Y) is the linear combination of the independent variables(X). Here regression function is known as hypothesis which is defined as below.

h_{θ}(X) = f(X,θ)

Suppose we have only one independent variable(x), then our hypothesis is defined as below.

The goal is to find some values of θ(known as coefficients), so we can minimize the difference between real and predicted values of dependent variable(y). If we take the values of all θ are zeros, then our predicted value will be zero. Cost function is used as measurement factor of linear regression model and it calculates average squared error for **m** observations. Cost function is denoted by J(θ) and defined as below.

As we can see from the above formula, if cost is large then, predicted value is far from the real value and if cost is small then, predicted value is nearer to real value. Therefor, we have to minimize cost to meet more accurate prediction.

**Linear regression in R**

R is language and environment for statistical computing. R has powerful and comprehensive features for fitting regression models. We will discuss about how linear regression works in R. In R, basic function for fitting linear model is lm(). The format is

fit <- lm(*formula*, *data*)

where *formula* describes model(in our case linear model) and *data* describes which data are used to fit model. The resulting object(*fit* in this case) is a list that contains information about the fitted model. The formula typically written as

Y ~ x1 + x2 + … + xk

where ~ separates the dependent variable(y) on the left from independent variables(x1, x2, ….. , xk) from right, and the independent variables are separated by + signs. let’s see simple regression example(example is from book *R in action*). We have the dataset *women* which contains height and weight for a set of 15 women ages 30 to 39. we want to predict weight from height. R code to fit this model is as below.

>fit <-lm(weight ~ height, data=women) >summary(fit)

Output of the summary function gives information about the object *fit*. Output is as below

Call: lm(formula = weight ~ height, data = women) Residuals: Min 1Q Median 3Q Max -1.7333 -1.1333 -0.3833 0.7417 3.1167 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -87.51667 5.93694 -14.74 1.71e-09 *** height 3.45000 0.09114 37.85 1.09e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.525 on 13 degrees of freedom Multiple R-squared: 0.991, Adjusted R-squared: 0.9903 F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14

Let’s understand the output. Values of coefficients(θs) are -87.51667 and 3.45000, hence prediction equation for model is as below

*Weight = -87.52 + 3.45*height*

In the output, *residual standard error* is cost which is 1.525. Now, we will look at real values of weight of 15 women first and then will look at predicted values. Actual values of weight of 15 women are as below

>women$weight

Output [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

Predicted values of 15 women are as below

>fitted(fit)

Output 1 2 3 4 5 6 7 8 9 112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 140.1833 10 11 12 13 14 15 143.6333 147.0833 150.5333 153.9833 157.4333 160.8833

We can see that predicted values are nearer to the actual values.Finally, we understand what is regression, how it works and regression in R.

**Caveat**

Here, I want to beware you from the misunderstanding about correlation and causation. In the regression, dependent variable is correlated with the independent variable. This means, as the value of the independent variable changes, value of the dependent variable also changes. But, this does not mean that independent variable cause to change the value of dependent variable. Causation implies correlation , but reverse is not true. For example, smoking causes the lung cancer and smoking is correlated with alcoholism. Many discussions are there on this topic. if we go deep into than one blog is not enough to explain this.But, we will keep in mind that we will consider correlation between dependent variable and independent variable in regression.

In the next blog, I will discuss about the real world business problem and how to use regression into it.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. **Watch the Replay now!**

The following two tabs change content below.

#### Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API.
Google Plus Profile: : Amar Gondaliya

- Predictive analysis on Web Analytics tool data - July 3, 2013
- Predict User’s Return Visit within a day part-3 - October 22, 2012
- Predict User’s Return Visit within a day part-2 - October 22, 2012