Author Archives: Amar Gondaliya


Amar Gondaliya

About Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

Predictive analysis on Web Analytics tool data

In our previous webinar, we discussed on predictive analytics and basic things to perform predictive analysis. We also discussed on an eCommerce problem and how it can be solved using predictive analysis. In this post, I will explain R script that I used to perform predictive analysis during webinar.

Before I explain about R script, let me recall eCommerce problem that we discussed during webinar so can get better idea about the data and R script. For eCommerce retailers product return is headache and higher return rates impact the bottom line of their business. So if return rate is reduced by a small amount then it would impact on the total revenue. In order to reduce return rate, we need to identify transactions where probability of product return is higher, if we can able to identify those transactions then we can perform some actions before delivering  products and reduce the return rate.


In webinar, we discussed that we can solve this problem using predictive analytics and use Google Analytics data. To perform predictive analysis we need to go through modeling process and following are the major steps of it.

  1. Load input data
  2. Introducing model variables
  3. Create model
  4. Check model performance
  5. Apply model on test data

I have included these steps in R script. So, let me explain R script that we used in webinar. R script is shown below.

# Step-1 : Read train dataset
train 
# remove TransactionID from train dataset
train 
# Step-3 : Create model 
model 
# Step-4 : Calculate accuracy of model
predicted 
#Step-5 : Applying model on test data
#Load test dataset 
test 
#Predict for test data
test_predict 
#creating label for test dataset
label 
# set label equal to 1 where probabilty of return > 0.6
label[test_predict>0.6] 
# attach label to test dataset
test$label 
# Identify transactionID where label is 1.
high_prob_transactionIds 
high_prob_transactionIds

As you can see that first step is load input data set. In our case input data are train data and train data are loaded using read.csv() function. Train data contain the transaction based data and it contains TransactionID. TransactionID is not needed to use in the model, so it should be removed from the train data.

We also discussed about the variables during the webinar. Train data include pre-purchase, in-purchase and some general attributes. We can retrieve these data from the Google Analytics.

Next, model is created using glm() function and three arguments are given to it which are formula, family and data. In formula, we specify response variable and predictor variables separated by ~ sign. Second argument we set family equal to binomial and last we set data equal to train. Once model is created, its performance is checked where accuracy of the model is calculated. it is shown in the script.

Finally, model is applied on the test dataset and predict the probability of the product return for each transaction in test dataset. In the script, you can see that I have performed several steps to identify the transactionIDs from test data having higher probability of product return. Let me explain them, first test data are loaded. Second, predict() function is used which will generate the probabilities of product return and store in test_predict. Third, new variable label is created which contain 0 for all transactions initially and then using test_predict variable, 0 is replaced with the 1 where probability of return is greater than 0.6 or 60%. Now this label is attached to the test data. Finally all the transactionIDs are retrieved where label is 1 which means that probability of product return is greater than 60% in these transactionIDs.

So this is the script which I used during the webinar and performed the predictive analysis. I have created dummy datasets which you can use to perform these steps yourself. You can download data and R script from here

Here I want to share you one thing, this is not optimized model. This is a practice model. You can improve the model by taking other variables from Google Analytics or performing some optimization tasks, so you can get better results. However if you want to look at some other predictive models on web analytics tool data click here

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts

Predict User’s Return Visit within a day part-3

Welcome to the last part of the series on predicting user’s revisit to the website. In the  first part of series, I generated the logistic regression model for prediction problem whether a user will come back on  website in next 24 hours. In the second part, I discussed about model improvement and seen the model accuracy.

In this post, I will discuss about logistic regression with Google Prediction API and compare it to our model.

When I used Google Prediction API on our data set, I found following result.

Let’s understand result first. The id of the model is “revisit_log_it”, the model type is “CLASSIFICATION” (i.e. Logistic regression), number of instances are 2555 (i.e. Data set contains 2555 rows) and most important result is classification accuracy which is 0.98 (98%).

In our model, accuracy was 98.43% which is similar to the Google Prediction API result. After comparing both results, let’s try to predict whether a user will come back on website in next 24 hours. Suppose we have tracked following information of a user of his last visit and want to predict will user return in next 24 hours?.

  1. visitCount-2
  2. daySinceLastVisit-0
  3. medium-organic
  4. landingPagePath-’/features-adwords-excel-add-in/
  5. exitPagePath-’/excel-add-in-calculator/
  6. pageDepth-2

Here, we first need to understand all the parameter values. In the record, visitCount is 2 means user has visited second time, daySincelastVisit is 0 means user has visited second time in a day (not after some days), medium is organic means user has came through search engine, landingPagePath is ” ‘/features-adwords-excel-add-in/ ” means user has entered on this page in the website, exitPagePath is ” ‘/excel-add-in-calculator/ ” means user exited from this page from the website and pageDepth is 2 means user visits 2 pages during his visit. Let’s predict will this user will come back on website in next 24 hours. R code of predicting for above observation is as below.

>in_d <- data.frame( DaySinceLastVisit=0,visitCount=2,f.medium="organic",f.landingPagePath="'/google-analytics-excel-pricing/",f.exitPagepath="'/excel-add-in-calculator/",pageDepth=2)
>round(predict(Model_3,in_d,type="response"))
Output
1

Output is 1 means user will come back on website in next 24 hours. Let’s make prediction using Google Prediction API and it is as below.

We can see from the response that ouputLabel is “YES” means user will come back in next 24 hours. Finally, we have done prediction for a user using both models (i.e Our model and Prediction API model) .

Feel free to write your feedback  about this series of posts and let us know if you want to do such a predictive analysis.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts

Predict User’s Return Visit within a day part-2

Welcome to the second part of the series on predicting user’s revisit to the website. In my earlier blog Logistic Regression with R, I discussed what is logistic regression. In the first part of the series, we applied logistic regression to available data set. The problem statement there was whether a user will return in the next 24 hours or not. The model is built and till now it was showing us 88% accuracy in predicting user’s revisit.

In this post, I’d try to showcase ways to improve this accuracy and take it to the next level. This is more about technical optimization so  if you are a business reader you may want to skip and check how can you use this for your benefit. But, if you are techwiz or Data modeling guy like me, let’s get rolling.

As I have discussed in blog Improving Bounce Rate prediction Model for Google Analytics Data, the first step of the model improvement is variable selection and the second step is outlier detection (If you want to know more details of steps, refer mentioned blog). Let’s apply these steps one by one.

Variable selection

I have used stepwise backward selection method for variable selection. R code for the stepwise backward selection method is as below.

>Model_1 <- glm(revisit ~ DaySinceLastVisit + visitCount +f.medium +f.landingPagePath +f.exitPagepath+pageDepth, data=data ,family = binomial("logit"))
>library(MASS)
>stepAIC(Model_1, direction="backward")
Output
Start:  AIC=2119.37
revisit ~ DaySinceLastVisit + visitCount + f.medium + f.landingPagePath +  f.exitPagepath + pageDepth

                     Df Deviance    AIC
- f.exitPagepath    152   1732.4 1966.4
- f.landingPagePath  87   1751.0 2115.0
                    1581.4 2119.4
- pageDepth           1   1583.4 2119.4
- f.medium           11   1656.5 2172.5
- visitCount          1   1740.1 2276.1
- DaySinceLastVisit   1   1826.4 2362.4

Step:  AIC=1966.42
revisit ~ DaySinceLastVisit + visitCount + f.medium + f.landingPagePath + pageDepth

                     Df Deviance    AIC
                    1732.4 1966.4
- pageDepth           1   1738.9 1970.9
- f.landingPagePath 101   1987.5 2019.5
- f.medium           12   1821.2 2031.2
- visitCount          1   1929.3 2161.3
- DaySinceLastVisit   1   1978.4 2210.4

Before we understand the output, let me explain how the variables are selected in stepwise backward selection? In the stepwise backward selection method, AIC is used as the selection criterion. General rule is lower the AIC, best the model(i.e. For a group of variables, if AIC decrease by removing any variable(s) from group,then remaining variables are used in the model. This process continues until AIC stops decreasing). From the output, we can see that AIC is decreased and variable exitPageapath is excluded from the model. Now, we will create new model(Model_2 ) which does not include exitPageapath. R code for new model is as below.

>Model_2<-glm(revisit ~ DaySinceLastVisit + visitCount +f.medium +f.landingPagePath +pageDepth, data=data,family = binomial("logit"))

After generating the new model ,let’s check the accuracy of the new model and it is as below.

>predicted_revisit<- round(predict(Model_2,in_d,type="response"))
>confusion_matrix<- ftable(revisit, predicted_revisit)
>accuracy<- sum(diag(confusion_matrix))/2555*100
Output
86.57534

From the output, we can see that accuracy of the new model is decreased. This does not seem good to us. Variable selection method did not help us in improving the model. Let’s try second step for model improvement which is outlier detection.

Outlier detection

As we know that data set contains some unreliable observations which make model’s quality poor. We always need to detect outlier and remove them. For numerical variables, outliers can  be removed by observing the histogram of  frequency distribution of the values of each variable (Process is described in blog Improving Bounce Rate Prediction Model for Google Analytics Data). In our data set, there are three numerical variables named visitCount, daySinceLastVisit and pageDepth. I have generated new data set after removing outliers. Let’s create new model based on new data set and check the accuracy of the new model. R code for new model is as below.

>Model_3 <- glm(revisit ~ DaySinceLastVisit + visitCount +f.medium +f.landingPagePath +f.exitPagepath+pageDepth, data=data_outlier_removed ,family = binomial("logit"))

Now, we will check the accuracy of the new model and it is as below.

>predicted_revisit <- round(predict(Model_3,in_d,type="response"))
>confusion_matrix <- ftable(revisit, predicted_revisit)
>accuracy <- sum(diag(confusion_matrix))/2292*100
Output
98.42932

From the result, we can see that model has more accuracy than previous models (Model_1 and Model_2) and it is good for us. So, removing the outliers from the data set, the model got more improvement and prediction accuracy.  For now, we can conclude that through this model (Model_3), we can predict more accurately whether a user will return to website in next 24 hours. If you want to do exercise, Click here for R code and sample data set. In the next blog, we will discuss about logistic regression with Google Prediction API, check the accuracy of the Google Prediction API for our data set and try to predict for a user that will user return to website in next 24 hours?

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts

Predict User’s Return Visit within a day part-1

In my earlier blog, I have discussed about what is logistic regression? And how logistic model is generated in R? Now we will apply that learning on a specific problem of prediction. In this post, I will create a basic model to predict whether a user will return on website in next 24 hours. This problem is based on the user characteristics as well as website characteristics, but  here we will predict based on some measures (i.e. User’s visits, user’s landing page path, user’s exit page path, etc.). Here our predicted outcome is 1 or 0 . 1 stands for “Yes” and 0 stands for “No”. Let’s  discuss possible data set to build a logistic regression model.

For this problem, we have collected the data of a website from the google analytics. The data set contains the following parameters.

  1. visitor_ID
  2. visitCount
  3. daysSinceLastVisit
  4. medium
  5. landingPagePath
  6. exitPagePath
  7. pageDepth

Let’s understand the parameters first. First parameter is visitor_ID, which is the id of visitor. Second parameter is visitCount , which contains values in increasing order(i.e. For a particular visitor,  if  visitor visits site first time then value of visitCount is 1, visits second time then value of visitCount is 2 and so on). Third parameter is daySincelastVisit, which contains the days difference of two consecutive visits. Fourth one is medium, which contains categorical values(i.e. organic, referral, etc. ).Fifth parameter is landingPagePath, which contains a string value represent the entrance page of the user for each visit. Sixth parameter is exitPagepPath, which contains a string value represent the exit page of the user for each visit. The last parameter is PageDepth, which contains the values that represent the how many pages a user has visited during a single visit.

Here our goal is to predict whether a user will return on website in next 24 hours. From the collected data, we can say that, a user would have came back if his visitcount is more than 1 and daysinceLastVisit is less than or equal to 1. Based on this criteria, we have generated new variable named revisit, which contains values “1″ or “0″ for each user. “1″ indicates user has came back and “0″ indicates user has not came back. This variable(revisit) is considered as the dependent variable and visitCount, daySinceLastVisit, medium, landingPagepath, exitPagePath and pageDepth are considered as the independent variables. Let’s generate model.

Before generating a model, let we discuss one issue. Issue is that data set contains categorical variables then how to deal with them? In the linear regression, only numeric values were considered in blog Linear Regression using R, but in the logistic regression we need to consider categorical values. There are many solutions for this issue, but I have used the dummy variable codding.

In the dummy coding, variable with K categories is usually entered into a regression as a sequence of K-1 dummy variables. For our data set, we have three categorical variables which are medium, landingPagePath and exitPagePath. Each variable contains 14, 102 and 167 categories respectively.  Generally we are not appending dummy variables, but we create a contrast matrix for each categorical variable. I have done dummy coding for our categorical variables. We will not go into the detail and coding scheme, because one blog is not enough to explain dummy coding. We will deal only with our actual problem of the prediction.

Let’s generate regression model based on the data set. The R code for our model is as below.

>Model_1 <- glm(revisit ~ DaySinceLastVisit + visitCount +f.medium +f.landingPagePath +f.exitPagepath+pageDepth, data=data ,family = binomial("logit"))

Here, I am not going to show the summary of the model, because the summary is  too large to view, then the question arises here that how to decide model effectiveness without summary ? I have chosen alternate option to measure the effectiveness of the model and it is the accuracy of the model. Accuracy of the model is calculated as how many % model have been successful in predicting true against the actual values. Through the accuracy, we can decide  the effectiveness of the model. Following is the R code snippet to calculate the accuracy of our model.

>confusion_matrix <- ftable(actual_revisit, predicted_revisit)
>accuracy <- sum(diag(confusion_matrix))/2555*100
Output
88.21918

In the above R code, I have used ftable() function to generate confusion matrix which is used in calculating the accuracy of the model.    Here we will not discuss in detail about confusion matrix, because it is out of the scope of the blog. For more detail refer wiki page of the confusion matrix. Let’s see the output, from the output, we can see that accuracy of our model is 88.22%, which is good for us. But, we can increase the accuracy of the model, if we improve the model. If the accuracy of the model is above 95%, then we can predict more accurately. Before we generate some prediction, we will improve the model first and then try to predict using improved model. If you want to do the same exercise, Click here for R code and sample data set. In the next blog, we will discuss about model improvement and check the accuracy of the improved model.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts

Logistic Regression with R

Logistic Regression

In my first blog post, I have explained about the what is regression? And how linear regression model is generated in R? In this post, I will explain what is logistic regression? And how the logistic regression model is generated in R?

Let’s first understand logistic regression. Logistic regression is one of the type of regression and it is used to predict outcome of the categorical dependent variable. (i.e. categorical variable has  limited number of categorical values) based on the one or more independent variables. For example, if you would like to predict who will win the next T20 world cup, based on player’s strength and  other details. It is a prediction done with categorical variable. Logistic regression can be binomial  or multinomial.

In the binomial or binary logistic regression, the outcome can have only two possible types of values (e.g. “Yes” or “No”, “Success” or “Failure”). Multinomial logistic refers to cases where the outcome can have three or more possible types of values (e.g., “good” vs. “very good” vs. “best” ). Generally outcome is coded as “0″ and “1″ in binary logistic regression. We will use binary logistic regression in the rest of the part of the blog. Now, we will look at how the logistic regression model is generated in R.

Logistic regression in R

To fit logistic regression model, glm() function is used in R which is similar to lm(), but glm() includes additional parameters. The format is

glm(Y~X1+X2+X3, family=binomial(link=”logit”), data=mydata)

Here, Y is dependent variable and X1, X2 and X3 are independent variables. Function includes additional parameter family and it has value binomial(link=”logit”) which means the probability distribution of regression model is binomial and link function is logit (Refer book  R in Action for more information). Let’s generate a simple model. Suppose we want to predict whether a student will get admission based on his two exam scores. For this problem we have a historical data from previous applicants which can be used as the training data set to build a model. The data set contains the following parameters.

  1. exam_1- Exam-1 score
  2. exam_2- Exam-2 score
  3. admitted- 1 if admitted or 0 if not admitted

In the above parameters, parameter admitted has value 1 or 0 for each observation. Now, we will generate a model that can predict, will student get admission based on two exam scores? For a given problem, admitted is considered as dependent variable, exam_1 and exam_2 are considered as independent variables. The R code for the model is given as  below.

>Model_1<-glm(admitted ~ exam_1 +exam_2, family = binomial("logit"), data=data)

After generating the model, let’s try to predict using this model. Suppose we have two exam marks of a student, 60 of exam_1 and 85 of exam_2. We will predict that will student get admission? Following is R code for predicting probability of student to get admission.

>in_frame<-data.frame(exam_1=60,exam_2=86)
>predict(Model_1,in_frame, type="response")
Output
0.9894302

Here, the output is given as a probability score which has value in range 0 to 1. If the probability score is greater than 0.5 then it is considered as TRUE. If the probability score is less than or equal to 0.5 then it is considered as FALSE. In our case 1 or 0 will be considered as the output to decide, will student get admission or not? if it is 1 then student will get admission otherwise not.  So I have used round() function to convert probability score to 0 or 1. It is as below.

>round(predict(Model_1, in_frame, type="response"))
Output
1

Output is 1 means a student will get admission. We can also predict for other observations in the above manner. Finally we understood what is logistic regression? And how it works in R? If you want to do the same exercise, Click here for R code and sample data set of above example. In the next blog, we will discuss about a specific problem for Google Analytics data and see how to use logistic regression into?

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts

Predict Bounce Rate based on Page Load Time in Google Analytics

Welcome to the second part. In the last blog post on Linear Regression with R, we have discussed about what is regression? and how it is used ? Now we will apply that learning on a specific problem of prediction. In this post, I will create a basic model to predict bounce rate as function of page load time components. In next blog, I’d share how to improve the model to improve the prediction.

We know that bounce rate is important  for a web site. Here, we want to identify  relationships between bounce rate and time components of a web page(e.g. average page download time, average page load time, average server response time, etc.) and how much these time components  impact on bounce rate? For this problem, we have collected data of various web sites from Google analytics. The data set contains following parameters.

  1. x_id – Id of the page
  2. ismobile – page visited is by mobile or not
  3. Country
  4. pagePath
  5. pageTitle
  6. avgServerResponseTime
  7. avgServerConnectionTime
  8. avgRedirectionTime
  9. avgPageDownloadTime
  10. avgDomainLookupTime
  11. avgPageLoadTime
  12. entrances
  13. pageviews
  14. exits
  15. bounces

Each parameter is tracked for a single page. We have 8488 rows in data set and we have calculated bounce rate for each page as below.

Bounce rate = (bounces / entrances)*100

Here, we want to know the impact of  average server response time, average server connection time, average redirection time, average domain look up  time, average page download time and average page load time on the bounce rate. So, we have rearranged the data set and removed x_id, country, page path, page title, entrances, page views, exits and bounces from the data set and appended bouncerate after calculating it. Now data set contains following parameters.

  1. bouncerate
  2. avgServerResponseTime
  3. avgServerConnectionTime
  4. avgRedirectionTime
  5. avgPageDownloadTime
  6. avgDomainLookupTime
  7. avgPageLoadTime

Let’s use regression on this data set. In this problem, we want to identify the dependency of the bounce rate on time components. So, we will consider bouncerate as dependent variable and the rest of the parameters from the data set as independent variables. Regression model for our data set in R is as below

>Model_1 <- lm(bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime +avgDomainLookupTime + avgPageLoadTime)

We have generated the model nicely, but we are interested to know the relationships between bounce rate and and time components. Let’s check summary of the model.

>summary(model_1)
Output
Call:
lm(formula = bouncerate ~ avgServerResponseTime + avgServerConnectionTime +
    avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime +
    avgPageLoadTime)

Residuals:
    Min      1Q  Median      3Q     Max
-98.276 -19.816  -1.169  19.805 107.705 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)             49.10686    0.32862 149.435  < 2e-16 ***
avgServerResponseTime   -0.85724    0.17154  -4.997 5.93e-07 ***
avgServerConnectionTime  2.02335    0.55566   3.641 0.000273 ***
avgRedirectionTime      -0.37822    0.06368  -5.939 2.97e-09 ***
avgPageDownloadTime      0.31975    0.12172   2.627 0.008631 **
avgDomainLookupTime      4.14929    0.88525   4.687 2.81e-06 ***
avgPageLoadTime          0.04684    0.01896   2.470 0.013528 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 26.74 on 8481 degrees of freedom
Multiple R-squared: 0.01339,	Adjusted R-squared: 0.0127
F-statistic: 19.19 on 6 and 8481 DF,  p-value: < 2.2e-16

Let’s understand the result. In the result, coefficients are shown in the column Estimate std. So, the equation for bounce rate becomes as below.

bouncerate = 49.107 + (-0.86)avgServerResponsetime + (2.03)avgServerconnectionTime + (-0.38)avgRedirectionTime + (0.32)avgPageDownloadTime + (4.14)avgDomainLookuptime + (.05)avgpageLoadtime

As we can see from the equation, avgDomainLookupTime impacts more on bounce rate . If avgDomainLookupTime increase by 1 unit then bounce rate increase by 4.14. At last, we succeed in identifying  the relationship between bounce rate and time components of a web page using regression.

Here, we cannot say that the relationships estimated from this regression model(model_1) are perfect, because the model result is  generated after model fitted to the data set(i.e. model learns from the data and then estimate coefficients values) and data set may contain some unreliable observations . It is necessary to improve the model, so we can identify the relationships of bounce rate and time components very precisely. In the next blog, we will discuss about how to improve the model? and summary of the improved model.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts

Linear Regression using R

Regression

Through this post I am going to explain How Linear Regression works? Let us start with what is regression and how it works? Regression is widely used for prediction and forecasting in field of machine learning. Focus of regression is on the relationship between dependent and one or more independent variables. The “dependent variable” represents the output or effect, or is tested to see if it is the effect. The “independent variables” represent the inputs or causes, or are tested to see if they are the cause. Regression analysis helps to understand how the value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are kept unchanged. In the regression, dependent variable is estimated as function of independent variables which is called regression function. Regression model involves following variables.

  • Independent variables X.
  • Dependent variable Y
  • Unknown parameter θ

In the regression model Y is function of (X,θ). There are many techniques for regression analysis, but here we will consider linear regression.

Linear regression

In the Linear regression, dependent variable(Y) is the linear combination of the independent variables(X). Here regression function is known as hypothesis which is defined as below.

hθ(X) = f(X,θ)

Suppose we have only one independent variable(x), then our hypothesis is defined as below.

The goal is to find some values of θ(known as coefficients), so we can minimize the difference between real and predicted values of dependent variable(y). If we take the values of all θ are zeros, then our predicted value will be zero. Cost function is used as measurement factor of linear regression model and it calculates average  squared error for m observations. Cost function is denoted by J(θ) and defined as below.

As we can see from the above formula, if cost is large then, predicted value is far from the real value and if cost is small then, predicted value is nearer to real value. Therefor, we have to minimize cost to meet more accurate prediction.

Linear regression in R

R is language and environment for statistical computing. R has powerful and comprehensive features for fitting regression models. We will discuss about how linear regression works in R. In R, basic function for fitting linear model is lm(). The format is

fit <- lm(formula, data)

where formula describes model(in our case linear model) and data describes which data are used to fit model. The resulting object(fit in this case) is a list that contains information about the fitted model. The formula typically written as

Y ~ x1 + x2 + … + xk

where ~ separates the dependent variable(y) on the left from independent variables(x1, x2, ….. , xk) from right, and the independent variables are separated by + signs. let’s see simple regression example(example is from book R in action). We have the dataset women which contains height and weight for a set of 15 women ages 30 to 39. we want to predict weight from height. R code to fit this model is as below.

>fit <-lm(weight ~ height, data=women)
>summary(fit)

Output of the summary function gives information about the object fit. Output is as below

Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991,	Adjusted R-squared: 0.9903
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

Let’s understand the output. Values of coefficients(θs) are -87.51667 and 3.45000, hence prediction equation for model is as below

Weight = -87.52 + 3.45*height

In the output, residual standard error is cost which is 1.525. Now, we will look at real values of weight of 15 women first and then will look at predicted values. Actual values of weight of 15 women  are as below

>women$weight
Output
[1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

Predicted values of 15 women are as below

>fitted(fit)
Output
       1        2        3        4        5        6        7        8        9
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 140.1833
      10       11       12       13       14       15
143.6333 147.0833 150.5333 153.9833 157.4333 160.8833

We can see that predicted values are nearer to the actual values.Finally, we understand what is regression, how it works and regression in R.

Caveat

Here, I want to beware you from the misunderstanding about correlation and causation. In the regression, dependent variable is correlated with the independent variable. This means, as the value of the independent variable changes, value of the dependent variable also changes. But, this does not mean that independent variable cause to change the value of dependent variable. Causation implies correlation , but reverse is not true. For example, smoking causes the lung cancer and smoking is correlated with alcoholism.  Many discussions are there on this topic. if we go deep into than one blog is not enough to explain this.But, we will keep in mind that we will consider correlation between dependent variable and independent variable in  regression.

In the next blog, I will discuss about the real world business problem and how to use regression into it.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts

Regression with Google Prediction API

Welcome to the last part. In the previous blog, we have discussed about the model improvement and seen the summary of the improved model. In this post, I will discuss about regression with Google prediction API, compare it with our regression model and predict the bounce rate. When I used Google prediction API on the our data set, I found  following  result.

Let’s understand result first. Id of the model is “a1″, model type is “REGRESSSION”, number of instances are 8488(i.e. data set contains 8488 rows) and most important result is mean squared error which is 704.82. Here, there is no information about the coefficients of the model. Then, question arise how to evaluate model? Don’t worry, I will explain.

In the first part, I have explained about cost and if the cost is minimum then model is better. Cost of Google prediction API result is calculated as the square root of the mean square error and it is 26.55. However the cost of our improved model is termed as residual standard error and it is 24.83. If we compare these two costs , then we can say that R regression model is similar to the Google prediction API model.

After understanding the relationships between the bounce rate and time components, let’s predict the bounce rate through regression model. R provides predict() function to generate prediction. Suppose we have following observation for a webpage

  • avgServerResponseTime – 0.427189189
  • avgServerConnectionTime – 0.007081081
  • avgRedirectionTime – 0.318081081
  • avgPageDownloadTime – 0.416432432
  • avgDomainLookupTime – 0.033351351
  • avgPageLoadTime – 3.395026316

R code to generate prediction is as follow.

>insert_frame<- data.frame(avgServerResponseTime=0.427189189,avgServerConnectionTime=0.007081081,avgRedirectionTime=0.318081081,avgPageDownloadTime=0.416432432,avgDomainLookupTime=0.033351351,avgPageLoadTime=3.395026316)
>predict(Model_2,insert_frame,type='response')
Output
50.14

Let’s check the prediction of the above observation with Google Prediction API. This is shown below.

From the above result, we can see that prediction for observation is 48.39 and it is similar to the our prediction model.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts

Improving Bounce Rate Prediction Model for Google Analytics Data

Welcome to the third part. In the previous blog, we have discussed about the relationships of the bounce rate and page load time components. We have also fitted the regression model to identify relationships and discussed why to improve the model?

In this post, I will  discuss about steps for improving the existing model for increasing accuracy of predicting bounce rate based on components of page load time. In the model improvement, the first step is variable selection(i.e. Independent variables) and second step is outlier detection. These two steps are essential for model improvement. We will discuss them one by one.

Variable selection is crucial part of model improvement, because variables(i.e. Independent variables) play important roles into developing the best model. We always need to identify which variables are important for model and which are not. There are two methods for variable selection, first is stepwise selection and second is all subset regression. Let’s discuss them one by one.

Stepwise selection

In the stepwise selection variables are added to or deleted from a model at a time until some stopping criterion is reached. For example, in stepwise forward selection we add independent variables to model one at a time, stopping when adding of other variables would no longer Improve model.

In stepwise backward selection we start model that includes all independent variables, and then delete them one at a time until removing variables degrade the quality of the model. R provides MAAS package to perform stepwise selection using stepAIC() function. I have used stepwise backward selection as below.

>library(MASS)
>stepAIC(Model_1, direction="backward")
Output
Start:  AIC=55790.44
bouncerate ~ avgServerResponseTime + avgServerConnectionTime +
    avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime +
    avgPageLoadTime

                          Df Sum of Sq     RSS   AIC
                                 6062418 55790
- avgPageLoadTime          1    4361.4 6066779 55795
- avgPageDownloadTime      1    4932.9 6067351 55795
- avgServerConnectionTime  1    9478.1 6071896 55802
- avgDomainLookupTime      1   15704.0 6078122 55810
- avgServerResponseTime    1   17852.1 6080270 55813
- avgRedirectionTime       1   25216.4 6087634 55824

Call:
lm(formula = bouncerate ~ avgServerResponseTime + avgServerConnectionTime +
    avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime +
    avgPageLoadTime)

Coefficients:
            (Intercept)    avgServerResponseTime  avgServerConnectionTime
               49.10686                 -0.85724                  2.02335
     avgRedirectionTime      avgPageDownloadTime      avgDomainLookupTime
               -0.37822                  0.31975                  4.14929
        avgPageLoadTime
                0.04684

As we can see from the result, there is a term AIC. AIC is used as the stopping criterion for variable selection. General rule is lower the AIC, better the model. Here, we have used backward selection method and no one variable is removed, this tells that removing any variable from the model does not decrease the value of the AIC and thereby doesn’t improve the model that we have currently.

All subset regression

In all subset regression, every possible model is inspected. All subset regression is performed using regsubsets() function from the leaps package. We can choose best n-models by setting nbest=2,3,.. We can choose R-squared, Adjusted R-squared as our criterion for best model. I have used Adjusted R-squared and nbest=2 in model as below.

>library(leaps)
>leaps <-regsubsets(bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime +avgDomainLookupTime + avgPageLoadTime,data=data,nbest=2)
>plot(leaps, scale="adjr2")

When we plot, we come to know that with maximum value of adjusted R-squared all the variables must be selected. This is shown in the plot below.

In the plot , we can see that minimum value of the adjusted R-squared is  0.00 3 and only two corresponding  variables(e.g. Intercept and DomainLookupTime) are marked with black color. Maximum value of adjusted R-squared is 0.013 and all corresponding variables are marked with black color,  this means all the variables should be selected.

After variable selection, we can conclude that in current model all independent variables are important and they impact on bounce rate.Let ‘s move to the second step of the model improvement.

Now, we have to detect outliers because outlier degrades the quality of model and quality of model effects on identifying the relationships between bounce rate and page load time components.To detect the outliers we need to take summary of the data. Summary gives the information about the minimum, maximum and median values of every variables of the data. Summary of the data is as below.

>summary(data)
Output
   bouncerate     avgServerResponseTime avgServerConnectionTime avgRedirectionTime
 Min.   :  0.00   Min.   :  0.00000     Min.   : 0.00000        Min.   :0.000e+00
 1st Qu.: 29.09   1st Qu.:  0.08771     1st Qu.: 0.00000        1st Qu.:0.000e+00
 Median : 47.97   Median :  0.21517     Median : 0.00904        Median :1.687e-03
 Mean   : 49.30   Mean   :  0.63631     Mean   : 0.08004        Mean   :5.147e-01
 3rd Qu.: 69.23   3rd Qu.:  0.74684     3rd Qu.: 0.05492        3rd Qu.:6.270e-02
 Max.   :100.00   Max.   :110.49100     Max.   :26.58582        Max.   :1.757e+02
 avgPageDownloadTime avgDomainLookupTime avgPageLoadTime
 Min.   : 0.00000    Min.   : 0.000000   Min.   :  0.004
 1st Qu.: 0.04575    1st Qu.: 0.000000   1st Qu.:  1.994
 Median : 0.21200    Median : 0.000000   Median :  4.105
 Mean   : 0.73316    Mean   : 0.041300   Mean   :  7.886
 3rd Qu.: 0.64848    3rd Qu.: 0.001388   3rd Qu.:  7.973
 Max.   :93.17600    Max.   :20.184500   Max.   :485.234

When we check the max value for each variable, we found that in many variables(avgServerResponseTime, avgPageLoadTime, etc) max values are not relative to their other values(i.e. Mean, Median,..etc.). Let we take an example, for variable avgPageloadTime, there is a big difference between Max value(485.234) and Mean(7.886) value. Max value is very larger than the Mean value, this means variable has outliers. So we need to check histogram for this variable and frequency distribution or occurrence of values. I have plotted the histogram for avgPageLoadTime as below.

From histogram, we can see that values within 0 to 50 has high frequency or occurrence  and for the other values frequency is very low(i.e. one or two ) means they are outliers and they make quality of regression model poor. So, we have to remove the outliers. R provides the subset() function to remove the outliers. After plotting histogram for each variables, I have used subset function as below.

>data_frame_hist <- subset(data,data$bouncerate & data$avgServerResponseTime<10 & data$avgServerConnectionTime<2.5 & avgPageDownloadTime<10 & data$avgRedirectionTime<10 & data$avgDomainLookupTime<2.5 & data$avgPageLoadTime<50)

I have set constraint for all variables and generate new data set. Let’s generate second model based on new data set and check summary.

>Model_2<-lm(bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime +avgDomainLookupTime + avgPageLoadTime)
>summary(Model_2)
Output
Call:
lm(formula = bouncerate ~ avgServerResponseTime + avgServerConnectionTime +
    avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime +
    avgPageLoadTime)

Residuals:
    Min      1Q  Median      3Q     Max
-86.845 -18.834  -1.784  18.245  90.590 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)             51.81372    0.40375 128.331  < 2e-16 ***
avgServerResponseTime   -4.47623    0.34155 -13.106  < 2e-16 ***
avgServerConnectionTime 18.08111    1.77633  10.179  < 2e-16 ***
avgRedirectionTime      -3.47483    0.37124  -9.360  < 2e-16 ***
avgPageDownloadTime      0.95150    0.33956   2.802  0.00509 **
avgDomainLookupTime     15.44193    2.15348   7.171 8.14e-13 ***
avgPageLoadTime          0.08968    0.05385   1.665  0.09589 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 24.83 on 7892 degrees of freedom
Multiple R-squared: 0.0522,	Adjusted R-squared: 0.05148
F-statistic: 72.44 on 6 and 7892 DF,  p-value: < 2.2e-16

Now, we have result of the improved model. From the result, the equation for bounce rate becomes as below.

bouncerate = 51.82 + (-4.47)avgServerResponsetime + (18.08)avgServerconnectionTime + (-3.47)avgRedirectionTime + (0.95)avgPageDownloadTime + (15.44)avgDomainLookuptime + (.09)avgpageLoadtime

As we can see from the equation, avgServerconnestionTime impacts more on the bounce rate. If we look at my initial model(model_1) in second blog, most impacting parameter was avgDomainLookupTime. we can conclude that avgServerconnestionTime more than avgDomainLookupTime on bounce rate.

Finally ,we identified the relationships between bounce rate and time components. In the next blog, we will discuss about regression with Google prediction API and  prediction of the bounce rate.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts