Improving Bounce Rate Prediction Model for Google Analytics Data

By Amar Gondaliya | 15.09.2012

Welcome to the third part. In the previous blog, we have discussed about the relationships of the bounce rate and page load time components. We have also fitted the regression model to identify relationships and discussed why to improve the model?

In this post, I will  discuss about steps for improving the existing model for increasing accuracy of predicting bounce rate based on components of page load time. In the model improvement, the first step is variable selection(i.e. Independent variables) and second step is outlier detection. These two steps are essential for model improvement. We will discuss them one by one.

Variable selection is crucial part of model improvement, because variables(i.e. Independent variables) play important roles into developing the best model. We always need to identify which variables are important for model and which are not. There are two methods for variable selection, first is stepwise selection and second is all subset regression. Let’s discuss them one by one.

Stepwise selection

In the stepwise selection variables are added to or deleted from a model at a time until some stopping criterion is reached. For example, in stepwise forward selection we add independent variables to model one at a time, stopping when adding of other variables would no longer Improve model.

In stepwise backward selection we start model that includes all independent variables, and then delete them one at a time until removing variables degrade the quality of the model. R provides MAAS package to perform stepwise selection using stepAIC() function. I have used stepwise backward selection as below.

>library(MASS)
>stepAIC(Model_1, direction="backward")
Output
Start:  AIC=55790.44
bouncerate ~ avgServerResponseTime + avgServerConnectionTime +
    avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime +
    avgPageLoadTime

                          Df Sum of Sq     RSS   AIC
                                 6062418 55790
- avgPageLoadTime          1    4361.4 6066779 55795
- avgPageDownloadTime      1    4932.9 6067351 55795
- avgServerConnectionTime  1    9478.1 6071896 55802
- avgDomainLookupTime      1   15704.0 6078122 55810
- avgServerResponseTime    1   17852.1 6080270 55813
- avgRedirectionTime       1   25216.4 6087634 55824

Call:
lm(formula = bouncerate ~ avgServerResponseTime + avgServerConnectionTime +
    avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime +
    avgPageLoadTime)

Coefficients:
            (Intercept)    avgServerResponseTime  avgServerConnectionTime
               49.10686                 -0.85724                  2.02335
     avgRedirectionTime      avgPageDownloadTime      avgDomainLookupTime
               -0.37822                  0.31975                  4.14929
        avgPageLoadTime
                0.04684

As we can see from the result, there is a term AIC. AIC is used as the stopping criterion for variable selection. General rule is lower the AIC, better the model. Here, we have used backward selection method and no one variable is removed, this tells that removing any variable from the model does not decrease the value of the AIC and thereby doesn’t improve the model that we have currently.

All subset regression

In all subset regression, every possible model is inspected. All subset regression is performed using regsubsets() function from the leaps package. We can choose best n-models by setting nbest=2,3,.. We can choose R-squared, Adjusted R-squared as our criterion for best model. I have used Adjusted R-squared and nbest=2 in model as below.

>library(leaps)
>leaps <-regsubsets(bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime +avgDomainLookupTime + avgPageLoadTime,data=data,nbest=2)
>plot(leaps, scale="adjr2")

When we plot, we come to know that with maximum value of adjusted R-squared all the variables must be selected. This is shown in the plot below.

In the plot , we can see that minimum value of the adjusted R-squared is  0.00 3 and only two corresponding  variables(e.g. Intercept and DomainLookupTime) are marked with black color. Maximum value of adjusted R-squared is 0.013 and all corresponding variables are marked with black color,  this means all the variables should be selected.

After variable selection, we can conclude that in current model all independent variables are important and they impact on bounce rate.Let ‘s move to the second step of the model improvement.

Now, we have to detect outliers because outlier degrades the quality of model and quality of model effects on identifying the relationships between bounce rate and page load time components.To detect the outliers we need to take summary of the data. Summary gives the information about the minimum, maximum and median values of every variables of the data. Summary of the data is as below.

>summary(data)
Output
   bouncerate     avgServerResponseTime avgServerConnectionTime avgRedirectionTime
 Min.   :  0.00   Min.   :  0.00000     Min.   : 0.00000        Min.   :0.000e+00
 1st Qu.: 29.09   1st Qu.:  0.08771     1st Qu.: 0.00000        1st Qu.:0.000e+00
 Median : 47.97   Median :  0.21517     Median : 0.00904        Median :1.687e-03
 Mean   : 49.30   Mean   :  0.63631     Mean   : 0.08004        Mean   :5.147e-01
 3rd Qu.: 69.23   3rd Qu.:  0.74684     3rd Qu.: 0.05492        3rd Qu.:6.270e-02
 Max.   :100.00   Max.   :110.49100     Max.   :26.58582        Max.   :1.757e+02
 avgPageDownloadTime avgDomainLookupTime avgPageLoadTime
 Min.   : 0.00000    Min.   : 0.000000   Min.   :  0.004
 1st Qu.: 0.04575    1st Qu.: 0.000000   1st Qu.:  1.994
 Median : 0.21200    Median : 0.000000   Median :  4.105
 Mean   : 0.73316    Mean   : 0.041300   Mean   :  7.886
 3rd Qu.: 0.64848    3rd Qu.: 0.001388   3rd Qu.:  7.973
 Max.   :93.17600    Max.   :20.184500   Max.   :485.234

When we check the max value for each variable, we found that in many variables(avgServerResponseTime, avgPageLoadTime, etc) max values are not relative to their other values(i.e. Mean, Median,..etc.). Let we take an example, for variable avgPageloadTime, there is a big difference between Max value(485.234) and Mean(7.886) value. Max value is very larger than the Mean value, this means variable has outliers. So we need to check histogram for this variable and frequency distribution or occurrence of values. I have plotted the histogram for avgPageLoadTime as below.

From histogram, we can see that values within 0 to 50 has high frequency or occurrence  and for the other values frequency is very low(i.e. one or two ) means they are outliers and they make quality of regression model poor. So, we have to remove the outliers. R provides the subset() function to remove the outliers. After plotting histogram for each variables, I have used subset function as below.

>data_frame_hist <- subset(data,data$bouncerate & data$avgServerResponseTime<10 & data$avgServerConnectionTime<2.5 & avgPageDownloadTime<10 & data$avgRedirectionTime<10 & data$avgDomainLookupTime<2.5 & data$avgPageLoadTime<50)

I have set constraint for all variables and generate new data set. Let’s generate second model based on new data set and check summary.

>Model_2<-lm(bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime +avgDomainLookupTime + avgPageLoadTime)
>summary(Model_2)
Output
Call:
lm(formula = bouncerate ~ avgServerResponseTime + avgServerConnectionTime +
    avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime +
    avgPageLoadTime)

Residuals:
    Min      1Q  Median      3Q     Max
-86.845 -18.834  -1.784  18.245  90.590 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)             51.81372    0.40375 128.331  < 2e-16 ***
avgServerResponseTime   -4.47623    0.34155 -13.106  < 2e-16 ***
avgServerConnectionTime 18.08111    1.77633  10.179  < 2e-16 ***
avgRedirectionTime      -3.47483    0.37124  -9.360  < 2e-16 ***
avgPageDownloadTime      0.95150    0.33956   2.802  0.00509 **
avgDomainLookupTime     15.44193    2.15348   7.171 8.14e-13 ***
avgPageLoadTime          0.08968    0.05385   1.665  0.09589 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 24.83 on 7892 degrees of freedom
Multiple R-squared: 0.0522,	Adjusted R-squared: 0.05148
F-statistic: 72.44 on 6 and 7892 DF,  p-value: < 2.2e-16

Now, we have result of the improved model. From the result, the equation for bounce rate becomes as below.

bouncerate = 51.82 + (-4.47)avgServerResponsetime + (18.08)avgServerconnectionTime + (-3.47)avgRedirectionTime + (0.95)avgPageDownloadTime + (15.44)avgDomainLookuptime + (.09)avgpageLoadtime

As we can see from the equation, avgServerconnestionTime impacts more on the bounce rate. If we look at my initial model(model_1) in second blog, most impacting parameter was avgDomainLookupTime. we can conclude that avgServerconnestionTime more than avgDomainLookupTime on bounce rate.

Finally ,we identified the relationships between bounce rate and time components. In the next blog, we will discuss about regression with Google prediction API and  prediction of the bounce rate.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts