Welcome to the third part. In the previous blog, we have discussed about the relationships of the bounce rate and page load time components. We have also fitted the regression model to identify relationships and discussed why to improve the model?

In this post, I will discuss about steps for improving the existing model for increasing accuracy of predicting bounce rate based on components of page load time. In the model improvement, the first step is variable selection(i.e. Independent variables) and second step is outlier detection. These two steps are essential for model improvement. We will discuss them one by one.

Variable selection is crucial part of model improvement, because variables(i.e. Independent variables) play important roles into developing the best model. We always need to identify which variables are important for model and which are not. There are two methods for variable selection, first is stepwise selection and second is all subset regression. Let’s discuss them one by one.

## Stepwise selection

In the stepwise selection variables are added to or deleted from a model at a time until some stopping criterion is reached. For example, in stepwise forward selection we add independent variables to model one at a time, stopping when adding of other variables would no longer Improve model.

In stepwise backward selection we start model that includes all independent variables, and then delete them one at a time until removing variables degrade the quality of the model. R provides MAAS package to perform stepwise selection using stepAIC() function. I have used stepwise backward selection as below.

>library(MASS) >stepAIC(Model_1, direction="backward")

Output Start: AIC=55790.44 bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime + avgPageLoadTime Df Sum of Sq RSS AIC 6062418 55790 - avgPageLoadTime 1 4361.4 6066779 55795 - avgPageDownloadTime 1 4932.9 6067351 55795 - avgServerConnectionTime 1 9478.1 6071896 55802 - avgDomainLookupTime 1 15704.0 6078122 55810 - avgServerResponseTime 1 17852.1 6080270 55813 - avgRedirectionTime 1 25216.4 6087634 55824 Call: lm(formula = bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime + avgPageLoadTime) Coefficients: (Intercept) avgServerResponseTime avgServerConnectionTime 49.10686 -0.85724 2.02335 avgRedirectionTime avgPageDownloadTime avgDomainLookupTime -0.37822 0.31975 4.14929 avgPageLoadTime 0.04684

As we can see from the result, there is a term AIC. AIC is used as the stopping criterion for variable selection. General rule is lower the AIC, better the model. Here, we have used backward selection method and no one variable is removed, this tells that removing any variable from the model does not decrease the value of the AIC and thereby doesn’t improve the model that we have currently.

## All subset regression

In all subset regression, every possible model is inspected. All subset regression is performed using regsubsets() function from the leaps package. We can choose best n-models by setting nbest=2,3,.. We can choose R-squared, Adjusted R-squared as our criterion for best model. I have used Adjusted R-squared and nbest=2 in model as below.

>library(leaps) >leaps <-regsubsets(bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime +avgDomainLookupTime + avgPageLoadTime,data=data,nbest=2) >plot(leaps, scale="adjr2")

When we plot, we come to know that with maximum value of adjusted R-squared all the variables must be selected. This is shown in the plot below.

In the plot , we can see that minimum value of the adjusted R-squared is 0.00 3 and only two corresponding variables(e.g. Intercept and DomainLookupTime) are marked with black color. Maximum value of adjusted R-squared is 0.013 and all corresponding variables are marked with black color, this means all the variables should be selected.

After variable selection, we can conclude that in current model all independent variables are important and they impact on bounce rate.Let ‘s move to the second step of the model improvement.

Now, we have to detect outliers because outlier degrades the quality of model and quality of model effects on identifying the relationships between bounce rate and page load time components.To detect the outliers we need to take summary of the data. Summary gives the information about the minimum, maximum and median values of every variables of the data. Summary of the data is as below.

>summary(data)

Output bouncerate avgServerResponseTime avgServerConnectionTime avgRedirectionTime Min. : 0.00 Min. : 0.00000 Min. : 0.00000 Min. :0.000e+00 1st Qu.: 29.09 1st Qu.: 0.08771 1st Qu.: 0.00000 1st Qu.:0.000e+00 Median : 47.97 Median : 0.21517 Median : 0.00904 Median :1.687e-03 Mean : 49.30 Mean : 0.63631 Mean : 0.08004 Mean :5.147e-01 3rd Qu.: 69.23 3rd Qu.: 0.74684 3rd Qu.: 0.05492 3rd Qu.:6.270e-02 Max. :100.00 Max. :110.49100 Max. :26.58582 Max. :1.757e+02 avgPageDownloadTime avgDomainLookupTime avgPageLoadTime Min. : 0.00000 Min. : 0.000000 Min. : 0.004 1st Qu.: 0.04575 1st Qu.: 0.000000 1st Qu.: 1.994 Median : 0.21200 Median : 0.000000 Median : 4.105 Mean : 0.73316 Mean : 0.041300 Mean : 7.886 3rd Qu.: 0.64848 3rd Qu.: 0.001388 3rd Qu.: 7.973 Max. :93.17600 Max. :20.184500 Max. :485.234

When we check the max value for each variable, we found that in many variables(avgServerResponseTime, avgPageLoadTime, etc) max values are not relative to their other values(i.e. Mean, Median,..etc.). Let we take an example, for variable* avgPageloadTime*, there is a big difference between Max value(485.234) and Mean(7.886) value. Max value is very larger than the Mean value, this means variable has outliers. So we need to check histogram for this variable and frequency distribution or occurrence of values. I have plotted the histogram for avgPageLoadTime as below.

From histogram, we can see that values within 0 to 50 has high frequency or occurrence and for the other values frequency is very low(i.e. one or two ) means they are outliers and they make quality of regression model poor. So, we have to remove the outliers. R provides the subset() function to remove the outliers. After plotting histogram for each variables, I have used subset function as below.

>data_frame_hist <- subset(data,data$bouncerate & data$avgServerResponseTime<10 & data$avgServerConnectionTime<2.5 & avgPageDownloadTime<10 & data$avgRedirectionTime<10 & data$avgDomainLookupTime<2.5 & data$avgPageLoadTime<50)

I have set constraint for all variables and generate new data set. Let’s generate second model based on new data set and check summary.

>Model_2<-lm(bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime +avgDomainLookupTime + avgPageLoadTime) >summary(Model_2)

Output Call: lm(formula = bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime + avgPageLoadTime) Residuals: Min 1Q Median 3Q Max -86.845 -18.834 -1.784 18.245 90.590 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 51.81372 0.40375 128.331 < 2e-16 *** avgServerResponseTime -4.47623 0.34155 -13.106 < 2e-16 *** avgServerConnectionTime 18.08111 1.77633 10.179 < 2e-16 *** avgRedirectionTime -3.47483 0.37124 -9.360 < 2e-16 *** avgPageDownloadTime 0.95150 0.33956 2.802 0.00509 ** avgDomainLookupTime 15.44193 2.15348 7.171 8.14e-13 *** avgPageLoadTime 0.08968 0.05385 1.665 0.09589 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 24.83 on 7892 degrees of freedom Multiple R-squared: 0.0522, Adjusted R-squared: 0.05148 F-statistic: 72.44 on 6 and 7892 DF, p-value: < 2.2e-16

Now, we have result of the improved model. From the result, the equation for bounce rate becomes as below.

*bouncerate = 51.82 + (-4.47)avgServerResponsetime + (18.08)avgServerconnectionTime + (-3.47)avgRedirectionTime + (0.95)avgPageDownloadTime + (15.44)avgDomainLookuptime + (.09)avgpageLoadtime*

As we can see from the equation,* avgServerconnestionTime* impacts more on the bounce rate. If we look at my initial model(model_1) in second blog, most impacting parameter was *avgDomainLookupTime*. we can conclude that *avgServerconnestionTime* more than *avgDomainLookupTime* on bounce rate.

Finally ,we identified the relationships between bounce rate and time components. In the next blog, we will discuss about regression with Google prediction API and prediction of the bounce rate.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. **Watch the Replay now!**

The following two tabs change content below.

#### Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API.
Google Plus Profile: : Amar Gondaliya

- Predictive analysis on Web Analytics tool data - July 3, 2013
- Predict User’s Return Visit within a day part-3 - October 22, 2012
- Predict User’s Return Visit within a day part-2 - October 22, 2012