Welcome to the second part. In the last blog post on Linear Regression with R, we have discussed about what is regression? and how it is used ? Now we will apply that learning on a specific problem of prediction. In this post, I will create a basic model to predict bounce rate as function of page load time components. In next blog, I’d share how to improve the model to improve the prediction.

We know that bounce rate is important for a web site. Here, we want to identify relationships between bounce rate and time components of a web page(e.g. average page download time, average page load time, average server response time, etc.) and how much these time components impact on bounce rate? For this problem, we have collected data of various web sites from Google analytics. The data set contains following parameters.

- x_id – Id of the page
- ismobile – page visited is by mobile or not
- Country
- pagePath
- pageTitle
- avgServerResponseTime
- avgServerConnectionTime
- avgRedirectionTime
- avgPageDownloadTime
- avgDomainLookupTime
- avgPageLoadTime
- entrances
- pageviews
- exits
- bounces

Each parameter is tracked for a single page. We have 8488 rows in data set and we have calculated bounce rate for each page as below.

Bounce rate = (bounces / entrances)*100

Here, we want to know the impact of *average server response time*, *average server connection time*, *average redirection time*, *average domain look up time*, *average page download time* and *average page load time* on the bounce rate. So, we have rearranged the data set and removed *x_id*, *country, page path, page title, entrances, page views, exits* and *bounces* from the data set and appended *bouncerate *after calculating it. Now data set contains following parameters.

- bouncerate
- avgServerResponseTime
- avgServerConnectionTime
- avgRedirectionTime
- avgPageDownloadTime
- avgDomainLookupTime
- avgPageLoadTime

Let’s use regression on this data set. In this problem, we want to identify the dependency of the bounce rate on time components. So, we will consider *bouncerate* as dependent variable and the rest of the parameters from the data set as independent variables. Regression model for our data set in R is as below

>Model_1 <- lm(bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime +avgDomainLookupTime + avgPageLoadTime)

We have generated the model nicely, but we are interested to know the relationships between bounce rate and and time components. Let’s check summary of the model.

>summary(model_1)

Output Call: lm(formula = bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime + avgPageLoadTime) Residuals: Min 1Q Median 3Q Max -98.276 -19.816 -1.169 19.805 107.705 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 49.10686 0.32862 149.435 < 2e-16 *** avgServerResponseTime -0.85724 0.17154 -4.997 5.93e-07 *** avgServerConnectionTime 2.02335 0.55566 3.641 0.000273 *** avgRedirectionTime -0.37822 0.06368 -5.939 2.97e-09 *** avgPageDownloadTime 0.31975 0.12172 2.627 0.008631 ** avgDomainLookupTime 4.14929 0.88525 4.687 2.81e-06 *** avgPageLoadTime 0.04684 0.01896 2.470 0.013528 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 26.74 on 8481 degrees of freedom Multiple R-squared: 0.01339, Adjusted R-squared: 0.0127 F-statistic: 19.19 on 6 and 8481 DF, p-value: < 2.2e-16

Let’s understand the result. In the result, coefficients are shown in the column *Estimate std. *So, the equation for bounce rate becomes as below.

*bouncerate = 49.107 + (-0.86)avgServerResponsetime + (2.03)avgServerconnectionTime + (-0.38)avgRedirectionTime + (0.32)avgPageDownloadTime + (4.14)avgDomainLookuptime + (.05)avgpageLoadtime*

As we can see from the equation, *avgDomainLookupTime *impacts more on bounce rate . If *avgDomainLookupTime *increase by 1 unit then bounce rate increase by 4.14. At last, we succeed in identifying the relationship between bounce rate and time components of a web page using regression.

Here, we cannot say that the relationships estimated from this regression model(model_1) are perfect, because the model result is generated after model fitted to the data set(i.e. model learns from the data and then estimate coefficients values) and data set may contain some unreliable observations . It is necessary to improve the model, so we can identify the relationships of bounce rate and time components very precisely. In the next blog, we will discuss about how to improve the model? and summary of the improved model.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. **Watch the Replay now!**

Hi Amar,

thank you, the article is very clear in its structure and the meaning. I’m glad to see this kind of practical articles.

I have just one comment on the possible outputs of the model. According to “Multiple R-squared: 0.01339” one can see that the variance in Y variable (=bounce rate) was explained by the model only by 1.34 %. I understand that was just an example data set, but it seems there is something more that can impact bounce rate metric 🙂 But it is a great step and next one can try another model with either numerical metrics (hour of the day, day of the week) or categorical (traffic medium).

Regards,

Pavel Jasek

Hi Pavel,

Thanks for your comment.

The above example was for educational purpose only, that how we can build a model using GA data set and what are the variables we can predict/estimate using regression model in R.

I agree with your suggestion. Next time I will definitely use new additional variables besides existing one which you have suggested.

If you ever have more ideas building predictive models based on GA data set. Please let me know, I would love to create them.

Regards,

Amar