Predict User’s Return Visit within a day part-1

In my earlier blog, I have discussed about what is logistic regression? And how logistic model is generated in R? Now we will apply that learning on a specific problem of prediction. In this post, I will create a basic model to predict whether a user will return on website in next 24 hours. This problem is based on the user characteristics as well as website characteristics, but  here we will predict based on some measures (i.e. User’s visits, user’s landing page path, user’s exit page path, etc.). Here our predicted outcome is 1 or 0 . 1 stands for “Yes” and 0 stands for “No”. Let’s  discuss possible data set to build a logistic regression model.

For this problem, we have collected the data of a website from the google analytics. The data set contains the following parameters.

  1. visitor_ID
  2. visitCount
  3. daysSinceLastVisit
  4. medium
  5. landingPagePath
  6. exitPagePath
  7. pageDepth

Let’s understand the parameters first. First parameter is visitor_ID, which is the id of visitor. Second parameter is visitCount , which contains values in increasing order(i.e. For a particular visitor,  if  visitor visits site first time then value of visitCount is 1, visits second time then value of visitCount is 2 and so on). Third parameter is daySincelastVisit, which contains the days difference of two consecutive visits. Fourth one is medium, which contains categorical values(i.e. organic, referral, etc. ).Fifth parameter is landingPagePath, which contains a string value represent the entrance page of the user for each visit. Sixth parameter is exitPagepPath, which contains a string value represent the exit page of the user for each visit. The last parameter is PageDepth, which contains the values that represent the how many pages a user has visited during a single visit.

Here our goal is to predict whether a user will return on website in next 24 hours. From the collected data, we can say that, a user would have came back if his visitcount is more than 1 and daysinceLastVisit is less than or equal to 1. Based on this criteria, we have generated new variable named revisit, which contains values “1″ or “0″ for each user. “1″ indicates user has came back and “0″ indicates user has not came back. This variable(revisit) is considered as the dependent variable and visitCount, daySinceLastVisit, medium, landingPagepath, exitPagePath and pageDepth are considered as the independent variables. Let’s generate model.

Before generating a model, let we discuss one issue. Issue is that data set contains categorical variables then how to deal with them? In the linear regression, only numeric values were considered in blog Linear Regression using R, but in the logistic regression we need to consider categorical values. There are many solutions for this issue, but I have used the dummy variable codding.

In the dummy coding, variable with K categories is usually entered into a regression as a sequence of K-1 dummy variables. For our data set, we have three categorical variables which are medium, landingPagePath and exitPagePath. Each variable contains 14, 102 and 167 categories respectively.  Generally we are not appending dummy variables, but we create a contrast matrix for each categorical variable. I have done dummy coding for our categorical variables. We will not go into the detail and coding scheme, because one blog is not enough to explain dummy coding. We will deal only with our actual problem of the prediction.

Let’s generate regression model based on the data set. The R code for our model is as below.

>Model_1 <- glm(revisit ~ DaySinceLastVisit + visitCount +f.medium +f.landingPagePath +f.exitPagepath+pageDepth, data=data ,family = binomial("logit"))

Here, I am not going to show the summary of the model, because the summary is  too large to view, then the question arises here that how to decide model effectiveness without summary ? I have chosen alternate option to measure the effectiveness of the model and it is the accuracy of the model. Accuracy of the model is calculated as how many % model have been successful in predicting true against the actual values. Through the accuracy, we can decide  the effectiveness of the model. Following is the R code snippet to calculate the accuracy of our model.

>confusion_matrix <- ftable(actual_revisit, predicted_revisit)
>accuracy <- sum(diag(confusion_matrix))/2555*100
Output
88.21918

In the above R code, I have used ftable() function to generate confusion matrix which is used in calculating the accuracy of the model.    Here we will not discuss in detail about confusion matrix, because it is out of the scope of the blog. For more detail refer wiki page of the confusion matrix. Let’s see the output, from the output, we can see that accuracy of our model is 88.22%, which is good for us. But, we can increase the accuracy of the model, if we improve the model. If the accuracy of the model is above 95%, then we can predict more accurately. Before we generate some prediction, we will improve the model first and then try to predict using improved model. If you want to do the same exercise, Click here for R code and sample data set. In the next blog, we will discuss about model improvement and check the accuracy of the improved model.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Predict User's Return Visit within a day part-1 by
Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts

  • http://www.facebook.com/kantibanerjeee Kanti Banerjee

    a bit confused about catagorical variables…

    • Amar Gondaliya

      Hi Kanti,
      I think you are confused with the dummy variable coding of a categorical variable.
      Let me clear your confusion. You know in regression analysis, numerical variables are easy to use, but
      categorical variables are not used directly. Categorical variables are transformed into numeric nature, so they can be used in regression. There are many ways to do that, but I have used dummy variable coding in the provided code. For example, a categorical variable contains K categories, then k-1 dummy variables are created and then they are entered into regression analysis. These k-1 dummy variables are numeric in nature, which can easily be used in regression.
      Let me explain you above discussion with an example.  Suppose a categorical variable performance contains four categories “poor”, ”average”, ”good”  and “excellent”.  Then three new dummy variables are created called performance1, performance2 and performance3. These dummy variables are numeric and they are ready to use in regression. These things are handled by contrast matrix in R (download provided code).
      I think I have cleared your confusion about how categorical variables are handled. If you still have any query then let me know.  

      Regards,
      Amar

  • majom

    I like you post and that you provide a lot of details on the topic which you cover. I would recommend you to look into survival models (also called hazard models) which seem to fit the context of your analysis even better than logistic regression. Discrete hazard models can be even fitted with the “glm” command, or you go for the Cox model and use one of the many handy packages, such as “survival”.