Author Archives: Vignesh Prajapati


Vignesh Prajapati

About Vignesh Prajapati

Vignesh is Data Engineer at Tatvic. He loves to play with opensource playground to make predictive solution on Big data with R, Hadoop and Google Prediction API.
Google Plus profile: Vignesh Prajapati

Google analytics data extraction in R

Unlike other posts on this blog this particular post is more focused on coding using R so audience with the developer mindset would like it more than pure business analysts.

My goal is to describe an alternate method to use to extract the data from Google Analytics via API into R. I have been using R from quite a some time but I think the GA library for R has been broken and while they did make an update, it’s sort of not being used right now.

Considering this, I thought to write it down by myself on move on as more data related operation are now being done using R.

Moreover, the Rgoogleaanalytics package that is available is built for linux only and my windows friends may just like me having something for them as well.

Ok so lets get started, it’s going to very quick and easy.

There are some prerequisites for GA Data extraction in R:

  1. At least one domain must be registered with your Google analytics account
  2. R installed with the following the Googleng packages

Steps to be followed for Google analytics data extraction in R :

  1. Set the Google analytics query parameters for preparing the request URI

To extract the Google analytics data, first you need to define the query parameters like dimensions, metrics, startdate, enddate, sort and filters as per your requirement.

# Defining Google analytics search query parameters

# Set the dimensions and metrics
ga_dimensions <- 'ga:visitorType,ga:operatingSystem,ga:country'
ga_matrics <- 'ga:visits,ga:bounces,ga:avgTimeOnSite'

# Set the starting and ending date
startdate <- '2012-01-01'
enddate <- '2012-11-30'

# Set the segment, sort and filters
segment <- 'dynamic::ga:operatingSystem==Android'
sort <- 'ga:visits'
filters <- 'ga:visits>2'
  1. Get the access token from Oauth 2.0 Playground

We will obtain the access token from Oauth 2.0 Playground. Following are the steps for generating the access token.

  1. Go to Oauth 2.0 Playground
  2. Select Analytics API and click on the Authorize APIs button with providing your related account credentials
  3. Generate the access token by clicking on the Exchange authorization code for tokens and set it to the access token variable in the provided R script

  1. Retrieve and select the Profile

From the below, you can retrieve the number of the profiles which registered under your Google Analytics account. With this you can have the related GA profile id. Before retrieving profiles ensure that access token is present.

We can retrieve the profile by requesting to Management API with accesstoken as a parameter, it will return the JSON response. Here are the steps to convert the response to the list and store it in to the data frame object profiles.

# For requesting the GA profiles and store the JSON response in to GA.profiles.Json variable
GA.profiles.Json <- getURL(paste("https://www.googleapis.com/analytics/v3/management/accounts/~all/webproperties/~all/profiles?access_token=",access_token, sep="", collapse = ","))

# To convert resonse variable GA.profiles.Json to list
GA.profiles.List <- fromJSON(GA.profiles.Json, method='C')

# Storing the profile id and name to profile.id and profile.name variable
GA.profiles.param <- t(sapply(GA.profiles.List$items,
                              '[', 1:max(sapply(GA.profiles.List$items, length)))) 
profiles.id <- as.character(GA.profiles.param[, 1])
profiles.name <- as.character(GA.profiles.param[, 7])

# Storing the profile.id and profile.name to profiles data.frame
profiles <- data.frame(id=profiles.id,name=profiles.name)

We have stored the profiles information in profiles data frame with profile id and profile name. We can print the retrieved list by following code

profiles
OUTPUT::

         id       name
1 ga:123456    abc.com
2 ga:234567    xyz.com

At a time we can retrieve the Google analytics data from only one GA profile. so we need to define the profile id for which we want to retrieve the GA data. You can select the related profile id from the above output and store it in to profileid variable to be later used in the code.

# Set your google analytics profile id
profileid <- 'ga:123456'
  1. Retrieving GA data

Requesting the Google analytics data to Google analytics data feed API with access token and all of the query parameters defined as dimensions, metrics, start date, end date, sort and filters.

# Request URI for querying the Google analytics Data
GA.Data <- getURL(paste('https://www.googleapis.com/analytics/v3/data/ga?',
                        'ids=',profileid,
			'&dimensions=',ga_dimensions,
                        '&metrics=',ga_matrics,
			'&start-date=',startdate,
                        '&end-date=',enddate,
                        '&segment=',segment,
			'&sort=',sort,
			'&filters=',filters,
                        '&max-results=',10000,
                        '&start-index=',start_index*10000+1,
                        '&access_token=',accesstoken, sep='', collapse=''))

This request returns a response body with the JSON structure. Therefore to interpret these response values we need to convert it to list object.

# For converting the Json data to list object GA.list
GA.list <- fromJSON(GA.Data, method='C')

Now its easy to get the response parameters from this list object. So, the total number of the data rows will be obtained by the following command

# For getting the total number of the data rows
totalrow <-  GA.list$totalResults
  1. Storing GA data in Data frame

Storing the Google analytics response data in R dataframe object which is more appropriate to data visualization and data modeling in R

# Splitting the ga_matrics to vectors
metrics_vec <- unlist(strsplit(ga_matrics,split=','))

# Splitting the ga_dimensions to vectors
dimension_vec <-unlist(strsplit(ga_dimensions,split=','))

# To splitting the columns name from string object(dimension_vec)
ColnamesDimension <- gsub('ga:','',dimension_vec)

# To splitting the columns name from string object(metrics_vec)
ColnamesMetric <- gsub('ga:','',metrics_vec)

# Combining dimension and metric column names to col_names
col_names <- c(ColnamesDimension,ColnamesMetric)
colnames(finalres) <- col_names

# To convert the object GArows to dataframe type
GA.DF <- as.data.frame(finalres)

Finally the retrieved data is stored in GA.DF dataframe. You can chek it’s top data by the following command

head(GA.DF)
OUTPUT::
        visitorType operatingSystem   country visits bounces      avgTimeOnSite
1       New Visitor         Android Australia      3       1              106.0
2       New Visitor         Android   Belgium      3       1 155.33333333333334
3       New Visitor         Android    Poland      3       0               60.0
4       New Visitor         Android    Serbia      3       2 40.666666666666664
5       New Visitor         Android     Spain      3       1               43.0
6 Returning Visitor         Android (not set)      3       3                0.0

You will need this full R script for trying this yourself, You can download this script by clicking here. Currently I am working on development of R package, which will help R users to do the same task with less effort. If anyone among you is interested provide your email id in comment, we’ll get in touch.

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Vignesh Prajapati

Vignesh Prajapati

Vignesh is Data Engineer at Tatvic. He loves to play with opensource playground to make predictive solution on Big data with R, Hadoop and Google Prediction API.
Google Plus profile: Vignesh Prajapati

More Posts

Product revenue prediction with R – part 2

After development of predictive model for transactional product revenue -(Product revenue prediction with R – part 1), we can further improvise the model prediction by modifications in the model. In this post, we will see what are the steps required for model improvement. With the help of a set of model summary parameters, the data analyst can improve and evaluate the predictive model. Here, I have provided the information about how can we choose the best model or more fitted model for accurate prediction. We can do that by following ways using certain R functions.

  1. Choose Effective variables for the model
  2. Model Comparisons
  3. Measure Prediction Accuracy
  4. Cross validation

1. Choose Effective variables for the model:

With this technique, we can choose appropriate variable as well as filter variables to take into the development of predictive model. One of the common useful trick is to remove outliers from dataset to make a more accurate prediction.

Outliers Detection and removal:

We can check data ranges or distribution with the help of histogram function, set subsets of our datasets to better fit and reduce the RSS (Residual Sum of Squares) of the model. That will increase the prediction accuracy of the model by removing outliers. One easy way to detect outliers from our dataset is to use histogram function. With hist(), we can check frequency vs data values for a single variable. We have displayed it here for only one  variable. The output of hist() on the variable xproductviews  is given below

It represents that there are about 4000 numbers of observations having value of xproductviews less than 8000. Here, we can choose observations having xproductviews less than 5000 for filteration. We can also check the distribution of data with summary function upon data variable. The dataset is stored in “data” Object, the summary of which is given below.

> summary(data)
output
Nofinstancesofcartadd     Nofuniqueinstancesofcartadd    cartaddTotalRsValue
Min.   :  0.000                Min.   :  0.000             Min.   :     0
1st Qu.:  0.000                1st Qu.:  0.000             1st Qu.:     0
Median :  0.000                Median :  0.000             Median :     0
Mean   :  3.638                Mean   :  2.668             Mean   :  4207
3rd Qu.:  0.000                3rd Qu.:  0.000             3rd Qu.:     0
Max.   :833.000                Max.   :622.000             Max.   :752186

Nofinstancesofcartremoval NofUniqueinstancesofcartremoval productviews
Min.   : 0.0000               Min.   : 0.0000             Min.   :    0.00
1st Qu.: 0.0000               1st Qu.: 0.0000             1st Qu.:   14.75
Median : 0.0000               Median : 0.0000             Median :   44.00
Mean   : 0.2553               Mean   : 0.1283             Mean   :  161.52
3rd Qu.: 0.0000               3rd Qu.: 0.0000             3rd Qu.:  130.00
Max.   :36.0000               Max.   :29.0000             Max.   :24306.00   

cartremoveTotalvalueinRs  uniqueproductviews       productviewRsvalue      ItemrevenuenRs
Min.   :    0.0              Min.   :    0         Min.   :       0        Min.  :   0.0
1st Qu.:    0.0              1st Qu.:   11         1st Qu.:   11883        1st Qu:   0.0
Median :    0.0              Median :   35         Median :   40194        Median:   0.0
Mean   :  301.3              Mean   :  130         Mean   :  252390        Mean  :  64.8
3rd Qu.:    0.0              3rd Qu.:  104         3rd Qu.:  180365        3rd Qu:   0.0
Max.   :29994.0              Max.   :20498         Max.   :29930894        Max.  :80380.0

Here, we can see that every explanatory variable has Min., 1st Qu., Median, Mean, 3rd Qu. and Max. All sequential values should be near to each other but they are very far. One possible solution for this is to filter data with such conditions that would give more related data. With subset function, we can get subset of our dataset with certain conditions like xcartadd<200, xcartuniqadd<100, xcartaddtotalrs<2e+05, xcartremove<5, xcardtremovetotal<5, xcardtremovetotalrs<5000, xproductviews <5000, xuniqprodview<2500 and  xuniqprodview<2500 by considering histogram graph of  these variables. We have choosed above conditions for formatting our dataset variables such that they might have large fraction of original data and nearly similar values of Min., 1st Qu., Median, Mean, 3rd Qu. and Max.  It will remove the outliers from the dataset and then store the dataset to newdata.

> newdata <- subset(data,xcartadd<200 & xcartuniqadd<100 & xcartaddtotalrs<2e+05 & xcartremove<5 & xcardtremovetotal<5 & xcardtremovetotalrs<5000 & xproductviews <5000 & xuniqprodview<2500 )

After removing outliers from our datasets, summary of newdata looks like

> summary(newdata)
output
Nofinstancesofcartadd        Nofuniqueinstancesofcartadd    cartaddTotalRsValue
Min.   : 0.0000                 Min.   : 0.0000             Min.   :    0.0
1st Qu.: 0.0000                 1st Qu.: 0.0000             1st Qu.:    0.0
Median : 0.0000                 Median : 0.0000             Median :    0.0
Mean   : 0.3275                 Mean   : 0.1857             Mean   :  295.4
3rd Qu.: 0.0000                 3rd Qu.: 0.0000             3rd Qu.:    0.0
Max.   :14.0000                 Max.   :10.0000             Max.   :48400.0

Nofinstancesofcartremoval NofUniqueinstancesofcartremoval   productviews
Min.   :0.0000                  Min.   :0.00000             Min.   : 0.00
1st Qu.:0.0000                  1st Qu.:0.00000             1st Qu.: 9.00
Median :0.0000                  Median :0.00000             Median :24.00
Mean   :0.0436                  Mean   :0.01666             Mean   :30.47
3rd Qu.:0.0000                  3rd Qu.:0.00000             3rd Qu.:47.00
Max.   :4.0000                  Max.   :2.00000             Max.   :99.00   

cartremoveTotalvalueinRs uniqueproductviews productviewRsvalue    ItemrevenuenRs
Min.   :   0.00           Min.   : 0.00     Min.   :     0        Min.   :  0.00
1st Qu.:   0.00           1st Qu.: 7.00     1st Qu.:  7077        1st Qu.:  0.00
Median :   0.00           Median :19.00     Median : 19383        Median :  0.00
Mean   :  24.22           Mean   :24.21     Mean   : 45150        Mean   : 33.42
3rd Qu.:   0.00           3rd Qu.:38.00     3rd Qu.: 47889        3rd Qu.:  0.00
Max.   :4190.00           Max.   :91.00     Max.   :942160        Max.   :989.44

Now, we will develop our second model model_out with the newdata object.

model_out <- lm(formula=yitemrevenue_out ~ xcartadd_out + xcartuniqadd_out + xcartaddtotalrs_out + xcartremove_out + xcardtremovetotal_out + xcardtremovetotalrs_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out,data= newdata)

We have two models, one(Model1) with outlier values and other(Model2) is without outlier values.

  1. Model 1 – model (Model with outliers)
  2. Model 2 – model_out (Model without outliers)

In model 2, after removing outliers from explanatory variable we have updated variables names with postfix (_out). We can choose appropriate variables with two techniques like

  • Stepwise Regression
  • All Subsets Regression

Stepwise Regression:

In stepwise Regression, variables are added to or deleted from  model one at a time until  stopping  criterion  is  reached.  For example, in forward stepwise regression we add predictor variables to the model one at a time, stopping when the addition of variables would no longer improve the model. In backward stepwise regression, you start with a model that includes all predictor variables and then delete them one at a time until  removing  variables  would  degrade  the  quality  of  the  model.  Model with lower AIC value will fit the data better, therefore its appropriate model. We have applied Stepwise Regression with backward direction on above dataset. Here, we have applied stepwise regression with MASS package from R on model_out which is without outliers.

> library(MASS)
> stepAIC(model_out,direction='backward')
output
Start:  AIC=27799.14
yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out +
    xuniqprodview_out + xprodviewinrs_out

                      Df Sum of Sq      RSS   AIC
- xuniqprodview_out    1     25570 53512589 27799
                             53487020 27799
- xcartaddtotalrs_out  1     47194 53534214 27800
- xcartremove_out      1     48485 53535505 27800
- xproductviews_out    1    185256 53672276 27807
- xprodviewinrs_out    1    871098 54358118 27843

Step:  AIC=27798.49
yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out +
    xprodviewinrs_out

                      Df Sum of Sq      RSS   AIC
                             53512589 27799
- xcartaddtotalrs_out  1     39230 53551819 27799
- xcartremove_out      1     50853 53563442 27799
- xprodviewinrs_out    1    940137 54452727 27846
- xproductviews_out    1   2039730 55552319 27902

Call:
lm(formula = yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out +
    xproductviews_out + xprodviewinrs_out)

Coefficients:
        (Intercept)  xcartaddtotalrs_out      xcartremove_out    xproductviews_out
          8.8942468           -0.0023806           11.9088716            1.2072294
  xprodviewinrs_out
         -0.0002675

where RSS – Residual sum of square= Σ(Actual-predicted)2

This method suggests us to consider the four variables in the predictive model which are xcartaddtotalrs_out,  xcartremove_out, xprodviewinrs_out and xprodviews_out. This technique is controversial (by this criticism), there’s no guarantee that it will find the best model. So, we have another technique – All Subsets Regression to cross check this result.

All Subsets Regression:

All  subsets  regression  is  implemented  using  the  regsubsets()  function  from  the leaps package. This regression will suggest the best set of variables graphically. Analyst can prefer this method for variable selection. It will suggest the set of variables having p value less than 0.05. p value denotes significance of the existence of variables into the model. With the following set of command we can get the subsets of variables.

> library(leaps)
> leaps <- regsubsets(yitemrevenue_out ~ xcartadd_out + xcartuniqadd_out + xcartaddtotalrs_out + xcartremove_out + xcardtremovetotal_out + xcardtremovetotalrs_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out,data= newdata)
> plot(leaps,scale="adjr2")

And Result Graph is:

From above graph, we can distinguish which variables to include and which not to. You can see, the first row of this graph having black strip on xcartaddtotalrs_out, xcartremove_out, xproductviews_out, xuniqprodview_out and xprodviewinrs_out to be considered in to model.

Now, we will update model_out variables with this output

model_out <- lm(formula=yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out, data = newdata)

2. Model Comparisons:
We can compare models with AIC and anova functions.

  • AIC
  • anova

AIC:
We can check AIC value of both models (model1 and model2) with this function. And distinguish that smaller AIC value model is a better fit. Command for AIC is given below

> AIC(model,model_out)
output
         df      AIC
model     11 72204.46
model_out  7 58937.51

Here, model is with outliers data and model_out is without outliers data. Here, we will choose model_out having smaller AIC value as it is a better than model for prediction.

anova:
We can choose better to fit model among nested models with this function. The probability value which is less than 0.05  or smaller is better model to fit the data values. We are having two models with outliers and without outliers which are not nested model, so it will not be applied in this case. This function is for comparing the two or three models, but for large numbers of model we can prefer stepwise selection or subsets selection. 

3. Measure Prediction Accuracy:
For measuring the prediction accuracy of the model, we require model summary parameters to be checked. Like Residual standard error, Degrees of freedom, Multiple R squared and p-values. Model summary of model_out looks like below.

> summary(model_out)
output
Call:
lm(formula = yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out +
    xproductviews_out + xuniqprodview_out + xprodviewinrs_out,
    data = newdata)

Residuals:
    Min      1Q  Median      3Q     Max
-2671.1  -173.6   -83.4   -42.9 14288.6 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)          3.992e+01  1.254e+01   3.183  0.00147 **
xcartaddtotalrs_out -7.888e-03  2.570e-03  -3.070  0.00216 **
xcartremove_out     -3.410e+01  2.431e+01  -1.403  0.16076
xproductviews_out    1.248e+01  1.222e+00  10.215  < 2e-16 ***
xuniqprodview_out   -1.350e+01  1.487e+00  -9.076  < 2e-16 ***
xprodviewinrs_out    3.705e-04  5.151e-05   7.193 7.62e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 656.4 on 3721 degrees of freedom
Multiple R-squared: 0.1398,	Adjusted R-squared: 0.1386
F-statistic: 120.9 on 5 and 3721 DF,  p-value: < 2.2e-16

We can check the model prediction accuracy based on summary parameters like Residual Standard error, p value and  R squared value. The theta (coefficients) values for all the explanatory variables of a linear model, which describe a positive or negative relationship between a response variable and explanatory variable. e.g. Here we are predicting the product revenue so for 12.48 unit increase in transactional product revenue explained by 1 unit increase in product page view (if we check for xprodviewinrs_out , 0.0003705 unit increase in transactional product revenue explained by 1 unit increase in productview in rs). We can consider following points for choosing the model

  • RSS, Residual standard error and R squared error. The RSS should be as small as possible. Logically model with RSS value 0 will predict exact as actual value.
  • Variable with low (less than 0.5) p value describes significant to be exist in the model.
  • R squared error describes the correct prediction probability. From this we can choose the best model from given two models, with lowest Residual standard error high R squared error.
  • The lower AIC valued model is a better fit than others. 

4. Cross validation:
We can cross validate our regression model with several ways but we are doing this by two methods:

  1. Shrinkage method
  2. 80/20 datasets training/testing

Shrinkage method:

With shrinkage method, we can cross check values of R squared values of training datasets and testing datasets. It first folds dataset in k subsets and then picks k-1 for training and rest of them for testing phase. Then calaulate R-squared for training and testing. We can choose the model based on lower Multiple R squared difference of training and testing dataset.

Below  is given snap of cross validation of two models

  • model(With outliers)
  • model_out(Without outliers)
> shrinkage(model)
output
Original R-square = 0.7109059
10 Fold Cross-Validated R-square = 0.6222664
Change = 0.08863944
> shrinkage(model_out)
output
Original R-square = 0.1397824
10 Fold Cross-Validated R-square = 0.116201
Change = 0.02358148

Here we can see the change value for the model_out is lower than another model. Therefore we are considering model_out because of its small variance on prediction.

80/20 datasets training/testing:
With this technique, we can choose 80% of our dataset for training phase and 20% of our dataset for testing phase. That means we can build our model on 80% of the dataset and then prediction is generated on the input as 20% dataset. The output is compared with actual value from 20% of historical dataset. Therefore on the basis of  ratio of correct predicted values to the total observations(from 20% of dataset), we can measure the prediction accuracy of different model.

In this blog, we have done  model development and evaluation in R. If you need to do it yourself in R, you can download R code + sample dataset. In next of my post(Product revenue prediction with R – part 3), I will explain how to generate prediction for transactional product revenue with our model by input data object and also compare it with Google Prediction API model.

Want us to help you implement or analyze the data for your visitors. Contact us

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Vignesh Prajapati

Vignesh Prajapati

Vignesh is Data Engineer at Tatvic. He loves to play with opensource playground to make predictive solution on Big data with R, Hadoop and Google Prediction API.
Google Plus profile: Vignesh Prajapati

More Posts

Product revenue prediction with R – part 3

After development and improvement  of predictive model with R (as in the previous blog), I have focused here about making a prediction with the R model ( linear regression model ) and comparison with the Google prediction API model. In statistical modeling, R will calculate intercept and variable coefficients to describe the relationship between a response variable and the explanatory variables. And that model will use this relation for making the prediction purpose.

The statistical relation can be described by the following formula (which is derived from this model ),

Productrevenue = 39.92 – 0.0078 * (xcartaddtotalrs_out) – 34.10 * (xcartremove_out) + 12.480 * (xprodviews_out) – 13.50 * (xuniqprodview_out) + 0.00037 * (xprodviewinrs_out)

Therefor in R, there is predict() function of making a prediction with this relation. But for making prediction we require input data. We will store request data to input_data. Now, suppose we have an input dataset ( description ) like

  • xcartaddtotalrs_out = 0
  • xcartremove_out = 0
  • xproductviews_out = 47
  • xuniqprodview_out = 38
  • xprodviewinrs_out = 5828
> input_data <- data.frame(xcartaddtotalrs_out=0, xcartremove_out=0, xprodviews_out=47,  xuniqprodview_out=38, xprodviewinrs_out=5828)

This means we want to predict the transactional product revenue on the base of  xcartaddtotalrs, xcartremove, xproductviews, xuniqprodview and xprodviewinrs. Now, we will make a prediction by predict function and model_out prediction model (which we have already developed in Product revenue prediction with R – part 2).

> predict(model_out,input_data,type="response")
output
115.8346013

We can do the same prediction activity on google prediction API with less effort. When we processed the same dataset with google prediction API for predictive modeling, it’s model summary would look like

Let’s identify above attributes from prediction result. The id is unique model identity information for model identification. In model information, there are numberInstances which describes total numbers of rows are 4061 in the dataset, modelType attribute describes the type of model (either regression or categorical) which is regression and meanSquaredError which is 1606123.17.  Root of mean squared error is the cost of this regression model in the Google Prediction API model which is 1267.33.

In R, our predictive model has cost 656.4 which is lower than the Google Prediction API. The reason behind the reduced cost is variable selection and removal of outliers from our dataset. In R, we can improve our prediction accuracy with improving model as well as improving our dataset quality as per model type. But in google prediction API, we can improve prediction accuracy only by dataset quality, we can’t update model.

Don’t think this stuff is more complex, it’s pretty interesting once you are used to developing it. To start learning this predictive modeling, just start with rough implementation and improve step by step as per your requirement. If you need to do it yourself you can download this R code + sample dataset. In next of my blog- Product revenue prediction with Prediction API, I will discuss about generating prediction with Google Prediction API with more description.

Want us to help you implement or analyze the data for your visitors. Contact us

Vignesh Prajapati

Vignesh Prajapati

Vignesh is Data Engineer at Tatvic. He loves to play with opensource playground to make predictive solution on Big data with R, Hadoop and Google Prediction API.
Google Plus profile: Vignesh Prajapati

More Posts

Product revenue prediction with R – part 1

In my upcoming three blogs, I am going to discuss about how Product managers, Data analyst and Data scientists can develop model for the prediction of the transactional product revenue on the basis of user actions like total numbers of time product added to the cart, total numbers of time product added to the cart, total numbers of page view of product and more. Product managers and data scientists can use linear regression tool for model based predictive analysis on business data here. We will apply regression learning on product transactional data for defining most effective variables that can impact on product transactional revenue. In this blog, I will discuss about how can we develop prediction model on GA dataset and what are the summary statistics of the model. First, we will see how data analyst can get transaction related dataset from source (Google Analytics) for predictive analysis of product revenue.

Business Dataset:

For business analytics, we require set of product purchase historical data on which we can perform analysis operations. We  use Google Analytics for capturing datasets like product name, product SKU for purchase items, numbers of instances removed from the cart, total numbers of product page view, item revenue and more.

After capturing data from Google Analytics, we store it on our Mongo Instance on Amazon server. With RMongo Package we can load datasets in R-studio (is free and open source integrated development environment (IDE) for R) or can also use r-google-analytics  for doing the same.

As we know, business data can be either numerical or categorical. Suppose, if we take product revenue and number of product page views then they will be considered as Numerical Data.  But if we take product-name and country-name then they will be considered as  Categorical Data.

We have available extracted Google Analytics dataset which looks like below that we used for Shopping cart analysis in last post.

[click on image for full-size image]

We have collected list of necessary variables from available Google analytics dataset for predictive model development, which are

  1. yitemrevenue – Item Revenue at Rs
  2. xcartadd – Numbers of instance added to cart
  3. xcartuniqadd – Numbers of unique instance added to cart
  4. xcartaddtotalrs –Total Rs value of products after they are added to cart
  5. xcartremove- Numbers of instances removed from cart
  6. xcardtremovetotal – Total numbers of instances removed from cart
  7. xcardtremovetotalrs – Total  Rs after numbers of instances removed from cart
  8. xproductviews – Numbers of page views
  9. xuniqprodview – Numbers of uniqe product views
  10. xprodviewinrs – Rs at total numbers of page views

After collecting the necessary data, we are ready to develop predictive model in R. Here, we require a response variable and explanatory variables for regression modeling. The element which is predicted is called the response variable. The variables by which we are going to predict the response variable are called the explanatory variables. In the next part, we will develop model with regression tool.

Model development in R:

Since  we are trying to describe the relationship between product revenue and user behavior, we will develop a regression model with  product revenue as the response variable and the rest are explanatory variables. With this  relationship, we can predict transactional product revenue. We have separated the dataset in response variable and explanatory variables as:

Response variable:

  1. yitemrevenue – Item Revenue at Rs

Explanatory variables:

  1. xcartadd – Numbers of instance added to cart
  2. xcartuniqadd – Numbers of unique instance added to cart
  3. xcartaddtotalrs –Total Rs after instances added to cart
  4. xcartremove- Numbers of instances removed from cart
  5. xcardtremovetotal – Total numbers of instances removed from cart
  6. xcardtremovetotalrs – Total  Rs after numbers of instances removed from cart
  7. xproductviews – Numbers of page views
  8. xuniqprodview – Numbers of uniqe product views
  9. xprodviewinrs – Rs at total numbers of page views

Now, we can assume that this dataset is ready for preparing the linear regression model. In R, we can use lm() function for implementation of linear model with following syntax.

> fit <- lm(formula,data)

Here, (fit) object is linear model and (data) is data object to be applied in the model. A model can be developed using the following code

> model <- lm(formula=yitemrevenue ~ xcartadd + xcartuniqadd + xcartaddtotalrs + xcartremove + xcardtremovetotal + xcardtremovetotalrs + xproductviews + xuniqprodview + xprodviewinrs , data = data)

Here, model object  is regression model in which we are interested in. We have implemented it by above formula, (~ ) sign separates the response variables with explanatory variable. Variable on the left side of the sign is the response variable and on the right are explanatory variables. We can check summary statistics of the model with the summary function in R. With summary statistics, we can measure the prediction accuracy of model. Here is summary of our linear model

> summary(model)

Call:
lm(formula = yitemrevenue ~ xcartadd + xcartuniqadd + xcartaddtotalrs +
    xcartremove + xcardtremovetotal + xcardtremovetotalrs + xproductviews +
    xuniqprodview + xprodviewinrs)

Residuals:
   Min     1Q Median     3Q    Max
-22489   -143    -58    -34  37838 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)          3.544e+01  2.325e+01   1.524 0.127464
xcartadd            -5.209e+01  1.343e+01  -3.878 0.000107 ***
xcartuniqadd         4.220e+01  1.786e+01   2.362 0.018208 *
xcartaddtotalrs      7.089e-02  2.406e-03  29.465  < 2e-16 ***
xcartremove         -5.810e+01  3.842e+01  -1.512 0.130598
xcardtremovetotal    4.299e+02  5.430e+01   7.917 3.09e-15 ***
xcardtremovetotalrs -7.004e-02  2.066e-02  -3.390 0.000705 ***
xproductviews        9.629e+00  9.405e-01  10.238  < 2e-16 ***
xuniqprodview       -1.097e+01  1.080e+00 -10.159  < 2e-16 ***
xprodviewinrs        2.344e-04  4.456e-05   5.261 1.51e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1351 on 4174 degrees of freedom
Multiple R-squared: 0.7109,	Adjusted R-squared: 0.7103
F-statistic:  1140 on 9 and 4174 DF,  p-value: < 2.2e-16

And this summary has list of parameters like

Residual Sum of squares(RSS)  =
Degrees of Freedom (DF) = Length of rows in your dataset – Total numbers of your explanatory variables
Residual standard error =
Multiple R squared error  =
pr  = probability for variable to be included into the model, it should be less than 0.05 for a variable to be included.
t values =

f statistics check whether R squared is different from zero.

 

With reference to this model summary, we have Residual standard error as 1351, which should be as small as possible (logically with value 0 denotes perfect prediction). Where Multiple R squared as 0.7109 which denotes this model has nearly 71% prediction accuracy. We can see that xproductviews and xcartaddtotalrs have more impact on the product revenue. Here, 9.629 unit increase in product revenue explained by 1 unit increment in xproductviews.

Therefore, we can see product revenue may be majorly dependent on the total page views of product-page as well as total number of times the product is added to cart. Therefore the product revenue is largely proportional to the product-page views. We can assume here that, we can achieve increment on product transactional revenue on base of  more numbers of page view. If you need to do this yourself in R, you can download R code + sample dataset. In the next blog post (Product revenue prediction with R – part 2), I will share how to improve our predictive model with R.

Want us to help you implement or analyze the data for your visitors. Contact us

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Vignesh Prajapati

Vignesh Prajapati

Vignesh is Data Engineer at Tatvic. He loves to play with opensource playground to make predictive solution on Big data with R, Hadoop and Google Prediction API.
Google Plus profile: Vignesh Prajapati

More Posts

Product revenue prediction with Google Prediction API

In this post, I am going to explain how can we build the model for transactional product revenue prediction with Google Prediction API as we already discussed same stuff (Product revenue prediction with R)on R. With the help of Prediction API, we can build prediction model without any programming. Here we just have to focus on our dataset to make more accurate predictions. Google provides this service with Google Cloud Storage and Google Prediction API.

We can store our dataset at Google cloud storage and fire queries from Google prediction API for generating prediction on set of your datasets which is  easy. Google Prediction API is developed with set of optimized Machine Learning Algorithms for model based prediction. Here, the description is on the base of the Prediction API v1.5

How to train data with Google Prediction API:

First, we require the minimum one number of project at our Google API console, enabled Google Prediction API and Google Cloud Storage service with that project. Our datasets have to meet certain format like

  • There must no header for all column
  • Predicted variable must be at first column
  • There must not NA values in dataset which confuse Google Prediction server

Then we can upload our dataset to Google Cloud Storage Engine by creating bucket. After uploading dataset to Google Cloud Storage Engine, we have to query from Google Prediction API by Google API explorer. There are numbers of methods to Prediction API service to deal with the data like

  1. prediction.trainedmodels.insert – Can insert your dataset for training with this function.
  2. prediction.trainedmodels.get – Can get the training status of pre-inserted data model
  3. prediction.trainedmodels.analyze – Can provide the Analysis on the trained model
  4. prediction.trainedmodels.predict – Can make prediction

Here, we are making prediction for product revenue (yitemrevenue) on the base of the  xcartadd, xcartuniqadd, xcartaddtotalrs, xcartremove, xcardtremovetotal, xcardtremovetotalrs, xproductviews, xuniqprodview and xprodviewinrs. Therfore,  yitemrevenue is predicted variable and others are explanatory variables. As in previous blog, we have already developed model in R for same dataset. For doing same with Google Prediction API, We can start training of dataset with prediction.trainedmodels.insert with model unique id, data storage location, and type of model as parameters.

[You can use this dataset]

Will give Response like:

It means request accepted by Google Prediction server and have started training. We can check the training status of model with the help of prediction.trainedmodels.get

Here is response of Get function of Prediction API:

Where it describes

Toatal numbers of instance = 4061
Type of model = Regression
Mean squared Error(MSE) = = 1606123.17

Here, this model summary will not provide intercept and variable coefficient like R. To make prediction with above model we can query with the followed data ( description ) by Predict method (prediction.trainedmodels.predict)

Test 1:

  1. xcartadd = 0
  2. xcartuniqadd = 0
  3. xcartaddtotalrs = 0
  4. xcartremove = 0
  5. xcardtremovetotal = 0
  6. xcardtremovetotalrs = 0
  7. xproductviews = 47
  8. xuniqprodview = 38
  9. xprodviewinrs = 5828

Actual  (yitemrevenue) = 110.06

 Output
(yitemrevenue) = 155.15717487938028

Test 2:

  1. xcartadd = 0
  2. xcartuniqadd = 0
  3. xcartaddtotalrs = 0
  4. xcartremove = 0
  5. xcardtremovetotal = 0
  6. xcardtremovetotalrs = 0
  7. xproductviews = 484
  8. xuniqprodview = 392
  9. xprodviewinrs = 445026

Actual (yitemrevenue) = 803.81

 Output
(yitemrevenue) = 934

with R (after removing outliers + subsets of independent variable), we predicted 115.8346013 for test 1 and 955.153476 for test 2 on same dataset. Therefore, getting prediction with R and Google Prediction API are nearly same. When we are doing with prediction API we can improve prediction with improving dataset quality but another side in R, we can improve prediction accuracy by improving the dataset quality as well as the prediction model as we already discussed in Product revenue prediction with R – part 2.

Want us to help you implement or analyze the data for your visitors. Contact us

Vignesh Prajapati

Vignesh Prajapati

Vignesh is Data Engineer at Tatvic. He loves to play with opensource playground to make predictive solution on Big data with R, Hadoop and Google Prediction API.
Google Plus profile: Vignesh Prajapati

More Posts