After development of predictive model for transactional product revenue -(Product revenue prediction with R – part 1), we can further improvise the model prediction by modifications in the model. In this post, we will see what are the steps required for model improvement. With the help of a set of model summary parameters, the data analyst can improve and evaluate the predictive model. Here, I have provided the information about how can we choose the best model or more fitted model for accurate prediction. We can do that by following ways using certain R functions.
- Choose Effective variables for the model
- Model Comparisons
- Measure Prediction Accuracy
- Cross validation
1. Choose Effective variables for the model:
With this technique, we can choose appropriate variable as well as filter variables to take into the development of predictive model. One of the common useful trick is to remove outliers from dataset to make a more accurate prediction.
Outliers Detection and removal:
We can check data ranges or distribution with the help of histogram function, set subsets of our datasets to better fit and reduce the RSS (Residual Sum of Squares) of the model. That will increase the prediction accuracy of the model by removing outliers. One easy way to detect outliers from our dataset is to use histogram function. With hist(), we can check frequency vs data values for a single variable. We have displayed it here for only one variable. The output of hist() on the variable xproductviews is given below
It represents that there are about 4000 numbers of observations having value of xproductviews less than 8000. Here, we can choose observations having xproductviews less than 5000 for filteration. We can also check the distribution of data with summary function upon data variable. The dataset is stored in “data” Object, the summary of which is given below.
output Nofinstancesofcartadd Nofuniqueinstancesofcartadd cartaddTotalRsValue Min. : 0.000 Min. : 0.000 Min. : 0 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0 Median : 0.000 Median : 0.000 Median : 0 Mean : 3.638 Mean : 2.668 Mean : 4207 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.: 0 Max. :833.000 Max. :622.000 Max. :752186 Nofinstancesofcartremoval NofUniqueinstancesofcartremoval productviews Min. : 0.0000 Min. : 0.0000 Min. : 0.00 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 14.75 Median : 0.0000 Median : 0.0000 Median : 44.00 Mean : 0.2553 Mean : 0.1283 Mean : 161.52 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 130.00 Max. :36.0000 Max. :29.0000 Max. :24306.00 cartremoveTotalvalueinRs uniqueproductviews productviewRsvalue ItemrevenuenRs Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0 1st Qu.: 0.0 1st Qu.: 11 1st Qu.: 11883 1st Qu: 0.0 Median : 0.0 Median : 35 Median : 40194 Median: 0.0 Mean : 301.3 Mean : 130 Mean : 252390 Mean : 64.8 3rd Qu.: 0.0 3rd Qu.: 104 3rd Qu.: 180365 3rd Qu: 0.0 Max. :29994.0 Max. :20498 Max. :29930894 Max. :80380.0
Here, we can see that every explanatory variable has Min., 1st Qu., Median, Mean, 3rd Qu. and Max. All sequential values should be near to each other but they are very far. One possible solution for this is to filter data with such conditions that would give more related data. With subset function, we can get subset of our dataset with certain conditions like xcartadd<200, xcartuniqadd<100, xcartaddtotalrs<2e+05, xcartremove<5, xcardtremovetotal<5, xcardtremovetotalrs<5000, xproductviews <5000, xuniqprodview<2500 and xuniqprodview<2500 by considering histogram graph of these variables. We have choosed above conditions for formatting our dataset variables such that they might have large fraction of original data and nearly similar values of Min., 1st Qu., Median, Mean, 3rd Qu. and Max. It will remove the outliers from the dataset and then store the dataset to newdata.
> newdata <- subset(data,xcartadd<200 & xcartuniqadd<100 & xcartaddtotalrs<2e+05 & xcartremove<5 & xcardtremovetotal<5 & xcardtremovetotalrs<5000 & xproductviews <5000 & xuniqprodview<2500 )
After removing outliers from our datasets, summary of newdata looks like
output Nofinstancesofcartadd Nofuniqueinstancesofcartadd cartaddTotalRsValue Min. : 0.0000 Min. : 0.0000 Min. : 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0 Median : 0.0000 Median : 0.0000 Median : 0.0 Mean : 0.3275 Mean : 0.1857 Mean : 295.4 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0 Max. :14.0000 Max. :10.0000 Max. :48400.0 Nofinstancesofcartremoval NofUniqueinstancesofcartremoval productviews Min. :0.0000 Min. :0.00000 Min. : 0.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: 9.00 Median :0.0000 Median :0.00000 Median :24.00 Mean :0.0436 Mean :0.01666 Mean :30.47 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:47.00 Max. :4.0000 Max. :2.00000 Max. :99.00 cartremoveTotalvalueinRs uniqueproductviews productviewRsvalue ItemrevenuenRs Min. : 0.00 Min. : 0.00 Min. : 0 Min. : 0.00 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 7077 1st Qu.: 0.00 Median : 0.00 Median :19.00 Median : 19383 Median : 0.00 Mean : 24.22 Mean :24.21 Mean : 45150 Mean : 33.42 3rd Qu.: 0.00 3rd Qu.:38.00 3rd Qu.: 47889 3rd Qu.: 0.00 Max. :4190.00 Max. :91.00 Max. :942160 Max. :989.44
Now, we will develop our second model model_out with the newdata object.
model_out <- lm(formula=yitemrevenue_out ~ xcartadd_out + xcartuniqadd_out + xcartaddtotalrs_out + xcartremove_out + xcardtremovetotal_out + xcardtremovetotalrs_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out,data= newdata)
We have two models, one(Model1) with outlier values and other(Model2) is without outlier values.
- Model 1 – model (Model with outliers)
- Model 2 – model_out (Model without outliers)
In model 2, after removing outliers from explanatory variable we have updated variables names with postfix (_out). We can choose appropriate variables with two techniques like
- Stepwise Regression
- All Subsets Regression
In stepwise Regression, variables are added to or deleted from model one at a time until stopping criterion is reached. For example, in forward stepwise regression we add predictor variables to the model one at a time, stopping when the addition of variables would no longer improve the model. In backward stepwise regression, you start with a model that includes all predictor variables and then delete them one at a time until removing variables would degrade the quality of the model. Model with lower AIC value will fit the data better, therefore its appropriate model. We have applied Stepwise Regression with backward direction on above dataset. Here, we have applied stepwise regression with MASS package from R on model_out which is without outliers.
> library(MASS) > stepAIC(model_out,direction='backward')
output Start: AIC=27799.14 yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out Df Sum of Sq RSS AIC - xuniqprodview_out 1 25570 53512589 27799 53487020 27799 - xcartaddtotalrs_out 1 47194 53534214 27800 - xcartremove_out 1 48485 53535505 27800 - xproductviews_out 1 185256 53672276 27807 - xprodviewinrs_out 1 871098 54358118 27843 Step: AIC=27798.49 yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out + xprodviewinrs_out Df Sum of Sq RSS AIC 53512589 27799 - xcartaddtotalrs_out 1 39230 53551819 27799 - xcartremove_out 1 50853 53563442 27799 - xprodviewinrs_out 1 940137 54452727 27846 - xproductviews_out 1 2039730 55552319 27902 Call: lm(formula = yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out + xprodviewinrs_out) Coefficients: (Intercept) xcartaddtotalrs_out xcartremove_out xproductviews_out 8.8942468 -0.0023806 11.9088716 1.2072294 xprodviewinrs_out -0.0002675
where RSS – Residual sum of square= Σ(Actual-predicted)2
This method suggests us to consider the four variables in the predictive model which are xcartaddtotalrs_out, xcartremove_out, xprodviewinrs_out and xprodviews_out. This technique is controversial (by this criticism), there’s no guarantee that it will find the best model. So, we have another technique – All Subsets Regression to cross check this result.
All Subsets Regression:
All subsets regression is implemented using the regsubsets() function from the leaps package. This regression will suggest the best set of variables graphically. Analyst can prefer this method for variable selection. It will suggest the set of variables having p value less than 0.05. p value denotes significance of the existence of variables into the model. With the following set of command we can get the subsets of variables.
> library(leaps) > leaps <- regsubsets(yitemrevenue_out ~ xcartadd_out + xcartuniqadd_out + xcartaddtotalrs_out + xcartremove_out + xcardtremovetotal_out + xcardtremovetotalrs_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out,data= newdata) > plot(leaps,scale="adjr2")
From above graph, we can distinguish which variables to include and which not to. You can see, the first row of this graph having black strip on xcartaddtotalrs_out, xcartremove_out, xproductviews_out, xuniqprodview_out and xprodviewinrs_out to be considered in to model.
Now, we will update model_out variables with this output
model_out <- lm(formula=yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out, data = newdata)
2. Model Comparisons:
We can compare models with AIC and anova functions.
We can check AIC value of both models (model1 and model2) with this function. And distinguish that smaller AIC value model is a better fit. Command for AIC is given below
output df AIC model 11 72204.46 model_out 7 58937.51
Here, model is with outliers data and model_out is without outliers data. Here, we will choose model_out having smaller AIC value as it is a better than model for prediction.
We can choose better to fit model among nested models with this function. The probability value which is less than 0.05 or smaller is better model to fit the data values. We are having two models with outliers and without outliers which are not nested model, so it will not be applied in this case. This function is for comparing the two or three models, but for large numbers of model we can prefer stepwise selection or subsets selection.
3. Measure Prediction Accuracy:
For measuring the prediction accuracy of the model, we require model summary parameters to be checked. Like Residual standard error, Degrees of freedom, Multiple R squared and p-values. Model summary of model_out looks like below.
output Call: lm(formula = yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out, data = newdata) Residuals: Min 1Q Median 3Q Max -2671.1 -173.6 -83.4 -42.9 14288.6 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.992e+01 1.254e+01 3.183 0.00147 ** xcartaddtotalrs_out -7.888e-03 2.570e-03 -3.070 0.00216 ** xcartremove_out -3.410e+01 2.431e+01 -1.403 0.16076 xproductviews_out 1.248e+01 1.222e+00 10.215 < 2e-16 *** xuniqprodview_out -1.350e+01 1.487e+00 -9.076 < 2e-16 *** xprodviewinrs_out 3.705e-04 5.151e-05 7.193 7.62e-13 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 656.4 on 3721 degrees of freedom Multiple R-squared: 0.1398, Adjusted R-squared: 0.1386 F-statistic: 120.9 on 5 and 3721 DF, p-value: < 2.2e-16
We can check the model prediction accuracy based on summary parameters like Residual Standard error, p value and R squared value. The theta (coefficients) values for all the explanatory variables of a linear model, which describe a positive or negative relationship between a response variable and explanatory variable. e.g. Here we are predicting the product revenue so for 12.48 unit increase in transactional product revenue explained by 1 unit increase in product page view (if we check for xprodviewinrs_out , 0.0003705 unit increase in transactional product revenue explained by 1 unit increase in productview in rs). We can consider following points for choosing the model
- RSS, Residual standard error and R squared error. The RSS should be as small as possible. Logically model with RSS value 0 will predict exact as actual value.
- Variable with low (less than 0.5) p value describes significant to be exist in the model.
- R squared error describes the correct prediction probability. From this we can choose the best model from given two models, with lowest Residual standard error high R squared error.
- The lower AIC valued model is a better fit than others.
4. Cross validation:
We can cross validate our regression model with several ways but we are doing this by two methods:
- Shrinkage method
- 80/20 datasets training/testing
With shrinkage method, we can cross check values of R squared values of training datasets and testing datasets. It first folds dataset in k subsets and then picks k-1 for training and rest of them for testing phase. Then calaulate R-squared for training and testing. We can choose the model based on lower Multiple R squared difference of training and testing dataset.
Below is given snap of cross validation of two models
- model(With outliers)
- model_out(Without outliers)
output Original R-square = 0.7109059 10 Fold Cross-Validated R-square = 0.6222664 Change = 0.08863944
output Original R-square = 0.1397824 10 Fold Cross-Validated R-square = 0.116201 Change = 0.02358148
Here we can see the change value for the model_out is lower than another model. Therefore we are considering model_out because of its small variance on prediction.
80/20 datasets training/testing:
With this technique, we can choose 80% of our dataset for training phase and 20% of our dataset for testing phase. That means we can build our model on 80% of the dataset and then prediction is generated on the input as 20% dataset. The output is compared with actual value from 20% of historical dataset. Therefore on the basis of ratio of correct predicted values to the total observations(from 20% of dataset), we can measure the prediction accuracy of different model.
In this blog, we have done model development and evaluation in R. If you need to do it yourself in R, you can download R code + sample dataset. In next of my post(Product revenue prediction with R – part 3), I will explain how to generate prediction for transactional product revenue with our model by input data object and also compare it with Google Prediction API model.
Want us to help you implement or analyze the data for your visitors. Contact us
Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!
The following two tabs change content below.
Vignesh is Data Engineer at Tatvic. He loves to play with opensource playground to make predictive solution on Big data with R, Hadoop and Google Prediction API.
Google Plus profile: Vignesh Prajapati
Google Plus profile: Vignesh Prajapati