As some of you guys may know, our Predictive Analytics Webinar took place yesterday (June 19th 2013). I believe you all agree that predictive analysis is a vast topic, right? So, for this webinar, our intention was to share our knowledge and help you understand how you can use this method to take your analysis to the next level. And for that, during this first webinar on predictive analysis, we gave you an introduction of R (the tool that we have adopted), taught you how you can extract your web analytics (Google Analytics) tool data into R and took you through the steps of how to build a predictive model in order to perform a predictive analysis.
As a standard, all our webinars have a questions and answers round by their end. Since due to a time constraint we are not able to answer all the queries during the live event, one of our follow up practices is to compress all Q&A in a unique blogpost and share with you here once our webinars have taken place.
Let me list down now all the answers for the questions that we have received during yesterday’s webinar.
And if you have any further question regarding R, how to extract your web analytics data into R and/or how to build a predictive model and perform a predictive analysis, please feel 100% free to contact us!
Q: Will the webinar recording, slides and sample codes be available for download?
A: Yes. They are already available on webinar’s page.
Q: Are R packages free?
A: Yes all the R packages are free. This is possible since R is an open source language.
Q: Does “R” pull the data every time you make the calculations? Or does the data remain in R locally stored?
A: Once you have extracted data for a specific query, it remains stored locally. You may even export this data as a .csv file to store it permanently. It may be helpful to think that you are pulling the data only when you fire a query to the Google Analytics API.
Q: I only see RGoogleAnalytics for R version 3.0.0, says it doesn’t work on my R version 3.0.1. Any suggestions?
A: This is warning message specific to RStudio. While installing this package from RStudio, it will call getDependencies() to check its dependencies and also identify that whether the original package exists on CRAN and throws the given warning when it doesn’t. As RGoogleAnalytics package is not on CRAN, hence the warning. and can be safely ignored. I hope this should be fixed up when RGoogleAnalytics is ported on CRAN. Installating RGoogleAnalytics from the default R console shall not fire this warning message.
Q: Kushan showed how to connect R and GA data using API. Can we use Flat file extracts (data warehouse extracts) to feed into R and do analysis? How?
A: Yes, flat files in the form of tables can be easily imported into R using the read.table() function. Here is a link explaining how to get your data in R from various data sources: .
Q: How can I integrate R and Predictive Modeling with Omniture SiteCatalyst?
A: You can have your data from Site Catalyst to be extracted via data warehouse and get it stored into your server. This can then be easily extracted into R. Alternatively, similar to the RGoogleanalytics package, there is a package named RSiteCatalyst. Here’s a link to the package homepage: http://cran.r-project.org/web/packages/RSiteCatalyst/
Q: How can I get this dataset from Google Analytics in the form of a table?
A: You can export your Google Analytics data into a .csv file and get it into R using the read.table() function.
Q: How exactly is machine language related to our situation?
A: Machine Learning is a class of algorithms that form the core of predictive models. These algorithms are based on Statistical and Mathematical formulas. Hence, a model built on these Machine Learning Algorithms can be used to predict the probabilities of occurrence of future events.
Q: I didn’t see any equation when you were building the model? Did I miss something?
A: In the first argument to the glm function, we mentioned the Response and Predictor Variables separated by a ~ sign which forms the equation for that particular model. The response variable in our case was ‘label’ and we used all the rest of the variables as predictors. R has this neat way of describing all the variables in a data frame by a “.” (dot). If we need to type discrete variables names, we need to separate them by a + sign. In this case, the glm function would look like glm (label ~ Is_holiday + Is_gift + bSumPrice + bMinPrice).
Q: Can you please share the ‘differentiating factors’ about the approach Tatvic has took in Web Analytics using its ‘Web Analytics Framework’. Can we do whatever you have explained in this webinar using Tatvic’s ‘Web Analytics Framework’?
A: The key differentiating factors that we use in our approach while using Web Analytics Framework is focus on ROI part. Whatever activity we would carry out for you we would keep ROI in mind and help you win either at revenue part or help you identify redundant areas where cost can be optimized.
Regarding the second part of your question, we would say yes. The top end part of web analytics activity cycle leads to prediction, forecasting and personalization where you can use the learning from this webinar.
Q: Which R command was the machine learning algorithm?
A: We used the Logistic Regression technique for building the model. This was done using the glm() function.
Q: You demoed using transaction info, which is detailed data (lower level). The other reports in GA are aggregated. Is it applicable in this other cases too?
A: In our case, we wanted to know individual transaction ids where probability of returns is higher to bucket those individuals and send customized offers to ensure retention. However, you can use aggregate level parameters to check for example areas/regions from where you are getting maximum returns, etc. So, depending upon requirement, we can either focus on lower level or top level data sets.
Q: How do you determine the most important variables in the model?
A: In order to identify the important variables, we need to perform Univariate Analysis, in which we create the model with individual Predictor Variables one at a time and analyze the p-values for these variables. If the p-values are greater than the significance level (0.005), then these variables can be removed. At the end of this process, we end up with the statistically important variables. Alternatively, we can also check the correlations of the Response variables to each of the Predictor Variables. The variables with strong correlation can be regarded as important.
Q: Do you have a procedure for a complete predictive model, including refining what you did in the webinar?
A: Yes. An insight into all the steps associated with Predictive Modeling can be found here: http://bit.ly/1bZ4OIZ.
Q: What would be the best career advice for someone with software engineering background and currently studying MBA in US who is interested in predictive analytics?
A: Predictive Analytics is a very exciting field to work on. Data has the potential to solve a variety of problems. We would suggest starting out by exploring public datasets and also competing with other data scientists on real life problems like the ones on Kaggle.