Logistic Regression with R

By Amar Gondaliya | 22.10.2012

Logistic Regression

In my first blog post, I have explained about the what is regression? And how linear regression model is generated in R? In this post, I will explain what is logistic regression? And how the logistic regression model is generated in R?

Let’s first understand logistic regression. Logistic regression is one of the type of regression and it is used to predict outcome of the categorical dependent variable. (i.e. categorical variable has  limited number of categorical values) based on the one or more independent variables. For example, if you would like to predict who will win the next T20 world cup, based on player’s strength and  other details. It is a prediction done with categorical variable. Logistic regression can be binomial  or multinomial.

In the binomial or binary logistic regression, the outcome can have only two possible types of values (e.g. “Yes” or “No”, “Success” or “Failure”). Multinomial logistic refers to cases where the outcome can have three or more possible types of values (e.g., “good” vs. “very good” vs. “best” ). Generally outcome is coded as “0″ and “1″ in binary logistic regression. We will use binary logistic regression in the rest of the part of the blog. Now, we will look at how the logistic regression model is generated in R.

Logistic regression in R

To fit logistic regression model, glm() function is used in R which is similar to lm(), but glm() includes additional parameters. The format is

glm(Y~X1+X2+X3, family=binomial(link=”logit”), data=mydata)

Here, Y is dependent variable and X1, X2 and X3 are independent variables. Function includes additional parameter family and it has value binomial(link=”logit”) which means the probability distribution of regression model is binomial and link function is logit (Refer book  R in Action for more information). Let’s generate a simple model. Suppose we want to predict whether a student will get admission based on his two exam scores. For this problem we have a historical data from previous applicants which can be used as the training data set to build a model. The data set contains the following parameters.

  1. exam_1- Exam-1 score
  2. exam_2- Exam-2 score
  3. admitted- 1 if admitted or 0 if not admitted

In the above parameters, parameter admitted has value 1 or 0 for each observation. Now, we will generate a model that can predict, will student get admission based on two exam scores? For a given problem, admitted is considered as dependent variable, exam_1 and exam_2 are considered as independent variables. The R code for the model is given as  below.

>Model_1<-glm(admitted ~ exam_1 +exam_2, family = binomial("logit"), data=data)

After generating the model, let’s try to predict using this model. Suppose we have two exam marks of a student, 60 of exam_1 and 85 of exam_2. We will predict that will student get admission? Following is R code for predicting probability of student to get admission.

>in_frame<-data.frame(exam_1=60,exam_2=86)
>predict(Model_1,in_frame, type="response")
Output
0.9894302

Here, the output is given as a probability score which has value in range 0 to 1. If the probability score is greater than 0.5 then it is considered as TRUE. If the probability score is less than or equal to 0.5 then it is considered as FALSE. In our case 1 or 0 will be considered as the output to decide, will student get admission or not? if it is 1 then student will get admission otherwise not.  So I have used round() function to convert probability score to 0 or 1. It is as below.

>round(predict(Model_1, in_frame, type="response"))
Output
1

Output is 1 means a student will get admission. We can also predict for other observations in the above manner. Finally we understood what is logistic regression? And how it works in R? If you want to do the same exercise, Click here for R code and sample data set of above example. In the next blog, we will discuss about a specific problem for Google Analytics data and see how to use logistic regression into?

Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. Watch the Replay now!

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

More Posts

  • Somebody

    Thank you so much ;)