๐Ÿ“ˆ Interpreting a logistic regression using R

Dec 19, 2023ยท
Andres Hernandez
Andres Hernandez
ยท 6 min read
Logistic Regression

Welcome back! This post is the direct follow-up to my last article, “The Logic Behind Logistic Regression.”

In that post, we covered the theory. Now, it’s time to get our hands dirty.

We’re going to take a real-world dataset and walk through a complete example of implementing and interpreting a logistic regression model in R. The code itself is surprisingly short, but the insights we can pull from it are powerful.

This tutorial is broken down into three simple steps:

  1. Getting the Data
  2. Building the Model
  3. Interpreting the Results

Let’s get started.


๐Ÿ“Š 1. Reading the Data

First, we need to load our data. I’m using a public dataset called Social_Network_Ads.csv.

We’ll start by setting our working directory and loading the data.

# Set the working directory to your file's location
setwd('Path_to_your_csv_file')

# Read the CSV file
input_dataset <- read.csv('./Social_Network_Ads.csv', na.strings="NA")

# View the first few rows of the data
head(input_dataset)

The head command shows us the first few rows of our data:

User.IDGenderAgeEstimatedSalaryPurchased
15624510Male19190000
15810944Male35200000
15668575Female26430000
15603246Female27570000
15804002Male19760000
15728773Male27580000

Our goal is to predict the Purchased column (our dichotomous 0 or 1 outcome) using Gender, Age, and EstimatedSalary as our independent variables.

A Quick Bit of Data Prep Before we model, we have to tell R that Gender is a categorical variable (a “factor”). We’ll also set “Female” as the baseline (or “reference level”) to compare against.

Why do this? This changes the interpretation of the Gender coefficient from “what is the effect of ‘Male’?” to “what is the effect of ‘Male’ compared to ‘Female’?”

# Convert Gender to a factor variable
input_dataset$Gender <- as.factor(input_dataset$Gender)

# Set 'Female' as the reference level
ref_level <- 'Female'
input_dataset$Gender <- relevel(input_dataset$Gender, ref_level)

# Convert output to factor variable
input_dataset$Purchased <- as.factor(input_dataset$Purchased)

Quick Exploratory Data Analysis (EDA)

It’s always smart to peek at your data’s structure.

# See the structure of the data (variable types)
str(input_dataset)
'data.frame':	400 obs. of  5 variables:
 $ User.ID        : int  15624510 15810944 15668575 15603246 ...
 $ Gender         : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 2 ...
 $ Age            : int  19 35 26 27 19 27 ...
 $ EstimatedSalary: int  19000 20000 43000 57000 76000 58000 ...
 $ Purchased      : int  0 0 0 0 0 0 ...

And get a statistical summary:

summary(input_dataset)
StatisticUser.IDGenderAgeEstimatedSalaryPurchased
Min.15566689Female:20418.00150000:257
1st Qu.15626764Male :19629.75430001:143
Median1569434137.0070000
Mean1569153937.6669742
3rd Qu.1575036346.0088000
Max.1581523660.00150000

Key observations:

  • Our Gender variable is nicely balanced (204 Female, 196 Male).
  • Our Purchased variable is unbalanced (143 “1s” vs 257 “0s”), but not so extreme that we need special handling for this example.

๐Ÿš€ 2. Implementing the Model

This is the best part. Building the logistic regression model in R is a single line of code.

We use the glm() function, which stands for Generalized Linear Model.

# Create the logistic regression model
logistic_classifier <- glm(formula = Purchased ~ Gender + Age + EstimatedSalary,
                           family = 'binomial',
                           data = na.omit(input_dataset))

Let’s quickly break that down:

  • formula = Purchased ~ Gender + Age + EstimatedSalary: This is our model. We’re predicting Purchased using (~) Gender, Age, and Salary.
  • family = ‘binomial’: This is the magic key. It tells glm() to perform a logistic regression (for a binomial outcome) instead of a standard linear regression.
  • data = na.omit(…): We tell R to use our dataset and to simply ignore rows with missing data for this example.

๐Ÿ“ˆ 3. Interpreting the Results

Now, let’s see what our model found. We just call summary() on the model we created.

# Get the model results
summary(logistic_classifier)

This gives us the main output table. The Coefficients section is what we care about most.

Coefficients:

Results logistic regression

Is My Variable Significant? (The p-value) Look at the Pr(>|z|) column (the p-value). This tells us if a variable has a statistically significant effect.

  • Age has a p-value of < 2e-16 (which is tiny) and three stars (***). This means Age is highly significant.
  • GenderMale has a p-value of 0.274. This is much higher than our usual cutoff of 0.05. This means Gender is not a significant predictor in this model.

What’s the Null Hypothesis? The p-value tests the “null hypothesis,” which for any variable is that its coefficient $\beta$ is zero.

If $\beta$ = 0, its odds ratio ($e^\beta$) is 1. This means the variable has no effect on the odds.

A tiny p-value (like for Age) lets us “reject the null hypothesis” and conclude our variable does have a real effect.

The Magic: Interpreting Coefficients as Odds The Estimate column is in log-odds, which are hard to understand. We need to convert them into odds ratios ($e^\beta$) to make them interpretable.

We can run this simple code to convert all our coefficients:

# See the coefficients in percentage 
(exp(coef(logistic_classifier)) - 1) * 100

This gives us the percentage change in odds for a one-unit increase in each variable.

VariablePercentage Change in Odds
(Intercept)-99.999719367
GenderMale39.632444566
Age26.740233562
EstimatedSalary0.003644185

Now we can make clear, human-readable interpretations:

  • GenderMale: 39.63% Interpretation: “Being male increases the odds of purchasing by 39.63% compared to being female, holding all other variables constant.” (Remember, though, this variable was not statistically significant!)

  • Age: 26.74% Interpretation: “For each additional year of age, the odds of purchasing increase by 26.74%.”

  • EstimatedSalary: 0.003644185 Interpretation: This number is tiny because its unit is a single dollar. It’s easier to scale it. For every $10,000 increase in salary, the odds of purchasing would increase by a much more significant amount.

How Good is the Model Overall? Finally, look at the bottom of the summary() output.

  • Null Deviance (509.9): This is the “badness of fit” for a model with no variables (an “empty” model).
  • Residual Deviance (264.1): This is the “badness of fit” for our model.
  • The fact that our Residual Deviance is much lower (a drop of ~245) than the Null Deviance shows that our set of variables (Age, Gender, Salary) is much better at predicting the outcome than just guessing.
  • AIC (Akaike Information Criterion): This is another fit measure. It’s mainly useful for comparing models. If you built another model with different variables, you would prefer the one with the lower AIC.

โœ… Conclusion And that’s it! From just a few lines of R, we were able to:

  • Load and prepare data.
  • Build a logistic regression model.
  • Determine which variables were significant predictors (Age).
  • Translate cryptic log-odds into clear, interpretable percentages (“a 26.7% increase in odds per year”).
  • Confirm that our model, as a whole, is a good fit for the data.
  • Hopefully, by combining the theory from the last post with this practical example, you now have a solid grasp of how (and why) to use logistic regression.

Please leave a comment if you have any questions!

All code and data can be downloaded from my github blogs dedicated account. logistic regression interpreted

References

Most of the content was reviewed in detail from:

  1. Pampel, F. C. (2000). Quantitative Applications in the Social Sciences: Logistic regression Thousand Oaks, CA: SAGE Publications Ltd doi: 10.4135/9781412984805. URL: http://methods.sagepub.com/book/logistic-regression/n1.xml