Binary Logistic Regression

This post demonstrates how to use binay logistic regression to categorize two palmetto species in Florida. General analysis process includes data exploration, model build up and model selection.

Yutian Fang true
2022-03-13
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

library(tidyverse)
library(here)
library(broom)
library(caret)
library(AICcmodavg)
library(patchwork)
library(kableExtra)

Overview of the data:

Read in data

palmetto <-read_csv(here('data','palmetto.csv'),
                    show_col_types = FALSE)

Pre-process data - make it clean

# Select columns that will be used in the study
palmetto_df <- palmetto %>%
  select(-year,-plant,-site,-habitat,-treatment,-survival,-scape,-new_lvs,-biomass,-canopy,-lf_long,-comments) %>%
  mutate(species = as.factor(species)) %>%
  drop_na()

# Change plant species from "1" and "2" to species name
palmetto_df$species <- recode_factor(palmetto_df$species, "1" = "Serenoa repens", "2" = "Sabal etonia")

Create 3 graphs to explore which variable more likely to help classify species

p1 <- ggplot(data = palmetto_df, aes(x = height, y = length)) +
  geom_point(aes(color = species)) +
  labs(x = "height (cm)", y = "length (cm)")

p2 <- ggplot(data = palmetto_df, aes(x = height, y = width)) +
  geom_point(aes(color = species)) +
  labs(x = "height (cm)", y = "width (cm)")

p3 <- ggplot(data = palmetto_df, aes(x = height, y = green_lvs)) +
  geom_point(aes(color = species)) +
  labs(x = "height (cm)", y = "count of green leaves")

p1/p2/p3
This compound figure is composed by three sub-figures: the first one shows the relationship between height and canopy length, the second one shows between height and canopy width, and the third one between height and counts of green leaves. The color of the points represent different plant species. According to the compound figure, counts of green leaves seem to do the best job of classifying two species, as can be seen, the two color points are most seperated in the third sub-figure.

Figure 1: This compound figure is composed by three sub-figures: the first one shows the relationship between height and canopy length, the second one shows between height and canopy width, and the third one between height and counts of green leaves. The color of the points represent different plant species. According to the compound figure, counts of green leaves seem to do the best job of classifying two species, as can be seen, the two color points are most seperated in the third sub-figure.

Build up binary logistic regression model

Model1: Use height, width, length and counts of green leaves to predict palmetto species

f1 <- species~height + width + length + green_lvs

pal_df_blr1 <- glm(formula = f1,
                   data = palmetto_df,
                   family = 'binomial')

Model2: Use height, width and counts of green leaves to predict palmetto species

f2 <- species~height + width + green_lvs

pal_df_blr2 <- glm(formula = f2,
                   data = palmetto_df,
                   family = 'binomial')

Model selection based on AIC and cross-validation

Use caret package to automate cross validation(10-fold validation, repeated 10 times)

set.seed(123)

tr_ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 10)

### train the model

model1 <- train(f1, data = palmetto_df,
                method = 'glm', family = 'binomial',
                trControl = tr_ctrl)

model1
Generalized Linear Model 

12267 samples
    4 predictor
    2 classes: 'Serenoa repens', 'Sabal etonia' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 11040, 11040, 11040, 11041, 11040, 11041, ... 
Resampling results:

  Accuracy   Kappa    
  0.9169231  0.8338335
model2 <- train(f2, data = palmetto_df,
                method = 'glm', family = 'binomial',
                trControl = tr_ctrl)

model2
Generalized Linear Model 

12267 samples
    3 predictor
    2 classes: 'Serenoa repens', 'Sabal etonia' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 11039, 11041, 11040, 11041, 11041, 11039, ... 
Resampling results:

  Accuracy   Kappa    
  0.8988022  0.7975858

According to the accuracy result, the first model performs better than the second one with higher accuracy (0.9169>0.8988)

Use AICcmodavg::aictab to compare the model results

AICcmodavg::aictab(list(pal_df_blr1, pal_df_blr2))

Model selection based on AICc:

     K    AICc Delta_AICc AICcWt Cum.Wt       LL
Mod1 5 5194.57       0.00      1      1 -2592.28
Mod2 4 5987.48     792.91      0      1 -2989.74

According to the AIC result, the first model performs better than the second one with lower AIC value (5194.57 < 5987.48)

Therefore, I select model 1 to train the entire dataset.

Formatted Model 1 table

blr1_tidy <- broom::tidy(pal_df_blr1)

blr1_tidy %>%
  mutate(p_value = ifelse(p.value <0.001, "<0.001", "0")) %>%
  select(-p.value) %>%
  kbl(caption = "Table1: Binary Logistic Regression Result for Model1",
      col.names = c( 'Variable_Name',
                     'Coefficient_Estimates',
                     'Standard_Error',
                     'Statistics',
                     'P_Value'),
      digits = c(0,5,5,5,0)) %>%
  kable_classic(full_width = F, html_font = "Cambria",position = "center")
Table 1: Table1: Binary Logistic Regression Result for Model1
Variable_Name Coefficient_Estimates Standard_Error Statistics P_Value
(Intercept) 3.22669 0.14207 22.71180 <0.001
height -0.02922 0.00231 -12.66984 <0.001
width 0.03944 0.00210 18.78227 <0.001
length 0.04582 0.00187 24.55600 <0.001
green_lvs -1.90847 0.03886 -49.10728 <0.001

Original data classification based on model1

Create species prediction based on model1, examine whether the prediction is accurate

blr1_fitted <- pal_df_blr1 %>%
  broom::augment(type.predict = 'response') %>%
  select(-.resid, -.std.resid, -.hat, -.sigma, -.cooksd) %>%
  mutate(species_predicted = ifelse(.fitted >= 0.5, "Sabal etonia", "Serenoa repens" )) %>%
  mutate(correct_classified = ifelse(species == species_predicted, "TRUE", "FALSE"))

Create final table, and format the final table

# Create final table
final_table <- blr1_fitted %>%
  select(-height,-width,-length,-green_lvs,-.fitted,-species_predicted) %>%
  group_by(species) %>%
  count(correct_classified) %>%
  rename(number_counts = n) %>%
  mutate(total_counts = sum(number_counts)) %>%
  mutate(percentage_classified = (number_counts/total_counts)*100) %>%
  select(-total_counts)

# Format final table
final_table %>%
    kbl(caption = "Table2: Final Classified Table",
      col.names = c( 'Species',
                     'Classified_Correctly?',
                     'Counts_of_Classification',
                     'Classified_Percentage(%)'),
      digits = c(0,0,0,3)) %>%
  kable_classic(full_width = F, html_font = "Cambria",position = "center")
Table 2: Table2: Final Classified Table
Species Classified_Correctly? Counts_of_Classification Classified_Percentage(%)
Serenoa repens FALSE 564 9.228
Serenoa repens TRUE 5548 90.772
Sabal etonia FALSE 454 7.376
Sabal etonia TRUE 5701 92.624

According to the Table 2 result above, model 1 does a good job in predicting Palmetto species based on height, canopy length and width, and also counts of green leaves with percentage of correctly classified above 90% for both of species. Among the two species, model 1 has higher prediction accuracy for Sabal etonia compared to Serenoa repens (92.62% > 90.77%).