Yutian Fang: Binary Logistic Regression

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

library(tidyverse)
library(here)
library(broom)
library(caret)
library(AICcmodavg)
library(patchwork)
library(kableExtra)

Overview of the data:

Data Description: The data in this task comes from the survival,growth and biomass estimates of two dominant palmetto species in Florida. The surveying year is from 1981 - 2017. The variables focused on this task include palmetto species (“plant”), palmetto height (“height”), palmetto canopy length(“length”), palmetto canopy width(“width”), and counts of green leaves (green_lvs).
Task Objective: This task focuses on using quantitative variables (height, length, width, counts of green leaves) to predict the species of Palmetto (either “Serenoa repens” or “Sabal etonia”). Two binary logistic regression models will be built, with the better one selected from AIC and cross-validation result.
Analysis Process: First, the original dataset is cleaned and 3 plots are made to explore which variable might be the best variable that differentiate the two species. Next, two binary logistic regression models are built to predict palmetto species, and then use AIC and cross-validation to select the better one among the two. Finally, the classification results are examined by using the selected model to predict palmetto species from original dataset, and the % of correctly classified for each species is calculated.
Data Citation: Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017, ongoing at 5-year intervals ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/f2f96ec76fbbd4b9db431c79a770c4d5 (Accessed 2022-02-03).

Read in data

palmetto <-read_csv(here('data','palmetto.csv'),
                    show_col_types = FALSE)

Pre-process data - make it clean

# Select columns that will be used in the study
palmetto_df <- palmetto %>%
  select(-year,-plant,-site,-habitat,-treatment,-survival,-scape,-new_lvs,-biomass,-canopy,-lf_long,-comments) %>%
  mutate(species = as.factor(species)) %>%
  drop_na()

# Change plant species from "1" and "2" to species name
palmetto_df$species <- recode_factor(palmetto_df$species, "1" = "Serenoa repens", "2" = "Sabal etonia")

Create 3 graphs to explore which variable more likely to help classify species

p1 <- ggplot(data = palmetto_df, aes(x = height, y = length)) +
  geom_point(aes(color = species)) +
  labs(x = "height (cm)", y = "length (cm)")

p2 <- ggplot(data = palmetto_df, aes(x = height, y = width)) +
  geom_point(aes(color = species)) +
  labs(x = "height (cm)", y = "width (cm)")

p3 <- ggplot(data = palmetto_df, aes(x = height, y = green_lvs)) +
  geom_point(aes(color = species)) +
  labs(x = "height (cm)", y = "count of green leaves")

p1/p2/p3

This compound figure is composed by three sub-figures: the first one shows the relationship between height and canopy length, the second one shows between height and canopy width, and the third one between height and counts of green leaves. The color of the points represent different plant species. According to the compound figure, counts of green leaves seem to do the best job of classifying two species, as can be seen, the two color points are most seperated in the third sub-figure.

Figure 1: This compound figure is composed by three sub-figures: the first one shows the relationship between height and canopy length, the second one shows between height and canopy width, and the third one between height and counts of green leaves. The color of the points represent different plant species. According to the compound figure, counts of green leaves seem to do the best job of classifying two species, as can be seen, the two color points are most seperated in the third sub-figure.

Build up binary logistic regression model

Model1: Use height, width, length and counts of green leaves to predict palmetto species

f1 <- species~height + width + length + green_lvs

pal_df_blr1 <- glm(formula = f1,
                   data = palmetto_df,
                   family = 'binomial')

Model2: Use height, width and counts of green leaves to predict palmetto species

f2 <- species~height + width + green_lvs

pal_df_blr2 <- glm(formula = f2,
                   data = palmetto_df,
                   family = 'binomial')

Model selection based on AIC and cross-validation

Use `caret` package to automate cross validation(10-fold validation, repeated 10 times)

set.seed(123)

tr_ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 10)

### train the model

model1 <- train(f1, data = palmetto_df,
                method = 'glm', family = 'binomial',
                trControl = tr_ctrl)

model1

Generalized Linear Model 

12267 samples
    4 predictor
    2 classes: 'Serenoa repens', 'Sabal etonia' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 11040, 11040, 11040, 11041, 11040, 11041, ... 
Resampling results:

  Accuracy   Kappa    
  0.9169231  0.8338335

model2 <- train(f2, data = palmetto_df,
                method = 'glm', family = 'binomial',
                trControl = tr_ctrl)

model2

Generalized Linear Model 

12267 samples
    3 predictor
    2 classes: 'Serenoa repens', 'Sabal etonia' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 11039, 11041, 11040, 11041, 11041, 11039, ... 
Resampling results:

  Accuracy   Kappa    
  0.8988022  0.7975858

According to the accuracy result, the first model performs better than the second one with higher accuracy (0.9169>0.8988)

Use `AICcmodavg::aictab` to compare the model results

AICcmodavg::aictab(list(pal_df_blr1, pal_df_blr2))


Model selection based on AICc:

     K    AICc Delta_AICc AICcWt Cum.Wt       LL
Mod1 5 5194.57       0.00      1      1 -2592.28
Mod2 4 5987.48     792.91      0      1 -2989.74

According to the AIC result, the first model performs better than the second one with lower AIC value (5194.57 < 5987.48)

Therefore, I select model 1 to train the entire dataset.

Formatted Model 1 table

blr1_tidy <- broom::tidy(pal_df_blr1)

blr1_tidy %>%
  mutate(p_value = ifelse(p.value <0.001, "<0.001", "0")) %>%
  select(-p.value) %>%
  kbl(caption = "Table1: Binary Logistic Regression Result for Model1",
      col.names = c( 'Variable_Name',
                     'Coefficient_Estimates',
                     'Standard_Error',
                     'Statistics',
                     'P_Value'),
      digits = c(0,5,5,5,0)) %>%
  kable_classic(full_width = F, html_font = "Cambria",position = "center")

Table 1: Table1: Binary Logistic Regression Result for Model1
Variable_Name	Coefficient_Estimates	Standard_Error	Statistics	P_Value
(Intercept)	3.22669	0.14207	22.71180	<0.001
height	-0.02922	0.00231	-12.66984	<0.001
width	0.03944	0.00210	18.78227	<0.001
length	0.04582	0.00187	24.55600	<0.001
green_lvs	-1.90847	0.03886	-49.10728	<0.001

Original data classification based on model1

Create species prediction based on model1, examine whether the prediction is accurate

blr1_fitted <- pal_df_blr1 %>%
  broom::augment(type.predict = 'response') %>%
  select(-.resid, -.std.resid, -.hat, -.sigma, -.cooksd) %>%
  mutate(species_predicted = ifelse(.fitted >= 0.5, "Sabal etonia", "Serenoa repens" )) %>%
  mutate(correct_classified = ifelse(species == species_predicted, "TRUE", "FALSE"))

Create final table, and format the final table

# Create final table
final_table <- blr1_fitted %>%
  select(-height,-width,-length,-green_lvs,-.fitted,-species_predicted) %>%
  group_by(species) %>%
  count(correct_classified) %>%
  rename(number_counts = n) %>%
  mutate(total_counts = sum(number_counts)) %>%
  mutate(percentage_classified = (number_counts/total_counts)*100) %>%
  select(-total_counts)

# Format final table
final_table %>%
    kbl(caption = "Table2: Final Classified Table",
      col.names = c( 'Species',
                     'Classified_Correctly?',
                     'Counts_of_Classification',
                     'Classified_Percentage(%)'),
      digits = c(0,0,0,3)) %>%
  kable_classic(full_width = F, html_font = "Cambria",position = "center")

Table 2: Table2: Final Classified Table
Species	Classified_Correctly?	Counts_of_Classification	Classified_Percentage(%)
Serenoa repens	FALSE	564	9.228
Serenoa repens	TRUE	5548	90.772
Sabal etonia	FALSE	454	7.376
Sabal etonia	TRUE	5701	92.624

According to the Table 2 result above, model 1 does a good job in predicting Palmetto species based on height, canopy length and width, and also counts of green leaves with percentage of correctly classified above 90% for both of species. Among the two species, model 1 has higher prediction accuracy for Sabal etonia compared to Serenoa repens (92.62% > 90.77%).

Binary Logistic Regression