This post demonstrates how to use binay logistic regression to categorize two palmetto species in Florida. General analysis process includes data exploration, model build up and model selection.
# Select columns that will be used in the study
palmetto_df <- palmetto %>%
select(-year,-plant,-site,-habitat,-treatment,-survival,-scape,-new_lvs,-biomass,-canopy,-lf_long,-comments) %>%
mutate(species = as.factor(species)) %>%
drop_na()
# Change plant species from "1" and "2" to species name
palmetto_df$species <- recode_factor(palmetto_df$species, "1" = "Serenoa repens", "2" = "Sabal etonia")
p1 <- ggplot(data = palmetto_df, aes(x = height, y = length)) +
geom_point(aes(color = species)) +
labs(x = "height (cm)", y = "length (cm)")
p2 <- ggplot(data = palmetto_df, aes(x = height, y = width)) +
geom_point(aes(color = species)) +
labs(x = "height (cm)", y = "width (cm)")
p3 <- ggplot(data = palmetto_df, aes(x = height, y = green_lvs)) +
geom_point(aes(color = species)) +
labs(x = "height (cm)", y = "count of green leaves")
p1/p2/p3
Figure 1: This compound figure is composed by three sub-figures: the first one shows the relationship between height and canopy length, the second one shows between height and canopy width, and the third one between height and counts of green leaves. The color of the points represent different plant species. According to the compound figure, counts of green leaves seem to do the best job of classifying two species, as can be seen, the two color points are most seperated in the third sub-figure.
f1 <- species~height + width + length + green_lvs
pal_df_blr1 <- glm(formula = f1,
data = palmetto_df,
family = 'binomial')
f2 <- species~height + width + green_lvs
pal_df_blr2 <- glm(formula = f2,
data = palmetto_df,
family = 'binomial')
caret
package to automate cross validation(10-fold validation, repeated 10 times)set.seed(123)
tr_ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 10)
### train the model
model1 <- train(f1, data = palmetto_df,
method = 'glm', family = 'binomial',
trControl = tr_ctrl)
model1
Generalized Linear Model
12267 samples
4 predictor
2 classes: 'Serenoa repens', 'Sabal etonia'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 11040, 11040, 11040, 11041, 11040, 11041, ...
Resampling results:
Accuracy Kappa
0.9169231 0.8338335
model2 <- train(f2, data = palmetto_df,
method = 'glm', family = 'binomial',
trControl = tr_ctrl)
model2
Generalized Linear Model
12267 samples
3 predictor
2 classes: 'Serenoa repens', 'Sabal etonia'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 11039, 11041, 11040, 11041, 11041, 11039, ...
Resampling results:
Accuracy Kappa
0.8988022 0.7975858
According to the accuracy result, the first model performs better than the second one with higher accuracy (0.9169>0.8988)
AICcmodavg::aictab
to compare the model results
Model selection based on AICc:
K AICc Delta_AICc AICcWt Cum.Wt LL
Mod1 5 5194.57 0.00 1 1 -2592.28
Mod2 4 5987.48 792.91 0 1 -2989.74
According to the AIC result, the first model performs better than the second one with lower AIC value (5194.57 < 5987.48)
Therefore, I select model 1 to train the entire dataset.
blr1_tidy <- broom::tidy(pal_df_blr1)
blr1_tidy %>%
mutate(p_value = ifelse(p.value <0.001, "<0.001", "0")) %>%
select(-p.value) %>%
kbl(caption = "Table1: Binary Logistic Regression Result for Model1",
col.names = c( 'Variable_Name',
'Coefficient_Estimates',
'Standard_Error',
'Statistics',
'P_Value'),
digits = c(0,5,5,5,0)) %>%
kable_classic(full_width = F, html_font = "Cambria",position = "center")
Variable_Name | Coefficient_Estimates | Standard_Error | Statistics | P_Value |
---|---|---|---|---|
(Intercept) | 3.22669 | 0.14207 | 22.71180 | <0.001 |
height | -0.02922 | 0.00231 | -12.66984 | <0.001 |
width | 0.03944 | 0.00210 | 18.78227 | <0.001 |
length | 0.04582 | 0.00187 | 24.55600 | <0.001 |
green_lvs | -1.90847 | 0.03886 | -49.10728 | <0.001 |
blr1_fitted <- pal_df_blr1 %>%
broom::augment(type.predict = 'response') %>%
select(-.resid, -.std.resid, -.hat, -.sigma, -.cooksd) %>%
mutate(species_predicted = ifelse(.fitted >= 0.5, "Sabal etonia", "Serenoa repens" )) %>%
mutate(correct_classified = ifelse(species == species_predicted, "TRUE", "FALSE"))
# Create final table
final_table <- blr1_fitted %>%
select(-height,-width,-length,-green_lvs,-.fitted,-species_predicted) %>%
group_by(species) %>%
count(correct_classified) %>%
rename(number_counts = n) %>%
mutate(total_counts = sum(number_counts)) %>%
mutate(percentage_classified = (number_counts/total_counts)*100) %>%
select(-total_counts)
# Format final table
final_table %>%
kbl(caption = "Table2: Final Classified Table",
col.names = c( 'Species',
'Classified_Correctly?',
'Counts_of_Classification',
'Classified_Percentage(%)'),
digits = c(0,0,0,3)) %>%
kable_classic(full_width = F, html_font = "Cambria",position = "center")
Species | Classified_Correctly? | Counts_of_Classification | Classified_Percentage(%) |
---|---|---|---|
Serenoa repens | FALSE | 564 | 9.228 |
Serenoa repens | TRUE | 5548 | 90.772 |
Sabal etonia | FALSE | 454 | 7.376 |
Sabal etonia | TRUE | 5701 | 92.624 |
According to the Table 2 result above, model 1 does a good job in predicting Palmetto species based on height, canopy length and width, and also counts of green leaves with percentage of correctly classified above 90% for both of species. Among the two species, model 1 has higher prediction accuracy for Sabal etonia compared to Serenoa repens (92.62% > 90.77%).