This post demonstrates how to use binay logistic regression to categorize two palmetto species in Florida. General analysis process includes data exploration, model build up and model selection.
# Select columns that will be used in the study
palmetto_df <- palmetto %>%
select(-year,-plant,-site,-habitat,-treatment,-survival,-scape,-new_lvs,-biomass,-canopy,-lf_long,-comments) %>%
mutate(species = as.factor(species)) %>%
drop_na()
# Change plant species from "1" and "2" to species name
palmetto_df$species <- recode_factor(palmetto_df$species, "1" = "Serenoa repens", "2" = "Sabal etonia")
p1 <- ggplot(data = palmetto_df, aes(x = height, y = length)) +
geom_point(aes(color = species)) +
labs(x = "height (cm)", y = "length (cm)")
p2 <- ggplot(data = palmetto_df, aes(x = height, y = width)) +
geom_point(aes(color = species)) +
labs(x = "height (cm)", y = "width (cm)")
p3 <- ggplot(data = palmetto_df, aes(x = height, y = green_lvs)) +
geom_point(aes(color = species)) +
labs(x = "height (cm)", y = "count of green leaves")
p1/p2/p3
f1 <- species~height + width + length + green_lvs
pal_df_blr1 <- glm(formula = f1,
data = palmetto_df,
family = 'binomial')
f2 <- species~height + width + green_lvs
pal_df_blr2 <- glm(formula = f2,
data = palmetto_df,
family = 'binomial')
caret
package to automate cross validation(10-fold validation, repeated 10 times)set.seed(123)
tr_ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 10)
### train the model
model1 <- train(f1, data = palmetto_df,
method = 'glm', family = 'binomial',
trControl = tr_ctrl)
model1
Generalized Linear Model
12267 samples
4 predictor
2 classes: 'Serenoa repens', 'Sabal etonia'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 11040, 11040, 11040, 11041, 11040, 11041, ...
Resampling results:
Accuracy Kappa
0.9169231 0.8338335
model2 <- train(f2, data = palmetto_df,
method = 'glm', family = 'binomial',
trControl = tr_ctrl)
model2
Generalized Linear Model
12267 samples
3 predictor
2 classes: 'Serenoa repens', 'Sabal etonia'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 11039, 11041, 11040, 11041, 11041, 11039, ...
Resampling results:
Accuracy Kappa
0.8988022 0.7975858
According to the accuracy result, the first model performs better than the second one with higher accuracy (0.9169>0.8988)
AICcmodavg::aictab
to compare the model results
Model selection based on AICc:
K AICc Delta_AICc AICcWt Cum.Wt LL
Mod1 5 5194.57 0.00 1 1 -2592.28
Mod2 4 5987.48 792.91 0 1 -2989.74
According to the AIC result, the first model performs better than the second one with lower AIC value (5194.57 < 5987.48)
Therefore, I select model 1 to train the entire dataset.
blr1_tidy <- broom::tidy(pal_df_blr1)
blr1_tidy %>%
mutate(p_value = ifelse(p.value <0.001, "<0.001", "0")) %>%
select(-p.value) %>%
kbl(caption = "Table1: Binary Logistic Regression Result for Model1",
col.names = c( 'Variable_Name',
'Coefficient_Estimates',
'Standard_Error',
'Statistics',
'P_Value'),
digits = c(0,5,5,5,0)) %>%
kable_classic(full_width = F, html_font = "Cambria",position = "center")
Variable_Name | Coefficient_Estimates | Standard_Error | Statistics | P_Value |
---|---|---|---|---|
(Intercept) | 3.22669 | 0.14207 | 22.71180 | <0.001 |
height | -0.02922 | 0.00231 | -12.66984 | <0.001 |
width | 0.03944 | 0.00210 | 18.78227 | <0.001 |
length | 0.04582 | 0.00187 | 24.55600 | <0.001 |
green_lvs | -1.90847 | 0.03886 | -49.10728 | <0.001 |
blr1_fitted <- pal_df_blr1 %>%
broom::augment(type.predict = 'response') %>%
select(-.resid, -.std.resid, -.hat, -.sigma, -.cooksd) %>%
mutate(species_predicted = ifelse(.fitted >= 0.5, "Sabal etonia", "Serenoa repens" )) %>%
mutate(correct_classified = ifelse(species == species_predicted, "TRUE", "FALSE"))
# Create final table
final_table <- blr1_fitted %>%
select(-height,-width,-length,-green_lvs,-.fitted,-species_predicted) %>%
group_by(species) %>%
count(correct_classified) %>%
rename(number_counts = n) %>%
mutate(total_counts = sum(number_counts)) %>%
mutate(percentage_classified = (number_counts/total_counts)*100) %>%
select(-total_counts)
# Format final table
final_table %>%
kbl(caption = "Table2: Final Classified Table",
col.names = c( 'Species',
'Classified_Correctly?',
'Counts_of_Classification',
'Classified_Percentage(%)'),
digits = c(0,0,0,3)) %>%
kable_classic(full_width = F, html_font = "Cambria",position = "center")
Species | Classified_Correctly? | Counts_of_Classification | Classified_Percentage(%) |
---|---|---|---|
Serenoa repens | FALSE | 564 | 9.228 |
Serenoa repens | TRUE | 5548 | 90.772 |
Sabal etonia | FALSE | 454 | 7.376 |
Sabal etonia | TRUE | 5701 | 92.624 |
According to the Table 2 result above, model 1 does a good job in predicting Palmetto species based on height, canopy length and width, and also counts of green leaves with percentage of correctly classified above 90% for both of species. Among the two species, model 1 has higher prediction accuracy for Sabal etonia compared to Serenoa repens (92.62% > 90.77%).