Demo Post 2

decision trees
machine learning
arrests
Author

Jane Doe

Published

September 5, 2023

Understanding the Data

We are looking at arrests data by state. The data set has 50 rows (one for each state) and four variables.

glimpse(USArrests)
Rows: 50
Columns: 4
$ Murder   <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4, 5.3, 2.…
$ Assault  <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46, 120, 24…
$ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 65, 57, 6…
$ Rape     <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9, 25.8, 2…

Each of the variables are a numeric-continuous data type. We have arrests per 100,000 people for three violent crimes: assault, murder, and rape. We also have a column indicating the degree of urban population in that state. Before preceding with prediction, we note that tree-based techniques can be more unstable if the variables are too correlated with one another. We can also see if there are any extreme skews in the data.

library(GGally)
Warning: package 'GGally' was built under R version 4.3.2
ggpairs(USArrests)

We do see some positive relationships and stronger correlations, but mayne not quite enough to get us in trouble.

Now lets try and predict Murder using the other features.

dt = rpart(Murder ~.,
           data=USArrests)
rpart.plot(dt)

We can calculate a kind of R-squared measure of accuracy by squaring the correlation between the actual Murder values with our predicted ones.

USArrests %>%
  mutate(predicted_murder = predict(dt, USArrests)) %>%
  select(Murder, predicted_murder) %>%
  cor() -> corrmat

rsq = corrmat[["Murder", "predicted_murder"]]^2
print(paste("The r-square for our model is", round(rsq,2), sep=": "))
[1] "The r-square for our model is: 0.78"