install.packages("devtools")
devtools::install_github("lbenz730/ncaahoopR")Resources
Tutorial
The tutorial below will help you get started with basic data analysis in R. To follow along, you can download R and Rstudio. R is freely available at http://www.r-project.org/. RStudio, the popular IDE for R, is freely available at https://posit.co/downloads/.
It’s important to test the usefulness of sophisticated models against more basic alternatives. The following may serve as a simple baseline comparator.
Step 1: Collecting data
1a. Install a package to scrape NCAA data
1b. Load ncaahoopR package and load tidyverse
library(ncaahoopR)
library(tidyverse)1c. Glimpse at a particular team’s schedule
# look at a team's schedule
duke_sample <- get_schedule("Duke", "2024-25")
duke_sample %>%
glimpse()Rows: 39
Columns: 7
$ game_id <dbl> 401706881, 401706882, 401706883, 401706884, 401706885, 4017…
$ date <date> 2024-11-04, 2024-11-08, 2024-11-12, 2024-11-16, 2024-11-22…
$ opponent <chr> "Maine", "Army", "Kentucky", "Wofford", "Arizona", "Kansas"…
$ location <chr> "H", "H", "N", "H", "A", "N", "H", "H", "A", "H", "H", "A",…
$ team_score <dbl> 96, 100, 72, 86, 69, 72, 70, 84, 76, 72, 68, 82, 88, 89, 76…
$ opp_score <dbl> 62, 58, 77, 35, 55, 75, 48, 78, 65, 46, 47, 56, 65, 62, 47,…
$ record <chr> "1-0 (0-0)", "2-0 (0-0)", "2-1 (0-0)", "3-1 (0-0)", "4-1 (0…
1d. Check out documentation at https://github.com/lbenz730/ncaahoopR.
Step 2: Basic prediction
Imagine you wish to predict the point spread in match-ups between Duke, UNC and NC State using data from Fall 2022. We can build a simple (poor) naive model that assumes each team scores a normally distributed amount of points per game.
pt_df = NULL
for(teamName in c("Duke", "UNC", "NC State")) {
pt_df = rbind(pt_df, get_schedule(teamName, "2022-23") %>%
filter(date < "2023-01-01") %>% # look at only Fall games
summarize(mean_pts = mean(team_score),
var_pts = var(team_score)) %>%
mutate(team = teamName))
}
pt_df # A tibble: 3 × 3
mean_pts var_pts team
<dbl> <dbl> <chr>
1 73.9 118. Duke
2 80.9 172. UNC
3 79.5 173. NC State
Using this, we can make naive predictions about the outcome of each matchup.
getCI95 = function(m, s) {
return(toString(
c(m - (qnorm(0.975) * s), m + (qnorm(0.975) * s))
))
}
lookup <- c(team1 = "V1", team2= "V2")
combn(pt_df$team, 2) %>%
t() %>%
as.data.frame() %>%
rename(all_of(lookup)) %>%
rowwise() %>%
mutate(pt_spread =
pt_df$mean_pts[pt_df["team"] == team1] -
pt_df$mean_pts[pt_df["team"] == team2]) %>%
mutate(spread_sd =
sqrt(pt_df$var_pts[pt_df["team"] == team1] +
pt_df$var_pts[pt_df["team"] == team2])) %>%
mutate(spread_CI_95 = getCI95(pt_spread, spread_sd)) %>%
select(-spread_sd)# A tibble: 3 × 4
# Rowwise:
team1 team2 pt_spread spread_CI_95
<chr> <chr> <dbl> <chr>
1 Duke UNC -7.07 -40.4689585336895, 26.3261013908323
2 Duke NC State -5.68 -39.1512178722751, 27.7988369198942
3 UNC NC State 1.40 -35.0191969327747, 37.8096731232509
Further reading
Here are a list of online articles that may be helpful in this competition.
– Creating a College Basketball Metric to Predict Point Spreads for March MadnessArticle
– Confidence vs Prediction Intervals: Understanding the Difference