Resources

Tutorial

The tutorial below will help you get started with basic data analysis in R. To follow along, you can download R and Rstudio. R is freely available at http://www.r-project.org/. RStudio, the popular IDE for R, is freely available at https://posit.co/downloads/.

Note

It’s important to test the usefulness of sophisticated models against more basic alternatives. The following may serve as a simple baseline comparator.

Step 1: Collecting data

1a. Install a package to scrape NCAA data

install.packages("devtools")
devtools::install_github("lbenz730/ncaahoopR")

1b. Load ncaahoopR package and load tidyverse

library(ncaahoopR)
library(tidyverse)

1c. Glimpse at a particular team’s schedule

# look at a team's schedule
duke_sample <- get_schedule("Duke", "2024-25")

duke_sample %>%
  glimpse()
Rows: 39
Columns: 7
$ game_id    <dbl> 401706881, 401706882, 401706883, 401706884, 401706885, 4017…
$ date       <date> 2024-11-04, 2024-11-08, 2024-11-12, 2024-11-16, 2024-11-22…
$ opponent   <chr> "Maine", "Army", "Kentucky", "Wofford", "Arizona", "Kansas"…
$ location   <chr> "H", "H", "N", "H", "A", "N", "H", "H", "A", "H", "H", "A",…
$ team_score <dbl> 96, 100, 72, 86, 69, 72, 70, 84, 76, 72, 68, 82, 88, 89, 76…
$ opp_score  <dbl> 62, 58, 77, 35, 55, 75, 48, 78, 65, 46, 47, 56, 65, 62, 47,…
$ record     <chr> "1-0 (0-0)", "2-0 (0-0)", "2-1 (0-0)", "3-1 (0-0)", "4-1 (0…

1d. Check out documentation at https://github.com/lbenz730/ncaahoopR.

Step 2: Basic prediction

Imagine you wish to predict the point spread in match-ups between Duke, UNC and NC State using data from Fall 2022. We can build a simple (poor) naive model that assumes each team scores a normally distributed amount of points per game.

pt_df = NULL

for(teamName in c("Duke", "UNC", "NC State")) {
pt_df = rbind(pt_df, get_schedule(teamName, "2022-23") %>%
  filter(date < "2023-01-01") %>% # look at only Fall games
  summarize(mean_pts = mean(team_score),
            var_pts = var(team_score)) %>%
  mutate(team = teamName))
}

pt_df 
# A tibble: 3 × 3
  mean_pts var_pts team    
     <dbl>   <dbl> <chr>   
1     73.9    118. Duke    
2     80.9    172. UNC     
3     79.5    173. NC State

Using this, we can make naive predictions about the outcome of each matchup.

getCI95 = function(m, s) {
  return(toString(
    c(m - (qnorm(0.975) * s), m + (qnorm(0.975) * s))
  ))
}
  
lookup <- c(team1 = "V1", team2= "V2")

combn(pt_df$team, 2) %>%
  t() %>%
  as.data.frame() %>%
  rename(all_of(lookup)) %>%
  rowwise() %>%
  mutate(pt_spread = 
           pt_df$mean_pts[pt_df["team"] == team1] - 
            pt_df$mean_pts[pt_df["team"] == team2]) %>%
  mutate(spread_sd = 
           sqrt(pt_df$var_pts[pt_df["team"] == team1] +
            pt_df$var_pts[pt_df["team"] == team2])) %>%
  mutate(spread_CI_95 = getCI95(pt_spread, spread_sd)) %>%
  select(-spread_sd)
# A tibble: 3 × 4
# Rowwise: 
  team1 team2    pt_spread spread_CI_95                       
  <chr> <chr>        <dbl> <chr>                              
1 Duke  UNC          -7.07 -40.4689585336895, 26.3261013908323
2 Duke  NC State     -5.68 -39.1512178722751, 27.7988369198942
3 UNC   NC State      1.40 -35.0191969327747, 37.8096731232509

Further reading

Here are a list of online articles that may be helpful in this competition.

Creating a College Basketball Metric to Predict Point Spreads for March MadnessArticle

Confidence vs Prediction Intervals: Understanding the Difference