Getting started

The tutorial below will help you get started with basic data analysis in R. To follow along, you can download R and Rstudio. R is freely available at http://www.r-project.org/. RStudio, the popular IDE for R, is freely available at https://posit.co/downloads/.

Note

It’s important to test the usefulness of sophisticated models against more basic alternatives. The following may serve as a simple baseline comparator.

Step 1: Collecting data

1a. Install a package to scrape NCAA data

#install.packages("devtools")
#devtools::install_github("lbenz730/ncaahoopR")

1b. Load the package

library(ncaahoopR)

1c. Glimpse at a particular team’s schedule

# look at a team's schedule
get_schedule("Duke", "2022-23") %>%
  glimpse()
Rows: 36
Columns: 7
$ game_id    <chr> "401482906", "401482907", "401482908", "401482909", "401482…
$ date       <date> 2022-11-07, 2022-11-11, 2022-11-15, 2022-11-18, 2022-11-21…
$ opponent   <chr> "Jacksonville", "South Carolina Upstate", "Kansas", "Delawa…
$ location   <chr> "H", "H", "N", "H", "H", "N", "N", "N", "H", "H", "N", "H",…
$ team_score <dbl> 71, 84, 64, 92, 74, 54, 71, 56, 81, 75, 74, 82, 70, 86, 60,…
$ opp_score  <dbl> 44, 38, 69, 58, 57, 51, 64, 75, 72, 59, 62, 55, 81, 67, 84,…
$ record     <chr> "1-0 (0-0)", "2-0 (0-0)", "2-1 (0-0)", "3-1 (0-0)", "4-1 (0…

1d. Check out documentation at https://github.com/lbenz730/ncaahoopR.

Step 2: Basic prediction

Imagine you wish to predict the point spread in match-ups between Duke, UNC and NC State in the Spring 2023 using data from Fall 2022. We can build a simple (poor) naive model that assumes each team scores a normally distributed amount of points per game.

pt_df = NULL

for(teamName in c("Duke", "UNC", "NC State")) {
pt_df = rbind(pt_df, get_schedule(teamName, "2022-23") %>%
  filter(date < "2023-01-01") %>% # look at only Fall games
  summarize(mean_pts = mean(team_score),
            var_pts = var(team_score)) %>%
  mutate(team = teamName))
}

pt_df
  mean_pts  var_pts     team
1 73.85714 118.4396     Duke
2 80.92857 171.9176      UNC
3 79.53333 173.2667 NC State

Using this, we can make naive predictions about the outcome of each matchup.

getCI95 = function(m, s) {
  return(toString(
    c(m - (qnorm(0.975) * s), m + (qnorm(0.975) * s))
  ))
}
  
lookup <- c(team1 = "V1", team2= "V2")

combn(pt_df$team, 2) %>%
  t() %>%
  as.data.frame() %>%
  rename(all_of(lookup)) %>%
  rowwise() %>%
  mutate(pt_spread = 
           pt_df$mean_pts[pt_df["team"] == team1] - 
            pt_df$mean_pts[pt_df["team"] == team2]) %>%
  mutate(spread_sd = 
           sqrt(pt_df$var_pts[pt_df["team"] == team1] +
            pt_df$var_pts[pt_df["team"] == team2])) %>%
  mutate(spread_CI_95 = getCI95(pt_spread, spread_sd)) %>%
  select(-spread_sd)
# A tibble: 3 × 4
# Rowwise: 
  team1 team2    pt_spread spread_CI_95                       
  <chr> <chr>        <dbl> <chr>                              
1 Duke  UNC          -7.07 -40.4689585336895, 26.3261013908323
2 Duke  NC State     -5.68 -39.1512178722751, 27.7988369198942
3 UNC   NC State      1.40 -35.0191969327747, 37.8096731232509