In this computer lab we’re going to try to predict outcomes for AFL football
The technique we will use is very basic—feel free to experiment with others after the lab
The data set we will use is this one
It can be read in as a table using this command:
footy <- read.table("trainingdata.txt", header=T)
The data contains results for AFL games in 2008
For those who don’t know about football, here’s what you need to know for the exercise
When two teams play, the winner is the team with most points at the end of the game
In the data set, each row is the outcome of one game
Each row compares the home team against the away team for that game
The data set contains many variables but we will look at only three
footy$lg_away_team_margin: Same as lg_home_team_margin, but for away team
We are interested in using lg_home_team_margin and lg_away_team_margin as predictors for who will win the current game.
For starters let’s, look at the data:
To understand the figure, consider a point in the south east corner. South east means that the home team did well in their previous game, and the away team did badly. In this situation, we might expect that the home team would win the current game, and the circle would be black. Hence, we would expect black circles in the south east, and red circles in the north west. This seems like it might be the average case, although the relationship is actually not very clear.
The code for producing the figure is below. Please run it.
footy <- read.table("data/trainingdata.txt", header=T)
x2 <- footy$lg_home_team_margin
x3 <- footy$lg_away_team_margin
Y <- footy$home_team_win
# Black are winners, reds are losers
plot(x2[Y == 1], x3[Y == 1], col="black",
xlab="Home team margin, previous game",
ylab="Away team margin, previous game",
main="Outcomes for home team")
points(x2[Y == -1], x3[Y == -1], col="red")
legend(-135, -100, c("win","loss"), col=c("black","red"), pch=c(21, 21))
Now we’re going to predict home team win/loss by linear regression
Our regression model will be
Here:
Note that the variable y takes only the values 1 or -1, corresponding to win or loss
We can now run the regression, leading to predictions of the form
If we put in new values for x2 = lg_home_team_margin and x3 = lg_away_team_margin we get predictions for win or loss
Here we understand a that
In the next figure, we run the regression, and then draw a line through the points
The line gives the “decision boundary” (where prediction changes from “win” to “loss” or vice versa)
Here is the figure:
In the figure, for points to the south east of the line the predicted value of y is positive, and hence we predict win. For points to the north west of the line the predicted value of y is negative, and hence we predict loss.
Replicate the previous figure by
Next let’s see how our prediction model performs on new data.
Here is a second data set of the same form, from the same season
We’ll call this new data the test data set. The previous data set we’ll call the training data set
If we look at the win rate for the home team in the training data, it is over 50%. Hence the naive prediction for the test data set is that the home team will win. If we predict like this on the test data set, we are right 59.7% of the time.
If, on the other hand we predict with our model (using the coefficient values we estimated on the training data set), the success rate is 64.8%
Replicate these results by
The following code contains solutions to both exercises
footy <- read.table("trainingdata.txt", header=T)
x2 <- footy$lg_home_team_margin
x3 <- footy$lg_away_team_margin
Y <- footy$home_team_win
# Black are winners, reds are losers
plot(x2[Y == 1], x3[Y == 1], col="black",
xlab="Home team margin, previous game",
ylab="Away team margin, previous game",
main="Outcomes for home team")
points(x2[Y == -1], x3[Y == -1], col="red")
legend(-135, -100, c("win","loss"),
col=c("black","red"), pch=c(21, 21))
# Run the regression
reg1 <- lm(Y ~ x2 + x3)
# Plot the line such that (1, x2, x3)'beta = 0
gridsize <- 40
b <- coef(reg1)
grid <- seq(min(x2), max(x2), length=gridsize)
lines(grid, (- b[1] - b[2] * grid ) / b[3])
# Add text to indicate prediction categories
text(-110, 110, "Predict loss")
text(50, -105, "Predict win")
# Test success rate predicting on test data set
footy_test <- read.table("testdata.txt", header=T)
x2_test <- footy_test$lg_home_team_margin
x3_test <- footy_test$lg_away_team_margin
# Combine regressors into matrix of rows (1, x2_test, x3_test)
X_test <- cbind(rep(1, length(x2_test)), x2_test, x3_test)
# Evaluate predictions for each row
pred <- X_test %*% coef(reg1)
# Convert to 1, -1 values
pred <- ifelse(pred > 0, 1, -1)
# Actual outcomes, to compare against predictions
Y_test <- footy_test$home_team_win
cat("fraction of wins in training set:", mean(Y == 1), "\n")
cat("fraction of wins in test set:", mean(Y_test == 1), "\n")
cat("prediction success rate:", mean(pred == Y_test), "\n")