Table Of Contents

Previous topic

Multiple Regression

Next topic

Experiments with Markov Chains

This Page

Prediction

In this computer lab we’re going to try to predict outcomes for AFL football

The technique we will use is very basic—feel free to experiment with others after the lab

The data set we will use is this one

It can be read in as a table using this command:

footy <- read.table("trainingdata.txt", header=T)

The data contains results for AFL games in 2008

For those who don’t know about football, here’s what you need to know for the exercise

  • When two teams play, the winner is the team with most points at the end of the game

  • Teams usually play either “at home” or “away”
    • At home means at the team’s local home ground

In the data set, each row is the outcome of one game

Each row compares the home team against the away team for that game

The data set contains many variables but we will look at only three

  • footy$home_team_win: Whether or not the home team won the game
    • 1 indicates that the home team won, -1 indicates that they lost
  • footy$lg_home_team_margin: The winning margin of the home team in their previous game
    • Measured in points (e.g. lg_home_team_margin = 10 means they won previous game by ten points)
    • Negative value indicates that they lost
  • footy$lg_away_team_margin: Same as lg_home_team_margin, but for away team

We are interested in using lg_home_team_margin and lg_away_team_margin as predictors for who will win the current game.

For starters let’s, look at the data:

_images/winloss.png
  • On the x-axis is lg_home_team_margin, while on the y-axis is lg_away_team_margin
  • The data points (circles) correspond to the value of these variables for each game
  • Black circles indicate that the home team won the current game, red circles indicate that they lost

To understand the figure, consider a point in the south east corner. South east means that the home team did well in their previous game, and the away team did badly. In this situation, we might expect that the home team would win the current game, and the circle would be black. Hence, we would expect black circles in the south east, and red circles in the north west. This seems like it might be the average case, although the relationship is actually not very clear.

The code for producing the figure is below. Please run it.

footy <- read.table("data/trainingdata.txt", header=T)

x2 <- footy$lg_home_team_margin
x3 <- footy$lg_away_team_margin
Y <- footy$home_team_win

# Black are winners, reds are losers
plot(x2[Y == 1], x3[Y == 1], col="black",
    xlab="Home team margin, previous game",
    ylab="Away team margin, previous game",
    main="Outcomes for home team")
points(x2[Y == -1], x3[Y == -1], col="red")
legend(-135, -100, c("win","loss"), col=c("black","red"), pch=c(21, 21))

Now we’re going to predict home team win/loss by linear regression

Our regression model will be

(1)\[y = \beta_1 + \beta_2 x_2 + \beta_3 x_3\]

Here:

  • y = home_team_win
  • x2 = lg_home_team_margin
  • x3 = lg_away_team_margin

Note that the variable y takes only the values 1 or -1, corresponding to win or loss

We can now run the regression, leading to predictions of the form

(2)\[y = \hat \beta_1 + \hat \beta_2 x_2 + \hat \beta_3 x_3\]

If we put in new values for x2 = lg_home_team_margin and x3 = lg_away_team_margin we get predictions for win or loss

Here we understand a that

  • if y > 0, then the model predicts a win
  • if y <= 0, then the model predicts a loss

In the next figure, we run the regression, and then draw a line through the points

(3)\[\{(x_2, x_3) \in \mathbb{R}^2 \,|\, \hat \beta_1 + \hat \beta_2 x_2 + \hat \beta_3 x_3 = 0\}\]

The line gives the “decision boundary” (where prediction changes from “win” to “loss” or vice versa)

Here is the figure:

_images/winlosspred.png

In the figure, for points to the south east of the line the predicted value of y is positive, and hence we predict win. For points to the north west of the line the predicted value of y is negative, and hence we predict loss.

Exercises

Exercise 1

Replicate the previous figure by

  • running the regression
  • writing down an equation for the line and plotting it using lines
  • add the text with text(-110, 110, "Predict loss") and text(50, -105, "Predict win")

Next let’s see how our prediction model performs on new data.

Here is a second data set of the same form, from the same season

We’ll call this new data the test data set. The previous data set we’ll call the training data set

If we look at the win rate for the home team in the training data, it is over 50%. Hence the naive prediction for the test data set is that the home team will win. If we predict like this on the test data set, we are right 59.7% of the time.

If, on the other hand we predict with our model (using the coefficient values we estimated on the training data set), the success rate is 64.8%

Exercise 2

Replicate these results by

  • combining the new regressor values from the test data set with the estimated coefficients from the training data set to produce predictions, and
  • comparing those predictions with actual outcomes in the test data set

Solutions

The following code contains solutions to both exercises

footy <- read.table("trainingdata.txt", header=T)

x2 <- footy$lg_home_team_margin
x3 <- footy$lg_away_team_margin
Y <- footy$home_team_win

# Black are winners, reds are losers
plot(x2[Y == 1], x3[Y == 1], col="black",
    xlab="Home team margin, previous game",
    ylab="Away team margin, previous game",
    main="Outcomes for home team")
points(x2[Y == -1], x3[Y == -1], col="red")

legend(-135, -100, c("win","loss"),
   col=c("black","red"), pch=c(21, 21))


# Run the regression
reg1 <- lm(Y ~ x2 + x3)
# Plot the line such that (1, x2, x3)'beta = 0
gridsize <- 40
b <- coef(reg1)
grid <- seq(min(x2), max(x2), length=gridsize)
lines(grid, (- b[1] - b[2] * grid ) / b[3])
# Add text to indicate prediction categories
text(-110, 110, "Predict loss")
text(50, -105, "Predict win")


# Test success rate predicting on test data set
footy_test <- read.table("testdata.txt", header=T)
x2_test <- footy_test$lg_home_team_margin
x3_test <- footy_test$lg_away_team_margin
# Combine regressors into matrix of rows (1, x2_test, x3_test)
X_test <- cbind(rep(1, length(x2_test)), x2_test, x3_test)
# Evaluate predictions for each row
pred <- X_test %*% coef(reg1)
# Convert to 1, -1 values
pred <- ifelse(pred > 0, 1, -1)
# Actual outcomes, to compare against predictions
Y_test <- footy_test$home_team_win
cat("fraction of wins in training set:", mean(Y == 1), "\n")
cat("fraction of wins in test set:", mean(Y_test == 1), "\n")
cat("prediction success rate:", mean(pred == Y_test), "\n")