A Late Data Cup 2021 Submission

5 min readApr 29, 2021

My dataset consists of 75000 events from 41 OHL games played by the Erie Otters in 2019–2020. It was made available by the NHL for this year’s annual Data Cup competition. The competition ended in March, but I decided to use it because it was in part what motivated me to take data science more seriously. “I can do this.. but how can I do it better?”

In any case, I decided to build three models. They are:

1. The probability of a goal conditional on a shot.
2. The probability of a completed pass conditional on a pass attempt.
3. The expected number of shots in the next 10 seconds of play.

Feature Engineering

Before I could predict what I wanted to know I had to transform the data and engineer features. The features I created that were common to all models were:

1. Total seconds remaining in the period and game.
2. The prior two events
3. The x,y coordinates of the prior two events
4. The seconds elapsed since the last event
5. The home skater advantage
6. The score differential

For the pass model I also created a Pass Angle (in rads) feature. For the Expected Shots in the Next 10 Seconds model I engineered the target feature, which required a nifty for-loop to sum the shots in qualifying rows!

Metrics to Evaluate the Models

The Goal Probability Model

Because goals are so infrequent, I decided to focus on the “area under the curve” computed using the Receiver Operating Characteristic.

The Pass Completion Model

Passes are usually completed, but the outcome of each pass is still subject to aleatory randomness. I therefore chose to again focus on the ROC and the area under its curve.

The Expected Shots Model

Because the number of shots is a count, I used Poisson deviance to establish a baseline and compare my models. In the process I learned that XGB models are especially flexible! You can specify a number of different objectives and metrics.

Model Results

Goal Model:

My goal model ended up performing well on the test set! Here is a comparison between the XGBoost Classifier and a simple logistic regression model.

Surprising absolutely nobody, the XGBoost model wins again!

Much to my satisfaction, the interaction of the x and y-coordinates when predicting whether a shot will result in a goal matches what you’d expect if you’ve ever played or watched a hockey game for more than thirty seconds. Note that all x, y coordinates below are within the offensive zone.

Shots from right in front of the net are much, much more valuable than from anywhere else.

Curiously, the goal probability declines faster as you go South on the above plot, but this is not entirely unexpected. Roughly ~65% of players shoot left-handed, so shots to the North will be more powerful for the majority of players in the OHL.

Here’s another plot showing same in more dramatic fashion.

There’s a striking discontinuity in goal probability as the shooter’s coordinates move outside of the central goal scoring zone. You can also discern the importance of shot angle from the above plot, with goal probability dropping off more sharply closer to the net

The Pass Model

My pass model performed extremely well on the test set. Below is the ROC curve for the XGBoost model.

This result is comparable to the result achieved by a Data Cup finalist working in R with a BART (Bayesian Additive Regression Trees) model. Not quite as good, but comparable.

When examining the interaction of feature variables I discovered relationships that resemble what an experienced hockey player would expected. The likelihood of a pass completion drops off sharply when the pass target is in the center of the ice and increases signifncantly towards the boards and corners.

Passes towards the center of the ice in the offensive neutral zero and towards the goal scoring zones in front of the net are least probable to be completed.

The dataset included each player who participated in an event, so with my passing model I computed the expected pass completion probability for every pass each player attempted. I then calculated “pass completions above expectation per 100 pass attempts” for every Erie Otters player.

Pass Completions Over Expectations Per 100 Passes for Erie Otters players with more than 400 total pass attempts

Jamie Drysdale and Maxim Golod, the two players leading the team in Passes Completed Above Expectation, are considered the top prospects on the Erie Otters by NHL scouts.

As an exercise, I also used Grid Approximation to do a Bayesian update on the passing skill of Aidan Campbell, a player with only 200 pass attempts (note my prior here could easily be questioned: this was an exercise).

Prior and Posterior for Aidan Campbell’s passing skill using a prior from the model. He is an especially risk-averse passer, but he completes them as well as you’d expect

We learned a little bit: Aidan is less likely to be much better or much worse than an average player in the dataset with his pass attempt risk profile.

The Shots Model

My shots model was more modest in its ability to predict the future. It improved upon the baseline by explaining roughly 10% of the variance in the shot count over the next 10 seconds. The most important feature by far was the x-coordinate of the event.

What feature did the shot model use most frequently?

The interaction of the features was nevertheless illuminating. The graph below shows both the discontinuity in the effect of the x-coordinate and the stratification of the influence of each event type.

The expected shots in the next ten seconds jumps as soon as an event is within the offensive zone. This is a consequence of both the nature of spacetime and the offside rule restricting the order in which the puck and players must enter the offensive zone.

Conclusion

My models are pretty good and the interaction of the features is intuitive and consistent with domain knowledge. The models also allow for player comparisons that accord with the consensus opinions of hockey experts. That’s very satisfying.

There is much more that could be done with this dataset and I look forward to further exploration.

All of the notebooks for the above analysis can be found at my github here.

A Late Data Cup 2021 Submission

Metrics to Evaluate the Models

Model Results

Written by Jason Young